Skip to content

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929

Open
k1832 wants to merge 2 commits into
VSCodeVim:masterfrom
k1832:fix/surrogate-pairs
Open

fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands#9929
k1832 wants to merge 2 commits into
VSCodeVim:masterfrom
k1832:fix/surrogate-pairs

Conversation

@k1832
Copy link
Copy Markdown
Contributor

@k1832 k1832 commented Feb 10, 2026

What this PR does / why we need it:

When the cursor is on a character encoded as a UTF-16 surrogate pair — emojis (😄), rare CJK characters (𩸽), mathematical symbols (𝒟, 𝔸), musical symbols (𝄞), etc. — several character-level commands break.

position.getRight() increments by 1 UTF-16 code unit, but these characters are 2 code units. Moving by 1 lands between the pair, and VSCode's validatePosition clamps it back to the start — so the cursor effectively goes backward.

Command Broken behavior
a (Append) Cursor lands before the character instead of after
r (Replace) Only replaces half the character, corrupting the text
s (Change char) Only deletes half the character before entering Insert mode
~ (Toggle case) Corrupts the character into a lone surrogate
ga (Unicode info) Shows the half-surrogate value instead of the full codepoint

Previous fixes addressed insert mode (PR #7977 / #6046) and motions (l/h) and operators (x/X/y), but these 5 commands were missed. This PR completes the surrogate pair handling for character-level commands and adds 12 regression tests to prevent future breakage.

Adds getSurrogateAwareRight()/getSurrogateAwareLeft() helpers on the Position prototype that skip past surrogate pairs, following the same pattern already used by MoveRight (l) / MoveLeft (h). Uses these helpers in the 5 affected commands. Also switches ga from charCodeAt() to codePointAt() to report the full Unicode codepoint.

Which issue(s) this PR fixes

Fixes #9931
Partially addresses #8321 — this PR fixes r and other character-level commands reported there. Remaining issues not addressed: easymotion uses raw character arithmetic for match positioning and marker decorations; xp (transpose) on surrogate pairs is also still broken (TODO added in put.ts).

Special notes for your reviewer:

  • 19/19 surrogate pair tests passing (7 existing + 12 new)
  • Full test suite passes (3157 passing, 0 regressions)
  • Tests cover emojis, rare CJK (𩸽), and mathematical symbols (𝒟, 𝔸) from U+1D400 block
  • Tests require fix: add missing @types/minimatch devDependency #9928 to build when re-resolving dependencies

@k1832 k1832 changed the title Fix surrogate pair (emoji) handling for a, r, s, ~, ga commands fix: surrogate pair (emoji) handling for a, r, s, ~, ga commands Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from 0cdb8e1 to 14abd9a Compare February 10, 2026 03:14
@k1832 k1832 mentioned this pull request Feb 10, 2026
@k1832 k1832 force-pushed the fix/surrogate-pairs branch 2 times, most recently from 164974e to c3b1300 Compare February 10, 2026 03:32
@J-Fields J-Fields requested a review from Copilot March 27, 2026 02:46
Comment thread src/common/motion/position.ts Outdated
k1832 added 2 commits April 26, 2026 22:44
`position.getRight()` increments by 1 UTF-16 code unit, but emojis
outside the Basic Multilingual Plane are encoded as 2 code units
(a surrogate pair). Moving by 1 code unit lands between the pair,
and VSCode's `validatePosition` clamps it back to the start — so
the cursor effectively goes backward.

Add `getSurrogateAwareRight()`/`getSurrogateAwareLeft()` helpers on
the Position prototype that skip past surrogate pairs, and use them
in the affected commands:

- `a` (Append): cursor now lands after the emoji
- `r` (ReplaceCharacter): replaces the full emoji, not half
- `s` (ChangeOperator): deletes the full emoji before entering insert
- `~` (ToggleCase): advances past the emoji without corrupting it
- `ga` (UnicodeInfo): shows full codepoint (e.g. U+1F604) via
  codePointAt() instead of the half-surrogate from charCodeAt()

Commands already protected (no changes needed): x/X (DeleteOperator),
l/h (MoveRight/MoveLeft), y (YankOperator).
Address review feedback: pass the TextDocument instead of a
pre-extracted line string, matching the rest of the Position API
(prevWordStart, etc.). Easier to call and removes the chance of
passing the wrong line's text.
@k1832 k1832 force-pushed the fix/surrogate-pairs branch from ddafd57 to 963c1b2 Compare April 26, 2026 13:50
@k1832 k1832 requested a review from J-Fields May 14, 2026 05:38
@k1832
Copy link
Copy Markdown
Contributor Author

k1832 commented May 14, 2026

Forgot to re-request a review...

@NanoUser3000
Copy link
Copy Markdown

note about p: p working depends one additional factor
when using j to move cursor over 2-wide / surrogate pair symbol, we can go from two columns (assuming they are normal characters). the cursor over emoji may be in 1 of 2 states: "in column more to the left" or "more to the right".

Screen capture showing the difference of going from space versus t (I x the a, then do jjpukljp…). Despite using p with cursor showing over the same character, the results are different)
Screencast from 2026-05-22 18-45-15.webm

Same thing is happening for a yanked word: the word is pasted before or after the emoji under cursor depending on this internal state.

shift+p / P seems to be working fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surrogate pair characters (emoji, rare CJK, etc.) break a, r, s, ~, ga commands

3 participants