Converting a String to Unicode Code Points in Java
Converting a Java String into its Unicode representation sounds simple until you hit emoji, historical scripts or rare CJK characters β code points above U+FFFF. This guide covers both the fast path (char-based) and the correct path (codePoint-based).
char vs code point
A Java char is a 16-bit unsigned integer representing a UTF-16 code unit. Characters up to U+FFFF fit in a single char; characters above that (emojis, most ancient scripts) are stored as a surrogate pair β two char values.
String text = "Hi \uD83D\uDE00"; // "Hi π"
System.out.println(text.length()); // 4 β surrogate pair counts as 2 chars
The actual number of Unicode characters is 4: H, i, space, and π. To count code points correctly, use codePointCount:
System.out.println(text.codePointCount(0, text.length())); // 4
The simple case β no surrogates
If you're sure the string contains only BMP characters (Latin, accents, standard CJK), chars() is enough:
String s = "Java";
s.chars().forEach(c -> System.out.printf("U+%04X %n", c));
// U+004A
// U+0061
// U+0076
// U+0061
The correct case β codePoints() for any string
String s = "cafΓ© π";
s.codePoints().forEach(cp -> System.out.printf("U+%04X %n", cp));
// U+0063
// U+0061
// U+0066
// U+00E9
// U+0020
// U+1F600
codePoints() automatically combines surrogate pairs into a single int. This is the only correct way to iterate Unicode in Java.
Collect code points into an array
int[] points = s.codePoints().toArray();
System.out.println(points.length); // 6
System.out.println(Integer.toHexString(points[5])); // 1f600
Produce a \u... representation
String escape(String s) {
StringBuilder out = new StringBuilder();
s.codePoints().forEach(cp -> {
if (cp < 0x80) out.append((char) cp); // printable ASCII
else if (cp <= 0xFFFF) out.append(String.format("\\u%04X", cp));
else out.append(String.format("\\U%08X", cp)); // beyond BMP
});
return out.toString();
}
escape("cafΓ© π");
// caf\u00E9 \U0001F600
Convert code points back to a String
int[] points = { 0x48, 0x69, 0x1F600 };
String s = new String(points, 0, points.length);
System.out.println(s); // Hi π
Or from a single code point:
String smiley = new String(Character.toChars(0x1F600)); // π
Character metadata
Character exposes rich Unicode metadata β script, type, numeric value, etc.:
int cp = "Γ©".codePointAt(0);
System.out.println(Character.getName(cp));
// LATIN SMALL LETTER E WITH ACUTE
System.out.println(Character.UnicodeScript.of(cp));
// LATIN
System.out.println(Character.isLetter(cp));
// true
System.out.println(Character.getType(cp));
// 2 (LOWERCASE_LETTER)
Trim by Unicode, not by char
Simple slicing can break a surrogate pair in half, producing invalid UTF-16:
String s = "Hi π there";
String bad = s.substring(0, 4); // "Hi \uD83D" β half an emoji
// Correct: convert to code points, slice, convert back
int[] cps = s.codePoints().toArray();
String good = new String(cps, 0, 4); // "Hi π"
Reverse a string with emoji intact
String reverseByCodePoint(String s) {
int[] cps = s.codePoints().toArray();
int[] reversed = new int[cps.length];
for (int i = 0; i < cps.length; i++) reversed[i] = cps[cps.length - 1 - i];
return new String(reversed, 0, reversed.length);
}
reverseByCodePoint("Hi π"); // "π iH" β emoji preserved
"Hi π".chars().asDoubleStream();
new StringBuilder("Hi π").reverse().toString(); // mangled β do not use
Even StringBuilder.reverse() handles surrogates correctly in recent JDKs, but is unreliable on older versions.
Normalisation considerations
The same visual character can have multiple code-point representations. For example, Γ© can be U+00E9 (composed) or U+0065 U+0301 (e + combining accent). Use Normalizer to get a canonical form:
import java.text.Normalizer;
String nfc = Normalizer.normalize(s, Normalizer.Form.NFC); // composed
String nfd = Normalizer.normalize(s, Normalizer.Form.NFD); // decomposed
If you're comparing user input against stored data, always normalise both sides first.
Quick reference
| Task | API |
|---|---|
| Iterate BMP characters only | s.chars() |
| Iterate any Unicode (correct) | s.codePoints() |
| Get a single code point | s.codePointAt(index) |
| Count code points | s.codePointCount(0, s.length()) |
| Code point β String | new String(Character.toChars(cp)) |
| int[] β String | new String(points, 0, len) |
Whenever you handle user content that can contain emoji, languages beyond the BMP, or historical scripts, default to codePoints() β it's barely more verbose and spares you a category of rare but nasty bugs.