Converting a String to Unicode Code Points in Java

Converting a Java String into its Unicode representation sounds simple until you hit emoji, historical scripts or rare CJK characters — code points above U+FFFF. This guide covers both the fast path (char-based) and the correct path (codePoint-based).

char vs code point

A Java char is a 16-bit unsigned integer representing a UTF-16 code unit. Characters up to U+FFFF fit in a single char; characters above that (emojis, most ancient scripts) are stored as a surrogate pair — two char values.

String text = "Hi \uD83D\uDE00"; // "Hi 😀"
System.out.println(text.length()); // 4 — surrogate pair counts as 2 chars

The actual number of Unicode characters is 4: H, i, space, and 😀. To count code points correctly, use codePointCount:

System.out.println(text.codePointCount(0, text.length())); // 4

The simple case — no surrogates

If you're sure the string contains only BMP characters (Latin, accents, standard CJK), chars() is enough:

String s = "Java";
s.chars().forEach(c -> System.out.printf("U+%04X %n", c));
// U+004A
// U+0061
// U+0076
// U+0061

The correct case — codePoints() for any string

String s = "café 😀";
s.codePoints().forEach(cp -> System.out.printf("U+%04X %n", cp));
// U+0063
// U+0061
// U+0066
// U+00E9
// U+0020
// U+1F600

codePoints() automatically combines surrogate pairs into a single int. This is the only correct way to iterate Unicode in Java.

Collect code points into an array

int[] points = s.codePoints().toArray();
System.out.println(points.length);              // 6
System.out.println(Integer.toHexString(points[5])); // 1f600

Produce a \u... representation

String escape(String s) {
    StringBuilder out = new StringBuilder();
    s.codePoints().forEach(cp -> {
        if (cp < 0x80) out.append((char) cp);          // printable ASCII
        else if (cp <= 0xFFFF) out.append(String.format("\\u%04X", cp));
        else out.append(String.format("\\U%08X", cp)); // beyond BMP
    });
    return out.toString();
}

escape("café 😀");
// caf\u00E9 \U0001F600

Convert code points back to a String

int[] points = { 0x48, 0x69, 0x1F600 };
String s = new String(points, 0, points.length);
System.out.println(s); // Hi 😀

Or from a single code point:

String smiley = new String(Character.toChars(0x1F600)); // 😀

Character metadata

Character exposes rich Unicode metadata — script, type, numeric value, etc.:

int cp = "é".codePointAt(0);
System.out.println(Character.getName(cp));
// LATIN SMALL LETTER E WITH ACUTE

System.out.println(Character.UnicodeScript.of(cp));
// LATIN

System.out.println(Character.isLetter(cp));
// true

System.out.println(Character.getType(cp));
// 2 (LOWERCASE_LETTER)

Trim by Unicode, not by `char`

Simple slicing can break a surrogate pair in half, producing invalid UTF-16:

String s = "Hi 😀 there";
String bad = s.substring(0, 4);   // "Hi \uD83D" — half an emoji

// Correct: convert to code points, slice, convert back
int[] cps = s.codePoints().toArray();
String good = new String(cps, 0, 4); // "Hi 😀"

Reverse a string with emoji intact

String reverseByCodePoint(String s) {
    int[] cps = s.codePoints().toArray();
    int[] reversed = new int[cps.length];
    for (int i = 0; i < cps.length; i++) reversed[i] = cps[cps.length - 1 - i];
    return new String(reversed, 0, reversed.length);
}

reverseByCodePoint("Hi 😀"); // "😀 iH" — emoji preserved
"Hi 😀".chars().asDoubleStream();
new StringBuilder("Hi 😀").reverse().toString(); // mangled — do not use

Even StringBuilder.reverse() handles surrogates correctly in recent JDKs, but is unreliable on older versions.

Normalisation considerations

The same visual character can have multiple code-point representations. For example, é can be U+00E9 (composed) or U+0065 U+0301 (e + combining accent). Use Normalizer to get a canonical form:

import java.text.Normalizer;

String nfc = Normalizer.normalize(s, Normalizer.Form.NFC); // composed
String nfd = Normalizer.normalize(s, Normalizer.Form.NFD); // decomposed

If you're comparing user input against stored data, always normalise both sides first.

Quick reference

Task	API
Iterate BMP characters only	`s.chars()`
Iterate any Unicode (correct)	`s.codePoints()`
Get a single code point	`s.codePointAt(index)`
Count code points	`s.codePointCount(0, s.length())`
Code point → String	`new String(Character.toChars(cp))`
int[] → String	`new String(points, 0, len)`

Whenever you handle user content that can contain emoji, languages beyond the BMP, or historical scripts, default to codePoints() — it's barely more verbose and spares you a category of rare but nasty bugs.

char vs code point

The simple case — no surrogates

The correct case — codePoints() for any string

Collect code points into an array

Produce a \u... representation

Convert code points back to a String

Character metadata

Trim by Unicode, not by char

Reverse a string with emoji intact

Normalisation considerations

Quick reference

Related articles

ArrayList in Java: Complete Guide with Examples

OR Operator in Java: Bitwise | vs Logical ||

Java Array Length: Finding the Size of an Array

String Length in Java: length() Explained

Convert String to int in Java (Integer.parseInt)

Constructors in Java: Complete Guide

Convert int to String in Java

Methods in Java: Declaration, Parameters, Return Types

Trim by Unicode, not by `char`