String#grapheme_clusters
Some hawk-eyed readers of the article on String#[] pointed out that there are some caveats when using brackets to reference a single character in the case the character spans multiple codepoints:
string = "🇯🇵"
string[0] # => "🇯"
string[1] # => "🇵"
string.chars # => ["🇯", "🇵"]
string.codepoints # => [127471, 127477]
const string = "🇯🇵";
string[0]; // => '\uD83C' // High surrogate
string[1]; // => '\uDDF8' // Low surrogate
string[2]; // => '\uD83C' // High surrogate
string[3]; // => '\uDDF5' // Low surrogate
string[0] + string[1]; // => "🇯"
string[2] + string[3]; // => "🇵"
[...string]; // => ["🇯", "🇵"]
[...string].map(c => c.codePointAt(0)); // => [127471, 127477]
As you can see, neither Ruby nor JavaScript return the character that we see on screen when accessing a single "character" for these complex strings. This also causes practical issues with things like string length calculations and slicing.
To bridge the gap between bytes, encodings and "real-world" characters on screen, Ruby provides the String#grapheme_clusters method which returns an array of the actual characters that users would perceive on screen:
string = "🇯🇵"
string.length # => 2
string.chars # => ["🇯", "🇵"]
string.grapheme_clusters # => ["🇯🇵"]
string.grapheme_clusters.length # => 1
const string = "🇯🇵";
string.length; // => 4
[...string]; // => ["🇯", "🇵"]
const segmenter = new Intl.Segmenter(
undefined, { granularity: "grapheme" }
);
const segments = [...segmenter.segment(string)]
const graphemeClusters = segments.map(s => s.segment); // => ["🇯🇵"]
graphemeClusters.length; // => 1
String#grapheme_clusters is actually just a convenience method for String#each_grapheme_cluster.to_a. So if you want to iterate over the grapheme clusters without creating an array first, you should use #each_grapheme_cluster directly. JavaScript also provides a grapheme iterator via Intl.Segmenter, but the API is quite a bit clunkier to use.
History
String#grapheme_clusters was added in Ruby 2.5, released on Christmas 2017.