Skip to main content
Dublin Library

The Publishing Project

Locale-aware string splitting

 

In Javascript, the intl object provides several locale-aware tools to work with in Javascript.

One of those tools is the segmenter object. The segmenter object enables locale-aware segmentation from a string with selectable granularity (grapheme, word and sentence).

a grapheme is a single character, regardless of how many codepoints it takes to display it. "🫵" is one grapheme, so is a space " "

Words and sentences are self-explanatory.

We first create a segmenter object with two parameters:

A valid language code and the granularity that we want to use. For this example, we're using word as the granularity.

const segmenterEs = new Intl.Segmenter(
  'es', { 
  granularity: 'word'
});

Next, we use the segmenter to create a list of all the segments and assign them to a constant that we'll use to do something with the segments.

const segments = segmenterEs.segment(
  "Me gustas cuando callas porque estás como ausente"
);

Finally, we do something with the segments. In this case, I chose to use array of to loop through the segments and log them to the console.

for (const segment of segments) {
  console.log(segment.segment);
}

It should be just as easy to append the segments to an existing element or search the segments for a given string.

Edit on Github