168 lines
6.2 KiB
Bash
168 lines
6.2 KiB
Bash
cat << EOF
|
|
.Dd ${MAN_DATE}
|
|
.Dt LIBGRAPHEME 7
|
|
.Os suckless.org
|
|
.Sh NAME
|
|
.Nm libgrapheme
|
|
.Nd unicode string library
|
|
.Sh SYNOPSIS
|
|
.In grapheme.h
|
|
.Sh DESCRIPTION
|
|
The
|
|
.Nm
|
|
library provides functions to properly handle Unicode strings according
|
|
to the Unicode specification in regard to character, word, sentence and
|
|
line segmentation and case detection and conversion.
|
|
.Pp
|
|
Unicode strings are made up of user-perceived characters (so-called
|
|
.Dq grapheme clusters ,
|
|
see
|
|
.Sx MOTIVATION )
|
|
that are composed of one or more Unicode codepoints, which in turn
|
|
are encoded in one or more bytes in an encoding like UTF-8.
|
|
.Pp
|
|
There is a widespread misconception that it was enough to simply
|
|
determine codepoints in a string and treat them as user-perceived
|
|
characters to be Unicode compliant.
|
|
While this may work in some cases, this assumption quickly breaks,
|
|
especially for non-Western languages and decomposed Unicode strings
|
|
where user-perceived characters are usually represented using multiple
|
|
codepoints.
|
|
.Pp
|
|
Despite this complicated multilevel structure of Unicode strings,
|
|
.Nm
|
|
provides methods to work with them at the byte-level (i.e. UTF-8
|
|
.Sq char
|
|
arrays) while also offering codepoint-level methods.
|
|
Additionally, it is a
|
|
.Dq freestanding
|
|
library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
|
|
a standard library. This makes it easy to use in bare metal environments.
|
|
.Pp
|
|
Every documented function's manual page provides a self-contained
|
|
example illustrating the possible usage.
|
|
.Sh SEE ALSO
|
|
.Xr grapheme_decode_utf8 3 ,
|
|
.Xr grapheme_encode_utf8 3 ,
|
|
.Xr grapheme_is_character_break 3 ,
|
|
.Xr grapheme_is_lowercase 3 ,
|
|
.Xr grapheme_is_lowercase_utf8 3 ,
|
|
.Xr grapheme_is_titlecase 3 ,
|
|
.Xr grapheme_is_titlecase_utf8 3 ,
|
|
.Xr grapheme_is_uppercase 3 ,
|
|
.Xr grapheme_is_uppercase_utf8 3 ,
|
|
.Xr grapheme_next_character_break 3 ,
|
|
.Xr grapheme_next_character_break_utf8 3 ,
|
|
.Xr grapheme_next_line_break 3 ,
|
|
.Xr grapheme_next_line_break_utf8 3 ,
|
|
.Xr grapheme_next_sentence_break 3 ,
|
|
.Xr grapheme_next_sentence_break_utf8 3 ,
|
|
.Xr grapheme_next_word_break 3 ,
|
|
.Xr grapheme_next_word_break_utf8 3 ,
|
|
.Xr grapheme_to_lowercase 3 ,
|
|
.Xr grapheme_to_lowercase_utf8 3 ,
|
|
.Xr grapheme_to_titlecase 3 ,
|
|
.Xr grapheme_to_titlecase_utf8 3
|
|
.Xr grapheme_to_uppercase 3 ,
|
|
.Xr grapheme_to_uppercase_utf8 3 ,
|
|
.Sh STANDARDS
|
|
.Nm
|
|
is compliant with the Unicode ${UNICODE_VERSION} specification.
|
|
.Sh MOTIVATION
|
|
The idea behind every character encoding scheme like ASCII or Unicode
|
|
is to express abstract characters (which can be thought of as shapes
|
|
making up a written language). ASCII for instance, which comprises the
|
|
range 0 to 127, assigns the number 65 (0x41) to the abstract character
|
|
.Sq A .
|
|
This number is called a
|
|
.Dq codepoint ,
|
|
and all codepoints of an encoding make up its so-called
|
|
.Dq code space .
|
|
.Pp
|
|
Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
|
|
first 128 codepoints are identical to ASCII's. The additional code
|
|
points are needed as Unicode's goal is to express all writing systems
|
|
of the world.
|
|
To give an example, the abstract character
|
|
.Sq \[u00C4]
|
|
is not expressable in ASCII, given no ASCII codepoint has been assigned
|
|
to it.
|
|
It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
|
|
.Pp
|
|
One may assume that this process is straightfoward, but as more and
|
|
more codepoints were assigned to abstract characters, the Unicode
|
|
Consortium (that defines the Unicode standard) was facing a problem:
|
|
Many (mostly non-European) languages have such a large amount of
|
|
abstract characters that it would exhaust the available Unicode code
|
|
space if one tried to assign a codepoint to each abstract character.
|
|
The solution to that problem is best introduced with an example: Consider
|
|
the abstract character
|
|
.Sq \[u01DE] ,
|
|
which is
|
|
.Sq A
|
|
with an umlaut and a macron added to it.
|
|
In this sense, one can consider
|
|
.Sq \[u01DE]
|
|
as a two-fold modification (namely
|
|
.Dq add umlaut
|
|
and
|
|
.Dq add macron )
|
|
of the
|
|
.Dq base character
|
|
.Sq A .
|
|
.Pp
|
|
The Unicode Consortium adapted this idea by assigning codepoints to
|
|
modifications.
|
|
For example, the codepoint 0x308 represents adding an umlaut and 0x304
|
|
represents adding a macron, and thus, the codepoint sequence
|
|
.Dq 0x41 0x308 0x304 ,
|
|
namely the base character
|
|
.Sq A
|
|
followed by the umlaut and macron modifiers, represents the abstract
|
|
character
|
|
.Sq \[u01DE] .
|
|
As a side-note, the single codepoint 0x1DE was also assigned to
|
|
.Sq \[u01DE] ,
|
|
which is a good example for the fact that there can be multiple
|
|
representations of a single abstract character in Unicode.
|
|
.Pp
|
|
Expressing a single abstract character with multiple codepoints solved
|
|
the code space exhaustion-problem, and the concept has been greatly
|
|
expanded since its first introduction (emojis, joiners, etc.). A sequence
|
|
(which can also have the length 1) of codepoints that belong together
|
|
this way and represents an abstract character is called a
|
|
.Dq grapheme cluster .
|
|
.Pp
|
|
In many applications it is necessary to count the number of
|
|
user-perceived characters, i.e. grapheme clusters, in a string.
|
|
A good example for this is a terminal text editor, which needs to
|
|
properly align characters on a grid.
|
|
This is pretty simple with ASCII-strings, where you just count the number
|
|
of bytes (as each byte is a codepoint and each codepoint is a grapheme
|
|
cluster).
|
|
With Unicode-strings, it is a common mistake to simply adapt the
|
|
ASCII-approach and count the number of code points.
|
|
This is wrong, as, for example, the sequence
|
|
.Dq 0x41 0x308 0x304 ,
|
|
while made up of 3 codepoints, is a single grapheme cluster and
|
|
represents the user-perceived character
|
|
.Sq \[u01DE] .
|
|
.Pp
|
|
The proper way to segment a string into user-perceived characters
|
|
is to segment it into its grapheme clusters by applying the Unicode
|
|
grapheme cluster breaking algorithm (UAX #29).
|
|
It is based on a complex ruleset and lookup-tables and determines if a
|
|
grapheme cluster ends or is continued between two codepoints.
|
|
Libraries like ICU and libunistring, which also offer this functionality,
|
|
are often bloated, not correct, difficult to use or not reasonably
|
|
statically linkable.
|
|
.Pp
|
|
Analogously, the standard provides algorithms to separate strings by
|
|
words, sentences and lines, convert cases and compare strings.
|
|
The motivation behind
|
|
.Nm
|
|
is to make unicode handling suck less and abide by the UNIX philosophy.
|
|
.Sh AUTHORS
|
|
.An Laslo Hunhold Aq Mt dev@frign.de
|
|
EOF
|