Initial Commit

2025-08-30 16:07:19 +01:00
commit d86c15e30c
169 changed files with 121377 additions and 0 deletions
--- a/libs/libgrapheme-2.0.2/man/libgrapheme.sh
+++ b/libs/libgrapheme-2.0.2/man/libgrapheme.sh
@@ -0,0 +1,167 @@
+cat << EOF
+.Dd ${MAN_DATE}
+.Dt LIBGRAPHEME 7
+.Os suckless.org
+.Sh NAME
+.Nm libgrapheme
+.Nd unicode string library
+.Sh SYNOPSIS
+.In grapheme.h
+.Sh DESCRIPTION
+The
+.Nm
+library provides functions to properly handle Unicode strings according
+to the Unicode specification in regard to character, word, sentence and
+line segmentation and case detection and conversion.
+.Pp
+Unicode strings are made up of user-perceived characters (so-called
+.Dq grapheme clusters ,
+see
+.Sx MOTIVATION )
+that are composed of one or more Unicode codepoints, which in turn
+are encoded in one or more bytes in an encoding like UTF-8.
+.Pp
+There is a widespread misconception that it was enough to simply
+determine codepoints in a string and treat them as user-perceived
+characters to be Unicode compliant.
+While this may work in some cases, this assumption quickly breaks,
+especially for non-Western languages and decomposed Unicode strings
+where user-perceived characters are usually represented using multiple
+codepoints.
+.Pp
+Despite this complicated multilevel structure of Unicode strings,
+.Nm
+provides methods to work with them at the byte-level (i.e. UTF-8
+.Sq char
+arrays) while also offering codepoint-level methods.
+Additionally, it is a
+.Dq freestanding
+library (see ISO/IEC 9899:1999 section 4.6) and thus does not depend on
+a standard library. This makes it easy to use in bare metal environments.
+.Pp
+Every documented function's manual page provides a self-contained
+example illustrating the possible usage.
+.Sh SEE ALSO
+.Xr grapheme_decode_utf8 3 ,
+.Xr grapheme_encode_utf8 3 ,
+.Xr grapheme_is_character_break 3 ,
+.Xr grapheme_is_lowercase 3 ,
+.Xr grapheme_is_lowercase_utf8 3 ,
+.Xr grapheme_is_titlecase 3 ,
+.Xr grapheme_is_titlecase_utf8 3 ,
+.Xr grapheme_is_uppercase 3 ,
+.Xr grapheme_is_uppercase_utf8 3 ,
+.Xr grapheme_next_character_break 3 ,
+.Xr grapheme_next_character_break_utf8 3 ,
+.Xr grapheme_next_line_break 3 ,
+.Xr grapheme_next_line_break_utf8 3 ,
+.Xr grapheme_next_sentence_break 3 ,
+.Xr grapheme_next_sentence_break_utf8 3 ,
+.Xr grapheme_next_word_break 3 ,
+.Xr grapheme_next_word_break_utf8 3 ,
+.Xr grapheme_to_lowercase 3 ,
+.Xr grapheme_to_lowercase_utf8 3 ,
+.Xr grapheme_to_titlecase 3 ,
+.Xr grapheme_to_titlecase_utf8 3
+.Xr grapheme_to_uppercase 3 ,
+.Xr grapheme_to_uppercase_utf8 3 ,
+.Sh STANDARDS
+.Nm
+is compliant with the Unicode ${UNICODE_VERSION} specification.
+.Sh MOTIVATION
+The idea behind every character encoding scheme like ASCII or Unicode
+is to express abstract characters (which can be thought of as shapes
+making up a written language). ASCII for instance, which comprises the
+range 0 to 127, assigns the number 65 (0x41) to the abstract character
+.Sq A .
+This number is called a
+.Dq codepoint ,
+and all codepoints of an encoding make up its so-called
+.Dq code space .
+.Pp
+Unicode's code space is much larger, ranging from 0 to 0x10FFFF, but its
+first 128 codepoints are identical to ASCII's. The additional code
+points are needed as Unicode's goal is to express all writing systems
+of the world.
+To give an example, the abstract character
+.Sq \[u00C4]
+is not expressable in ASCII, given no ASCII codepoint has been assigned
+to it.
+It can be expressed in Unicode, though, with the codepoint 196 (0xC4).
+.Pp
+One may assume that this process is straightfoward, but as more and
+more codepoints were assigned to abstract characters, the Unicode
+Consortium (that defines the Unicode standard) was facing a problem:
+Many (mostly non-European) languages have such a large amount of
+abstract characters that it would exhaust the available Unicode code
+space if one tried to assign a codepoint to each abstract character.
+The solution to that problem is best introduced with an example: Consider
+the abstract character
+.Sq \[u01DE] ,
+which is
+.Sq A
+with an umlaut and a macron added to it.
+In this sense, one can consider
+.Sq \[u01DE]
+as a two-fold modification (namely
+.Dq add umlaut
+and
+.Dq add macron )
+of the
+.Dq base character
+.Sq A .
+.Pp
+The Unicode Consortium adapted this idea by assigning codepoints to
+modifications.
+For example, the codepoint 0x308 represents adding an umlaut and 0x304
+represents adding a macron, and thus, the codepoint sequence
+.Dq 0x41 0x308 0x304 ,
+namely the base character
+.Sq A
+followed by the umlaut and macron modifiers, represents the abstract
+character
+.Sq \[u01DE] .
+As a side-note, the single codepoint 0x1DE was also assigned to
+.Sq \[u01DE] ,
+which is a good example for the fact that there can be multiple
+representations of a single abstract character in Unicode.
+.Pp
+Expressing a single abstract character with multiple codepoints solved
+the code space exhaustion-problem, and the concept has been greatly
+expanded since its first introduction (emojis, joiners, etc.). A sequence
+(which can also have the length 1) of codepoints that belong together
+this way and represents an abstract character is called a
+.Dq grapheme cluster .
+.Pp
+In many applications it is necessary to count the number of
+user-perceived characters, i.e. grapheme clusters, in a string.
+A good example for this is a terminal text editor, which needs to
+properly align characters on a grid.
+This is pretty simple with ASCII-strings, where you just count the number
+of bytes (as each byte is a codepoint and each codepoint is a grapheme
+cluster).
+With Unicode-strings, it is a common mistake to simply adapt the
+ASCII-approach and count the number of code points.
+This is wrong, as, for example, the sequence
+.Dq 0x41 0x308 0x304 ,
+while made up of 3 codepoints, is a single grapheme cluster and
+represents the user-perceived character
+.Sq \[u01DE] .
+.Pp
+The proper way to segment a string into user-perceived characters
+is to segment it into its grapheme clusters by applying the Unicode
+grapheme cluster breaking algorithm (UAX #29).
+It is based on a complex ruleset and lookup-tables and determines if a
+grapheme cluster ends or is continued between two codepoints.
+Libraries like ICU and libunistring, which also offer this functionality,
+are often bloated, not correct, difficult to use or not reasonably
+statically linkable.
+.Pp
+Analogously, the standard provides algorithms to separate strings by
+words, sentences and lines, convert cases and compare strings.
+The motivation behind
+.Nm
+is to make unicode handling suck less and abide by the UNIX philosophy.
+.Sh AUTHORS
+.An Laslo Hunhold Aq Mt dev@frign.de
+EOF