JANOS Help System: [Commands] [Topics] [Tech Support] [Printable Manual] [Search]
UTF-8 Encoding Reference UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. UTF-8 is capable of encoding all 1,000,000+ valid Unicode code points using one to four bytes. Code point - UTF-8 conversion First code Last code Byte1 Byte2 Byte3 Byte4 U+0000 U+007F 0xxxxxxx U+0080 U+07FF 110xxxxx 10xxxxxx U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx The bits encoding the binary value of the Unicode point replace the xxx from the most significant bit on the left (in Byte1) to the least on the right in the last byte as needed. ASCII characters of the range 0x00 to 0x7F are not encoded. If Byte1 is larger than 0x7F the first bit is 1. This indicates that additional bytes will be used in the encoding. As you may see in the table above the initial bits in Byte1 define how many bytes will be used in the encoding. Each additional byte will begin with 10 and provide 6 more bits of the final binary value. UTF-8 encodings used in the context of JNIOR are rarely more than two bytes. Note that JANOS offers a shortcut for selecting the appropriate Unicode character for common accenting. For instance by typing the base character 'e' followed by typing Ctrl-U twice you can toggle to the correct letter used in the word résumé. [/flash/manpages/reference.hlp:769]