One of the most common standards for character encoding is the ASCII-standard. Each character is encoded using 7 bits of a byte, so 128 different characters can be adressed. Reading and writing ASCII characters is straight forward, because each character is stored in exactly one byte. No encoding or decoding is necesseray, to write or read an ASCII text. All bytes can directly be read, stored to memory and then used. The draw-back of ASCII, of course, is the small character set of only 128 characters. There are a lot more characters than that in the languages all around the world.
This problem was addressed by the Unicode standard, which was created to make every known character of the world available in a single character table. Each character has a defined position in the table, a so-called code point. The unicode table contains 0x10FFFF entries at the moment, so a 32 bit type is required to represent an unencoded unicode character. Therefore, the unicode character type Pt::Char is a 32-bit type. For example, Pt::Char is used by Pt::String or Pt::StringStream.
The UTF-8 encoding was introduced to store unicode characters in byte sequences, which are compatible to classic null-terminated C strings. One unicode character is encoded into a byte sequence of 1 or more bytes. Further, the characters are encoded such that a character in 7-bit-ASCII has the exact same value as in UTF-8, so any valid ASCII text is valid UTF-8 encoded text. Besides UTF-8 encoding, a many more encodings have been developed, for example for Base64, UTF-16 or Latin-1.
Text streams can be used to convert text between different encodings and character types. The text streams are derived from the I/O streams provided by the C++ standard library, so there is a a stream type for input, for output and for both, respectively. The basic class templates look like this:
Text streams do not only convert between text encodings, but also between character types of different size. The first template parameter is the character type of the decoded text and the second one is the character type of the encoded text. They are also called internal and external character types and may also be of the same type. The internal character type is used as the character type of the standard C++ stream base class. Besides the three class templates, there are typedefs for the most common case in Pt, where the internal character type is Pt::Char and the external character type is of type char:
A text stream always works with another stream as input or output. A text input stream works with another std::basic_istream to read the encoded input. A text output stream needs another std::basic_ostream to write the encoded output. This underlying stream uses the external character type. For example, the TextIStream, which uses char as the external character type, can operate on any std::istream.
A text codec is used by a text stream to perform the actual translation. One example is the Pt::Utf8Codec. The following example shows how to read UTF-8 encoded text:
A string stream is used as the input for the text stream, which uses a text codec to convert from UTF-8 to the raw unicode character type. The std::getline function will read all input into a Pt::String. Of course, all extraction operators can also be used, for example, to directly read numbers from the stream. The next example shows how to encode text to an UTF-8 byte sequence:
The string stream serves as the output of the text stream, which uses the same type of codec like the input text stream before. This time, the codec is used to convert from the raw unicode character type to UTF-8. When all data has been written to the output text stream, terminate needs to be called to finish off the output byte sequence. This is especially important for encodings with shift states. All insertion operators can be used for text output streams e.g. to format numbers.
The examples so far create the text codec on the heap with the new operator and the stream manages the lifetime of the codec. This can be avoided by passing a value different from 0 to the codecs constructor, in which case the codec must exist at least as long as the stream that uses it:
Codecs are stateless, which means that one codec can be used with multiple text streams. On systems that provide the std::locale facilities, a text codes can be used as a std::codecvt facet.
The Base64 encoding scheme is not a character encoding in the classical sense, but works very similar to other types of encodings. The framework provides a text codec to convert to and from base64 encoded text named Pt::Base64Codec. It can be used with the basic text stream templates, where the internal and external character types are both char. The following example shows how text is converted to base64:
The string stream serves as the output for the base64 encoded text. The base64 codec is used with a basic text stream to convert the string "Hello World!". This time it is important to terminate the output sequence, because the base64 format requires padding at the end.