Text Processing

This set of classes and functions extends the string and localization support of the C++ standard library to work with unicode characters and strings. A unicode character type and string class (a specialization of std::basic_string) can be used to hold unicode text. A set of functions allows to transform and classify individual characters. Text can be converted e.g. between different encodings using i/o streams and text codecs. A regular expression class allows to search and match patterns in unicode strings. Localization facets are available for the systems which support standard C++ locales.

Character Encodings

One of the most common standards for character encoding is the ASCII-standard. Each character is encoded using 7 bits of a byte, so 128 different characters can be adressed. Reading and writing ASCII characters is straight forward, because each character is stored in exactly one byte. The builtin C++ type char can be used to represent ASCII characters. The draw-back of ASCII, of course, is the small character set of only 128 characters. There are a lot more characters than that in the languages all around the world.

This problem was addressed by the Unicode standard, which was created to make every known character of the world available in a single character table. Each character has a defined position in the table, a so-called code point. The unicode table contains 0x10FFFF entries at the moment, so a 32 bit type is required to represent a raw unicode character.

The UTF-8 encoding was introduced to store unicode characters in byte sequences, which are compatible to classic null-terminated C strings. One unicode character is encoded into a byte sequence of 1 or more bytes. Further, the characters are encoded such that a character in 7-bit-ASCII has the exact same value as in UTF-8, so any valid ASCII text is valid UTF-8 encoded text. This demonstrates the difference between encodings and character types. ASCII and UTF-8 can both be represented by sequences of the character type char, but their values are interpreted according to the encoding. Besides UTF-8 encoding, a many more encodings have been developed, for example Latin-1, UTF-16 or in the broadest sense Base64.

Characters and Strings

The unicode character type Pt::Char can directly represent a unicode code point. It is used as the character type for Pt::String or Pt::StringStream. Characters can be classified or transformed using a set of functions similar to what can be found in the cctype header of the standard library:

Pt::Char ch = 'a';
// check character category
assert( isalpha(ch) );
assert( islower(ch) );
// convert to upper case
Pt::Char ch2 = toupper(ch);
assert( isupper(ch) );

This class Pt::String is not yet another unicode string class, but it is a specialization of the std::basic_string template for the unicode character type Pt::Char:

typedef std::basic_string<Pt::Char> String;

It offers all the functionality of the std::basic_string template. This has the advantage, that all generic algorithms that work with std::basic_string should also work with Pt::String. Please refer to a standard c++ manual for a complete overview. Additional methods make it easier to work with other character types. For example, the relational operators are also overloaded for char and wchar_t.

Since a specialization of std::char_traits is also provided, the C++ iostreams can be instantiated for Pt::Char, including the string streams. Three typedefs provide shorter names for the unicode capable string streams:

typedef std::basic_istringstream<Pt::Char> IStringStream;
typedef std::basic_ostringstream<Pt::Char> OStringStream;
typedef std::basic_stringstream<Pt::Char> StringStream;

The insertion and extraction opertors (<< and >>) for iostreams require certain localization factes to be present in the std::locale. Pt will install specializations of std::num_put, std::num_get, std::numpunct and std::ctype for Pt::Char. This means that all other facilities that use localization facets will also work.

Text Streams and Codecs

Text streams can be used to convert text between different encodings and character types. The text streams are derived from the I/O streams provided by the C++ standard library, so there is a a stream type for input, for output and for both, respectively. The basic class templates look like this:

template <typename CharT, typename ByteT>
class BasicTextIStream : public std::basic_istream<CharT>...
template <typename CharT, typename ByteT>
class BasicTextOStream : public std::basic_ostream<CharT>...
template <typename CharT, typename ByteT>
class BasicTextStream : public std::basic_iostream<CharT>...

Text streams do not only convert between text encodings, but also between character types of different size. The first template parameter is the character type of the decoded text and the second one is the character type of the encoded text. They are also called internal and external character types and may also be of the same type. The internal character type is used as the character type of the standard C++ stream base class. Besides the three class templates, there are typedefs for the most common case in Pt, where the internal character type is Pt::Char and the external character type is of type char:

typedef BasicTextIStream<Char, char> TextIStream;
typedef BasicTextOStream<Char, char> TextOStream;
typedef BasicTextStream<Char, char> TextStream;

A text stream always works with another stream as input or output. A text input stream works with another std::basic_istream to read the encoded input. A text output stream needs another std::basic_ostream to write the encoded output. This underlying stream uses the external character type. For example, the TextIStream, which uses char as the external character type, can operate on any std::istream.

A Pt::TextCodec is used by text converters to encode and decode external byte sequences, hence the name codec. It implements the std::codecvt facet interface, on systems provide the std::locale facilities. Codecs are stateless, which means that one codec can be used with multiple text converters. A TextCodec is constructed with a reference counter that indicates whether the converter or locale manages the lifetime of the codec. If that value is 0, as it is the case if the TextCodec is default constructed, the text converter or locale will delete the codec.

Therefore, a default constructed TextCodec has to be cretaed with new, as it is the rule for all localization facets. This can be avoided by passing a value different from 0 to the codecs constructor, in which case the codec must exist at least as long as the stream that uses it:

Pt::Utf8Codec codec(1);
Pt::TextOStream tos(codec);

A text stream can be constructed with an underlying stream and a TextCodec, but both can also be set or reset later. If no codec is set, the stream will directly assign characters, instead of converting them. If no target stream is set, the text stream will always be EOF. The following example demonstrates how a string stream is used as the input for a text stream, which uses a Pt::Utf8Codec to decode UTF-8 encoded text:

std::istringstream iss("UTF-8 encoded text");
std::getline(tis, s);

The std::getline function will read all input into a Pt::String. Of course, the extraction operator can also be used, for example, to directly read numbers from the stream. The next example shows how to encode text to an UTF-8 byte sequence:

std::ostringstream oss;
Pt::String s = L"Hello World!";
tos << s;
tos.flush();

The string stream serves as the output of the text stream, which uses a Pt::Utf8codec to encode text to UTF-8. The insertion operator can be used, for strings or to format numbers. When all data has been written to the text stream, flush needs to be called to finish off the output byte sequence. This is especially important for encodings with shift states.

Regular Expressions

The Pt::Regex class allows to match a string pattern in unicode text. It resembles the std::basic_regex class and can be used to support systems, where std::basic_regex is not available in the standard C++ implementation. The syntax for the match pattern is similar to the extended POSIX syntax. The following table shows the special characters that can be used to write regular expressions:

. Any character
[ ] A character in a given set
[^ ] A character not in a given set
^ Begin of line
$ End of line
\< Begin of a word
\> End of a word
( ) A marked subexpression
* Matches the preceding element zero or more times
? Matches the preceding element zero or one time
+ Matches the preceding element one or more times
| Matches either the expression before or after the operator
\ Escapes the next character

The regular expression is constructed from a unicode string, either a Pt::String or a null-terminated sequence of unicode characters of type Pt::Char. It can then be used to match it against unicode strings as shown in the next example:

Pt::String expr = L"[hc]ats";
Pt::Regex regex(expr);
Pt::String str1 = L"I like cats!";
Pt::String str2 = L"I like hats!";
Pt::String str3 = L"I like bats!";
// this does match
bool matched = regex.match(str1);
// this does also match
matched = regex.match(str2);
// this does not match
matched = regex.match(str3);

It is also possibe to match a regular expression against a unicode input string and find out what tokens in the string actually matched. The match() member function has an overload, which fills a Pt::RegexSMatch with the result. Note that the first result at index 0 is always the input string itself. The following example illustrates this:

Pt::String expr = L"([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)";
Pt::Regex regex(expr);
Pt::String str = L"My IP address is 192.168.0.77";
bool matched = regex.match(str, smatch);
if(matched)
{
std::cout << "IP: " << smatch.str(1).narrow() << std::endl;
}
else
{
std::cout << "No IP in " << smatch.str(0).narrow() << std::endl;
}

Base-64 Encoding

The Base64 encoding scheme is not a character encoding in the classical sense, but works very similar to other types of encodings. The framework provides a text codec to convert to and from base64 encoded text named Pt::Base64Codec. It can be used with the basic text stream templates, where the internal and external character types are both char. The following example shows how text is converted to base64:

std::ostringstream oss;
BasicTextOStream<char, char> b64(oss, new Base64Codec());
b64 << "Hello World!";
b64.flush();

The string stream serves as the output for the base64 encoded text. The base64 codec is used with a basic text stream to convert the string "Hello World!". Here, it is important to terminate the output sequence by calling flush, because the base64 format requires padding at the end. Note, that inserting std::endl will also terminate the base64 sequence.