![]() |
Vault
4.1
|
VCodePoint stores a Unicode code point, which is similar to a 'char' except that the range of values is vastly larger than what fits in one byte. More...
#include <vcodepoint.h>
Public Member Functions | |
VCodePoint (int i) | |
Creates the code point by specifying the integer value. | |
VCodePoint (char c) | |
Creates the code point by specifying a C char value. | |
VCodePoint (const VChar &c) | |
Creates the code point by specifying a VChar, which wraps C char value. | |
VCodePoint (const VString &hexNotation) | |
Creates the code point by specifying the Unicode "U+" notational value. | |
VCodePoint (const Vu8 *buffer, int startOffset) | |
Creates the code point by examining a byte buffer at a specified offset, where there exists a valid UTF-8 formatted code point. | |
VCodePoint (VBinaryIOStream &stream) | |
Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point. | |
VCodePoint (VTextIOStream &utf8Stream) | |
Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point. | |
VCodePoint (const std::wstring &utf16WideString, int atIndex) | |
Creates the code point by reading one or two code units from the supplied wide string, where the string contains a valid UTF-16 formatted code point. | |
int | getUTF8Length () const |
Returns the length of this code point if it is formatted as UTF-8. | |
int | getUTF16Length () const |
Returns the number of code units in this code point if it is formatted as UTF-16. | |
int | intValue () const |
Returns the code point integer value. | |
VString | toString () const |
Returns a VString, that is to say the UTF-8 form of the code point as a short VString of 1 to 4 bytes. | |
VChar | toASCIIChar () const |
Returns a VChar containing the character value if it is ASCII (code points 0 through 127), or throws a VRangeException if not. | |
std::wstring | toUTF16WideString () const |
Returns a std::wstring in UTF-16 format, that is to say a small array of one or two UTF-16 code units. | |
void | writeToBinaryStream (VBinaryIOStream &stream) const |
Writes the code point to a binary stream in UTF-8 form (1 to 4 bytes). | |
bool | isNull () const |
Returns true if the code point is value zero. | |
bool | isNotNull () const |
The inverse of isNull(). | |
bool | isASCII () const |
Returns true if this code point represents an ASCII value (code points 0 through 127). | |
bool | isWhitespace () const |
Avoid using these. | |
bool | isAlpha () const |
bool | isNumeric () const |
bool | isAlphaNumeric () const |
bool | isHexadecimal () const |
Static Public Member Functions | |
static int | getUTF8LengthFromUTF8StartByte (Vu8 startByte) |
Returns the UTF-8 length of a code point given the first UTF-8 byte of the code point. | |
static int | getUTF8LengthFromCodePointValue (int intValue) |
Returns the UTF-8 length of a code point given the code point's integer value. | |
static bool | isUTF8ContinuationByte (Vu8 byteValue) |
Returns true if the specified byte from a UTF-8 byte stream is a continuation byte; that is to say, if it is not a byte value that starts a code point sequence. | |
static int | countUTF8CodePoints (const Vu8 *buffer, int numBytes) |
Returns the number of code points in the specified UTF-8 byte stream. | |
static int | getPreviousUTF8CodePointOffset (const Vu8 *buffer, int offset) |
Returns the offset of the previous UTF-8 code point start, given the offset of a given code point. | |
static int | getUTF16LengthFromCodePointValue (int intValue) |
Returns the UTF-16 length of a code point given the code point's integer value. | |
static bool | isUTF16SurrogateCodeUnit (wchar_t codeUnit) |
Returns true if the specified code unit from a UTF-16 sequence is a surrogate; that is to say, if it is part of a 2-unit sequence rather than being itself a complete 1-unit sequence. | |
Friends | |
bool | operator== (const VCodePoint &p1, const VCodePoint &p2) |
Returns true if the two code points are the same. | |
bool | operator== (const VCodePoint &cp, char c) |
bool | operator== (char c, const VCodePoint &cp) |
VCodePoint stores a Unicode code point, which is similar to a 'char' except that the range of values is vastly larger than what fits in one byte.
Because we often trade in UTF-8 (especially for VString), we have helper methods here for getting the length of the code point if it were represented in UTF-8, as well as the ability to create a (small) VString containing the code point in UTF-8. Ideally, you should deprecate use of char and VChar (at least when building and examining VString objects) in favor of VCodePoint and VString::iterator which understands UTF-8. This will make it easier to manipulate UTF-8 VStrings with proper semantics.
Definition at line 34 of file vcodepoint.h.
VCodePoint::VCodePoint | ( | int | i | ) | [explicit] |
Creates the code point by specifying the integer value.
For example, ASCII 'a' is 97 or 0x61, and GREEK CAPITAL LETTER OMEGA is 937 or 0x03A9.
i | the code point integer value |
Definition at line 20 of file vcodepoint.cpp.
VCodePoint::VCodePoint | ( | char | c | ) | [explicit] |
Creates the code point by specifying a C char value.
Because a char is a single byte, this will only work for values < 256. Non-ASCII values ( > 127) will be interpreted as the code point having the same value.
c | the C char whose value to treat as a code point |
Definition at line 27 of file vcodepoint.cpp.
VCodePoint::VCodePoint | ( | const VChar & | c | ) | [explicit] |
Creates the code point by specifying a VChar, which wraps C char value.
Because a char is a single byte, this will only work for values < 256. Non-ASCII values ( > 127) will be interpreted as the code point having the same value.
c | the C char whose value to treat as a code point |
Definition at line 34 of file vcodepoint.cpp.
VCodePoint::VCodePoint | ( | const VString & | hexNotation | ) | [explicit] |
Creates the code point by specifying the Unicode "U+" notational value.
This constructor makes the "U+" prefix optional, though it is recommended to include it for clarity in your code. You do NOT need to supply an even number of digits by prepending a zero. For example, ASCII 'a' is "U+61", and GREEK CAPITAL LETTER OMEGA is "U+03A9".
hexNotation | the Unicode code point string to interpret |
Definition at line 41 of file vcodepoint.cpp.
References VString::chars(), getUTF16LengthFromCodePointValue(), getUTF8LengthFromCodePointValue(), VHex::hexCharsToByte(), intValue(), VString::length(), and VString::startsWith().
VCodePoint::VCodePoint | ( | const Vu8 * | buffer, |
int | startOffset | ||
) | [explicit] |
Creates the code point by examining a byte buffer at a specified offset, where there exists a valid UTF-8 formatted code point.
For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence.
buffer | a pointer to a byte buffer to examine |
startOffset | the offset into buffer, from which to start, and where there must exist a valid UTF-8 formatted code point |
Definition at line 104 of file vcodepoint.cpp.
References getUTF8LengthFromUTF8StartByte().
VCodePoint::VCodePoint | ( | VBinaryIOStream & | stream | ) | [explicit] |
Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.
For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence. Note that VCodePoint treats binary and text streams the same since UTF-8 can be viewed as a space-efficient binary encoding.
stream | the stream to read from |
Definition at line 120 of file vcodepoint.cpp.
References getUTF8LengthFromUTF8StartByte(), and VBinaryIOStream::readU8().
VCodePoint::VCodePoint | ( | VTextIOStream & | utf8Stream | ) | [explicit] |
Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.
For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence. Note that VCodePoint treats binary and text streams the same since UTF-8 can be viewed as a space-efficient binary encoding.
utf8Stream | the stream to read from |
Definition at line 130 of file vcodepoint.cpp.
References getUTF8LengthFromUTF8StartByte(), and VIOStream::readGuaranteedByte().
VCodePoint::VCodePoint | ( | const std::wstring & | utf16WideString, |
int | atIndex | ||
) | [explicit] |
Creates the code point by reading one or two code units from the supplied wide string, where the string contains a valid UTF-16 formatted code point.
A UTF-16 code point may be composed of a single code unit for the "simpler" characters, or two code units for the rest. If the wide string ends in the middle of a two-unit code point, a VEOFException will be thrown if you attempt to read the split code point at the end of the wide string.
utf16WideString | the UTF-16 encode wide string to read |
atIndex | the index into the string from which to read the one or two code units |
Definition at line 140 of file vcodepoint.cpp.
References isUTF16SurrogateCodeUnit().
int VCodePoint::getUTF8Length | ( | ) | const [inline] |
Returns the length of this code point if it is formatted as UTF-8.
The answer is always in the range 1 to 4.
Definition at line 105 of file vcodepoint.h.
int VCodePoint::getUTF16Length | ( | ) | const [inline] |
Returns the number of code units in this code point if it is formatted as UTF-16.
The answer is always 1 or 2.
Definition at line 110 of file vcodepoint.h.
int VCodePoint::intValue | ( | ) | const [inline] |
VString VCodePoint::toString | ( | ) | const |
Returns a VString, that is to say the UTF-8 form of the code point as a short VString of 1 to 4 bytes.
This is how you take a code point and turn it into a string that can be inserted or appended into another, longer, string.
Definition at line 165 of file vcodepoint.cpp.
References getUTF8LengthFromCodePointValue().
VChar VCodePoint::toASCIIChar | ( | ) | const |
Returns a VChar containing the character value if it is ASCII (code points 0 through 127), or throws a VRangeException if not.
Unless you prefer to catch the exception, you should normally call isASCII() before invoking this conversion. The primary use case is when you are parsing a string and looking for specific ASCII syntax in it; conversion to VChar lets you 'switch' on single-quoted char constants.
Definition at line 206 of file vcodepoint.cpp.
References isASCII().
std::wstring VCodePoint::toUTF16WideString | ( | ) | const |
Returns a std::wstring in UTF-16 format, that is to say a small array of one or two UTF-16 code units.
This is how you take a code point and turn it into a wstring that can be inserted or appended into another, longer, wstring. If you need to interface with Windows "wide" APIs, wstring is used; normally you will just take a VString and get the entire wstring from it via VString::toUTF16().
Definition at line 242 of file vcodepoint.cpp.
References getUTF16LengthFromCodePointValue().
void VCodePoint::writeToBinaryStream | ( | VBinaryIOStream & | stream | ) | const |
Writes the code point to a binary stream in UTF-8 form (1 to 4 bytes).
stream | the stream to write to |
Definition at line 267 of file vcodepoint.cpp.
References getUTF8LengthFromCodePointValue(), and VBinaryIOStream::writeU8().
bool VCodePoint::isNull | ( | ) | const [inline] |
Returns true if the code point is value zero.
This corresponds to an ASCII NUL character, and might be useful in rare cases where you are reading a null terminator from a buffer, or when initializing a code point with value zero and checking it later.
Definition at line 150 of file vcodepoint.h.
bool VCodePoint::isNotNull | ( | ) | const [inline] |
The inverse of isNull().
Definition at line 155 of file vcodepoint.h.
bool VCodePoint::isASCII | ( | ) | const [inline] |
Returns true if this code point represents an ASCII value (code points 0 through 127).
This test should always be made prior to calling toASCIIChar() since that method will throw a range exception if the code point is not an ASCII character, unless you intend to catch such exceptions.
Definition at line 163 of file vcodepoint.h.
bool VCodePoint::isWhitespace | ( | ) | const |
Avoid using these.
This is a temporary bridge from VChar / char, in code migrating to VCodePoint.
Definition at line 214 of file vcodepoint.cpp.
References intValue().
int VCodePoint::getUTF8LengthFromUTF8StartByte | ( | Vu8 | startByte | ) | [static] |
Returns the UTF-8 length of a code point given the first UTF-8 byte of the code point.
The length can be simply deduced by the value in the byte.
startByte | the first byte of a 1- to 4-byte UTF-8 code point sequence |
Definition at line 299 of file vcodepoint.cpp.
int VCodePoint::getUTF8LengthFromCodePointValue | ( | int | intValue | ) | [static] |
Returns the UTF-8 length of a code point given the code point's integer value.
intValue | a code point value |
Definition at line 319 of file vcodepoint.cpp.
bool VCodePoint::isUTF8ContinuationByte | ( | Vu8 | byteValue | ) | [static] |
Returns true if the specified byte from a UTF-8 byte stream is a continuation byte; that is to say, if it is not a byte value that starts a code point sequence.
byteValue | a byte from a UTF-8 byte stream |
Definition at line 334 of file vcodepoint.cpp.
int VCodePoint::countUTF8CodePoints | ( | const Vu8 * | buffer, |
int | numBytes | ||
) | [static] |
Returns the number of code points in the specified UTF-8 byte stream.
buffer | the UTF-8 byte buffer to examine |
numBytes | the number of bytes in the buffer to examine |
Definition at line 340 of file vcodepoint.cpp.
References getUTF8Length().
int VCodePoint::getPreviousUTF8CodePointOffset | ( | const Vu8 * | buffer, |
int | offset | ||
) | [static] |
Returns the offset of the previous UTF-8 code point start, given the offset of a given code point.
The answer should be 1 to 4 bytes less than the specified offset, since UTF-8 uses 1 to 4 bytes per code point. You must not call this function with offset 0 in a physical buffer, because the function would look "left" of the valid buffer memory. Of course, if the buffer parameter points inside a physical buffer (not to the start of one) then that caveat does not apply.
buffer | the buffer to examine |
offset | the offset to start looking "left" of in the buffer |
Definition at line 353 of file vcodepoint.cpp.
References isUTF8ContinuationByte().
int VCodePoint::getUTF16LengthFromCodePointValue | ( | int | intValue | ) | [static] |
Returns the UTF-16 length of a code point given the code point's integer value.
intValue | a code point value |
Definition at line 380 of file vcodepoint.cpp.
bool VCodePoint::isUTF16SurrogateCodeUnit | ( | wchar_t | codeUnit | ) | [static] |
Returns true if the specified code unit from a UTF-16 sequence is a surrogate; that is to say, if it is part of a 2-unit sequence rather than being itself a complete 1-unit sequence.
Generally you will use this to identify the lead surrogate, and then read the trail surrogate by getting the next unit in the sequence.
codeUnit | a code unit from a UTF-16 sequence |
Definition at line 364 of file vcodepoint.cpp.