Vault  4.1
Public Member Functions | Static Public Member Functions | Friends
VCodePoint Class Reference

VCodePoint stores a Unicode code point, which is similar to a 'char' except that the range of values is vastly larger than what fits in one byte. More...

#include <vcodepoint.h>

List of all members.

Public Member Functions

 VCodePoint (int i)
 Creates the code point by specifying the integer value.
 VCodePoint (char c)
 Creates the code point by specifying a C char value.
 VCodePoint (const VChar &c)
 Creates the code point by specifying a VChar, which wraps C char value.
 VCodePoint (const VString &hexNotation)
 Creates the code point by specifying the Unicode "U+" notational value.
 VCodePoint (const Vu8 *buffer, int startOffset)
 Creates the code point by examining a byte buffer at a specified offset, where there exists a valid UTF-8 formatted code point.
 VCodePoint (VBinaryIOStream &stream)
 Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.
 VCodePoint (VTextIOStream &utf8Stream)
 Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.
 VCodePoint (const std::wstring &utf16WideString, int atIndex)
 Creates the code point by reading one or two code units from the supplied wide string, where the string contains a valid UTF-16 formatted code point.
int getUTF8Length () const
 Returns the length of this code point if it is formatted as UTF-8.
int getUTF16Length () const
 Returns the number of code units in this code point if it is formatted as UTF-16.
int intValue () const
 Returns the code point integer value.
VString toString () const
 Returns a VString, that is to say the UTF-8 form of the code point as a short VString of 1 to 4 bytes.
VChar toASCIIChar () const
 Returns a VChar containing the character value if it is ASCII (code points 0 through 127), or throws a VRangeException if not.
std::wstring toUTF16WideString () const
 Returns a std::wstring in UTF-16 format, that is to say a small array of one or two UTF-16 code units.
void writeToBinaryStream (VBinaryIOStream &stream) const
 Writes the code point to a binary stream in UTF-8 form (1 to 4 bytes).
bool isNull () const
 Returns true if the code point is value zero.
bool isNotNull () const
 The inverse of isNull().
bool isASCII () const
 Returns true if this code point represents an ASCII value (code points 0 through 127).
bool isWhitespace () const
 Avoid using these.
bool isAlpha () const
bool isNumeric () const
bool isAlphaNumeric () const
bool isHexadecimal () const

Static Public Member Functions

static int getUTF8LengthFromUTF8StartByte (Vu8 startByte)
 Returns the UTF-8 length of a code point given the first UTF-8 byte of the code point.
static int getUTF8LengthFromCodePointValue (int intValue)
 Returns the UTF-8 length of a code point given the code point's integer value.
static bool isUTF8ContinuationByte (Vu8 byteValue)
 Returns true if the specified byte from a UTF-8 byte stream is a continuation byte; that is to say, if it is not a byte value that starts a code point sequence.
static int countUTF8CodePoints (const Vu8 *buffer, int numBytes)
 Returns the number of code points in the specified UTF-8 byte stream.
static int getPreviousUTF8CodePointOffset (const Vu8 *buffer, int offset)
 Returns the offset of the previous UTF-8 code point start, given the offset of a given code point.
static int getUTF16LengthFromCodePointValue (int intValue)
 Returns the UTF-16 length of a code point given the code point's integer value.
static bool isUTF16SurrogateCodeUnit (wchar_t codeUnit)
 Returns true if the specified code unit from a UTF-16 sequence is a surrogate; that is to say, if it is part of a 2-unit sequence rather than being itself a complete 1-unit sequence.

Friends

bool operator== (const VCodePoint &p1, const VCodePoint &p2)
 Returns true if the two code points are the same.
bool operator== (const VCodePoint &cp, char c)
bool operator== (char c, const VCodePoint &cp)

Detailed Description

VCodePoint stores a Unicode code point, which is similar to a 'char' except that the range of values is vastly larger than what fits in one byte.

Because we often trade in UTF-8 (especially for VString), we have helper methods here for getting the length of the code point if it were represented in UTF-8, as well as the ability to create a (small) VString containing the code point in UTF-8. Ideally, you should deprecate use of char and VChar (at least when building and examining VString objects) in favor of VCodePoint and VString::iterator which understands UTF-8. This will make it easier to manipulate UTF-8 VStrings with proper semantics.

Definition at line 34 of file vcodepoint.h.


Constructor & Destructor Documentation

VCodePoint::VCodePoint ( int  i) [explicit]

Creates the code point by specifying the integer value.

For example, ASCII 'a' is 97 or 0x61, and GREEK CAPITAL LETTER OMEGA is 937 or 0x03A9.

Parameters:
ithe code point integer value

Definition at line 20 of file vcodepoint.cpp.

VCodePoint::VCodePoint ( char  c) [explicit]

Creates the code point by specifying a C char value.

Because a char is a single byte, this will only work for values < 256. Non-ASCII values ( > 127) will be interpreted as the code point having the same value.

Parameters:
cthe C char whose value to treat as a code point

Definition at line 27 of file vcodepoint.cpp.

VCodePoint::VCodePoint ( const VChar c) [explicit]

Creates the code point by specifying a VChar, which wraps C char value.

Because a char is a single byte, this will only work for values < 256. Non-ASCII values ( > 127) will be interpreted as the code point having the same value.

Parameters:
cthe C char whose value to treat as a code point

Definition at line 34 of file vcodepoint.cpp.

VCodePoint::VCodePoint ( const VString hexNotation) [explicit]

Creates the code point by specifying the Unicode "U+" notational value.

This constructor makes the "U+" prefix optional, though it is recommended to include it for clarity in your code. You do NOT need to supply an even number of digits by prepending a zero. For example, ASCII 'a' is "U+61", and GREEK CAPITAL LETTER OMEGA is "U+03A9".

Parameters:
hexNotationthe Unicode code point string to interpret

Definition at line 41 of file vcodepoint.cpp.

References VString::chars(), getUTF16LengthFromCodePointValue(), getUTF8LengthFromCodePointValue(), VHex::hexCharsToByte(), intValue(), VString::length(), and VString::startsWith().

VCodePoint::VCodePoint ( const Vu8 buffer,
int  startOffset 
) [explicit]

Creates the code point by examining a byte buffer at a specified offset, where there exists a valid UTF-8 formatted code point.

For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence.

Parameters:
buffera pointer to a byte buffer to examine
startOffsetthe offset into buffer, from which to start, and where there must exist a valid UTF-8 formatted code point

Definition at line 104 of file vcodepoint.cpp.

References getUTF8LengthFromUTF8StartByte().

VCodePoint::VCodePoint ( VBinaryIOStream stream) [explicit]

Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.

For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence. Note that VCodePoint treats binary and text streams the same since UTF-8 can be viewed as a space-efficient binary encoding.

Parameters:
streamthe stream to read from

Definition at line 120 of file vcodepoint.cpp.

References getUTF8LengthFromUTF8StartByte(), and VBinaryIOStream::readU8().

VCodePoint::VCodePoint ( VTextIOStream utf8Stream) [explicit]

Creates the code point by reading one or more bytes from the supplied stream, where the stream contains a valid UTF-8 formatted code point.

For example, if the code point is ASCII it will be a single byte; otherwise, the first byte will be the start of a 1- to 4-byte UTF-8 sequence. Note that VCodePoint treats binary and text streams the same since UTF-8 can be viewed as a space-efficient binary encoding.

Parameters:
utf8Streamthe stream to read from

Definition at line 130 of file vcodepoint.cpp.

References getUTF8LengthFromUTF8StartByte(), and VIOStream::readGuaranteedByte().

VCodePoint::VCodePoint ( const std::wstring &  utf16WideString,
int  atIndex 
) [explicit]

Creates the code point by reading one or two code units from the supplied wide string, where the string contains a valid UTF-16 formatted code point.

A UTF-16 code point may be composed of a single code unit for the "simpler" characters, or two code units for the rest. If the wide string ends in the middle of a two-unit code point, a VEOFException will be thrown if you attempt to read the split code point at the end of the wide string.

Parameters:
utf16WideStringthe UTF-16 encode wide string to read
atIndexthe index into the string from which to read the one or two code units

Definition at line 140 of file vcodepoint.cpp.

References isUTF16SurrogateCodeUnit().


Member Function Documentation

int VCodePoint::getUTF8Length ( ) const [inline]

Returns the length of this code point if it is formatted as UTF-8.

The answer is always in the range 1 to 4.

Returns:
obvious

Definition at line 105 of file vcodepoint.h.

int VCodePoint::getUTF16Length ( ) const [inline]

Returns the number of code units in this code point if it is formatted as UTF-16.

The answer is always 1 or 2.

Returns:
obvious

Definition at line 110 of file vcodepoint.h.

int VCodePoint::intValue ( ) const [inline]

Returns the code point integer value.

Returns:
obvious

Definition at line 115 of file vcodepoint.h.

VString VCodePoint::toString ( ) const

Returns a VString, that is to say the UTF-8 form of the code point as a short VString of 1 to 4 bytes.

This is how you take a code point and turn it into a string that can be inserted or appended into another, longer, string.

Returns:
the code point in UTF-8 VString form

Definition at line 165 of file vcodepoint.cpp.

References getUTF8LengthFromCodePointValue().

VChar VCodePoint::toASCIIChar ( ) const

Returns a VChar containing the character value if it is ASCII (code points 0 through 127), or throws a VRangeException if not.

Unless you prefer to catch the exception, you should normally call isASCII() before invoking this conversion. The primary use case is when you are parsing a string and looking for specific ASCII syntax in it; conversion to VChar lets you 'switch' on single-quoted char constants.

Returns:
a VChar representation of this ASCII code point

Definition at line 206 of file vcodepoint.cpp.

References isASCII().

std::wstring VCodePoint::toUTF16WideString ( ) const

Returns a std::wstring in UTF-16 format, that is to say a small array of one or two UTF-16 code units.

This is how you take a code point and turn it into a wstring that can be inserted or appended into another, longer, wstring. If you need to interface with Windows "wide" APIs, wstring is used; normally you will just take a VString and get the entire wstring from it via VString::toUTF16().

Returns:
the code point in UTF-16 std::wstring form

Definition at line 242 of file vcodepoint.cpp.

References getUTF16LengthFromCodePointValue().

void VCodePoint::writeToBinaryStream ( VBinaryIOStream stream) const

Writes the code point to a binary stream in UTF-8 form (1 to 4 bytes).

Parameters:
streamthe stream to write to

Definition at line 267 of file vcodepoint.cpp.

References getUTF8LengthFromCodePointValue(), and VBinaryIOStream::writeU8().

bool VCodePoint::isNull ( ) const [inline]

Returns true if the code point is value zero.

This corresponds to an ASCII NUL character, and might be useful in rare cases where you are reading a null terminator from a buffer, or when initializing a code point with value zero and checking it later.

Returns:
true if the value is zero

Definition at line 150 of file vcodepoint.h.

bool VCodePoint::isNotNull ( ) const [inline]

The inverse of isNull().

Returns:
true if the value is not zero

Definition at line 155 of file vcodepoint.h.

bool VCodePoint::isASCII ( ) const [inline]

Returns true if this code point represents an ASCII value (code points 0 through 127).

This test should always be made prior to calling toASCIIChar() since that method will throw a range exception if the code point is not an ASCII character, unless you intend to catch such exceptions.

Returns:
true if the value is in the range 0 to 127.

Definition at line 163 of file vcodepoint.h.

bool VCodePoint::isWhitespace ( ) const

Avoid using these.

This is a temporary bridge from VChar / char, in code migrating to VCodePoint.

Definition at line 214 of file vcodepoint.cpp.

References intValue().

int VCodePoint::getUTF8LengthFromUTF8StartByte ( Vu8  startByte) [static]

Returns the UTF-8 length of a code point given the first UTF-8 byte of the code point.

The length can be simply deduced by the value in the byte.

Parameters:
startBytethe first byte of a 1- to 4-byte UTF-8 code point sequence
Returns:
the number of bytes in the UTF-8 sequence

Definition at line 299 of file vcodepoint.cpp.

int VCodePoint::getUTF8LengthFromCodePointValue ( int  intValue) [static]

Returns the UTF-8 length of a code point given the code point's integer value.

Parameters:
intValuea code point value
Returns:
the number of bytes in the corresponding UTF-8 sequence

Definition at line 319 of file vcodepoint.cpp.

bool VCodePoint::isUTF8ContinuationByte ( Vu8  byteValue) [static]

Returns true if the specified byte from a UTF-8 byte stream is a continuation byte; that is to say, if it is not a byte value that starts a code point sequence.

Parameters:
byteValuea byte from a UTF-8 byte stream
Returns:
true if the byte is NOT the start of a code point; false if it IS the start of a code point

Definition at line 334 of file vcodepoint.cpp.

int VCodePoint::countUTF8CodePoints ( const Vu8 buffer,
int  numBytes 
) [static]

Returns the number of code points in the specified UTF-8 byte stream.

Parameters:
bufferthe UTF-8 byte buffer to examine
numBytesthe number of bytes in the buffer to examine
Returns:
the number of code points counted

Definition at line 340 of file vcodepoint.cpp.

References getUTF8Length().

int VCodePoint::getPreviousUTF8CodePointOffset ( const Vu8 buffer,
int  offset 
) [static]

Returns the offset of the previous UTF-8 code point start, given the offset of a given code point.

The answer should be 1 to 4 bytes less than the specified offset, since UTF-8 uses 1 to 4 bytes per code point. You must not call this function with offset 0 in a physical buffer, because the function would look "left" of the valid buffer memory. Of course, if the buffer parameter points inside a physical buffer (not to the start of one) then that caveat does not apply.

Parameters:
bufferthe buffer to examine
offsetthe offset to start looking "left" of in the buffer
Returns:
the offset of the previous code point start; should in the range (offset-4) to (offset-1)

Definition at line 353 of file vcodepoint.cpp.

References isUTF8ContinuationByte().

int VCodePoint::getUTF16LengthFromCodePointValue ( int  intValue) [static]

Returns the UTF-16 length of a code point given the code point's integer value.

Parameters:
intValuea code point value
Returns:
the number of code units in the corresponding UTF-16 sequence

Definition at line 380 of file vcodepoint.cpp.

bool VCodePoint::isUTF16SurrogateCodeUnit ( wchar_t  codeUnit) [static]

Returns true if the specified code unit from a UTF-16 sequence is a surrogate; that is to say, if it is part of a 2-unit sequence rather than being itself a complete 1-unit sequence.

Generally you will use this to identify the lead surrogate, and then read the trail surrogate by getting the next unit in the sequence.

Parameters:
codeUnita code unit from a UTF-16 sequence
Returns:
true if the code unit is NOT a complete code point; false if it IS a surrogate

Definition at line 364 of file vcodepoint.cpp.


The documentation for this class was generated from the following files:

Copyright ©1997-2014 Trygve Isaacson. All rights reserved. This documentation was generated with Doxygen.