8.13. Introduction to Pointer-Based String
Processing
In this section, we introduce some
common C++ Standard Library functions that facilitate string processing. The
techniques discussed here are appropriate for developing text editors, word
processors, page layout software, computerized typesetting systems and other
kinds of text-processing software. We have already used the C++ Standard Library
string class in several examples to represent
strings as full-fledged objects. For example, the GradeBook class case study in Chapters
3–7 represents a course name using a string object.
In Chapter
18 we present class string in detail. Although using
string objects is usually straightforward, we use
null-terminated, pointer-based strings in this section. Many C++ Standard
Library functions operate only on null-terminated, pointer-based strings, which
are more complicated to use than string
objects. Also, if you work with legacy C++ programs, you may be required to
manipulate these pointer-based strings.
8.13.1. Fundamentals of Characters
and Pointer-Based Strings
Characters are the fundamental
building blocks of C++ source programs. Every program is composed of a sequence
of characters that—when grouped together meaningfully—is interpreted by the
compiler as a series of instructions used to accomplish a task. A program may
contain character constants. A character constant is an integer value represented as
a character in single quotes. The value of a character constant is the integer
value of the character in the machine's character set. For example, 'z'
represents the integer value of z (122 in the ASCII character set; see
Appendix
B), and '\n' represents the integer value of newline (10 in the
ASCII character set).
A string is a series of characters
treated as a single unit. A string may include letters, digits and various special
characters such as +, -, *, / and $. String literals, or string constants, in C++
are written in double quotation marks as follows:
| "John Q. Doe" |
(a name) |
| "9999 Main Street" |
(a street address) |
| "Maynard,
Massachusetts" |
(a city and state) |
| "(201) 555-1212" |
(a telephone
number) |
A pointer-based string in C++ is an array
of characters ending in the null character ('\0'), which marks where the string terminates in memory. A
string is accessed via a pointer to its first character. The value of a string
is the address of its first character, but the sizeof a string literal is the length of the string including the
terminating null character. In this sense, strings are like arrays, because an
array name is also a pointer to its first element.
A string literal may be used as
an initializer in the declaration of either a character array or a variable of
type char *. The declarations
char color[] = "blue";
const char *colorPtr = "blue";
each initialize a
variable to the string "blue". The first declaration creates a
five-element array color containing the characters 'b', 'l', 'u',
'e' and '\0'. The second declaration creates pointer variable
colorPtr that points to the letter b in the string
"blue" (which ends in '\0') somewhere in memory. String
literals have static storage class (they
exist for the duration of the program) and may or may not be shared if the same
string literal is referenced from multiple locations in a program. According to
the C++ standard (Section 2.13.4), the effect of attempting to modify a string
literal is undefined; thus, you should always declare a pointer to a string
literal as const char *.
The declaration char color[] = "blue"; could also be
written
char color[] = { 'b', 'l', 'u', 'e', '\0' };
When declaring a character array to
contain a string, the array must be large enough to store the string and its
terminating null character. The preceding declaration determines the size of the
array, based on the number of initializers provided in the initializer
list.
Common Programming Error 8.15
|
Not allocating
sufficient space in a character array to store the null character that
terminates a string is an
error. |
Common Programming Error
8.16
|
Creating or using a
C-style string that does not contain a terminating null character can lead to
logic errors. |
Error-Prevention Tip 8.4
|
When storing a
string of characters in a character array, be sure that the array is large
enough to hold the largest string that will be stored. C++ allows strings of any
length to be stored. If a string is longer than the character array in which it
is to be stored, characters beyond the end of the array will overwrite data in
memory following the array, leading to logic
errors. |
A string can be read into a character
array using stream extraction with cin. For
example, the following statement can be used to read a string into character
array word[ 20 ]:
The string entered by the user is stored in word. The preceding statement reads characters until a
white-space character or end-of-file indicator is encountered. Note that the
string should be no longer than 19 characters to leave room for the terminating
null character. The setw stream manipulator can
be used to ensure that the string read into word
does not exceed the size of the array. For example, the statement
cin >> setw( 20 ) >> word;
specifies that cin should read a maximum of 19
characters into array word and save the
20th location in the array to store the terminating null character for the
string. The setw stream manipulator
applies only to the next value being input. If more than 19 characters are
entered, the remaining characters are not saved in word, but will be read in and can be stored in another
variable.
In some cases, it is desirable to
input an entire line of text into an array. For this purpose, C++ provides the
function cin.getline in header file <iostream>. In Chapter
3 you were introduced to the similar function
getline from header file <string>, which read input until a newline character was entered,
and stored the input (without the newline character) into a string
specified as an argument. The cin.getline
function takes three arguments—a character array in which the line of text will
be stored, a length and a delimiter character. For example, the program
segment
char sentence[ 80 ];
cin.getline( sentence, 80, '\n' );
declares array sentence of 80
characters and reads a line of text from the keyboard into the array. The
function stops reading characters when the delimiter character '\n' is encountered, when the end-of-file indicator is
entered or when the number of characters read so far is one less than the length
specified in the second argument. (The last character in the array is reserved
for the terminating null character.) If the delimiter character is encountered,
it is read and discarded. The third argument to cin.getline has '\n' as a default value, so the
preceding function call could have been written as follows:
cin.getline( sentence, 80 );
Chapter
15, Stream Input/Output, provides a
detailed discussion of cin.getline and other input/output
functions.
Common Programming Error 8.17
|
Processing a
single character as a char * string can lead to a fatal runtime error.
A char * string is a pointer—probably a
respectably large integer. However, a character is a small integer (ASCII values
range 0–255). On many systems, dereferencing a char value causes an error, because low memory addresses are
reserved for special purposes such as operating system interrupt handlers—so
"memory access violations"
occur. |
Common Programming Error 8.18
|
Passing a
string as an argument to a function when a character is expected is a
compilation error. |
8.13.2. String-Manipulation
Functions of the String-Handling Library
The string-handling library
provides many useful functions for manipulating string data, comparing strings,
searching strings for characters and other strings, tokenizing strings
(separating strings into logical pieces such as the separate words in a
sentence) and determining the length of strings. This section presents some
common string-manipulation functions of the string-handling library (from the
C++ standard library). The functions are summarized in Fig. 8.29;
then each is used in a live-code example. The prototypes for these functions are
located in header file <cstring>.
Fig. 8.29. String-manipulation functions of the
string-handling library.
| Function prototype |
Function description |
| char
*strcpy(
char *s1,
const char *s2 ); |
| |
Copies the string s2 into the character array s1. The value of s1
is returned. |
| char
*strncpy(
char *s1,
const char *s2,
size_t n ); |
| |
Copies at most n characters of the string s2 into the character array s1. The value of
s1 is returned. |
| char
*strcat(
char *s1,
const char *s2 ); |
| |
Appends the string s2 to
s1. The first character of s2 overwrites the
terminating null character of s1. The value of s1 is
returned. |
| char
*strncat(
char *s1,
const char *s2,
size_t n ); |
| |
Appends at most n
characters of string s2 to string s1. The first character of
s2 overwrites the terminating null character of s1. The value
of s1 is returned. |
| int
strcmp(
const char *s1,
const char *s2 ); |
| |
Compares the string s1
with the string s2. The function returns a
value of zero, less than zero or greater than zero if s1 is equal to,
less than or greater than s2, respectively. |
| int
strncmp(
const char *s1,
const char *s2,
size_t n ); |
| |
Compares up to n
characters of the string s1 with the string s2. The function returns zero, less than zero or greater
than zero if the n-character portion of s1 is equal to, less than or greater than the corresponding
n-character portion of s2, respectively. |
| char
*strtok(
char *s1,
const char *s2 ); |
| |
A sequence of calls to
strtok breaks string s1 into tokens,
such as words in a line of text. The string is broken up based on the characters
contained in string s2. For instance, if we
were to break the string "this:is:a:string" into
tokens based on the character ':', the resulting
tokens would be "this", "is", "a" and
"string". Function strtok returns only
one token at a time—the first call contains s1 as the first argument, and subsequent calls to
continue tokenizing the same string contain NULL as the first argument. A pointer to the current token is
returned by each call. If there are no more tokens when the function is called,
NULL is returned. |
| size_t
strlen(
const char *s); |
| |
Determines the length of string
s. The number of characters preceding the
terminating null character is returned. |
Note that several functions in Fig. 8.29 contain parameters with
data type size_t. This type is defined in the header file
<cstring> to be an unsigned integral type such as unsigned
int or unsigned long.
Common Programming Error 8.19
|
Forgetting to include the
<cstring> header file when using functions from the
string-handling library causes compilation
errors. |
Copying Strings with strcpy and strncpy
Function strcpy copies
its second argument—a string—into its first argument—a character array that must
be large enough to store the string and its terminating null character, (which is also copied). Function strncpy is much like strcpy,
except that strncpy specifies the number of
characters to be copied from the string into the array. Note that function
strncpy does not necessarily copy the
terminating null character of its second argument—a terminating null character
is written only if the number of characters to be copied is at least one more
than the length of the string. For example, if "test" is the second argument, a terminating null character
is written only if the third argument to strncpy is at least 5
(four characters in "test" plus one
terminating null character). If the third argument is larger than 5, null characters are appended to the array until the total
number of characters specified by the third argument is written.
Common Programming Error 8.20
|
When using strncpy, the terminating null character of the second argument (a
char * string) will not be copied if the number of
characters specified by strncpy's third argument
is not greater than the second argument's length. In that case, a fatal error
may occur if you do not manually terminate the resulting char * string with a null
character. |
Figure
8.30 uses strcpy (line 17) to copy the
entire string in array x into array y and uses
strncpy (line 23) to copy the first 14
characters of array x into array z. Line 24 appends a null character ('\0') to
array z, because the call to strncpy
in the program does not write a terminating null character. (The third argument
is less than the string length of the second argument plus one.)
Fig. 8.30. strcpy and
strncpy.
1 // Fig. 8.30: fig08_30.cpp
2 // Using strcpy and strncpy.
3 #include <iostream>
4 using std::cout;
5 using std::endl;
6
7 #include <cstring> // prototypes for strcpy and strncpy
8 using std::strcpy;
9 using std::strncpy;
10
11 int main()
12 {
13 char x[] = "Happy Birthday to You"; // string length 21
14 char y[ 25 ];
15 char z[ 15 ];
16
17 strcpy( y, x ); // copy contents of x into y
18
19 cout << "The string in array x is: " << x
20 << "\nThe string in array y is: " << y << '\n';
21
22 // copy first 14 characters of x into z
23 strncpy( z, x, 14 ); // does not copy null character
24 z[ 14 ] = '\0'; // append '\0' to z's contents
25
26 cout << "The string in array z is: " << z << endl;
27 return 0; // indicates successful termination
28 } // end main
|
The string in array x is: Happy Birthday to You
The string in array y is: Happy Birthday to You
The string in array z is: Happy Birthday
|
Concatenating Strings with strcat and strncat
Function strcat appends
its second argument (a string) to its first argument (a character array
containing a string). The first character of the second argument replaces the
null character ('\0') that terminates the string
in the first argument. You must ensure that the array used to store the first
string is large enough to store the combination of the first string, the second
string and the terminating null character (copied from the second string).
Function strcat
appends a specified number of characters from the second string to the first
string and appends a terminating null character to the result. The program of Fig. 8.31 demonstrates function strcat (lines
19 and 29) and function strncat (line 24).
Fig. 8.31. strcat and
strncat.
1 // Fig. 8.31: fig08_31.cpp
2 // Using strcat and strncat.
3 #include <iostream>
4 using std::cout;
5 using std::endl;
6
7 #include <cstring> // prototypes for strcat and strncat
8 using std::strcat;
9 using std::strncat;
10
11 int main()
12 {
13 char s1[ 20 ] = "Happy "; // length 6
14 char s2[] = "New Year "; // length 9
15 char s3[ 40 ] = "";
16
17 cout << "s1 = " << s1 << "\ns2 = " << s2;
18
19 strcat( s1, s2 ); // concatenate s2 to s1 (length 15)
20
21 cout << "\n\nAfter strcat(s1, s2):\ns1 = " << s1 << "\ns2 = " << s2;
22
23 // concatenate first 6 characters of s1 to s3
24 strncat( s3, s1, 6 ); // places '\0' after last character
25
26 cout << "\n\nAfter strncat(s3, s1, 6):\ns1 = " << s1
27 << "\ns3 = " << s3;
28
29 strcat( s3, s1 ); // concatenate s1 to s3
30 cout << "\n\nAfter strcat(s3, s1):\ns1 = " << s1
31 << "\ns3 = " << s3 << endl;
32 return 0; // indicates successful termination
33 } // end main
|
s1 = Happy
s2 = New Year
After strcat(s1, s2):
s1 = Happy New Year
s2 = New Year
After strncat(s3, s1, 6):
s1 = Happy New Year
s3 = Happy
After strcat(s3, s1):
s1 = Happy New Year
s3 = Happy Happy New Year
|
Comparing Strings with strcmp and strncmp
Figure 8.32
compares three strings using strcmp (lines 21, 22 and 23) and strncmp (lines 26, 27 and 28). Function
strcmp compares its first string argument
with its second string argument character by character. The function returns
zero if the strings are equal, a negative value if the first string is less than
the second string and a positive value if the first string is greater than the
second string. Function strncmp is equivalent to
strcmp, except that strncmp compares
up to a specified number of characters. Function strncmp stops comparing characters if it reaches the null character
in one of its string arguments. The program prints the integer value returned by
each function call.
Fig. 8.32. strcmp and
strncmp.
1 // Fig. 8.32: fig08_32.cpp
2 // Using strcmp and strncmp.
3 #include <iostream>
4 using std::cout;
5 using std::endl;
6
7 #include <iomanip>
8 using std::setw;
9
10 #include <cstring> // prototypes for strcmp and strncmp
11 using std::strcmp;
12 using std::strncmp;
13
14 int main()
15 {
16 char *s1 = "Happy New Year";
17 char *s2 = "Happy New Year";
18 char *s3 = "Happy Holidays";
19
20 cout << "s1 = " << s1 << "\ns2 = " << s2 << "\ns3 = " << s3
21 << "\n\nstrcmp(s1, s2) = " << setw( 2 ) << strcmp( s1, s2 )
22 << "\nstrcmp(s1, s3) = " << setw( 2 ) << strcmp( s1, s3 )
23 << "\nstrcmp(s3, s1) = " << setw( 2 ) << strcmp( s3, s1 );
24
25 cout << "\n\nstrncmp(s1, s3, 6) = " << setw( 2 )
26 << strncmp( s1, s3, 6 ) << "\nstrncmp(s1, s3, 7) = " << setw( 2 )
27 << strncmp( s1, s3, 7 ) << "\nstrncmp(s3, s1, 7) = " << setw( 2 )
28 << strncmp( s3, s1, 7 ) << endl;
29 return 0; // indicates successful termination
30 } // end main
|
s1 = Happy New Year
s2 = Happy New Year
s3 = Happy Holidays
strcmp(s1, s2) = 0
strcmp(s1, s3) = 1
strcmp(s3, s1) = -1
strncmp(s1, s3, 6) = 0
strncmp(s1, s3, 7) = 1
strncmp(s3, s1, 7) = -1
|
Common Programming Error 8.21
|
Assuming that strcmp and strncmp return one (a true value) when their arguments are equal is
a logic error. Both functions return zero (C++'s false value) for equality.
Therefore, when testing two strings for equality, the result of the
strcmp or strncmp function should be
compared with zero to determine whether the strings are
equal. |
To understand just what it means for one
string to be "greater than" or "less than" another string, consider the process
of alphabetizing a series of last names. You would, no doubt, place "Jones"
before "Smith," because the first letter of "Jones" comes before the first
letter of "Smith" in the alphabet. But the alphabet is more than just a list of
26 letters—it is an ordered
list of characters. Each letter occurs in a specific position within the list.
"Z" is more than just a letter of the alphabet; "Z" is specifically the 26th
letter of the alphabet.
How does the computer know that one
letter comes before another? All characters are represented inside the computer
as numeric codes; when the computer compares two strings, it actually compares
the numeric codes of the characters in the strings.
In an effort to standardize
character representations, most computer manufacturers have designed their
machines to utilize one of two popular coding schemes—ASCII or EBCDIC. Recall that
ASCII stands for "American Standard Code for Information Interchange." EBCDIC
stands for "Extended Binary Coded Decimal Interchange Code." There are other
coding schemes as well.
ASCII and EBCDIC are called character codes, or
character sets. Most readers of this book will be using desktop or notebook
computers that use the ASCII character set. IBM mainframe computers use the
EBCDIC character set. As Internet and World Wide Web usage becomes pervasive,
the newer Unicode® character set is growing in popularity (www.unicode.org). String and character manipulations actually
involve the manipulation of the appropriate numeric codes and not the characters
themselves. This explains the interchangeability of characters and small
integers in C++. Since it is meaningful to say that one
numeric code is greater than, less than or equal to another numeric code, it
becomes possible to relate various characters or strings to one another by
referring to the character codes. Appendix
B contains the ASCII character codes.
Portability Tip 8.4
|
The internal
numeric codes used to represent characters may be different on different
computers that use different character
sets. |
Portability Tip 8.5
|
Do not
explicitly test for ASCII codes, as in if ( rating == 65 ); rather, use the corresponding character
constant, as in if ( rating == 'A'
). |
[Note:
With some compilers, functions strcmp and strncmp always
return -1, 0 or 1, as in
the sample output of Fig. 8.32. With other compilers, these functions return 0 or the
difference between the numeric codes of the first characters that differ in the
strings being compared. For example, when s1 and s3 are compared, the first characters that differ between them
are the first character of the second word in each string—N (numeric
code 78) in s1 and H (numeric code 72) in s3, respectively. In this case, the return value will be
6 (or -6 if s3 is compared to s1).]
Tokenizing a String with strtok
Function strtok
breaks a string into a series of tokens. A
token is a sequence of characters separated by delimiting characters
(usually spaces or punctuation marks). For example, in a line of text, each word
can be considered a token, and the spaces separating the words can be considered
delimiters.
Multiple calls to
strtok are required to break a string into tokens
(assuming that the string contains more than one token). The first call to
strtok contains two arguments, a string to be
tokenized and a string containing characters that separate the tokens (i.e.,
delimiters). Line 19 in Fig. 8.33 assigns to
tokenPtr a pointer to the first token in
sentence. The second argument, " ", indicates
that tokens in sentence are separated by spaces. Function
strtok searches for the first character in sentence that is not a delimiting character (space). This begins
the first token. The function then finds the next delimiting character in the
string and replaces it with a null ('\0') character. This terminates
the current token. Function strtok saves (in a static variable) a pointer to the next character following the
token in sentence and returns a pointer to the
current token.
Fig. 8.33. Using strtok to tokenize a
string.
1 // Fig. 8.33: fig08_33.cpp
2 // Using strtok to tokenize a string.
3 #include <iostream>
4 using std::cout;
5 using std::endl;
6
7 #include <cstring> // prototype for strtok
8 using std::strtok;
9
10 int main()
11 {
12 char sentence[] = "This is a sentence with 7 tokens";
13 char *tokenPtr;
14
15 cout << "The string to be tokenized is:\n" << sentence
16 << "\n\nThe tokens are:\n\n";
17
18 // begin tokenization of sentence
19 tokenPtr = strtok( sentence, " " );
20
21 // continue tokenizing sentence until tokenPtr becomes NULL
22 while ( tokenPtr != NULL )
23 {
24 cout << tokenPtr << '\n';
25 tokenPtr = strtok( NULL, " " ); // get next token
26 } // end while
27
28 cout << "\nAfter strtok, sentence = " << sentence << endl;
29 return 0; // indicates successful termination
30 } // end main
|
The string to be tokenized is:
This is a sentence with 7 tokens
The tokens are:
This
is
a
sentence
with
7
tokens
After strtok, sentence = This
|
Subsequent calls to strtok to
continue tokenizing sentence contain NULL as the first
argument (line 25). The NULL argument indicates that the call to
strtok should continue tokenizing from the location in
sentence saved by the last call to strtok. Note that
strtok maintains this saved information in a
manner that is not visible to you. If no tokens remain when strtok is called, strtok returns NULL. The
program of Fig. 8.33
uses strtok to tokenize the string "This is a sentence with 7 tokens". The program prints each token on a separate line. Line
28 outputs sentence after tokenization. Note that strtok modifies the
input string; therefore, a copy of the string
should be made if the program requires the original after the calls to
strtok. When sentence is output after tokenization, note that
only the word "This" prints, because strtok replaced each
blank in sentence with a null character ('\0') during the
tokenization process.
Common Programming Error 8.22
|
Not realizing that strtok modifies the string being tokenized, then attempting to use
that string as if it were the original unmodified string is a logic
error. |
Determining String Lengths
Function strlen takes a
string as an argument and returns the number of characters in the string—the
terminating null character is not included in the length. The length is also the
index of the null character. The program of Fig. 8.34 demonstrates function
strlen.
Fig. 8.34. strlen returns the
length of a char * string.
1 // Fig. 8.34: fig08_34.cpp
2 // Using strlen.
3 #include <iostream>
4 using std::cout;
5 using std::endl;
6
7 #include <cstring> // prototype for strlen
8 using std::strlen;
9
10 int main()
11 {
12 char *string1 = "abcdefghijklmnopqrstuvwxyz";
13 char *string2 = "four";
14 char *string3 = "Boston";
15
16 cout << "The length of \"" << string1 << "\" is " << strlen( string1 )
17 << "\nThe length of \"" << string2 << "\" is " << strlen( string2 )
18 << "\nThe length of \"" << string3 << "\" is " << strlen( string3 )
19 << endl;
20 return 0; // indicates successful termination
21 } // end main
|
The length of "abcdefghijklmnopqrstuvwxyz" is 26
The length of "four" is 4
The length of "Boston" is 6
|