strtok
tags: strcspn, string.h, strspn, strtok
The strtok
function aids in splitting a string into pieces based on a given
set of characters which delimit each piece (tokenizing a string).
char *
strtok(char *s1, const char *s2);
If the delimiting characters are found then a pointer to the first non-delimiter is returned and the next delimiter is overwritten with a null character. Subsequent calls passing a null pointer as the first parameter will return a pointer to the next token from the original string, or a null pointer if no token remains.
Keep in mind that strtok
can modify the string you pass in (note the first
parameter is a char *
, rather than a const char *
). If you need to retain
the original data then it's best to copy it and tokenize the copy, leaving the
original data untouched.
It's also important to understand that strtok
will preserve a pointer across
calls to it, namely a pointer to the end of the last token so that it can
continue tokenizing if a null pointer is passed in as the first parameter.
Local Variables
As mentioned, strtok
needs to keep track of where it left off so that
subsequent calls with a null pointer as the s1
argument will continue to
tokenize the string that was last seen. In order to preserve data across
function calls we can use the static
storage class specifier to instruct the C
compiler that we want the variable to have a static storage duration which means
that the "lifetime" of the variable is constant. The way that's achieved
during execution of a program is for the variable's lifetime to be the entire
duration of the program itself. We choose this storage class on a local variable
because this data is only needed within the strtok
function itself.
static char *save = NULL;
Error Checking
The first condition we want to check is whether or not a set of token delimiters
was passed in, so we validate the s2
pointer. If it's a null pointer then we
indicate to the caller that no token was found by returning a null pointer.
if (!s2)
{
return NULL;
}
Now let's consider the string being tokenized. If multiple tokens exist in the
same string then the caller will initiall pass in a pointer to the string and
then pass in a null pointer to continue tokenizing the same string. This means
that a null pointer for s1
is not necessarily an error. It would be an error
if s1
were null before any tokens had been identified and the way we can
detect this condition is if our saved pointer is also null. If our saved pointer
wasn't null then strtok
should use the saved pointer and continue tokenizing
where it left off.
if (!s1)
{
if (!save)
{
return NULL;
}
s1 = save;
}
Implementation
The first job of strtok
is to find the initial character of the next token. If
s1
points at a non-delimiter character then that's the initial character.
However, if s1
points at a delimiting character then it, along with any
subsequent delimiters, must be skipped over. This would be called the span of
delimiting characters and is what the strspn
function can be used for - it
allows us to skip over any initial delimiting characters. In the case that we
skip to the end of the string (i.e. the remaining part of the string only
consisted of delimiters) then no token remains and we return a null pointer.
s1 += strspn(s1, s2);
if ('\0' == *s1)
{
return NULL;
}
The next job is to find the end of the token which will either be the end of the
string or at the next delimiter. This would be the span of non-delimiting
characters, said another way this is the span of the complement of the
delimiting characters which is what strcspn
can be used for. When we find the
end we'll update our saved pointer to that location.
save = s1 + strcspn(s1, s2);
Finally, we need to null terminate the token data to make the token itself a
string (i.e. a sequence of characters with a null terminator). There are two
possiblities for the character following the end of the token: it's either a
delimiter or a null character. Before we can check for these conditions, we
should validate the saved pointer since it was updated through the call to
strcspn
.
if (save)
{
/* Do something */
}
If the current token ends in a delimiter then we need to overwrite it with a
null character and update the saved pointer to the next character so that a
subsequent call to strok
with a null pointer for s1
will continue tokenizing
at the correct place.
if (save && *save)
{
*save++ = '\0';
}
If the current token already ends in a null terminator then no more work is
needed. It's important that we don't update the saved pointer in this case,
otherwise it would point beyond the null terminator into memory with unknown
data. By retaining a pointer to a null character, subsequent calls will return
properly when strtok
attempts to skip over any initial delimiting characters.
Testing
There are quite a few edge cases to check for with the functionality of strtok
so there are many test cases to go through. Since strtok
can preserve a
pointer across calls, the test strings consist of tokens which start with
different characters. This allows the test to ensure that tokenizing is
progressing correctly by checking a single character rather than comparing an
entire string.
The first tests ensure that input validation is performed properly (remember
that a null pointer for s1
isn't always an error). Next, tests are run to
validate that tokenization works with a single delimiter between the tokens, in
addition to switching the string being tokenized to ensure that strtok
starts
using the new string rather than the saved pointer. Following that are similar
tests with multiple delimiters. Next, validate that tokenizing strings with
initial and terminating characters works properly. Then we verify that the
delimiters can be changed while tokenizing the same string. Finally, the
standard includes a few test cases to help demonstrate the desired
functionality, so those are included as well.
int
strtokTest(void)
{
int ret = 1;
char str1[] = "ABCD:BCD:CD:D";
char str1_copy[] = "ABCD:BCD:CD:D";
char sep1[] = ":";
char str2[] = "EFGH:!FGH!:GH:!H";
char str2_copy[] = "EFGH:!FGH!:GH:!H";
char sep2[] = ":!";
char str3[] = ":!:IJKL!:!";
char sep3[] = ":!";
char str4[] = "MNOP!::";
char sep4[] = "!:";
char str5[] = "QRST::UVWX:!:YZ";
char sep5_1[] = ":";
char sep5_2[] = "!";
/* From the standard */
static char str[] = "?a???b,,,#c";
char *t = NULL;
do
{
/* Invalid token string on first call with invalid separators */
if (NULL != strtok(NULL, NULL))
{
break;
}
/* Still invalid token string with valid separators */
if (NULL != strtok(NULL, sep1))
{
break;
}
/* Valid token string with invalid separators */
if (NULL != strtok(str1, NULL))
{
break;
}
/* Tokenize on single separator */
if (0 != strcmp("ABCD", strtok(str1, sep1)))
{
break;
}
/* Tokenize new string */
if (0 != strcmp("EFGH", strtok(str2, sep2)))
{
break;
}
/* Tokenize on single seperator */
memcpy(str1, str1_copy, sizeof(str1));
if (0 != strcmp("ABCD", strtok(str1, sep1)))
{
break;
}
/* Tokenize on single seperator, second time */
if (0 != strcmp("BCD", strtok(NULL, sep1)))
{
break;
}
/* Tokenize on single seperator, third time */
if (0 != strcmp("CD", strtok(NULL, sep1)))
{
break;
}
/* Tokenize on single seperator, fourth time */
if (0 != strcmp("D", strtok(NULL, sep1)))
{
break;
}
/* Tokenize on single seperator, end */
if (NULL != strtok(NULL, sep1))
{
break;
}
/* Tokenize on multiple seperators */
memcpy(str2, str2_copy, sizeof(str2));
if (0 != strcmp("EFGH", strtok(str2, sep2)))
{
break;
}
/* Tokenize on multiple seperators, second time */
if (0 != strcmp("FGH", strtok(NULL, sep2)))
{
break;
}
/* Tokenize on multiple seperators, third time */
if (0 != strcmp("GH", strtok(NULL, sep2)))
{
break;
}
/* Tokenize on multiple seperators, fourth time */
if (0 != strcmp("H", strtok(NULL, sep2)))
{
break;
}
/* Tokenize on multiple seperators, end */
if (NULL != strtok(NULL, sep2))
{
break;
}
/* Tokenize with initial leading separators */
if (0 != strcmp("IJKL", strtok(str3, sep3)))
{
break;
}
/* Tokenize with initial leading separators, end */
if (NULL != strtok(NULL, sep3))
{
break;
}
/* Tokenize with trailing separators */
if (0 != strcmp("MNOP", strtok(str4, sep4)))
{
break;
}
/* Tokenize with trailing separators, repeat */
if (NULL != strtok(NULL, sep4))
{
break;
}
/* Change token string, start */
if (0 != strcmp("QRST", strtok(str5, sep5_1)))
{
break;
}
/* Change token string, change */
if (0 != strcmp(":UVWX:", strtok(NULL, sep5_2)))
{
break;
}
/* Change token string, change again */
if (0 != strcmp("YZ", strtok(NULL, sep5_1)))
{
break;
}
/* Test from the standard: 't points to the token "a"' */
t = strtok(str, "?");
if (0 != strcmp("a", t))
{
break;
}
/* Test from the standard, 't points to the token "??b" */
t = strtok(NULL, ",");
if (0 != strcmp("??b", t))
{
break;
}
/* Test from the standard, 't points to the token "c"' */
t = strtok(NULL, "#,");
if (0 != strcmp("c", t))
{
break;
}
/* Test from the standard, 't is a null pointer' */
t = strtok(NULL, "?");
if (t)
{
break;
}
ret = 0;
} while (0);
return ret;
}
Conclusion
Although the work performed by strtok
is complicated, it can be described
succinctly through the use of existing library functions (namely strspn
and
strcspn
). This is a great demonstration of the power of capturing common logic
into functions for reuse.
char *
strtok(char *s1, const char *s2)
{
static char *save = NULL;
if (!s2)
{
return NULL;
}
if (!s1)
{
if (!save)
{
return NULL;
}
s1 = save;
}
/* Skip initial delimiters */
s1 += strspn(s1, s2);
if ('\0' == *s1)
{
return NULL;
}
/* Find next delimiter */
save = s1 + strcspn(s1, s2);
if (save && *save)
{
*save++ = '\0';
}
return s1;
}