The strtok function aids in splitting a string into pieces based on a given set of characters which delimit each piece (tokenizing a string).

char *
strtok(char *s1, const char *s2);

If the delimiting characters are found then a pointer to the first non-delimiter is returned and the next delimiter is overwritten with a null character. Subsequent calls passing a null pointer as the first parameter will return a pointer to the next token from the original string, or a null pointer if no token remains.

Keep in mind that strtok can modify the string you pass in (note the first parameter is a char *, rather than a const char *). If you need to retain the original data then it's best to copy it and tokenize the copy, leaving the original data untouched.

It's also important to understand that strtok will preserve a pointer across calls to it, namely a pointer to the end of the last token so that it can continue tokenizing if a null pointer is passed in as the first parameter.

Local Variables

As mentioned, strtok needs to keep track of where it left off so that subsequent calls with a null pointer as the s1 argument will continue to tokenize the string that was last seen. In order to preserve data across function calls we can use the static storage class specifier to instruct the C compiler that we want the variable to have a static storage duration which means that the "lifetime" of the variable is constant. The way that's achieved during execution of a program is for the variable's lifetime to be the entire duration of the program itself. We choose this storage class on a local variable because this data is only needed within the strtok function itself.

static char *save = NULL;

Error Checking

The first condition we want to check is whether or not a set of token delimiters was passed in, so we validate the s2 pointer. If it's a null pointer then we indicate to the caller that no token was found by returning a null pointer.

if (!s2)
{
    return NULL;
}

Now let's consider the string being tokenized. If multiple tokens exist in the same string then the caller will initiall pass in a pointer to the string and then pass in a null pointer to continue tokenizing the same string. This means that a null pointer for s1 is not necessarily an error. It would be an error if s1 were null before any tokens had been identified and the way we can detect this condition is if our saved pointer is also null. If our saved pointer wasn't null then strtok should use the saved pointer and continue tokenizing where it left off.

if (!s1)
{
    if (!save)
    {
        return NULL;
    }

    s1 = save;
}

Implementation

The first job of strtok is to find the initial character of the next token. If s1 points at a non-delimiter character then that's the initial character. However, if s1 points at a delimiting character then it, along with any subsequent delimiters, must be skipped over. This would be called the span of delimiting characters and is what the strspn function can be used for - it allows us to skip over any initial delimiting characters. In the case that we skip to the end of the string (i.e. the remaining part of the string only consisted of delimiters) then no token remains and we return a null pointer.

s1 += strspn(s1, s2);
if ('\0' == *s1)
{
    return NULL;
}

The next job is to find the end of the token which will either be the end of the string or at the next delimiter. This would be the span of non-delimiting characters, said another way this is the span of the complement of the delimiting characters which is what strcspn can be used for. When we find the end we'll update our saved pointer to that location.

save = s1 + strcspn(s1, s2);

Finally, we need to null terminate the token data to make the token itself a string (i.e. a sequence of characters with a null terminator). There are two possiblities for the character following the end of the token: it's either a delimiter or a null character. Before we can check for these conditions, we should validate the saved pointer since it was updated through the call to strcspn.

if (save)
{
    /* Do something */
}

If the current token ends in a delimiter then we need to overwrite it with a null character and update the saved pointer to the next character so that a subsequent call to strok with a null pointer for s1 will continue tokenizing at the correct place.

if (save && *save)
{
    *save++ = '\0';
}

If the current token already ends in a null terminator then no more work is needed. It's important that we don't update the saved pointer in this case, otherwise it would point beyond the null terminator into memory with unknown data. By retaining a pointer to a null character, subsequent calls will return properly when strtok attempts to skip over any initial delimiting characters.

Testing

There are quite a few edge cases to check for with the functionality of strtok so there are many test cases to go through. Since strtok can preserve a pointer across calls, the test strings consist of tokens which start with different characters. This allows the test to ensure that tokenizing is progressing correctly by checking a single character rather than comparing an entire string.

The first tests ensure that input validation is performed properly (remember that a null pointer for s1 isn't always an error). Next, tests are run to validate that tokenization works with a single delimiter between the tokens, in addition to switching the string being tokenized to ensure that strtok starts using the new string rather than the saved pointer. Following that are similar tests with multiple delimiters. Next, validate that tokenizing strings with initial and terminating characters works properly. Then we verify that the delimiters can be changed while tokenizing the same string. Finally, the standard includes a few test cases to help demonstrate the desired functionality, so those are included as well.

int
strtokTest(void)
{
    int ret             = 1;
    char str1[]         = "ABCD:BCD:CD:D";
    char str1_copy[]    = "ABCD:BCD:CD:D";
    char sep1[]         = ":";
    char str2[]         = "EFGH:!FGH!:GH:!H";
    char str2_copy[]    = "EFGH:!FGH!:GH:!H";
    char sep2[]         = ":!";
    char str3[]         = ":!:IJKL!:!";
    char sep3[]         = ":!";
    char str4[]         = "MNOP!::";
    char sep4[]         = "!:";
    char str5[]         = "QRST::UVWX:!:YZ";
    char sep5_1[]       = ":";
    char sep5_2[]       = "!";
    /* From the standard */
    static char str[]   = "?a???b,,,#c";
    char *t             = NULL;

    do
    {
        /* Invalid token string on first call with invalid separators */
        if (NULL != strtok(NULL, NULL))
        {
            break;
        }

        /* Still invalid token string with valid separators */
        if (NULL != strtok(NULL, sep1))
        {
            break;
        }

        /* Valid token string with invalid separators */
        if (NULL != strtok(str1, NULL))
        {
            break;
        }

        /* Tokenize on single separator */
        if (0 != strcmp("ABCD", strtok(str1, sep1)))
        {
            break;
        }

        /* Tokenize new string */
        if (0 != strcmp("EFGH", strtok(str2, sep2)))
        {
            break;
        }

        /* Tokenize on single seperator */
        memcpy(str1, str1_copy, sizeof(str1));
        if (0 != strcmp("ABCD", strtok(str1, sep1)))
        {
            break;
        }

        /* Tokenize on single seperator, second time */
        if (0 != strcmp("BCD", strtok(NULL, sep1)))
        {
            break;
        }

        /* Tokenize on single seperator, third time */
        if (0 != strcmp("CD", strtok(NULL, sep1)))
        {
            break;
        }

        /* Tokenize on single seperator, fourth time */
        if (0 != strcmp("D", strtok(NULL, sep1)))
        {
            break;
        }

        /* Tokenize on single seperator, end */
        if (NULL != strtok(NULL, sep1))
        {
            break;
        }

        /* Tokenize on multiple seperators */
        memcpy(str2, str2_copy, sizeof(str2));
        if (0 != strcmp("EFGH", strtok(str2, sep2)))
        {
            break;
        }

        /* Tokenize on multiple seperators, second time */
        if (0 != strcmp("FGH", strtok(NULL, sep2)))
        {
            break;
        }

        /* Tokenize on multiple seperators, third time */
        if (0 != strcmp("GH", strtok(NULL, sep2)))
        {
            break;
        }

        /* Tokenize on multiple seperators, fourth time */
        if (0 != strcmp("H", strtok(NULL, sep2)))
        {
            break;
        }

        /* Tokenize on multiple seperators, end */
        if (NULL != strtok(NULL, sep2))
        {
            break;
        }

        /* Tokenize with initial leading separators */
        if (0 != strcmp("IJKL", strtok(str3, sep3)))
        {
            break;
        }

        /* Tokenize with initial leading separators, end */
        if (NULL != strtok(NULL, sep3))
        {
            break;
        }

        /* Tokenize with trailing separators */
        if (0 != strcmp("MNOP", strtok(str4, sep4)))
        {
            break;
        }

        /* Tokenize with trailing separators, repeat */
        if (NULL != strtok(NULL, sep4))
        {
            break;
        }

        /* Change token string, start */
        if (0 != strcmp("QRST", strtok(str5, sep5_1)))
        {
            break;
        }

        /* Change token string, change */
        if (0 != strcmp(":UVWX:", strtok(NULL, sep5_2)))
        {
            break;
        }

        /* Change token string, change again */
        if (0 != strcmp("YZ", strtok(NULL, sep5_1)))
        {
            break;
        }

        /* Test from the standard: 't points to the token "a"' */
        t = strtok(str, "?");
        if (0 != strcmp("a", t))
        {
            break;
        }

        /* Test from the standard, 't points to the token "??b" */
        t = strtok(NULL, ",");
        if (0 != strcmp("??b", t))
        {
            break;
        }

        /* Test from the standard, 't points to the token "c"' */
        t = strtok(NULL, "#,");
        if (0 != strcmp("c", t))
        {
            break;
        }

        /* Test from the standard, 't is a null pointer' */
        t = strtok(NULL, "?");
        if (t)
        {
            break;
        }

        ret = 0;
    } while (0);

    return ret;
}

Conclusion

Although the work performed by strtok is complicated, it can be described succinctly through the use of existing library functions (namely strspn and strcspn). This is a great demonstration of the power of capturing common logic into functions for reuse.

char *
strtok(char *s1, const char *s2)
{
    static char *save = NULL;

    if (!s2)
    {
        return NULL;
    }

    if (!s1)
    {
        if (!save)
        {
            return NULL;
        }

        s1 = save;
    }

    /* Skip initial delimiters */
    s1 += strspn(s1, s2);
    if ('\0' == *s1)
    {
        return NULL;
    }

    /* Find next delimiter */
    save = s1 + strcspn(s1, s2);
    if (save && *save)
    {
        *save++ = '\0';
    }

    return s1;
}
comments powered by Disqus