The inaugural string.h function is memcpy which will copy the specified number of characters from one object to another. It takes two character pointers, one to the destination object and one to the source object, along with the number of characters to copy. It returns a pointer to the destination object.

void *
memcpy(void *s1, const void *s2, size_t n);

As for the return value, the standard makes no indication that an error value would be returned if something went wrong, so memcpy lacks a way to do meaningful error checking. Also note that s2 is a const void *, meaning that the data to which it points won't be modifiable through the s2 pointer, but the pointer itself could be changed.

Local Variables

Since this function isn't very complicated, we will only need two local variables, a source pointer and a destination pointer. But, isn't that was s1 and s2 already are? Well, they are in a sense, but since they are void pointers then we can't do anything with them because they don't point to any specific type. What we need to do is copy those pointers into char pointers so that we can read and write characters.

char *pDst          = s1;
const char *pSrc    = s2;

Note that our source needs to be a const char * to match the const that is attached to s2. Without this, you should get a warning about discarding the const qualifier:

src/string.c: In function ‘memcpy’:
src/string.c:59:21: warning: initialization discards ‘const’ qualifier from
pointer target type [enabled by default]
     char *pSrc    = s2;
                     ^

Parameter Validation

The first thing we should do is check our inputs to be sure they are valid before we use them. The basic checks here would be ensuring that we don't have NULL pointers and that our size isn't zero:

if ( !s1 ||
     !s2 ||
     !n))
{
    return s1;
}

This will prevent us from having a NULL pointer dereference when we read or write a character and prevent us from executing unnecessary instructions if we didn't need to perform any actual copying.

The standard explicitly states that 'If copying takes place between objects that overlap, the behavior is undefined.' So, we don't have to verify that s1 isn't in the range of [s2, s2 + n - 1], or that s2 isn't in the range [s1, s1 + n - 1]. Undefined behavior means that the standard allows us to ignore this behavior, to implement a specific behavior that is then required to be documented, or to terminate execution completely. For right now, we will just ignore the behavior.

Function Body

Since we have the number of characters to copy stored in n, we'll use that as a loop counter and compare it against 0 to determine when we need to stop copying. We can then use our source and destination pointers to copy a single character at a time. Once we're done copying, all we need to do is return the original destination pointer.

while (0 < n--)
{
    *(pDst++) = *(pSrc++);
}

return s1;

Security

Although the standard says we can just ignore overlapping objects pointed to by s1 and s2, we should be able to easily verify that the objects don't overlap with the simple range checks that were mentioned before. Since size_t is an unsigned type, then we can use suggestions from the INT30-C section of the CERT C Coding Standard to check for overlap and wrapping:

if ( !s1 ||
     !s2 ||
     !n ||
    /* Check for object overlap */
    ((((unsigned long) s2) - ((unsigned long) s1)) < n) ||
    ((((unsigned long) s1) - ((unsigned long) s2)) < n) ||
    /* Check for wrapping while copying */
    ((((unsigned long) -1) - ((unsigned long) s1)) < n) ||
    ((((unsigned long) -1) - ((unsigned long) s2)) < n))
{
    return s1;
}

Due to a limitation of C90, there is no integral type guaranteed to be large enough to hold the value of a pointer, but in order to perform this error checking we must convert the pointers to an integral type. The largest possible type available in C90 is unsigned long, so that is what we'll convert the pointers to in order to perform arithmetic on them. We'll add a test later on to ensure that this is true on the system where this library is tested. As well, the function comment will include a warning about our assumption.

That's a lot of checking but it provides validation that our copy should succeed. There are other errors that could still happen, like reading outside of a memory boundary, but they are difficult (or impossible) to prevent in a general enough way to be included in the library. There are two camps here: either you prevent the user from doing something bad and let them handle the error checking themselves, or you let them try something bad at the expense of extra cycles spent on error checking in the library itself. Most implementations are done in the first way and I think it would be interesting to do it the latter way.

Backwards copy?

Another use for memcpy would be to slide a region of data over by some amount. But, what if we slide the region over by so little that its destination is actually within the original region? Well, our security checks will prevent that from happening! This is nice, but in actuality a user may need to do this very thing and the standard doesn't actually say we have to prevent it. To allow this we'll first have to remove the overlap checks.

Now that we allow the source and destination regions to overlap, let's look at what happens when we slide a region to lower memory such that destination region will overlap with the source. Assume that we have the characters, A, B, C, and D stored in memory adjacent to each other at address 0x8000, and we want to move it over to address 0x7FFE.

First, we read an A from 0x8000 and store it at 0x7FFE. Next, we read a B from 0x8001 and store it at 0x7FFF. Next we read a C from 0x8002 and store it at 0x8000, overwriting the A from our source region. Finally, we read a D from 0x8003 and store it at 0x8001, overwriting the B from our source region. We end up with A, B, C, D starting at address 0x7FFE. Although we overwrote our source region with new data, it wasn't an issue because we read from the source locations before overwriting them and our memcpy was a success.

Next, let's look at what happens when we slide a region to higher memory such that the destination region will overlap with the source. We'll use the same source address and data, but the destination will be address 0x8002.

First, we read an A from 0x8000 and store it at 0x8002, overwriting the C from our source region. Next, we read a B from 0x8001 and store it at 0x8003, overwriting the D from our source region. Next, we read an A from 0x8002 and store it at 0x8004. Finally, we read a B from 0x8003 and store it at 0x8005. We end up with A, B, A, B, starting at address 0x8002. This is the incorrect result and it happened because we overwrote source locations before reading from them.

So, how can we implement memcpy in such a way that the source and destination may overlap but the copy will still happen properly? Well, for the case that we slide the region in to higher memory, we can simply copy from the end of the region to the start, rather than the normal method of copying from start to end.

We'll redo our failed copy backwards and see what happens. First, we read a D from 0x8003 and store it at 0x8005. Next, we read a C from 0x8002 and store it at 0x8004. Next, we read a B from 0x8001 and store it at 0x8003, overwriting the D from our source region. Finally, we read an A from 0x8000 and store it to 0x8002, overwriting the C from our source region. We end up with A, B, C, D, starting at address 0x8002 and our memcpy was a success when done backwards.

In the same way that the forward copy failed when you have overlapping regions and move from lower to higher memory, a similar copy from higher to lower memory will fail when done backwards. As such, we must check the source and destination and perform a forward or backward copy as appropriate. The only true way to do this is by having two copies of the same code where one is implemented as normal, while the other moves the pointers to the end of the region and decrements them as it copies.

/* Copy forward or backward to handle object overlap */
if (pDst < pSrc)
{
    while (0 < n--)
    {
        *(pDst++) = *(pSrc++);
    }
}
else
{
    pDst += n;
    pSrc += n;

    while (0 < n--)
    {
        *(--pDst) = *(--pSrc);
    }
}

Our backward copy works very similarly to the forward copy with the exception of where our pointers start. Since we use the pre-decrement operator we must start out past the region we want to copy. This will allow us to back up in to the valid region and then perform the copy.

Efficiency

When you look at the complexity of an algorithm you can measure two things to get an idea of how the algorithm will perform: time and space. We'll look at how to analyze them both, how to improve the efficiency of memcpy, and some other generic notes on efficiency. That being said, keep in mind the famous quote, "Premature optimization is the root of all evil." Try not to be too clever and save the hardcore optimizations for when there is a true need for them (i.e. your algorithm isn't meeting speed requirements or it's using too much memory).

Before getting in to the details of efficiency, it's worth noting that the algorithm we have at this point is truly as efficient as it can be according to the standard. This is due to strict aliasing and object access rules which limit the ways in which we can read or write data. The standard tells us that a pointer to char may be used to access any other type of object which is why those are the only pointers we can use. Unfortunately, this limits us to copying one byte at a time.

For the sake of exploration we'll look at using other pointer types to perform the copy in a faster way, but remember that this breaks the rules of the standard.

Time

A common way to measure the time of an algorithm is to use Big O Notation which represents a cap on the growth of a function in relation to its input. This is appropriate for C code because each line, comparison, or assignment could actually take multiple instructions to execute depending on what hardware it runs on. Using Big O Notation allows us to normalize our measurements, providing an accurate description of our function regardless of where it runs.

For this memcpy algorithm, we can see that there is a direct correlation between the amount of input data, n, and the number of steps required to complete the copy. This means our function can be described with $O(n)$ in relation to how long it takes to execute.

Space

We can also use Big O Notation to describe the amount of memory, or space, that a function uses to do its work. With C programming, you may have arrays or multiple variables used to keep track of state, but the space for those will be constant and most likely on the stack. There is the concept of dynamic memory allocation where you use variable amounts of memory from the heap; these will be indicated by calls to functions like malloc or calloc. Since the amount of memory used with dynamic memory allocation is unknown at run time, this is the memory we are interested in measuring. Most other uses of memory will be known at compile time and would remain constant.

Our implementation of memcpy does not make use of dynamic memory allocation which means it will use the same amount of memory every time, regardless of the size of input data. We can describe this behavior with $O(1)$ indicating constant memory usage.

Multi-character copy

Copying one character at a time could be pretty slow if we were copying a large amount of characters. It could be more efficient to do four characters at a time using an int pointer (points to four bytes on x86_64), or even eight characters at a time using a long pointer (points to eight bytes on x86_64). This will incur some additional overhead for calculating which type to use with what values of n, as well as extra logic to copy the leftover characters if the chosen type is not a multiple of n. Let's explore this.

Side-note: copying via a pointer to a non-char type will violate the strict-aliasing rules of C because only pointers to void or char types may alias any other pointer. Since we don't have any type information about the objects being copied to or from, we could only use pointers to char types if we want to stay within the bounds of strict-aliasing rules.

Alignment

The first thing we'll look at is alignment, which deals with how data is arranged and accessed in memory. Data said to be "n-byte aligned" will always be at a memory address that is a multiple of n. For example, four-byte aligned data will always be at an address that ends in 0x0, 0x4, 0x8, or 0xC, because those are all multiples of four. This is an important concept when copying memory because different amounts of data can be accessed depending on the type used for access, but this only works if the addresses match the alignment of that type.

Each type in C may have a different alignment due to its size in bytes. Examples from x86_64 would be char, short, int, and long, which each have a size of one, two, four, and eight bytes, respectively, and also have the same values for their alignment. This means that storing 0xDEADBEEF at address 0x8000 might look like so:

unsigned char *pChar = (unsigned char *) 0x8000;
unsigned short *pShort = (unsigned short *) 0x8000;
unsigned int *pInt = (unsigned int *) 0x8000;
unsigned long *pLong = (unsigned long *) 0x8000;

*pChar = 0xEF;
*(pChar + 1) = 0xBE;
*(pChar + 2) = 0xAD;
*(pChar + 3) = 0xDE;

*pShort = 0xBEEF;
*(pShort + 1) = 0xDEAD;

*pInt = 0xDEADBEEF;

*pLong = 0xDEADBEEF;

As you can see, the larger the size of a type, the larger the amount of data that be accessed at one time. However, there could be a point at which you stop seeing benefits, as can be seen when we used int and long pointers.

The address we chose, 0x8000, is special because it fits the alignment of each of the four types. This won't always be the case and this is where problems start to happen. If we had chosen the address 0x8002 then the only valid types for accessing that address would be char and short, because 0x8002 is a multiple of both one and two, but not four or eight. This will come in to play when attempting to do multi-byte copies, because both the source and destination addresses will need to have the same alignment. The convenient part about copying one char at a time is that every address is one-byte aligned so there are no alignment issues to worry about.

The largest amount of data that a processor can access at one time is referred to as a word, so if we can copy a word at a time then we will theoretically be copying memory as fast as possible. The word size on each machine will change, but it will always be a multiple of the largest data type available on that machine. In terms of C90, this means copying a long at a time is the best we can do. The sign of the long is irrelevant here because an unsigned long occupies the same amount of memory as a long.

For more information about alignment, see this article by Johnathan Rentzsch.

Calculating Efficiency Gains

The second thing we'll look at is how to determine which type will be the most efficient at copying a certain amount of data. To do this, we need to consider how many steps are taken to perform a copy. For simplicities sake, we will consider one copy to be one step. If we use different types then that one step may copy one byte of data or it may copy eight.

To take care of any alignment issues (assuming that the source and destination are capable of being aligned), we will assume that our copies start one byte past an alignment boundary, and end one byte before an alignment boundary which will represent the worst-case scenario. For an n-byte aligned type, this means we have $n-1$ steps before and after we perform the n-byte aligned bulk copy. This also means that an n-byte aligned type will only be used to copy $3n - 2$ characters, or more. Otherwise, we wouldn't have any characters left to copy after handling alignment.

For x86_64, our thresholds for performing copies with each type are as follows:

char, 1-byte aligned: $3(1) - 2 = 3 - 2 = 1$
short, 2-byte aligned: $3(2) - 2 = 6 - 2 = 4$
int, 4-byte aligned: $3(4) - 2 = 12 - 2 = 10$
long, 8-byte aligned: $3(8) - 2 = 24 - 2 = 22$

As you can see, the thresholds are very close together. To determine how many steps, in the worst case, are used to copy $x$ number of characters with an $n$-byte aligned type, we will derive an equation based on the three parts of the copy: a) getting to an alignment boundary, b) the bulk copy, and c) copying whatever is leftover:

$$ f(x,n) = a + b + c $$

$$ a = (n - 1) $$

$$ b = \left \lfloor \dfrac{x - (n - 1)}{n} \right \rfloor $$

$$ c = x - a - nb $$

And if we substitute in $c$ we can simplify:

$$ f(x,n) = a + b + c = a + b + x - a - nb = x + b - nb = x + b(1 - n)$$

Then substitute in $b$ since $a$ isn't needed anymore:

$$ f(x,n) = x + b(n - 1) = x + \left \lfloor \dfrac{x - (n - 1)}{n} \right \rfloor (1 - n) $$

As an example, let's compare copying 100 characters using short and long types.

$$ f(100, 2) = 100 + \left \lfloor \dfrac{100 - (2 - 1)}{2} \right \rfloor (1 - 2) $$

$$ f(100, 2) = 100 + \left \lfloor \dfrac{100 - 1}{2} \right \rfloor (-1) $$

$$ f(100, 2) = 100 - \left \lfloor \dfrac{99}{2} \right \rfloor $$

$$ f(100, 2) = 100 - \left \lfloor 49.5 \right \rfloor $$

$$ f(100, 2) = 100 - 49 $$

$$ f(100, 2) = 51 $$

$$ f(100, 8) = 100 + \left \lfloor \dfrac{100 - (8 - 1)}{8} \right \rfloor (1 - 8) $$

$$ f(100, 8) = 100 + \left \lfloor \dfrac{100 - 7}{8} \right \rfloor (-7) $$

$$ f(100, 8) = 100 + \left \lfloor \dfrac{93}{8} \right \rfloor (-7) $$

$$ f(100, 8) = 100 + \left \lfloor 11.625 \right \rfloor (-7) $$

$$ f(100, 8) = 100 + 11(-7) $$

$$ f(100, 8) = 100 - 77 $$

$$ f(100, 8) = 23 $$

Considering the range of size_t, which dictates how many characters get copied at once, the efficiency gains of using a larger type are noticed for very small values of n.

Implementing a typeless memcpy

Next we'll look at implementing a multi-byte memcpy by starting with the basic copy from before and building up to a version which will work regardless of the type used for the "bulk" copy portion. Here was our original copy loop:

while (0 < n--)
{
    *(pDst++) = *(pSrc++);
}

This relied on the size of each copy being one byte allowing us to use the post-decrement operator to reduce our counter. More generically, this loop would be:

while (0 < n)
{
    *(pDst++) = *(pSrc++);
    n -= sizeof(*pDst);
}

Which will accomplish the same thing while working with any type chosen for pDst and pSrc (assuming they're the same). For now we are assuming that both pDst and pSrc point to memory addresses which are aligned and that n is a multiple of sizeof(*pDst); we'll deal with that later. We need to further abstract this to account for getting to an aligned boundary before the bulk copy and then to copy any leftover bytes.

Like we mentioned with our object overlap checks, C90 doesn't have an integral type that is guaranteed to be large enough to hold the value of a pointer and we need such a type in order to check the alignment of a pointer. The best we can do is use the largest type available to us, long, and note the invalid assumption that it will hold the value of a pointer. In our tests we can include a check to ensure that this is true.

char *pCharDst          = s1;
const char *pCharSrc    = s2;
my_type *pDst           = NULL;
const my_type *pSrc     = NULL;

/* Error checking */

while ((0 < (((unsigned long) pCharDst) & (sizeof(*pDst) - 1))) &&
       (0 < n--))
{
    *(pCharDst++) = *(pCharSrc++);
}

pDst = (my_type *) pCharDst;
pSrc = (my_type *) pCharSrc;

while (sizeof(*pDst) <= n)
{
    *(pDst++) = *(pSrc++);
    n -= sizeof(*pDst);
}

pCharDst = (char *) pDst;
pCharSrc = (char *) pSrc;

while (0 < n--)
{
    *(pCharDst++) = *(pCharSrc++);
}

Finally, we need to deal with our alignment assumption by checking the source and destination addresses for matching alignment. If their offsets from an alignment boundary do not match then they will never sync up to both match the alignment of the bulk copy type at the same time. In this case, the best we can do is use our generic char copy.

char *pCharDst          = s1;
const char *pCharSrc    = s2;
my_type *pDst           = NULL;
const my_type *pSrc     = NULL;

/* Error checking */

if ((((unsigned long) pCharDst) & (sizeof(*pDst) - 1)) ==
        (((unsigned long) pCharSrc) & (sizeof(*pSrc) - 1)))
{
    while ((0 < (((unsigned long) pCharDst) & (sizeof(*pDst) - 1))) &&
           (0 < n--))
    {
        *(pCharDst++) = *(pCharSrc++);
    }

    pDst = (my_type *) pCharDst;
    pSrc = (my_type *) pCharSrc;

    while (sizeof(*pDst) <= n)
    {
        *(pDst++) = *(pSrc++);
        n -= sizeof(*pDst);
    }

    pCharDst = (char *) pDst;
    pCharSrc = (char *) pSrc;
}

while (0 < n--)
{
    *(pCharDst++) = *(pCharSrc++);
}

Choosing when to switch types

Now we have all of tools we need to write a multi-byte memcpy. Before fully implementing the multi-byte memcpy, let's look at our thresholds one more time (again, these are specific to x86_64):

char, 1-byte aligned - $3(1) - 2 = 3 - 2 = 1$
short, 2-byte aligned - $3(2) - 2 = 6 - 2 = 4$
int, 4-byte aligned - $3(4) - 2 = 12 - 2 = 10$
long, 8-byte aligned - $3(8) - 2 = 24 - 2 = 22$

Not only are these the lowest values of n that are required to use that specific type, they are also the first values of n where copying with the smaller type will take more steps than with the larger. This means that the minimum count thresholds double as the efficiency thresholds. We could use a simple check to decide on which type to use for the bulk copy and then use the above code with the appropriate type substituted for pDst and pSrc. Due to the nature of alignment, ensuring that something is aligned to the largest type guarantees that alignment is satisfied for any smaller type (thought I wouldn't be surprised if there was an architecture for which that wasn't true). This means we can always perform our alignment check using a modulus of the largest type, sizeof(long), and if it passes then we're guaranteed that the source and destination can ultimately be aligned for any type.

#define MEMCPY_LONG_THRESHOLD   (22)
#define MEMCPY_INT_THRESHOLD    (10)
#define MEMCPY_SHORT_THRESHOLD  (4)

char *pCharDst          = s1;
const char *pCharSrc    = s2;
short *pShortDst        = NULL;
const short *pShortSrc  = NULL;
int *pIntDst            = NULL;
const int *pIntSrc      = NULL;
long *pLongDst          = NULL;
const long *pLongSrc    = NULL;

/* Error checking */

if ((MEMCPY_SHORT_THRESHOLD <= n) &&
    (((unsigned long) pCharDst) & (sizeof(*pLongDst) - 1)) ==
        (((unsigned long) pCharSrc) & (sizeof(*pLongSrc) - 1)))
{
    if (MEMCPY_LONG_THRESHOLD <= n)
    {
        while ((0 < (((unsigned long) pCharDst) & (sizeof(*pLongDst) - 1)))
               &&
               (0 < n--))
        {
            *(pCharDst++) = *(pCharSrc++);
        }

        pLongDst = (long *) pCharDst;
        pLongSrc = (long *) pCharSrc;

        while (sizeof(*pLongDst) <= n)
        {
            *(pLongDst++) = *(pLongSrc++);
            n -= sizeof(*pLongDst);
        }

        pCharDst = (char *) pLongDst;
        pCharSrc = (char *) pLongSrc;
    }

    if (MEMCPY_INT_THRESHOLD <= n)
    {
        while ((0 < (((unsigned long) pCharDst) & (sizeof(*pIntDst) - 1)))
               &&
               (0 < n--))
        {
            *(pCharDst++) = *(pCharSrc++);
        }

        pIntDst = (int *) pCharDst;
        pIntSrc = (int *) pCharSrc;

        while (sizeof(*pIntDst) <= n)
        {
            *(pIntDst++) = *(pIntSrc++);
            n -= sizeof(*pIntDst);
        }

        pCharDst = (char *) pIntDst;
        pCharSrc = (char *) pIntSrc;
    }

    if (MEMCPY_SHORT_THRESHOLD <= n)
    {
        while ((0 < (((unsigned long) pCharDst) & (sizeof(*pShortDst) - 1)))
               &&
               (0 < n--))
        {
            *(pCharDst++) = *(pCharSrc++);
        }

        pShortDst = (short *) pCharDst;
        pShortSrc = (short *) pCharSrc;

        while (sizeof(*pShortDst) <= n)
        {
            *(pShortDst++) = *(pShortSrc++);
            n -= sizeof(*pShortDst);
        }

        pCharDst = (char *) pShortDst;
        pCharSrc = (char *) pShortSrc;
    }
}

while (0 < n--)
{
    *(pCharDst++) = *(pCharSrc++);
}

You'll notice that we used if statements for each of the bulk copies, rather than if else for the int and short copies. This is because we aren't sure what type of system we will run on and there is a possibility that what is leftover from one bulk copy may be enough to do a smaller bulk copy. On x86_64 this could be seen when copying 22 bytes when the source and destination addresses are already aligned to the size of a long. Two bulk copies would happen with a long pointer, followed by three copies with a short pointer.

While this will mostly guarantee you get the fastest copy for any value of n, it has a lot of repetitive code and is overly complicated just to provide faster copies for a very small window of n values. In addition, if we look at each comparison as another step then a copy using short pointers has completely lost any efficiency gain it would've had because we spent five steps just to figure out we should be using shorts. Performing more detailed step or efficiency analysis really requires you to look at the assembly level and that's not what we're here to do. With that said, a copy using short pointers probably won't have any real benefits over the basic char copy. Likewise, the int copy would show gains for such a small window that it's unnecessary to implement unless your specific application utilizes a lot of copies in that range.

Now that we're down to only performing bulk copies with long pointers, we should consider the overhead of checking n, the alignment, and adjusting pointers as necessary for a successful multi-byte copy. The added overhead means that we should increase the threshold required to perform the extra work, otherwise we aren't actually gaining anything. We'll bump up our threshold by an order of magnitude by setting it to 200. You could more finely tune this threshold by looking at the assembly output of your memcpy implementation and comparing exactly how many instructions are used for the extra checks and the bulk copy itself. Our final multi-byte implementation (sans error checking) is below. Each loop here represents each piece of our efficiency equation from before: $a$ to get to an alignment boundary, $b$ to perform the bulk copy, and $c$ to copy any leftovers. Lastly, we've added two new variables to track the alignment offsets for us so we only have to calculate them once.

#define MEMCPY_EFFICIENCY_THRESHOLD     (200)

void *
memcpy(void *s1, const void *s2, size_t n)
{
    char *pCharDst          = s1;
    const char *pCharSrc    = s2;
    long *pLongDst          = NULL;
    const long *pLongSrc    = NULL;
    size_t dstOffset        = 0;
    size_t srcOffset        = 0;

    /* Error checking */

    dstOffset = ((unsigned long) pCharDst) & (sizeof(*pLongDst) - 1);
    srcOffset = ((unsigned long) pCharSrc) & (sizeof(*pLongSrc) - 1);

    if ((MEMCPY_EFFICIENCY_THRESHOLD <= n) &&
        (dstOffset == srcOffset))
    {
        while ((0 < dstOffset--) &&
               (0 < n--))
        {
            *(pCharDst++) = *(pCharSrc++);
        }

        pLongDst = (long *) pCharDst;
        pLongSrc = (long *) pCharSrc;

        while (sizeof(*pLongDst) <= n)
        {
            *(pLongDst++) = *(pLongSrc++);
            n -= sizeof(*pLongDst);
        }

        pCharDst = (char *) pLongDst;
        pCharSrc = (char *) pLongSrc;
    }

    while (0 < n--)
    {
        *(pCharDst++) = *(pCharSrc++);
    }

    return s1;
}

Other Efficiency Notes

Our parameter validation includes a check on n that is questionable. On one hand, our loops will check n and never run, so we don't actually need the check. On the other hand, we can exit early when n is zero and save ourselves from performing the other checks that are irrelevant since no data is being copied. The chance of (n == 0) being true is possible, but highly unlikely, and keeping the check means that we're adding an extra step to every call to memcpy. Due to this, we'll favor the calls to memcpy where (n != 0) by removing parameter validation on n, and let those unlikely calls suffer a few extra steps.

For error checking, we'll consider each check to be an additional step. So, we're adding just four steps to validate that our objects do not overlap and that we don't wrap during the copy. This is an incredibly small price to pay for the added security and safety that these checks add, so we'll definitely keep those.

Backwards copy revisited

Now that we have a more efficient version of our algorithm, we still need to make it copy backwards to handle the case of object overlap. We've already discussed why this is necessary and how to do it with the simple copy loop and walked through step-by-step the way to expand the simple version into a more efficient version. The pointer arithmetic takes care of a lot of the headache for us. Since this is an extension of what we've done already, the full version is found at the end.

Testing

Normally we could use memcmp to verify that our memcpy happened correctly, but we don't have that function yet. We'll create a simple test function that checks the boundaries of the copy to ensure we copied up to the boundaries, but no further. This function takes a control memory region that should be a copy of the destination region so it can be restored, as well as an offset in to the region such that we can verify no bytes before the region were touched. Before doing the actual copy, we verify that the source and destination are actually different at the critical points, otherwise our tests won't be meaningful. The last note about this function is with the use of the variables src2 and src3. This function will be used to test copying between overlapping objects and we must preserve the values of the source object that we intend to test, otherwise they may be overwritten and our test will appear to fail.

int
verifyCopy(
    const void * const  pVoidCtrl,
    const size_t        ctrlSize,
    void                *pVoidDst,
    const size_t        dstSize,
    const void * const  pVoidSrc,
    const size_t        srcSize,
    const size_t        offset)
{
    int ret                     = 1;
    const char * const pCtrl    = pVoidCtrl;
    char *pDst                  = pVoidDst;
    const char * const pSrc     = pVoidSrc;
    size_t n                    = 0;
    size_t i1                   = offset - 1;
    size_t i2                   = offset;
    size_t i3                   = 0;
    size_t i4                   = 0;
    char src2                   = 0;
    char src3                   = 0;

    if ((NULL == pCtrl) ||
        (NULL == pDst) ||
        (NULL == pSrc) ||
        (ctrlSize != dstSize) ||
        (0 == offset) ||
        ((2 * offset) >= ctrlSize) ||
        ((2 * offset) >= dstSize))
    {
        return ret;
    }

    /* Copy up to offset bytes from the end of the shortest region */
    if (dstSize > srcSize)
    {
        n = srcSize - (2 * offset);
    }
    else
    {
        n = dstSize - (2 * offset);
    }

    i3 = offset + n - 1;
    i4 = offset + n;

    src2 = pSrc[i2];
    src3 = pSrc[i3];

    do
    {
        /* Verify that we will be able to see a difference */
        if ((pDst[i1] == pSrc[i1]) ||
            (pDst[i2] == src2) ||
            (pDst[i3] == src3) ||
            (pDst[i4] == pSrc[i4]))
        {
            break;
        }

        memcpy(pDst + offset, pSrc + offset, n);

        /* Test the boundaries */
        if ((pDst[i1] != pCtrl[i1]) ||
            (pDst[i2] != src2) ||
            (pDst[i3] != src3) ||
            (pDst[i4] != pCtrl[i4]))
        {
            break;
        }

        ret = 0;
    } while (0);

    /* Return the destination to its original form */
    memcpy(pDst, pCtrl, dstSize);

    return ret;
}

Our main test function will explicitly test the error conditions and then use the previous function to verify that the memcpy implementation works properly. The method by which we verify our alignment handling is by testing every combination of start and end offset from zero to sizeof(long) - 1. The last two tests will ensure that copying between overlapping objects works properly.

int
memcpyTest(void)
{
    int ret         = 1;
    int failed      = 0;
    size_t i        = 0;
    size_t j        = 0;
    char bulk1[]    =   "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ";
    char bulk2[]    =   "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ";
    char bulk3[]    =   "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA";
    char bulk4[]    =   "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA"
                        "ZYXWVUTSRQPONMLKJIHGFEDCBAZYXWVUTSRQPONMLKJIHGFEDCBA";
    char simple1[]  =   "This will be a short string to test simple things";
    char simple2[]  =   "This will be a short string to test simple things";

    do
    {
        /* Alignment checking relies on an unsigned long holding a pointer value
         * without truncation.
         */
        if (sizeof(unsigned long) < sizeof(void *))
        {
            break;
        }

        /* No way to verify, we'll just crash here if we don't catch the NULL */
        memcpy(NULL, simple1, sizeof(simple1));

        /* We may crash here */
        memcpy(simple2, NULL, sizeof(simple2));

        for (i = 0; i < sizeof(simple2); i++)
        {
            if (simple1[i] != simple2[i])
            {
                failed = 1;
                break;
            }
        }

        if (1 == failed)
        {
            break;
        }

        /* We might crash here too */
        memcpy(simple2, (void *) ((unsigned long) -5), sizeof(simple2));

        for (i = 0; i < sizeof(simple2); i++)
        {
            if (simple1[i] != simple2[i])
            {
                failed = 1;
                break;
            }
        }

        if (1 == failed)
        {
            break;
        }

        /* And here might be another crash */
        memcpy((void *) ((unsigned long) -5), simple1, sizeof(simple1));

        /* Verify no characters are copied */
        memcpy(bulk2, bulk3, 0);

        for (i = 0; i < sizeof(bulk2); i++)
        {
            if (bulk1[i] != bulk2[i])
            {
                failed = 1;
                break;
            }
        }

        if (1 == failed)
        {
            break;
        }

        /* Test each permutation of 0 to sizeof(long) start positions and
         * offsets doing a forward copy.
         */
        for (i = 0; i < sizeof(long); i++)
        {
            for (j = 0; j < sizeof(long); j++)
            {
                if (0 != verifyCopy(bulk1 + i,
                                    sizeof(bulk1) - i,
                                    bulk2 + i,
                                    sizeof(bulk2) - i,
                                    bulk3 + i,
                                    sizeof(bulk3) - i,
                                    j + 5))
                {
                    failed = 1;
                    break;
                }
            }

            if (1 == failed)
            {
                break;
            }
        }

        if (1 == failed)
        {
            break;
        }

        /* Test each permutation of 0 to sizeof(long) start positions and
         * offsets doing a backward copy.
         */
        for (i = 0; i < sizeof(long); i++)
        {
            for (j = 0; j < sizeof(long); j++)
            {
                if (0 != verifyCopy(bulk3 + i,
                                    sizeof(bulk3) - i,
                                    bulk4 + i,
                                    sizeof(bulk4) - i,
                                    bulk2 + i,
                                    sizeof(bulk2) - i,
                                    j + 5))
                {
                    failed = 1;
                    break;
                }
            }

            if (1 == failed)
            {
                break;
            }
        }

        if (1 == failed)
        {
            break;
        }

        /* Slide a buffer backward */
        if (0 != verifyCopy(bulk1 + 12,
                            sizeof(bulk1) - 12,
                            bulk2 + 12,
                            sizeof(bulk2) - 12,
                            bulk2 + 12 + (3 * sizeof(long)),
                            sizeof(bulk2) - 12 - (3 * sizeof(long)),
                            7))
        {
            break;
        }

        /* Slide a buffer forward */
        if (0 != verifyCopy(bulk1 + 12 + (3* sizeof(long)),
                            sizeof(bulk1) - 12 - (3* sizeof(long)),
                            bulk2 + 12 + (3* sizeof(long)),
                            sizeof(bulk2) - 12 - (3* sizeof(long)),
                            bulk2 + 12,
                            sizeof(bulk2) - 12,
                            7))
        {
            break;
        }

        ret = 0;
    } while (0);

    return ret;
}

Conclusion

At the core, memcpy is very simple; just loop over the data and move it from source to destination. The versions we've implemented have several extra checks in place to prevent NULL pointer dereferences and reading/writing with an improperly chosen length. Since copying between overlapping objects is allowed if implemented, we chose to implement that to provide more functionality.

void *
memcpy(void *s1, const void *s2, size_t n)
{
    char *pDst          = s1;
    const char *pSrc    = s2;

    if ( !s1 ||
         !s2 ||
        /* Check for wrapping while copying */
        ((((unsigned long) -1) - ((unsigned long) s1)) < n) ||
        ((((unsigned long) -1) - ((unsigned long) s2)) < n))
    {
        return s1;
    }

    /* Copy forward or backward to handle object overlap */
    if (pDst < pSrc)
    {
        while (0 < n--)
        {
            *(pDst++) = *(pSrc++);
        }
    }
    else
    {
        pDst += n;
        pSrc += n;

        while (0 < n--)
        {
            *(--pDst) = *(--pSrc);
        }
    }

    return s1;
}

We also implemented bulk copies when available by using pointers to the largest type available in C90. Keep in mind that the bulk copy implementation relies on the assumption that an unsigned long is at least as large as a pointer type and that it also violates certain parts of the standard to perform copies in this way.

#define MEMCPY_EFFICIENCY_THRESHOLD     (200)

void *
memcpy(void *s1, const void *s2, size_t n)
{
    char *pCharDst          = s1;
    const char *pCharSrc    = s2;
    long *pLongDst          = NULL;
    const long *pLongSrc    = NULL;
    size_t dstOffset        = 0;
    size_t srcOffset        = 0;

    if ( !s1 ||
         !s2 ||
        /* Check for wrapping while copying */
        ((((unsigned long) -1) - ((unsigned long) s1)) < n) ||
        ((((unsigned long) -1) - ((unsigned long) s2)) < n))
    {
        return s1;
    }

    /* Copy forward or backward to handle object overlap */
    if (pCharDst < pCharSrc)
    {
        dstOffset = ((unsigned long) pCharDst) & (sizeof(*pLongDst) - 1);
        srcOffset = ((unsigned long) pCharSrc) & (sizeof(*pLongSrc) - 1);

        if ((MEMCPY_EFFICIENCY_THRESHOLD <= n) &&
            (dstOffset == srcOffset))
        {
            while ((0 < dstOffset--) &&
                   (0 < n--))
            {
                *(pCharDst++) = *(pCharSrc++);
            }

            pLongDst = (long *) pCharDst;
            pLongSrc = (long *) pCharSrc;

            while (sizeof(*pLongDst) <= n)
            {
                *(pLongDst++) = *(pLongSrc++);
                n -= sizeof(*pLongDst);
            }

            pCharDst = (char *) pLongDst;
            pCharSrc = (char *) pLongSrc;
        }

        while (0 < n--)
        {
            *(pCharDst++) = *(pCharSrc++);
        }
    }
    else
    {
        pCharDst += n;
        pCharSrc += n;

        dstOffset = ((unsigned long) pCharDst) & (sizeof(*pLongDst) - 1);
        srcOffset = ((unsigned long) pCharSrc) & (sizeof(*pLongSrc) - 1);

        if ((MEMCPY_EFFICIENCY_THRESHOLD <= n) &&
            (dstOffset == srcOffset))
        {
            while ((0 < dstOffset--) &&
                   (0 < n--))
            {
                *(--pCharDst) = *(--pCharSrc);
            }

            pLongDst = (long *) pCharDst;
            pLongSrc = (long *) pCharSrc;

            while (sizeof(*pLongDst) <= n)
            {
                *(--pLongDst) = *(--pLongSrc);
                n -= sizeof(*pLongDst);
            }

            pCharDst = (char *) pLongDst;
            pCharSrc = (char *) pLongSrc;
        }

        while (0 < n--)
        {
            *(--pCharDst) = *(--pCharSrc);
        }
    }

    return s1;
}

memcpy