Friday, March 30, 2012

Small String Optimization and Move Operations

EDIT: A lot of readers seem more interested in my implementation of the String class than the purpose of this post, which is to show a somewhat remarkable fact about how small string optimization interacts with move operations. Please don't post more "This implementation is not the most efficient". You are missing the point. The example implementation of my string class here is neither complete, efficient, nor representative of STL implementations. It is only used to illustrate the idea behind small string optimization. I don't even know if the code actually compiles.

Typically, std::string is implemented using small string optimization, a technique by which small strings are stored directly inside the object, rather than dynamically allocated on the heap.

Before I start discussing the issue with small string optimization, let me give you a quick introduction to how it works. Let's say we want to implement a very simplified string class ourselves. In each String object, we simply keep a pointer to the beginning and to the terminating null character of the char array (std::string also needs to keep a pointer to an allocator, as you are allowed to specify your own memory management).


class String {
public:
  String(const char* p);
  String(const String& s);

  String(String&& s);
  ~String();
  unsigned length() const { return _size; }

  char* begin() { return _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  char* _beg; // first character in array
  unsigned _size; // excluding terminating null char
};


We can implement these constructors in a straightforward manner, using low level programming for efficiency:


inline String::String(const char* p) : _size(strlen(p))
{
    _beg(new char[_size+1]);
    memcpy(_beg,p,_size+1);
}

inline String::String(const String& s) : _size(s.length())
{
    _beg = new char[_size+1];
    memcpy(_beg,s._beg,len+1);
  }

inline String::String(String&& s) : _size(s.length())
{
   _beg = s._beg;
   s._beg = nullptr; // zero out
}

inline String::~String() { delete [] _beg; }



(The implementation will be slightly faster if you replace new/delete with malloc/free, as there is no need to call any constructors/destructors for char arrays. In theory, compilers can optimize this away; in practice, they don't.) See this post for a discussion on this issue (unrelated to small string optimization and move operations).

So far so good, but now you realize that you can do something clever: since we need to a char* pointers in every string, why not use that space to store small strings of length up to sizeof(char*)? That way, we won't need to call new/delete for small strings. In fact, since strings are in most applications quite short, why not store a little more while we're at it?

This is known as small string optimization. One way we can efficiently and elegantly implement this is by using tagged unions. Let's try it out:


class SString {
public:
  SString(const char* p);
  SString(const String& s);

  SString(String&& s);
  ~SString();
  unsigned length() const { return _size; }

  char* begin() { return _size < 16 ? str : _beg; }
  char* end() { return begin() + _size; }
  // ...
private:
  unsigned _size;
  union {
    char* _beg;
    char str[16];
  };
};


Note how SString::begin() was defined to either return _beg or str, depending on the size of the string.


The constructor for const char* is not that hard:


inline SString::SString(const char* p) : _size(strlen(p))
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,p,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,p,_size+1);
    }
}


The copy constructor is pretty much the same:


inline SString::SString(const SString& s) : _size(s.length())
{
    if (_size < 16) {
      // Small string optimization: store in static str
      memcpy(str,s.str,_size+1);
    } else {
      // Dynamically allocate string (like before)
      _beg(new char[_size+1]);
      memcpy(_beg,s._beg,_size+1);
    }
}

Now comes the interesting part where you'll see the trade off when using small string optimization. Here is the move constructor:


inline SString::SString(SString&& s) : _size(s.length())
{
    if (_size < 16) {
       // ???
    } else {
       _beg = s._beg; // steal resource from s
       s._beg = nullptr; // prevent s from deleting on destruction
    }
}

So when we have a dynamically allocated array, we can steal the resource. When we've used the small string optimization, on the otherhand, there is no way to steal the resource, as it is statically allocated. We must hence copy it:


if (_size < 16) {
      memcpy(str,s.str,_size+1);
} else {

What may be surprising to a string user, is that the  performance degradation occurs only for small strings.
This effect can be seen on MSVC 2010, which uses a small string optimization of 16 characters for std::string. We take a command line string, then move the string between two string objects 10 million times.
Here's the result:
With the a string of 16 and 17 characters, including the null terminating char, the execution time in seconds is:

Length      Time
16             0.171
17             0.06

In other words, the small string optimization, which is now the small string deterioration, is about 3 times slower than copying pointers. This leads to the surprising conclusion that when moving strings, it is actually faster to move larger strings.

Note that most of the time, you don't just move around strings, but allocate them, deallocate them, copy them, etc. In all those cases, of course, small string optimization is very useful.

27 comments:

  1. Don't really understand computer science, but it should be the future of the world.

    A question, UK is the country got most affected by the solar wind weeks ago, around 18th Mar. since uk has the most advanced technology in lots of filed including agriculture industry.

    the question is do you think the pc scince and coding can help to solve the problem caused by the uncertainty of the today's cosmic weather?

    ReplyDelete
  2. This is not my area of expertise, but yes, I believe they can and already do to some extent. Have a look here:
    http://en.wikipedia.org/wiki/Space_weather#Space_weather_modeling

    As mathematical models for space weather forecasting become more accurate (and computational power increases, as new models will presumably become more complex), better forecasting will likely be possible. However, the butterfly effect (http://en.wikipedia.org/wiki/Butterfly_effect) limits our ability to accurately predict the far future (if not for quantum mechanical randomness).

    ReplyDelete
  3. Yes, randomness. that's what i tried to think of to express, which added alot of difficulty in building up the model i guess.

    Am gonna read through these articles and wiki pages one by one.. will take a while.

    ReplyDelete
  4. > In theory, compilers can optimize this away; in practice, they don't.

    What compiler are you using?
    Here's g++ 4.5.3 on cygwin, and the compiler certainly does optimize it away: http://codepad.org/k5VSjbwq

    ReplyDelete
    Replies
    1. Thank you for pointing this out. I'm using MSVC 10; after checking the assembly output, I can see that it does not produce any call to the default constructor either.

      Since this is derailing slightly from the topic of this post, I've written a new post elaborating on this:
      http://john-ahlgren.blogspot.com/2012/04/newdelete-vs-mallocfree-for-pod-types.html

      Delete
  5. Came across this post via a link from Scott Meyers blog (cool to be linked from there). Excellent succinct write-up of the the SSO and C++11 move issues and also how SSO is implemented. Thanks!

    ReplyDelete
    Replies
    1. Good to it's useful.

      I should point out though (now that there are lot more readers here!): my intention was never to suggest that small string optimization (SSO) is a bad idea. On the contrary, I think it's an excellent optimization strategy that works most of the time. The penalty is always at worst the 16 bytes + whatever more is needed to store the range, capacity, and allocator, so the deterioration is definitely capped at a rather small value.

      Repeatedly moving objects (like I did here) does of course multiply this penalty, which is what you see. You'd be hard pressed to actually find real world issues with SSO.

      My point, as far as usefulness goes (I benchmarked it more out of curiosity), is the same as what Scott describes in his post: that people may have the illusion that "all moves are always fast".

      Delete
  6. I also don't understand computer science. Do you think an alien belt buckle would look good for my zombie hunter costume?

    ReplyDelete
    Replies
    1. I'm gonna need to see a picture of that before I decide...

      Delete
  7. > memcpy(str,s.str,_size+1);
    This is actually a very inefficient way of doing the copy, because passing a dynamic size argument to malloc prevents efficient optimization. A better way would be to always copy over the 16 bytes -- this way, the compiler can turn the malloc into a few direct moves. This loses a lot of the overhead of malloc that is ruining your benchmark time.

    It gets even better than that. The optimal move constructor for that string is actually one line:
    malloc(this,s,16+sizeof(_size))

    On modern processors, the load and store units always work on aligned 16-byte chunks, and the cache system past the L1 cache can only move aligned 64-byte chunks. This means that doing a single copy of a byte inside a cache line is doing exactly the same work as doing a single copy of 16 bytes, so long as a boundary isn't crossed. On 64-bit, that malloc with a static size is going to compile into 4 or 6 moves, depending on the set architecture (sse3 or not), and those will have a reciprocal throughput of 2 to 6 clock cycles, depending on alignment. In any case, it will be faster than the total function overhead, and considerably faster than having to do the if. (In the real situation where there are both small and large strings, so the if would occasionally be mispredicted.)

    ReplyDelete
    Replies
    1. Interesting. Question:

      On a modern C++ compiler that supports auto vectorization (e.g. GCC, Intel C++, MSVC 11 beta), does the following code produce vectorized instructions:

      memcpy(str, s.str,_16);

      I'm not sure I follow you on the if-branching. When we have more than 16 bytes, we need to copy a dynamic array. There's no way to perform a malloc/memcpy with static size, as we are resorting to arbitrary length strings.

      Delete
  8. > memcpy(str, s.str,_16);

    Depends on the target architecture. For pre SSE3, it just does normal moves, because while the relevant unaligned movs were added in SSE2, they only became fast in SSE3. On 64-bit, this only doubles the cost, and it should still stay well below the function overhead.

    > I'm not sure I follow you on the if-branching. When we have more than 16 bytes, we need to copy a dynamic array. There's no way to perform a malloc/memcpy with static size, as we are resorting to arbitrary length strings.

    No, in the move constructor you can *steal the resource*, like you did in your example. And when you do that, it's easy to realize that doing a direct copy of the object is faster than branching over "should I copy 16 bytes" or "should I copy 4 bytes", especially when copying 4 and 16 bytes are internally as fast.

    ReplyDelete
  9. That was a bad formulation from my part. What I meant is that when we have more than 16 bytes, we need to copy _the pointer_ to the dynamic array. When we have 16 bytes or less, we need to copy the static memory (more efficiently using the methods you've mentioned). But still, these are two different operations, and I don't see how you'd get an implementation of the move constructor without distinguishing between the cases (i.e. using an if-condition or something equivalent).

    Perhaps you could show a definition of your move constructor without using any if statement?

    ReplyDelete
  10. > Perhaps you could show a definition of your move constructor without using any if statement?

    I did. Here it is with the signature:

    inline SString::SString(SString&& s) : _size(s.length())
    {
    memcpy(this, s, sizeof(SString));
    }


    Your pointer is stored in the same space as the static memory. If you just copy over this space, you are either copying the pointer, or the string, depending on which one happens to be in use.

    Also, since the other thing you need to copy is the length integer, and it happens to be right next to the static memory, you can extend that copy to just be the full object. So, the entire method body collapses to a single call inlined to at most 6 machine instructions, with no serial dependencies between the 3 pairs of movs, no control flow, or anything else. You *really* cannot get any better than that.

    ReplyDelete
    Replies
    1. You're previous example was a malloc call with 16+sizeof(SSstring), which confused me.

      I now see what you mean. Yeah, it doesn't get better than that. I'm impressed. Not very related to SSO and move operations anymore, but nevertheless very insightful info about low level programming. Thanks.

      Delete
    2. You have to set s._begin to 0, otherwise you will be deleting the memory twice if it is not a short string.

      Delete
  11. In your copy and move .ctor, you don't need to copy the [16]. You can just assign the pointer and the small string gets copied too. (Vs memcpy which would run a loop)

    ReplyDelete
    Replies
    1. That's only true if the pointer is same size as char[16], which it typically isn't.

      Delete
    2. What Andreas said, and also read Unknown's comments; I believe this provides the most efficient implementation.

      @Andreas: Yes, you're right in that he needs to zero out the pointer too.

      @All: I should point out that Unknown's method, although impressively fast given the optimizations he mentions, does NOT work for general classes. If you have non-POD members, the constructors need to be called, rather than bitwise copying using memcpy (it may still work in practice, but I wouldn't rely on it; an STL vendor can of course do so.). To take an obvious example (of when it never works), consider a class that holds a shared_pointer: if you copy it using memcpy, the reference counter will not go up, which leads to undefined behavior. This kind of optimization is best done by the STL vendor, and not you.

      Delete
  12. Correction to self: shared_ptr will probably work in a move constructor, as far as reference counting goes, since the moved object calls it's destructor. This assumes that the shared resource does not point back to the object, if it does (as it could your own or someone else's custom class), you'll have problems. The technicalities are endless, but the point remains the same: Don't do it, unless you really know what you're dealing with.

    ReplyDelete
  13. John, did you get a metric for this ctor? Is it faster than the rest?

    inline SString::SString(SString&& rhs) : _size(rhs.length())
    {
    memcpy(this, rhs, sizeof(SString));
    rhs._beg = nullptr; // prevent s from deleting on destruction
    }

    If you know class has only POD members. Then memcpy rhs is a pretty good optimisation.

    ReplyDelete
    Replies
    1. Before I answer your question:

      I think some of my readers may have misunderstood the purpose of this post. It was never to construct an efficient string class. In fact, it wasn't about constructing a string class at all, but to show (what I consider to be) a somewhat interesting observation about the interaction between small string optimization and move operations.

      To answer your question:

      No, I haven't benchmarked its performance. I think Unknown's comments above outline the same argument with regard to the optimization.

      I write C++ code in the proper fashion: using std::string and other STL containers when possible. I hardly ever use a call to memcpy in real code.

      Delete
    2. Btw, if you like to heavily optimize your code for POD types, you might want to read this: http://john-ahlgren.blogspot.com/2012/04/newdelete-vs-mallocfree-for-pod-types.html

      Delete
  14. There seems to be a conflict of purpose here:

    On one hand you state that this isn't supposed to be an efficient implementation, and on the other hand your conclusion seems to be a claim about the *performance* difference between implementations. But the performance characteristics of your unoptimised implementations tell you nothing at all about the performance characteristics of the kind of optimised implementations that you might actually use in production code.

    ReplyDelete
    Replies
    1. My code, as stated in the beginning, is for illustrative purposes only. The benchmark was conducted using MSVC's STL implementation, not my own code.

      Delete
    2. My apologies. I missed the relevant sentence.

      Delete