Never create Ruby strings longer than 23 characters
patshaughnessy.netThis is known as the "small string optimisation" in C++, so you can see a similar implementation in Clangs libc++[1].
One interesting corollary is that moving short strings in an implementation that does this could actually be ever so slightly (negligibly) slower than moving long ones (since byte copies are slower than word copies). But generally, this is a free lunch optimisation and can save you hundreds of megs of memory when writing programs dealing with millions of short strings.
[1] http://llvm.org/svn/llvm-project/libcxx/trunk/include/string - search for "union"
http://www.slideshare.net/nirusuma/what-lies-beneath-the-bea... (from march 2012) also discusses this.
Also (pedantic):
#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))
sizeof(char) is always 1, so that division is superfluous.sizeof(char) is implementation defined; see limits.h for more information on the probable size of char for your target. If the sizeof(char) is 1, the division will be optimized away, so there's no loss by keeping the code portable.
No, the size of char (in bits) is implementation defined, but sizeof(char) is defined to be 1, no matter what its size in bits.
No this is incorrect, see:
http://stackoverflow.com/q/4562249/25981
sizeof(char) is always defined to be one. This can't be altered by a conforming compiler.
Wrong. sizeof(char) is define to be one. The number of bits in a byte (char) is implementation defined (this is why CHAR_BIT exists). Not the same thing.
More like "ruby optimizes for short strings, and chose 23 at the cut-off point for Reasons."
Can't wait for someone to write a new faster better string class which handles strings of any length by internally chopping them into 23 character portions....
lol yes....... how reasonable. ahah when i saw the title of the article all i could think was "click comments to tune in for the most amusing flamewar this week" but so many of these comments are like "so... its 23 characters.... why not?!"
comeon peopleeeee, a bit of an arbitrary internal standard, no?
i understand the point about "CONCLUSION: it doesn't matter for a few strings!" but.... comeonnnn it must matter on some level, otherwise why is Rails such a pain in the ass to optimize? these things must add up...
It's not arbitrary. It is the length available when taking the space the String objects must have in order to be able to handle strings allocated on the heap.
And I could write books about why Rails is such a pain to optimize but this is not it: This is an optimisation over the generic case of allocating the buffer elsewhere. An optimisation you also happens to find in C++ implementations for example.
What non-arbitrary "small string" length would you chose?
Btw, the fact that you don't understand it doesn't make it arbitrary. Hint: it's not arbitrary at all.
seriously... this?
#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))
explain "3" in terms of your comp sci 101 divine computer law... Ok I get itttt "len/ptr/capa/shared". I've never written an interpreter so I can't rewrite the implementation myself, but I've worked with Rails enough to know that Ruby's is terribly inefficient, and these are problems created in it's abstraction from C, not inherent to C. Maybe this RString case is a minor issue, but it seems to me to lack some rigorousness, to be a performance leak. It's one of those things that makes you wonder "What Would Linus Do?". Although I do understand the points about including it AT ALL to be "free gain, little loss" and the language can't be C, but sometimes I don't know....
Just admit that maybe something is wrong here. Please. you'll be the first Rubyist ever, the dawn of a golden era. I've been intrigued more and more by the Lisp community because many of those devs seem to have a philosophical unwillingness to compromise performance and proper engineering, whereas in Ruby people get mad at you for not embracing the compromises!! Geez.
Hint: writing "Hint:" and then saying something really obvious that you learned from the article we all just read makes you sound craaaaaazy obnoxious
(Incidentally, it's issues like this that make me thank my lucky stars I do my web work on the JVM now.)
Haha, oh, I'm not a rubyist. Far from it; I'm a systems programmer. My ruby experience comes from writing add-ons for a hobbiest video game engine. I agree that ruby is often quite inefficient.
However, in this case, the number 23 comes from the amount of space that would be needed to store the smallest string in heap memory after taking into account malloc's bookkeeping info and whatnot.
As an aside, if you already knew the answer, why go on a rant that makes you seem ignorant?
I mean I understand the logic of optimizing for cases when you are going to be less than the lower-bound of data you can throw to the heap as you explain. I guess my question (I am honestly asking this one) is why not use the struct char array approach or some variant of it to optimize in the other direction as well (using 256 as the limit, for example, since it is closely associated with DB applications)? Is there a concern for wasted memory, or would it screw up their implementation of "union" or something? (I will, btw, willfully admit ignorance on some super in-depth C mechanics but I don't think it invalidates my point.)
I've heard of cases where Ruby performance supposedly increased tremendously because of raising some memory / garbage collection limits, and in today's world I always wondered why they don't try harder to make the sacrifice on memory footprint in order to gain speed. As a platform that is to this day routinely blasted for being so damn slow (I'm sure a lot of this falls on Rails' shoulders especially since ActiveRecord still was glacial when I stopped following... ~1.3 seconds of Ruby processing to render simple pages IIRC) why not start inserting some performance hacks or re-implementing with performance as top priority?
The answer to your question is that I'm sociopathic on the internet.
It would be a tremendous waste of memory, as a large percently of strings are small. It would also be a massive slowdown for any situations where you are copying a big string unchanged (in which case Ruby "just" marks the string as "copy on write") or where you are copying a small string.
As for slowness, there are many, many obvious targets much higher up the list than String. Slow method dispatch being the primary culprit.
I feel you..... I was more wondering if there was some better compromise point between pre-allocated / dynamic but as you say, there are tradeoffs.
You're right, there must be a lot more low-hanging fruit out there. Something interesting though... isn't the Ruby interpreter just a standard (RubySpec)? JRuby (and Rubinius I think?) are other, better-performing implementations, correct? I've been thinking about trying JRuby since I was looking into Grails and thought "eh why not just try JRuby on Rails".
but i've kinda gone in the other direction anyway and am more interested in Clojure > Ruby, especially since Clojure "frameworks" are just dead simple libs. It's weird because RoR was around first (I think, maybe some overlap in development...) but it seems that now that Clojure and Scala exist, RoR is kindof like a bad compromise between the two.
Though Clojure is kindof minimal and not everyone is singing Scala's praises...
I hope that is meant as a joke.
More to the point, ruby always uses string with more than 23 character. It's strings that are passed to the client and an HTML page is almost always bigger than 23 characters.
Ruby is a general purpose scripting language that can be used for web development (rails, sinatra) but is often used for different purposes (puppet, chef, vagrant, shoes, ...).
And even if you'd assume web development as the only purpose, there's a lot of strings that are shorter than 23 characters: Header for request and responses, form fields passed by the client (usernames, passwords, ...), field names passed in hashes, table and column names, template and file names, URLs or even the occasional, totally rare string in a json structure. It's an optimization with major gain and little loss.
Actually, my comment needs to be understood as an answer to the OP's title "Never create Ruby strings longer than 23 characters".
I understand that there's many places where strings are shorter than 23 characters and I never meant to say that this optimization was superfluous. I actually think it's pretty clever.
There's some discussion at https://news.ycombinator.com/item?id=3425164 , including some interesting technical/benchmarky comments.
Title: "Never create Ruby strings longer than 23 characters"
Conclusion: "Don’t worry! I don’t think you should refactor all your code to be sure you have strings of length 23 or less."
This is MRI (C Ruby) behaviour and not Ruby - specific , though. However, this is still interesting information.
Wouldn't it be better to use this declaration though:
struct RString {
struct RBasic basic;
union {
struct {
long len;
char *ptr;
union {
long capa;
VALUE shared;
} aux;
} heap;
char ary[];
} as;
};
/* apologies if I messed up the syntax here */
#define RSTRING_EMBED_LEN_MAX (sizeof(((RString*)(0))->as) - 1)
Then you can even use the padding the compiler added, if any, plus you can add more things to heap and the embed length will grow automatically.For anyone interested, he points to an older translation of the Ruby Hacking Guide, there is a pretty much complete translation at
Thanks for the link! I'm not interested in Ruby per se, but it's fascinating nonetheless from the perspective of data structures and how they are implemented in C.
On a related note, I've found a much less comprehensive (but still useful) guide to Python internals: http://tech.blog.aknin.name/category/my-projects/pythons-inn....
I suppose the thing to do is analyse your app for the average string length, and just recompile your Ruby with that. Would be even better of it was a command line parameter.
This isn't quite right. Even if your average string length is 1k+, you shouldn't change the embedded string size to 1k+. I think these objects sit on the C stack internally, which doesn't handle large objects like this well.
Also, I would guess the performance gains (from skipping malloc) would wash out the longer your average string gets -- even if the huge stack use doesn't kill your performance for some other reason (blowing the d-cache?).
I don't think these strings ever sit on the C stack, except maybe if some C code/extension is being really clever. The standard representation for variables is a tagged pointer as far as I know, so I would assume that is all that goes on the stack. This optimization probably just saves another level of indirection.
Hm, ok. Still, at some level fragmentation is eating gobs of space even if correct for the "average".
Why does "str2 = str" actually allocate a new RString instead of just pointing both str and str2 to the same RString?
That's what it is doing. The additional RString structure associates the label "str2" with the characters (on the heap) allocated for the original string.
Ruby experts can correct me if I'm wrong, but when Ruby sees a name like "str2" it looks it up in a table, which points it to the RString structure. From there, it can follow the pointer to the actual array of characters, which in this case is only stored once.
According to the article both str and str2 will point to the same char[] on the heap, but they are represented by two different RString objects. As you said when you want to access str and str2 you need to look them up in a table. So why not have both entries on the table point to the same RString, instead of pointing to two different RString's that point to the same char[]?
I'm sorry, I misunderstood your question. Apparently what you describe is actually the case, as the link posted by danieldk below describes.
I haven't checked the code so I may be wrong but it's possible it's for multi-threading reasons.
MRI has a global interpreter lock, so that does not make much sense.
In fact, the diagram is simply wrong. This was rectified by the author in an article two weeks later:
http://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-s...
Extremely interesting. But I cannot quite understand why RSTRING_EMBED_LEN_MAX is calculated that way.
VALUE seems to be unsigned int defined via "typedef uintptr_t VALUE;" and "typedef unsigned __int64 uintptr_t;"
But why is it calculated like that I don't get. Anyone can explain?
The small string buffer should be the same size as the "heap" struct so as not to waste memory -- remember, they shared the memory as they're members of a union. The heap struct contains three members which, taking into accoult alignment restrictions, usually add up to three times the machine word size (which is basically what sizeof(uintptr_t) is). The "-1" is because C strings are null-terminated, so the maximum length is one less than the size of the buffer.
What I don't know is why they don't simply use sizeof(heap) as the buffer size.
Ah that was obvious. Thanks, very clear answer.
It's using the storage in an RString struct that isn't otherwise occupied by the RBasic info:
https://github.com/ruby/ruby/blob/8f77cfb308061ff49de0a47e82...
Note the `as` union. The `heap` version has three VALUE-sized entries, so RSTRING_EMBED_LEN_MAX is calculated accordingly, with the -1 to account for the null terminator.
Good question. In a really roundabout way it manages to be the same size as the alternative struct.
Edit: I missed that part of that was another union, removed what I said about it being off on 32 bit.
I still don't understand why they go so roundabout by dividing by one and casting to int...
Actually in C and C++, longs are 32 bit on most 32-bit platforms. If you need a 64-bit integer type, you need either "long long" or some implementation-specific equivalent.
I know that, I misread it as using two longs plus two pointers, fixed it now.
I wonder why they didn't make cut-off optimization points at 33?
When programmers don't know in advance how long name/email/input/whatever field is going to be - they just use the magic "power of two" length :)
So 32 (or 33) in this case would be more reasonable.
Because of this line:
23 wasn't chosen, it was calculated to be the size that would be required for a struct describing a heap string, and will actually be a different number for different architectures. Choosing to make it bigger would add unnecessary overhead to the RString struct.#define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))> When programmers don't know
They did know. And there are many more cutoffs than powers of two, depending on storage backend.
Reminds me of this Mr Show sketch :) https://www.youtube.com/watch?v=RkP_OGDCLY0
Is Ruby's internal encoding UTF-8, then?
Each String in Ruby has their own encoding. But by default, it is UTF-8 these days.
"never use ruby" works well for me
The designer of a string class in any language. C++ and Java - has to deal with the same issue - that heap allocations are slower than stack allocations. But to do a stack allocation means reserving some fixed length memory which is a waste if you have a lot of small strings. It's a tradeoff. The Ruby approach is reasonable. I think in Microsoft's C++ STL library, the limit is 16 rather than 32. Even with the low-level closer-to-the-metal power of C/C++, the string class designer still has to make a decision about the tradeoff.
Strings are overrated, they should never be used until you really need them.
Who needs more than 23?
This comment is also 23
This all sounds rather terrible for Ruby, doesn't it? It isn't so much that the short string is faster (though I'm left unclear whether it itself is on the stack/heap, though given the GC nature of Ruby and practical considerations of the language, it must be the heap), but rather that the cost of the short string is also added to the long string in the heap (assumed) allocation of the RString (which becomes larger and thus more difficult to malloc).
If this is intended to sit on the stack, which I find highly unlikely (especially given the timings that seem to be the delta between one malloc and two, and would be much more significant if it were a stack allocation versus a heap allocation. This is not comparable to small string optimizations for the stack in C++), maybe. But otherwise it seems like a poorly considered hack.
The string type could as easily have been dynamically allocated based upon the length of the string, where the ptr by default points inside that same allocated block. If the string is expanded it can then be realloced and the string alloced somewhere else. No waste, a single allocation, etc.
I got the distinct impression (backed up by an actual code snippet defining the max embedded string size) that the 23 byte limit was calculated to exactly match the size of the data that would otherwise have to be stored for a heap string anyway. Thus, it doesn't actually take any extra space in the struct, and those 23 bytes do not go unused in other strings.allocation of the RString (which becomes larger and thus more difficult to malloc), and the 23 string bytes that will sit unused for longer strings.You're exactly right on the union: I tried to edit out my error on that before someone noticed (my principal point is about the mallocs), but your comment appeared right as I saved. Shame be upon me.
So this holds (on a 64-bit machine) 8-bytes for the pointer, 8-bytes for the length (string not null terminated), and 8-bytes for the capacity. Alternately, via a union, it stores 24 bytes of string (null terminated). It knows whether it is a or b via a separate flag that it holds separately in RBasic.
I retract my jab about memory loss, but it still sounds rather terrible. Every bit of code dealing with strings needs to validate flags on every use to determine what it is dealing with, alternate between length specified or null terminated, etc. Ugh.
Look up a C++ string implementation sometime, and you'll find that this is almost exactly how most efficient C++ string implementations do this too.
And it doesn't need to alternate between length specified or null terminated - they're all length specified (or how do you think short Ruby strings are also able to store ASCII NUL)
> but rather that the cost of the short string is also added to the long string
The key point is that for short strings, you're not adding it to the cost of the long string, but instead overwriting the data that you would track the long string.
In other words, the memory layout you get is this:
Which leads to another interesting question: How does the optimization interact with null in the middle of short strings, since only long strings have a length stored? Does Ruby check for embedded null when creating a string, and disable this optimization?short: [Rbasic]['s', 't', 'r', 'i', 'n', 'g', '\0'] long: [Rbasic][ length ][ dataptr ][capa or value]