Python sum, why not strings? [closed]
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
def sum2(iterable, start=0): return start + reduce(operator.add, iterable)
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity sum(, 0) = sum2(,0) = 888
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead] sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine. Edit: I am aware that the standard way is to do «».join . My question is why the option of using sum for strings was banned, and no banning was there for, say, lists. Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
@NullUserException: It would be great if you were right, but sadly the + operation on strings is already overloaded to mean concatentation. So with + we already construct string «sums».
@S.Lott: I meant summing a sequence of lists compared to summing a sequence of strings. As it happens, «sum» of a list of lists concatenates the lists. You can sum two lists using + to concatenate them. You can sum two strings using + to concatenate them. So it makes as much sense to define sum as concatenation for strings as it is for lists. That is what I meant. Whether this is good or is bad is beside the question.
@S.Lott: read my question again. It is quite clear there. I said: «for all types of parameters except strings. It works for numbers and lists, for example.» Which means that numbers and lists are parameters in much the same way strings are. How did you understand the comparison between sum and «».join ?
@S.Lott Not to beat a dead horse, but I read the question and understood it instantly. And on a more technical level, characters in a Python string are just strings themselves, you can technically /can/ sum the characters, resulting in ordinary concatenation. ( ‘,’.join(‘foo’) , for example, returns ‘f,o,o’ .)
8 Answers 8
Python tries to discourage you from «summing» strings. You’re supposed to join them:
It’s a lot faster, and uses much less memory.
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)' 100 loops, best of 3: 8.46 msec per loop $ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)' 1000 loops, best of 3: 296 usec per loop
Edit (to answer OP’s edit): As to why strings were apparently «singled out», I believe it’s simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ».join, so explicitly forbidding strings on sum will point this out to newbies.
BTW, this restriction has been in place «forever», i.e., since the sum was added as a built-in function (rev. 32347)
Actually, I think it’s the other way around: reduce and sum are similar for lists, because they do more or less the same thing. That’s why I used this benchmark: reduce , representing sum , versus join , which is an optimised way of «adding» strings.
I am feeling old now. I understand «forever» in Python as before 2.0. Things done after the introduction of new style classes are not that «forever».
It does seem «unpythonic» to discourage this usage. I agree with Muhammad that it’s a strange exception. But it seems this is the best reason available.
I think it’s there because of the chronological sequence of implementations: when they implemented the sum builtin, they already had join for strings. So, to avoid people unknowingly (or knowingly, but lazily/mischievously) using sum for strings, they specifically disallowed it. Since there wasn’t a specific «sum» for lists (or other types), they made the exception only for strings. Nowadays, I think they’d keep it this way, even if someone came up with a specific «sum», for backwards compatibility.
It’s not optimizing for a common case. Optimization would be using that same if statement blocking it to instead invoke the (already written) str.join code. Since that would be such an easy optimization to make, I’d assume the intent is to deliberate fail noisily to teach people about the existence of str.join .
You can in fact use sum(..) to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use «».join(..) anyway..
>>> class ZeroObject(object): . def __add__(self, other): . return other . >>> sum(["hi", "there"], ZeroObject()) 'hithere'
I find this very interesting. Of course it is too clever by half, but it adds to the understanding of this feature.
In the builtin_sum function we have this bit of code:
/* reject string values for 'start' parameter */ if (PyObject_TypeCheck(result, &PyBaseString_Type)) < PyErr_SetString(PyExc_TypeError, "sum() can't sum strings [use ''.join(seq) instead]"); Py_DECREF(iter); return NULL; >Py_INCREF(result); >
It’s explicitly checked in the code and rejected.
It’s interesting to see the code, but the question was «Why aren’t strings summed», not «How did they exclude strings from summing?» .
Well, I meant, the reason why its not working is because its explicitly banned in the code. Others seemed to answer by explaining why you shouldn’t do it.
Thanks for sharing the source code. But I get a different traceback when I try to sum two strings. sum([‘foo’, ‘bar’]) -> TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’ . But sum([‘foo’, ‘bar’], ») -> TypeError: sum() can’t sum strings [use ».join(seq) instead] . Any ideas why the first one returns a different traceback? I think it’s more likely to occur that way.
I just noticed the answer to my above comment is explained by @u0b34a0f6ae below. I now realise the start=0 argument is the initial value for the summation process.
The preferred, fast way to concatenate a sequence of strings is by calling ».join(sequence).
By making sum refuse to operate on strings, Python has encouraged you to use the correct method.
So it’s part of the ‘one way to do it’ philosophy. (As they could just have made sum treat strings specially)
@ThomasAhle: There is a computational reason why sum is the wrong tool for the job. sum builds the result incrementally. Doing so with strings requires (potentially many) temporary immutable strings. If done naively, this process requires lots of temporary memory and lots of copying. In contrast, if done with ».join() , the appropriate amount of memory is allocated from the beginning, and result is built right there at once. See Martelli’s exhortation.
Sure, but you can do lots of inefficient things in Python, and often it doesn’t matter. I can’t think of any other cases than ‘sum of strings’ where there are tests against doing something, hard coded in the Python C code!
Long answer: The sum function has to create an object for each partial sum.
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
double s are always the same size, which makes sum ‘s running time O(1)×N = O(N).
int (formerly known as long ) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum ‘s worst-case running time is lg(M) + lg(2M) + lg(3M) + . + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
For str (where M = the length of the longest string), the worst-case running time is M + 2M + 3M + . + NM = M×(1 + 2 + . + N) = O(N²).
Thus, sum ming strings would be much slower than sum ming numbers.
str.join does not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N) time, much faster than sum .
That argument is wrong because the same would hold for lists and tuples, which can be summed. As stated by HS, Python explicitly check for strings, and only for strings, which just doesn’t make sense.
@Philipp: The missing piece is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum .
The Reason Why
@dan04 has an excellent explanation for the costs of using sum on large lists of strings.
The missing piece as to why str is not allowed for sum is that many, many people were trying to use sum for strings, and not many use sum for lists and tuples and other O(n**2) data structures. The trap is that sum works just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum .
Is there any reason why the sum() implementation doesn’t just call ».join() when a string is encountered in sum() instead of returning an error instructing you to call ».join() ?
Edit: Moved the parts about immutability to history.
Basically, its a question of preallocation. When you use a statement such as
and expect it to work similar to a reduce statement, the code generated looks something like
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a") v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b") . res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that’s maybe not the point here. What’s more important, is that every new string on each line must be allocated to it’s specific size (which. I don’t know it it must allocate in every iteration of the reduce statement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won’t help anymore and Python must allocate again, which is rather expensive.
A dedicated method like join , however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
What is the fastest way to sum two strings made up of only numbers but without converting each to an int beforehand?
If I am given two strings made up of only numbers 0-9, what is the fastest way to sum the two? Doing something like str(int(num1) + int(num2)) is not allowed and fails. It mentions in a failed test that the max for an int() is 4300 characters:
ValueError: Exceeds the limit (4300) for integer string conversion: value has 2102288 digits; use sys.set_int_max_str_digits() to increase the limit
Each of the two strings of numbers can be 1,000,000+ characters/numbers long, be empty strings, have mismatched lengths, and/or have leading zeros. The only way that I can think of is to add each top/bottom digit one at a time from right-to-left and carry the 1 over when necessary. My version passes all of the tests but fails on long numbers.
def sum_strings(x, y): if not x: x = '0' if not y: y = '0' x_len = len(x) y_len = len(y) if x_len > y_len: y = y.rjust(x_len, '0') elif y_len > x_len: x = x.rjust(y_len, '0') carry = 0 total = '' for index in range(len(x) - 1, -1, -1): new_sum = int(x[index]) + int(y[index]) + carry if new_sum > 9: new_sum -= 10 carry = 1 else: carry = 0 total = f'' answer = f'' if carry else total return answer if len(answer) > 1 else answer.lstrip('0')
Times out: Here are the example «easy» test cases.
@test.describe('Basic tests') def test_examples(): @test.it('Example tests') def basic_tests(): test.assert_equals(sum_strings("1", "1"), "2") test.assert_equals(sum_strings("123", "456"), "579")
Is there any way to do it faster? EDIT: Here is the updated/working version although now that I can see other submissions I think there are cleaner ways to do it than this:
def sum_strings(x, y): if not x: x = '0' if not y: y = '0' x_len = len(x) y_len = len(y) if x_len > y_len: y = y.rjust(x_len, '0') elif y_len > x_len: x = x.rjust(y_len, '0') carry = 0 total = [] for index in range(len(x) - 1, -1, -1): new_sum = int(x[index]) + int(y[index]) + carry if new_sum > 9: new_sum -= 10 carry = 1 else: carry = 0 total.append(str(new_sum)) if carry: total.append(str(carry)) total_str = ''.join(reversed(total)) return total_str[1:] if len(total_str) > 1 and total_str[0] == '0' else total_str