combine two arrays and sort
I really doubt that you’ll do any better without writing a compiled extension to combine concatenate unique and sort .
8 Answers 8
Since you use numpy, I doubt that bisec helps you at all. So instead I would suggest two smaller things:
- Do not use np.sort , use c.sort() method instead which sorts the array in place and avoids the copy.
- np.unique must use np.sort which is not in place. So instead of using np.unique do the logic by hand. IE. first sort (in-place) then do the np.unique method by hand (check also its python code), with flag = np.concatenate(([True], ar[1:] != ar[:-1])) with which unique = ar[flag] (with ar being sorted). To be a bit better, you should probably make the flag operation in place itself, ie. flag = np.ones(len(ar), dtype=bool) and then np.not_equal(ar[1:], ar[:-1], out=flag[1:]) which avoids basically one full copy of flag .
- I am not sure about this. But .sort has 3 different algorithms, since your arrays maybe are almost sorted already, changing the sorting method might make a speed difference.
This would make the full thing close to what you got (without doing a unique beforehand):
def insort(a, b, kind='mergesort'): # took mergesort as it seemed a tiny bit faster for my sorted large array try. c = np.concatenate((a, b)) # we still need to do this unfortunatly. c.sort(kind=kind) flag = np.ones(len(c), dtype=bool) np.not_equal(c[1:], c[:-1], out=flag[1:]) return c[flag]
If you roll these suggestions into a function, I’d be happy to add it to the timeit framework I posted in my answer.
@mgilson added it, but if you want to compare it, compare it with large arrays. small are likely off the point for this purpose, too much smaller overhead that does not matter for this application.
And I have to say — This answer is pretty remarkable. (As documented in my comment on the original question) I didn’t think you could improve on the code posted by the OP very much. Well done. (hopefully others will see this and upvote gratuitously).
Inserting elements into the middle of an array is a very inefficient operation as they’re flat in memory, so you’ll need to shift everything along whenever you insert another element. As a result, you probably don’t want to use bisect . The complexity of doing so would be around O(N^2) .
Your current approach is O(n*log(n)) , so that’s a lot better, but it’s not perfect.
Inserting all the elements into a hash table (such as a set ) is something. That’s going to take O(N) time for uniquify, but then you need to sort which will take O(n*log(n)) . Still not great.
The real O(N) solution involves allocated an array and then populating it one element at a time by taking the smallest head of your input lists, ie. a merge. Unfortunately neither numpy nor Python seem to have such a thing. The solution may be to write one in Cython.
It would look vaguely like the following:
def foo(numpy.ndarray[int, ndim=1] out, numpy.ndarray[int, ndim=1] in1, numpy.ndarray[int, ndim=1] in2): cdef int i = 0 cdef int j = 0 cdef int k = 0 while (i!=len(in1)) or (j!=len(in2)): # set out[k] to smaller of in[i] or in[j] # increment k # increment one of i or j
Great efficiency improvement, as long as your assumption of presorted inputs is correct. It seems the OP intended that.
You can always check that your inputs are sorted, that only takes O(n) time, so doesn’t increase the complexity.
When curious about timings, it’s always best to just timeit . Below, i’ve listed a subset of the various methods and their timings:
import numpy as np import timeit import heapq def insort(a, x, lo=0, hi=None): if hi is None: hi = len(a) while lo < hi: mid = (lo+hi)//2 if x < a[mid]: hi = mid else: lo = mid+1 return lo, np.insert(a, lo, [x]) size=10000 a = np.array(range(size)) b = np.array(range(size)) def op(a,b): return np.unique(np.concatenate((a,b))) def martijn(a,b): c = np.copy(a) lo = 0 for i in b: lo, c = insort(c, i, lo) return c def martijn2(a,b): c = np.zeros(len(a) + len(b), a.dtype) for i, v in enumerate(heapq.merge(a, b)): c[i] = v def larsmans(a,b): return np.array(sorted(set(a) | set(b))) def larsmans_mod(a,b): return np.array(set.union(set(a),b)) def sebastian(a, b, kind='mergesort'): # took mergesort as it seemed a tiny bit faster for my sorted large array try. c = np.concatenate((a, b)) # we still need to do this unfortunatly. c.sort(kind=kind) flag = np.ones(len(c), dtype=bool) np.not_equal(c[1:], c[:-1], out=flag[1:]) return c[flag]
martijn2 25.1079499722 OP 1.44831800461 larsmans 9.91507601738 larsmans_mod 5.87612199783 sebastian 3.50475311279e-05
My specific contribution here is larsmans_mod which avoids creating 2 sets -- it only creates 1 and in doing so cuts execution time nearly in half.
EDIT removed martijn as it was too slow to compete. Also tested for slightly bigger arrays (sorted) input. I also have not tested for correctness in output .
In addition to the other answer on using bisect.insort , if you are not content with performance, you may try using blist module with bisect . It should improve the performance.
Also, you arrays seem to be sorted. If so, you can use merge function from heapq mudule to utilize the fact that both arrays are presorted. This approach will take an overhead because of crating a new array in memory. It may be an option to consider as this solution's time complexity is O(n+m) , while the solutions with insort are O(n*m) complexity (n elements * m insertions)
import heapq a = [1,2,4,5,6,8,9] b = [3,4,7,10] it = heapq.merge(a,b) #iterator consisting of merged elements of a and b L = list(it) #list made of it print(L)
If you want to delete repeating values, you can use groupby:
import heapq import itertools a = [1,2,4,5,6,8,9] b = [3,4,7,10] it = heapq.merge(a,b) #iterator consisting of merged elements of a and b it = (k for k,v in itertools.groupby(it)) L = list(it) #list made of it print(L)
@Jun You should do some test to determine the memory consumption. The problem with traditional arrays and list is that to insert they require to move all the elements after the insertion by 1. That requires O(n) . In blist they use another data structure, so that complexity is O(log(n)) . It's will change things dramatically for large arrays and lists.
You could use the bisect module for such merges, merging the second python list into the first.
The bisect* functions work for numpy arrays but the insort* functions don't. It's easy enough to use the module source code to adapt the algorithm, it's quite basic:
from numpy import array, copy, insert def insort(a, x, lo=0, hi=None): if hi is None: hi = len(a) while lo < hi: mid = (lo+hi)//2 if x < a[mid]: hi = mid else: lo = mid+1 return lo, insert(a, lo, [x]) a = array([1,2,4,5,6,8,9]) b = array([3,4,7,10]) c = copy(a) lo = 0 for i in b: lo, c = insort(c, i, lo)
Not that the custom insort is really adding anything here, the default bisect.bisect works just fine too:
import bisect c = copy(a) lo = 0 for i in b: lo = bisect.bisect(c, i) c = insert(c, i, lo)
Using this adapted insort is much more efficient than a combine and sort. Because b is sorted as well, we can track the lo insertion point and search for the next point starting there instead of considering the whole array each loop.
If you don't need to preserve a , just operate directly on that array and save yourself the copy.
More efficient still: because both lists are sorted, we can use heapq.merge :
from numpy import zeros import heapq c = zeros(len(a) + len(b), a.dtype) for i, v in enumerate(heapq.merge(a, b)): c[i] = v
Just the code I just came up with. But I realize that if the array 'b' is millions of elements long, then the machine will reassign the array 'c' in your code millions of times. This doesn't look very efficient to me
@Jun: Yeah, I don't like insert that much, so I've found a better method still using a pre-allocated array and heapq.merge.
import bisect a = array([1,2,4,5,6,8,9]) b = array([3,4,7,10]) for i in b: pos = bisect.bisect(a, i) insert(a,[pos],i)
I can't test this right now, but it should work
The sortednp package implements an efficient merge of sorted numpy-arrays, just sorting the values, not making them unique:
import numpy as np import sortednp a = np.array([1,2,4,5,6,8,9]) b = np.array([3,4,7,10]) c = sortednp.merge(a, b)
I measured the times and compared them in this answer to a similar post where it outperforms numpy's mergesort (v1.17.4).
Seems like no one mentioned union1d (union1d). Currently, it is a shortcut for unique(concatenate((ar1, ar2))) , but its a short name to remember and it has a potential to be optimized by numpy developers since its a library function. It performs very similar to insort from seberg's accepted answer for large arrays. Here is my benchmark:
import numpy as np def insort(a, b, kind='mergesort'): # took mergesort as it seemed a tiny bit faster for my sorted large array try. c = np.concatenate((a, b)) # we still need to do this unfortunatly. c.sort(kind=kind) flag = np.ones(len(c), dtype=bool) np.not_equal(c[1:], c[:-1], out=flag[1:]) return c[flag] size = int(1e7) a = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size) b = np.random.randint(np.iinfo(np.int).min, np.iinfo(np.int).max, size) np.testing.assert_array_equal(insort(a, b), np.union1d(a, b)) import timeit repetitions = 20 print("insort: %.5fs" % (timeit.timeit("insort(a, b)", "from __main__ import a, b, insort", number=repetitions)/repetitions,)) print("union1d: %.5fs" % (timeit.timeit("np.union1d(a, b)", "from __main__ import a, b; import numpy as np", number=repetitions)/repetitions,))
insort: 1.69962s union1d: 1.66338s
Как объединить массивы в Python (с примерами)
Самый простой способ объединить массивы в Python — использовать функцию numpy.concatenate , которая использует следующий синтаксис:
numpy.concatenate ((a1, a2, ….), ось = 0)
- a1, a2…: последовательность массивов
- ось: ось, вдоль которой будут соединяться массивы. По умолчанию 0.
В этом руководстве представлено несколько примеров использования этой функции на практике.
Пример 1: объединение двух массивов
В следующем коде показано, как объединить два одномерных массива:
import numpy as np #create two arrays arr1 = np.array([1, 2, 3, 4, 5]) arr2 = np.array([6, 7, 8]) #concatentate the two arrays np.concatenate ((arr1, arr2)) [1, 2, 3, 4, 5, 6, 7, 8]
В следующем коде показано, как объединить два двумерных массива:
import numpy as np #create two arrays arr1 = np.array([[3, 5], [9, 9], [12, 15]]) arr2 = np.array([[4, 0]]) #concatentate the two arrays np.concatenate ((arr1, arr2), axis= 0 ) array([[3, 5], [9, 9], [12, 15], [4, 0]]) #concatentate the two arrays and flatten the result np.concatenate ((arr1, arr2), axis= None ) array([3, 5, 9, 9, 12, 15, 4, 0])
Пример 2. Объединение более двух массивов
Мы можем использовать аналогичный код для объединения более двух массивов:
import numpy as np #create four arrays arr1 = np.array([[3, 5], [9, 9], [12, 15]]) arr2 = np.array([[4, 0]]) arr3 = np.array([[1, 1]]) arr4 = np.array([[8, 8]]) #concatentate all the arrays np.concatenate ((arr1, arr2, arr3, arr4), axis= 0 ) array([[3, 5], [9, 9], [12, 15], [4, 0], [1, 1], [8, 8]]) #concatentate all the arrays and flatten the result np.concatenate ((arr1, arr2, arr3, arr4), axis= None ) array([3, 5, 9, 9, 12, 15, 4, 0, 1, 1, 8, 8])
Дополнительные ресурсы
В следующих руководствах объясняется, как выполнять аналогичные операции в NumPy: