Python find substring between

Find common substring between two strings

I’d like to compare 2 strings and keep the matched, splitting off where the comparison fails. So if I have 2 strings:

string1 = "apples" string2 = "appleses" answer = "apples"

string1 = "apple pie available" string2 = "apple pies" answer = "apple pie"

I’m sure there is a simple Python way of doing this but I can’t work it out, any help and explanation appreciated.

The content of the question does not correspond to what is in the title. The problem described is longest common prefix

20 Answers 20

For completeness, difflib in the standard-library provides loads of sequence-comparison utilities. For instance find_longest_match which finds the longest common substring when used on strings. Example use:

from difflib import SequenceMatcher string1 = "apple pie available" string2 = "come have some apple pies" match = SequenceMatcher(None, string1, string2).find_longest_match() print(match) # -> Match(a=0, b=15, size=9) print(string1[match.a:match.a + match.size]) # -> apple pie print(string2[match.b:match.b + match.size]) # -> apple pie

If you’re using a version older than 3.9, you’need to call find_longest_match() with the following arguments:

SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

Heads up to those using this on longer strings, you might want to set the kwarg «autojunk» to False when creating the instance of SequenceMatcher.

I’ll note that there are outstanding bugs in difflib that should prevent its use in real-world scenarios. For example, it seems that the well known ‘heuristic’ interferes with the completeness of methods such as ‘get_matching_blocks’.

Warning: This answer does not find the longest common substring! Despite its name (and the method’s documentation), find_longest_match() does not do what its name implies. The class documentation for SequenceMatcher does hint at this, however, saying: This does not yield minimal edit sequences . For example, in some cases, find_longest_match() will claim there are no matches in two strings of length 1000, even though there are matching substrings of length > 500.

man, what turkey wrote that API. Forcing you to put the lengths of the strings in everytime instead of just assume its the ful strings, and the first argument to SequenceMatcher is nearly always going to be None :@

One might also consider os.path.commonprefix that works on characters and thus can be used for any strings.

import os common = os.path.commonprefix(['apple pie available', 'apple pies']) assert common == 'apple pie'

As the function name indicates, this only considers the common prefix of two strings.

Clarified answer, it should be clear what this solution does now. The question is a bit vague in that regard. The title suggests «any substring», description and examples indicate «common prefix».

@famzah You linked to the documentation of os.commonpath this is not the same as the os.commonprefix that is used in the answer. But true, there could be some limitations, just the documentation does not mention any.

def common_start(sa, sb): """ returns the longest common substring from the beginning of sa and sb """ def _iter(): for a, b in zip(sa, sb): if a == b: yield a else: return return ''.join(_iter())

>>> common_start("apple pie available", "apple pies") 'apple pie'

Or a slightly stranger way:

def stop_iter(): """An easy way to break out of a generator""" raise StopIteration def common_start(sa, sb): return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

def terminating(cond): """An easy way to break out of a generator""" if cond: return True raise StopIteration def common_start(sa, sb): return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))

This solution, as of now, isn’t complete. It only compares both strings from the zeroth position. For instance: >>> common_start(«XXXXXapple pie available», «apple pies») returns an empty string.

@NitinNain: That was never clarified in the original question. But yes, this solution only finds the common start of strings

No — from that document: «There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target or the predicate (rather than by the __next__() call implied in the for loop proper).«

@Eric still, from the Python 3.6 release notes, Raising the StopIteration exception inside a generator will now generate a DeprecationWarning . If you run your code with Python3 -W default::DeprecationWarning , the last two examples both raise DeprecationWarning s

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): match = "" for j in range(len2): if (i + j < len1 and string1[i + j] == string2[j]): match += string2[j] else: if (len(match) >len(answer)): answer = match match = "" return answer print(longestSubstringFinder("apple pie available", "apple pies")) print(longestSubstringFinder("apples", "appleses")) print(longestSubstringFinder("bapples", "cappleses"))

This algorithm is incorrect with given some inputs (e.g. «apple pie. «, «apple pie») but works if you switch parameter position. I think there’s something wrong with the if statement when you compare i+j < len1

This doesn’t work because it does not consider scenario where you will need to do a «re-matching» for the second string. For instance, in «acdaf» vs «acdacdaf», when starting from «a» of the first string it will match all the way till the «acda» part of the second string, then it will break at c. Then no matter what you can no longer pick up acdaf.

Fix bugs with the first’s answer:

def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): for j in range(len2): lcs_temp = 0 match = '' while ((i+lcs_temp < len1) and (j+lcs_templen(answer): answer = match return answer print(longestSubstringFinder("dd apple pie available", "apple pies")) print(longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000") print(longestSubstringFinder("bapples", "cappleses")) print(longestSubstringFinder("apples", "apples"))

The same as Evo’s, but with arbitrary number of strings to compare:

def common_start(*strings): """ Returns the longest common substring from the beginning of the `strings` """ def _iter(): for z in zip(*strings): if z.count(z[0]) == len(z): # check all elements in `z` are the same yield z[0] else: return return ''.join(_iter())

The fastest way I’ve found is to use suffix_trees package:

from suffix_trees import STree a = ["xxxabcxxx", "adsaabc"] st = STree.STree(a) print(st.lcs()) # "abc"

This script requests you the minimum common substring length and gives all common substrings in two strings. Also, it eliminates shorter substrings that longer substrings include already.

def common_substrings(str1,str2): len1,len2=len(str1),len(str2) if len1 > len2: str1,str2=str2,str1 len1,len2=len2,len1 #short string=str1 and long string=str2 min_com = int(input('Please enter the minumum common substring length:')) cs_array=[] for i in range(len1,min_com-1,-1): for k in range(len1-i+1): if (str1[k:i+k] in str2): flag=1 for m in range(len(cs_array)): if str1[k:i+k] in cs_array[m]: #print(str1[k:i+k]) flag=0 break if flag==1: cs_array.append(str1[k:i+k]) if len(cs_array): print(cs_array) else: print('There is no any common substring according to the parametres given') common_substrings('ciguliuana','ciguana') common_substrings('apples','appleses') common_substrings('apple pie available','apple pies')

import itertools as it ''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

I’m now wanting python to make it.takewhile a language feature: a for a, b in zip(string1, string2) while a == b

».join(el[0] for el in itertools.takewhile(lambda t: t[0] == t[1], zip(«ahello», «hello»))) returns «» , which appears to be incorrect. The correct result would be «hello» .

@AndersonGreen: You are right, it doesn’t answer exactly the question, althought his examples only took into account the starting point at first char and I pointed out it in my answer too.

def matchingString(x,y): match='' for i in range(0,len(x)): for j in range(0,len(y)): k=1 # now applying while condition untill we find a substring match and length of substring is less than length of x and y while (i+k

 A Trie data structure would work the best, better than DP. Here is the code.
 class TrieNode: def __init__(self): self.child = [None]*26 self.endWord = False class Trie: def __init__(self): self.root = self.getNewNode() def getNewNode(self): return TrieNode() def insert(self,value): root = self.root for i,character in enumerate(value): index = ord(character) - ord('a') if not root.child[index]: root.child[index] = self.getNewNode() root = root.child[index] root.endWord = True def search(self,value): root = self.root for i,character in enumerate(value): index = ord(character) - ord('a') if not root.child[index]: return False root = root.child[index] return root.endWord def main(): # Input keys (use only 'a' through 'z' and lower case) keys = ["the","anaswe"] output = ["Not present in trie", "Present in trie"] # Trie object t = Trie() # Construct trie for key in keys: t.insert(key) # Search for different keys print("<> ---- <>".format("the",output[t.search("the")])) print("<> ---- <>".format("these",output[t.search("these")])) print("<> ---- <>".format("their",output[t.search("their")])) print("<> ---- <>".format("thaw",output[t.search("thaw")])) if __name__ == '__main__': main() 
 Let me know in case of doubts.
 In case we have a list of words that we need to find all common substrings I check some of the codes above and the best was https://stackoverflow.com/a/42882629/8520109 but it has some bugs for example 'histhome' and 'homehist'. In this case, we should have 'hist' and 'home' as a result. Furthermore, it differs if the order of arguments is changed. So I change the code to find every block of substring and it results a set of common substrings:
 main = input().split(" ") #a string of words separated by space def longestSubstringFinder(string1, string2): '''Find the longest matching word''' answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): for j in range(len2): lcs_temp=0 match='' while ((i+lcs_temp < len1) and (j+lcs_templen(answer)): answer = match return answer def listCheck(main): '''control the input for finding substring in a list of words''' string1 = main[0] result = [] for i in range(1, len(main)): string2 = main[i] res1 = longestSubstringFinder(string1, string2) res2 = longestSubstringFinder(string2, string1) result.append(res1) result.append(res2) result.sort() return result first_answer = listCheck(main) final_answer = [] for item1 in first_answer: #to remove some incorrect match string1 = item1 double_check = True for item2 in main: string2 = item2 if longestSubstringFinder(string1, string2) != string1: double_check = False if double_check: final_answer.append(string1) print(set(final_answer)) 
 main = 'ABACDAQ BACDAQA ACDAQAW XYZCDAQ' #>>> main = 'homehist histhome' #>>>
 Источник