Longest common subsequence python

Содержание

Longest Common Subsequence: Python, C++ Example
Example
Naive Method
Example
Optimal Substructure
Recursive Method of Longest Comm Sequence
Implementation in C++:
Implementation in python:
Dynamic Programming method of Longest Common Subsequence (LCS)
Implementation in C++:
Implementation in Python:
How to implement Longest Common Subsequence in Python
Find Longest Common Subsequence in Python
Implementation of LCS in Python

Longest Common Subsequence: Python, C++ Example

Longest Common Subsequence (LCS) means you will be given two strings/patterns/sequences of objects. Among these two sequences/strings, you need to find the longest subsequence of elements in the same order present in both strings or patterns.

Example

For example, there are two strings provided.

Pattern_1 = “RGBGARGA”
Pattern_2 = “BGRARG”

From pattern_1, sequences can be produced like “RGB”, “RGGA”, ”RGAR”. For creating a sequence, you need to maintain the relative position of each character in the string.
From pattern_2, we can produce sequences like “BGR”, ”BRAG”, ”RARG”. Sequences can be produced as long as they maintain the original string’s relative position.

The term relative position means order.

For example, “BRG” is a valid sequence because “B” appeared first, then “R,” and then “G” in the original string pattern_2. However, if a sequence is “RBRG”, it’s not valid. Because in the original string (pattern_2), “B” comes first.

We have two options to find the Longest Common Subsequence from the given two sequences or arrays.

Naive method
Dynamic Programming Solution: Longest Common Subsequence is also known as LCS.

A naive Solution has larger time complexity and is not the optimal solution. Using Dynamic Programming Solution (DP), we overcome the complexity problem.

Naive Method

Naive method is a simple approach to the problem regardless of the time complexity and other optimization factors.

The Naive method consists of “Brute Force”, multiple loops, recursive methods in most cases.

The term brute force means going through all possible patterns for a given problem.

Example

From the above example of pattern1 and pattern2, let us assume pattern1 has a length of m and pattern2 has a length of n. To check every possible case, we need to evaluate every possible subsequence of pattern1 with pattern2.

Here’s a simple 4 letter string “ABCD”. For example, we need to create a sequence from “ABCD.” Either we can take a character or not. That means that, for each character, we have two choices.

The character will be added to the subsequence.
The character will not be added to the subsequence.

Here, the images show all the sequences that we can make from the string “ABCD”.

Sequence with 1 character:

Sequences with 2 characters:

Sequences with 3 characters:

From the above diagram, there’re 14 sequences. If we do not take letters, basically an empty string, the total sequences will be 15. Moreover, the string “ABCD” itself is a sequence. So, the total sequences are 16.

So, It is possible to generate 2 4 or 16 subsequences from “ABCD”. Then, a string with a length of m will have to total subsequence of 2 m .

For each subsequence, we need to check it for the whole pattern2. It will take 0(n) time. 0(n) means the complexity function that calculates the time taken for execution.

So, total time complexity becomes O(n*2 m ). For the example, we have seen above the value of m=8 and n=5.

Here are the steps of the Naive Method:

Step 1) Take a sequence from the pattern1.
Step 2) Match the sequence from step1 with pattern2.
Step 3) If it matches, then save the subsequence.
Step 4) If more sequence is left in the pattern1, then go to step 1 again.
Step 5) Print the longest subsequence.

Optimal Substructure

The term optimal substructure means an optimal solution (simple) can be found by solving the subproblems. For example, in the above example, we have pattern1 and pattern2.

Step 1) Take the first two characters from each pattern

Step 2) Take the third to fifth characters from each pattern.

Step 3) Continue similarly with the remaining characters.

We find the LCS on the sub-string (string generated from an original string). Then we keep the record of the length of the LCS of the substrings.

Now, here is another interesting property that is overlapping sub-problems. A problem is said to have overlapping sub-problems if the problem statement can be broken into small subproblems and used several times in the program.

The diagram below shows that the recursive algorithm called the function with the same parameter several times.

For example, let’s see the recursion tree.

In the dark-colored box, you can notice overlapping sub-problems. (“RG”, “RA”), (“RG”,”R”) and others are called several times.

For optimizing this, we have the approach of Dynamic Programming (DP).

Recursive Method of Longest Comm Sequence

The graph is shown above is the recursive method. Each recursive function has a base case to break the recursion or start returning from its stack.

For this implementation, we’ll use a base case. So, the algorithm is like the following:

If all elements before the last element have a match, then increase the length by one and return
Pass two patterns to the function, and take the max value of the return
If one pattern has zero length, then we have no subsequence to compare. Return 0 in this case. This is the base case of the recursion.

Pseudo Code:

def lcs: input: pattern_1, pattern_2, len_1, len_2 if len_1 or len_2 is zero: then return 0 if pattern_1[len_1 - 1] equals pattern_2[len_2 - 1] then return 1 + lcs(pattern_1, pattern_2, len_1 - 1, len_2 - 1) else max of lcs(pattern_1, pattern_2, len_1 - 1, len_2), lcs(pattern_1, pattern_2, len_1, len_2 - 1)

Implementation in C++:

#include #include using namespace std; int lcs(string pattern_1, string pattern_2, int len_1, int len_2) < if (len_1 == 0 || len_2 == 0) return 0; if (pattern_1[len_1 - 1] == pattern_2[len_2 - 1]) < return 1 + lcs(pattern_1, pattern_2, len_1 - 1, len_2 - 1); >else < return max(lcs(pattern_1, pattern_2, len_1 - 1, len_2), lcs(pattern_1, pattern_2, len_1, len_2 - 1)); >> int main()

Implementation in python:

def lcs(pattern_1, pattern_2, len_1, len_2): if len_1 == 0 or len_2 == 0: return 0 if pattern_1[len_1 - 1] == pattern_2[len_2 - 1]: return 1 + lcs(pattern_1, pattern_2, len_1 - 1, len_2 - 1) else : return max(lcs(pattern_1, pattern_2, len_1 - 1, len_2), lcs(pattern_1, pattern_2, len_1, len_2 - 1)) pattern_1 = "RGBGARGA" pattern_2 = "BGRARG" print("Lenght of LCS is: ", lcs(pattern_1, pattern_2, len(pattern_1), len(pattern_2)))

Dynamic Programming method of Longest Common Subsequence (LCS)

Dynamic programming means optimizing the plain recursive method. For example, if we see the recursive or naive approach graph, we can see that there are several same function calls. The Dynamic Programing method records all the calculations in an array and uses it when needed.

We’ll use a 2D array with dimensions of m x n, where m and n are lengths of pattern1 and pattern2.

For 2D array, we can use List data structures in python or vector/array data structures in C++.

Pseudo Code for LCS using DP:

LCS(pattern_1, pattern_2): m = length of pattern_1 + 1 n = length of pattern_2 + 1 dp[n][m] for i in range 0 to n + 1: for j in range 0 to m + 1: if i or j equals to 0: dp[i][j] = 0 else if pattern_1[i] == pattern_2[j]: dp[i]][j] = dp[i - 1][j - 1] + 1 else : dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) return dp[n][m]

Here’s the table LCS that is used as a 2D array data structure for the dynamic programming approach.

Let’s discuss the logic we used here. Steps are:

Step 1) If i or j is zero, we are taking an empty string from the given two strings and trying to find the common subsequences. However, as the substring we’re taking is empty, the subsequence length is 0.

Step 2) If two character matches, we’ll assign the value to the (i,j) index by incrementing the previously calculated LCS, which is present in the (i-1,j-1) index (from the previous row).

Step 3) If it doesn’t match, then we will take the max LCS of the adjacent two indexes. And this way, we need to fill all the values in the 2D array.

Step 4) Finally, we will return the value of the last cell of the 2D array.

Basically, all the value in the 2D array contains the length of common subsequences. Among these, the last cell contains the length of the longest common subsequence.

Implementation in C++:

#include using namespace std; int lcs(string pattern_1, string pattern_2) < int m = pattern_1.size(); int n = pattern_2.size(); // dp will store solutions as the iteration goes on int dp[n + 1][m + 1]; for (int i = 0; i & lt; n + 1; i++) < for (int j = 0; j & lt; m + 1; j++) < if (i == 0 || j == 0) < dp[i][j] = 0; >else if (pattern_2[i — 1] == pattern_1[j — 1]) < dp[i][j] = dp[i - 1][j - 1] + 1; >else < dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]); >> > return dp[n][m]; > int main()

Implementation in Python:

def lcs(pattern_1, pattern_2): m = len(pattern_1) n = len(pattern_2) # dp will store solutions as the iteration goes on dp = [ [None] * (n + 1) for item in range(m + 1) ] for i in range(m + 1): for j in range(n + 1): if i == 0 or j == 0: dp[i][j] = 0 elif pattern_1[i - 1] == pattern_2[j - 1]: dp[i][j] = dp[i - 1][j - 1] + 1 else : dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) return dp[m][n] pattern_1 = "RGBGARGA" pattern_2 = "BGRARG" print("Length of LCS: ", lcs(pattern_1, pattern_2))

So, both the strings have the longest common subsequence of length 5.

In a nutshell, we are simply calculating each task once in the DP method in the DP method. In the recursive method, we might have overlapping subproblems.

In this Dynamic Programming Algorithm, we are using a 2D matrix. There will be two strings given (assume both have length n). Then the space needed in the array is n x n. If the strings are large enough, we will need a memory-optimized version of the DP solution.

The simplified logic that was taken in the code is:

Declare a 2D Array DP[m][n].
Fill the first row and first column of the DP array with 0.
Take i and j for the iteration.
If pattern1[i] equals to pattern2[j], then update DP[i][j] = DP[i-1][j-1] + 1
If pattern1[i] not equals to pattern2[j], then DP[i][j] will be the maximum value between DP[i-1][j] and DP[i][j-1].
Continue until i and j reach m and n.
The last element or DP[m-1][n-1], will contain the length.

Here, it’s addressed as DP[m-1][n-1], because array index begins from 0.

The DP method has lower time complexity; it is O(mn), where m and n are the length of the input string or array.
DP is a faster approach than the recursive one, with the time complexity of O(n*2 m ).
Dynamic programming (DP) is not memory optimized. We used a 2D array that has the length of m*n. So, the space complexity is (m*n).
Recursive method, in a worst-case scenario, the highest memory it will occupy will be m+n, basically the total length of the inputted string.
Today’s modern computer is sufficient to handle this amount of memory.

Источник

How to implement Longest Common Subsequence in Python

This Python tutorial will help you to understand what is Longest Common Subsequence and how Python implements this algorithm. First, we will learn what is longest common subsequence.

Find Longest Common Subsequence in Python

Definition :

Longest Common Subsequence determines the longest sequence which exists in both the given strings. A subsequence or a substring can be formed from a string or a sequence.

The longest common subsequence from the above two strings or two sequences is ‘BCAD’.

Forms the basis for data comparison which will be used in the field of bioinformatics.
Also widely used by revision control systems such as Git.

Implementation of LCS in Python

Source Code: Python

def lcs(str1, str2): a = len(str1) b = len(str2) string_matrix = [[0 for i in range(b+1)] for i in range(a+1)] for i in range(1, a+1): for j in range(1, b+1): if i == 0 or j == 0: string_matrix[i][j] = 0 elif str1[i-1] == str2[j-1]: string_matrix[i][j] = 1 + string_matrix[i-1][j-1] else: string_matrix[i][j] = max(string_matrix[i-1][j], string_matrix[i][j-1]) index = string_matrix[a][b] res = [""] * index i = a j = b while i > 0 and j > 0: if str1[i-1] == str2[j-1]: res[index-1] = str1[i-1] i -= 1 j -= 1 index -= 1 elif string_matrix[i-1][j] > string_matrix[i][j-1]: i -= 1 else: j -= 1 return res if __name__ == '__main__': str1 = "acbaed" str2 = "abcadf" string1 = ''.join(lcs(str1, str2)) print("Length of LCS is:", len(string1),"\nsubsequence is:", string1) str3 = "ABAZDC" str4 = "BACBAD" string2 = ''.join(lcs(str3, str4)) print("Length of LCS is:", len(string2),"\nsubsequence is:", string2)

Case -1 :- Input : str1 = "acbaed" str2 = "abcadf" Output : Length of LCS is: 4 subsequence is: abad

Case -2 :- Input : str3 = "ABAZDC" str4 = "BACBAD" Output : Length of LCS is: 4 subsequence is: ABAD

Источник