## Motivation

String matching is one of the most studied issues in Computer Science. There are three well known algorithms by

It is very unlikely that a developer in this day will be asked to write one let develop a new one. However the reason

for a refresher is that some problems may be modeled as a string matching problem. Besides, this is the actual reason

for the refresher,

they are interesting.

## Classic String Match

The classic problem is described as: Given a string Y and a search string x is there a substring of Y that matches exactly with x. In C syntax a brute force algorithm is shown below

int strSearch(const char *Y, int N, const char *x, int n) // N is the length of string Y // n is the length of string x { for(int i=0; i+n <= N; ++i) { bool found=true; for(j=0; found && j<n; ++j) { found = (Y[i+j] == x[j]); } if (found) return i; } return -1; }

## Permutaion

An interesting variation is to find a substring in Y that is a permutation of the string x. Let’s try test driven development: Given a solution let us write code to verify it. In other words for some k such that 0<= k < N-n, verify that the substring Y[k, k+n) is a a permutation of X. There are n! (! denotes factorial) permutations of the substring Y[k,k+n). Comparing every permutation is not viable. One solution is to reduce both strings to a canonical form that can then be compared. One such form is to sort the two strings which makes comparison possible. Thus to find a permuted substring, just slide a window of size n across Y and check if it is a permuted version of x.

int strPermuteSearch(const char *Y, int N, const char *x, int n) // N is the length of string Y // n is the length of string x { char y[n]; sort(x,x+n); for(int i=0; i+n <= N; ++i) { strncpy(y, Y+i, n); sort(y,y+n); if (strncmp(x,y,n)==0) return i; } return -1; }

The time complexity of this algorithm is O(N*n*Log(n)). The n*Log(n) is because of the sorting withing the loop. This can reduced to O(N*n) by using insertion sort. Since *y* is sorted all that is needed when i is incremented is to remove *Y[i-1]* from y and insert *Y[i+n-1]* which can be done in O(n) while keeping y sorted.

## Combination

This leads to another problem. Is there a substring of Y such that some combination of n characters of the substring is a permutation of x. Again consider a verification algorithm. Given a string Z of length M where M>=n is there some collection n characters which can be permuted to match x. There are C(M,n) combinations. Trying all the combinations is not viable. Consider Z[0]. If Z[0] is present in x, our solution space is now reduced to

C(M-1,n-1). If Z[0] is not present then our solution space is reduced to C(M-1,n). The algorithm terminates when n==0 which means that a result has been found or n>M which means that Z does not contain a substring which can be permuted to x.

bool Verify(const char* Z, int M, char *x, int n) { if (n==0) return true; if (n>M) return false; int i=0; for(; i<n; ++i) if(*Z == x[i]) break; if (i<n) { //swap x[i] with x[n-1] char t = x[i]; x[i]=x[n-1]; x[n-1] =t; return Verify(Z+1, M-1, x, n-1); } else return Verify(Z+1,M-1,x,n); }

Notice that the linear search in Lines 5 and 6 push the time complexity to O(n*M). If the two strings were sorted, then we could reduce it to O(M) as shown below

bool VerifySorted(const char* Z, int M, char *x, int n) { if (n==0) return true; else if (n>M) return false; else if (*Z < *x) return Verify(Z+1, M-1,x,n); else if (*Z == *x) return Verify(Z+1,M-1,x+1,n-1); else // *Z>*x return false; }

This solution can be adapted to checking if B is a subset of A, if A and B are sequences. Now let us restate the problem adding a simple restriction: Find the smallest substring of Y such that there exists n characters in it, which can be permuted

to match x. This can be done by sliding a window of size M where n<=M<=N and calling Verify with that window as shown below

struct position { int start; int length; }; position CombinationSubset(const char* Y, int N, char *x, int n){ position result = {-1,-1}; char Z[N]; sort(x,x+n); for(M=n; M<=N; ++M) { for(i=0; i+M <=N; ++i) { strncpy(Z,Y+i,M); sort(Z,Z+M); if (VerifySorted(Z+i, M, x,n)) { result.start=i; result.length = M; return result; } } } return result; }

Notice that we could make some optimisation by checking that Z[i] and Z[i+M-1] are both in x before calling VerifySorted. As before, `sort(Z,Z+M)`

in lines 11 and 12 can be reduced to an insertion taking O(M) time.

## Distinct characters

Problem:List all sub strings of length n in a string Y such that every character in the substring is unique.

Hint 1: Find the number of duplicate characters in every substring of length n.

Hint 2: use C++ sort and unique and find the difference in length

Problem: List all substrings with exactly one duplicate.

Hint: Modify the previous solution.