Thursday, January 13, 2011

Lexicographically Lowest Possible String

The qualification round for Facebook Hacker Cup happened last weekend. I was initially tricked by one of the problems. I thought I had solved it - but the next day while browsing some blogs about the problems,  my wife discovered a mistake I had made. Fortunately, I had solved the other problem correctly to qualify for the next round.

The problem appears simple enough. You have been given a set of string and you have to find a "lexicographically lowest possible concatenation" of these strings. An example makes it perfectly clear. Assume that the only characters in the strings are lowercase characters from 'a' to 'z'.
If the input is abc xyz pqr . The correct output should be abcpqrxyz .
Of all the possible ways in which you can concatenate the strings, this will be the first one if you arrange them in ascending order. Conceptually, this is equivalent to saying, find all the N! ways in which the strings can be combined, sort them in ascending order and output the first one.

That brute force approach is horribly inefficient. So the quick solution that comes to mind is, just sort all the strings in ascending order, and then iterate over them to concatenate them one after another. That will work with our example. That's what I did. So what's the issue ?

Problem with natural sort and concat 


Consider the following example case where the input is just these 2 strings[xa xab]
Since "xa" < "xab" a simple sort and append method will produce xaxab
This is not correct, as the other possible concatenation is the correct answer.
Because xabxa < xaxab

And that's the problem.
For any 2 strings S1 and S2, S1 < S2 does NOT imply (S1.S2) < (S2.S1).
This happens only if one string is a starting substring of another. There is NO problem with natural sort and concat in following cases.
For [abc pqrabc] the answer is still in natural order = abcpqrabc which is < pqrabcabc
For [abcd abczpqr]  the answer is still in natural order = abcdabczpqr which is < abczpqrabcd
Even when we have a starting substring, the simple natural sort and concat can be correct.
For [xa xaz] answer is still in natural order = xaxaz.

Correct solution

My new solution is still a sort and append. But when the strings are being sorted, simple comparison "if S1 < S2" should not be used. The correct comparison is "if (S1.S2) < (S2.S1)". So the strings should be placed in ascending order using the comparison of their concatenations.
Generally speaking, before concatenating the strings should be ordered
[S0 S1 S2 ... Sn]
such that
for any k, the concatenation (S[k-1].S[k]) < (S[k].S[k-1])
Then the entire concatenation would be lowest.

Informal Proof

That sounds intuitively correct. Still, since I was tricked once, I just want to make sure that this approach is indeed correct.

Let's say while sorting we decide to order the strings as [S1 S2] because (S1.S2) < (S2.S1). Then we check string S3, and since (S2.S3) < (S3.S2) we decide to put it last as [S1 S2 S3]. This will give us S1.S2.S3 as our answer. It's correct based on comparison of [S1 S2] and [S2 S3]. But what about the comparison of [S1 S3] ? How do we know that S3 should not be put ahead of S1 giving us the answer S3.S1.S2 ?

To prove that above ordring is indeed correct, we need to make sure that our comparison is transitively correct.
Prove that, if (S1.S2) < (S2.S1) and (S2.S3) < (S3.S2), then it implies (S1.S2) < (S3.S1)
The only interesting case in our ordered list [S0 S1 S2 ... Sn] is when the natural sorting order is violated for all 3 strings. Any ordering [S1 S2 S3] generated by our algorithm is trivially correct if S1 < S2. There is no point comparing S3 to S1 here. In fact we can simply combine such string pairs into one bigger string, and keep reducing the number of strings till we have a violation of the natural order in at least 3 successive strings.
Consider the natual sorting order [xa xab xabc]. But for us, the correct order is [xabc xab xa]. This is the only interesting case. We want to make sure our derived answer xabcxabxa is really the lowest.
Another example.
Consider the natual sorting order [xa xab xabz]. But for us, the correct order is [xab xabz xa]. This is NOT intersting as the order for first 2 strings [xab xabz] is just natural ordering. We can reduce this to [xabxabz xa] and now our solution is trivially correct.
To repeat, the natural sorting order of [S1 S2] will be violated only if entire S1 is a substring of S2, and (S2.S1) < (S1.S2).
So let's consider the case generally and name the substrings as X, Y and Z.
S1 = Z.Y.X
S2 = Z.Y
S3 = Z

Clearly S3 < S2 < S1 using natural order, but solution using our comparison came out to be in exactly the opposite order as [S1 S2 S3]. Let's prove it correct.

Also our algorithm decided that S2.S3 < S3.S2 - meaning Z.Y.Z < Z.Z.Y Removing identical substring from the beginignof both, gives us Y.Z < Z.Y and this implies Y < Z.

Now S3.S1 will be Z.Z.Y.X and compare this to S1.S2 which is Z.Y.X.Z.Y. Since we know that Z > Y, it follows that Z.Z > Z.Y. Therefore (S3.S1) is indeed > (S1.S2) and hence S3 should not be placed ahead of S1.
Maybe I made it too complicated. But I could not figure out a simpler proof. I can also prove it a bit more formally using induction, but I don't think it's simpler. In any case, the solution is really very simple to implement, and that's what matters !

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...