Monday, July 16, 2012

Grouping Permutation Strings

This is a straightforward problem.

Given N strings of equal length (K) on an alphabet, you have to partition them into sets, such that every set contains strings that are permutations of each other. Of course, a string can be added to only one set and if 2 strings get added to 2 different sets, they should not be permutations of each other.

This can be easily done in O(N*K) time by using a hashmap. Each string will be inspected once, and each character of the string will be inspected only once.

We will use a hashmap, that maps the permutation key to a set of strings. We iterate over the N strings to find the permutation key of each string. In each iteration, you need an array to keep counts of characters that appear in the string. This array size will be the same as the size of the alphabet. So it will be able to keep counts of all the characters in the alphabet, that may or may not appear in a string. Then iterate over the characters in the string. Whenever you see a character, increment the corresponding count in the array. After the string is inspected, the array of counts can be used as a key in the map. If 2 strings have the same counts for all the characters, then they are permutations of each other.

Here is the Java code.

// assume all lowercase strings
int A = 26 ; // size of alphabet is 26
int N = ... ; // number of strings
String [] allStrings = new String[N] ;

// code to populate the array with equal size strings
// ...
int K = allStrings[0].length ;
// all strings are of the same length

// we are using a map with a String as key
// and a list of String as a value
// By using a List (instead of Set)
// we preserve the original order of strings
// This may or may not be important depending on the problem
Map permutations_map = new HashMap() ;

// Now iterate over the array and find permutation key for each string
for (String s : allStrings) {
    // counting array length is same as alphabet length
    int [] char_counts = new int[A] ;
    for (int i = 0 ; i < K ; i++) {
 char c = s.charAt(i) ;
 int index = ((int) c) - ((int) 'a') ; // char math !
 char_counts[index] ++ ; // increment the count
    } // 

    // now the string representation of this array can be used as a key
    String bucket_id = Arrays.toString(char_counts) ;

    // add the string to the list in map
    // create a new list if needed
    if (! permutations_map.contains(bucket_id) ) {
     permutations_map.put (bucket_id, new ArrayList()) ;
    } // 

    List str_list = permutations_map.get (bucket_id) ;
    str_list.add (s) ;

} // 

The caveat : This is dependent on character math. It's no big deal, but if it is a problem for the alphabet, just use another map, tp map the characters into unique numbers that can be used for the array index. Simple.

NOTE : For small strings on very large alphabets, an alternative scheme will be better. But the idea given here is reasonable to find permutations within numbers, or ascii strings.

I used this to solve another problem. First I had to generate a list of primes, for which I used the "Sieve of Eratosthenes". Then, I treated each prime as a string of digits, and use the above algorithm to bucket them. That was only part of the solution. I will cover the remaining parts in later posts.

Here is the Java code for grouping the primes. Note that all the numbers are assumed to have the same number of digits. In particular, they were all 4 digit primes.

public static final long MaxNumber = 9999 ;
public static final int NumberOfDigits =
    (int) (Math.ceil(Math.log10(MaxNumber))) ;
public static final long Floor =
    (long) (Math.pow(10,NumberOfDigits-1)) ; // 1000
public static final long Ceiling =
    (long) (Math.pow(10,NumberOfDigits)) ; // 10000

public static String getBucketId (long num) {
    int digits_count [] = new int[10] ;
    int pow10 = Floor ; // start dividing by 1000
    while (pow10 != 1) {
 long digit = num / pow10 ; // left most digit
 digits_count[digit] ++ ; // increment the count
 num -= digit*pow10 ; // remove the left most digit
     pow10 /= 10 ; // for next iteration
    } // 
    digits_count[num] ++ ; // last digit remains
    return Arrays.toString(digits_count) ;
} // 

public static void insertIntoBucket
                  (String bucket_id, long num) {
    if ( ! BucketedPrimes.containsKey (bucket_id) ) {
     BucketedPrimes.put (hash_value, new ArrayList()) ;
    } // 
    List permutation_list =
                      BucketedPrimes.get (bucket_id) ;
    permutation_list.add (num) ;
} // 

public static void createBuckets
                    (int num_digits, List all_primes) {
    for (Long p : all_primes) {
 long prime = p.longValue() ;
 String hash_val = getBucketId (prime) ;
 insertIntoBucket (hash_val, prime) ;
    } // 
}
This creates the BucketedPrimaes map given a list of numbers, in this case the numbers happen to be primes. Every value in the map is a list of 1 or more numbers, that are digit permutations of each other.

2 comments:

  1. Just curious. Would "hashing" the string to a numeric value and use it as a key would be quicker? Plus inserting the string to the hash map requires memory/string operations. Would it be faster to insert "index" of such strings, rather the strings itself? (This assumes the set of strings does not change).

    ReplyDelete
  2. If I understood you correctly, it's correct to say that inserting the "index" of the string is cheaper than doing memory/string operation. That's how most Java objects are added to collections. Only the reference is added, not the actual contents.

    I am not sure I understood the first question correctly. Whatever hashing mechanism is used to hash the string to a numeric value should result in the same hashcode for strings that are permutations of each other. That's what I am essentially doing. The bucket id is based on count of every character in the string.

    ReplyDelete

Related Posts Plugin for WordPress, Blogger...