Polyglot Dojo #1: Compress String

ยท

9 min read

character mess poster

Challenge

The first challenge we pick to solve is Compress Challange. The description of the challenge says:

Compress a string such that 'AAABCCDDDD' becomes 'A3BC2D4'. Only compress the string if it saves space

To solve this challenge, we must satisfy the following constraints:

  • We can assume the string is ASCII

  • The compression should be case sensitive

  • We can use additional data structures

  • We can assume the string would fit into the memory

They also provide the following test suite to check our solution:

from nose.tools import assert_equal


class TestCompress(object):

    def test_compress(self, func):
        assert_equal(func(None), None)
        assert_equal(func(''), '')
        assert_equal(func('AABBCC'), 'AABBCC')
        assert_equal(func('AAABCCDDDDE'), 'A3BC2D4E')
        assert_equal(func('BAAACCDDDD'), 'BA3C2D4')
        assert_equal(func('AAABAACCDDDD'), 'A3BA2C2D4')
        print('Success: test_compress')


def main():
    test = TestCompress()
    compress_string = CompressString()
    test.test_compress(compress_string.compress)


if __name__ == '__main__':
    main()

Python Solution

The high-level representation of the idea I can think of here is that we want to group continuous alike characters from a string, and then represent their group length as a number. Kind of like the following picture:

groupby visualization

The first approach I tried here was using collection.Counter. Here is an example of how we can use it:

from collections import Counter

count = Counter("AABBCC")

print(count)

Counter({'A': 2, 'B': 2, 'C': 2})

Well, at first sight, it looks like a correct solution, and of course, it can satisfy all of the assertions from the suggested test suite except the last one. The problem with the last test is that the Counter doesn't look for continuity of the characters, so when we pass "AABBAA" to it, look what happens:

from collections import Counter

print(Counter("AABBAA"))

Counter({'A': 4, 'B': 2})

The Counter counts a character all over the string, which is not what we want. Finding out that is not enough; I knew I need to try another approach. To extract these groups, we can iterate through the characters of the string, and check their identity with the previous one. However, it's a bit mechanical for my taste, so I preferred to search for groupby keyword in Python's standard library. And fortunately, I got lucky and found a bit bizarre groupby function.

However, before continuing that, I should add, one exciting thing I learned while working on these challenges is, usefullness of having a local copy of language's documentations. In Linux, I'm using Zeal. For Mac users, I know there is Dash which Zeal uses its document format. Also, for hardcore Emacs users, there is a dash.el which is quite impressive, but I haven't used to it yet.

Anyway, let's get back to our business. I said, groupby was a little bit bizarre to me. Let's look at how it works:

from itertools import groupby

for key, group in groupby("AABBCC"):
    print(f"{key}: {list(group)}")

A: ['A', 'A'] B: ['B', 'B'] C: ['C', 'C']

So it gives us what we want, what is bizarre about it? The function name says it is going to group by something. We haven't passed anything as the by part; I mean something like GROUP BY phrase in SQL. So, I took a second look at the documentation where it says:

The key is a function computing a key value for each element. If not specified or is None, key defaults to an identity function and returns the element unchanged

The identity function. So it uses some function to compare each element's identity, which is what we want. What is the nature of that function? We will discover that in another blog post.

So after finding a single function, that provides us with all we need to satisfy the basic logic of the problem, we can form it as a solution. Here is mine:

$ open repl.it/@shahinism/Compress-String-Python โ–ˆ

All I do here is to basically:

  • iterate through the result of groupby function, which provides me by a pair of key (the character) and a group (an iterable object of continuous repetition of the paired character).

  • Then I turn that group into a list and count() the number of elements there.

  • If the count is bigger than one, I append the count number paired with the character; otherwise, I append the character alone.

Then I checked the original answer provided by challenge collection. They have chosen the mechanical solution (which we will come back to with our Rust experiment), but I've learned about groupby ๐Ÿ˜‰.

The time complexity of string append (as value += "new value" is optimized in CPython and is equal to \(O(1)\). So this is safe to say the overall algorithm for our solution has the time complexity of \(O(N)\).

Clojure Solution

Since the interactive coding challenges repository only supports Python, we first need to port our test suite to Clojure. Here are the same constraints expressed in Clojure:

(deftest compress-test
  (testing "empty, is empty"
    (is (= (compress "") "")))
  (testing "doesn't compress fine strings"
    (is (= (compress "ABC") "ABC"))
    (is (= (compress "AABBCC") "AABBCC")))
  (testing "does compress effectively"
    (is (= (compress "AAABCCDDDDE") "A3BC2D4E"))
    (is (= (compress "BAAACCDDDD") "BA3C2D4"))
    (is (= (compress "AAABAACCDDDD") "A3BA2C2D4"))))

The syntax here is quite straightforward. The lines including (is (=... are defining the main constraints and (testing "..." are grouping related tests with a readable description. The most exciting thing for me here is the readability and simplicity of the code.

Now, we are ready to tackle the real problem. Since we already used some functional techniques while we were trying to solve this problem in Python, wandering inside the real functional land of Lisp, I hoped the underlying logic won't change much.

To explore the standard library of Clojure, I used Clojure Docs website. The first search for group-by, I found a function, that is much different from what I had in mind. But I remembered, there was a function with a more reasonable name I learned about when I was solving Clojure Koans. The functions name was partition-by. It's described as:

Applies f to each value in coll, splitting it each time f returns a new value. Returns a lazy seq of partitions. Returns a stateful transducer when no collection is provided.

So, we need a function f, which would return a different type, when the input character changes. Do you remember how groupby() function, was splitting the input string, to groups of the same continuous characters? Back there, we knew Python is using an "identity" function as the default value for the func parameter.

Interestingly enough, I searched for the identity in Clojure Docs and found it. So our grouping logic, in Clojure would be like this:

(partition-by identity "AABBCC")

Calling this would return a list of character groups like ((\A \A) (\B \B) (\C \C)), which is quite like what we need. So I turned it into a function:

(defn group-chars
  [string]
  (partition-by identity string))

As you know, must of the functional languages, don't like the usual loops! Clojure is not different. However, they provide a more compelling and usually more comfortable to reason alternatives (if you understand how they work, of course). Since the result of our group-char function, is an iterable list of items, and all we want to do with each item, is to encode them to a character + count format, our best alternative for a loop would be a map function. The old map we all know about from JavaScript's famous functional toolbox!

However, to use the map, we need to have our encoding logic, implemented as a function. Since the logic is a bit bigger than what can fit inside an anonymous function, I'm going to define it as a separate function:

(defn create-part
  [group]
  (let [length (count group)
        char (first group)]
    (if (= length 1)
      (str char)
      (str char length))
    ))

All it does is:

  • Takes a group as the input parameter.

  • Uses count to get the length of the group and stores it inside length.

  • considers the first character as the group's identity and stores it in char.

  • And finally, if the length of the group is 1, it just returns the char as a string, otherwise, returns the concatenation of char and length as the result.

One main characteristic of this function, is that it assumes the group contains a unique set of characters. A better approach here would be to check for that explicitly, and probably clojure.spec would be helpful to implement it as design by contract.

Yet, I'm not familiar with the concept or how to handle exceptions safely in Clojure as of yet. Given our software is a small challenge solution, I'm going not to bother myself with this fact, and rely on our pipeline to handle that until I learn more about Clojure (even though it's not a good practice! ๐Ÿ˜‰).

So, with that part in place, we can go forward, and implement our final part of the puzzle, which is, getting the input string, split it into groups, feed it to create the part, and concatenate the resulting map:

(defn compress
  "Compress string."
  [string]
  (let [compressed (join "" (map create-part (group-chars string)))]
    (if (= (count compressed) (count string))
      string
      compressed)))

As you see, I also used a condition to return the original string, if compressing it didn't reduce the size of the string. Putting it all together, we solve the challenge as you can confirm here:

$ open repl.it/@shahinism/Compress-String-Clojure โ–ˆ

Rust Solution

Well, the most challenging part for me in this journey was solving it using Rust, and yet I'm not happy with the results. It's probably because it's the language I know far less about comparing to others. However, let's start to learn ๐Ÿ˜Š. Again, let's port the tests:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_compress() {
        assert_eq!(compress(""), "");
        assert_eq!(compress("ABC"), "ABC");
        assert_eq!(compress("AABBCC"), "AABBCC");
        assert_eq!(compress("AAABCCDDDDE"), "A3BC2D4E");
        assert_eq!(compress("BAAACCDDDD"), "BA3C2D4");
        assert_eq!(compress("AAABAACCDDDD"), "A3BA2C2D4");
        assert_eq!(compress(String::from("AAABAACCDDDD")), "A3BA2C2D4");
    }
}

This is quite usual unit testing syntax, not much interesting. However, as you see, I've added an extra test, to make sure it can compress instances of String object just like string literals.

And here is a working solution to the problem:

$ open repl.it/@shahinism/Compress-String-Rust โ–ˆ

I'm going to describe this solution with a high-level perspective instead of focusing on the details. There are two reasons for this:

  • As I said, I'm not happy with my solution so far. My experience is quite small on Rust, and I believe all those .to_string() and .as_str() should be eliminated somehow. So I plan to refactor this solution gradually, as I learn more about Rust's type system.

  • Don't want my ego or perfectionism, get into the way of my blogging ๐Ÿ˜…. As I said in the introduction of this series, I started it to help me learn more.

With that out of the way, let's see, what is my Rusty solution to this problem. It's quite close to the original Python answer provided in interactive challenges.

  • I have a function create_part, which handles the encoding of a character, based on the number of continuous repetitions.

  • And the primary compress function which loops through all the characters of the string counting the number of their continuity and appends each part to a result string.

  • Finally, I check for the length of the compressed string, and only provide the compressed version as a result, if it's reduced the size of the original string.

Conclusion

The idea of this series is so exciting to me. Well, I always wanted to improve my overall problem-solving skills, and so far, looks like this series is helping me in that regard. As you see, different paradigms of these languages, are forcing me to solve the problems with different approaches (top-down in Rust and bottom-up in Python and Clojure).

Also, the different level of abstractions provided by each language's standard library is helping me dive into different levels of problem's basics gradually.

I don't want to lie about it, I some times get mad when something doesn't make sense, yet going back and re-reading or repeating the process, is improving my overall confidence in coding. No need to say that it's also helping me to learn new abilities of a language I thought I already knew (Python) ๐Ÿ˜‰.

ย