web developer & system programmer

coder . cl

ramblings and thoughts on programming...

base conversion tricks

published: 04-10-2012 / updated: 04-10-2012
posted in: c, development, haskell, programming, tips
by Daniel Molina Wegener

Base conversion can be used to compress number representations, and can be used to hide the real number representation. The algorithm is well known, anyone can program that algorithm, and it is not so hard to handle. I was playing with Haskell writing that algorithm to check how effective is that algorithm to create URL shorteners. My idea is quite simple, you have an URL database ID, which is an integer number, indexed as many entries on the database, but once the database starts growing, the number can reach a length that is not so easy to remember. For example the ID 999 999 999 999 is not so easy to remember.

The basic algorithm is to have an ASCII alphabet as representation of positional numeric system symbols, and then to apply the standard base conversion algorithm using that alphabet to get the string representation of the number, so encoding and decoding algorithms in Haskell can be written as follows.


alpha :: String
alpha = "0123456789"
        ++ "abcdefghijklmnopqrstuvwxyz"
        ++ "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        ++ "._"

alphaLength :: Integer
alphaLength = fromIntegral $ length alpha

aenc :: Integer -> String
aenc i = let fin = fromIntegral alphaLength
             encn :: Integer -> String -> String
             encn x y
                 | x == 0 = reverse y
                 | otherwise = let (n, m) = x `divMod` fin
                                   r = alpha !! fromIntegral m
                               in encn n $ y ++ [r]
         in encn i []

adec :: String -> Integer
adec i = let fin = alphaLength
             lns = fromIntegral $ length i
             decn :: String -> Integer -> Integer -> Integer
             decn (x:xs) y z = let
                 pw, el, nm :: Integer
                 pw = lns - (fromIntegral z + 1)
                 el = fromIntegral
                      $ fromJust
                      $ elemIndex x alpha
                 nm = (el * (fin ^ pw)) + fromIntegral y
                 in decn xs nm (z + 1)
             decn [] y _ = y
         in decn i 0 0

The encoding and decoding algorithms above are using a numeric base of 64, and the alphabet is case-sensitive. So, for the number 12 345 we have the string representation on numeric base 64 as “30V”. The number 999 999 999 999 has the representation as “ezkFg__” and 999 999 999 999 999 has the representation as “3znWAND__”. The trick on the base conversion on numbers with that size is made thanks to the data types used on the base conversion. The standard integer and long data types in C have up to 64 bits and probably 128 bits in some systems. Still they have a limit. With the Integer data type in Haskell there is almost no limit because they are “arbitrary-precision integers”. To use that kind of numbers on C, you must install a library like apcalc.

Also you can hide the number representation if you keep your alphabet private and you modify the positional symbols, where you can have a very different number representation of any number using this kind of base conversion, depending on your alphabet. Another option is to use bit representations, so you can extend the base alphabet to other ASCII symbols and have an extended base conversion, for example using 128 printable characters you can have a base 128 encoded number.

This encoding is very similar to Base N encoding, but the algorithm is quite different. It is being described on the RFC 4648. I have implemented that algorithm but any alphabet and any base encoding on the Caffeine Project — you can use base 128 if you want — and the difference with other implementations is the fact that it can handle streaming data, so you can create base encoded streams, and do not requires fixed size buffers. The standard functions can encode fixed size buffers, and encode_stream and decode_stream can encode and decode streaming buffers. You can see the source code here.