coder . cl » logrev http://coder.cl web developer & system programmer Sat, 03 Nov 2012 12:37:47 +0000 en hourly 1 http://wordpress.org/?v=3.4.2 memory usage in haskell http://coder.cl/2012/03/memory-usage-in-haskell/ http://coder.cl/2012/03/memory-usage-in-haskell/#comments Wed, 21 Mar 2012 21:17:07 +0000 Daniel Molina Wegener http://coder.cl/?p=2341 As you I am a Haskell programmer. I have started a FOSS project called Apache Log Reviser, or logrev as short name. I have been playing with code optimizations in Haskell on that project. So, I want to share some experience and considerations regarding its memory usage, mainly on what is related to lazy evaluations and non-strict bindings. Where lazy evaluations should be used every time you need to read or write data with delayed behaviour, for example reading large lists or even lazy lists, and non-strict bindings should be used every time you need to request data to be placed on memory for immediate reading or writing.

I have replaced the main file processing loop — the one that is collecting data from log lines — forcing lazy evaluations and non-strict bindings where they are required. The resulting code after the optimization is as follows.


foldLogLines :: LogRevStatsMap
                -> LogRevOptions
                -> [String]
                -> LogRevStatsMap
foldLogLines ms _ [] = ms
foldLogLines ms o ~ls = fll ms ls
                        where fll rs [] = rs
                              fll rs (x : ~xs) = let
                                lm = parseLogLine x
                                ns = lm `seq` o `seq` procLogMachine o rs lm
                                in seq ns $ fll ns xs

Again I am using seq for non-strict bindings but now I made the lines argument pattern entirely lazy, including the constructs inside the closure which brings even more laziness to the folding functions. The original optimization was just an implementation of a left folding functions, now it works using the same left folding function using a closure which uses lazy reads of the given list of strings — or log lines. As you remember the first chart obtained was using about ~33MB of memory, as follows.

ApacheLogRev with Initial Optimization

ApacheLogRev with Initial Optimization

λλλ

With the second optimization, the resulting memory usage has been reduced to ~13MB. Despite the readFile and lines functions are lazy, we must implement function with the same lazy behaviour, and we must use non-strict bindings only where we need immediate access to the data. Also, you should know that let and where statements have non-strict bindings for their variables, so they are placed immediately on the heap, unless you declare them lazy explicitly. The resulting profiling chart of the second optimization is as follows.

ApacheLogRev with Lazy Reading

ApacheLogRev with Lazy Reading

Finally I have concluded that we must not underestimate the power of lazy evaluations and Haskell patterns. Once they are used properly, you gain very good performance and optimal memory usage. Enjoy Haskell as programming language.


© Daniel Molina Wegener for coder . cl, 2012. | Permalink | No comment | Add to del.icio.us
Post tags:

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

]]>
http://coder.cl/2012/03/memory-usage-in-haskell/feed/ 0
a good start on logrev http://coder.cl/2012/03/a-good-start-on-logrev/ http://coder.cl/2012/03/a-good-start-on-logrev/#comments Sat, 17 Mar 2012 12:43:36 +0000 Daniel Molina Wegener http://coder.cl/?p=2328 As you know ApacheLogRev or logrev as short name, is my first Haskell FOSS project — I have other FOSS projects written in C and Python — and I want to share some experiences about logrev while I was developing some features that it should have in the future, because it is still under development and it is not finished yet. Basically, it is using the Parsec combinator based parser. One of the best advantages of this parser builder is the fact that it is generating parsers on run time, rather than creating fixed parsers using BNF grammars which are subject of static context free processing, rather than creating dynamic parsers.

Parsec allows you to create a parser dynamically on runtime, rather than compiling a parser specification and grammar specification on compile time as being done with flex and bison. As combinatorial construct, seems that can write a dynamic compiler generated in run time, instead of having a compiler built as static component which cannot be modified in run time — with some exception that are not covered by flex and bison, like any C++ compiler, which is capable to understand a small subset of syntax sugar enabled features like templates and operators.

I have created a DSL called Log Reviser Specification or LSR as short name which allows you to create log processing statements defining variables, processing functions and specify a reports based on the collected data. You can see the LSR parser on the github repository and a sample LSR file on the same repository.

λ λ λ

The problem of memory leaks — which is present on all languages, without exception — had come to my mind once the program was not able to parse a huge file without crashing. The fact that every language can be subject of memory leaks, is so real as the fact that you cannot abuse of recursion in procedural and object oriented languages. So, I have started profiling the application. With the initial revision, I got LogLine data type being instantiated many times without deallocation, with many instances which were kept in memory due to the lazy evaluation behaviour which is natural in Haskell. Tracing its memory usage, I got the following chart.

ApacheLogRev with Memory Leaks

ApacheLogRev with Memory Leaks

The program execution was keeping a peak of LogLine instances on memory without being released due to Haskell lazy evaluation. So, now I am using seq and deepseq to reduce the amount lazy evaluations made in Haskell.


instance NFData LogLine where
  rnf a = a `seq` ()

parseLogLine :: String -> Maybe LogLine
parseLogLine s = let
  r = parse logLine "[Invalid]" s
  in case r of
          Left  _   -> Nothing
          Right itm -> itm `deepseq` Just itm

foldLogLines :: [LogRevStatsAction]
                -> LogRevOptions
                -> [String]
                -> [LogRevStatsAction]
foldLogLines [] _ [] = []
foldLogLines ms _ [] = ms
foldLogLines ms o (x : xs) = let
  lm :: Maybe LogLine
  ns :: [LogRevStatsAction]
  lm = x `seq` parseLogLine x
  ns = lm `seq` o `seq` procLogMachine o ms lm
  in foldLogLines ns o xs

I have replaced the initial folding function — where it was using fmap and hGetLine — with the lazy evaluation function hGetContents for file descriptor reading and strict function for result processing foldLogLines. So, for file reading currently I am using lazy reads, but strict evaluation for line processing, allowing large file processing with very good results, thanks to the left folded strict evaluation of the lazy file reading. So we can appreciate the difference in the following chart.

ApacheLogRev without Memory Leaks

ApacheLogRev without Memory Leaks

λ λ λ

So, the garbage collector is really being used without keeping references on memory and instances linked to past results thanks to the tail call implementation in GHC, where we get an initial peak of memory usage which is being reduced along the execution. Also, if you have reached the Java OutOfMemoryError, .NET OutOfMemoryException, «PHP Allowed Memory Size Exhausted Fatal Error», and similar ones in your favorite language, and your solution is to increase the memory limits, your are doing it wrong. The fact that you have memory leaks in your code and you must correct the memory leaks rather than increasing the memory limits, just is telling me that you do not have any idea about memory management. Despite many languages now are using garbage collectors.


© Daniel Molina Wegener for coder . cl, 2012. | Permalink | No comment | Add to del.icio.us
Post tags:

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

]]>
http://coder.cl/2012/03/a-good-start-on-logrev/feed/ 0
the logrev design http://coder.cl/2012/03/the-logrev-design/ http://coder.cl/2012/03/the-logrev-design/#comments Sun, 04 Mar 2012 23:38:09 +0000 Daniel Molina Wegener http://coder.cl/?p=2318 On March 2th, the past Friday, I have released the initial code for a new FOSS project. The current code is hosted at github. LogRev is a log reviser tool, it extracts statistics from — currently — the Apache Access Logs. The initial design only supports few grouping queries but how it was coded will allow some interesting features that I will explain in this article. I hope that you will enjoy the design of this tool.

Well, the first idea is to have a dynamic parser, a configurable tokenizer and information extractor that will allow you to make some queries related to any application log. Currently it uses Parsec as combinatoric static parser for one kind of log entry — Apache Access Logs as was explained on the previous paragraph — and due to the nature of a combinatoric parser, will allow the creation of dynamically placed combinatoric tokenizers allowing the dynamic parsing of various kinds of entries. This is why this project was implemented in Haskell rather than using other language, despite Parsec is implemented in various languages like C++.

So, we have a dynamic parser that will be configured from a DSL, where I’m very close to finish its specification, followed by an Action Stack, where the action stack represents the dynamic placement of sequentially executed combinators to extract the statistics information that you want, with both, predefined data collector combinators and pluggable modular data collectors, where the data collectors will be specified on the DSL for this tool, and probably I will call it LRS or Log Revision Specification.

Finally there is the output method, where it currently is supporting graphic charts and plain text as output. On the future, it will support other kind of output, also specified in LSR.

Simple LogRev Design

Simple LogRev Design

Due to size of server logs, where many of them are really huge, or many times we need to process large amount of data, I have decided to use Haskell as the main language, because it supports very well the usage of combinators due to its type system and supported abstractions, also Haskell supports very well byte code compilation making the code faster enough to 5000 lines of log lines in 1 second, generating the plain text output for two reports, the status report and the country report.

08:00 [dmw@www:3 logrev-sample]$ wc -l main.log 
5000 main.log
08:00 [dmw@www:3 logrev-sample]$ logrev --input=./main.log --output=report
Processing: ./main.log

Status:
       200:       4834    9426271      96.68      96.10
       206:          3     107131       0.06       1.09
       301:          4        733       0.08       0.01
       302:         18       7568       0.36       0.08
       403:         68      11391       1.36       0.12
       404:         73     252012       1.46       2.57

Country:
       AUS:          5      32936       0.10       0.34
       CHL:       4516    6933671      90.32      70.69
       CHN:         90     910020       1.80       9.28
       DEU:          4      28591       0.08       0.29
       ESP:        203     174777       4.06       1.78
       RUS:         45     477554       0.90       4.87
       UKR:          4        518       0.08       0.01
       USA:        133    1217649       2.66      12.41


real    0m1.104s
user    0m0.964s
sys     0m0.092s
08:01 [dmw@www:3 logrev-sample]$ ls -l *.png
-rw-rw-r-- 1 dmw dmw 13085 2012-03-05 08:01 report_country.png
-rw-rw-r-- 1 dmw dmw 12031 2012-03-05 08:01 report_status.png

Also the graphic chart output is very nice, thanks the Haskell Charts hackage. And here you have sample output.

Country Sample Chart

Country Sample Chart
Status Sample Chart

Status Sample Chart

© Daniel Molina Wegener for coder . cl, 2012. | Permalink | One comment | Add to del.icio.us
Post tags:

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

]]>
http://coder.cl/2012/03/the-logrev-design/feed/ 1