<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-9118358517729434270</id><updated>2011-08-02T17:05:06.656-07:00</updated><category term='Haskell'/><title type='text'>Goes Good With Beer</title><subtitle type='html'>A blog about random programming and computer science topics.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://goesgoodwithbeer.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9118358517729434270/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://goesgoodwithbeer.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Tim</name><uri>http://www.blogger.com/profile/16757326329682076264</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>1</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-9118358517729434270.post-3915064417815134574</id><published>2009-07-16T14:31:00.000-07:00</published><updated>2009-07-16T17:59:26.562-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Haskell'/><title type='text'>Hackage Statistics (part 1)</title><content type='html'>I recently downloaded the package manifests from Haskell's &lt;a href="http://hackage.haskell.org/packages/hackage.html"&gt;hackage &lt;/a&gt;(a global repository for random people to post useful Haskell code) database and started performing some interesting analysis on them. My goal was to check out various easy statistics. All packages in this database are nicely indexed with a file describing all the dependencies, version information, etc. That manifest file (for package &lt;span style="font-family:courier new;"&gt;foo&lt;/span&gt;, it's &lt;span style="font-family:courier new;"&gt;foo.cabal&lt;/span&gt;) enables cabal (the package manager system Haskell / hackage uses) to interact with the main hackage database and ghc. However, there is all sorts of interesting information to visualize and toy with. For a sample of what one looks like check the manifest for the &lt;a href="http://hackage.haskell.org/package/tar"&gt;tar &lt;/a&gt;package &lt;a href="http://hackage.haskell.org/packages/archive/tar/0.3.1.0/tar.cabal"&gt;tar.cabal&lt;/a&gt;. (This package allow&lt;br /&gt;&lt;br /&gt;The whole blob of these database headers is stored in a &lt;span style="font-family:courier new;"&gt;.tar.gz&lt;/span&gt; (a tar gzip'd). So first, I &lt;span style="font-family:courier new;"&gt;wget&lt;/span&gt;'d the archive so it was stored locally.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;% wget http://hackage.haskell.org/packages/archive/00-index.tar.gz&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I could have manually decompressed (with tar) the manifests first, but they are just fat text and I had a more interesting idea: hackage has nice packages for handling &lt;a href="http://hackage.haskell.org/package/tar"&gt;tar &lt;/a&gt;and &lt;a href="http://hackage.haskell.org/package/zlib"&gt;gzip&lt;/a&gt;'d data. I decided to use those instead.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;% cabal install tar&lt;br /&gt;...&lt;br /&gt;% cabal install zlib&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now at the header we use import all these libraries (to illustrate some).&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;import qualified Data.ByteString.Lazy as BS&lt;br /&gt;import qualified Codec.Archive.Tar as Tar&lt;br /&gt;import qualified Codec.Compression.GZip as GZip&lt;br /&gt;&lt;br /&gt;(&gt;|=) ma k = ma &gt;&gt;= return . k&lt;br /&gt;&lt;br /&gt;scanArchive fpgz = do&lt;br /&gt;    withBinaryFile fpgz ReadMode $ \h -&gt; do&lt;br /&gt;    entries &lt;- BS.hGetContents h &gt;|= GZip.decompress &gt;|= Tar.read&lt;br /&gt;    mapEntries ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The entries value holds a &lt;a href="http://hackage.haskell.org/packages/archive/tar/0.3.1.0/doc/html/Codec-Archive-Tar.html#t%3AEntries"&gt;&lt;span style="font-family:courier new;"&gt;Tar.Entries&lt;/span&gt;&lt;/a&gt;, a linked list of tar entries. I define the &lt;span style="font-family:courier new;"&gt;(&gt;|=) &lt;/span&gt;operator as just a ghetto use of &lt;span style="font-family:courier new;"&gt;&lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/base/Prelude.html#v%3Afmap"&gt;fmap&lt;/a&gt; &lt;/span&gt;(or &lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Monad.html#v%3AliftM"&gt;&lt;span style="font-family:courier new;"&gt;liftM&lt;/span&gt;&lt;/a&gt;) to make the transformations look like more of a pipeline. Now tar entries are basically a &lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/bytestring/Data-ByteString-Lazy.html"&gt;&lt;span style="font-family:courier new;"&gt;ByteString&lt;/span&gt;&lt;/a&gt; of the file and path info. Those byte buffers are the raw text of the manifest files (the .cabal files).&lt;br /&gt;&lt;br /&gt;Now all we need is is a way to turn those byte strings into abstract syntax. At any rate, we left that worry to my ill-named &lt;span style="font-family:courier new;"&gt;mapEntries &lt;/span&gt;function above and we fill in the details here.&lt;br /&gt;&lt;br /&gt;For a bit I considered actually doing some raw work, by parsing and representing the syntax of these cabal files (actually quite a bit of work). But I recalled seeing some packages when perusing the hierarchical libraries. Turns out the packages the cabal tool uses are nice and modular and are even part of ghc's default hierarchical libraries. So it was as easy as:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;import Distribution.PackageDescription&lt;br /&gt;import Distribution.PackageDescription.Parse&lt;br /&gt;import Distribution.Package&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then all we need is to call &lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/Cabal/Distribution-PackageDescription-Parse.html#v%3AparsePackageDescription"&gt;&lt;span style="font-family:courier new;"&gt;parsePackageDescription &lt;/span&gt;&lt;/a&gt;passing a &lt;span style="font-family:courier new;"&gt;String &lt;/span&gt;of the file as an argument. There are better ways to convert a &lt;span style="font-family:courier new;"&gt;ByteString &lt;/span&gt;to a &lt;span style="font-family:courier new;"&gt;String&lt;/span&gt;, I just assumed it was ASCII and used the unpack function mapping the &lt;span style="font-family:courier new;"&gt;Word8&lt;/span&gt;'s to &lt;span style="font-family:courier new;"&gt;Char&lt;/span&gt;'s via&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;(Data.Char.chr . fromIntegeral)&lt;/span&gt;&lt;br /&gt;(I think one is supposed to import a different &lt;span style="font-family:courier new;"&gt;ByteString &lt;/span&gt;package, but this works fine for me.) Now, &lt;span style="font-family:courier new;"&gt;parsePackageDescription &lt;/span&gt;gives us its super complicated syntax representation as &lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/Cabal/Distribution-PackageDescription.html#t%3AGenericPackageDescription"&gt;GenericPackageDescription &lt;/a&gt;data values. After scanning all the options and depth of this syntax tree, I am really glad I didn't mess with it directly. Next, I group packages in a tree, removing older versions (each hackage component might have several versions, I only want the newest).&lt;br /&gt;&lt;br /&gt;The type of &lt;span style="font-family:courier new;"&gt;scanArchive &lt;/span&gt;is:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;scanArchive :: FilePath -&gt; IO [GenericPackageDescription]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Lazy Evaluation Rant&lt;/span&gt;&lt;br /&gt;Parsing the 12m archive with 14oo+ packages takes &lt;span style="font-family:courier new;"&gt;ghci &lt;/span&gt;around 350m in memory. I have no idea where the bloat is, I am using &lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/containers/Data-Map.html#v%3Aadjust"&gt;Data.Map.adjust&lt;/a&gt; to add things to the update packages.&lt;br /&gt;Even after the parse an culling of duplicate packages, for some reason &lt;span style="font-family:courier new;"&gt;ghci &lt;/span&gt;holds onto all 350m. (Even after I force evaluation of the stupid thing via printing.)&lt;br /&gt;&lt;br /&gt;Forget &lt;span style="font-family:courier new;"&gt;&lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/parallel/Control-Parallel-Strategies.html#v%3Arnf"&gt;Control.Strategies.Parallel.rnf&lt;/a&gt; &lt;/span&gt;too, the big honking &lt;span style="font-family:courier new;"&gt;GenericPackageDescription&lt;/span&gt; syntax tree of data types is not an instance, and I am not going to derive all of them myself. If I force the map holding the modules as I add them to partially reevaluate each iteration, I can get it down to about 190m with &lt;span style="font-family:courier new;"&gt;(&lt;a href="http://haskell.org/ghc/docs/latest/html/libraries/base/Control-Exception.html#8"&gt;Control.Exception.evaluate&lt;/a&gt; (m /= m'))&lt;/span&gt; where &lt;span style="font-family:courier new;"&gt;m&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;m' &lt;/span&gt;are the old and new map. Using (length . show) on the map takes much longer, but prunes the memory use to 150m. Even better, if I use the in-equality approach every 10 packages or so, it takes 200m and completes in less than 20 seconds. I'll take it.&lt;br /&gt;&lt;br /&gt;What makes me furious about this is that is not that I don't have 350m lying around, it is that I cannot easily reason about my ridiculously simple program. That is scary to me and so galling that it quickly drives me int a bloodlust.&lt;br /&gt;&lt;br /&gt;Now I have a list of all the packages. Since I don't want to wait 20 seconds per query of our database, we use the ``global'' value trick.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;db = unsafePerformIO $ scanArchive "00-index.tar.gz"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now &lt;span style="font-family:courier new;"&gt;db &lt;/span&gt;only gets evaluated once, but holds the entire list of packages.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;The Fruits Of Our Labor&lt;/span&gt;&lt;br /&gt;A simple function to query our ``database'' creates histograms of information. It takes the database, a display function and a function to project with.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;hist :: Eq a&lt;br /&gt;    =&gt; [GenericPackageDescription]&lt;br /&gt;    -&gt; (a -&gt; String)&lt;br /&gt;    -&gt; (GenericPackageDescription -&gt; a)&lt;br /&gt;    -&gt; IO ()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The above calls a utility function that computes a histogram of &lt;span style="font-family:courier new;"&gt;Eq &lt;/span&gt;instance values and prints the results in sorted order.&lt;br /&gt;&lt;br /&gt;As an example a histogram of the copyrights is:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;*Main&gt; hist db id $ packageDescription .&gt; category&lt;br /&gt; 128  Data&lt;br /&gt; 100  System&lt;br /&gt;  77  Text&lt;br /&gt;  68  Web&lt;br /&gt;  68  Network&lt;br /&gt;  59&lt;br /&gt;  55  Graphics&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The first arguments are self as described above. The projection function&lt;span style="font-family:courier new;"&gt; ((.&gt;) is flip (.)) &lt;/span&gt;selects all the category attributes of the package descriptions. Since this is already a String, we use id, show would add quotes. The listing shows that 128 packages are categorized as being related to data. The second most common category is system related packages.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Stability&lt;/span&gt;&lt;br /&gt;As a second example, we project on package stability. The lowercase allows us to be case insensitive (e.g. some people list Alpha and others alpha).&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;*Main&gt; hist db id $ packageDescription .&gt; stability .&gt; lowercase&lt;br /&gt; 703&lt;br /&gt; 387  experimental&lt;br /&gt;  93  alpha&lt;br /&gt;  76  stable&lt;br /&gt;  70  provisional&lt;br /&gt;  51  beta&lt;br /&gt;  23  unstable&lt;br /&gt;   8  seems to work, passes a few tests&lt;br /&gt;   4  deprecated&lt;br /&gt;   2  alpa&lt;br /&gt;   1  seems to work, but not posix yet&lt;br /&gt;   1  testing&lt;br /&gt;   1  none&lt;br /&gt;   1  experimental, not usable yet&lt;br /&gt;   1  builds&lt;br /&gt;   1  experiemental&lt;br /&gt;   1  unknown&lt;br /&gt;   1  mostly stable&lt;br /&gt;   1  pre-alpha&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(Note: that the 703 at the top &lt;span style="font-weight: bold; font-style: italic;"&gt;is not a total&lt;/span&gt;, 703 entries list the empty string as their stability level.)&lt;br /&gt;&lt;br /&gt;Now, I love hackage. In a couple of hours I cobbled together a really cool script mainly using hackage-based code. However, about, ``all the free code that is already out there for us to use,'' what do we make of its stability? I would say provisional and stable are the good ones that is 140 of like 1400 packages. So caveat emptor and choose carefully.&lt;br /&gt;&lt;br /&gt;That said, if 2/3 of software cost is maintenance; nobody wants to service a bunch of whiny users if they don't have to. It's the first 1/3 that is fun I suppose. I know if I uploaded something to hackage, I wouldn't want to babysit it once I got bored with it. But this is kind of a nice feature of hackage: post the code, if someone wants to use it, let them fix it.&lt;br /&gt;&lt;br /&gt;How many packages have a different author than maintainer? Does this imply a such a change of hands as above?&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;*Main&gt; let auth = packageDescription .&gt; author&lt;br /&gt;*Main&gt; let maint = packageDescription .&gt; maintainer&lt;br /&gt;*Main&gt; length $ filter (\d -&gt; auth d /= maint d) db&lt;br /&gt;1150&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I don't know if I trust this number, but it is kind of cool how quickly I was able to go from ``I wonder?'' to ``Let's find out!''&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Top Authors&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;*Main&gt; hist db id $ auth&lt;br /&gt;     89&lt;br /&gt;&lt;br /&gt;     38  Don Stewart&lt;br /&gt;&lt;br /&gt;     29  Henning Thielemann &lt;...&gt;&lt;br /&gt;&lt;br /&gt;     23  John Goerzen&lt;br /&gt;&lt;br /&gt;     17  Wang, Jinjing&lt;br /&gt;&lt;br /&gt;     16  Conal Elliott&lt;br /&gt;&lt;br /&gt;     14  Rohan Drape&lt;br /&gt;&lt;br /&gt;     14  Andy Gill&lt;br /&gt;&lt;br /&gt;     14  Michael Snoyman &lt;...&gt;&lt;br /&gt;     ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Top Maintainers&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;*Main&gt; hist db id $ maint&lt;br /&gt;     45&lt;br /&gt;     34  Henning Thielemann &lt;haskell@henning-thielemann.de&gt;&lt;br /&gt;     26  libraries[at]haskell.org&lt;br /&gt;     23  John Goerzen &lt;jgoerzen@complete.org&gt;&lt;br /&gt;     19  Don Stewart &lt;dons@galois.com&gt;&lt;br /&gt;     18  dons[at]galois.com&lt;br /&gt;     17  conal[at]conal.net&lt;br /&gt;     16  Wang, Jinjing &lt;nfjinjing@gmail.com&gt;&lt;br /&gt;     15  rd[at]slavepianos.org&lt;br /&gt;     14  Micha6el Snoyman &lt;michael@snoyman.com&gt;&lt;br /&gt;     13  Lemmih (lemmih[at]gmail.com)&lt;br /&gt;     12  haskelldb-users[at]lists.sourceforge.net&lt;br /&gt;     12  Audrey Tang &lt;audreyt@audreyt.org&gt;&lt;br /&gt;     11  bjorn[at]bringert.net&lt;br /&gt;     10  TextRegexLazy[at]personal.mightyreason.com&lt;br /&gt;     10  Lennart Augustsson&lt;br /&gt;     ...&lt;br /&gt;&lt;/audreyt@audreyt.org&gt;&lt;/michael@snoyman.com&gt;&lt;/nfjinjing@gmail.com&gt;&lt;/dons@galois.com&gt;&lt;/jgoerzen@complete.org&gt;&lt;/haskell@henning-thielemann.de&gt;&lt;/pre&gt;&lt;br /&gt;You can see that Henning and Don are both big contributers although plenty of people put packages in anonymously.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9118358517729434270-3915064417815134574?l=goesgoodwithbeer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://goesgoodwithbeer.blogspot.com/feeds/3915064417815134574/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9118358517729434270&amp;postID=3915064417815134574' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9118358517729434270/posts/default/3915064417815134574'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9118358517729434270/posts/default/3915064417815134574'/><link rel='alternate' type='text/html' href='http://goesgoodwithbeer.blogspot.com/2009/07/hackage-statistics-part-1.html' title='Hackage Statistics (part 1)'/><author><name>Tim</name><uri>http://www.blogger.com/profile/16757326329682076264</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
