July 16, 2009

Hackage Statistics (part 1)

I recently downloaded the package manifests from Haskell's hackage (a global repository for random people to post useful Haskell code) database and started performing some interesting analysis on them. My goal was to check out various easy statistics. All packages in this database are nicely indexed with a file describing all the dependencies, version information, etc. That manifest file (for package foo, it's foo.cabal) enables cabal (the package manager system Haskell / hackage uses) to interact with the main hackage database and ghc. However, there is all sorts of interesting information to visualize and toy with. For a sample of what one looks like check the manifest for the tar package tar.cabal. (This package allow

The whole blob of these database headers is stored in a .tar.gz (a tar gzip'd). So first, I wget'd the archive so it was stored locally.

% wget http://hackage.haskell.org/packages/archive/00-index.tar.gz


I could have manually decompressed (with tar) the manifests first, but they are just fat text and I had a more interesting idea: hackage has nice packages for handling tar and gzip'd data. I decided to use those instead.

% cabal install tar
...
% cabal install zlib
...

Now at the header we use import all these libraries (to illustrate some).

import qualified Data.ByteString.Lazy as BS
import qualified Codec.Archive.Tar as Tar
import qualified Codec.Compression.GZip as GZip

(>|=) ma k = ma >>= return . k

scanArchive fpgz = do
withBinaryFile fpgz ReadMode $ \h -> do
entries <- BS.hGetContents h >|= GZip.decompress >|= Tar.read
mapEntries ...

The entries value holds a Tar.Entries, a linked list of tar entries. I define the (>|=) operator as just a ghetto use of fmap (or liftM) to make the transformations look like more of a pipeline. Now tar entries are basically a ByteString of the file and path info. Those byte buffers are the raw text of the manifest files (the .cabal files).

Now all we need is is a way to turn those byte strings into abstract syntax. At any rate, we left that worry to my ill-named mapEntries function above and we fill in the details here.

For a bit I considered actually doing some raw work, by parsing and representing the syntax of these cabal files (actually quite a bit of work). But I recalled seeing some packages when perusing the hierarchical libraries. Turns out the packages the cabal tool uses are nice and modular and are even part of ghc's default hierarchical libraries. So it was as easy as:

import Distribution.PackageDescription
import Distribution.PackageDescription.Parse
import Distribution.Package


Then all we need is to call parsePackageDescription passing a String of the file as an argument. There are better ways to convert a ByteString to a String, I just assumed it was ASCII and used the unpack function mapping the Word8's to Char's via
(Data.Char.chr . fromIntegeral)
(I think one is supposed to import a different ByteString package, but this works fine for me.) Now, parsePackageDescription gives us its super complicated syntax representation as GenericPackageDescription data values. After scanning all the options and depth of this syntax tree, I am really glad I didn't mess with it directly. Next, I group packages in a tree, removing older versions (each hackage component might have several versions, I only want the newest).

The type of scanArchive is:

scanArchive :: FilePath -> IO [GenericPackageDescription]


Lazy Evaluation Rant
Parsing the 12m archive with 14oo+ packages takes ghci around 350m in memory. I have no idea where the bloat is, I am using Data.Map.adjust to add things to the update packages.
Even after the parse an culling of duplicate packages, for some reason ghci holds onto all 350m. (Even after I force evaluation of the stupid thing via printing.)

Forget Control.Strategies.Parallel.rnf too, the big honking GenericPackageDescription syntax tree of data types is not an instance, and I am not going to derive all of them myself. If I force the map holding the modules as I add them to partially reevaluate each iteration, I can get it down to about 190m with (Control.Exception.evaluate (m /= m')) where m and m' are the old and new map. Using (length . show) on the map takes much longer, but prunes the memory use to 150m. Even better, if I use the in-equality approach every 10 packages or so, it takes 200m and completes in less than 20 seconds. I'll take it.

What makes me furious about this is that is not that I don't have 350m lying around, it is that I cannot easily reason about my ridiculously simple program. That is scary to me and so galling that it quickly drives me int a bloodlust.

Now I have a list of all the packages. Since I don't want to wait 20 seconds per query of our database, we use the ``global'' value trick.

db = unsafePerformIO $ scanArchive "00-index.tar.gz"

Now db only gets evaluated once, but holds the entire list of packages.

The Fruits Of Our Labor
A simple function to query our ``database'' creates histograms of information. It takes the database, a display function and a function to project with.


hist :: Eq a
=> [GenericPackageDescription]
-> (a -> String)
-> (GenericPackageDescription -> a)
-> IO ()


The above calls a utility function that computes a histogram of Eq instance values and prints the results in sorted order.

As an example a histogram of the copyrights is:

*Main> hist db id $ packageDescription .> category
128 Data
100 System
77 Text
68 Web
68 Network
59
55 Graphics
...

The first arguments are self as described above. The projection function ((.>) is flip (.)) selects all the category attributes of the package descriptions. Since this is already a String, we use id, show would add quotes. The listing shows that 128 packages are categorized as being related to data. The second most common category is system related packages.

Stability
As a second example, we project on package stability. The lowercase allows us to be case insensitive (e.g. some people list Alpha and others alpha).

*Main> hist db id $ packageDescription .> stability .> lowercase
703
387 experimental
93 alpha
76 stable
70 provisional
51 beta
23 unstable
8 seems to work, passes a few tests
4 deprecated
2 alpa
1 seems to work, but not posix yet
1 testing
1 none
1 experimental, not usable yet
1 builds
1 experiemental
1 unknown
1 mostly stable
1 pre-alpha

(Note: that the 703 at the top is not a total, 703 entries list the empty string as their stability level.)

Now, I love hackage. In a couple of hours I cobbled together a really cool script mainly using hackage-based code. However, about, ``all the free code that is already out there for us to use,'' what do we make of its stability? I would say provisional and stable are the good ones that is 140 of like 1400 packages. So caveat emptor and choose carefully.

That said, if 2/3 of software cost is maintenance; nobody wants to service a bunch of whiny users if they don't have to. It's the first 1/3 that is fun I suppose. I know if I uploaded something to hackage, I wouldn't want to babysit it once I got bored with it. But this is kind of a nice feature of hackage: post the code, if someone wants to use it, let them fix it.

How many packages have a different author than maintainer? Does this imply a such a change of hands as above?

*Main> let auth = packageDescription .> author
*Main> let maint = packageDescription .> maintainer
*Main> length $ filter (\d -> auth d /= maint d) db
1150

I don't know if I trust this number, but it is kind of cool how quickly I was able to go from ``I wonder?'' to ``Let's find out!''


Top Authors

*Main> hist db id $ auth
89

38 Don Stewart

29 Henning Thielemann <...>

23 John Goerzen

17 Wang, Jinjing

16 Conal Elliott

14 Rohan Drape

14 Andy Gill

14 Michael Snoyman <...>
...




Top Maintainers

*Main> hist db id $ maint
45
34 Henning Thielemann
26 libraries[at]haskell.org
23 John Goerzen
19 Don Stewart
18 dons[at]galois.com
17 conal[at]conal.net
16 Wang, Jinjing
15 rd[at]slavepianos.org
14 Micha6el Snoyman
13 Lemmih (lemmih[at]gmail.com)
12 haskelldb-users[at]lists.sourceforge.net
12 Audrey Tang
11 bjorn[at]bringert.net
10 TextRegexLazy[at]personal.mightyreason.com
10 Lennart Augustsson
...

You can see that Henning and Don are both big contributers although plenty of people put packages in anonymously.