Thursday, July 24, 2008

GSoC Hoogle: Week 9

I'm off camping for the next weekend in a couple of hours, so this is my early weekly summary. From next week, for a week and a half, I'll actually have an SSH connection so expect to see 200+ patches flow into the Hoogle repo in a few days.

This week I've been rewriting the type search. I spent 3 days writing code, type checking it, but not actually having enough written to run it. Late last night I finished the code, and this morning I debugged it. Amazingly (although actually quite commonly for Haskell) it worked with only minor tweaks. I now have a type search which should scale to large databases and provide fast and accurate searches.

All the basic tests work, and I can generate a Hoogle database for the array library. I still can't generate a Hoogle database for the base library, due to a stack overflow, but I think the cause of the stack overflow has changed and should be easy to debug.

Next Week: A public beta of the command line version is now overdue, and hopefully will happen next week. I aim to finish the actual search side of Hoogle, and move on to the web interface.

User Visible Changes: Type search works again, mostly.

Sunday, July 20, 2008

GSoC Hoogle: Week 8

This week I've been travelling quite a bit, and rather busy with other things. Hopefully next week I'll be able to focus more time on Hoogle!

This week I fleshed out the final part of type search, including support for instances and alpha renaming of variables. After having implemented all the bits in the type search, I tried to convert the base libraries - and it failed, taking up too much time/memory to feasibly finish.

The type search is based around the idea of having nodes in a graph representing types, and then moving between these nodes, at a cost. In order to avoid a blow-up in the number of nodes in the graph, types are alpha-normalised and then alpha-renaming is performed afterwards. Instead of having 3 type nodes for (a,b), (c,d) and (a,a) there is just one named (a,b) and a 3 sets of alpha-renamings. All is good.

However, once you introduce instance restrictions, the types blow up. For example, from the type node a, you can move to Eq a => a, Ord a => a, Show a => a etc. The large (but feasible) number of type nodes, combined with even a small number of class names, gives a huge number of nodes. In fact, for every type variable in a node there are 2^n possible instance contexts it could take. All is bad.

Fortunately there is a solution - move instance checking outside the type graph. This makes the number of nodes feasible, and should work fairly well. It also has a few other benefits, including slightly better scoring and a simpler implementation in a few places. I also came up with a strategy for moving the cost associated with alpha-renaming into the graph search, which further simplifies things.

Of course, all this work takes time, so overall progress is slower than I would have liked. However, the results so far are promising, and the problems of scale seem to have been successfully addressed. The problem of fast and accurate type searching is hard, but hopefully Hoogle 4 will have a scalable solution that should be useful.

Next week: I want to finish the implementation of type searching, and check it works on the full base libraries. A release would be good, although may take place early in the following week.

User visible changes: Creating a database for the base library will now fail with a stack overflow. Hopefully next weeks changes will fix this!

Friday, July 11, 2008

GSoC Hoogle: Week 7

This week I've continued to improve the type searching, and generated Hoogle databases for the core libraries. I'm away from a computer all weekend until Tuesday evening, which has happened every ICFP contest for the last 3 years.

I've substantially refactored the type searching, basing it on a proper abstract Graph data type. Now the mechanisms for dealing with type search and graph traversal are separate it is much easier to express clearly what type search is doing. I've also fleshed out the type searching code so that it can accurately perform searches with all the necessary features. There are still a number of tasks to do before the type searching code is finished, but each is a fairly discrete unit of work with well-understood problems.

The other challenge for the week has been generating Hoogle databases for the core libraries - the base library and all the other libraries GHC ships with a release. With these libraries in place, it is feasible to use Hoogle to perform useful queries. The libraries are generated using a combination of Cabal, Haddock and Hoogle. I've made changes in both the Haddock and Hoogle layers so that the full base libraries can now be processed.

In order to deal with the full base libraries there are numerous GHC extensions that must be supported. In particular, Hoogle now supports multi-parameter type classes, higher-ranked types, type operators, unboxed types, unboxed tuples and NDP style arrays. All of these features are translated down into Haskell 98 types, but most closely approximate their behaviour in GHC, and can be used in searches.

Throughout the week I've been profiling the database creation code in Hoogle. The databases for the core libraries come to about 4.5Mb, and are highly optimised for performing searches - often at the cost of making them harder to create. I've halved the time to create databases during the week, using profiling to direct improvements. The time required to process the Core libraries takes 60 seconds, which is a certainly an acceptable timeframe, but could always be faster. Currently the biggest culprit in the profile is the hPutByte function:


hPutByte :: Handle -> Int -> IO ()
hPutByte hndl i = hPutChar hndl $ chr i


Since the underlying databases are written using lots of hTell and hSeek commands, it is not possible to use something like the Data.Binary library. However, if anyone has any suggestions on how to improve performance they would be gratefully received.

Next week: I want to finish off the remaining type search features, and then package up a command line release for Hackage. Hopefully Hoogle 4 will be ready for initial use by early testers.

User visible changes: Type search is more robust, but still not fully featured. Database creation if faster and robust. You can search the base libraries.

Sunday, July 06, 2008

GSoC Hoogle: Week 6

This week I've been tackling type searching. I have just (in the last few minutes) got my first type search to work. At the moment type search is very limited, but all the ideas and scafolding are in place, so should now proceed relatively quickly.

In all previous versions on Hoogle, type searching was O(n), where n is the number of functions in the database. Hoogle compared the type search to each possible answer, computed a closeness score, then at the end wrote out the closest matches. This meant that before the first answer could be given, all functions had to be checked, i.e. the time for the first answer was O(n). As the Hoogle database is about to get massively bigger, this approach is insufficient.

The new version of Hoogle is much cleverer. It works by exploring a graph, following similar ideas to Dijkstra's algorithm, to reach more suitable results first. Typically, the best answers will be given without any search of the graph, and then as the graph is searched more results will appear with lower closeness. With the new scheme the complexity is O(m), where m is the number of results you want. I hope at some point after the SoC is finished to describe the algorithm properly, so others can understand it, and hopefully improve upon it.

Next week: Finishing off type searching, so it supports all the features planned. Build system work, and potentially a cabal pre-release.

User visible changes: Type search works to some degree, but not perfectly. Database debugging options (conversion and dumping to a text file) have been added.