Neil Mitchell's Blog (Haskell etc): February 2015

Wednesday, February 25, 2015

Making withSocketsDo unnecessary

Summary: Currently you have to call withSocketsDo before using the Haskell network library. In the next version you won't have to.

The Haskell network library has always had a weird and unpleasant invariant. Under Windows, you must call withSocketsDo before calling any other functions. If you forget, the error message isn't particularly illuminating (e.g. getAddrInfo, does not exist, error 10093). Calling withSocketsDo isn't harmful under Linux, but equally isn't necessary, and thus easy to accidentally omit. The network library has recently merged some patches so that in future versions there is no requirement to call withSocketsDo, even on Windows.

Existing versions of network

The reason for requiring withSocketsDo is so that the network library can initialise the Windows Winsock library. The code for withSocketsDo was approximately:

withSocketsDo :: IO a -> IO a
#if WINDOWS
withSocketsDo act = do
    initWinsock
    act `finally` termWinsock
#else
withSocketsDo act = act
#endif

Where initWinsock and termWinsock were C functions. Both checked a mutable variable so they only initialised/terminated once. The initWinsock function immediately initialised the Winsock library. The termWinsock function did not terminate the library, but merely installed an atexit handler, providing a function that ran when the program shut down which terminated the Winsock library.

As a result, in all existing versions of the network library, it is fine to nest calls to withSocketsDo, call withSocketsDo multiple times, and to perform networking operations after withSocketsDo has returned.

Future versions of network

My approach to removing the requirement to call withSocketsDo was to make it very cheap, then sprinkle it everywhere it might be needed. Making such a function cheap on non-Windows just required an INLINE pragma (although its very likely GHC would have always inlined the function anyway).

For Windows, I changed to:

withSocketsDo act = do evaluate withSocketsInit; act 

{-# NOINLINE withSocketsInit #-}
withSocketsInit = unsafePerformIO $ do
    initWinsock
    termWinsock

Now withSocketsDo is very cheap, with subsequent calls requiring no FFI calls, and thanks to pointer tagging, just a few cheap instructions. When placing additional withSocketsDo calls my strategy was to make sure I called it before constructing a Socket (which many functions take as an argument), and when taking one of the central locks required for the network library. In addition, I identified a few places not otherwise covered.

In newer versions of the network library it is probably never necessary to call withSocketsDo - if you find a place where one is necessary, let me know. However, for compatibility with older versions on Windows, it is good practice to always call withSocketsDo. Libraries making use of the network library should probably call withSocketsDo on their users behalf.

Tuesday, February 17, 2015

nub considered harmful

Summary: Don't use nub. A much faster alternative is nubOrd from the extra package.

The Haskell Data.List module contains the function nub, which removes duplicate elements. As an example:

nub [1,2,1,3] ==  [1,2,3]

The function nub has the type Eq a => [a] -> [a]. The complexity of take i $ nub xs is O(length xs * i). Assuming all elements are distinct and you want them all, that is O(length xs ^ 2). If we only have an Eq instance, that's the best complexity we can achieve. The reason is that given a list as ++ [b], to check if b should be in the output requires checking b for equality against nub as, which requires a linear scan. Since checking each element requires a linear scan, we end up with a quadratic complexity.

However, if we have an Ord instance (as we usually do), we have a complexity of O(length xs * log i) - a function that grows significantly slower. The reason is that we can build a balanced binary-tree for the previous elements, and check each new element in log time. Does that make a difference in practice? Yes. As the graph below shows, by the time we get to 10,000 elements, nub is 70 times slower. Even at 1,000 elements nub is 8 times slower.

The fact nub is dangerous isn't new information, and I even suggested changing the base library in 2007. Currently there seems to be a nub hit squad, including Niklas Hambüchen, who go around raising tickets against various projects suggesting they avoid nub. To make that easier, I've added nubOrd to my extra package, in the Data.List.Extra module. The function nubOrd has exactly the same semantics as nub (both strictness properties and ordering), but is asymptotically faster, so is almost a drop-in replacement (just the additional Ord context).

For the curious, the above graph was generated in Excel, with the code below. I expect the spikes in nub correspond to garbage collection, or just random machine fluctuations.

import Control.Exception
import Data.List.Extra
import Control.Monad
import System.Time.Extra

benchmark xs = do
    n <- evaluate $ length xs
    (t1,_) <- duration $ evaluate $ length $ nub xs
    (t2,_) <- duration $ evaluate $ length $ nubOrd xs
    putStrLn $ show n ++ "," ++ show t1 ++ "," ++ show t2

main = do
    forM_ [0,100..10000] $ \i -> benchmark $ replicate i 1
    forM_ [0,100..10000] $ \i -> benchmark [1..i]

Tuesday, February 10, 2015

Why is the Hoogle index so out of date?

Summary: Hoogle 4 is out of date. The alpha version Hoogle 5 has fresh code and data every day (and isn't yet ready).

Someone recently asked why Hoogle's index is so out of date. Making the index both more current (updated daily) and larger (indexing all of Stackage) is one of the goals behind my Hoogle 5 rewrite (which still isn't finished). Let's compare the different update processes:

Hoogle 4 updates took about two hours to complete, if they went well, and often had to be aborted. I first compiled the Hoogle binary on the haskell.org machines, which often failed, as typically the version of GHC was very old. Once I'd got a compiled binary, I needed to generate the database, which took about 2 hours, and occasionally failed halfway through. Once I had the new binary and databases I moved everything to correct place for Apache, accepting a small window of downtime during the move. Assuming that worked, I did a few test searches and smiled. Often the new Hoogle binary failed to start (usually failure to find some files, sometimes permissions) and I had to switch back to the old copy. Fixing up such issues took up to an hour. I had a mix of Windows .bat and Linux .sh scripts to automate some of the steps, but they weren't very robust, and required babysitting.

Hoogle 5 updates happen automatically at 8pm every night, take 4 minutes, and have yet to fail. I have a cron script that checks out the latest code and runs an update script. That script clones a fresh repo, compiles Hoogle, builds the databases, runs the test suite, kills the old version and launches the new version. The Hoogle code is all tested on Travis, so I don't expect that to fail very often. The upgrade script is hard to test, but the two failure modes are upgrading to a broken version, or not upgrading. The upgrade script runs checks and fails if anything doesn't work as expected, so it errs on the side of not upgrading. I use Uptime Robot to run searches and check the server is working, along with a canary page which raises an error if no upgrade happens for two days.

Clearly, the Hoogle 5 version update story is better. But why didn't I do it that way with Hoogle 4? The answer is that Hoogle 4 came out over six years ago, and a lot has changed since then:

Hoogle 4 is a CGI binary, served through Apache, while Hoogle 5 is a Haskell Warp server. By moving the logic into Haskell, it's far easier for me to configure and manage. Warp was only released on Hackage in 2011.
Hoogle 4 runs on the on the main haskell.org server, where my mistakes can easily take out the haskell.org home page (as a result, the haskell.org home page once said "moo" for 10 minutes). Hoogle 5 runs on a dedicated VM where I have root, and no one else runs anything, so I can experiment with settings about swap files, IP tables and cron jobs.
My job has provided a lot of practice doing drive-by sysadmining over the last 6 years. I've also had a lot of practice doing critical releases on a nightly basis. In comparison, Hoogle is pretty simple.
The revised/rethought approach to Hoogle databases is a lot faster and uses a lot less memory, so it takes under a minute to generate databases, instead of over an hour. That time difference makes it much easier to experiment with different approaches.

When will Hoogle 5 be ready? It doesn't yet do type search, there is no offline version and no API. There are probably lots of other little pieces missing. If you want, feel free to use it now at hoogle.haskell.org. You can still use Hoogle 4 at haskell.org/hoogle, or the more up-to-date FP complete hosted Hoogle 4.

Thursday, February 05, 2015

Refactoring with Equational Reasoning

Summary: Haskell is great for refactoring, thanks to being able to reason about and transform programs with confidence.

I think one of Haskell's strengths as a practical language is that it's easy to refactor, and more importantly, easy to refactor safety. Programs in the real world often accumulate technical debt - code that is shaped more by its history than its current purpose. Refactoring is one way to address that technical debt, making the code simpler, but not changing any meaningful behaviour.

When refactoring, you need to think of which alternative forms of code are equivalent but better. In C++ I've removed unused variables, only to find they were RAII variables, and their mere presence had a semantic effect. In R I've removed redundant if expressions, only to find the apparently pure condition had the effect of coercing a variable and changing its type. In Haskell, it's equally possible to make refactorings that at first glance appear safe but aren't - however, in practice, it happens a lot less. I think there are a few reasons for that:

Haskell is pure and evaluation strategy is largely unobservable - moving a statement "before" or "after" another lexically is usually safe.
Refactorings that do go wrong, for example variables that accidentally get moved out of scope or types which are no longer as constrained, usually result in compiler errors.
The Haskell community cares about semantics and laws. The Monad laws are satisfied by almost all monads, flagrantly breaking those laws is rare.
Functions like unsafePerformIO, which could harm refactoring, are almost always used behind a suitable pure abstraction.

Note that these reasons are due to both the language, and the conventions of the Haskell community. (Despite these factors, there are a few features that can trip up refactorings, e.g. exceptions, record wildcards, space-leaks.)

To take a very concrete example, today I was faced with the code:

f = fromMaybe (not b) . select
if f v == b then opt1 else opt2

At one point the function f was used lots, had a sensible name and nicely abstracted some properties. Now f is used once, the semantics are captured elsewhere, and the code is just unclear. We can refactor this statement, focusing on the condition:

f v == b
-- inline f
(fromMaybe (not b) . select) v == b
-- remove brackets and inline (.)
fromMaybe (not b) (select v) == b
-- expand to a case statement
(case select v of Nothing -> not b; Just x -> x) == b
-- push the == down
case select v of Nothing -> not b == b; Just x -> x == b
-- simplify not b == b
case select v of Nothing -> False; Just x -> x == b
-- collapse back up
select v == Just b

And now substitute back in:

if select v == Just b then opt1 else opt2

Our code is now much simpler and more direct. Thanks to the guarantees I expect of Haskell programs, I also have a high degree of confidence this code really is equivalent - even if it isn't obvious just looking at beginning and end.

Neil Mitchell's Blog (Haskell etc)