Blog

I'm trying to make this website my new blog, using OMD and MPP, with OCaml of course, but also the (awful) HTML/CSS/JS triplet.

Here follow the latest blog posts.

Functional visitors for a complex tree-shaped data structure

Let's consider this data type:

type e = | A | B of t | C of string * t * t | D of string | E of int
and  t = e list
[try]

If we want to write an abstraction to apply a function to this data structure (the one of type t) without being bothered to treat the whole structure everytime, we can get some inspiration from the visitor design pattern commonly used in object-oriented programming. The “problem” is that we don't have the same inheritance mechanism in functional programming, meaning that the default behaviour has to be implemented with functions instead of using methods that can be overriden in subclasses.

First attempt

The following implementation met my needs:

let rec visit f = function
  | [] -> []
  | (A | E _ | D _) as e :: tl ->
    begin match f e with
      | Some(l) -> l @ visit f tl
      | None -> e :: visit f tl
    end
  | B(t) as e :: tl ->
    begin match f e with
      | Some(l) -> l @ visit f tl
      | None -> B(visit f t) :: visit f tl
    end
  | C(s, t1, t2) as e :: tl ->
    begin match f e with
      | Some(l) -> l @ visit f tl
      | None -> C(s, visit f t1, visit f t2) :: visit f tl
    end
(* val visit : (e -> t option) -> t -> t *)
[try]

With this abstraction, it becomes very easy to write a function that removes all As:

let remove_A t = visit (function A -> Some [] | _ -> None) t
[try]

or all Bs:

let remove_B t = visit (function B _ -> Some [] | _ -> None) t
[try]

and it's also very easy to convert all Es to Ds:

let convert_Es_to_Ds t =
  visit (function E(i) -> Some [D(string_of_int i)] | _ -> None) t
[try]

Doing this with some fold abstraction is not suitable because we would need to do something like

let rec remove_B t =
  List.rev
    (fold
       (fun r ->
          function
          | B _ -> r
          | (A | E _ | D _) as e -> e::r
          | C(s, t1, t2) -> C(s, remove_B t1, remove_B t2))
       []
       t)
[try]

in which case what bothers me most is the line that has C(s, remove_B t1, remove_B t2) because it means that we still have to make trivial recursive calls that are just annoying to write (trivial code should be avoided whenever possible because it increases the chances to introduce nasty bugs).

What about genericity?

Well, perhaps we might want to have a visitor that doesn't always return a t. Can we make it generic using polymorphic abstractions? The implementation of visit returns a value, so if we want it to be able to return something else, we have to parameterise the default behaviour.

So, let's add a parameter to the function visit and see what it looks like. Well, let's call it glue since it's a function that glues 2 things together (if you find a better name, please let me know).

let rec visit f glue = function
  | [] -> []
  | (A | E _ | D _) as e :: tl ->
    begin match f e with
      | Some(l) -> glue l (visit f glue tl)
      | None -> glue [] (e :: visit f glue tl)
    end
  | B(t) as e :: tl ->
    begin match f e with
      | Some(l) -> glue l (visit f glue tl)
      | None -> glue [] (B(visit f glue t) :: visit f glue tl)
    end
  | C(s, t1, t2) as e :: tl ->
    begin match f e with
      | Some(l) -> glue l (visit f glue tl)
      | None ->
        glue []
          (C(s, visit f glue t1, visit f glue t2) :: visit f glue tl)
    end
(* val visit : (e -> 'a list option) -> ('a list -> t -> t) -> t -> t *)
[try]

We're almost there. There's still that 'a list, which is less generic than 'a. Well, glue needs to have 2 parameters because it has to be able to treat both branches of the pattern-matching filter, so we could make the first parameter optional. We could use optional arguments, and we can see that f already returns an option, can we just take advantage of that? Well, if we replace for instance

  | (A | E _ | D _) as e :: tl ->
    begin match f e with
      | Some(l) -> glue l (visit f glue tl)
      | None -> glue [] (e :: visit f glue tl)
    end
[try]

by

  | (A | E _ | D _) as e :: tl ->
      glue (f e) (visit f glue tl)
[try]

it does work for this branch but it doesn't for the others because one purpose of the visitor is to deeply traverse the data structure automatically. So we're back to optional arguments...

And so we could have that:

let rec visit f ?(glue=(fun ?l r -> match l with None -> r | Some l -> l @ r)) = function
  | [] -> []
  | (A | E _ | D _) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f ~glue tl)
      | None -> glue (e :: visit f ~glue tl)
    end
  | B(t) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f ~glue tl)
      | None -> glue (B(visit f ~glue t) :: visit f ~glue tl)
    end
  | C(s, t1, t2) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f ~glue tl)
      | None ->
        glue 
          (C(s, visit f ~glue t1, visit f ~glue t2) :: visit f ~glue tl)
    end
(* val visit : (e -> t option) -> ?glue:(?l:t -> t -> t) -> t -> t *)
[try]

Then the problem is that it's not polymorphic any more, while we want it to be polymorphic! So let's drop the default value for the parameter glue.

let rec visit f (glue:(?l:'a-> t -> t)) = function
  | [] -> []
  | (A | E _ | D _) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f glue tl)
      | None -> glue (e :: (visit f glue tl)
    end
  | B(t) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f glue tl)
      | None -> glue (B(visit f glue t) :: visit f glue tl)
    end
  | C(s, t1, t2) as e :: tl ->
    begin match f e with
      | Some(l) -> glue ~l (visit f glue tl)
      | None ->
        glue 
          (C(s, visit f glue t1, visit f glue t2) :: visit f glue tl)
    end
(* val visit : (e -> 'a option) -> (?l:'a -> t -> t) -> t -> t *)
[try]

There we go, we have a generic visitor for structures of type t.

As we did before, we may now define remove_A, remove_B and convert_Es_to_Ds, using the following quite simple definition of glue.

let glue ?l t = match l with Some l -> l @ t | None -> t
[try]
let remove_A t = visit (function A -> Some [] | _ -> None) glue t
[try]
let remove_B t = visit (function B _ -> Some [] | _ -> None : e -> t option) glue t
[try]
let convert_Es_to_Ds t =
  visit (function E(i) -> Some [D(string_of_int i)] | _ -> None) glue t
[try]

We could actually make it more generic by allowing glue to return something else than a t, and we could have something resembling val visit : (e -> 'a option) -> (?l:'a -> t -> 'b) -> t -> 'b. Well, the problem is that we have to tell visit how to reconstruct a visited C. It would be easy to have a more generic version of visit of type val visit : (e -> 'a option) -> (?l:'a -> t -> ('b * t)) -> t -> ('b * t) but that's become far from simple to understand: it looks like some sort of visit and fold merged together. Hmmm... Let's not go that far, for now.

High-power LEDs for super-bright light

I've been "playing" with COB LEDs for about half a year now. Here's a summary of my experience.

Preliminaries

COB LEDs

A "COB LED" is a basically a bunch of tiny LEDs put on together to provide a very energy-efficient and usually very bright LED (using both parallel and series circuit schemes). Note that it means the voltage of COB LEDs are not the voltage of "normal LEDs", since putting some LEDs in series means adding up voltage values!

An LED does not behave like an incandescent light bulb

Also, note that if there's a roughly linear relation between light intensity and voltage for classic lamps, it's not the case at all with LEDs! With LEDs, basically, the light intensity varies with the amperage, not the voltage. So if you lower the voltage by a very little bit, it will shine as bright if the amperage keeps up; if you lower the voltage too much, it basically stops emitting light (e.g., if you give it 16V instead of 19V, you have so little light that it's really useless), but if you increase the voltage too much you still fry your LEDs so you don't want to do that at all.

Light colors, measured in Kelvin

Basically, 2700K to 3300K correspond to classic (discontinued) incandescent light bulbs, and it's quite close to classic candle light, which is about 1850K. 6500K is day light as in "the light provided by the Sun on a sunny day at noon", which seems very cool if it's dark outside or if you have warm light around. My favourite artificial light color is generally from 4000K to 5000K. I find the 3000K quite acceptable, below that is too warm, above 6000K is too cool.

Don't get me wrong: in this post, "cool" not a synonym for "awesome", stick to its original meaning instead.

Read more on Wikipedia.

Color spectral distribution

Note that LEDs generally have a very narrow light color spectrum. That means that instead of emitting light of a lot of different colors, including some colors you don't see or colors that are not very much noticed because they are less intensively emitted, LEDs will really emitt just the color you asked for. That is not such a good thing (it's not so bad either). But as a consequence, you might feel that the light emitted by an LED is not very comfortable. What I do is that I mainly use warm light LEDs and I add a few cool light LEDs amongst them: it goes from 1 cool for 6 warms to 1 for 1. The result of 1 for 1 is quite nice when it's bright outside, but it's not so nice if it's in the middle of the night.

Why not simply buy normal lamps?

Well, buying new lamps for having more light implicates buying light bulbs as well. I'm not a very big fan of light bulbs that contain mercury because if you ever break one, you'll breathe mercury gas (much likely in small quantities though). And I can't install lights on the wall without asking my landlady for permission, and even if I did have permission, it would mean putting electric wires inside the wall and that's not really an idea that speaks to me (I've never done that) and that means painting the wall afterwards (not a big fan of painting walls either). What about LEDs then? Don't I need to install them on the wall anyway? Well, no, I don't! I have access to some of the iron based framework, and as LEDs are light weight and flat, I stick them on some material on which I put some magnets, et voilà! (Now I'm even a bigger fan of strong magnets than before.)

Well, I won't deny it, the main reason why I wanted to use LEDs is that it's super high tech, it's fun, it's powerful and energy-efficient (and ecological), it's nice, it's the future, it's geek. Also, we plan on using LEDs massively in university of Cambridge buildings, starting with the department of computer science.

What I've used

LEDs

I've essentially used 6W COB LEDs, which — according to the specs given by their manufacturers — are as efficient as 120lm/W or 108lm/W, depending on if you prefer white light or warm white light. I gave a try to more a classic (though high-power) LED before: 9 pieces of 1W LEDs put together on an aluminium disc. Well, I switched to COB LEDs for three reasons. The first one is that COB LEDs are more energy-efficient. The second one is that the voltage of the COB LEDs that I use is higher (19V instead of 10V), which means that for a given wattage, the current intensity (amperage) is lower, and that means less chance for the wires to heat... (I'm not an electronician nor a physisist, please correct me if I'm wrong.) And the third one is that COB LEDs are very compact. The "9 pieces of 1W LEDs on a disc" is really big: the diameter is 88mm. The drawback is that the COB LEDs that I use are only available in the 6500K and 3000K color temperatures, whereas the other ones are available in 6500K, 4000K, 3000K and 2500K; 400K being my favourite.

Power supply units

It's not that easy to provide electrical power to LEDs... because we need a potentially quite high amperage, the right voltage, and the latter has to be very stable (or at least never reach a higher value than the expected one, otherwise the LEDs burn and die quite easily, and we don't want to kill some hardware that's supposed to last ten and up to thirty years).

I recommend this power supply unit (PSU) from Rapid if you're powering at most 30 watts of LEDs at a voltage of 24V or less for three reasons: it seems to me so far that it's a very good quality PSU, the voltage is well regulated, it does provide up to 30W without complaining or heating up, it switches on very quickly (virtually instantaneously). I have used 2 of them for about 6 months. (I haven't tried to make it work at 36W because it would supposedly diminish its lifespan.)

Well, that 36W PSU is currently (early October 2013) out of stock, and since I visited (randomly) a Maplin store and saw this 120W laptop power supply unit for less than £40, I bought two of them to give them a shot. They are a lot cheaper than the 36W PSU I've been using so far, if we compare the price-over-power ratio. Well, the voltage they deliver seems okay: according to my cheap multimeter, so it's not really the most reliable measurement, they deliver 19.2V when using a 19V head. The main drawback of this adaptor is that it takes time to "boot"! (Yes, it means it will not deliver power during those 5 seconds.) It's not really a problem for a laptop, but for a lighting system, having to wait 5 full seconds just seems like forever. So it means that if I want instantaneous light-on feature, I need a switch between the PSU and the LED instead of between the wall plug and the PSU. Well, in practice, I think I'll simply wait 5 seconds, because I'll only use them when I want a very very bright light, since I already have 2 LED-based lighting devices powered by those 36W PSUs that provide power instantaneously.

The other drawback is that it's made to work with (existing) laptops and they provide one connector per model (i.e., shape of the head and voltage), so you can't use any connector and still assume you'll be able to select the voltage: here, the head determines the voltage you'll get.

Heat!

One of the biggest (potential) problem with COB LEDs is that, as they are very powerful and very tiny, you're supposed to stick them on something that can dissipate the heat that they produce. What I learnt from that is that iron is a very very slow heat conductor compared to aluminium. I'm using aluminium rulers because it's the cheapest flat thing made of aluminium that I could find. Three rules very easily dissipate the heat produced by 4 LEDs. I tried tripling the density of LEDs... It was not a good idea, as the rulers become hot quite rapidly: within a few minutes, they reach close to 50°C. So with 4 LEDs per ruler, active fan cooling may dissipate the heat.

Sticking COB LEDs

To stick LEDs, I used some akasa thermal adhesive tape that I bought on Amazon, and it works pretty well.

The worst part...

... is obviously, to me, the soldering part. It's quite hard to solder those COB LEDs because the contact spots are very small, so it takes a tremendous amount of time.

Overall

Why bright light is important...

I feel that it's important to me to face (very) bright light during the day, and it's best if it comes directly from the sun. Apparently (although I'm not a hundred percent sure) it's actually important to be exposed to (very) bright light at least 30 minutes a day.

So, instead of using a very expensive S.A.D. light to prevent potential seasonal affective disorder, well, I'm using very bright state-of-the-art LEDs to light up my home. :-)

Conclusion

LEDs are awesome because they need very little space compared to other light-emitting devices.

As they're very energy-efficient, using them for indirect lighting doesn't consume too much electricity.

They can last up to 10 or 30 years if they're taken well care of. So once you make them, you don't need to worry about them for a while... Well, I'm not sure the power supply units can last that many years, but it's not the hard part to change.

If you live in Cambridge (UK), you could use Makespace to build your own LED-based super bright and/or super smart lighting system. Well, given my low-budget when I started building those LED-based lighting systems, I couldn't afford Makespace membership, so I kind of have all I need to handcraft at home.

Pictures

4 COB LEDs on 3 aluminium rulers, in parallel. The 3 aluminum rulers are stuck together using aluminium tape, which has good thermal conductivity, reflects light not too badly and is not too expensive.

4-COB-LEDs-off-640.jpg

4-COB-LEDs-off-unplugged-640.jpg

4-COB-LEDs-off2-640.jpg

If you do want LEDs but don't want to solder anything

Well, you may find a lot of websites on Google selling LEDs, you may also try this one, which belongs to a friend of mine.

(Example) A use case of MPP in an OCaml program

Let's say you have a big variant type such as this one:

type t = A | B | C | D | E | F | G
[try]

and you want to implement a t -> string function. With MPP, you can write this:

{< let l = ["A"; "B"; "C"; "D"; "E"; "F"; "G";] >}
type t = {< let _ = List.iter (Printf.printf "| %s ") l >}
let to_string = function
{< let _ = List.iter (fun e -> Printf.printf "| %s -> %S" e e) l >}

and then you have that:

type t = | A | B | C | D | E | F | G 
let to_string = function
| A -> "A"
| B -> "B"
| C -> "C"
| D -> "D"
| E -> "E"
| F -> "F"
| G -> "G"
[try]

As you may see, if you grow the type t by adding more cases, you don't have to manually update the function to_string.

OPAMaging MPP

Please read this if you want to know more about MPP.

Announcement: I've OPAMaged MPP!

So, I've eventually OPAMaged MPP (i.e., made an OPAM package for MPP)! Now MPP officially has a version number, and at the time I'm writing these lines, the current version is 0.1.0.

Happy that it eventually worked...

Yes, I'm quite happy that MPP now has an OPAM package. To do so, I copied the _oasis and the Makefile files from MPP and adapted them for MPP. At first I wanted to avoid using oasis because for reasons that are still unknown to me, some people told me (half-joking?) that it was better not to depend on oasis... Well, actually I tried not to depend on oasis! I tried but just couldn't build in a reasonable amount of time a Makefile that met the requirements for an OPAM package. Building MPP is quite trivial, installing it in the right place and being able to uninstall it, well, less trivial (perhaps I've just missed a simple thing in the process).

A side note

Although the MPP project started a while ago now, it's still at a pretty early stage of development since I've been mostly working on the tools that are needed to build the new ocaml.org website, such as OMD (read about the experience here), which I've developed from scratch, and OPAM-DOC, which I had to contribute to at some point because we had to fix it while having a very little amount time.

Now that OMD has become quite stable (although some bugs are still found from time to time, which is normal) ../..

No product of human intellect comes out right the first time. We rewrite sentences, rip out knitting stitches, replant gardens, remodel houses, and repair bridges. Why should software be any different? – Lauren Ruth Wiener

../.. and since I'm not the only contributor on opam-doc (Vincent Botbol is still participating, he fixed quite a nasty-to-fix bug last weekend, and Leo White is still working on the JavaScript part), it's not as much time-taking as OMD, but because it's not my code and because it's using some syntax extension (yeah, CamlP4's in the place, yo), patching it is not as simple as it could be...

Using MPP in two differents ways

MPP is now an OPAM package

First of all, I've to say I've finally made an OPAM package for MPP! At the time I'm writing these words, it's still waiting for merging. Read more here.

Description of MPP

MPP is a preprocessor that has been designed to allow any (non-insane) programming language to be used as a preprocessor for any text-based file. The particularity is that the users keep their initial language as their primary language: the language(s) brought by MPP is (are) secondary. It means that ideally, if you give a random text file to MPP, MPP will return it unchanged! And in practice, you can always parameterize MPP (on the command line, or in an MPP file) such that for a given text file, MPP will return it unchanged.

Of course, the purpose of MPP is not to replace the "cat" command! But having the property that you can guarantee that MPP will behave as the "cat" command if you want it to is important, because it means you can customise MPP such that MPP can work for any text file.

Side note:

Well, if you discover a language that has some kind of infinite semantics, which are somehow automatically defined by itself, such that every possible sequence of bytes has a meaning in this language, MPP might not work because you would have a hard time finding a token that's not already taken, since there're all already taken... But I don't think that it's a realistic scenario, so let's just forget about it.

Principle of MPP

The principle on which MPP is based is pretty simple. The users define the tokens they want to use to determine the beginning and the end of MPP blocks. And there are 3 kinds of MPP blocks:

  1. basic MPP block;
  2. basic MPP block with possible nesting;
  3. foreign MPP block.

The first and the second are used for the same things except that the second allows nesting. The third kind is the most interesting one, as it's the one that allows to embed virtually any programming language as a preprocessor language.

Side note:

Well, you can't easily use pure lambda calculus as a preprocessor language. Your programming language has to have some notion of input/output to process text. Using pure lambda calculus might be very painful and it would start by making it impure enough to interact with the rest of the world.

Example 1: the basics

So, let's say you're writing a program using the OCaml language, and you want x to be a random integer.

You can write this:

let x = Random.int max_int

and if you want to be sure that the number is different each time you execute your program, you should use Random.self_init or an equivalent.

Now what if you want a random number to be generated at compile time? You can't do it in pure OCaml. You need to preprocess your file and/or to read from another file...

With MPP, you can write this:

let x = << cmd echo $RANDOM >>

if you use mpp with these parameters: -so '<<' -sc '>>' to set opening and closing tokens to << and >>. Default tokens are respectively (( and )) but if you're writing some OCaml code, you shouldn't use (( and )) because there's quite a chance that they appear as OCaml code in your file.

What happens with MPP is that it simply echoes everything that is not MPP, and when it's MPP, it interprets the commands. The cmd builtin is probably the simplest and the most powerful, since it calls your shell with what follows cmd. The first line is the command that is called, and what follows is given to the command as its standard input.

So, basically,

let x = << cmd echo $RANDOM >> mod 42

given to MPP has the same semantics as this Bash script

echo -n 'let x = '
echo $RANDOM
echo ' mod 42'

Now, instead of using non-nesting tokens, you can use nesting tokens, the only difference being that you can nest commands. Wait, what if you nest with non-nesting commands? Well, MPP has a very simple syntax. If you can't nest, you really can't nest. So, if you write << echo << echo 42 >> >> it just doesn't work! Moreover, you have a syntax error because the first >> is seen as the closing token for the first <<, and MPP stops when there's a syntax error. If you want to echo << echo 42 >>, it is possible, you just have to name your block, as in <<name echo << echo 42 >> name>> and then it just works!

Note that with nesting tokens, you have to name all outer tokens. The default tokens are {{ and }}. So, to nest some commands, you can write for instance {{name set x {{ cmd cat foo.txt }} name}}. Note that the the most inner block doesn't have to be a nesting one, as it doens't nest any. It means you could write {{name set x << cmd cat foo.txt >> name}} instead. And note that only the most inner nesting block can be nameless and that's in some cases only.

Note: the technique presented above (in example 1) has been used to generate the new OCaml website.

Example 2: using OCaml as a preprocessor language

When MPP is used with the option -l, it produces a program that prints the file, using information provided to MPP about the language specified with -l. For instance, mpp -l ocaml will produce an OCaml program.

If you apply mpp -l ocaml to a file F that contains no MPP commands, then you have an OCaml program that, when executed, produces the original file F.

When you use -l ocaml, all "foreign blocks" are considered as OCaml code and it will just "pass through". So, if you write 42 {< let x = 23 >} Hello, it will be translated to this program:

let _ = print_string "42 "
let x = 23
let _ = print_string " Hello"

It's not exactly as flexible as "you can write anything at the top-level". Indeed, you may write any valid top-level definition. It could be a type definition, a module definition, really anything you want that is valid at the top level and that does not require double semi-colons. That is, you cannot write {< print_int 42 >}because it's not a valid top-level phrase, you should write this instead {< let _ = print_int 42 >}. If you really love playing at edges, you could write {< ;; print_int 42 ;; >}and it might work... if there is something preceding. If there isn't (but that's not trivial to guess because the semantics of MPP might change on this precise point, I don't know, I can't promise right now), then you could perhaps write {< print_int 42;; >}.

But you can't convert a block of text into an OCaml function. You can't write something like you would do in PHP, such as:

<ul>
  {< for i = 1 to 42 do >}
    <li> {< print_int i >} </li>
  {< done >}
</ul>

Oh but wait! You can use functors...

So in functor-style, you can write this:

<ul>
  {< module I (X:sig val i : int end) = struct >}
    <li> {< let _ = print_int X.i >} </li>
  {< end >}
  {< let _ = for i = 1 to 42 do
       let module M = I(struct let i = i end) in
       ()
     done >}
</ul>

Sorry if that hurts your eyes, but life is not always easy. Oh, and if you think it's overkill, well, of course it is, that's a tiny toy example! I'm actually using this functor-style to write this blog. It means that I generate a template with the functor-style headers and footers and I write inside the functor using Markdown, and when I need to write some OCaml code, I use MPP features. And when I need to print MPP stuff, I use the 2 worlds of MPP, but it's late and I don't want to go into details just right now. What's important is that it's nice and it works. I've always wanted to be able to easily embed the OCaml language in any text (LaTeX, ML, HTML, Bash, etc.) file, and now I can!! :)

(Exercise) Odd or even?

This is the exercice:

Given a string (of bytes), is the number of bits set to 1 odd or even?

Let's now find some possible solutions.

Well, the most naive way is to count them all, and then see if the number is even or odd. But it's probably better not to use an integer to count them all and then see if that counter is odd or even, since we can instead use a Boolean: each time we see a 1, we can simply flip the Boolean value!

let has_an_odd_number_of_1 s =
  let res = ref false in
  for i = 0 to String.length s - 1 do
    let x = int_of_char s.[i] in
    for j = 0 to 7 do
      if (x lsr j) land 1 = 1 then
        res.contents <- not res.contents;
    done
  done;
  res.contents
[try]

Side note:

Moreover, we can see that using a Boolean means it will not limit the number of bits we may read, whereas an integer... well, actually, when we reach the maximum value of integers (of type int, or Int32.t or Int64.t, which — in OCaml — are respectively Pervasives.max_int, Int32.max_int and Int64.max_int), adding one will make them even negative numbers (max_int, from Pervasives, Int32 or Int64, are all odd numbers), so we should still get the odd/even outcome right. But then, we would have to be sure that we have correctly designed the code that tells us if an integer is odd or even. For instance, we can write is_odd these ways:

Well, let's start with a very classic mathematically-elegant implementation of two mutually recursive functions:

let rec is_odd = function
 | 0 -> false
 | n -> is_even (n-1)
and is_even = function
 | 0 -> true
 | n -> is_odd (n-1)
[try]

Note that it needs a lot of computation when the integer is "big"...

And this is probably the shortest version:

let is_odd n = n mod 2 <> 0
[try]

Now note that we cannot write this and expect it to work properly in all cases:

let is_odd n = n mod 2 = 1
[try]

and that's because if n is negative and odd, then n mod 2 is -1, not 1.

A tricky version that uses properties on integers:

let is_odd n = n <> 0 && ((n / 2) * 2) <> n
[try]

Another tricky version that uses properties on machine integers:

let is_odd n = n land 1 = 1
[try]

Well, there is the same number of characters as the shortest version, and actually this version is probably the most efficient because we know that the land operation is easy for the computer, whereas the mod operation implies division, which is not an easy one. Then mod could be optimised in some cases, such as mod 2 because it's basically the same as land 1 except that the sign has to be preserved! So, it's likely to still cost more...

But there's way more efficient than that! Indeed, unless we really don't have a choice, we shouldn't read bits one by one, because it's slow!

So what can we do instead?

Well, there's a very very convenient logical operation on integers. It's called XOR (and the OCaml operator for that is lxor), and yes we can use it!

Note that 1 XOR 1 = 0; 0 XOR 0 = 0; but 1 XOR 0 or 0 XOR 1 = 1. It's very convenient because it's basically giving us the answer! When we have two 1s, we can eliminate them, that's what happens with XOR. When we have a 1 and a 0, we keep the 1. So if we take those bytes byte by byte, and XOR them with each other, at the end we will have one remaining byte that will tell us if the whole string had an odd number of 1 or not.

let has_an_odd_number_of_1__efficient s =
  let l = String.length s in
  let rec loop c i =
    if i = l then
      let res = ref false in
      for j = 0 to 7 do
        if (c lsr j) land 1 = 1 then
          res.contents <- not res.contents;
      done;
      res.contents
    else
      loop (c lxor int_of_char s.[i]) (succ i)
  in loop 0 0
[try]

And... There we go! :-)

Note that int_of_char is a zero-cost operation (but char_of_int is not, because it has to check that the value of the given integer is between 0 and 255 inclusive).

(Exercise) Are there more 1s or 0s?

The challenge is this:

Provide a program that will determine, given a set of 1s and 0s, if there are more 0s or more 1s.

So the first question that comes to my mind is about the representation of that set of 1s and 0s. Is it a linked list, a double linked list, an array, a tree, a string, or something else?

If it's a list (single linked or double linked actually doesn't matter much), it's going to be pretty straightforward: just read the list and use a counter. Yes, a single counter is enough, you can add one to the counter when you meet a 1, and subtract one when you meet a 0. At the end, if your counter has its initial value, then you have the same number of 1s and 0s. If it's lesser, then it has more 0s, else it has more 1s.

Let's declare a type for the result.

type result = Equal | More_zeros | More_ones
[try]
let count l =
  let rec loop counter = function
    | [] -> counter
    | 1::tl -> loop (counter+1) tl
    | 0::tl -> loop (counter-1) tl
    | _::tl -> loop counter tl
    (* you might want to debate whether you should stop 
       when you get something else than a 0 or a 1. *)
  in
  let r = loop 0 l in
  if r = 0 then
    Equal
  else if r > 0 then
    More_ones
  else
    More_zeros    
[try]

Well, what's the risk with this program? The risk is that the integer counter overflows, of course! If you have a very long list of 1s (or 0s) only, you may get things just wrong. However, in OCaml, in reality, there's not really a chance that the integer overflows because of the length of a single linked list, especially if you're using a 64-bit architecture based OCaml, on which the greatest integer is 4_611_686_018_427_387_903 (about 4⨉1018). There's really a long way to have such a long list because basically you would need to allocate more than (about 32⨉106 terabytes) at once, since basically a linked list of integers is made of blocks that have 2 cells each (one for the integer, one for the address of the next cell), each cell taking 64 bits (or 8 bytes).

But then, what if you don't have linked lists but some stream that gives you a very large number of 0s and/or 1s? Well, to begin with, counting from 0 to 4⨉1018 takes a really long time. If your machine can count from 0 to 109 in a single second (that would mean your machine is very fast), it would still take 4⨉109 seconds, which is about 4000000000/(606024*365) years. Oh, that means about 126 years! So let's just assume that a 63 bit signed integer is enough for us. And if you really can't assume that for some reason, you can always implement 128 bit signed integers quite easily, and if you don't know how to do that or if you're too lazy to do it, use the Big_int module.

But let's go back the representation of those 0s and 1s. I'd like to make the computation as fast as possible and I'll put those 0s and 1s in a very compact representation. Each 0 and 1 will now only take one bit in the memory (with a possible constant overhead for the whole). For that, let's use OCaml's strings, which are basically arrays of bytes. The longest string I can have on my machine will bear 1_152_921_504_606_846_904 bits (I know that because I multiplied Sys.max_string_length by 8), and that's a lot (more than 108).

Now say we want to know whether there are more 0s or 1s as fast as possible. How do we do that?

We don't want to count all 0s and 1s bit by bit, because that would have quite a high cost! Indeed, we would have to get each byte, and for each byte we would have to read each of its 8 bits (that can be done) one by one. We don't want to do that.

Instead, since we have bytes, we can conveniently allocate an array of size 256. Each cell of that array will contain the right number to add to the counter. This way, we can read a byte, get its number of 0s and 1s in O(1).

(* this table is computed only once *)
let table =
  let number_for_a_byte b =
    let r = ref 0 in
    for i = 0 to 7 do
      if (b lsr i) land 1 = 0 then
        decr r
      else
        incr r
    done;
    !r
  in
  let a = Array.make 256 0 in
  for i = 0 to 255 do
    a.(i) <- number_for_a_byte i
  done;
  a
[try]

Then let's abstract from the means to read new 0s and 1s, by assuming we'll be provided a function f that given () will return 8 bits in a value of type char, and will raise the exception End_of_file when it has no more bits to give.

let more_zeros_or_ones f =
  let c = ref 0 in
  begin try while true do
    c := !c + table.(int_of_char(f()))
  done with End_of_file -> () end;
  if !c = 0 then
    Equal
  else if !c > 0 then
    More_zeros
  else
    More_ones
[try]

Note that int_of_char has a zero-cost (i.e., it "disappears" at compile-time because int and char sort of share the same memory representation). If you want better performance, you should inline f, provided that you know what it is (you may want to check if the compiler does the inlining itself first, just in case).

Also, you may want to use a table with a size larger than 256 if you have a lot of memory but I'm not so sure you'd gain performance unless you use nasty tricks to read larger integers from a string. Then again, you may not end up using a string, in which case you have to think with the whole problem in mind.

On the Implementation of OPAMDOC

Early this year, Leo White started the implementation of opamdoc. Then Vincent Botbol worked on it from late May to mid August during his stay in Cambridge.

Now, Vincent's back to studying "computer science research" in Paris, and he continues working on opam-doc when he can on his free time.

I didn't really want to get involved too deep in the implementation of opamdoc (mainly because it takes time). Eventually, since I've ended up doing most of the technical-side implementation of the new ocaml.org web site, I had to eventually get into opamdoc's source code to integrate its output in the website.

If you look at the source code of opamdoc, you'll see there's a bunch of files at the root directory. Well, some of them are inherited from the implementation of ocamldoc, and some are new. I've mostly contributed to generate.ml and opam_doc_config.ml. The former implements the HTML backend of opamdoc, and the latter contains the JavaScript engine that loads the documentation. This blog post is mostly about my experience with those two files.

generate.ml

This big file (about 1.5 kloc) contains the functions to retrieve the information from cmt, cmd, cmti, cmdi and cmi files in order to build the HTML document.

Side note:

The .cmt files were officially introduced in the standard OCaml compiler in version 4.00.0, so it's pretty recent work. Previously, they were produced by a separate tool developed at OCamlPro for TypeRex.

Two reasons why it's not super easy

Well, this can actually be explain in just a few words. To begin with, there are many cases to address. Well, that is not exact. I should say that there are may cases to address and those cases are fairly poorly documented, which is "normal" given that there was probably no particular motivation to put efforts into documentation. This is true for the source code of ocamldoc and for its official documentation.

For instance, if you look into info.mli, you can see that the first type definition is:

type style_kind =
  | SK_bold
  | SK_italic
  | SK_emphasize
  | SK_center
  | SK_left
  | SK_right
  | SK_superscript
  | SK_subscript
  | SK_custom of string

and the only documentation for this type is (** The differents kinds of style. *). Well, you can see that there's not really any documentation needed for those SK_... until... you see SK_custom of string. There you go! You have to guess...

It's not that hard when you have to guess a few times, it's a lot harder when you keep having to guess. That's my point.

Well, the other issue is more interesting. At the time ocamldoc was designed and implemented, I bet no one imagined what folks at JaneStreet would do with OCaml! I'm talking about the implementation of their open source library, known as Core. "Core & co" use a lot of include directives. The module Core.Std includes a lot of other modules, which also include modules. If you want to generate a single full HTML page for Core.Std, you'd end up with a huge page. And such a page would contain a lot of information coming straight from other pages, so you'd end up with hundreds of megabytes. Instead of doing so, opamdoc generates only one page per package and one per module. If a module includes another one, then the first will fetch the documentation of the second and there you go. So we only have 8.4MB of HTML for {async, async_core, async_extra, async_unix, core, core_bench, core_extended, core_kernel} in total (well, this number should increase in the future, but linearly, as people will hopefully start documenting all those packages). And that's why we have a JavaScript loader.

opam_doc_config.ml, or docs_loader.js

Since the JavaScript may have to load tens of megabytes of HTML, you have to program some nasty functions and loops... and at some point it does become big enough for your browser to stop responding while it's busy loading your documentation. So there are several solutions to that. The best would probably be to stop writing in JavaScript (and use something else that compiles to JavaScript). But that's for next step. Now, we're trying to make the JavaScript work.

The problem with JavaScript is that basically there is one "event" loop, and all events are happening sequentially or concurrently, depending on how you wrote your JS, but when one event is making the browser busy, the browser is unable to do anything else. That's why your browser may tell you at some point that you have a script making it ill and ask you whether you want it to stop executing that script. One workaround for that problem when you know you ask for a lot of computation time is to divide your big computation into smaller ones. You can use window.setTimeout for that, meaning you transform your recursive calls like f() into window.setTimeout(f, 0) so that f will be called at some point. And if you have iterative loops instead, write them recursive and use window.setTimeout. That's bad. Because then the browser is unable to tell that it's crazy busy... and it's still really busy. So you can increase the value 0 to 1 or 10 or more. But if you increase it too much, it'll become too slow...

Ok, well, don't use JavaScript. It's a nightmare.

We will probably rewrite the script using Js_of_ocaml at some point, mainly because when you write in OCaml, you have static type checking, and that saves so much time!

To be followed...

OPAM-DOC with multiple OCaml packages

OPAM-DOC

The tool ocamldoc shipped with the standard distribution of the OCaml compiler suffers from some limitations because it doesn't have access to all the information it needs. Notably, if you want to build the documentation of some package a that depends on some package b, you'll find out that the documentation of the package a will not automatically refer to the documentation of the package b. Well, opamdoc solves this problem and it's very nice.

On this site, which is work in progress, you can find a list of packages with their descriptions. This list is built using opam2web. And each package description links to its documentation. And the documentation for all packages is built as a whole, which means that if the documentation of a package links to all other packages on which it depends. Isn't it nice? :)

Install and run opamdoc

opamdoc is still under active development, the current step is finding and fixing remaining issues and packaging.

If you want to use opamdoc now, here's some instructions based on Anil's instructions, which are those:

opam remote add opamdoc git://github.com/avsm/opamdoc-dev
opam switch 4.01.0beta1+opamdoc
eval `opam config env`
opam install opamdoc
opamdoc-rebuild
opamdoc-generate
open ~/.opam/4.01.0beta1+opamdoc/opamhtml/index.html
  • opamdoc-rebuild builds a database of cmt/cmd files in ~/.opam/4.01.0beta1+opamdoc/opamdoc.
  • opamdoc-generate builds a package HTML list in ~/.opam/4.01.0beta+opamdoc/opamhtml.

Well, they work quite well, except opamdoc-rebuild uses .cmd files as a base, whereas it should use .cmt files instead because sometimes .cmt files exist while .cmd don't. This is a problem for a few packages (Core is one of them).

My advice: if something doesn't work as expect, please open an issue, and if you want to customize the way the documentation is built, don't hesitate to edit opamdoc-rebuild and/or opamdoc-generate (they are Bash scripts).

A few notes about those instructions

The first line, which is opam remote add opamdoc git://github.com/avsm/opamdoc-dev, will give sense to the second one, which is opam switch 4.01.0beta1+opamdoc. And 4.01.0beta1+opamdoc is a customised compiler that produces on the side the .cm{t,d}{,i} files that are used by opamdoc. Using it is very convenient! If you don't use it, you need to produce those files in an other way of your choice (really? you want to go through such trouble?).

Note that opamdoc is written in such a way that it would ask a lot of work to make it work with versions of the compiler prior to 4.01.0 because it uses features introduced by this very version.

Just in case, here follow the customised version of opamdoc-rebuild and opamdoc-generate that I used to build the documentation available on http://ocaml-redesign.github.io/pkg/docs/.

opamdoc-rebuild

Here's my customised version of opamdoc-rebuild:

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1
fi

function calc_md5_for_file()
{
  if builtin command -v md5 > /dev/null; then
    md5=$(cat $1 | md5)
  elif builtin command -v md5sum > /dev/null ; then
    md5=$(cat $1 | md5sum | awk '{print $1}')
  else
    echo "Neither md5 nor md5sum were found in the PATH"
    exit 1
  fi
}

BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
rm -rf ${DOC}
mkdir -p ${DOC}

PKGS=$(find ${BUILD}/ -mindepth 1 -maxdepth 1 -type d "$@")
echo ${PKGS}

for pkg in ${PKGS}; do
  pkgname=$(basename $pkg)
  echo $pkgname
  mkdir -p ${DOC}/$pkgname

  CMTS=$(find ${pkg} -type f -name '*.cmt')
  for cmt in ${CMTS}; do
    d=$(dirname $cmt)
    calc_md5_for_file "$cmt";

    f=$(basename $cmt .cmt)
    mkdir -p ${DOC}/$pkgname/$md5
    r=${DOC}/$pkgname/$md5/$f
    for ext in cmt cmti cmd cmdi cmi ; do
        if [ -e $d/$f.$ext ]; then
            $dry cp $d/$f.$ext $r.$ext
        fi
    done
  done
done

opamdoc-generate

Here's my customised version of opamdoc-generate.

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1
fi

OPAMDOC=${OPAMDOC:-opamdoc}


BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
HTML=${BASE}/opamhtml

rm -rf ${HTML}
mkdir -p ${HTML}

# Grab the build dir in reverse mtime order to get the
# ordering of generation "correct".
PKGS=$(ls -1tr $BUILD)

rm -rf $HTML
mkdir -p $HTML
cd $HTML
for pkg in ${PKGS}; do
  fs="$(find $DOC/$pkg -type f)"
  if [ "$fs" != "" ]; then
    name=$(echo $pkg | awk -F. '{print $1}')
    echo $pkg $name
    $dry $OPAMDOC -p $name $fs
  fi
done

Enjoy.

OMD: a Markdown parser in OCaml

Motivations

There are so many existing implementations of Markdown, why yet another one? Well, after asking the OCaml community about existing ways to parse and manipulate Markdown documents, it seemed that there were no stand-alone complete Markdown parser written in OCaml, meaning that they were incomplete (i.e., not fully compatible with Markdown) or interfaces to some Markdown parsers implemented using other languages than OCaml.

Since in OCaml Labs we're working a lot with Github, and Github uses Markdown a lot (Github web pages hosting, Github issues, etc.) and other sites are also using Markdown as well, and Markdown is very popular and easy to learn, and flexible in the sense that you can always fall back to HTML when you want to express something that doesn't have a special syntax in Markdown, blah blah blah, it was - somehow - time to have a Markdown parser implemented using OCaml. And preferably OCaml alone, meaning that one that has OCaml installed should be able to use it easily without having to deal with too many dependencies. Well, there it is: OMD!

Issues... were essentially with Markdown itself

Every computer scientist knows how to write a parser for a language that is as simple as Markdown. Well, no, not every computer scientist, sadly, but at least every programmer should. Ok, sadly it isn't really the case either. Anyways.

Markdown is a rather simple language, if you look at the specs. Well, that depends on what you actuall call "specs", of course. According to Wikipedia, Markdown was invented by John Gruber, and Aaron Swartz helped him. Then the original document that describes the Markdown language is available online there: http://daringfireball.net/projects/markdown/syntax. And you can see that searching for "Markdown+syntax" on Google makes that page the top result (and here again, I'm kind of helping it be and stay on the top...).

Actually, Markdown is not that simple to parse. Why? Because so many things are just left to the imagination of the people who implement parsers for this language, because programs can't decide by themselves what's right to do. It's really the author(s) of the program that decides what the program does. And in Markdown, so many things are ambiguous and Gruber's document doesn't actually describe a grammar at all. It just tells you how to write this and that, but if you write it slightly differently, it doesn't tell you what the outcome would be and by just reading the document, you really can't tell what the output will be.

The one thing to have in mind is that there are no errors in Markdown. Every text file is a valid Markdown file, it might just be converted to some HTML you wouldn't expect. For instance, you may write a link using this syntax:

[blah blah blah](the-url-which-can-be-absolute-or-relative)

and it gets converted to

<p><a href="the-url-which-can-be-absolute-or-relative">blah blah blah</a></p>

But if you forget the closing parenthesis, then it becomes that instead

<p>[blah blah blah](the-url-which-can-be-absolute-or-relative</p>

precisely because nothing is wrong in Markdown.

And what if there are parentheses in your URL? What if they are unbalanced?

Flagrant Ambiguity

The following is some text that has to mean something in Markdown...

* Alice is falling in love with Bob.
    * Bob is secretly in love with Alice, but he's seeing Eve.
  * Eve is in-between, she loves both Alice and Bob, she can't help it.

So, Pandoc, which a tool that converts a document in language A to a document in language B, where A and B can be the same or different languages amongst LaTeX, HTML, Markdown and (many) others, considers that Eve is on the same level as Bob. So its HTML output is

<ul>
<li>Alice is falling in love with Bob.
<ul>
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
</ul></li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>
</ul>

But if instead you add Dylan as in

* Alice is falling in love with Bob.
   * Dylan, a new character, is being conceived here.
    * Bob is secretly in love with Alice, but he's seeing Eve.
  * Eve is in-between, she loves both Alice and Bob, she can't help it.

then of course Eve is not on the same level as Bob anymore and goes with Dylan and Alice, Bob is on his own.

<ul>
<li>Alice is falling in love with Bob.</li>
<li>Dylan, a new character, is being conceived here.
<ul>
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
</ul></li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>
</ul>

This doesn't make much sense...

And Github's embedded Markdown to HTML converter chooses some similar broken semantics. If one writes bullets on different levels, it shouldn't be meaningless.

Also, on Github, if you write

* 1
  * 2
    * 3
      * 4
        * 5
          * 6
            * 7
              * 8
                * 9
                  * 10

Then 2 and 3 are on the same level, as for 4 and 5, 6 and 7, and 8 and 9. 1 and 10 are on their own. And if you extract 2 and 3, meaning

  * 2
    * 3

then 2 and 3 are not on the same level anymore! See for yourself:

With OMD, hopefully, there is a deterministic meaning for each level of indentation. The list where there were Alice, Bob and Eve is converted to this to the least insane semantics I could think of.

The idea is that, since Eve is neither on the same level as Bob or Alice, Eve should be in a new list (because, obviously, she's the only one on that level anyway). So she is in a new list. And since she's not on a deeper level than Bob, she shouldn't be on a sub-list of his. But she is on a deeper level than Alice, so she has to be on a sub-list of hers. So, here is the HTML that OMD produces:

<ul>
 <li>Alice is falling in love with Bob.
  <ul>
   <li>Bob is secretly in love with Alice, but he&apos;s seeing Eve.
   </li>
  </ul>
  <ul>
   <li>Eve is in-between, she loves both Alice and Bob, she can&apos;t help it.
   </li>
  </ul>
 </li>
</ul>

Oh, you might have noticed that OMD converts quotes to &apos; because otherwise I would need to differentiate when they have to be converted from when it's optional.

Implementation

Pandoc's documentation says

In contrast to most existing tools for converting markdown to HTML, which use regex substitutions, Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document, and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer.

Come on, most tools are using regular expressions substitutions?! I can only imagine the nightmare that it must be to implement and debug such an implementation -- no wait, I can't because I just don't want to imagine such a nightmare.

I used functions and function calls, a lot of them are tail recursive, not all of them but then it means I don't need them to be tail recursive, and those functions basically take a list of tokens and return a new list with possibly fewer tokens and the expression to which the missing ones were converted into.

So far, in version 0.4 (which is not released yet at the time I write this), there's a little less than 8k lines of pure OCaml code. (Note that I didn't write "pure functional", I wrote "pure OCaml".)

OMD is an open-source free and libre software library that any OCaml developer can use (hopefully quite easily since it doesn't depend on anything else that the standard OCaml compiler and library). And of course, it's also a tool for any one who write Markdown and wants it to be converted (quickly) to HTML. OMD, so far, is about 10 times faster than Pandoc, and I didn't even make any efforts to make it fast.

Compatibility

OMD has been developed using OCaml 4.0.1, Christophe Troestler made me make it compatible with OCaml 3.12.1. Then I guess it might work with older version of OCaml but it's not certain (mainly because OCaml's standard library has slightly changed, as I think I don't use any language features that were introduced in 3.12 or 4.0).

By the way, thank you Christophe for your support, interest and contributions to OMD :-)

Future of OMD

OMD already is in OPAM. A very stable version of OMD should be released soon. As a tool, it takes Markdown as input and produces HTML as output. A Markdown backend has been freshly implemented, so you can output Markdown as well, which is quite useful for debugging or if you want to know how many iterations you need before you reach the fix point. You can also output "text", in the sense that it's basically the HTML without the HTML tags, so it's very non-formatted text. There also are options to output HTML table of contents and parametrise their depths.

OMD and OCaml.org

We are at OCaml Labs making a new website for ocaml.org. The design is being provided by onespacemedia. At the time I'm writing these lines, I'm using the HTML/CSS/JS for the upcoming OCaml.org to style my new Github-hosted website, so I can play with it more.

Most pages will be written in Markdown instead of HTML, so that people of the OCaml community may contribute to it in a more convenient way.

And of course, that means that OMD will be used for OCaml.org.


tags: