Blog
I'm trying to make this website my new blog, using OMD and MPP, with OCaml of course, but also the (awful) HTML/CSS/JS triplet.
Here follow the latest blog posts.
Functional visitors for a complex tree-shaped data structure
Let's consider this data type:
type e = | A | B of t | C of string * t * t | D of string | E of int
and t = e list
[try]
If we want to write an abstraction to apply a function to this data structure (the one of type t) without being bothered to treat the whole structure everytime, we can get some inspiration from the visitor design pattern commonly used in object-oriented programming. The “problem” is that we don't have the same inheritance mechanism in functional programming, meaning that the default behaviour has to be implemented with functions instead of using methods that can be overriden in subclasses.
First attempt
The following implementation met my needs:
let rec visit f = function
| [] -> []
| (A | E _ | D _) as e :: tl ->
begin match f e with
| Some(l) -> l @ visit f tl
| None -> e :: visit f tl
end
| B(t) as e :: tl ->
begin match f e with
| Some(l) -> l @ visit f tl
| None -> B(visit f t) :: visit f tl
end
| C(s, t1, t2) as e :: tl ->
begin match f e with
| Some(l) -> l @ visit f tl
| None -> C(s, visit f t1, visit f t2) :: visit f tl
end
(* val visit : (e -> t option) -> t -> t *)
[try]
With this abstraction, it becomes very easy to write a function
that removes all A
s:
let remove_A t = visit (function A -> Some [] | _ -> None) t
[try]
or all B
s:
let remove_B t = visit (function B _ -> Some [] | _ -> None) t
[try]
and it's also very easy to convert all E
s to D
s:
let convert_Es_to_Ds t =
visit (function E(i) -> Some [D(string_of_int i)] | _ -> None) t
[try]
Doing this with some fold
abstraction is not suitable because we would need
to do something like
let rec remove_B t =
List.rev
(fold
(fun r ->
function
| B _ -> r
| (A | E _ | D _) as e -> e::r
| C(s, t1, t2) -> C(s, remove_B t1, remove_B t2))
[]
t)
[try]
in which case what bothers me most is the line that has
C(s, remove_B t1, remove_B t2)
because it means that
we still have to make trivial recursive calls that are just annoying
to write (trivial code should be avoided whenever possible because it
increases the chances to introduce nasty bugs).
What about genericity?
Well, perhaps we might want to have a visitor that doesn't always
return a t
. Can we make it generic using polymorphic abstractions?
The implementation of visit
returns a value, so if we want it to
be able to return something else, we have to parameterise the default
behaviour.
So, let's add a parameter to the function visit
and see what it looks like.
Well, let's call it glue
since it's a function that glues 2 things together
(if you find a better name, please let me know).
let rec visit f glue = function
| [] -> []
| (A | E _ | D _) as e :: tl ->
begin match f e with
| Some(l) -> glue l (visit f glue tl)
| None -> glue [] (e :: visit f glue tl)
end
| B(t) as e :: tl ->
begin match f e with
| Some(l) -> glue l (visit f glue tl)
| None -> glue [] (B(visit f glue t) :: visit f glue tl)
end
| C(s, t1, t2) as e :: tl ->
begin match f e with
| Some(l) -> glue l (visit f glue tl)
| None ->
glue []
(C(s, visit f glue t1, visit f glue t2) :: visit f glue tl)
end
(* val visit : (e -> 'a list option) -> ('a list -> t -> t) -> t -> t *)
[try]
We're almost there. There's still that 'a list
, which is less generic
than 'a
. Well, glue
needs to have 2 parameters because it has to
be able to treat both branches of the pattern-matching filter, so we could
make the first parameter optional. We could use optional arguments, and
we can see that f
already returns an option,
can we just take advantage of that? Well, if we replace for instance
| (A | E _ | D _) as e :: tl ->
begin match f e with
| Some(l) -> glue l (visit f glue tl)
| None -> glue [] (e :: visit f glue tl)
end
[try]
by
| (A | E _ | D _) as e :: tl ->
glue (f e) (visit f glue tl)
[try]
it does work for this branch but it doesn't for the others because one purpose of the visitor is to deeply traverse the data structure automatically. So we're back to optional arguments...
And so we could have that:
let rec visit f ?(glue=(fun ?l r -> match l with None -> r | Some l -> l @ r)) = function
| [] -> []
| (A | E _ | D _) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f ~glue tl)
| None -> glue (e :: visit f ~glue tl)
end
| B(t) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f ~glue tl)
| None -> glue (B(visit f ~glue t) :: visit f ~glue tl)
end
| C(s, t1, t2) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f ~glue tl)
| None ->
glue
(C(s, visit f ~glue t1, visit f ~glue t2) :: visit f ~glue tl)
end
(* val visit : (e -> t option) -> ?glue:(?l:t -> t -> t) -> t -> t *)
[try]
Then the problem is that it's not polymorphic any more, while we want it to be polymorphic!
So let's drop the default value for the parameter glue
.
let rec visit f (glue:(?l:'a-> t -> t)) = function
| [] -> []
| (A | E _ | D _) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f glue tl)
| None -> glue (e :: (visit f glue tl)
end
| B(t) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f glue tl)
| None -> glue (B(visit f glue t) :: visit f glue tl)
end
| C(s, t1, t2) as e :: tl ->
begin match f e with
| Some(l) -> glue ~l (visit f glue tl)
| None ->
glue
(C(s, visit f glue t1, visit f glue t2) :: visit f glue tl)
end
(* val visit : (e -> 'a option) -> (?l:'a -> t -> t) -> t -> t *)
[try]
There we go, we have a generic visitor for structures of type t
.
As we did before, we may now define remove_A
, remove_B
and convert_Es_to_Ds
,
using the following quite simple definition of glue
.
let glue ?l t = match l with Some l -> l @ t | None -> t
[try]
let remove_A t = visit (function A -> Some [] | _ -> None) glue t
[try]
let remove_B t = visit (function B _ -> Some [] | _ -> None : e -> t option) glue t
[try]
let convert_Es_to_Ds t =
visit (function E(i) -> Some [D(string_of_int i)] | _ -> None) glue t
[try]
We could actually make it more generic by allowing glue
to return something
else than a t
, and we could have something resembling
val visit : (e -> 'a option) -> (?l:'a -> t -> 'b) -> t -> 'b
.
Well, the problem is that we have to tell visit
how to reconstruct a visited C
.
It would be easy to have a more generic version of visit
of type
val visit : (e -> 'a option) -> (?l:'a -> t -> ('b * t)) -> t -> ('b * t)
but that's become far from simple to understand: it looks like some sort of visit
and fold
merged together. Hmmm... Let's not go that far, for now.
High-power LEDs for super-bright light
I've been "playing" with COB LEDs for about half a year now. Here's a summary of my experience.
Preliminaries
COB LEDs
A "COB LED" is a basically a bunch of tiny LEDs put on together to provide a very energy-efficient and usually very bright LED (using both parallel and series circuit schemes). Note that it means the voltage of COB LEDs are not the voltage of "normal LEDs", since putting some LEDs in series means adding up voltage values!
An LED does not behave like an incandescent light bulb
Also, note that if there's a roughly linear relation between light intensity and voltage for classic lamps, it's not the case at all with LEDs! With LEDs, basically, the light intensity varies with the amperage, not the voltage. So if you lower the voltage by a very little bit, it will shine as bright if the amperage keeps up; if you lower the voltage too much, it basically stops emitting light (e.g., if you give it 16V instead of 19V, you have so little light that it's really useless), but if you increase the voltage too much you still fry your LEDs so you don't want to do that at all.
Light colors, measured in Kelvin
Basically, 2700K to 3300K correspond to classic (discontinued) incandescent light bulbs, and it's quite close to classic candle light, which is about 1850K. 6500K is day light as in "the light provided by the Sun on a sunny day at noon", which seems very cool if it's dark outside or if you have warm light around. My favourite artificial light color is generally from 4000K to 5000K. I find the 3000K quite acceptable, below that is too warm, above 6000K is too cool.
Don't get me wrong: in this post, "cool" not a synonym for "awesome", stick to its original meaning instead.
Color spectral distribution
Note that LEDs generally have a very narrow light color spectrum. That means that instead of emitting light of a lot of different colors, including some colors you don't see or colors that are not very much noticed because they are less intensively emitted, LEDs will really emitt just the color you asked for. That is not such a good thing (it's not so bad either). But as a consequence, you might feel that the light emitted by an LED is not very comfortable. What I do is that I mainly use warm light LEDs and I add a few cool light LEDs amongst them: it goes from 1 cool for 6 warms to 1 for 1. The result of 1 for 1 is quite nice when it's bright outside, but it's not so nice if it's in the middle of the night.
spectral power distribution: read more on Wikipedia.; find more on Google.
LED color rendering index: read more on Wikipedia.; find more on Google.
Why not simply buy normal lamps?
Well, buying new lamps for having more light implicates buying light bulbs as well. I'm not a very big fan of light bulbs that contain mercury because if you ever break one, you'll breathe mercury gas (much likely in small quantities though). And I can't install lights on the wall without asking my landlady for permission, and even if I did have permission, it would mean putting electric wires inside the wall and that's not really an idea that speaks to me (I've never done that) and that means painting the wall afterwards (not a big fan of painting walls either). What about LEDs then? Don't I need to install them on the wall anyway? Well, no, I don't! I have access to some of the iron based framework, and as LEDs are light weight and flat, I stick them on some material on which I put some magnets, et voilà! (Now I'm even a bigger fan of strong magnets than before.)
Well, I won't deny it, the main reason why I wanted to use LEDs is that it's super high tech, it's fun, it's powerful and energy-efficient (and ecological), it's nice, it's the future, it's geek. Also, we plan on using LEDs massively in university of Cambridge buildings, starting with the department of computer science.
What I've used
LEDs
I've essentially used 6W COB LEDs, which — according to the specs given by their manufacturers — are as efficient as 120lm/W or 108lm/W, depending on if you prefer white light or warm white light. I gave a try to more a classic (though high-power) LED before: 9 pieces of 1W LEDs put together on an aluminium disc. Well, I switched to COB LEDs for three reasons. The first one is that COB LEDs are more energy-efficient. The second one is that the voltage of the COB LEDs that I use is higher (19V instead of 10V), which means that for a given wattage, the current intensity (amperage) is lower, and that means less chance for the wires to heat... (I'm not an electronician nor a physisist, please correct me if I'm wrong.) And the third one is that COB LEDs are very compact. The "9 pieces of 1W LEDs on a disc" is really big: the diameter is 88mm. The drawback is that the COB LEDs that I use are only available in the 6500K and 3000K color temperatures, whereas the other ones are available in 6500K, 4000K, 3000K and 2500K; 400K being my favourite.
Power supply units
It's not that easy to provide electrical power to LEDs... because we need a potentially quite high amperage, the right voltage, and the latter has to be very stable (or at least never reach a higher value than the expected one, otherwise the LEDs burn and die quite easily, and we don't want to kill some hardware that's supposed to last ten and up to thirty years).
I recommend this power supply unit (PSU) from Rapid if you're powering at most 30 watts of LEDs at a voltage of 24V or less for three reasons: it seems to me so far that it's a very good quality PSU, the voltage is well regulated, it does provide up to 30W without complaining or heating up, it switches on very quickly (virtually instantaneously). I have used 2 of them for about 6 months. (I haven't tried to make it work at 36W because it would supposedly diminish its lifespan.)
Well, that 36W PSU is currently (early October 2013) out of stock, and since I visited (randomly) a Maplin store and saw this 120W laptop power supply unit for less than £40, I bought two of them to give them a shot. They are a lot cheaper than the 36W PSU I've been using so far, if we compare the price-over-power ratio. Well, the voltage they deliver seems okay: according to my cheap multimeter, so it's not really the most reliable measurement, they deliver 19.2V when using a 19V head. The main drawback of this adaptor is that it takes time to "boot"! (Yes, it means it will not deliver power during those 5 seconds.) It's not really a problem for a laptop, but for a lighting system, having to wait 5 full seconds just seems like forever. So it means that if I want instantaneous light-on feature, I need a switch between the PSU and the LED instead of between the wall plug and the PSU. Well, in practice, I think I'll simply wait 5 seconds, because I'll only use them when I want a very very bright light, since I already have 2 LED-based lighting devices powered by those 36W PSUs that provide power instantaneously.
The other drawback is that it's made to work with (existing) laptops and they provide one connector per model (i.e., shape of the head and voltage), so you can't use any connector and still assume you'll be able to select the voltage: here, the head determines the voltage you'll get.
Heat!
One of the biggest (potential) problem with COB LEDs is that, as they are very powerful and very tiny, you're supposed to stick them on something that can dissipate the heat that they produce. What I learnt from that is that iron is a very very slow heat conductor compared to aluminium. I'm using aluminium rulers because it's the cheapest flat thing made of aluminium that I could find. Three rules very easily dissipate the heat produced by 4 LEDs. I tried tripling the density of LEDs... It was not a good idea, as the rulers become hot quite rapidly: within a few minutes, they reach close to 50°C. So with 4 LEDs per ruler, active fan cooling may dissipate the heat.
Sticking COB LEDs
To stick LEDs, I used some akasa thermal adhesive tape that I bought on Amazon, and it works pretty well.
The worst part...
... is obviously, to me, the soldering part. It's quite hard to solder those COB LEDs because the contact spots are very small, so it takes a tremendous amount of time.
Overall
Why bright light is important...
I feel that it's important to me to face (very) bright light during the day, and it's best if it comes directly from the sun. Apparently (although I'm not a hundred percent sure) it's actually important to be exposed to (very) bright light at least 30 minutes a day.
So, instead of using a very expensive S.A.D. light to prevent potential seasonal affective disorder, well, I'm using very bright state-of-the-art LEDs to light up my home. :-)
Conclusion
LEDs are awesome because they need very little space compared to other light-emitting devices.
As they're very energy-efficient, using them for indirect lighting doesn't consume too much electricity.
They can last up to 10 or 30 years if they're taken well care of. So once you make them, you don't need to worry about them for a while... Well, I'm not sure the power supply units can last that many years, but it's not the hard part to change.
If you live in Cambridge (UK), you could use Makespace to build your own LED-based super bright and/or super smart lighting system. Well, given my low-budget when I started building those LED-based lighting systems, I couldn't afford Makespace membership, so I kind of have all I need to handcraft at home.
Pictures
4 COB LEDs on 3 aluminium rulers, in parallel. The 3 aluminum rulers are stuck together using aluminium tape, which has good thermal conductivity, reflects light not too badly and is not too expensive.
If you do want LEDs but don't want to solder anything
Well, you may find a lot of websites on Google selling LEDs, you may also try this one, which belongs to a friend of mine.
(Example) A use case of MPP in an OCaml program
Let's say you have a big variant type such as this one:
type t = A | B | C | D | E | F | G
[try]
and you want to implement a t -> string
function.
With MPP, you can write this:
{< let l = ["A"; "B"; "C"; "D"; "E"; "F"; "G";] >}
type t = {< let _ = List.iter (Printf.printf "| %s ") l >}
let to_string = function
{< let _ = List.iter (fun e -> Printf.printf "| %s -> %S" e e) l >}
and then you have that:
type t = | A | B | C | D | E | F | G
let to_string = function
| A -> "A"
| B -> "B"
| C -> "C"
| D -> "D"
| E -> "E"
| F -> "F"
| G -> "G"
[try]
As you may see, if you grow the type t
by adding more cases, you don't have to
manually update the function to_string
.
OPAMaging MPP
Please read this if you want to know more about MPP.
Announcement: I've OPAMaged MPP!
So, I've eventually OPAMaged MPP (i.e., made an OPAM package for MPP)! Now MPP officially has a version number, and at the time I'm writing these lines, the current version is 0.1.0.
Happy that it eventually worked...
Yes, I'm quite happy that MPP now has an OPAM package. To do so, I
copied the _oasis
and the Makefile
files from MPP and adapted them
for MPP. At first I wanted to avoid using oasis because for reasons
that are still unknown to me, some people told me (half-joking?) that
it was better not to depend on oasis... Well, actually I tried not to
depend on oasis! I tried but just couldn't build in a reasonable
amount of time a Makefile that met the requirements for an OPAM
package. Building MPP is quite trivial, installing it in the right
place and being able to uninstall it, well, less trivial (perhaps I've
just missed a simple thing in the process).
A side note
Although the MPP project started a while ago now, it's still at a pretty early stage of development since I've been mostly working on the tools that are needed to build the new ocaml.org website, such as OMD (read about the experience here), which I've developed from scratch, and OPAM-DOC, which I had to contribute to at some point because we had to fix it while having a very little amount time.
Now that OMD has become quite stable (although some bugs are still found from time to time, which is normal) ../..
No product of human intellect comes out right the first time. We rewrite sentences, rip out knitting stitches, replant gardens, remodel houses, and repair bridges. Why should software be any different? – Lauren Ruth Wiener
../.. and since I'm not the only contributor on opam-doc (Vincent Botbol is still participating, he fixed quite a nasty-to-fix bug last weekend, and Leo White is still working on the JavaScript part), it's not as much time-taking as OMD, but because it's not my code and because it's using some syntax extension (yeah, CamlP4's in the place, yo), patching it is not as simple as it could be...
Using MPP in two differents ways
MPP is now an OPAM package
First of all, I've to say I've finally made an OPAM package for MPP! At the time I'm writing these words, it's still waiting for merging. Read more here.
Description of MPP
MPP is a preprocessor that has been designed to allow any (non-insane) programming language to be used as a preprocessor for any text-based file. The particularity is that the users keep their initial language as their primary language: the language(s) brought by MPP is (are) secondary. It means that ideally, if you give a random text file to MPP, MPP will return it unchanged! And in practice, you can always parameterize MPP (on the command line, or in an MPP file) such that for a given text file, MPP will return it unchanged.
Of course, the purpose of MPP is not to replace the "cat" command! But having the property that you can guarantee that MPP will behave as the "cat" command if you want it to is important, because it means you can customise MPP such that MPP can work for any text file.
Side note:
Well, if you discover a language that has some kind of infinite semantics, which are somehow automatically defined by itself, such that every possible sequence of bytes has a meaning in this language, MPP might not work because you would have a hard time finding a token that's not already taken, since there're all already taken... But I don't think that it's a realistic scenario, so let's just forget about it.
Principle of MPP
The principle on which MPP is based is pretty simple. The users define the tokens they want to use to determine the beginning and the end of MPP blocks. And there are 3 kinds of MPP blocks:
- basic MPP block;
- basic MPP block with possible nesting;
- foreign MPP block.
The first and the second are used for the same things except that the second allows nesting. The third kind is the most interesting one, as it's the one that allows to embed virtually any programming language as a preprocessor language.
Side note:
Well, you can't easily use pure lambda calculus as a preprocessor language. Your programming language has to have some notion of input/output to process text. Using pure lambda calculus might be very painful and it would start by making it impure enough to interact with the rest of the world.
Example 1: the basics
So, let's say you're writing a program using the OCaml language, and
you want x
to be a random integer.
You can write this:
let x = Random.int max_int
and if you want to be sure that the number is different each time you
execute your program, you should use Random.self_init
or an
equivalent.
Now what if you want a random number to be generated at compile time? You can't do it in pure OCaml. You need to preprocess your file and/or to read from another file...
With MPP, you can write this:
let x = << cmd echo $RANDOM >>
if you use mpp with these parameters: -so '<<' -sc '>>'
to set
opening and closing tokens to <<
and >>
.
Default tokens are respectively ((
and ))
but if you're writing
some OCaml code, you shouldn't use ((
and ))
because
there's quite a chance that they appear as OCaml code in your file.
What happens with MPP is that it simply echoes everything that is
not MPP, and when it's MPP, it interprets the commands.
The cmd
builtin is probably the simplest and the most powerful,
since it calls your shell with what follows cmd
. The first
line is the command that is called, and what follows is given to
the command as its standard input.
So, basically,
let x = << cmd echo $RANDOM >> mod 42
given to MPP has the same semantics as this Bash script
echo -n 'let x = '
echo $RANDOM
echo ' mod 42'
Now, instead of using non-nesting tokens, you can use nesting tokens,
the only difference being that you can nest commands. Wait, what if
you nest with non-nesting commands? Well, MPP has a very simple syntax.
If you can't nest, you really can't nest.
So, if you write << echo << echo 42 >> >>
it just doesn't work!
Moreover, you have a syntax error because the first >>
is seen as the
closing token for the first <<
, and MPP stops when there's a syntax
error. If you want to echo << echo 42 >>
, it is possible, you just
have to name your block, as in <<name echo << echo 42 >> name>>
and then it just works!
Note that with nesting tokens, you have to name all outer tokens.
The default tokens are {{
and }}
. So, to nest some commands, you
can write for instance {{name set x {{ cmd cat foo.txt }} name}}
.
Note that the the most inner block doesn't have to be a nesting one,
as it doens't nest any. It means you could write
{{name set x << cmd cat foo.txt >> name}}
instead.
And note that only the most inner nesting block can be nameless and that's
in some cases only.
Note: the technique presented above (in example 1) has been used to generate the new OCaml website.
Example 2: using OCaml as a preprocessor language
When MPP is used with the option -l
, it produces a program that prints
the file, using information provided to MPP about the language specified
with -l
. For instance, mpp -l ocaml
will produce an OCaml program.
If you apply mpp -l ocaml
to a file F that contains no MPP commands,
then you have an OCaml program that, when executed, produces the
original file F.
When you use -l ocaml
, all "foreign blocks" are considered as
OCaml code and it will just "pass through".
So, if you write 42 {< let x = 23 >} Hello
,
it will be translated to this program:
let _ = print_string "42 "
let x = 23
let _ = print_string " Hello"
It's not exactly as flexible as "you can write anything at the top-level".
Indeed, you may write any valid top-level definition. It could be
a type definition, a module definition, really anything you want that
is valid at the top level and that does not require double semi-colons.
That is, you cannot write {< print_int 42 >}
because it's not a valid top-level phrase, you should write this instead
{< let _ = print_int 42 >}
.
If you really love playing at edges, you could write
{< ;; print_int 42 ;; >}
and it might work... if there is something preceding. If there isn't
(but that's not trivial to guess because the semantics of MPP might
change on this precise point, I don't know, I can't promise right now),
then you could perhaps write {< print_int 42;; >}
.
But you can't convert a block of text into an OCaml function. You can't write something like you would do in PHP, such as:
<ul>
{< for i = 1 to 42 do >}
<li> {< print_int i >} </li>
{< done >}
</ul>
Oh but wait! You can use functors...
So in functor-style, you can write this:
<ul>
{< module I (X:sig val i : int end) = struct >}
<li> {< let _ = print_int X.i >} </li>
{< end >}
{< let _ = for i = 1 to 42 do
let module M = I(struct let i = i end) in
()
done >}
</ul>
Sorry if that hurts your eyes, but life is not always easy. Oh, and if you think it's overkill, well, of course it is, that's a tiny toy example! I'm actually using this functor-style to write this blog. It means that I generate a template with the functor-style headers and footers and I write inside the functor using Markdown, and when I need to write some OCaml code, I use MPP features. And when I need to print MPP stuff, I use the 2 worlds of MPP, but it's late and I don't want to go into details just right now. What's important is that it's nice and it works. I've always wanted to be able to easily embed the OCaml language in any text (LaTeX, ML, HTML, Bash, etc.) file, and now I can!! :)
(Exercise) Odd or even?
This is the exercice:
Given a string (of bytes), is the number of bits set to 1 odd or even?
Let's now find some possible solutions.
Well, the most naive way is to count them all, and then see if the number is even or odd. But it's probably better not to use an integer to count them all and then see if that counter is odd or even, since we can instead use a Boolean: each time we see a 1, we can simply flip the Boolean value!
let has_an_odd_number_of_1 s =
let res = ref false in
for i = 0 to String.length s - 1 do
let x = int_of_char s.[i] in
for j = 0 to 7 do
if (x lsr j) land 1 = 1 then
res.contents <- not res.contents;
done
done;
res.contents
[try]
Side note:
Moreover, we can see that using a Boolean means it will not limit the number of bits we may read, whereas an integer... well, actually, when we reach the maximum value of integers (of type int, or Int32.t or Int64.t, which — in OCaml — are respectively Pervasives.max_int, Int32.max_int and Int64.max_int), adding one will make them even negative numbers (max_int, from Pervasives, Int32 or Int64, are all odd numbers), so we should still get the odd/even outcome right. But then, we would have to be sure that we have correctly designed the code that tells us if an integer is odd or even. For instance, we can write
is_odd
these ways:Well, let's start with a very classic mathematically-elegant implementation of two mutually recursive functions:
let rec is_odd = function | 0 -> false | n -> is_even (n-1) and is_even = function | 0 -> true | n -> is_odd (n-1) [try]
Note that it needs a lot of computation when the integer is "big"...
And this is probably the shortest version:
let is_odd n = n mod 2 <> 0 [try]
Now note that we cannot write this and expect it to work properly in all cases:
let is_odd n = n mod 2 = 1 [try]
and that's because if n is negative and odd, then n mod 2 is -1, not 1.
A tricky version that uses properties on integers:
let is_odd n = n <> 0 && ((n / 2) * 2) <> n [try]
Another tricky version that uses properties on machine integers:
let is_odd n = n land 1 = 1 [try]
Well, there is the same number of characters as the shortest version, and actually this version is probably the most efficient because we know that the
land
operation is easy for the computer, whereas themod
operation implies division, which is not an easy one. Thenmod
could be optimised in some cases, such asmod 2
because it's basically the same asland 1
except that the sign has to be preserved! So, it's likely to still cost more...
But there's way more efficient than that! Indeed, unless we really don't have a choice, we shouldn't read bits one by one, because it's slow!
So what can we do instead?
Well, there's a very very convenient logical operation on integers.
It's called XOR (and the OCaml operator for that is lxor
), and yes
we can use it!
Note that 1 XOR 1 = 0; 0 XOR 0 = 0; but 1 XOR 0 or 0 XOR 1 = 1. It's very convenient because it's basically giving us the answer! When we have two 1s, we can eliminate them, that's what happens with XOR. When we have a 1 and a 0, we keep the 1. So if we take those bytes byte by byte, and XOR them with each other, at the end we will have one remaining byte that will tell us if the whole string had an odd number of 1 or not.
let has_an_odd_number_of_1__efficient s =
let l = String.length s in
let rec loop c i =
if i = l then
let res = ref false in
for j = 0 to 7 do
if (c lsr j) land 1 = 1 then
res.contents <- not res.contents;
done;
res.contents
else
loop (c lxor int_of_char s.[i]) (succ i)
in loop 0 0
[try]
And... There we go! :-)
Note that int_of_char
is a zero-cost operation (but char_of_int
is
not, because it has to check that the value of the given integer is
between 0 and 255 inclusive).
(Exercise) Are there more 1s or 0s?
The challenge is this:
Provide a program that will determine, given a set of 1s and 0s, if there are more 0s or more 1s.
So the first question that comes to my mind is about the representation of that set of 1s and 0s. Is it a linked list, a double linked list, an array, a tree, a string, or something else?
If it's a list (single linked or double linked actually doesn't matter much), it's going to be pretty straightforward: just read the list and use a counter. Yes, a single counter is enough, you can add one to the counter when you meet a 1, and subtract one when you meet a 0. At the end, if your counter has its initial value, then you have the same number of 1s and 0s. If it's lesser, then it has more 0s, else it has more 1s.
Let's declare a type for the result.
type result = Equal | More_zeros | More_ones
[try]
let count l =
let rec loop counter = function
| [] -> counter
| 1::tl -> loop (counter+1) tl
| 0::tl -> loop (counter-1) tl
| _::tl -> loop counter tl
(* you might want to debate whether you should stop
when you get something else than a 0 or a 1. *)
in
let r = loop 0 l in
if r = 0 then
Equal
else if r > 0 then
More_ones
else
More_zeros
[try]
Well, what's the risk with this program? The risk is that the integer
counter
overflows, of course! If you have a very long list of 1s (or
0s) only, you may get things just wrong. However, in OCaml, in
reality, there's not really a chance that the integer overflows
because of the length of a single linked list, especially if you're
using a 64-bit architecture based OCaml, on which the greatest integer
is 4_611_686_018_427_387_903 (about 4⨉1018). There's really
a long way to have such a long list because basically you would need
to allocate more than (about 32⨉106 terabytes) at once,
since basically a linked list of integers is made of blocks that have
2 cells each (one for the integer, one for the address of the next
cell), each cell taking 64 bits (or 8 bytes).
But then, what if you don't have linked lists but some stream that gives you a very large number of 0s and/or 1s? Well, to begin with, counting from 0 to 4⨉1018 takes a really long time. If your machine can count from 0 to 109 in a single second (that would mean your machine is very fast), it would still take 4⨉109 seconds, which is about 4000000000/(606024*365) years. Oh, that means about 126 years! So let's just assume that a 63 bit signed integer is enough for us. And if you really can't assume that for some reason, you can always implement 128 bit signed integers quite easily, and if you don't know how to do that or if you're too lazy to do it, use the Big_int module.
But let's go back the representation of those 0s and 1s. I'd like to make the computation as fast as possible and I'll put those 0s and 1s in a very compact representation. Each 0 and 1 will now only take one bit in the memory (with a possible constant overhead for the whole). For that, let's use OCaml's strings, which are basically arrays of bytes. The longest string I can have on my machine will bear 1_152_921_504_606_846_904 bits (I know that because I multiplied Sys.max_string_length by 8), and that's a lot (more than 108).
Now say we want to know whether there are more 0s or 1s as fast as possible. How do we do that?
We don't want to count all 0s and 1s bit by bit, because that would have quite a high cost! Indeed, we would have to get each byte, and for each byte we would have to read each of its 8 bits (that can be done) one by one. We don't want to do that.
Instead, since we have bytes, we can conveniently allocate an array of size 256. Each cell of that array will contain the right number to add to the counter. This way, we can read a byte, get its number of 0s and 1s in O(1).
(* this table is computed only once *)
let table =
let number_for_a_byte b =
let r = ref 0 in
for i = 0 to 7 do
if (b lsr i) land 1 = 0 then
decr r
else
incr r
done;
!r
in
let a = Array.make 256 0 in
for i = 0 to 255 do
a.(i) <- number_for_a_byte i
done;
a
[try]
Then let's abstract from the means to read new 0s and 1s, by assuming
we'll be provided a function f
that given ()
will return 8 bits in
a value of type char
, and will raise the exception End_of_file
when it has no more bits to give.
let more_zeros_or_ones f =
let c = ref 0 in
begin try while true do
c := !c + table.(int_of_char(f()))
done with End_of_file -> () end;
if !c = 0 then
Equal
else if !c > 0 then
More_zeros
else
More_ones
[try]
Note that int_of_char
has a zero-cost (i.e., it "disappears" at
compile-time because int
and char
sort of share the same memory
representation). If you want better performance, you should inline
f
, provided that you know what it is (you may want to check if the
compiler does the inlining itself first, just in case).
Also, you may want to use a table with a size larger than 256 if you have a lot of memory but I'm not so sure you'd gain performance unless you use nasty tricks to read larger integers from a string. Then again, you may not end up using a string, in which case you have to think with the whole problem in mind.
On the Implementation of OPAMDOC
Early this year, Leo White started the implementation of opamdoc. Then Vincent Botbol worked on it from late May to mid August during his stay in Cambridge.
Now, Vincent's back to studying "computer science research" in Paris, and he continues working on opam-doc when he can on his free time.
I didn't really want to get involved too deep in the implementation of opamdoc (mainly because it takes time). Eventually, since I've ended up doing most of the technical-side implementation of the new ocaml.org web site, I had to eventually get into opamdoc's source code to integrate its output in the website.
If you look at the source code of
opamdoc, you'll see there's a
bunch of files at the root directory. Well, some of them are inherited
from the implementation of ocamldoc, and some are new. I've mostly
contributed to
generate.ml
and
opam_doc_config.ml
.
The former implements the HTML backend of opamdoc, and the latter
contains the JavaScript engine that loads the documentation. This blog
post is mostly about my experience with those two files.
generate.ml
This big file (about 1.5 kloc) contains the functions to retrieve the information from cmt, cmd, cmti, cmdi and cmi files in order to build the HTML document.
Side note:
The
.cmt
files were officially introduced in the standard OCaml compiler in version 4.00.0, so it's pretty recent work. Previously, they were produced by a separate tool developed at OCamlPro for TypeRex.
Two reasons why it's not super easy
Well, this can actually be explain in just a few words. To begin with, there are many cases to address. Well, that is not exact. I should say that there are may cases to address and those cases are fairly poorly documented, which is "normal" given that there was probably no particular motivation to put efforts into documentation. This is true for the source code of ocamldoc and for its official documentation.
For instance, if you look into
info.mli
, you
can see that the first type definition is:
type style_kind =
| SK_bold
| SK_italic
| SK_emphasize
| SK_center
| SK_left
| SK_right
| SK_superscript
| SK_subscript
| SK_custom of string
and the only documentation for this type is
(** The differents kinds of style. *)
.
Well, you can see that there's not really any
documentation needed for those SK_...
until... you see
SK_custom of string
. There you go! You have to guess...
It's not that hard when you have to guess a few times, it's a lot harder when you keep having to guess. That's my point.
Well, the other issue is more interesting. At the time ocamldoc
was
designed and implemented, I bet no one imagined what folks at
JaneStreet would do with OCaml! I'm talking about the implementation
of their open source library, known as Core
. "Core
& co" use a
lot of include
directives. The module Core.Std
includes a
lot of other modules, which also include modules. If you want to
generate a single full HTML page for Core.Std
, you'd end up with a
huge page. And such a page would contain a lot of information
coming straight from other pages, so you'd end up with hundreds of
megabytes. Instead of doing so, opamdoc generates only one page per
package and one per module. If a module includes another one, then the
first will fetch the documentation of the second and there you go. So
we only have 8.4MB of HTML for {async, async_core, async_extra,
async_unix, core, core_bench, core_extended, core_kernel} in total
(well, this number should increase in the future, but linearly, as
people will hopefully start documenting all those packages). And
that's why we have a JavaScript loader.
opam_doc_config.ml, or docs_loader.js
Since the JavaScript may have to load tens of megabytes of HTML, you have to program some nasty functions and loops... and at some point it does become big enough for your browser to stop responding while it's busy loading your documentation. So there are several solutions to that. The best would probably be to stop writing in JavaScript (and use something else that compiles to JavaScript). But that's for next step. Now, we're trying to make the JavaScript work.
The problem with JavaScript is that basically there is one "event"
loop, and all events are happening sequentially or concurrently,
depending on how you wrote your JS, but when one event is making the
browser busy, the browser is unable to do anything else. That's why
your browser may tell you at some point that you have a script making
it ill and ask you whether you want it to stop executing that script.
One workaround for that problem when you know you ask for a lot of
computation time is to divide your big computation into smaller ones.
You can use window.setTimeout
for that, meaning you transform your
recursive calls like f()
into window.setTimeout(f, 0)
so that f
will be called at some point. And if you have iterative loops instead,
write them recursive and use window.setTimeout
. That's bad. Because
then the browser is unable to tell that it's crazy busy... and it's
still really busy. So you can increase the value 0 to 1 or 10 or more.
But if you increase it too much, it'll become too slow...
Ok, well, don't use JavaScript. It's a nightmare.
We will probably rewrite the script using Js_of_ocaml at some point, mainly because when you write in OCaml, you have static type checking, and that saves so much time!
To be followed...
OPAM-DOC with multiple OCaml packages
OPAM-DOC
The tool ocamldoc
shipped with the standard distribution of the
OCaml compiler suffers from some limitations because it doesn't have
access to all the information it needs. Notably, if you want to build
the documentation of some package a
that depends on some package
b
, you'll find out that the documentation of the package a
will
not automatically refer to the documentation of the package b
.
Well, opamdoc solves this problem and it's very nice.
On this site, which is work in progress, you can find a list of packages with their descriptions. This list is built using opam2web. And each package description links to its documentation. And the documentation for all packages is built as a whole, which means that if the documentation of a package links to all other packages on which it depends. Isn't it nice? :)
Install and run opamdoc
opamdoc is still under active development, the current step is finding and fixing remaining issues and packaging.
If you want to use opamdoc now, here's some instructions based on Anil's instructions, which are those:
opam remote add opamdoc git://github.com/avsm/opamdoc-dev opam switch 4.01.0beta1+opamdoc eval `opam config env` opam install opamdoc opamdoc-rebuild opamdoc-generate open ~/.opam/4.01.0beta1+opamdoc/opamhtml/index.html
opamdoc-rebuild
builds a database of cmt/cmd files in~/.opam/4.01.0beta1+opamdoc/opamdoc
.opamdoc-generate
builds a package HTML list in~/.opam/4.01.0beta+opamdoc/opamhtml
.
Well, they work quite well, except opamdoc-rebuild
uses .cmd
files as a base, whereas
it should use .cmt
files instead because sometimes .cmt
files exist while .cmd
don't.
This is a problem for a few packages (Core
is one of them).
My advice: if something doesn't work as expect,
please open an issue,
and if you want to customize the way the documentation is built,
don't hesitate to edit opamdoc-rebuild
and/or opamdoc-generate
(they are Bash scripts).
A few notes about those instructions
The first line, which is opam remote add opamdoc git://github.com/avsm/opamdoc-dev
,
will give sense to the second one, which is opam switch 4.01.0beta1+opamdoc
.
And 4.01.0beta1+opamdoc
is a customised compiler that produces on the side the .cm{t,d}{,i}
files
that are used by opamdoc. Using it is very convenient! If you don't use it, you need to produce
those files in an other way of your choice (really? you want to go through such trouble?).
Note that opamdoc is written in such a way that it would ask a lot of work to make it work with versions of the compiler prior to 4.01.0 because it uses features introduced by this very version.
Just in case, here follow the customised version of opamdoc-rebuild
and opamdoc-generate
that
I used to build the documentation available on http://ocaml-redesign.github.io/pkg/docs/.
opamdoc-rebuild
Here's my customised version of opamdoc-rebuild
:
#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.
set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
echo Must be using a custom OPAM switch for this to work.
exit 1
fi
function calc_md5_for_file()
{
if builtin command -v md5 > /dev/null; then
md5=$(cat $1 | md5)
elif builtin command -v md5sum > /dev/null ; then
md5=$(cat $1 | md5sum | awk '{print $1}')
else
echo "Neither md5 nor md5sum were found in the PATH"
exit 1
fi
}
BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
rm -rf ${DOC}
mkdir -p ${DOC}
PKGS=$(find ${BUILD}/ -mindepth 1 -maxdepth 1 -type d "$@")
echo ${PKGS}
for pkg in ${PKGS}; do
pkgname=$(basename $pkg)
echo $pkgname
mkdir -p ${DOC}/$pkgname
CMTS=$(find ${pkg} -type f -name '*.cmt')
for cmt in ${CMTS}; do
d=$(dirname $cmt)
calc_md5_for_file "$cmt";
f=$(basename $cmt .cmt)
mkdir -p ${DOC}/$pkgname/$md5
r=${DOC}/$pkgname/$md5/$f
for ext in cmt cmti cmd cmdi cmi ; do
if [ -e $d/$f.$ext ]; then
$dry cp $d/$f.$ext $r.$ext
fi
done
done
done
opamdoc-generate
Here's my customised version of opamdoc-generate
.
#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.
set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
echo Must be using a custom OPAM switch for this to work.
exit 1
fi
OPAMDOC=${OPAMDOC:-opamdoc}
BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
HTML=${BASE}/opamhtml
rm -rf ${HTML}
mkdir -p ${HTML}
# Grab the build dir in reverse mtime order to get the
# ordering of generation "correct".
PKGS=$(ls -1tr $BUILD)
rm -rf $HTML
mkdir -p $HTML
cd $HTML
for pkg in ${PKGS}; do
fs="$(find $DOC/$pkg -type f)"
if [ "$fs" != "" ]; then
name=$(echo $pkg | awk -F. '{print $1}')
echo $pkg $name
$dry $OPAMDOC -p $name $fs
fi
done
Enjoy.
OMD: a Markdown parser in OCaml
Motivations
There are so many existing implementations of Markdown, why yet another one? Well, after asking the OCaml community about existing ways to parse and manipulate Markdown documents, it seemed that there were no stand-alone complete Markdown parser written in OCaml, meaning that they were incomplete (i.e., not fully compatible with Markdown) or interfaces to some Markdown parsers implemented using other languages than OCaml.
Since in OCaml Labs we're working a lot with Github, and Github uses Markdown a lot (Github web pages hosting, Github issues, etc.) and other sites are also using Markdown as well, and Markdown is very popular and easy to learn, and flexible in the sense that you can always fall back to HTML when you want to express something that doesn't have a special syntax in Markdown, blah blah blah, it was - somehow - time to have a Markdown parser implemented using OCaml. And preferably OCaml alone, meaning that one that has OCaml installed should be able to use it easily without having to deal with too many dependencies. Well, there it is: OMD!
Issues... were essentially with Markdown itself
Every computer scientist knows how to write a parser for a language that is as simple as Markdown. Well, no, not every computer scientist, sadly, but at least every programmer should. Ok, sadly it isn't really the case either. Anyways.
Markdown is a rather simple language, if you look at the specs. Well, that depends on what you actuall call "specs", of course. According to Wikipedia, Markdown was invented by John Gruber, and Aaron Swartz helped him. Then the original document that describes the Markdown language is available online there: http://daringfireball.net/projects/markdown/syntax. And you can see that searching for "Markdown+syntax" on Google makes that page the top result (and here again, I'm kind of helping it be and stay on the top...).
Actually, Markdown is not that simple to parse. Why? Because so many things are just left to the imagination of the people who implement parsers for this language, because programs can't decide by themselves what's right to do. It's really the author(s) of the program that decides what the program does. And in Markdown, so many things are ambiguous and Gruber's document doesn't actually describe a grammar at all. It just tells you how to write this and that, but if you write it slightly differently, it doesn't tell you what the outcome would be and by just reading the document, you really can't tell what the output will be.
The one thing to have in mind is that there are no errors in Markdown. Every text file is a valid Markdown file, it might just be converted to some HTML you wouldn't expect. For instance, you may write a link using this syntax:
[blah blah blah](the-url-which-can-be-absolute-or-relative)
and it gets converted to
<p><a href="the-url-which-can-be-absolute-or-relative">blah blah blah</a></p>
But if you forget the closing parenthesis, then it becomes that instead
<p>[blah blah blah](the-url-which-can-be-absolute-or-relative</p>
precisely because nothing is wrong in Markdown.
And what if there are parentheses in your URL? What if they are unbalanced?
Flagrant Ambiguity
The following is some text that has to mean something in Markdown...
* Alice is falling in love with Bob.
* Bob is secretly in love with Alice, but he's seeing Eve.
* Eve is in-between, she loves both Alice and Bob, she can't help it.
So, Pandoc, which a tool that converts a document in language A to a document in language B, where A and B can be the same or different languages amongst LaTeX, HTML, Markdown and (many) others, considers that Eve is on the same level as Bob. So its HTML output is
<ul>
<li>Alice is falling in love with Bob.
<ul>
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
</ul></li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>
</ul>
But if instead you add Dylan as in
* Alice is falling in love with Bob.
* Dylan, a new character, is being conceived here.
* Bob is secretly in love with Alice, but he's seeing Eve.
* Eve is in-between, she loves both Alice and Bob, she can't help it.
then of course Eve is not on the same level as Bob anymore and goes with Dylan and Alice, Bob is on his own.
<ul>
<li>Alice is falling in love with Bob.</li>
<li>Dylan, a new character, is being conceived here.
<ul>
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
</ul></li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>
</ul>
This doesn't make much sense...
And Github's embedded Markdown to HTML converter chooses some similar broken semantics. If one writes bullets on different levels, it shouldn't be meaningless.
Also, on Github, if you write
* 1
* 2
* 3
* 4
* 5
* 6
* 7
* 8
* 9
* 10
Then 2 and 3 are on the same level, as for 4 and 5, 6 and 7, and 8 and 9. 1 and 10 are on their own. And if you extract 2 and 3, meaning
* 2
* 3
then 2 and 3 are not on the same level anymore! See for yourself:
- raw version: https://raw.github.com/pw374/sandbox/master/mad-lists.md
- rendered-by-Github version: https://github.com/pw374/sandbox/blob/master/mad-lists.md
With OMD, hopefully, there is a deterministic meaning for each level of indentation. The list where there were Alice, Bob and Eve is converted to this to the least insane semantics I could think of.
The idea is that, since Eve is neither on the same level as Bob or Alice, Eve should be in a new list (because, obviously, she's the only one on that level anyway). So she is in a new list. And since she's not on a deeper level than Bob, she shouldn't be on a sub-list of his. But she is on a deeper level than Alice, so she has to be on a sub-list of hers. So, here is the HTML that OMD produces:
<ul>
<li>Alice is falling in love with Bob.
<ul>
<li>Bob is secretly in love with Alice, but he's seeing Eve.
</li>
</ul>
<ul>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.
</li>
</ul>
</li>
</ul>
Oh, you might have noticed that OMD converts quotes to '
because otherwise I would need to differentiate when they have to be
converted from when it's optional.
Implementation
Pandoc's documentation says
In contrast to most existing tools for converting markdown to HTML, which use regex substitutions, Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document, and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer.
Come on, most tools are using regular expressions substitutions?! I can only imagine the nightmare that it must be to implement and debug such an implementation -- no wait, I can't because I just don't want to imagine such a nightmare.
I used functions and function calls, a lot of them are tail recursive, not all of them but then it means I don't need them to be tail recursive, and those functions basically take a list of tokens and return a new list with possibly fewer tokens and the expression to which the missing ones were converted into.
So far, in version 0.4 (which is not released yet at the time I write this), there's a little less than 8k lines of pure OCaml code. (Note that I didn't write "pure functional", I wrote "pure OCaml".)
OMD is an open-source free and libre software library that any OCaml developer can use (hopefully quite easily since it doesn't depend on anything else that the standard OCaml compiler and library). And of course, it's also a tool for any one who write Markdown and wants it to be converted (quickly) to HTML. OMD, so far, is about 10 times faster than Pandoc, and I didn't even make any efforts to make it fast.
Compatibility
OMD has been developed using OCaml 4.0.1, Christophe Troestler made me make it compatible with OCaml 3.12.1. Then I guess it might work with older version of OCaml but it's not certain (mainly because OCaml's standard library has slightly changed, as I think I don't use any language features that were introduced in 3.12 or 4.0).
By the way, thank you Christophe for your support, interest and contributions to OMD :-)
Future of OMD
OMD already is in OPAM. A very stable version of OMD should be released soon. As a tool, it takes Markdown as input and produces HTML as output. A Markdown backend has been freshly implemented, so you can output Markdown as well, which is quite useful for debugging or if you want to know how many iterations you need before you reach the fix point. You can also output "text", in the sense that it's basically the HTML without the HTML tags, so it's very non-formatted text. There also are options to output HTML table of contents and parametrise their depths.
OMD and OCaml.org
We are at OCaml Labs making a new website for ocaml.org. The design is being provided by onespacemedia. At the time I'm writing these lines, I'm using the HTML/CSS/JS for the upcoming OCaml.org to style my new Github-hosted website, so I can play with it more.
Most pages will be written in Markdown instead of HTML, so that people of the OCaml community may contribute to it in a more convenient way.
And of course, that means that OMD will be used for OCaml.org.
tags: