Molis Hai:   Natural Language Password Generation
1 Source Text
2 Order
3 Entropy
4 Performance
5 Security
6 Other Stuff
6.2.901.900

Molis Hai: Natural Language Password Generation

John Clements <clements@racket-lang.org>

This collection produces passwords using models built on a corpus of source text. It manages to guarantee generation of passwords with known entropy by building markov models whose transitions are guided by huffman trees, allowing the use of variable numbers of bits of entropy for each transition.

To run it, use

  raco molis-hai

It also has a number of command-line arguments that you can use to control the password generation process.

1 Source Text

You can specify the source text using the -t flag. A megabyte of text is a reasonable size. I’ve included the text of Dickens’ A Tale of Two Cities, and this is used by default. Extremely long source texts will take a while to build models from, and extremely short source texts (e.g., "banana" will work, but will generate extremely long passwords.

2 Order

The tool builds a markov model mapping character-based n-grams to possible next letters. By default, it uses order 2. That is, the model will take a look at every instance of "pr", for instance, and see what could follow it.

You can also specify a larger order. If you specify order 3, then it will build its model using strings of length 3 as states; in this case, it would compute, e.g., what follows "pro" (and every other 3-letter string that occurs in the source text).

Larger models produce passwords that are more characteristic of the source text. Surprisingly, this isn’t usually such a great thing. Here are some 56-bit passwords of order 2:

Yess ott was whoc

ming to mys, ale, and ar

and the factinand cale

toner gre,' Throur and w

dis whishe coullows ing

lationsir.' Arge,--l

inesperept his calty

he led, againeved--th

And here are some 56-bit passwords of order 8:

took horse for Dover; and then stood smoking, too! replied: My child, this world

out of these footsteps destination and by the wall where he

and alluvial mud, under such thought madame, drawing themselves through her f

by Lucie at her sole reliance, under a bank overhanging about. There, said he, i

in such differed principle. And look at, am I? Why don't you be afraid of, for he stooped, and leth

the nose became much more wicked foreign woman; what plate as to matter. But when I ask your

did a great Stilton cheese, until to-morrow. For to-morrow! Now, a judiciously sensibility she wound about he

young ladies made to stand up against, and notice, when two dusty men passionate, loving, thankful that if s

The second set are clearly more Dickensian... but would make terrible passwords! I mean, yes, you’d probably be the only girl on the block to have "Stilton cheese, until tomorrow" in your password, but at the end of the day, I bet you’d be happier typing "Yess ott was whoc" when you lift the lid on your laptop.

In fact, some people might even prefer the order-1 passwords, which look like this (again, 56-bit):

arither tshelones,

a ffty,'l g wed

maged he tof hed ored

arerer pen le wagh

hemishemat. t wher

I engutsir.'sin

s ton toticould the

waped t If h inge t

Well, maybe not. Order-2 seems like a nice compromise.

3 Entropy

The generator uses a list of booleans to generate a password. Each character’s generation consumes zero or more booleans from this list. There is a bijection between the set of booleans of some length n and the set of passwords generated (assuming a fixed order and source text).

These booleans are generated by drawing randomness from "/dev/urandom" (sorry, Windows folks, tell me how to get random bits on your platform).

What this means is that passwords generated by this system are guaranteed to have the required entropy. That is, if I generate a 56-bit password this system, it’s guaranteed to have 56 bits of entropy, even though the attacker knows your source text and what order you’ve chosen.

For more about security and how it all works, take a look at my preprint on ArXiv.

4 Performance

Currently, the tool simply builds the model every time it’s called. This may seem inefficient, but the fact is that for order-2 models on 1-megabyte texts, it only takes a few seconds on my machine.

Larger models take longer to build, but they’re also huge, and would take up lots of space on disk, and aren’t really very useful (as noted above). If there’s really any demand for it, it wouldn’t be hard to add a means to compile source texts for fast password generation. Finally, this tool usually isn’t something that you run every day; waiting 30 seconds for password generation probably isn’t a big deal.

5 Security

These passwords are as secure as the bits that come from "/dev/urandom". In fact, there’s a bijection between them.

6 Other Stuff

The library contains lots of other stuff that I explored in the process of creating this; there’s some code for extracting the text of e-mails, and for html processing, and for word-based generation, etc. etc. etc. Hopefully, it’s all fairly easy to read.