Inserting whitespaces between words
Given a whitespace-less string like
“thereverseisnottrue”
and a dictionary of words ordered by probability of appearance in a text (first means more likely to appear), insert the whitespaces in the string in the appropriate positions.
In probability Zipf’s law asserts that frequencies \( f \) of certain events are inversely proportional to their rank \( r \). For example the most ranked English word the has a high frequency in a generic text. Suppose we were given the following dictionary of words ordered by descending frequency
the
not
is
there
reverse
verse
true
we can use a simplification of Zipf’s law to estimate the cost of insertion of each word: \( c(w) = \log(r+1) \) i.e. the cost of a word \( w \) is the logarithm of its rank + 1.
Now let’s take into account the example string we were given
thereverseisnottrue
with 0 characters we obviously have a cost of 0 since we haven’t inserted anything yet. If we proceed adding a character each time and checking for a correspondance in the dictionary we have that the first key found is at position 3
thereverseisnottrue
^^^^
| 'the' exists in the dictionary and has weight log(1+0)
The cost for the previous positions is inf
since there is no match for t
or th
. By keeping a history equal to the longest string in the dictionary we
can recursively evaluate each previous position for a better match, i.e. for a solution which either
- Covers all the characters encountered
- Is more cost-effective than the previous one
e.g.
thereverseisnottrue
^ ^
| |*
| | cost log(1+3)
|
| cost log(1+0)
when the position *
has been reached we can now evaluate which one of the following splits is a better match
ther + e = inf + inf
the + re = log(1+0) + inf
th + ere = inf + inf
t + here = inf + inf
'' + there = 0 + log(1+3)
therefore the best match is there
with a cost of \( 0 + \log(1+3) \). This means that if we were to stop our search at this point, there
would cost more than the
but would provide a better match since it covers all the encountered characters. It is an optimal solution for this subproblem since all the other solutions are not viable.
When the input text is covered for the length of thereverse
the process is restarted and this time the
+ reverse
has a minimum cost rather than there
+ verse
. Of course heuristics and more advanced text-recognition techniques would be used if this were a semantic-aware engine trying to reconstruct a text. In this simple instance it suffices to find the minimum cost to split the words.
Since this recursion exhibits both optimal substructure and overlapping subproblems during the dictionary checks, it is a good candidate to be solved with a dynamic programming approach
The code runs in \( O(N^2) \) and requires \( O(N) \) space.