gapWeightedSimilarity

The so-called "all-lengths gap-weighted string kernel" computes a similarity measure between s and t based on all of their common subsequences of all lengths. Gapped subsequences are also included.

To understand what gapWeightedSimilarity(s, t, lambda) computes, consider first the case lambda = 1 and the strings s = ["Hello", "brave", "new", "world"] and t = ["Hello", "new", "world"]. In that case, gapWeightedSimilarity counts the following matches:

  1. three matches of length 1, namely "Hello", "new", and "world";
  2. three matches of length 2, namely ("Hello", "new"), ("Hello", "world"), and ("new", "world");
  3. one match of length 3, namely ("Hello", "new", "world").

The call gapWeightedSimilarity(s, t, 1) simply counts all of these matches and adds them up, returning 7.

string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 1) == 7);

Note how the gaps in matching are simply ignored, for example ("Hello", "new") is deemed as good a match as ("new", "world"). This may be too permissive for some applications. To eliminate gapped matches entirely, use lambda = 0:

string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 0) == 4);

The call above eliminated the gapped matches ("Hello", "new"), ("Hello", "world"), and ("Hello", "new", "world") from the tally. That leaves only 4 matches.

The most interesting case is when gapped matches still participate in the result, but not as strongly as ungapped matches. The result will be a smooth, fine-grained similarity measure between the input strings. This is where values of lambda between 0 and 1 enter into play: gapped matches are exponentially penalized with the number of gaps with base lambda. This means that an ungapped match adds 1 to the return value; a match with one gap in either string adds lambda to the return value; ...; a match with a total of n gaps in both strings adds pow(lambda, n) to the return value. In the example above, we have 4 matches without gaps, 2 matches with one gap, and 1 match with three gaps. The latter match is ("Hello", "world"), which has two gaps in the first string and one gap in the second string, totaling to three gaps. Summing these up we get 4 + 2 * lambda + pow(lambda, 3).

string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 0.5) == 4 + 0.5 * 2 + 0.125);

gapWeightedSimilarity is useful wherever a smooth similarity measure between sequences allowing for approximate matches is needed. The examples above are given with words, but any sequences with elements comparable for equality are allowed, e.g. characters or numbers. gapWeightedSimilarity uses a highly optimized dynamic programming implementation that needs 16 * min(s.length, t.length) extra bytes of memory and O(s.length * t.length) time to complete.

F
gapWeightedSimilarity
(
alias comp = "a == b"
R1
R2
F
)
(
R1 s
,
R2 t
,)

Meta