The so-called "all-lengths gap-weighted string kernel" computes a
similarity measure between s and t based on all of their
common subsequences of all lengths. Gapped subsequences are also
included.
To understand what gapWeightedSimilarity(s, t, lambda) computes,
consider first the case lambda = 1 and the strings s =
["Hello", "brave", "new", "world"] and t = ["Hello", "new",
"world"]. In that case, gapWeightedSimilarity counts the
following matches:
three matches of length 1, namely "Hello", "new",
and "world";
three matches of length 2, namely ("Hello", "new"), ("Hello", "world"), and ("new", "world");
one match of length 3, namely ("Hello", "new", "world").
The call gapWeightedSimilarity(s, t, 1) simply counts all of
these matches and adds them up, returning 7.
string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 1) == 7);
Note how the gaps in matching are simply ignored, for example ("Hello", "new") is deemed as good a match as ("new",
"world"). This may be too permissive for some applications. To
eliminate gapped matches entirely, use lambda = 0:
string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, t, 0) == 4);
The call above eliminated the gapped matches ("Hello", "new"),
("Hello", "world"), and ("Hello", "new", "world") from the
tally. That leaves only 4 matches.
The most interesting case is when gapped matches still participate in
the result, but not as strongly as ungapped matches. The result will
be a smooth, fine-grained similarity measure between the input
strings. This is where values of lambda between 0 and 1 enter
into play: gapped matches are exponentially penalized with the
number of gaps with base lambda. This means that an ungapped
match adds 1 to the return value; a match with one gap in either
string adds lambda to the return value; ...; a match with a total
of n gaps in both strings adds pow(lambda, n) to the return
value. In the example above, we have 4 matches without gaps, 2 matches
with one gap, and 1 match with three gaps. The latter match is ("Hello", "world"), which has two gaps in the first string and one gap
in the second string, totaling to three gaps. Summing these up we get
4 + 2 * lambda + pow(lambda, 3).
gapWeightedSimilarity is useful wherever a smooth similarity
measure between sequences allowing for approximate matches is
needed. The examples above are given with words, but any sequences
with elements comparable for equality are allowed, e.g. characters or
numbers. gapWeightedSimilarity uses a highly optimized dynamic
programming implementation that needs 16 * min(s.length,
t.length) extra bytes of memory and O(s.length * t.length) time
to complete.
The so-called "all-lengths gap-weighted string kernel" computes a similarity measure between s and t based on all of their common subsequences of all lengths. Gapped subsequences are also included.
To understand what gapWeightedSimilarity(s, t, lambda) computes, consider first the case lambda = 1 and the strings s = ["Hello", "brave", "new", "world"] and t = ["Hello", "new", "world"]. In that case, gapWeightedSimilarity counts the following matches:
The call gapWeightedSimilarity(s, t, 1) simply counts all of these matches and adds them up, returning 7.
Note how the gaps in matching are simply ignored, for example ("Hello", "new") is deemed as good a match as ("new", "world"). This may be too permissive for some applications. To eliminate gapped matches entirely, use lambda = 0:
The call above eliminated the gapped matches ("Hello", "new"), ("Hello", "world"), and ("Hello", "new", "world") from the tally. That leaves only 4 matches.
The most interesting case is when gapped matches still participate in the result, but not as strongly as ungapped matches. The result will be a smooth, fine-grained similarity measure between the input strings. This is where values of lambda between 0 and 1 enter into play: gapped matches are exponentially penalized with the number of gaps with base lambda. This means that an ungapped match adds 1 to the return value; a match with one gap in either string adds lambda to the return value; ...; a match with a total of n gaps in both strings adds pow(lambda, n) to the return value. In the example above, we have 4 matches without gaps, 2 matches with one gap, and 1 match with three gaps. The latter match is ("Hello", "world"), which has two gaps in the first string and one gap in the second string, totaling to three gaps. Summing these up we get 4 + 2 * lambda + pow(lambda, 3).
gapWeightedSimilarity is useful wherever a smooth similarity measure between sequences allowing for approximate matches is needed. The examples above are given with words, but any sequences with elements comparable for equality are allowed, e.g. characters or numbers. gapWeightedSimilarity uses a highly optimized dynamic programming implementation that needs 16 * min(s.length, t.length) extra bytes of memory and O(s.length * t.length) time to complete.