gapWeightedSimilarityNormalized

The similarity per gapWeightedSimilarity has an issue in that it grows with the lengths of the two strings, even though the strings are not actually very similar. For example, the range ["Hello", "world"] is increasingly similar with the range ["Hello", "world", "world", "world",...] as more instances of "world" are appended. To prevent that, gapWeightedSimilarityNormalized computes a normalized version of the similarity that is computed as gapWeightedSimilarity(s, t, lambda) / sqrt(gapWeightedSimilarity(s, t, lambda) * gapWeightedSimilarity(s, t, lambda)). The function gapWeightedSimilarityNormalized (a so-called normalized kernel) is bounded in [0, 1], reaches 0 only for ranges that don't match in any position, and 1 only for identical ranges.

The optional parameters sSelfSim and tSelfSim are meant for avoiding duplicate computation. Many applications may have already computed gapWeightedSimilarity(s, s, lambda) and/or gapWeightedSimilarity(t, t, lambda). In that case, they can be passed as sSelfSim and tSelfSim, respectively.

Select!(isFloatingPoint!(F), F, double)
gapWeightedSimilarityNormalized
(
alias comp = "a == b"
R1
R2
F
)
(
R1 s
,
R2 t
,,
F sSelfSim = F.init
,
F tSelfSim = F.init
)

Examples

import std.math.operations : isClose;
import std.math.algebraic : sqrt;

string[] s = ["Hello", "brave", "new", "world"];
string[] t = ["Hello", "new", "world"];
assert(gapWeightedSimilarity(s, s, 1) == 15);
assert(gapWeightedSimilarity(t, t, 1) == 7);
assert(gapWeightedSimilarity(s, t, 1) == 7);
assert(isClose(gapWeightedSimilarityNormalized(s, t, 1),
                7.0 / sqrt(15.0 * 7), 0.01));

Meta