Powierża coefficient
Powierża coefficient is a statistic on strings for gauging whether a string is an "abbreviation" of another. The function is not symmetric so it is not a metric.
- Let
T(text) be a non-empty string. - Let
P(pattern) be a non-empty subsequence ofT. - Let
pbe a partition ofPandp_ibe its elements, where:- every
p_iis equal to some substring ofT,t_i. - the substrings
t_ido not overlap. t_iare in the same order asp_i.
- every
Powierża coefficient is the number of elements of the shortest partition p, less one. Alternatively, it is the number of gaps between the substrings t_i.
Used terms:
- A substring is a subsequence made of consecutive elements only. A subsequence doesn't have to be a substring. For example,
xzis a subsequence ofxyzbut it is not its substring. - A partition of a sequence is a sequence of pairwise disjoint subsequences that, when concatenated, are equal to the entire original sequence.
Intuitive explanation
Take all characters from the pattern and, while perserving the original order, align them with the same characters in the text so that there are as few groups of characters as possible. The coefficient is the number of gaps between these groups.
Examples
P |
T |
p |
Powierża coefficient |
|---|---|---|---|
powcoeff |
powierża coefficient |
pow, coeff |
1 |
abc |
a_b_c |
a, b, c |
2 |
abc |
abc |
abc |
0 |
abc |
xyz |
— | not defined |
For more examples, see tests.
Use case
The Powierża coefficient is used in kn and in nushell to determine which of the directories' names better match the abbreviation. Many other string coefficients and metrics were found unsuitable, including Levenshtein distance. Levenshtein distance is biased in favour of short strings. For example, the Levenshtein distance from gra to programming is greater than to gorgia, even though it does not "resemble" the abbreviation. Powierża coefficient for these pairs of strings is 0 and 2, so programming would be chosen (correctly).
Powierża algorithm
The algorithm was inspired by Wagner–Fischer algorithm . It is also very similar to a solution to the Longest Common Subsequence Problem. All of these algorithms are based on a matrix. Whereas in Wagner-Fischer algorithm (WF) there are 3 types of moves (horizontal, diagonal and vertical) in my algorithm there are only two — horizontal and diagonal. The main idea is that the 'cost' of a gap is always 1, no matter how long. (In WF the cost of a gap is it's length.)
That means the algorithm must differentiate between cells that were filled in horizontal moves and the ones that were filled in diagonal moves. The first type of cells are cells containing Gap(score); the second type — Continuation(score). A horizontal move results in Gap(score) if the original cell contains Gap(score) and in Gap(score + 1) if the original cell contains Continuation(score). The algorithm prefers moves that result in lower score and a diagonal move over horizontal move if they result in the same score.
-
Create a matrix
mrows byncols wheremis the length ofSandnis the length ofP.nmust be less or equal tom. Each cell can either be empty (that's the initial state) or contain eitherGap(score)orContinuation(score). -
Begin filling the matrix from left to right and from top to bottom. The first row is special —
xth,ythcell is set toContinuation(0)if thexthelement ofSand theythelement ofPare equal. Otherwise, is set toGap(score + cost)wherescoreis the score of its left neighbor. If its left neighbor is empty, the cell is left empty as well. -
Other cells are filled according to these rules:
Let
xbea's upper-left neighbor andybe its left neighbor:The cost of a diagonal move is 0 but such move is only possible if the
xthelement ofSand theythelement ofPare equal and ifxisn't empty. After the moveais set toContinuation(score)wherescoreisx's score.The cost of a horizontal move is 0 if
ycontainsGapand 1 ifycontainsContinuation. Such move is only possible ifyisn't empty. After the moveais set toGap(score + cost)wherescoreisy's score.- If there are no available moves, leave
aempty. - If there's only one available move, make it.
- If there are two available moves and their scores are equal, make the horizontal move.
- If there are two available moves and their scores aren't equal, make the move with the least score.
- If there are no available moves, leave
-
Powierża coefficient is the least value in the last row. In some cases there are no values in the last row and the coefficient is not defined.
Illustration
Cells with G's were filled in horizontal moves and those with C's were filled in diagonal moves. The numbers next to the letters are cells' scores. Red cells were skipped because of an optimization. Yellow cells were left empty. The coefficient is 2.
Benchmarks
The algorithm was compared with strsim's levenshtein in a benchmark run on the author's computer:
- Levenshtein distance:
[1.2908 µs 1.2946 µs 1.2987 µs] - Powierża coefficient:
[1.7718 µs 1.7748 µs 1.7778 µs]
