📄 jarowinklerdistance.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.spell;import com.aliasi.util.Distance;import com.aliasi.util.Proximity;import java.util.Arrays;/** * The <code>JaroWinklerDistance</code> class implements the original * Jaro string comparison as well as Winkler's modifications. As a * distance measure, Jaro-Winkler returns values between * <code>0</code> (exact string match) and <code>1</code> (no matching * characters). Note that this is reversed from the original * definitions of Jaro and Winkler in order to produce a distance-like * ordering. The original Jaro-Winkler string comparator returned * <code>1</code> for a perfect match and <code>0</code> for * complete mismatch; our method returns one minus the Jaro-Winkler * measure. * * <p>The Jaro-Winkler distance measure was developed for name comparison * in the U.S. Census. It is designed to compae surnames to surnames * and given names to given names, not whole names to whole names. * There is no character-specific information in this implementation, * but assumptions are made about typical lengths and the significance * of initial matches that may not apply to all languages. * * <p>The easiest way to understand the Jaro measure and the Winkler * variants is procedurally. The Jaro measure involves two steps, * first to compute the number of "matches" and second to * compute the number of "transpositions". The Winkler * adjustment involves a final rescoring based on an exact match score * for the initial characters of both strings. * * <h4>Formal Definition of Jaro-Winkler Distance</h4> * * <p>Suppose we are comparing character sequences <code>cs1</code> * and <code>cs2</code>. The Jaro-Winkler distance is defined by * the following steps. After the definitions, we consider some * examples. * * <p><b>Step 1: Matches:</b> The match phase is a greedy alignment * step of characters in one string against the characters in another * string. The maximum distance (measured by array index) at which * characters may be matched is defined by: * * <pre> * matchRange = max(cs1.length(), cs2.length()) / 2 - 1</pre> * * <p>The match phase is a greedy alignment that proceeds character by * character through the first string, though the distance metric is * symmetric (that, is reversing the order of arguments does not * affect the result). For each character encountered in the first * string, it is matched to the first unaligned character in the * second string that is an exact character match. If there is no * such character within the match range window, the character is left * unaligned. * * <p><b>Step 2: Transpositions:</b> After matching, the subsequence * of characters actually matched in both strings is extracted. These * subsequences will be the same length. The number of characters in * one string that do not line up (by index in the matched * subsequence) with identical characters in the other string is the * number of "half transpositions". The total number of * transpoisitons is the number of half transpositions divided by two, * rounding down. * * <p>The Jaro distance is then defined in terms of the number * of matching characters <code>matches</code> and the number * of transpositions, <code>transposes</code>: * * <pre> * jaroProximity(cs1,cs2) * = ( matches(cs1,cs2) / cs1.length() * + matches(cs1,cs2) / cs2.length() * + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3 * * jaroDistance(cs1,cs2) = 1 - jaroProximity(cs1,cs2)</pre> * * <p>In words, the measure is the average of three values; (a) the * percentage of the first string matched, (b) the percentage of the * second string matched, and (c) the percentage of matches that were * not transposed. * * <p><b>Step 3: Winkler Modification</b> The Winkler modification to * the Jaro comparison, resulting in the Jaro-Winkler comparison, * boosts scores for strings that match character for character * initially. Let <code>boostThreshold</code> be the minimum score * for a string that gets boosted. This value was set to * <code>0.7</code> in Winkler's papers (see references below). If * the Jaro score is below the boost threshold, the Jaro score is * returned unadjusted. The second parameter for the Winkler * modification is the size of the initial prefix considered, * <code>prefixSize</code>. The prefix size was set to <code>4</code> * in Winkler's papers. Next, let * <code>prefixMatch(cs1,cs2,prefixSize)</code> be the number of * characters in the prefix of <code>cs1</code> and <code>cs2</code> * that exactly match (by original index), up to a maximum of * <code>prefixSize</code>. The modified distance is then defined to * be: * * <pre> * jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize) * = jaroMeasure(cs1,cs2) <= boostThreshold * ? jaroMeasure(cs1,cs2) * : jaroMeasure(cs1,cs2) * + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2)) * * jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize) * = 1 - jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)</pre> * * <p><b>Examples:</b> We will present the alignment steps in the form * of tables, with offsets in the second string below the first string * positions that match. For a simple example, consider comparing the * given (nick)name <code>AL</code> to itself. Both strings are of * length 2. Thus the maximum match distance is <code>max(2,2)/2 - 1 * = 0</code>, meaning all matches must be exact. The matches are * illustrated in the following table: * * <table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td><td>A</td><td>L</td></tr> * <tr><td>matches</td><td>0</td><td>1</td></tr> * <tr><td><code>cs2</code></td><td>A</td><td>L</td></tr> * </table> * * <p>The notation in the matches row is meant to indicate that the * <code>A</code> at index <code>0</code> in <code>cs1</code> is * matched to the <code>A</code> at index <code>0</code> in * <code>cs2</code>. Similarlty for the <code>L</code> at index 1 in * <code>cs1</code>, which matches the <code>L</code> at index 1 in * <code>cs2</code>. This results in <code>matches(AL,AL) = 2</code>. * There are no transpositions, so the Jaro distance is just: * * <pre> * jaroProximity(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0</pre> * * <p>Applying the Winkler modification yields the same result: * * <pre> * jaroWinklerProximity(AL,AL) = 1 + 0.1 * 2 * (1.0 - 1) = 1.0</pre> * * <p>Next consider a more complex case, matching <code>MARTHA</code> and * <code>MARHTA</code>. Here the match distance is <code>max(5,5)/2 - * 1 = 1</code>, allowing matching characters to be up to one * character away. This yields the following alignment. * * <table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td> * <td>M</td><td>A</td><td>R</td><td><b>T</b></td><td><b>H</b></td><td>A</td> * </tr> * <tr><td>matches</td> * <td>0</td><td>1</td><td>2</td><td>4</td><td>3</td><td>5</td> * </tr> * <tr><td><code>cs2</code></td> * <td>M</td><td>A</td><td>R</td><td><b>H</b></td><td><b>T</b></td><td>A</td> * </tr> * </table> * * <p>Note that the <code>T</code> at index 3 in the first string * aligns with the <code>T</code> at index 4 in the second string, * whereas the <code>H</code> at index 4 in the first string alings * with the <code>H</code> at index 3 in the second string. The * strings that do not directly align are rendered in bold. This is * an instance of a transposition. The number of half transpositions * is determined by comparing the subsequences of the first and second * string matched, namely <code>MARTHA</code> and <code>MARHTA</code>. * There are two positions with mismatched characters, 3 and 4. This * results in two half transpositions, or a single transposition, for * a Jaro distance of: * * <pre> * jaroProximity(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944</pre> * * Three initial characters match, <code>MAR</code>, for a Jaro-Winkler * distance of: * * <pre> * jaroWinklerProximity(MARTHA,MARHTA) = 0.944 + 0.1 * 3 * (1.0 - 0.944) = 0.961</pre> * * <p>Next, consider matching strings of different lengths, such as * <code>JONES</code> and <code>JOHNSON</code>: * * <table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td> * <td>J</td><td>O</td><td>N</td><td><i>E</i></td><td>S</td><td></td><td></td> * </tr> * <tr><td>matches</td> * <td>0</td><td>1</td><td>3</td><td>-</td><td>5</td><td></td><td></td> * </tr> * <tr><td><code>cs2</code></td> * <td>J</td><td>O</td><td><i>H</i></td><td>N</td><td>S</td><td><i>O</i></td><td><i>N</i></td> * </tr> * </table> * * <p>The unmatched characters are rendered in italics. Here the * subsequence of matched characters for the two strings are <code>JONS</code> and * <code>JONS</code>, so there are no transpositions. Thus the Jaro * distance is: * * <pre> * jaroProximity(JONES,JOHNSON) * = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790</pre> * * <p>The strings <code>JONES</code> and <code>JOHNSON</code> only * match on their first two characters, <code>JO</code>, so the * Jaro-Winkler distance is: * * <pre> * jaroWinklerProximity(JONES,JOHNSON) * = .790 + 0.1 * 2 * (1.0 - .790) = 0.832</pre> * * <p>We will now consider some artificial examples not drawn from * (Winkler 2006). First, compare <code>ABCVWXYZ</code> and * <code>CABVWXYZ</code>, which are of length 8, allowing alignments * up to <code>8/4 - 1 = 3</code> positions away. This leads * to the following alignment: * * <table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td> * <td><b>A</b></td><td><b>B</b></td><td><b>C</b></td><td>V</td><td>W</td><td>X</td><td>Y</td><td>Z</td> * </tr> * <tr><td>matches</td> * <td>1</td><td>2</td><td>0</td><td>3</td><td>4</td><td>5</td><td>6</td><td>7</td> * </tr> * <tr><td><code>cs2</code></td> * <td><b>C</b></td><td><b>A</b></td><td><b>B</b></td><td>V</td><td>W</td><td>X</td><td>Y</td><td>Z</td> * </tr> * </table> * * <p>Here, there are 8/8 matches in both strings. There are only * three half-transpositions, in the first three characters, * because no position of <code>CAB</code> has an identical character * to <code>ABC</code>. This yields a total of one transposition, * for a Jaro score of: * * <pre> * jaroProximity(ABCVWXYZ,CABVWXYZ) * = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958</pre> * * <p>There is no initial prefix match, so the Jaro-Winkler comparison * produces the same result. Now consider matching * <code>ABCVWXYZ</code> to <code>CBAWXYZ</code>. Here, the initial * alignment is <code>2, 1, 0</code>, which yields only two half * transpositions. Thus under the Jaro distance, <code>ABC</code> is * closer to <code>CBA</code> than to <code>CAB</code>, though due to * integer rounding in computing the number of transpositions, this * will only affect the final result if there is a further * transposition in the strings. * * <p>Now consider the 10-character string <code>ABCDUVWXYZ</code>. * This allows matches up to <code>10/2 - 1 = 4</code> positions away.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -