⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 jarowinklerdistance.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
 * If matched against <code>DABCUVWXYZ</code>, the result is 10 * matches, and 4 half transposes, or 2 transposes.  Now consider * matching <code>ABCDUVWXYZ</code> against <code>DBCAUVWXYZ</code>. * Here, index 0 in the first string (<code>A</code>) maps to * index 3 in the second string, and index 3 in the first string * (<code>D</code>) maps to index 0 in the second string, but * positions 1 and 2 (<code>B</code> and <code>C</code>) map to * themselves.  Thus when comparing the output, there are only two * half transpositions, thus making the second example * <code>DBCAUVWXYZ</code> closer than <code>DABCUVWXYZ</code> to the * first string <code>ABCDUVWXYZ</code>. * * <p>Note that the transposition count cannot be determined solely by * the mapping.  For instance, the string <code>ABBBUVWXYZ</code> * matches <code>BBBAUVWXYZ</code> with alignment <code>4, 0, 1, 2, 5, 6, 7, * 8, 9, 1</code>.  But there are only two half-transpositions, because * only index 0 and index 3 mismatch in the subsequences of matching * characters. Contrast this with <code>ABCDUVWXYZ</code> matching * <code>DABCUVWXYZ</code>, which has the same alignment, but four * half transpositions. * * <p>The greedy nature of the alignment phase in the Jaro-Winkler * algorithm actually prevents the optimal alignments from being found * in some cases.  Consider the alignment of <code>ABCAWXYZ</code> * with <code>BCAWXYZ</code>: * * <table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td> *     <td><b>A<b></td><td><b>B</b></td><td><b>C</b></td><td><i>A</i></td><td>W</td><td>X</td><td>Y</td><td>Z</td> * </tr> * <tr><td>matches</td> *     <td>2</td><td>0</td><td>1</td><td>-</td><td>3</td><td>4</td><td>5</td><td>6</td> * </tr> * <tr><td><code>cs2</code></td> *     <td><b>B</b></td><td><b>C</b></td><td><b>A</b></td><td>W</td><td>X</td><td>Y</td><td>Z</td><td>&nbsp;</td> * </tr> * </table> * * <p>Here the first pair of <code>A</code> characters are matched, * leading to three half transposes (the first three matched * characters).  A better scoring, though illegal, alignment would be * the following, because it has the same number of matches, but no * transposes: * * <p><table cellpadding="3" border="1" style="margin-left: 2em"> * <tr><td><code>cs1</code></td> *     <td><i>A</i></td><td><b>B</b></td><td><b>C</b></td><td><b>A</b></td><td>W</td><td>X</td><td>Y</td><td>Z</td> * </tr> * <tr><td>matches</td> *     <td style="background-color:#FF9">-</td><td>0</td><td>1</td><td style="background-color:#FF9">2</td><td>3</td><td>4</td><td>5</td><td>6</td> * </tr> * <tr><td><code>cs2</code></td> *     <td><b>B</b></td><td><b>C</b></td><td><b>A</b></td><td>W</td><td>X</td><td>Y</td><td>Z</td><td>&nbsp;</td> * </tr> * </table> * * <p>The illegal links are highlighted in yellow.  Note that neither alignment * matches in the initial character, so the Winkler adjustments do not apply. * * <h4>Implementation Notes</h4> * * <p>This class's implementation is a literal translation of the C * algorithm used in William E. Winkler's papers and for the 1995 * U.S. Census Deduplication.  The algorithm is the work of * multiple authors and available from the folloiwng link: * * <ul> * <li> * Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch.  1994. * <a href="http://www.census.gov/geo/msb/stand/strcmp.c">strcmp95.c</a>, * Version 2. United States Census Bureau. * </li> * </ul> * * <p> Unlike the C version, the {@link * #distance(CharSequence,CharSequence)} and {@link * #proximity(CharSequence,CharSequence)} methods do not require its * inputs to be padded with spaces.  In addition, spaces are treated * just like any other characters within the algorithm itself.  There * is also no case normalization in this class's version. * Furthermore, the boundary conditions are changed so that two empty * strings return a score of <code>1.0</code> rather than zero, as in * the original algorithm. * * <p>The algorithm, along with applications in record linkage, is * sketched in the following highly readable survey article: * * <ul> * <li> * Winkler, William E.  2006. * <a href="http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf">Overview of * Record Linkage and Current Research Directions</a>. * Statistical Research Division, U.S. Census Bureau. * </li> * </ul> * * This document provides test cases in Table 6, which are the basis * for the unit tests for this class (though note the three 0.0 * results in the table do not agree with the return results of * <code>strcmp95.c</code> or the results of this class, which matches * <code>strcmp95.c</code>).  The description of the matching * procedure above is based on the actual <code>strcmp95</code> code, * the boundary conditions of which are not obvious from the text * descriptions in the literature.  An additional difference is that * <code>strcmp95</code>, but not the algorithms in Winkler's papers * nor the algorithm in this class, provides the possibility of * partial matches with similar-sounding characters * (e.g. <code>c</code> and <code>k</code>). * * <h4>Acknowledgements</h4> * * <p>We'd like to thank Bill Winkler for helping us understand the * versions of the algorithm and providing the <code>strcmp95.c</code> * code as a reference implementation. * * @author  Bob Carpenter * @version 3.0 * @since   LingPipe2.4 */public class JaroWinklerDistance    implements Distance<CharSequence>,               Proximity<CharSequence> {    private final double mWeightThreshold;    private final int mNumChars;    /**     * Construct a basic Jaro string distance without the Winkler     * modifications.  See the class documentation above for more information     * on the exact algorithm and its parameters.     */    public JaroWinklerDistance() {        this(Double.POSITIVE_INFINITY,0);    }    /**     * Construct a Winkler-modified Jaro string distance with the     * specified weight threshold for refinement and an initial number     * of characters over which to reweight.  See the class     * documentation above for more information on the exact algorithm     * and its parameters.     */    public JaroWinklerDistance(double weightThreshold, int numChars) {        mNumChars = numChars;        mWeightThreshold = weightThreshold;    }    /**     * Returns the Jaro-Winkler distance between the specified character     * sequences.  Teh distance is symmetric and will fall in the     * range <code>0</code> (perfect match) to <code>1</code> (no overlap).     * See the class definition above for formal definitions.     *     * <p>This method is defined to be:     *     * <pre>     *   distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)</code></pre>     *     * @param cSeq1 First character sequence to compare.     * @param cSeq2 Second character sequence to compare.     * @return The Jaro-Winkler comparison value for the two character     * sequences.     */    public double distance(CharSequence cSeq1, CharSequence cSeq2) {        return 1.0 - proximity(cSeq1,cSeq2);    }    /**     * Return the Jaro-Winkler comparison value between the specified     * character sequences.  The comparison is symmetric and will fall     * in the range <code>0</code> (no match) to <code>1</code>     * (perfect match)inclusive.  See the class definition above for     * an exact definition of Jaro-Winkler string comparison.     *     * <p>The method {@link #distance(CharSequence,CharSequence)} returns     * a distance measure that is one minus the comparison value.     *     * @param cSeq1 First character sequence to compare.     * @param cSeq2 Second character sequence to compare.     * @return The Jaro-Winkler comparison value for the two character     * sequences.     */    public double proximity(CharSequence cSeq1, CharSequence cSeq2) {        int len1 = cSeq1.length();        int len2 = cSeq2.length();        if (len1 == 0)            return len2 == 0 ? 1.0 : 0.0;        int  searchRange = Math.max(0,Math.max(len1,len2)/2 - 1);        boolean[] matched1 = new boolean[len1];        Arrays.fill(matched1,false);        boolean[] matched2 = new boolean[len2];        Arrays.fill(matched2,false);        int numCommon = 0;        for (int i = 0; i < len1; ++i) {            int start = Math.max(0,i-searchRange);            int end = Math.min(i+searchRange+1,len2);            for (int j = start; j < end; ++j) {                if (matched2[j]) continue;                if (cSeq1.charAt(i) != cSeq2.charAt(j))                    continue;                matched1[i] = true;                matched2[j] = true;                ++numCommon;                break;            }        }        if (numCommon == 0) return 0.0;        int numHalfTransposed = 0;        int j = 0;        for (int i = 0; i < len1; ++i) {            if (!matched1[i]) continue;            while (!matched2[j]) ++j;            if (cSeq1.charAt(i) != cSeq2.charAt(j))                ++numHalfTransposed;            ++j;        }        // System.out.println("numHalfTransposed=" + numHalfTransposed);        int numTransposed = numHalfTransposed/2;        // System.out.println("numCommon=" + numCommon        // + " numTransposed=" + numTransposed);        double numCommonD = numCommon;        double weight = (numCommonD/len1                         + numCommonD/len2                         + (numCommon - numTransposed)/numCommonD)/3.0;        if (weight <= mWeightThreshold) return weight;        int max = Math.min(mNumChars,Math.min(cSeq1.length(),cSeq2.length()));        int pos = 0;        while (pos < max && cSeq1.charAt(pos) == cSeq2.charAt(pos))            ++pos;        if (pos == 0) return weight;        return weight + 0.1 * pos * (1.0 - weight);    }    /**     * A constant for the Jaro distance.  The value is the same as     * would be returned by the nullary constructor     * <code>JaroWinklerDistance()</code>.     *     * <p>Instances are thread safe, so this single distance instance     * may be used for all comparisons within an application.     */    public static final JaroWinklerDistance JARO_DISTANCE        = new JaroWinklerDistance();    /**     * A constant for the Jaro-Winkler distance with defaults set as     * in Winkler's papers.  The value is the same as would be     * returned by the nullary constructor     * <code>JaroWinklerDistance(0.7,4)</code>.     *     * <p>Instances are thread safe, so this single distance instance     * may be used for all comparisons within an application.     */    public static final JaroWinklerDistance JARO_WINKLER_DISTANCE        = new JaroWinklerDistance(0.70,4);}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -