📄 weightededitdistance.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. *  * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.spell;import com.aliasi.util.Distance;import com.aliasi.util.Proximity;/** * The <code>WeightedEditDistance</code> class implements both the * proximity and distance interfaces based on the negative proximity * weights assigned to independent atomic edit operations. * * <h4>Weights Scaled as Log Probability</h4> * * <p>Weights on edit operations are scaled as log probabilities.   * Practically speaking, this means that the larger the weight, the * more likely the edit operation; keep in mind that -1 is larger than * -3, representing <code>2<sup>-1</sup> = 1/2</code> and * <code>2<sup>-3</sup> = 1/8</code> respectively on a linear * probability scale. * * <h4>Proximity and Edit Sequences</h4>  * <p>The log probability of a sequence of independent edits is the * sum of the log probabilities of the individual edits.  Proximity * between strings <code>s1</code> and <code>s2</code> is defined as * the maximum sum of edit weights over sequences of edits that * convert <code>s1</code> to <code>s2</code>. * * <p>Like the individual edit weights, proximity is scaled as * a log probability of the complete edit.  The larger the proximity, * the closer the strings; again, keep in mind that -10 is larger than * -20, representing roughly 1/1000 and 1/1,000,000 on the linear * probability scale. * * <h4>Distance is Negative Proximity</h4> * * <p>Distance is just negative proximity.  This scales edit distances * in the usual way, with distance of 3 between strings indicating they * are further away from each other than strings at distance 1.25. * * <h4>Relation to Simple Edit Distance</h4> *  * <P>This class generalizes the behavior of the class * <code>spell.EditDistance</code> without extending it in the inheritance * sense.  Weighted edit distance agrees with edit distance (up to * arithmetic precision) as a distance assuming the following weights: * match weight is 0, substitute, insert and delete weights are * <code>-1</code>, and the transposition weight is <code>-1</code> if * transpositions are allowed in the edit distance and * <code>Double.NEGATIVE_INFINITY</code> otherwise. * * <h4>Symmetry</h4> *  * <P>If the substitution and transposition weights are symmetric and * the insert and delete costs of a character are equal, then weighted * edit distance will be symmetric.   *  * <h4>Metricity</h4> * * <p>If the match weight of all * characters is zero, then the distance between a character sequence * and itself will be zero.   * <p>If transpose weights are negative infinity so that transposition is * not allowed, and if the assignment of substitution weights forms a * metric (see {@link Distance} for a definition), and if delete and * insert weights are non-negative and equal for all characters, and * if match weights are all zero, then weighted edit distance will * form a proper metric.  Other values may also form metrics, such as * a weight of -1 for all edits other than transpose. * * * <h4>Probabilistic Channel</h4> * * <p>A probabilistic relational model between strings is defined if * the weights are properly scaled as log probabilities.  Because * probabilities are between 0 and 1, log probabilities will be * between negative infinity and zero.  Proximity between two strings * <code>in</code> and <code>out</code> is defined by: * * <blockquote><pre> * proximity(in,out) * = Max<sub><sub>edit(in)=out</sub></sub> log2 P(edit) * </pre></blockquote> * * where the cost of the edit is defined to be: * * <blockquote><code> * log2 P(edit)  * <br> = log2 P(edit<sub>0</sub>,...,edit<sub>n-1</sub>) * <br> ~ log2 P(edit<sub>0</sub>) + ... + log P(edit<sub>n-1</sub>) * </code></blockquote> * * The last line is an approximation assuming edits are * independent. *  * <p>In order to create a proper probabilistic channel, exponentiated * edit weights must sum to 1.0.  This is not technically possible * with a local model if transposition is allowed, because of boundary * conditions and independence assumptions. *  * It is possible to define a proper channel if transposition is off, * and if all edit weights for a position (including all sequences of * arbitrarily long insertions) sum to 1.0.  In particular, if any * edits at all are allowed (have finite weights), then there must be * a non-zero weight assigned to matching, otherwise exponentiated * edit weight sum would exceed 1.0.  It is always possible to add an * offset to normalize the values to a probability model (the offset * will be negative if the sum exceeds 1.0 and positive if it falls * below 1.0 and zero otherwise). * <p>A fully probabilistic model would have to take the sum over all * edits rather than the maximum.  This class makes the so-called * Viterbi approximation, assuming the full probability is close to * that of the best probability, or at least proportional to it. *  *  * @author  Bob Carpenter * @version 3.0 * @since   LingPipe2.0 */public abstract class WeightedEditDistance     implements Distance<CharSequence>,               Proximity<CharSequence> {    /**     * Construct a weighted edit distance.     */    public WeightedEditDistance() {        /* do nothing */    }    /**     * Returns the weighted edit distance between the specified     * character sequences.  If the edit distances are interpreted as     * entropies, this distance may be interpreted as the entropy of     * the best edit path converting the input character sequence to     * the output sequence.  The first argument is taken to be the     * input and the second argument the output.     *     * <p>This method is thread     * safe and may be accessed concurrently if the abstract weighting     * methods are thread safe.     *     * @param csIn First character sequence.     * @param csOut Second character sequence.     * @return The edit distance between the sequences.     */    public double distance(CharSequence csIn, CharSequence csOut) {        return -proximity(csIn,csOut);    }    /**     * Returns the weighted proximity between the specified character     * sequences. The first argument is taken to be the input and the     * second argument the output.     *     * <p>This method is thread safe and may be accessed concurrently     * if the abstract weighting methods are thread safe.     *     * @param csIn First character sequence.     * @param csOut Second character sequence.     * @return The edit distance between the sequences.     */    public double proximity(CharSequence csIn, CharSequence csOut) {        return distance(csIn,csOut,true);    }    /**     * Returns the weighted edit distance between the specified     * character sequences ordering according to the specified     * similarity ordering.  The first argument is taken to     * be the input and the second argument the output.      * If the boolean flag for similarity is set to <code>true</code>,     * the distance is treated as a similarity measure, where     * larger values are closer; if it is <code>false</code>,      * smaller values are closer.     *     * <p>This method is thread safe and may be accessed concurrently     * if the abstract weighting methods are thread safe.     *     * @param csIn First character sequence.     * @param csOut Second character sequence.
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -