📄 weightededitdistance.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.spell;import com.aliasi.util.Distance;import com.aliasi.util.Proximity;/** * The <code>WeightedEditDistance</code> class implements both the * proximity and distance interfaces based on the negative proximity * weights assigned to independent atomic edit operations. * * <h4>Weights Scaled as Log Probability</h4> * * <p>Weights on edit operations are scaled as log probabilities. * Practically speaking, this means that the larger the weight, the * more likely the edit operation; keep in mind that -1 is larger than * -3, representing <code>2<sup>-1</sup> = 1/2</code> and * <code>2<sup>-3</sup> = 1/8</code> respectively on a linear * probability scale. * * <h4>Proximity and Edit Sequences</h4> * <p>The log probability of a sequence of independent edits is the * sum of the log probabilities of the individual edits. Proximity * between strings <code>s1</code> and <code>s2</code> is defined as * the maximum sum of edit weights over sequences of edits that * convert <code>s1</code> to <code>s2</code>. * * <p>Like the individual edit weights, proximity is scaled as * a log probability of the complete edit. The larger the proximity, * the closer the strings; again, keep in mind that -10 is larger than * -20, representing roughly 1/1000 and 1/1,000,000 on the linear * probability scale. * * <h4>Distance is Negative Proximity</h4> * * <p>Distance is just negative proximity. This scales edit distances * in the usual way, with distance of 3 between strings indicating they * are further away from each other than strings at distance 1.25. * * <h4>Relation to Simple Edit Distance</h4> * * <P>This class generalizes the behavior of the class * <code>spell.EditDistance</code> without extending it in the inheritance * sense. Weighted edit distance agrees with edit distance (up to * arithmetic precision) as a distance assuming the following weights: * match weight is 0, substitute, insert and delete weights are * <code>-1</code>, and the transposition weight is <code>-1</code> if * transpositions are allowed in the edit distance and * <code>Double.NEGATIVE_INFINITY</code> otherwise. * * <h4>Symmetry</h4> * * <P>If the substitution and transposition weights are symmetric and * the insert and delete costs of a character are equal, then weighted * edit distance will be symmetric. * * <h4>Metricity</h4> * * <p>If the match weight of all * characters is zero, then the distance between a character sequence * and itself will be zero. * <p>If transpose weights are negative infinity so that transposition is * not allowed, and if the assignment of substitution weights forms a * metric (see {@link Distance} for a definition), and if delete and * insert weights are non-negative and equal for all characters, and * if match weights are all zero, then weighted edit distance will * form a proper metric. Other values may also form metrics, such as * a weight of -1 for all edits other than transpose. * * * <h4>Probabilistic Channel</h4> * * <p>A probabilistic relational model between strings is defined if * the weights are properly scaled as log probabilities. Because * probabilities are between 0 and 1, log probabilities will be * between negative infinity and zero. Proximity between two strings * <code>in</code> and <code>out</code> is defined by: * * <blockquote><pre> * proximity(in,out) * = Max<sub><sub>edit(in)=out</sub></sub> log2 P(edit) * </pre></blockquote> * * where the cost of the edit is defined to be: * * <blockquote><code> * log2 P(edit) * <br> = log2 P(edit<sub>0</sub>,...,edit<sub>n-1</sub>) * <br> ~ log2 P(edit<sub>0</sub>) + ... + log P(edit<sub>n-1</sub>) * </code></blockquote> * * The last line is an approximation assuming edits are * independent. * * <p>In order to create a proper probabilistic channel, exponentiated * edit weights must sum to 1.0. This is not technically possible * with a local model if transposition is allowed, because of boundary * conditions and independence assumptions. * * It is possible to define a proper channel if transposition is off, * and if all edit weights for a position (including all sequences of * arbitrarily long insertions) sum to 1.0. In particular, if any * edits at all are allowed (have finite weights), then there must be * a non-zero weight assigned to matching, otherwise exponentiated * edit weight sum would exceed 1.0. It is always possible to add an * offset to normalize the values to a probability model (the offset * will be negative if the sum exceeds 1.0 and positive if it falls * below 1.0 and zero otherwise). * <p>A fully probabilistic model would have to take the sum over all * edits rather than the maximum. This class makes the so-called * Viterbi approximation, assuming the full probability is close to * that of the best probability, or at least proportional to it. * * * @author Bob Carpenter * @version 3.0 * @since LingPipe2.0 */public abstract class WeightedEditDistance implements Distance<CharSequence>, Proximity<CharSequence> { /** * Construct a weighted edit distance. */ public WeightedEditDistance() { /* do nothing */ } /** * Returns the weighted edit distance between the specified * character sequences. If the edit distances are interpreted as * entropies, this distance may be interpreted as the entropy of * the best edit path converting the input character sequence to * the output sequence. The first argument is taken to be the * input and the second argument the output. * * <p>This method is thread * safe and may be accessed concurrently if the abstract weighting * methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence. * @return The edit distance between the sequences. */ public double distance(CharSequence csIn, CharSequence csOut) { return -proximity(csIn,csOut); } /** * Returns the weighted proximity between the specified character * sequences. The first argument is taken to be the input and the * second argument the output. * * <p>This method is thread safe and may be accessed concurrently * if the abstract weighting methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence. * @return The edit distance between the sequences. */ public double proximity(CharSequence csIn, CharSequence csOut) { return distance(csIn,csOut,true); } /** * Returns the weighted edit distance between the specified * character sequences ordering according to the specified * similarity ordering. The first argument is taken to * be the input and the second argument the output. * If the boolean flag for similarity is set to <code>true</code>, * the distance is treated as a similarity measure, where * larger values are closer; if it is <code>false</code>, * smaller values are closer. * * <p>This method is thread safe and may be accessed concurrently * if the abstract weighting methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -