📄 estimatortrie.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. *  * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.chunk;import java.io.IOException;import java.io.ObjectInput;/** * An estimator trie is a data structure for compactly representing * the contexts and outcomes in an estimator at varying levels * of detail to allow for the computation of smoothed outcomes at * runtime. * * <p> The data file format for serialization of an estimator trie, as * read by the constructor {@link #EstimatorTrie(ObjectInput)} and * as written, for example, by {@link * com.aliasi.ne.TrainableEstimator#writeTo(java.io.DataOutputStream)}. * * <br/><br/> * <table cellpadding="5" border="1"> *   <tr> *     <td width="15%"><b>Count</b></td> *     <td width="15%"><b>Type</b></td> *     <td width="30%"><b>Variable</b></td> *     <td width="40%"><b>Content</b></td> *   </tr> *   <tr> *     <td>1</td> *     <td>int</td> *     <td>numNodes</td> *     <td>Number of nodes in trie</td> *   </tr> *   <tr> *     <td rowspan="5">numNodes</td> *     <td>int</td> *     <td>nodeSymbol</td> *     <td>Symbol identifier for node</td> *   </tr> *   <tr> *     <td>int</td> *     <td>nodeFirstOutcomeIndex</td> *     <td>Index in outcome array of first outcome</td> *   </tr> *   <tr> *     <td>int</td> *     <td>nodeFirstChild</td> *     <td>Index in this array of first child node</td> *   </tr> *   <tr> *     <td>float</td> *     <td>nodeLogOneMinusLambda</td> *     <td>log (1-lambda) for this node</td> *   </tr> *   <tr> *     <td>int</td> *     <td>nodeBackoff</td> *     <td>Index in this array of backoff node, or -1 if none</td> *   </tr> *   <tr> *     <td>1</td> *     <td>int</td> *     <td>numOutcomes</td> *     <td>Number of outcomes in trie</td> *   </tr> *   <tr> *     <td rowspan="3">numOutcomes</td> *     <td>int</td> *     <td>outcomeSymbol</td> *     <td>Symbol identifier for outcome</td> *   </tr> *   <tr> *     <td>int</td> *     <td>outcomeLogEstimate</td> *     <td>log probability estimate for this outcome</td> *   </tr> * </table> * </br> * * The parallel arrays are encoded one index at a time rather than one * whole array at a time.  All values are stored in local variables * with the given names, with the context node and outcome array * entries being indexed by the node number and outcome number * respectively. * </p> * * <p> The parallel arrays are used for lookup as follows.  A context * for estimating outcomes consists of a sequence of events, * represented here as symbols.  The events to be kept the longest in * smoothing are at the front of the sequence and are looked up first. * The first entry (index <code>0</code>) in the nodes array * constitutes the root node or null context.  It may or may not have * any outcomes on it.  The algorithm for evaluation follows the * sequence of context events, stopping only when extending the * context is not possible.  That node represents the most specific * context for which estimates can be made, and may not actually use * all of the contextual events if they haven't been seen, or seen * frequently enough, during training.  From this most specific node, * an estimate is made for the outcome event in question.  The * outcomes for the most specific context node are searched for the * outcome event being estimated.  If it is found, the log estimate on * that outcome is returned.  If it is not found, the backoff pointer * from the most specific node is followed, and the evaluation is done * again from the more general node, adding the <code>log * (1-lambda)</code> value as required by the smoothing model.  If a * backoff is attempted from a most general node without a backoff * node, an estimate of <code>log 0.0 = Double.NaN</code> is returned, * or in the case of estimation with a specified uniform distribution, * the log of the uniform estimate is added to the <code>log * (1-lambda)</code> and returned. * </p> * * <p> Lookup in the arrays is carried out by binary search over the * arrays of symbol identifiers.  For searching children in thre trie, * <code>nodeFirstChild[i]</code> is the index of the first child node * of node <code>i</code>, and <code>nodeFirstChild[i+1]</code> is the * index of the first child of node <code>i+1</code>, and so on. * Therefore, the binary search for children of <code>i</code> is * carried out over indices <code>j</code> such that <code> * nodeFirstChild[i] <= j < nodeFirstChild[i+1]</code>.  This requires * the node symbol ids to be in sorted order in these ranges.  The * outcomes are searched similarly, with outcomes for node * <code>i</code> searched on indices <code>j</code> such that * <code>nodeFirstOutcomeIndex[i] <= j < * nodeFirstOutcomeIndex[i+1]</code>. * </p> * * <p> The methods provided for estimation require the callers of * these classes to walk the trie.  This avoids unnecessary * intermediate object create to call the trie, and allows the tries * themselves to be flexible with respect to their geometry. * Given a context node (identified by index) and a symbol (by integer * identifier) for an event, the (index of a) more specific context * node that includes the symbol can be retrieved if it exists.  Given * a particular context node (identified by index), and a symbol * representing an outcome (identified by an integer from a symbol * table), the log estimate from the given context is returned by * {@link #estimateFromNode(int,int)}.  If the outcome doesn't exist * at any level of backoff, <code>Double.NaN</code> is returned.  The * method {@link #estimateFromNodeUniform(int,int,double)} includes a * double value to return instead of <code>Double.NaN</code> as a * uniform estimate.  </p> * * @author  Bob Carpenter * @version 1.0 * @since   LingPipe1.0 * @see OutcomeCounter * @see Node */class EstimatorTrie {    // Parallel arrays for trie nodes.    /** Number of nodes in the trie, indexing the parallel trie     * arrays.     */    private final int _numNodes;    /** Elements hold the identifiers for the symbol at the     * element.     */    private final int[] _nodeSymbol;    /** Elements hold the index in the outcome arrays of the first     * outcome.  One longer than other arrays, with final element     * holding the length for symmetry in all the bounded loops.     */    private final int[] _nodeFirstOutcome;    /** Elements hold the index in these arrays of the node that     * is the first child of this node in the trie structure.  One     * longer than other arrays, with final element holding the     * length for symmetry in all the bounded loops.     */    private final int[] _nodeFirstChild;    /** Elements hold the smoothing constant, <code>log 1 -     * lambda(context)</code>.     */    private final float[] _nodeLogOneMinusLambda;    /** Elements hold the index of the backoff node to check next     * for the outcome saught.  Will hold <code>-1</code> if there     * is no backoff node for the node.     */    private final int[] _nodeBackoff;    // Parallel arrays for outcomes given contexts.    /** Number of outcomes total, indexing the parallel outcome     * arrays.     */    private final int _numOutcomes;    /** Elements hold the identifier of the symbol represented by     * the outcome.     */    private final int[] _outcomeSymbol;    /** Elements hold the log estimate of the outcome.     */    private final float[] _outcomeLogEstimate;    /** Create an estimator trie from a data input stream.  Does     * not close the stream so that an estimator trie may be read     * out of a portion of a stream.     *     * @param in Data input stream from which to read this estimator's data.     * @throws IOException If there is an exception reading from the     * underlying stream.     */    public EstimatorTrie(ObjectInput in) throws IOException {        _numNodes = in.readInt();        _nodeSymbol = new int[_numNodes];        _nodeFirstOutcome = new int[_numNodes+1];        _nodeFirstChild = new int[_numNodes+1];        _nodeLogOneMinusLambda = new float[_numNodes];        _nodeBackoff = new int[_numNodes];        for (int i = 0; i < _numNodes; ++i) {            _nodeSymbol[i] = in.readInt();            _nodeFirstOutcome[i] = in.readInt();            _nodeFirstChild[i] = in.readInt();            _nodeLogOneMinusLambda[i] = in.readFloat();            _nodeBackoff[i] = in.readInt();        }        _nodeFirstChild[_numNodes] = _numNodes; // boundary        _numOutcomes = in.readInt();        _nodeFirstOutcome[_numNodes] = _numOutcomes; // boundary        _outcomeSymbol = new int[_numOutcomes];        _outcomeLogEstimate = new float[_numOutcomes];        for (int i = 0; i < _numOutcomes; ++i) {            _outcomeSymbol[i] = in.readInt();            _outcomeLogEstimate[i] = in.readFloat();        }    }    /**     * Returns the log estimate of the given outcome symbol from     * the specified node, or {@link java.lang.Double#NaN} if     * there is none.     *     * @param symbolID Outcome whose estimate is sought.     * @param nodeIndex Index of node representing context of     * estimate.     * @return Log estimate of outcome symbol given node index or     * <code>Double.NaN</code> if impossible.     */    public double estimateFromNode(int symbolID, int nodeIndex) {        if (symbolID < 0) return Double.NaN;        double backoffAccumulator = 0.0;        for(int currentNodeIndex = nodeIndex;            currentNodeIndex >= 0;            currentNodeIndex = _nodeBackoff[currentNodeIndex]) {            int low = _nodeFirstOutcome[currentNodeIndex];            int high = _nodeFirstOutcome[currentNodeIndex+1]-1;            while (low <= high) {                int mid = (high + low)/2;                if (_outcomeSymbol[mid] == symbolID)                    return backoffAccumulator + _outcomeLogEstimate[mid];                else if (_outcomeSymbol[mid] < symbolID)                    low = (low == mid ? mid+1 : mid);                else high = (high == mid ? mid-1 : mid);            }            backoffAccumulator += _nodeLogOneMinusLambda[currentNodeIndex];        }        return Double.NaN; // no more backoff nodes available    }    /**     * Returns the log estimate of the given outcome symbol from     * the specified node, or {@link java.lang.Double#NaN} if     * there is none.  Just like {@link #estimateFromNode} except     * that final backoff will be to a uniform estimate that is     * supplied as an argument.     *     * @param symbolID Outcome whose estimate is sought.     * @param nodeIndex Index of node representing context of estimate.     * @return Log estimate of outcome symbol given node index or     * <code>Double.NaN</code> if impossible.     */    public double estimateFromNodeUniform(int symbolID, int nodeIndex,                                          double uniformEstimate) {        if (symbolID < 0) return Double.NaN;        double backoffAccumulator = 0.0;        for (int currentNodeIndex = nodeIndex;             currentNodeIndex >= 0;             currentNodeIndex = _nodeBackoff[currentNodeIndex]) {            int low = _nodeFirstOutcome[currentNodeIndex];            int high = _nodeFirstOutcome[currentNodeIndex+1]-1;            while (low <= high) {                int mid = (high + low)/2;                if (_outcomeSymbol[mid] == symbolID)                    return backoffAccumulator + _outcomeLogEstimate[mid];                else if (_outcomeSymbol[mid] < symbolID)                    low = (low == mid ? mid+1 : mid);                else high = (high == mid ? mid-1 : mid);            }            backoffAccumulator += _nodeLogOneMinusLambda[currentNodeIndex];        }        return (backoffAccumulator + uniformEstimate);    }    /**     * Returns the index in the trie of the child node of the     * specified parent node index having the specified symbol ID     * or <code>-1</code> if it does not exist.     *     * @param symbolID Identifier of symbol of the child sought.     * @param parentNodeIndex Index of parent node from which to     * search for a child with specified identifier.     * @return Index of the child node of the specified parent     * node with specified symbol identifier.     */    public int lookupChild(int symbolID, int parentNodeIndex) {        int low = _nodeFirstChild[parentNodeIndex];        int high = _nodeFirstChild[parentNodeIndex+1]-1;        if (symbolID < 0) return -1;        while (low <= high) {            int mid = (high + low)/2;            if (_nodeSymbol[mid] == symbolID) return mid;            else if (_nodeSymbol[mid] < symbolID)                low = (low == mid ? mid+1 : mid);            else high = (high == mid ? mid-1 : mid);        }        return -1;    }}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -