📄 indexwriter.java

📁 Lucene a java open-source SearchEngine Framework
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
package org.apache.lucene.index;/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.document.Document;import org.apache.lucene.search.Similarity;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.Lock;import org.apache.lucene.store.LockObtainFailedException;import org.apache.lucene.store.AlreadyClosedException;import org.apache.lucene.util.BitVector;import java.io.File;import java.io.IOException;import java.io.PrintStream;import java.util.List;import java.util.ArrayList;import java.util.HashMap;import java.util.Set;import java.util.HashSet;import java.util.LinkedList;import java.util.Iterator;import java.util.Map.Entry;/**  An <code>IndexWriter</code> creates and maintains an index.  <p>The <code>create</code> argument to the   <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer, boolean)"><b>constructor</b></a>  determines whether a new index is created, or whether an existing index is  opened.  Note that you  can open an index with <code>create=true</code> even while readers are  using the index.  The old readers will continue to search  the "point in time" snapshot they had opened, and won't  see the newly created index until they re-open.  There are  also <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer)"><b>constructors</b></a>  with no <code>create</code> argument which  will create a new index if there is not already an index at the  provided path and otherwise open the existing index.</p>  <p>In either case, documents are added with <a  href="#addDocument(org.apache.lucene.document.Document)"><b>addDocument</b></a>  and removed with <a  href="#deleteDocuments(org.apache.lucene.index.Term)"><b>deleteDocuments</b></a>.  A document can be updated with <a href="#updateDocument(org.apache.lucene.index.Term, org.apache.lucene.document.Document)"><b>updateDocument</b></a>   (which just deletes and then adds the entire document).  When finished adding, deleting and updating documents, <a href="#close()"><b>close</b></a> should be called.</p>  <p>These changes are buffered in memory and periodically  flushed to the {@link Directory} (during the above method  calls).  A flush is triggered when there are enough  buffered deletes (see {@link #setMaxBufferedDeleteTerms})  or enough added documents since the last flush, whichever  is sooner.  For the added documents, flushing is triggered  either by RAM usage of the documents (see {@link  #setRAMBufferSizeMB}) or the number of added documents.  The default is to flush when RAM usage hits 16 MB.  For  best indexing speed you should flush by RAM usage with a  large RAM buffer.  You can also force a flush by calling  {@link #flush}.  When a flush occurs, both pending deletes  and added documents are flushed to the index.  A flush may  also trigger one or more segment merges which by default  run with a background thread so as not to block the  addDocument calls (see <a href="#mergePolicy">below</a>  for changing the {@link MergeScheduler}).</p>  <a name="autoCommit"></a>  <p>The optional <code>autoCommit</code> argument to the  <a href="#IndexWriter(org.apache.lucene.store.Directory, boolean, org.apache.lucene.analysis.Analyzer)"><b>constructors</b></a>  controls visibility of the changes to {@link IndexReader} instances reading the same index.  When this is <code>false</code>, changes are not  visible until {@link #close()} is called.  Note that changes will still be flushed to the  {@link org.apache.lucene.store.Directory} as new files,  but are not committed (no new <code>segments_N</code> file  is written referencing the new files) until {@link #close} is  called.  If something goes terribly wrong (for example the  JVM crashes) before {@link #close()}, then  the index will reflect none of the changes made (it will  remain in its starting state).  You can also call {@link #abort()}, which closes the writer without committing any  changes, and removes any index  files that had been flushed but are now unreferenced.  This mode is useful for preventing readers from refreshing  at a bad time (for example after you've done all your  deletes but before you've done your adds).  It can also be used to implement simple single-writer  transactional semantics ("all or none").</p>  <p>When <code>autoCommit</code> is <code>true</code> then  every flush is also a commit ({@link IndexReader}  instances will see each flush as changes to the index).  This is the default, to match the behavior before 2.2.  When running in this mode, be careful not to refresh your  readers while optimize or segment merges are taking place  as this can tie up substantial disk space.</p>    <p>Regardless of <code>autoCommit</code>, an {@link  IndexReader} or {@link org.apache.lucene.search.IndexSearcher} will only see the  index as of the "point in time" that it was opened.  Any  changes committed to the index after the reader was opened  are not visible until the reader is re-opened.</p>  <p>If an index will not have more documents added for a while and optimal search  performance is desired, then the <a href="#optimize()"><b>optimize</b></a>  method should be called before the index is closed.</p>  <p>Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to open  another <code>IndexWriter</code> on the same directory will lead to a  {@link LockObtainFailedException}. The {@link LockObtainFailedException}  is also thrown if an IndexReader on the same directory is used to delete documents  from the index.</p>    <a name="deletionPolicy"></a>  <p>Expert: <code>IndexWriter</code> allows an optional  {@link IndexDeletionPolicy} implementation to be  specified.  You can use this to control when prior commits  are deleted from the index.  The default policy is {@link  KeepOnlyLastCommitDeletionPolicy} which removes all prior  commits as soon as a new commit is done (this matches  behavior before 2.2).  Creating your own policy can allow  you to explicitly keep previous "point in time" commits  alive in the index for some time, to allow readers to  refresh to the new commit without having the old commit  deleted out from under them.  This is necessary on  filesystems like NFS that do not support "delete on last  close" semantics, which Lucene's "point in time" search  normally relies on. </p>  <a name="mergePolicy"></a> <p>Expert:  <code>IndexWriter</code> allows you to separately change  the {@link MergePolicy} and the {@link MergeScheduler}.  The {@link MergePolicy} is invoked whenever there are  changes to the segments in the index.  Its role is to  select which merges to do, if any, and return a {@link  MergePolicy.MergeSpecification} describing the merges.  It  also selects merges to do for optimize().  (The default is  {@link LogByteSizeMergePolicy}.  Then, the {@link  MergeScheduler} is invoked with the requested merges and  it decides when and how to run the merges.  The default is  {@link ConcurrentMergeScheduler}. </p>*//* * Clarification: Check Points (and commits) * Being able to set autoCommit=false allows IndexWriter to flush and  * write new index files to the directory without writing a new segments_N * file which references these new files. It also means that the state of  * the in memory SegmentInfos object is different than the most recent * segments_N file written to the directory. *  * Each time the SegmentInfos is changed, and matches the (possibly  * modified) directory files, we have a new "check point".  * If the modified/new SegmentInfos is written to disk - as a new  * (generation of) segments_N file - this check point is also an  * IndexCommitPoint. *  * With autoCommit=true, every checkPoint is also a CommitPoint. * With autoCommit=false, some checkPoints may not be commits. *  * A new checkpoint always replaces the previous checkpoint and  * becomes the new "front" of the index. This allows the IndexFileDeleter  * to delete files that are referenced only by stale checkpoints. * (files that were created since the last commit, but are no longer * referenced by the "front" of the index). For this, IndexFileDeleter  * keeps track of the last non commit checkpoint. */public class IndexWriter {  /**   * Default value for the write lock timeout (1,000).   * @see #setDefaultWriteLockTimeout   */  public static long WRITE_LOCK_TIMEOUT = 1000;  private long writeLockTimeout = WRITE_LOCK_TIMEOUT;  /**   * Name of the write lock in the index.   */  public static final String WRITE_LOCK_NAME = "write.lock";  /**   * @deprecated   * @see LogMergePolicy#DEFAULT_MERGE_FACTOR   */  public final static int DEFAULT_MERGE_FACTOR = LogMergePolicy.DEFAULT_MERGE_FACTOR;  /**   * Value to denote a flush trigger is disabled   */  public final static int DISABLE_AUTO_FLUSH = -1;  /**   * Disabled by default (because IndexWriter flushes by RAM usage   * by default). Change using {@link #setMaxBufferedDocs(int)}.   */  public final static int DEFAULT_MAX_BUFFERED_DOCS = DISABLE_AUTO_FLUSH;  /**   * Default value is 16 MB (which means flush when buffered   * docs consume 16 MB RAM).  Change using {@link #setRAMBufferSizeMB}.   */  public final static double DEFAULT_RAM_BUFFER_SIZE_MB = 16.0;  /**   * Disabled by default (because IndexWriter flushes by RAM usage   * by default). Change using {@link #setMaxBufferedDeleteTerms(int)}.   */  public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = DISABLE_AUTO_FLUSH;  /**   * @deprecated   * @see LogDocMergePolicy#DEFAULT_MAX_MERGE_DOCS   */  public final static int DEFAULT_MAX_MERGE_DOCS = LogDocMergePolicy.DEFAULT_MAX_MERGE_DOCS;  /**   * Default value is 10,000. Change using {@link #setMaxFieldLength(int)}.   */  public final static int DEFAULT_MAX_FIELD_LENGTH = 10000;  /**   * Default value is 128. Change using {@link #setTermIndexInterval(int)}.   */  public final static int DEFAULT_TERM_INDEX_INTERVAL = 128;  /**   * Absolute hard maximum length for a term.  If a term   * arrives from the analyzer longer than this length, it   * is skipped and a message is printed to infoStream, if   * set (see {@link #setInfoStream}).   */  public final static int MAX_TERM_LENGTH = DocumentsWriter.MAX_TERM_LENGTH;    // The normal read buffer size defaults to 1024, but  // increasing this during merging seems to yield  // performance gains.  However we don't want to increase  // it too much because there are quite a few  // BufferedIndexInputs created during merging.  See  // LUCENE-888 for details.  private final static int MERGE_READ_BUFFER_SIZE = 4096;  // Used for printing messages  private static Object MESSAGE_ID_LOCK = new Object();  private static int MESSAGE_ID = 0;  private int messageID = -1;  volatile private boolean hitOOM;  private Directory directory;  // where this index resides  private Analyzer analyzer;    // how to analyze text  private Similarity similarity = Similarity.getDefault(); // how to normalize  private boolean commitPending; // true if segmentInfos has changes not yet committed  private SegmentInfos rollbackSegmentInfos;      // segmentInfos we will fallback to if the commit fails  private SegmentInfos localRollbackSegmentInfos;      // segmentInfos we will fallback to if the commit fails  private boolean localAutoCommit;                // saved autoCommit during local transaction  private boolean autoCommit = true;              // false if we should commit only on close  private SegmentInfos segmentInfos = new SegmentInfos();       // the segments  private DocumentsWriter docWriter;  private IndexFileDeleter deleter;  private Set segmentsToOptimize = new HashSet();           // used by optimize to note those needing optimization  private Lock writeLock;  private int termIndexInterval = DEFAULT_TERM_INDEX_INTERVAL;  private boolean closeDir;  private boolean closed;  private boolean closing;  // Holds all SegmentInfo instances currently involved in  // merges  private HashSet mergingSegments = new HashSet();  private MergePolicy mergePolicy = new LogByteSizeMergePolicy();  private MergeScheduler mergeScheduler = new ConcurrentMergeScheduler();  private LinkedList pendingMerges = new LinkedList();  private Set runningMerges = new HashSet();  private List mergeExceptions = new ArrayList();  private long mergeGen;  private boolean stopMerges;  /**   * Used internally to throw an {@link   * AlreadyClosedException} if this IndexWriter has been   * closed.   * @throws AlreadyClosedException if this IndexWriter is   */  protected final void ensureOpen() throws AlreadyClosedException {    if (closed) {      throw new AlreadyClosedException("this IndexWriter is closed");    }  }  /**   * Prints a message to the infoStream (if non-null),   * prefixed with the identifying information for this   * writer and the thread that's calling it.   */  public void message(String message) {    if (infoStream != null)      infoStream.println("IW " + messageID + " [" + Thread.currentThread().getName() + "]: " + message);  }  private synchronized void setMessageID() {    if (infoStream != null && messageID == -1) {      synchronized(MESSAGE_ID_LOCK) {        messageID = MESSAGE_ID++;      }    }  }  /**   * Casts current mergePolicy to LogMergePolicy, and throws   * an exception if the mergePolicy is not a LogMergePolicy.   */  private LogMergePolicy getLogMergePolicy() {    if (mergePolicy instanceof LogMergePolicy)      return (LogMergePolicy) mergePolicy;    else      throw new IllegalArgumentException("this method can only be called when the merge policy is the default LogMergePolicy");  }  /** <p>Get the current setting of whether newly flushed   *  segments will use the compound file format.  Note that   *  this just returns the value previously set with   *  setUseCompoundFile(boolean), or the default value   *  (true).  You cannot use this to query the status of   *  previously flushed segments.</p>   *
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -