📄 indexwriter.java
字号:
package org.apache.lucene.index;/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.document.Document;import org.apache.lucene.search.Similarity;import org.apache.lucene.store.Directory;import org.apache.lucene.store.FSDirectory;import org.apache.lucene.store.Lock;import org.apache.lucene.store.LockObtainFailedException;import org.apache.lucene.store.AlreadyClosedException;import org.apache.lucene.util.BitVector;import java.io.File;import java.io.IOException;import java.io.PrintStream;import java.util.List;import java.util.ArrayList;import java.util.HashMap;import java.util.Set;import java.util.HashSet;import java.util.LinkedList;import java.util.Iterator;import java.util.Map.Entry;/** An <code>IndexWriter</code> creates and maintains an index. <p>The <code>create</code> argument to the <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer, boolean)"><b>constructor</b></a> determines whether a new index is created, or whether an existing index is opened. Note that you can open an index with <code>create=true</code> even while readers are using the index. The old readers will continue to search the "point in time" snapshot they had opened, and won't see the newly created index until they re-open. There are also <a href="#IndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.analysis.Analyzer)"><b>constructors</b></a> with no <code>create</code> argument which will create a new index if there is not already an index at the provided path and otherwise open the existing index.</p> <p>In either case, documents are added with <a href="#addDocument(org.apache.lucene.document.Document)"><b>addDocument</b></a> and removed with <a href="#deleteDocuments(org.apache.lucene.index.Term)"><b>deleteDocuments</b></a>. A document can be updated with <a href="#updateDocument(org.apache.lucene.index.Term, org.apache.lucene.document.Document)"><b>updateDocument</b></a> (which just deletes and then adds the entire document). When finished adding, deleting and updating documents, <a href="#close()"><b>close</b></a> should be called.</p> <p>These changes are buffered in memory and periodically flushed to the {@link Directory} (during the above method calls). A flush is triggered when there are enough buffered deletes (see {@link #setMaxBufferedDeleteTerms}) or enough added documents since the last flush, whichever is sooner. For the added documents, flushing is triggered either by RAM usage of the documents (see {@link #setRAMBufferSizeMB}) or the number of added documents. The default is to flush when RAM usage hits 16 MB. For best indexing speed you should flush by RAM usage with a large RAM buffer. You can also force a flush by calling {@link #flush}. When a flush occurs, both pending deletes and added documents are flushed to the index. A flush may also trigger one or more segment merges which by default run with a background thread so as not to block the addDocument calls (see <a href="#mergePolicy">below</a> for changing the {@link MergeScheduler}).</p> <a name="autoCommit"></a> <p>The optional <code>autoCommit</code> argument to the <a href="#IndexWriter(org.apache.lucene.store.Directory, boolean, org.apache.lucene.analysis.Analyzer)"><b>constructors</b></a> controls visibility of the changes to {@link IndexReader} instances reading the same index. When this is <code>false</code>, changes are not visible until {@link #close()} is called. Note that changes will still be flushed to the {@link org.apache.lucene.store.Directory} as new files, but are not committed (no new <code>segments_N</code> file is written referencing the new files) until {@link #close} is called. If something goes terribly wrong (for example the JVM crashes) before {@link #close()}, then the index will reflect none of the changes made (it will remain in its starting state). You can also call {@link #abort()}, which closes the writer without committing any changes, and removes any index files that had been flushed but are now unreferenced. This mode is useful for preventing readers from refreshing at a bad time (for example after you've done all your deletes but before you've done your adds). It can also be used to implement simple single-writer transactional semantics ("all or none").</p> <p>When <code>autoCommit</code> is <code>true</code> then every flush is also a commit ({@link IndexReader} instances will see each flush as changes to the index). This is the default, to match the behavior before 2.2. When running in this mode, be careful not to refresh your readers while optimize or segment merges are taking place as this can tie up substantial disk space.</p> <p>Regardless of <code>autoCommit</code>, an {@link IndexReader} or {@link org.apache.lucene.search.IndexSearcher} will only see the index as of the "point in time" that it was opened. Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.</p> <p>If an index will not have more documents added for a while and optimal search performance is desired, then the <a href="#optimize()"><b>optimize</b></a> method should be called before the index is closed.</p> <p>Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to open another <code>IndexWriter</code> on the same directory will lead to a {@link LockObtainFailedException}. The {@link LockObtainFailedException} is also thrown if an IndexReader on the same directory is used to delete documents from the index.</p> <a name="deletionPolicy"></a> <p>Expert: <code>IndexWriter</code> allows an optional {@link IndexDeletionPolicy} implementation to be specified. You can use this to control when prior commits are deleted from the index. The default policy is {@link KeepOnlyLastCommitDeletionPolicy} which removes all prior commits as soon as a new commit is done (this matches behavior before 2.2). Creating your own policy can allow you to explicitly keep previous "point in time" commits alive in the index for some time, to allow readers to refresh to the new commit without having the old commit deleted out from under them. This is necessary on filesystems like NFS that do not support "delete on last close" semantics, which Lucene's "point in time" search normally relies on. </p> <a name="mergePolicy"></a> <p>Expert: <code>IndexWriter</code> allows you to separately change the {@link MergePolicy} and the {@link MergeScheduler}. The {@link MergePolicy} is invoked whenever there are changes to the segments in the index. Its role is to select which merges to do, if any, and return a {@link MergePolicy.MergeSpecification} describing the merges. It also selects merges to do for optimize(). (The default is {@link LogByteSizeMergePolicy}. Then, the {@link MergeScheduler} is invoked with the requested merges and it decides when and how to run the merges. The default is {@link ConcurrentMergeScheduler}. </p>*//* * Clarification: Check Points (and commits) * Being able to set autoCommit=false allows IndexWriter to flush and * write new index files to the directory without writing a new segments_N * file which references these new files. It also means that the state of * the in memory SegmentInfos object is different than the most recent * segments_N file written to the directory. * * Each time the SegmentInfos is changed, and matches the (possibly * modified) directory files, we have a new "check point". * If the modified/new SegmentInfos is written to disk - as a new * (generation of) segments_N file - this check point is also an * IndexCommitPoint. * * With autoCommit=true, every checkPoint is also a CommitPoint. * With autoCommit=false, some checkPoints may not be commits. * * A new checkpoint always replaces the previous checkpoint and * becomes the new "front" of the index. This allows the IndexFileDeleter * to delete files that are referenced only by stale checkpoints. * (files that were created since the last commit, but are no longer * referenced by the "front" of the index). For this, IndexFileDeleter * keeps track of the last non commit checkpoint. */public class IndexWriter { /** * Default value for the write lock timeout (1,000). * @see #setDefaultWriteLockTimeout */ public static long WRITE_LOCK_TIMEOUT = 1000; private long writeLockTimeout = WRITE_LOCK_TIMEOUT; /** * Name of the write lock in the index. */ public static final String WRITE_LOCK_NAME = "write.lock"; /** * @deprecated * @see LogMergePolicy#DEFAULT_MERGE_FACTOR */ public final static int DEFAULT_MERGE_FACTOR = LogMergePolicy.DEFAULT_MERGE_FACTOR; /** * Value to denote a flush trigger is disabled */ public final static int DISABLE_AUTO_FLUSH = -1; /** * Disabled by default (because IndexWriter flushes by RAM usage * by default). Change using {@link #setMaxBufferedDocs(int)}. */ public final static int DEFAULT_MAX_BUFFERED_DOCS = DISABLE_AUTO_FLUSH; /** * Default value is 16 MB (which means flush when buffered * docs consume 16 MB RAM). Change using {@link #setRAMBufferSizeMB}. */ public final static double DEFAULT_RAM_BUFFER_SIZE_MB = 16.0; /** * Disabled by default (because IndexWriter flushes by RAM usage * by default). Change using {@link #setMaxBufferedDeleteTerms(int)}. */ public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = DISABLE_AUTO_FLUSH; /** * @deprecated * @see LogDocMergePolicy#DEFAULT_MAX_MERGE_DOCS */ public final static int DEFAULT_MAX_MERGE_DOCS = LogDocMergePolicy.DEFAULT_MAX_MERGE_DOCS; /** * Default value is 10,000. Change using {@link #setMaxFieldLength(int)}. */ public final static int DEFAULT_MAX_FIELD_LENGTH = 10000; /** * Default value is 128. Change using {@link #setTermIndexInterval(int)}. */ public final static int DEFAULT_TERM_INDEX_INTERVAL = 128; /** * Absolute hard maximum length for a term. If a term * arrives from the analyzer longer than this length, it * is skipped and a message is printed to infoStream, if * set (see {@link #setInfoStream}). */ public final static int MAX_TERM_LENGTH = DocumentsWriter.MAX_TERM_LENGTH; // The normal read buffer size defaults to 1024, but // increasing this during merging seems to yield // performance gains. However we don't want to increase // it too much because there are quite a few // BufferedIndexInputs created during merging. See // LUCENE-888 for details. private final static int MERGE_READ_BUFFER_SIZE = 4096; // Used for printing messages private static Object MESSAGE_ID_LOCK = new Object(); private static int MESSAGE_ID = 0; private int messageID = -1; volatile private boolean hitOOM; private Directory directory; // where this index resides private Analyzer analyzer; // how to analyze text private Similarity similarity = Similarity.getDefault(); // how to normalize private boolean commitPending; // true if segmentInfos has changes not yet committed private SegmentInfos rollbackSegmentInfos; // segmentInfos we will fallback to if the commit fails private SegmentInfos localRollbackSegmentInfos; // segmentInfos we will fallback to if the commit fails private boolean localAutoCommit; // saved autoCommit during local transaction private boolean autoCommit = true; // false if we should commit only on close private SegmentInfos segmentInfos = new SegmentInfos(); // the segments private DocumentsWriter docWriter; private IndexFileDeleter deleter; private Set segmentsToOptimize = new HashSet(); // used by optimize to note those needing optimization private Lock writeLock; private int termIndexInterval = DEFAULT_TERM_INDEX_INTERVAL; private boolean closeDir; private boolean closed; private boolean closing; // Holds all SegmentInfo instances currently involved in // merges private HashSet mergingSegments = new HashSet(); private MergePolicy mergePolicy = new LogByteSizeMergePolicy(); private MergeScheduler mergeScheduler = new ConcurrentMergeScheduler(); private LinkedList pendingMerges = new LinkedList(); private Set runningMerges = new HashSet(); private List mergeExceptions = new ArrayList(); private long mergeGen; private boolean stopMerges; /** * Used internally to throw an {@link * AlreadyClosedException} if this IndexWriter has been * closed. * @throws AlreadyClosedException if this IndexWriter is */ protected final void ensureOpen() throws AlreadyClosedException { if (closed) { throw new AlreadyClosedException("this IndexWriter is closed"); } } /** * Prints a message to the infoStream (if non-null), * prefixed with the identifying information for this * writer and the thread that's calling it. */ public void message(String message) { if (infoStream != null) infoStream.println("IW " + messageID + " [" + Thread.currentThread().getName() + "]: " + message); } private synchronized void setMessageID() { if (infoStream != null && messageID == -1) { synchronized(MESSAGE_ID_LOCK) { messageID = MESSAGE_ID++; } } } /** * Casts current mergePolicy to LogMergePolicy, and throws * an exception if the mergePolicy is not a LogMergePolicy. */ private LogMergePolicy getLogMergePolicy() { if (mergePolicy instanceof LogMergePolicy) return (LogMergePolicy) mergePolicy; else throw new IllegalArgumentException("this method can only be called when the merge policy is the default LogMergePolicy"); } /** <p>Get the current setting of whether newly flushed * segments will use the compound file format. Note that * this just returns the value previously set with * setUseCompoundFile(boolean), or the default value * (true). You cannot use this to query the status of * previously flushed segments.</p> *
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -