⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 indexwriter.java

📁 中文分词,中科院分词的改装版。使用java调用dll来实现的。
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
      writeLock = null;    }  }  /** Returns the Directory used by this index. */  public Directory getDirectory() {      return directory;  }  /** Returns the analyzer used by this index. */  public Analyzer getAnalyzer() {      return analyzer;  }  /** Returns the number of documents currently in this index. */  public synchronized int docCount() {    int count = 0;    for (int i = 0; i < segmentInfos.size(); i++) {      SegmentInfo si = segmentInfos.info(i);      count += si.docCount;    }    return count;  }  /**   * The maximum number of terms that will be indexed for a single field in a   * document.  This limits the amount of memory required for indexing, so that   * collections with very large files will not crash the indexing process by   * running out of memory.<p/>   * Note that this effectively truncates large documents, excluding from the   * index terms that occur further in the document.  If you know your source   * documents are large, be sure to set this value high enough to accomodate   * the expected size.  If you set it to Integer.MAX_VALUE, then the only limit   * is your memory, but you should anticipate an OutOfMemoryError.<p/>   * By default, no more than 10,000 terms will be indexed for a field.   *   */  private int maxFieldLength = DEFAULT_MAX_FIELD_LENGTH;  /**   * Adds a document to this index.  If the document contains more than   * {@link #setMaxFieldLength(int)} terms for a given field, the remainder are   * discarded.   */  public void addDocument(Document doc) throws IOException {    addDocument(doc, analyzer);  }  /**   * Adds a document to this index, using the provided analyzer instead of the   * value of {@link #getAnalyzer()}.  If the document contains more than   * {@link #setMaxFieldLength(int)} terms for a given field, the remainder are   * discarded.   */  public void addDocument(Document doc, Analyzer analyzer) throws IOException {    DocumentWriter dw =      new DocumentWriter(ramDirectory, analyzer, this);    dw.setInfoStream(infoStream);    String segmentName = newSegmentName();    dw.addDocument(segmentName, doc);    synchronized (this) {      segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));      maybeMergeSegments();    }  }  final int getSegmentsCounter(){    return segmentInfos.counter;  }    private final synchronized String newSegmentName() {    return "_" + Integer.toString(segmentInfos.counter++, Character.MAX_RADIX);  }  /** Determines how often segment indices are merged by addDocument().  With   * smaller values, less RAM is used while indexing, and searches on   * unoptimized indices are faster, but indexing speed is slower.  With larger   * values, more RAM is used during indexing, and while searches on unoptimized   * indices are slower, indexing is faster.  Thus larger values (> 10) are best   * for batch index creation, and smaller values (< 10) for indices that are   * interactively maintained.   *   * <p>This must never be less than 2.  The default value is {@link #DEFAULT_MERGE_FACTOR}.   */  private int mergeFactor = DEFAULT_MERGE_FACTOR;  /** Determines the minimal number of documents required before the buffered   * in-memory documents are merging and a new Segment is created.   * Since Documents are merged in a {@link org.apache.lucene.store.RAMDirectory},   * large value gives faster indexing.  At the same time, mergeFactor limits   * the number of files open in a FSDirectory.   *   * <p> The default value is {@link #DEFAULT_MAX_BUFFERED_DOCS}.   */  private int minMergeDocs = DEFAULT_MAX_BUFFERED_DOCS;  /** Determines the largest number of documents ever merged by addDocument().   * Small values (e.g., less than 10,000) are best for interactive indexing,   * as this limits the length of pauses while indexing to a few seconds.   * Larger values are best for batched indexing and speedier searches.   *   * <p>The default value is {@link #DEFAULT_MAX_MERGE_DOCS}.   */  private int maxMergeDocs = DEFAULT_MAX_MERGE_DOCS;  /** If non-null, information about merges will be printed to this.   */  private PrintStream infoStream = null;  /** Merges all segments together into a single segment, optimizing an index      for search. */  public synchronized void optimize() throws IOException {    flushRamSegments();    while (segmentInfos.size() > 1 ||           (segmentInfos.size() == 1 &&            (SegmentReader.hasDeletions(segmentInfos.info(0)) ||             segmentInfos.info(0).dir != directory ||             (useCompoundFile &&              (!SegmentReader.usesCompoundFile(segmentInfos.info(0)) ||                SegmentReader.hasSeparateNorms(segmentInfos.info(0))))))) {      int minSegment = segmentInfos.size() - mergeFactor;      mergeSegments(minSegment < 0 ? 0 : minSegment);    }  }  /** Merges all segments from an array of indexes into this index.   *   * <p>This may be used to parallelize batch indexing.  A large document   * collection can be broken into sub-collections.  Each sub-collection can be   * indexed in parallel, on a different thread, process or machine.  The   * complete index can then be created by merging sub-collection indexes   * with this method.   *   * <p>After this completes, the index is optimized. */  public synchronized void addIndexes(Directory[] dirs)      throws IOException {    optimize();					  // start with zero or 1 seg    int start = segmentInfos.size();    for (int i = 0; i < dirs.length; i++) {      SegmentInfos sis = new SegmentInfos();	  // read infos from dir      sis.read(dirs[i]);      for (int j = 0; j < sis.size(); j++) {        segmentInfos.addElement(sis.info(j));	  // add each info      }    }        // merge newly added segments in log(n) passes    while (segmentInfos.size() > start+mergeFactor) {      for (int base = start; base < segmentInfos.size(); base++) {        int end = Math.min(segmentInfos.size(), base+mergeFactor);        if (end-base > 1)          mergeSegments(base, end);      }    }    optimize();					  // final cleanup  }  /** Merges the provided indexes into this index.   * <p>After this completes, the index is optimized. </p>   * <p>The provided IndexReaders are not closed.</p>   */  public synchronized void addIndexes(IndexReader[] readers)    throws IOException {    optimize();					  // start with zero or 1 seg    final String mergedName = newSegmentName();    SegmentMerger merger = new SegmentMerger(this, mergedName);    final Vector segmentsToDelete = new Vector();    IndexReader sReader = null;    if (segmentInfos.size() == 1){ // add existing index, if any        sReader = SegmentReader.get(segmentInfos.info(0));        merger.add(sReader);        segmentsToDelete.addElement(sReader);   // queue segment for deletion    }          for (int i = 0; i < readers.length; i++)      // add new indexes      merger.add(readers[i]);    int docCount = merger.merge();                // merge 'em    segmentInfos.setSize(0);                      // pop old infos & add new    segmentInfos.addElement(new SegmentInfo(mergedName, docCount, directory));        if(sReader != null)        sReader.close();    synchronized (directory) {			  // in- & inter-process sync      new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), commitLockTimeout) {	  public Object doBody() throws IOException {	    segmentInfos.write(directory);	  // commit changes	    return null;	  }	}.run();    }        deleteSegments(segmentsToDelete);  // delete now-unused segments    if (useCompoundFile) {      final Vector filesToDelete = merger.createCompoundFile(mergedName + ".tmp");      synchronized (directory) { // in- & inter-process sync        new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), commitLockTimeout) {          public Object doBody() throws IOException {            // make compound file visible for SegmentReaders            directory.renameFile(mergedName + ".tmp", mergedName + ".cfs");            return null;          }        }.run();      }      // delete now unused files of segment       deleteFiles(filesToDelete);       }  }  /** Merges all RAM-resident segments. */  private final void flushRamSegments() throws IOException {    int minSegment = segmentInfos.size()-1;    int docCount = 0;    while (minSegment >= 0 &&           (segmentInfos.info(minSegment)).dir == ramDirectory) {      docCount += segmentInfos.info(minSegment).docCount;      minSegment--;    }    if (minSegment < 0 ||			  // add one FS segment?        (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor ||        !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))      minSegment++;    if (minSegment >= segmentInfos.size())      return;					  // none to merge    mergeSegments(minSegment);  }  /** Incremental segment merger.  */  private final void maybeMergeSegments() throws IOException {    long targetMergeDocs = minMergeDocs;    while (targetMergeDocs <= maxMergeDocs) {      // find segments smaller than current target size      int minSegment = segmentInfos.size();      int mergeDocs = 0;      while (--minSegment >= 0) {        SegmentInfo si = segmentInfos.info(minSegment);        if (si.docCount >= targetMergeDocs)          break;        mergeDocs += si.docCount;      }      if (mergeDocs >= targetMergeDocs)		  // found a merge to do        mergeSegments(minSegment+1);      else        break;      targetMergeDocs *= mergeFactor;		  // increase target size    }  }  /** Pops segments off of segmentInfos stack down to minSegment, merges them,    and pushes the merged index onto the top of the segmentInfos stack. */  private final void mergeSegments(int minSegment)      throws IOException {    mergeSegments(minSegment, segmentInfos.size());  }  /** Merges the named range of segments, replacing them in the stack with a   * single segment. */  private final void mergeSegments(int minSegment, int end)    throws IOException {    final String mergedName = newSegmentName();    if (infoStream != null) infoStream.print("merging segments");    SegmentMerger merger = new SegmentMerger(this, mergedName);    final Vector segmentsToDelete = new Vector();    for (int i = minSegment; i < end; i++) {      SegmentInfo si = segmentInfos.info(i);      if (infoStream != null)        infoStream.print(" " + si.name + " (" + si.docCount + " docs)");      IndexReader reader = SegmentReader.get(si);      merger.add(reader);      if ((reader.directory() == this.directory) || // if we own the directory          (reader.directory() == this.ramDirectory))        segmentsToDelete.addElement(reader);   // queue segment for deletion    }    int mergedDocCount = merger.merge();    if (infoStream != null) {      infoStream.println(" into "+mergedName+" ("+mergedDocCount+" docs)");    }    for (int i = end-1; i > minSegment; i--)     // remove old infos & add new      segmentInfos.remove(i);    segmentInfos.set(minSegment, new SegmentInfo(mergedName, mergedDocCount,                                            directory));    // close readers before we attempt to delete now-obsolete segments    merger.closeReaders();    synchronized (directory) {                 // in- & inter-process sync      new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), commitLockTimeout) {          public Object doBody() throws IOException {            segmentInfos.write(directory);     // commit before deleting            return null;          }        }.run();    }        deleteSegments(segmentsToDelete);  // delete now-unused segments    if (useCompoundFile) {      final Vector filesToDelete = merger.createCompoundFile(mergedName + ".tmp");      synchronized (directory) { // in- & inter-process sync        new Lock.With(directory.makeLock(COMMIT_LOCK_NAME), commitLockTimeout) {          public Object doBody() throws IOException {            // make compound file visible for SegmentReaders            directory.renameFile(mergedName + ".tmp", mergedName + ".cfs");            return null;          }        }.run();      }      // delete now unused files of segment       deleteFiles(filesToDelete);       }  }  /*   * Some operating systems (e.g. Windows) don't permit a file to be deleted   * while it is opened for read (e.g. by another process or thread). So we   * assume that when a delete fails it is because the file is open in another   * process, and queue the file for subsequent deletion.   */  private final void deleteSegments(Vector segments) throws IOException {    Vector deletable = new Vector();    deleteFiles(readDeleteableFiles(), deletable); // try to delete deleteable    for (int i = 0; i < segments.size(); i++) {      SegmentReader reader = (SegmentReader)segments.elementAt(i);      if (reader.directory() == this.directory)        deleteFiles(reader.files(), deletable);	  // try to delete our files      else        deleteFiles(reader.files(), reader.directory()); // delete other files    }    writeDeleteableFiles(deletable);		  // note files we can't delete  }    private final void deleteFiles(Vector files) throws IOException {    Vector deletable = new Vector();    deleteFiles(readDeleteableFiles(), deletable); // try to delete deleteable    deleteFiles(files, deletable);     // try to delete our files    writeDeleteableFiles(deletable);        // note files we can't delete  }  private final void deleteFiles(Vector files, Directory directory)       throws IOException {    for (int i = 0; i < files.size(); i++)      directory.deleteFile((String)files.elementAt(i));  }  private final void deleteFiles(Vector files, Vector deletable)       throws IOException {    for (int i = 0; i < files.size(); i++) {      String file = (String)files.elementAt(i);      try {        directory.deleteFile(file);		  // try to delete each file      } catch (IOException e) {			  // if delete fails        if (directory.fileExists(file)) {          if (infoStream != null)            infoStream.println(e.toString() + "; Will re-try later.");          deletable.addElement(file);		  // add to deletable        }      }    }  }  private final Vector readDeleteableFiles() throws IOException {    Vector result = new Vector();    if (!directory.fileExists(IndexFileNames.DELETABLE))      return result;    IndexInput input = directory.openInput(IndexFileNames.DELETABLE);    try {      for (int i = input.readInt(); i > 0; i--)	  // read file names        result.addElement(input.readString());    } finally {      input.close();    }    return result;  }  private final void writeDeleteableFiles(Vector files) throws IOException {    IndexOutput output = directory.createOutput("deleteable.new");    try {      output.writeInt(files.size());      for (int i = 0; i < files.size(); i++)        output.writeString((String)files.elementAt(i));    } finally {      output.close();    }    directory.renameFile("deleteable.new", IndexFileNames.DELETABLE);  }}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -