⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 lucenepdfdocument.java

📁 在lucene环境下把pdf的文件转换成txt文件的源代码
💻 JAVA
字号:
/* * To change this template, choose Tools | Templates * and open the template in the editor. */package lucenesearch;import java.io.*;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.pdfbox.cos.COSDocument;import org.pdfbox.pdfparser.PDFParser;import org.pdfbox.util.PDFTextStripper;import org.pdfbox.pdmodel.PDDocument;/** * * @author BileiZhu */public class LucenePDFDocument {    public static Document getDocument(File doc) {        COSDocument cosDoc = null;        PDFParser parser = null;        String docPath = doc.getAbsolutePath();        String title = doc.getName();        InputStream inputStream = null;        Reader contents = null;        Document document = new Document();        try {            inputStream = new FileInputStream(doc);        } catch (FileNotFoundException e) {            System.out.println(e);        }        try {            parser = new PDFParser(inputStream);            parser.parse();            cosDoc = parser.getDocument();            if(cosDoc.isEncrypted()) {                System.out.println("文件加密,无法索引");            }            PDFTextStripper stripper = new PDFTextStripper();            String docText = stripper.getText(new PDDocument(cosDoc));            contents = new StringReader(docText);        } catch (Exception e) {            System.out.println(e);        }        document.add(new Field("path", docPath, Field.Store.YES, Field.Index.NO));        document.add(new Field("title", title, Field.Store.YES, Field.Index.TOKENIZED));        document.add(new Field("contents", contents));        document.add(new Field("information", docPath + Long.toString(doc.lastModified()), Field.Store.YES, Field.Index.UN_TOKENIZED));        try {            cosDoc.close();        } catch (Exception e) {            System.out.println(e);        }        return document;    }}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -