⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 pdffilter.java

📁 dspace 用j2ee架构的一个数字图书馆.开源程序
💻 JAVA
字号:
/* * PDFFilter.java * * Version: $Revision: 1.8 $ * * Date: $Date: 2005/07/29 15:56:07 $ * * Copyright (c) 2002-2005, Hewlett-Packard Company and Massachusetts * Institute of Technology.  All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are * met: * * - Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * - Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * - Neither the name of the Hewlett-Packard Company nor the name of the * Massachusetts Institute of Technology nor the names of their * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS * OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR * TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE * USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH * DAMAGE. */package org.dspace.app.mediafilter;import java.io.ByteArrayInputStream;import java.io.InputStream;import org.pdfbox.cos.COSDocument;import org.pdfbox.pdfparser.PDFParser;import org.pdfbox.pdmodel.PDDocument;import org.pdfbox.util.PDFTextStripper;/* *  * to do: helpful error messages - can't find mediafilter.cfg - can't * instantiate filter - bitstream format doesn't exist *   */public class PDFFilter extends MediaFilter{    public String getFilteredName(String oldFilename)    {        return oldFilename + ".txt";    }    /**     * @return String bundle name     *       */    public String getBundleName()    {        return "TEXT";    }    /**     * @return String bitstreamformat     */    public String getFormatString()    {        return "Text";    }    /**     * @return String description     */    public String getDescription()    {        return "Extracted text";    }    /**     * @param source     *            source input stream     *      * @return InputStream the resulting input stream     */    public InputStream getDestinationStream(InputStream source)            throws Exception    {        // get input stream from bitstream        // pass to filter, get string back        PDFTextStripper pts = new PDFTextStripper();        PDFParser parser = new PDFParser(source);        parser.parse();        COSDocument cos = parser.getDocument();        String extractedText = pts                .getText(new PDDocument(parser.getDocument()));        // now close the pdf        cos.close();        // if verbose flag is set, print out extracted text        // to STDOUT        if (MediaFilterManager.isVerbose)        {            System.out.println(extractedText);        }        // generate an input stream with the extracted text        byte[] textBytes = extractedText.getBytes();        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);        return bais; // will this work? or will the byte array be out of scope?    }}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -