首页 › 资源下载 › 其他 › This data set contai › 源码查看
http:^^www.cs.cornell.edu^info^people^lagoze^papers^www.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 776 行 · 第 1/3 页
HTML
776 行
project, an ARPA-sponsored, CNRI-directed effort to create an onlinedigital library of technical reports from the nation's top computerscience universities.This version was installed at the five universities that formthe project(<!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><!WA6><a HREF="http://cs-tr.cs.cornell.edu">Cornell</A>,CMU, Berkeley, MIT, and Stanford),and shortly thereafter at Princeton, Dartmouth, and Rochester.Here we describe a few of its features.A full account may be found in <!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><!WA7><a HREF="#Dienst"> [Dienst]</A>.<p>One uses Dienst by connecting to any convenient Dienst server (thatsupports the user interface services) using a standard Web client.This server will display a form for searching the collection.  Unlessthe user restricts the search to a single publisher, all Dienstservers are searched in parallel.  Each Dienst server is made aware ofall other Dienst servers by fetching a list of all servers from asingle, central meta-server.  Thus when a new server comes online,other servers become aware of it after only a short time.  The resultsfrom a search are displayed as a list of the DocID, author, title, anddate for each matching document, and include a URL for each document.Selecting one displays the document in more detail, including a listof the available formats (obtained as described above.)  The user canretrieve the document in any of the formats.<p> Some repositories include page images as 4-bit 72 dot per inch GIFfiles.  When this is the case, the user interface service is able todisplay the document page at a time, inline on the user's Web client.We found that such pages are readable on most monitors and savesconsiderable network bandwidth compared to the 600 dpi TIFF images.In addition, some sites also store reduced size "thumbnail" pageimages, which allow the user to quickly browse through a document andthen click to view a interesting page (say one that contains agraphic) in full-page version.  Although we do not have any formaluser studies, anecdotal evidence says that this is a very powerfuland helpful feature.  <p>The server also allows the user to download and/or print all orselected pages of the document.  Local users may print directly, whileremote users can download a PostScript version of the document andthen print it manually.  Since all documents are not available inPostScript, the server has the ability to translate from TIFF imagesto level 2 PostScript on the fly.<h2>Maintaining the Document Collection</h2>Our goal is to simplify the process by which an author publishesdigital documents.  Much of the work in this area is at thedocument creation layer - that is, enhancements to HTML and/or HTMLeditors.  Our approach is to allow authors to use their traditionaltext production system - LaTeX, troff, Word, etc - and then providetools by which they can submit the results of that text processing toa digital library<h3> Dienst simplifies digital library maintenance</h3>Digital library technology will only propagate beyond thetechnologically savvy if such systems require minimal humanintervention, especially by trained experts.  Two points are obvious.First, authors are concerned primarily with writing documents andgetting them published.  Submission to a digital library shouldrequire little more skill than using a word processor.  Second, manyof the organizations that wish to publish documents (e.g., governmentagencies, academic departments, small companies) have little technicalexpertise.  These organizations might tolerate the need for areasonable skill level to install a digital library system (we intendto address the skill level required to install the digital librarysystem in future work).  However, they surely will not tolerate thecost of a systems expert to maintain the library.<p>At Cornell we have implemented a set of tools that mostly automate theprocess of managing a digital library.  The tools are closelyintegrated with the Dienst digital library server.  They are similarin spirit to those implemented for the Wide Area Technical ReportServer (<!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><!WA8><A HREF="#WATERS">[WATERS]</A>) system, known as Techrep, butwhereas Techrep is designed to maintain the centralized index andunstructured FTP-based document repository that is characteristic ofWATERS, the tools described here are tailored for the distributedindexes and structured repositories characteristic of Dienst.<p>Our design goal was to make the digital library maintainable by adocument librarian (DL) with relatively low-level computer training.This DL serves four major roles - 1) as the general manager of thecollection; 2) as the reviewer ofdocument submissions, to protect against counterfeit documentsubmissions; 3) as the clearing house for copyright issues; and 4) as the archiver of document hardcopy.  This system hasrecently been installed in the Cornell Computer Science Department andis now the means for all technical report submissions.<p><h4>Authors add documents with an HTML form</h4>The  submitter  prepares a document for submission by producing aPostScript representation.Ratherthan a plethora of document formats from a variety of word processors,we determined that PostScript represents a <i> lingua franca </i> thatcould be generated from virtually all word or text processing systems.We recognize that there will be documents that can not be representedin this fashion, but estimate that there number will be very few andthat techniques for managing them can be developed as the processmatures.  <p>The author submits a document by completing an HTML form that containstext fields for bibliographic data about the document.  These fields are thedocument title, author(s), pathname of the PostScript file, abstract,and submitter's e-mail address.  The submitter can quickly complete this form by "cutting and pasting" text from the document source.<p><h4>The document librarian validates submissions to the library</h4>The document librarian, in the role of gatekeeper of the system,learns of each submission through an automatically generated e-mailmessage. No document actually enters the database until the DLmanually checks the submission.  In addition, the DL acts as the legalgateway, ensuring that the authors complete a copyright release formthat gives the department permission to make the document availableover the internet.  When manual checking and copyright clearing arecomplete, the DL uses a simple command to assign a DocID to thedocument and signal that the document is ready entry into thedatabase.<p>The remainder of the process is fully automated.  Software that isintegrated with the digital library servergenerates the RFC-1357 bibliographic entry from thesubmitter's entry, checks the validity of the postscript file, builds theactual database entry, and generates the GIF images for online viewingand browsing of the document.<p>The image conversions in this process are done with the Extended Portable Bitmap Toolkit (<!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><!WA9><AHREF="#PBMPLUS">[PBMPLUS]</A>).  PBMPLUSconsists of a number of filters for conversion between a variety ofimage formats (TIFF's, GIF's, X Bitmaps, etc.) and a small set of portable formats, and a set of tools to perform manipulations (rotations, color transformation, scaling) on theportable format files.  PBMPLUS has the advantages of being free,quite reliable, usable on a wide variety of graphical formats, andquite powerful in its basic image manipulating capabilities.  <h4>Document librarian controls document withdrawal</h4>A library system must be able to handle author requests fordocument withdrawal.  The reason for withdrawal may be invalidation ofthe published research or newly published results in another document.For purposes of maintaining the integrity of collection, we have made thedocument librarian the control point for this operation.Document withdrawal, via a simple command, replaces the bibliographicfile with an entry whose only attributes are the document number and a"WITHDRAWN" flag - all other bibliographic information is deleted.This ensures that the DocID is not reused for another document.Furthermore, the withdrawal moves the original bibliographicfile and associated  image and postscriptfiles  to a location that is not accessible to the document server.<h4>Hardcopy is sometimes required</h4>While electronic document delivery is the <i>raison d'etre</i>of our system, we recognize that publication quality hardcopyis sometimes needed.  The document librarian must producepaper copy for archival storage and for people whodo not have electronic access.In our system, printing of TR's is done using apackage provided by Cornell Information Technologies called EZ-PUBLISH<!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><!WA10><A HREF="#EZPUB"> [EZPUB] </A>.  EZ-PUBLISH allows users across campus on variousplatforms to print to a central Xerox DocuTech publishing system.This is a publication quality printer that offers very high-speed andresolution (135 pages/minute, 600 dpi) and document setup facilitiessuch as binding, different paper types, etc.  With a command in theDienst document management suite the DL can specify that multiplecopies of a TR be printed on the DocuTech.  The command does automaticsetup of the print job including formatting of a standard CornellTechnical Report cover page.<p><p>We have just begun to use this automated system in the Computer Sciencedepartment at Cornell.  At a later time we will evaluate theeffectiveness of the system, with special attention payed to thenumber of documents that require a special submission procedure (i.e.,are not translatable to postscript).  Obviously if the ratio of theseis high to the number submitted documents, we need to rethink thedesign of the system.<h3> Digitizing existing documents is a mostly manual task</h3>We describe above a system for almost complete automation of thedocument submission process.  At Cornell, we faced the additional taskof converting an existing collection to digital form.  While some ofthe tools described above were useful for this task, a large amount ofmanual intervention was required.The Cornell Computer Science Department has been publishing technicalreports since 1968.  As of September, 1994 the department hadpublished 1449 TR's, with an average length of thirty-six pages (atotal of over 52,000 pages).  The digital record for many ofthese TR's is either non-existent, not easily available, or in aformat that is difficult to interpret with current hardware andsoftware (for example, a document formatted in an extinct copy ofWordStar that is only available on floppies for a long-gone CPM system).  <p>The common form that exists for all existing documents is hardcopy -the department maintains archival copies of the entire TR corpus.  Aproduction scanning facility on campus allowed the department toconvert the entire corpus to high-quality 600dpi group 4-compressedTIFF images.  Over a nine month period all hardcopy pages were scanned toindividual TIFF files and downloaded via FTP to disk in the ComputerScience Department.  Each TIFF file ranges in size from around onekilobyte for a blank page to almost two megabytes for a page thatcontains a high quality photographic image. The total collection ofpages images now occupies around 3.6 gigabytes.<p>It should be noted that scanning a collection, even as modest as theCornell CS TR's,  is time consuming, labor intensive, and not withoutproblems.  Even the most careful scanning technician occasionallymisses pages, skews pages, or misses part of a page due to a unnoticedfold when the page is put on the scanner bed.  These problems aredifficult, if not impossible, to detect automatically.  In addition,any problems that are detected are computationally intensive tocorrect.  For example, a simple ninety-degree rotation of a 600 dpiTIFF image (due to incorrect scanning orientation) can take up tothirty minutes on a reasonably equipped SPARCstation 10.<p>An example illustrates the difficulty of correcting scanningproblems.  We discovered after all scanning was complete that many ofour older TR's were scanned from pages that were oriented in landscapemode - two pages side-by-side.  The result was a TIFF file containing
http:^^www.cs.cornell.edu^info^people^lagoze^papers^www.html - 源码说明

本页面展示了「This data set contains WWW-pages collected from computer science departments of various universities」中的 http:^^www.cs.cornell.edu^info^people^lagoze^papers^www.html 源码文件，采用 HTML 编程语言编写，共 776 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与数据集相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?