readme.txt

来自「一个著名的文本分类数据集」· 文本代码 · 共 817 行 · 第 1/3 页
TXT
817 行
          Reuters-21578 text categorization test collection                        Distribution 1.0                       README file (v 1.2)                        26 September 1997                         David D. Lewis                      AT&T Labs - Research                          lewis@research.att.comI. Introduction   This README describes Distribution 1.0 of the Reuters-21578 textcategorization test collection, a resource for research in informationretrieval, machine learning, and other corpus-based research.II. Copyright & Notification    The copyright for the text of newswire articles and Reutersannotations in the Reuters-21578 collection resides with Reuters Ltd.Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the freedistribution of this data *for research purposes only*.     If you publish results based on this data set, please acknowledgeits use, refer to the data set by the name "Reuters-21578,Distribution 1.0", and inform your readers of the current location ofthe data set (see "Availability & Questions").III. Availability & Questions   The Reuters-21578, Distribution 1.0 test collection is availablefrom David D. Lewis' professional home page, currently:             http://www.research.att.com/~lewisBesides this README file, the collection consists of 22 data files, anSGML DTD file describing the data file format, and six filesdescribing the categories used to index the data.  (See Sections VIand VII for more details.)  Some additional files, which are not partof the collection but have been contributed by other researchers asuseful resources are also included.  All files are availableuncompressed, and in addition a single gzipped Unix tar archive of theentire distribution is available as reuters21578.tar.gz.   The text categorization mailing list, DDLBETA, is a good place tosend questions about this collection and other text categorizationissues. You may join the list by writing David Lewis atlewis@research.att.com.IV. History & Acknowledgements   The documents in the Reuters-21578 collection appeared on theReuters newswire in 1987.  The documents were assembled and indexedwith categories by personnel from Reuters Ltd. (Sam Dobbins, MikeTopliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen,Monica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987.  In 1990, the documents were made available by Reuters and CGI forresearch purposes to the Information Retrieval Laboratory (W.  BruceCroft, Director) of the Computer and Information Science Department atthe University of Massachusetts at Amherst.  Formatting of thedocuments and production of associated data files was done in 1990 byDavid D.  Lewis and Stephen Harding at the Information RetrievalLaboratory.Further formatting and data file production was done in 1991 and 1992by David D. Lewis and Peter Shoemaker at the Center for Informationand Language Studies, University of Chicago.  This version of the datawas made available for anonymous FTP as "Reuters-22173, Distribution1.0" in January 1993. From 1993 through 1996, Distribution 1.0 washosted at a succession of FTP sites maintained by the Center forIntelligent Information Retrieval (W. Bruce Croft, Director) of theComputer Science Department at the University of Massachusetts atAmherst.At the ACM SIGIR '96 conference in August, 1996 a group of textcategorization researchers discussed how published results onReuters-22173 could be made more comparable across studies.  It wasdecided that a new version of collection should be produced with lessambiguous formatting, and including documentation carefully spellingout standard methods of using the collection.  The opportunity wouldalso be used to correct a variety of typographical and other errors inthe categorization and formatting of the collection.Steve Finch and David D. Lewis did this cleanup of the collectionSeptember through November of 1996, relying heavily on Finch'sSGML-tagged version of the collection from an earlier study.  Oneresult of the re-examination of the collection was the removal of 595documents which were exact duplicates (based on identity of timestampsdown to the second) of other documents in the collection. The newcollection therefore has only 21,578 documents, and thus is called theReuters-21578 collection.  This README describes version 1.0 of thisnew collection, which we refer to as "Reuters-21578, Distribution1.0".In preparing the collection and documentation we have benefited fromdiscussions with Eric Brown, William Cohen, Fred Damerau, YoramSinger, Amit Singhal, and Yiming Yang, among many others.We thank all the people and organizations listed above for theirefforts and support, without which this collection would not exist.A variety of other changes were also made in going from Reuters-22173to Reuters-21578:   1. Documents were marked up with SGML tags, and a correspondingSGML DTD was produced, so that the boundaries of important sections ofdocuments (e.g. category fields) are unambiguous.   2. The set of categories that are legal for each of the fivecontrolled vocabulary fields was specified. All category names notlegal for a field were corrected to a legal category, moved to theirappropriate field, or removed, as appropriate.   3. Documents were given new ID numbers, in chronological order, andare collected 1000 to a file in order by ID (and therefore in orderchronologically). V. What is a Text Categorization Test Collection and Who Cares?    *Text categorization* is the task of deciding whether a piece oftext belongs to any of a set of prespecified categories.  It is ageneric text processing task useful in indexing documents for laterretrieval, as a stage in natural language processing systems, forcontent analysis, and in many other roles [LEWIS94d].   The use of standard, widely distributed test collections has been aconsiderable aid in the development of algorithms for the related taskof *text retrieval* (finding documents that satisfy a particularuser's information need, usually expressed in an textual request).Text retrieval test collections have allowed the comparison ofalgorithms developed by a variety of researchers around the world.(For more on text retrieval test collections see SPARCKJONES76.)   Standard test collections have been lacking, however, for textcategorization. Few data sets have been used by more than oneresearcher, making results hard to compare.  The Reuters-22173 testcollection has been used in a number of published studies since it wasmade available, and we believe that the Reuters-21578 collection willbe even more valuable.   The collection may also be of interest to researchers in machinelearning, as it provides a classification task with challengingproperties. There are multiple categories, the categories areoverlapping and nonexhaustive, and there are relationships among thecategories.  There are interesting possibilities for the use of domainknowledge.  There are many possible feature sets that can be extractedfrom the text, and most plausible feature/example matrices are largeand sparse.  There is even some temporal structure to the data[LEWIS94b], though problems with the indexing and the unevendistribution of stories within the timespan covered may make thiscollection a poor one to explore temporal issues.VI. Formatting      The Reuters-21578 collection is distributed in 22 files. Each ofthe first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000documents, while the last (reut2-021.sgm) contains 578 documents.       The files are in SGML format.  Rather than going into the detailsof the SGML language, we describe here in an informal way how the SGMLtags are used to divide each file, and each document, into sections.Readers interested in more detail on SGML are encouraged to pursueone of the many books and web pages on the subject.     Each of the 22 files begins with a document type declaration line:               <!DOCTYPE lewis SYSTEM "lewis.dtd">The DTD file lewis.dtd is included in the distribution.  Following thedocument type declaration line are individual Reuters articles markedup with SGML tags, as described below.   VI.A. The REUTERS tag:    Each article starts with an "open tag" of the form    <REUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??>where the ?? are filled in an appropriate fashion.  Each article endswith a "close tag" of the form:     </REUTERS>In all cases the <REUTERS> and </REUTERS> tags are the only itemson their line.       Each REUTERS tag contains explicit specifications of the valuesof five attributes, TOPICS, LEWISSPLIT, CGISPLIT, OLDID, and NEWID.These attributes are meant to identify documents and groups of documents, and have the following meanings:      1. TOPICS : The possible values are YES, NO, and BYPASS:        a. YES indicates that *in the original data* there was atleast one entry in the TOPICS fields.        b. NO indicates that *in the original data* the story had noentries in the TOPICS field.        c. BYPASS indicates that *in the original data* the story wasmarked with the string "bypass" (or a typographical variant on thatstring).     This poorly-named attribute unfortunately is the subject of muchconfusion. It is meant to indicate whether or not the document hadTOPICS categories *in the raw Reuters-22173 dataset*.  The sole use ofthis attribute is to defining training set splits similar to thoseused in previous research. (See the section on training set splits.)The TOPICS attribute does **NOT** indicate anything about whether ornot the Reuters-21578 document has any TOPICS categories.  (Version1.0 of this document was errorful on this point.)  That can bedetermined by actually looking at the TOPICS field. A story withTOPICS="YES" can have no TOPICS categories, and a story withTOPICS="NO" can have TOPICS categories.     Now, a reasonable (though not certain) assumption is that for allTOPICS="YES" stories the indexer at least thought about whether thestory belonged to a valid TOPICS category.  Thus, the TOPICS="YES"stories with no topics can reasonably be considered negative examplesfor all 135 valid TOPICS categories.     TOPICS="NO" stories are more problematic in their interpretation.Some of them presumedly result because the indexer made an explicitdecision that they did not belong to any of the 135 valid TOPICScategories.  However, there are many cases where it is clear that astory should belong to one or more TOPICS categories, but for somereason the category was not assigned.  There appear to be certain timeintervals where large numbers of such stories are concentrated,suggesting that some parts of the data set were simply not indexed, ornot indexed for some categories or category sets.  Also, in a fewcases, the indexer clearly meant to assign TOPICS categories, but putthem in the wrong field.  These cases have been corrected in theReuters-21578 data, yielding stories that have TOPICS categories, butwhere TOPICS="NO", because the the category was not assigned in theraw version of the data.     "BYPASS" stories clearly were not indexed, and so are useful onlyfor general distributional information on the language used in thedocuments.     2. LEWISSPLIT : The possible values are TRAINING, TEST, andNOT-USED.  TRAINING indicates it was used in the training set in theexperiments reported in LEWIS91d (Chapters 9 and 10), LEWIS92b,LEWIS92e, and LEWIS94b.  TEST indicates it was used in the test setfor those experiments, and NOT-USED means it was not used in thoseexperiments.     3. CGISPLIT : The possible values are TRAINING-SET andPUBLISHED-TESTSET indicating whether the document was in the trainingset or the test set for the experiments reported in HAYES89 andHAYES90b.     4. OLDID : The identification number (ID) the story had in theReuters-22173 collection.     5. NEWID : The identification number (ID) the story has in theReuters-21578, Distribution 1.0 collection.  These IDs are assigned tothe stories in chronological order.In addition, some REUTERS tags have a sixth attribute, CSECS, whichcan be ignored.  The use of these attributes is critical to allowing comparabilitybetween different studies with the collection, and is discussedfurther in Section VIII.  VI.B. Document-Internal Tags      Just as the <REUTERS> and </REUTERS> tags serve to delimitdocuments within a file, other tags are used to delimit elementswithin a document.  We discuss these in the order in which theytypically appear, though the exact order should not be relied upon inprocessing. In some cases, additional tags occur within an elementdelimited by these top level document-internal tags.  These arediscussed in this section as well.     We specify below whether each open/close tag pair is used exactly
readme.txt - 源码说明

本页面展示了「一个著名的文本分类数据集」中的 readme.txt 源码文件，采用文本编程语言编写，共 817 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与文本分类相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?