⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 readme.txt

📁 一个著名的文本分类数据集
💻 TXT
📖 第 1 页 / 共 3 页
字号:
once (ONCE) per a story, or a variable (VARIABLE) number of times(possibly zero).  In many cases the start tag of a pair appears onlyat the beginning of a line, with the corresponding end tag alwaysappearing at the end of the same line.  When this is the case, weindicate it with the notation "SAMELINE" below, as an aid to thoseprocessing the files without SGML tools.       1. <DATE>, </DATE> [ONCE, SAMELINE]: Encloses the date and timeof the document, possibly followed by some non-date noise material.     2. <MKNOTE>, </MKNOTE> [VARIABLE] : Notes on certain handcorrections that were done to the original Reuters corpus by SteveFinch.     3. <TOPICS>, </TOPICS> [ONCE, SAMELINE]: Encloses the list ofTOPICS categories, if any, for the document. If TOPICS categories arepresent, each will be delimited by the tags <D> and </D>.          4. <PLACES>, </PLACES> [ONCE, SAMELINE]: Same as <TOPICS>but for PLACES categories.     5. <PEOPLE>, </PEOPLE> [ONCE, SAMELINE]: Same as <TOPICS>but for PEOPLE categories.     6. <ORGS>, </ORGS> [ONCE, SAMELINE]: Same as <TOPICS> butfor ORGS categories.     7. <EXCHANGES>, </EXCHANGES> [ONCE, SAMELINE]: Same as<TOPICS> but for EXCHANGES categories.     8. <COMPANIES>, </COMPANIES> [ONCE, SAMELINE]: These tags alwaysappear adjacent to each other, since there are no COMPANIES categoriesassigned in the collection.         9. <UNKNOWN>, </UNKNOWN> [VARIABLE]: These tags bracket controlcharacters and other noisy and/or somewhat mysterious material in theReuters stories.     10. <TEXT>, </TEXT> [ONCE]: We have attempted to delimit all thetextual material of each story between a pair of these tags.  Somecontrol characters and other "junk" material may also be included.The whitespace structure of the text has been preserved. The <TEXT>tag has the following attribute:        a. TYPE: This has one of three values: NORM, BRIEF, andUNPROC.  NORM is the default value and indicates that the text of thestory had a normal structure. In this case the TEXT tag appears simplyas <TEXT>.  The tag appears as <TEXT TYPE="BRIEF"> when the story is ashort one or two line note.  The tags appears as <TEXT TYPE="UNPROC">when the format of the story is unusual in some fashion that limitedour ability to further structure it.The following tags optionally delimit elements inside the TEXTelement. Not all stories will have these tags:        a. <AUTHOR>, </AUTHOR> : Author of the story.         b. <DATELINE>, </DATELINE> : Location the storyoriginated from, and day of the year.         c. <TITLE>, </TITLE> : Title of the story. We have attemptedto capture the text of stories with TYPE="BRIEF" within a <TITLE>element.        d. <BODY>, </BODY> : The main text of the story.VII. Categories    A test collection for text categorization contains, at minimum, aset of texts and, for each text, a specification of what categoriesthat text belongs to.  For the Reuters-21578 collection the documentsare Reuters newswire stories, and the categories are five differentsets of content related categories.  For each document, a humanindexer decided which categories from which sets that documentbelonged to.  The category sets are as follows:              Number of    Number of Categories   Number of Categories Category Set  Categories     w/ 1+ Occurrences      w/ 20+ Occurrences  ************  **********   ********************   ******************** EXCHANGES        39                32                       7ORGS             56                32                       9PEOPLE          267               114                      15PLACES          175               147                      60TOPICS          135               120                      57The TOPICS categories are economic subject categories.  Examplesinclude "coconut", "gold", "inventories", and "money-supply".  Thisset of categories is the one that has been used in almost all previousresearch with the Reuters data. HAYES90b discusses some examples ofthe policies (not always obvious) used by the human indexers indeciding whether a document belonged to a particular TOPIC category.The EXCHANGES, ORGS, PEOPLE, and PLACES categories correspond to namedentities of the specified type.  Examples include "nasdaq"(EXCHANGES), "gatt" (ORGS), "perez-de-cuellar" (PEOPLE), and"australia" (PLACES). Typically a document assigned to a category fromone of these sets explicitly includes some form of the category namein the document's text. (Something which is usually not true forTOPICS categories.)  However, not all documents containing a namedentity corresponding to the category name are assigned to thesecategory, since the entity was required to be a focus of the newsstory [HAYES90b]. Thus these proper name categories are not as simpleto assign correctly as might be thought.Reuters-21578, Distribution 1.0 includes five files(all-exchanges-strings.lc.txt, all-orgs-strings.lc.txt,all-people-strings.lc.txt, all-places-strings.lc.txt, andall-topics-strings.lc.txt) which list the names of *all* legalcategories in each set.  A sixth file, cat-descriptions_120396.txtgives some additional information on the category sets.Note that a sixth category field, COMPANIES, was present in theoriginal Reuters materials distributed by Carnegie Group, but nocompany information was actually included in these fields. In theReuters-21578 collection this field is always empty.In the table above we note how many categories appear in at least 1 ofthe 21,578 documents in the collection, and how many appear at least20 of the documents.  Many categories appear in no documents, but weencourage researchers to include these categories when evaluating theeffectiveness of their categorization system. Additional details of the documents, categories, and corpuspreparation process appear in LEWIS92b, and at greater length inSection 8.1 of LEWIS91d.VIII. Using Reuters-21578 for Text Categorization Research     In testing a method for text categorization it is important thatknowledge of the nature of the test data not unduly influence thedevelopment of the system, or the performance obtained will beunrealistically high.  One way of dealing with this is to divide a setof data into two subsets: a training set and a test set.  Anexperimenter then develops a categorization system by automatedtraining on the training set only, and/or by human knowledgeengineering based on examination of the training set only.  Thecategorization system is then tested on the previously unexamined testset.  A number of variations on this basic theme are possible---seeWEISS91 for a good discussion.     Effectiveness results can only be compared between studies thatthe same training and test set (or that use cross-validationprocedures).  One problem with the Reuters-22173 collection was thatthe ambiguity of formatting and annotation led different researchersto use different training/test divisions. This was particularlyproblematic when researchers attempted to remove documents that "hadno TOPICS", as there were several definitions of what this meant.     To eliminate these ambiguities from the Reuters-21578 collectionwe specify exactly which articles are in each of the recommendedtraining sets and test sets by specifying the values those articleswill have on the TOPICS, LEWISSPLIT, and CGISPLIT attributes of theREUTERS tags.  We strongly encourage that all studies on Reuters-21578use one of the following training test divisions (or use multiplerandom splits, e.g. cross-validation):VIII.A. The Modified Lewis ("ModLewis") Split: Training Set (13,625 docs): LEWISSPLIT="TRAIN";  TOPICS="YES" or "NO" Test Set (6,188 docs):  LEWISSPLIT="TEST"; TOPICS="YES" or "NO" Unused (1,765): LEWISSPLIT="NOT-USED" or TOPICS="BYPASS"This replaces the 14704/6746 split (723 unused) of the Reuters-22173collection, which was used in LEWIS91d (Chapters 9 and 10), LEWIS92b,LEWIS92c, LEWIS92e, and LEWIS94b. Note the following:      1. The duplicate documents removed in forming Reuters-21578 areof course not present.       2. The documents with TOPICS="BYPASS" are not used, sincesubsequent analysis strongly indicates that they were not categorizedby the indexers.        3. The 1,765 unused documents should not be tested on and shouldnot be used for supervised learning.  However, they may useful asadditional information on the statistical distribution of words,phrases, and other features that might used to predict categories.This split assigns documents from April 7, 1987 and before to thetraining set, and documents from April 8, 1987 and after to the testset.WARNING: Given the many changes in going from Reuters-22173 toReuters-21578, including correction of many typographical errors incategory labels, results on the ModLewis split cannot be comparedwith any published results on the Reuters-22173 collection!VIII.B. The Modified Apte ("ModApte") Split : Training Set (9,603 docs): LEWISSPLIT="TRAIN";  TOPICS="YES" Test Set (3,299 docs): LEWISSPLIT="TEST"; TOPICS="YES" Unused (8,676 docs):   LEWISSPLIT="NOT-USED"; TOPICS="YES"                     or TOPICS="NO"                      or TOPICS="BYPASS"This replaces the 10645/3672 split (7,856 not used) of theReuters-22173 collection.  These are our best approximation to thetraining and test splits used in APTE94 and APTE94b. Note thefollowing:      1. As with the ModLewis, those documents removed in formingReuters-21578 are not present, and BYPASS documents are not used.        2. The intent in APTE94 and APTE94b was to use the Lewis split,but restrict it to documents with at least one TOPICS categories.However, but it was not clear exactly what Apte, et al meant by havingat least one TOPICS category (e.g. how was "bypass" treated, whetherthis was before or after any fixing of typographical errors, etc.). Wehave encoded our interpretation in the TOPICS attribute.  ***Notethat, as discussed above, some TOPICS="YES" stories have no TOPICScategories, and a few TOPICS="NO" stories have TOPICScategories. These facts are irrelevant to the definition of thesplit.*** If you are using a learning algorithm that requires eachtraining document to have at least TOPICS category, you can screen outthe training documents with no TOPICS categories. Please do NOT screenout any of the 3,299 documents - that will make your resultsincomparable with other studies.      3. As with ModLewis, it may be desirable to use the 8,676 Unuseddocuments for gathering statistical information about featuredistribution.As with ModLewis, this split assigns documents from April 7, 1987 andbefore to the training set, and documents from April 8, 1987 and afterto the test set.  The difference is that only documents with at leastone TOPICS category are used.  The rationale for this restriction isthat while some documents lack TOPICS categories because no TOPICSapply (i.e. the document is a true negative example for all TOPICScategories), it appears that others simply were never assigned TOPICScategories by the indexers. (Unfortunately, the amount of time thathas passed since the collection was created has made it difficult toestablish exactly what went on during the indexing.)WARNING: Given the many changes in going from Reuters-22173 toReuters-21578, including correction of many typographical errors incategory labels, results on the ModApte split cannot be comparedwith any published results on the Reuters-22173 collection!VIII.C. The Modified Hayes ("ModHayes") Split:  Training Set (20856 docs): CGISPLIT="TRAINING-SET" Test Set (722 docs): CGISPLIT="PUBLISHED-TESTSET" Unused (0 docs)This is the best approximation we have to the training and test splitsused in HAYES89, HAYES90b, and Chapter 8 of LEWIS91d.  It replaces the21450/723 split of the Reuters-22173 collection.  Note the following:      1. As with the other splits, the duplicate documents removed informing Reuters-21578 are not present.       2. "Training" in HAYES89 and HAYES90b was actually done by humanbeings looking at the documents and writing categorization rules. We can not be sure which of the document files were actually lookedat.        3. We specify that the BYPASS stories and the TOPICS=NO storiesare part of the training set, since they were used during manualknowledge engineering in the original Hayes experiments. That does notmean researchers are obliged to give these stories to, for instance, asupervised learning algorithm.  As mentioned in the other splits, theymay be more useful for getting distributional information aboutfeatures. There are a number of problems with the ModHayes split that make itless than desirable for text categorization research, includingunusual distribution of categories, pairs of near-duplicate documents,and chronological burstiness.  (See [LEWIS90b, Ch. 8] for moredetails.)Despite these problems, this split is of interest because it providesthe ability to compare results with those of the CONSTRUE system[HAYES89, HAYES90b].  Comparison of results on the ModHayes split withpreviously published results on the original Hayes split in HAYES89and HAYES90b (and LEWIS90b, Ch. 8) is possible, though the followingpoints should be taken into account:

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -