📄 readme.txt
字号:
1. The testset we provide in the ModHayes split has one fewerdocument than the one Hayes used. The document that was removed(OLDID="22026") was a timestamp duplicate of the document withOLDID="22027" and NEWID="13234". So in computing effectivenessmeasures for comparison with HAYES89/90b, the document withNEWID="13234" should be counted twice. 2. The documents in the Hayes testset had relatively few errors andanomalies in their categorization. And the errors which we did findand correct appear unlikely to have affected the original Hayesresults. In particular, it appears that the only errors in the TOPICSfield were the addition of a few invalid categories that were notevaluated on. However, for completeness we list the changes in theHayes testset documents made going from Reuters-22173 to Reuters-21578(all documents are referred to by their NEWID): Removal of invalid TOPIC "loan" : 13234, 16946, 17111, 17112, 17207,17217, 17228, 17234, 17271, 17310 Removal of invalid TOPIC "gbond" : 17138, 17260 Removal of invalid TOPIC "tbill" : 17258 Removal of invalid TOPIC "cbond" : 17024 Removal of invalid TOPIC "fbond" : 17087 Correction of invalid PEOPLE mancera to mancera-aguayo: 17142,17149, 17154, 17177, 17187 Correction of invalid PEOPLE andriesssen to andriessen : 17366 Correction of invalid PLACES "ivory" and "coast" to single correctPLACE "ivory-coast": 18383 3. The effectiveness measures used in HAYES89 and HAYES90b weresomewhat nonstandard. See Ch. 8 of LEWIS91d for a discussion.VIII.D. Other Splits We strongly encourage researchers to use one (or more) of theabove splits for their experiments (or use cross-validation on one ofthe sets of documents defined in the above splits). We recommend theModified Apte ("ModApte") Split for research on predicting the TOPICSfield, since the evidence is that a significant number of documentsthat should have TOPICS do not. The ModLewis split can be used if theresearcher has a strong need to test the ability of a system to dealwith examples belonging to no category. While it is likely that someof these examples should indeed belong to a category, the ModLewissplit is at least better than the corresponding split fromReuters-22173, in that it eliminates the "bypass" stories. We in particular encourage you to resist the followingtemptations: 1. Defining new splits based on whether or not the documentsactually have any TOPICS categories. (See the discussion of theModApte split.) 2. Testing your system only on the "easy" categories. This is atemptation we have succumbed to in the past, but will resist in thefuture. Yes, we know that some of the 135 TOPICS categories have fewor no positive training examples or few or no positive test examplesor both. Yes, purely supervised learning systems will do very badlyon these categories. Knowledge-based systems, on the other hand,might do well on them, while doing poorly in comparison withsupervised learning on categories with lots of positiveexamples. These comparisons are of great interest. Of course, it's ofgreat interest to *in addition* analyze subsets of categories(e.g. lots of positive examples vs. few positive examples, etc.). Note that one strategy we considered and rejected is to assumethat documents which have no TOPICS but do have categories in otherfields (PLACES, etc.) could be assumed to belong to no TOPICScategories. This does not appear to be a safe assumption - we havefound a number of examples of documents with PLACES but no TOPICS whenthere are TOPICS that clearly apply.IX. Feature Sets in Text Categorization For many text categorization methods, particularly those usingstatistical classification techniques, it is convenient to representdocuments not as a sequence of characters, but rather as a tuple ofnumeric or binary feature values. For instance, the value of featureFi for a document Dj might be 1 if the string of characters"financial" occurred in the document with whitespace on either side,and 0 otherwise. Or the value of Fi for Dj might be the number ofoccurrences of "financial" in document Dj. In information retrievalsuch features are often called "indexing terms" and one often speaksof a term being "present" in a document, to mean that the featuretakes on a non-default value. (Usually, but not always, any value but0 is non-default.) Comparisons between text categorization methods that representdocuments as feature tuples are aided by ensuring that the same tuplerepresentation is used with all methods, thus avoiding conflatingdifferences in feature extraction with differences in, say, machinelearning methods. For that reason, the Reuters-22173 distributionincluded not only the formatted text of the Reuters stories, but alsofeature tuple representations of the stories in each of two featuresets, one based on words and one based on noun phrases. Surprisingly,almost no use was made of these files by other researchers, so we havenot included files of this sort in the Reuters-21578 distribution. However, we are willing to make available as part of thedistribution any tuple representations of this sort that researcherswant to contribute. (Contact lewis@research.att.com if you would liketo do this.) Perhaps the ideal situation would be if someone with astrong interest in feature set formation produced tuples based on ahigh quality set of features which other researchers interested onlyin learning algorithms could make use of.X. Bibliography[This needs to be updated.]@article{APTE94 ,author = "Chidanand Apt{\'{e}} and Fred Damerau and Sholom M. Weiss" ,title = "Automated Learning of Decision Rules for Text Categorization" ,journal = "ACM Transactions on Information Systems" ,year = 1994 , note = "To appear." }@inproceedings{APTE94b ,author = "Chidanand Apt{\'{e}} and Fred Damerau and Sholom M. Weiss" ,title = "Toward Language Independent Automated Learning of Text Categorization Models" ,booktitle = sigir94 ,year = 1994 ,note = "To appear." }@inproceedings{HAYES89,author = "Philip J. Hayes and Peggy M. Anderson and Irene B. Nirenburg and Linda M. Schmandt",title = "{TCS}: A Shell for Content-Based Text Categorization",booktitle = "IEEE Conference on Artificial Intelligence Applications",year = 1990}@inproceedings{HAYES90b,author = "Philip J. Hayes and Steven P. Weinstein",title = "{CONSTRUE/TIS:} A System for Content-Based Indexing of a Database of News Stories",booktitle = "Second Annual Conference on Innovative Applications ofArtificial Intelligence",year = 1990}@incollection{HAYES92 ,author = "Philip J. Hayes" ,title = "Intelligent High-Volume Text Processing using Shallow,Domain-Specific Techniques" ,booktitle = "Text-Based Intelligent Systems" ,publisher = "Lawrence Erlbaum" ,address = "Hillsdale, NJ" ,year = 1992 ,editor = "Paul S. Jacobs"}@inproceedings{LEWIS91c ,author = "David D. Lewis" ,title = "Evaluating Text Categorization" ,booktitle = "Proceedings of Speech and Natural Language Workshop" ,year = 1991 ,month = feb ,organization = "Defense Advanced Research Projects Agency" ,publisher = "Morgan Kaufmann" ,pages = "312--318" }@phdthesis{LEWIS91d,author = "David Dolan Lewis",title = "Representation and Learning in Information Retrieval",school = "Computer Science Dept.; Univ. of Massachusetts; Amherst, MA 01003",year = 1992,note = "Technical Report 91--93."}@inproceedings{LEWIS91e,author = "David D. Lewis",title = "Data Extraction as Text Categorization: An Experiment withthe {MUC-3} Corpus",booktitle = "Proceedings of the Third Message Understanding Evaluationand Conference",year = 1991,month = may,organization = "Defense Advanced Research Projects Agency",publisher = "Morgan Kaufmann",address = "Los Altos, CA"}@inproceedings{LEWIS92b ,author = "David D. Lewis" ,title = "An Evaluation of Phrasal and Clustered Representations on a TextCategorization Task" ,booktitle = "Fifteenth Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval" ,year = 1992 ,pages = "37--50"}@inproceedings{LEWIS92d ,author = "David D. Lewis and Richard M. Tong",title = "Text Filtering in {MUC-3} and {MUC-4}",booktitle = "Proceedings of the Fourth Message Understanding Conference ({MUC-4})",year = 1992,month = jun,organization = "Defense Advanced Research Projects Agency",publisher = "Morgan Kaufmann",address = "Los Altos, CA"}@inproceedings{LEWIS92e,author = "David D. Lewis" ,title = "Feature Selection and Feature Extraction for Text Categorization",booktitle = "Proceedings of Speech and Natural Language Workshop",year = 1992,month = feb ,organization = "Defense Advanced Research Projects Agency",publisher = "Morgan Kaufmann",pages = "212--217"}@inproceedings{LEWIS94b ,author = "David D. Lewis and Marc Ringuette" ,title = "A Comparison of Two Learning Algorithms for Text Categorization" ,booktitle = "Symposium on Document Analysis and Information Retrieval" ,year = 1994 ,organization = "ISRI; Univ. of Nevada, Las Vegas" ,address = "Las Vegas, NV" ,month = apr ,pages = "81--93"}@article{LEWIS94d, author = "David D. Lewis and Philip J. Hayes", title = "Guest Editorial", journal = "ACM Transactions on Information Systems", year = 1994 , volume = 12, number = 3, pages = "231", month = jul}@article{SPARCKJONES76,author = "K. {Sparck Jones} and C. J. {van Rijsbergen}",title = "Information Retrieval Test Collections",journal = "Journal of Documentation",year = 1976,volume = 32,number = 1,pages = "59--75" }@book{WEISS91 ,author = "Sholom M. Weiss and Casimir A. Kulikowski" ,title = "Computer Systems That Learn" ,publisher = "Morgan Kaufmann" ,year = 1991 ,address = "San Mateo, CA" }
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -