📄 prepare_corpus.1

📁 有关自然语言理解理解方面的源码
💻 1
字号:
.\" Process this file with .\"    groff -man -Tascii prepare_corpus.1.TH PREPARE_CORPUS 1 "FEBRUARY 2004" "Infomap Project" "Infomap NLP Manual".SH NAME.TPprepare_corpus \- parse corpus for other Infomap NLP tools.SH SYNOPSIS.B prepare_corpus.BR -cdir " <corpus_dir> " -mdir " <model_data_dir> " \-cfile " <corpus_filename> " -fnfile " <file_name_filename> " \-slfile " <stoplist_filename> " -chfile " <valid_chars_filename> " -rptfile " <reports_filename> ".SH DESCRIPTION.B prepare_corpusparses a corpus and tabulates the term and document frequenciesof the word types in that corpus.  It writes several output filesused by other components of the Infomap NLP software package, mostdirectly by .BR count_wordvec ..SH OPTIONSNote that the options must .I allbe given, and that they must appear in the exact order shown in theabove synopsis.  Empty options should be explicitly specified using ""on the command line.  This strict option syntax is used because .B prepare_corpusshould usually be called via a Makefile or a wrapper script, rather than directly from the command line..TP.BI -cdir \ <corpus_dir>The directory containing the corpus.  This directory should containeither the corpus file .RB ( -cfile )or the filenames file.RB ( -fnfile )..TP.BI -mdir \ <model_data_dir>The directory in which the output files will be written..TP.BI -cfile \ <corpus_filename>The file containing the corpus, if this is a single file corpus.  If this option is non-empty, then .B -fnfileshould be empty (i.e., specified as "")..TP.BI -fnfile \ <file_name_filename>A file containing a list of filenames, one filename per line, for multi-file corpora.  The corpus consists of the contents of the fileslisted in this file, each of which must contain exactly one corpus document.If this option is non-empty, then.B -cfileshould be empty (i.e., specified as "")..TP.BI -chfile \ <validchars_filename>The file contaning the valid word characters. These characters are theones your words will eventually be composed of. All other charactersare considered by the tokenization to be breaking and are skipped. Thelist of characters in the valid characters file are given as acontinuous string without delimiters. Watch out for newlines: if you have one at the end of this file, itwill be considered as a legitimate part of words (may not be what youwant). Currently the tokenizer relies on the assumption that '<' is .I nota valid word character. This makes sense given the xml-like tags thatdesignate document and metadata boundaries.  Currently the tokenizer treats word-internal single quotes asnon-breaking valid word characters even if they are included as valid.This enables words like .I isn't to be treated as words while still stripping quotes. Changing this behaviouris currently not possible..TP.BI -slfile \ <stoplist_filename>The file contaning the stoplist (i.e., list of words to be ignored duringsubsequent processing).  The stoplist file should list stopwords one perline.  .B prepare_corpuswill mark these words as stopwords in the "dictionary" file that it outputs; this marking allows these words to be ignored by later processing..B prepare_corpuswill mark these words as stopwords in the "dictionary" file that it outputs; this marking allows these words to be ignored by later processing..TP.BI -rptfile \ <reports_filename>A file to write reports on the operation of prepare_corpus, if reportingis enabled..SH INPUT FILEScorpus file.RSThe file containing the corpus.  Either this file (specified as an argument to the.B -cfileoption) or the filenames file must be given, but not both.  Within the corpus file, documents are marked by simple XML-like <DOC> and </DOC>tags..REfilenames file.RSA file containing a list of files making up the corpus.  Either this file(specified as an argument to the.B -fnfileoption) or the corpus file must be given, but not both.  Each of thefiles listed in the filenames file must contain exactly one corpusdocument.  Note that all of these files also serve as input to.BR prepare_corpus ..RE.SH OUTPUT FILESAll of these files are produced in the model data directory, specifiedas an argument to the.B -mdiroption..I wordlist.RSThe parsed corpus.  This file is the original corpus, printedone token per line, with additional XML-tag-like tokens insertedto mark document boundaries..RE.I dic.RSThe "dictionary" file.  This file lists each of the words (types) encountered in the corpus, one word per line, sorted in decreasing order of frequency.Each line lists the term frequency, the document frequency, a boolean indication of whether the term is a stopword (1=stopword; 0=not a stopword),and the term (word) itself.  The fields are separated by spaces..RE.I numDocs.RSThis text file contains a single number, the number of documentsmaking up the corpus..RE.I number2name.{dir,pag}.RSThese files form a DBM database.  Each key in this databaseis a document ID number, assigned as the corpus is processed.The corresponding value is the filename of the file consisting of thedocument with that ID.  This database is only generated for multiple-filecorpora..RE.I model_params.bin.RSThis file contains essential parameters about an Infomap model in abinary format.  This file is read by the other preprocessing-side Infomaptools and by.BR associate (1).Less important corpus parameters are written to .I model_info.binto keep this file as small as possible.  This file is.I notvery portable to other architectures.See .BR write_text_params (1)..RE.I model_info.bin.RSThis file contains corpus parameters in a binary format that is convenientfor input to other tools.  See.BR write_text_params (1)..RE.I corpus_format.bin.RSThis file contains information about the input format of the corpus,such as which tags were used to recognize the beginning and end of documents,and which XML/SGML character entities (like &amp;) have been stripped.The file is in a binary format that is convenient as input for othertools.  See.BR write_text_params (1)..RE.SH SEE ALSO.BR count_wordvec (1), \ svdinterface (1), \ encode_wordvec (1), \\ count_artvec (1), \ write_text_params (1), \ associate(1)..SH DIAGNOSTICSReturns 0 to indicate success; 1 to indicate error..SH BUGSOption handling should be more flexible and should probably use libraryfunctions like .BR getopt (3).Please report bugs to .BR infomap-nlp-users@lists.sourceforge.net ..SH CREDITSThe Infomap NLP software was written by Stefan Kaufmann, HinrichSchuetze, Dominic Widdows, Beate Dorow, and Scott Cederberg.  TheInfomap algorithm was originally developed by Hinrich Schuetze..SH AUTHORThis manual page was written by Scott Cederberg.  Please directinquiries and bug reports to .BR infomap-nlp-users@lists.sourceforge.net .
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -