📄 count_wordvec.1

📁 有关自然语言理解理解方面的源码
💻 1
字号:
.\" Process this file with .\"    groff -man -Tascii count_wordvec.1.TH COUNT_WORDVEC 1 "February 2004" "Infomap Project" "Infomap NLP Manual".SH NAME.TP count_wordvec \- compute word co-occurrence counts.SH SYNOPSIS.B count_wordvec.BR -mdir " <model_data_dir> " -matlab " <0|1> " \-precontext " <pre_context_size> " -postcontext " <post_context_size> " \-rows " <num_rows> " -columns " <num_columns> " \-col_labels_from_file " <0|1> " \-col_label_file " <file_containing_col_labels> ".SH DESCRIPTION.B count_wordvecreads a parsed corpus (in the format produced by .BR prepare_corpus )and produces a matrix of co-occurrence counts.  This matrixhas the dimension specified by .BR -rows \ and \ -columns .The count in element .B E_i,jis the number of times the word corresponding to row.B ioccurred within the <pre_context_size> words immediately before an occurrence of the word corresponding to column.BR j,or within the <post_context_size>words immediately after an occurence of word.BR j..SH OPTIONSNote that the options must .I allbe given, and that they must appear in the exact order shown in theabove synopsis.  This strict option syntax is used because .B count_wordvecshould usually be called via a Makefile or a wrapper script, rather than directly from the command line..TP.BI -mdir \ <model_data_dir>.B count_wordvecwill read its input from and write its output to themodel data directory specified using this option..TP.BI -matlab \ <1|0>.B count_wordvecwrites its matrix of co-occurence counts in a two-file,column-major format suitable for input to.B svdinterface(which is a wrapper around functions from the Universityof Tennessee's SVDPACKC library).If the argument to this option is 1, output will alsobe written in a format suitable for input to MATLAB.  If the argument to this option is 0, no such outputwill be written..TP.BI -precontext \ <pre_context_size>.B count_wordvecworks by considering each word in a corpus in sequence.  Foreach such word .BR w ,all words occurring in a "context window" surrounding .B ware considered to have co-occurred with .BR w .This option specifies how many words occurring immediately.IB before \ wshould be considered to be in its context window.  Notethat context windows can be cut short by document boundaries..TP.BI -postcontext \ <post_context_size>.B count_wordvecworks by considering each word in a corpus in sequence.  Foreach such word.BR w ,all words occurring in a "context window" surrounding.B ware considered to have co-occurred with.BR w .This option specifies how many words occurring immediately.IB after \ wshould be considered to be in its context window.  Notethat context windows can be cut short by document boundaries..TP.BI -rows \ <num_rows>This option determines how many rows are maintained in .BR count_wordvec 'smatrix of co-occurrence counts.  In other words, it determineshow many words .B count_wordvecmaintains co-occurrence counts for.  The most frequent <num_rows> words in the corpus (neglecting stopwords) will havetheir co-occurrence counts recorded..TP .BI -columns \ <num_columns>This option determines how many columns are maintained in.BR count_wordvec 'smatrix of co-occurrence counts.  Each column of the co-occurrencematrix corresponds to a word, as does each row.  Row .B iis a "feature vector" for some word .BR w_i .Thus, while .B -rowsspecifies how many words we are learning feature vectors for,.B -columnsspecifies how many features each of these vectors contains.The words corresponding to these features (the "column labels")are intended to be "content-bearing words"; that is, they should bewords that provide information about the meaning of other words occurringin their context.  The selection of content-bearing words foruse as column labels is an important part of the Infomap algorithm;currently the words ranking 50-1049 in frequency (neglecting stopwords) are chosen..TP.BI -col_labels_from_file \ <0|1>If  equal  to  1, this  Boolean variable indicates that the column labels of the matrix of co-occurrence counts should be read from the file specified via option .B -col_label_file.If set to 0, .B count_wordvec will  choose  column labels automatically. Default is 0..TP.BI -col_label_file \ <file_containing_col_labels>This option specifies the name of the file containing a set of user-specifiedcontent-bearing words which .B count_wordvec will use as column labels of the matrix of co-occurrence counts. Default is ""..\" .SH EXAMPLES.SH INPUT FILESThese files are read from the model data directory, specified asan argument to the.B -mdiroption..I  dic.RSThe dictionary file.  See .BR prepare_corpus (1)..RE.I numFiles.RSThe number of documents in the corpus.See .BR prepare_corpus (1)..RE.I wordlist.RSThe parsed corpus, with indications of document breaks.See .BR prepare_corpus (1)..RE.I model_params.bin.RSReads in this file and writes out a modified version.See .BR prepare_corpus (1)..RE.I model_info.bin.RSReads in this file and writes out a modified version.See.BR prepare_corpus (1)..RE.SH OUTPUT FILESAll of these files are written to the model data directory, specifiedas an argument to the.B -mdiroption..I coll.RSThis file and the .I indxfile, together, serve as matrix input to .BR svdinterface (1)..RE.I indx.RSThis file and the .I collfile, together, serve as matrix input to.BR svdinterface (1)..RE.I matlab.RSThis file, which is only generated if the.B -matlaboption is given an argument of "1", containsthe co-occurrence matrix in a format suitable forinput to MATLAB..RE.I model_params.bin.RSReads in this file and writes out a modified version.See .BR prepare_corpus (1)..RE.I model_info.bin.RSReads in this file and writes out a modified version.See.BR prepare_corpus (1)..RE.SH SEE ALSO.BR prepare_corpus (1), \ svdinterface (1), \ encode_wordvec (1), \\ count_artvec (1), \ write_text_params(1)..SH DIAGNOSTICSReturns 0 to indicate success; a nonzero value to indicate error..SH BUGSShould probably have more flexible option-handling, perhaps using.BR getopt (3)or something similar.  This page should have more detailed documentationof the.IR coll ", " indx ", and " matlabfile formats.Please report bugs to .BR infomap-nlp-users@lists.sourceforge.net ..SH CREDITSThe Infomap NLP software was written by Stefan Kaufmann, HinrichSchuetze, Dominic Widdows, Beate Dorow, and Scott Cederberg.  TheInfomap algorithm was originally developed by Hinrich Schuetze..SH AUTHORThis manual page was written by Scott Cederberg.  Please directinquiries and bug reports to .BR infomap-nlp-users@lists.sourceforge.net .
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -