📄 readme
字号:
-------------------------MSTParser version 0.2-------------------------The following package contains a java implementation of the dependency parsersdescribed in:Non-Projective Dependency Parsing using Spanning Tree AlgorithmsR. McDonald, F. Pereira, K. Ribarov and J. HajicHLT-EMNLP, 2005Online Large-Margin Training of Dependency ParsersR. McDonald, K. Crammer and F. PereiraACL, 2005Online Learning of Approximate Dependency Parsing AlgorithmsR. McDonald and F. PereiraEACL, 2006In addition, the parsers in this package can also learn and produce typeddependency trees (i.e. trees with edge labels).The parser should work with Java 1.4 and 1.5If there are any problems running the parser then email: ryantm@cis.upenn.eduI will only respond to questions not answered in this README.----------------Contents----------------1. Compiling2. Example of usage3. Running the parser a. Input data format b. Training a parser c. Running a trained model on new data d. Evaluating output4. Memory/Disk space and performance issues5. Reproducing results in HLT-EMNLP and ACL papers----------------1. Compiling----------------To compile the code, first unzip/tar the downloaded file:> gunzip mstparser.tar.gz> tar -xvf mstparser.tar> cd MSTParserNext, run the following command> javac -classpath ".:lib/trove.jar" mstparser/DependencyParser.javaThis will compile the package.---------------------2. Example Usage---------------------In the directory data/ there are examples of training and testing data. Dataformat is described in the next section.train.ulab/test.ulab- training and testing data with unlabeled treestrain.lab/test.lab- training and testing data with labeled treesTo run an unlabeled parser type:> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ train train-file:data/train.ulab model-name:dep.model \ test test-file:data/test.ulab output-file:out.txt \ eval gold-file:data/test.ulab This will train a parser on the training data, run it on the testing data andevaluate the output against the gold standard. The results from running theparser are in the file out.txt and the trained model in dep.model.To train an labeled parser run the same command but use the labeled trainingand testing files.-------------------------3. Running the Parser--------------------------------------------------3a. Input data format-------------------------Example data sets are given in the data/ directory.Each sentence in the data is represented by 3 or 4 lines and sentences arespace separated. The general format is:w1 w2 ... wnp1 p2 ... pnl1 l2 ... lnd1 d2 ... d2....Where,- w1 ... wn are the n words of the sentence (tab deliminated)- p1 ... pn are the POS tags for each word- l1 ... ln are the labels of the incoming edge to each word- d1 ... dn are integers representing the postition of each words parentFor example, the sentence "John hit the ball" would be:John hit the ballN V D NSBJ ROOT MOD OBJ2 0 4 2Note that hit's parent is indexed by 0 since it is the root.If you wish to only train or test an unlabeled parser, then simply leave outthe third line for each sentence, e.g.,John hit the ballN V D N2 0 4 2The parser will automatically detect that it should produce unlabeled trees.Note that this format is the same for training AND for running the parser onnew data. Of course, you may not always know the gold standard. In this case,just substitute lines 3 (the edge labels) and lines 4 (the parent indexes) withdummy values. The parser just ignores these values and produces its own.----------------------------3b. Training the parser----------------------------If you have a set of labeled data, first place it in the format describedabove.If your training data is in a file train.txt, you can then run the command:> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ train train-file:train.txtThis will train a parser with all the default properties. Additonalproperties can be described with the following flags:train- if present then parser will train a new modeltrain-file:file.txt- use data in file.txt to train the parsermodel-name:model.name- store trained model in file called model.nametraining-iterations:numIters- Run training algorithm for numIters epochs, default is 10decode-type:type- type is either "proj" or "non-proj", e.g. decode-type:proj- Default is "proj"- "proj" use the projective parsing algorithm during training - i.e. The Eisner algorithm- "non-proj" use the non-projective parsing algorithm during training - i.e. The Chu-Liu-Edmonds algorithmtraining-k:K- Specifies the k-best parse set size to create constraints during training- Default is 1- For non-projective parsing algorithm, k-best decoding is approximateloss-type:type- type is either "punc" or "nopunc", e.g. loss-type:punc- Default is "punc"- "punc" include punctuation in hamming loss calculation- "nopunc" do not include punctuation in hamming loss calculationcreate-forest:cf- cf is either "true" or "false"- Default is "true"- If create-forest is false, it will not create the training parse forest (see section 4). It assumes it has been created.- This flag is useful if you are training many models on the same data and features but using different parameters (e.g. training iters, decoding type).order:ord- ord is either 1 or 2- Default is 1- Specifies the order/scope of features. 1 only has features over single edges and 2 has features over pairs of adjacent edges in the tree.------------------------------------------------3c. Running a trained model on new data------------------------------------------------This section assumes you have trained a model and it is stored in dep.model.First, format your data properly (section 3a).It should be noted that the parser assumes both words and POS tags. Togenerate POS tags for your data I suggest using the Ratniparkhi POS taggeror another tagger of your choice.The parser also assumes that the edge label and parent index lines arein the input. However, these can just be artificially inserted (e.g. with linesof "LAB ... LAB" and "0 ... 0") since the parser will produce these linesas output.If the data is in a file called test.txt, run the command:> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ test model-name:dep.model test-file:test.txt output-file:out.txtThis will create an output file "out.txt" with the predictions of the parser.Other properties can be defined with the following flags:test- If included a trained parser will be run on the testing datatest-file:file.txt- The file containing the data to run the parser onmodel-name:model.name- The name of the stored model to be usedoutput-file:out.txt- The result of running the parser on the new datadecode-type:type- See section 3b.order:ord- See section 3b. THIS NEEDS TO HAVE THE SAME VALUE OF THE TRAINED MODEL!!Note that if you train a labeled model, you should only run it expectinglabeled output (e.g. the test data should have 4 lines per sentence).And if you train an unlabeled model, you should only run it expectingunlabeled output (e.g. the test data should have 3 lines per sentence).------------------------3d. Evaluating Output------------------------This section describes a simple class for evaluating the output ofthe parser against a gold standard.Assume you have a gold standard, say test.txt and the output of the parsersay out.txt, then run the following command:> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ eval gold-file:test.txt output-file:out.txtThis will return both labeled and unlabeled accuracy (if the data sets containlabeled trees) as well as complete sentence accuracy, again labeled andunlabeled.We should note that currently this evaluation script includes all punctuation.In future releases we will modify this class to allow for the evaluation toingnore punctuation, which is standard for English (Yamada and Matsumoto 03).---------------------------------------------4. Memory/Disk space and performance issues---------------------------------------------This parser is memory and disk space intensive.MEMORY ISSUESRemember to always run java with the flag -Xmx1800m to use all availablememory for the heap. For 64-bit machines use an even larger value, say-Xmx8000m.Training a model on the WSJ can be done easily on a 32-bit machine.It should also be possible to train a model on the entire Prague DependencyTreebank on a 32-bit machine (I have done it), but I make no guarantees.DISK ISSUESTo make training quicker we store the entire parse forest on disk, alaClark and Curran 04. This can be very large, up to and over 20GB!! Be awareof this fact.If you train using a file called train.txt, the forest will be stored ina file called train.txt.forest. If disk space is an issue you can remove thisfile immediately after training (it is not need to run the parser on new data).However, sometimes it is good to keep this file around. Particularly, if youare retraining a model on the same data and feature space but want to trydifferent training settings. By using the create-forest:false flag, youcan avoid having to recreate this file (which can take some time).PERFORMANCE ISSUESOnce a model has been trained, running the model on new data is pretty quick.However, as with all discriminative trained parsers, it does take some timeto train a parser. On a two year old 32-bit machine is will take 10-15 hoursto train a model on the entire Penn Treebank and around 24-30 hours to traina model on the Prague Dependency Treebank. Newer machines or 64-bit machinesare of course much quicker.-------------------------------------------------------5. Reproducing results in HLT-EMNLP and ACL papers-------------------------------------------------------To reproduce the English results in McDonald et al. ACL 2005,> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ train train-file:train.wsj model-name:eng.wsj.model \ training-k:5 loss-type:nopunc decode-type:proj \ test test-file:test.wsj output-file:out.txt \ eval gold-file:test.wsjThis assumes that train.wsj is section 02-21 of the Penntreebank formattedabove and dependencies extracted using the head-rules of Yamada and Matsumoto.See Joakim Nivre's tool set at: http://w3.msi.vxu.se/~nivre/research/Penn2Malt.htmlfor a tool set to convert the WSJ to dependencies using the Yamada andMatsumoto head rules.test.wsj is section 23 of the WSJ converted as above. Furthermore, POS tags aresupplied using Adwait Ratniparkhi's MXPOST tool-kit trained on sections 02-21.This can be found at:http://www.cogsci.ed.ac.uk/~jamesc/taggers/MXPOST.htmlNote that the evaluation will be slightly off from the results reported. Thisis because the evaluation scripts include punctuation. If you modify theevaluation script to discount punctuation, results will align.To reproduce the Czech results in McDonald et al. HLT-EMNLP 2005,> java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \ train train-file:train.pdt model-name:czech.pdt.model \ training-k:1 loss-type:punc decode-type:non-proj \ test test-file:test.pdt output-file:out.txt \ eval gold-file:test.pdtThis assumes train.pdt and test.pdt are the training and testing sectionsof the Prague Dependency Treebank v1.0 formatted above. We use theautomatically assigned POS tags that have been reduced (see paper).
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -