mert-moses.pl.svn-base

来自「moses开源的机器翻译系统」· SVN-BASE 代码 · 共 1,162 行 · 第 1/3 页
SVN-BASE
1,162 行
#!/usr/bin/perl -w # $Id$# Usage:# mert-moses.pl <foreign> <english> <decoder-executable> <decoder-config># For other options see below or run 'mert-moses.pl --help'# Notes:# <foreign> and <english> should be raw text files, one sentence per line# <english> can be a prefix, in which case the files are <english>0, <english>1, etc. are used# Revision history# 13 Feb 2007 Better handling of default values for lambda, now works with multiple#             models and lexicalized reordering# 11 Oct 2006 Handle different input types through parameter --inputype=[0|1]#             (0 for text, 1 for confusion network, default is 0) (Nicola Bertoldi)# 10 Oct 2006 Allow skip of filtering of phrase tables (--no-filter-phrase-table)#             useful if binary phrase tables are used (Nicola Bertoldi)# 28 Aug 2006 Use either closest or average or shortest (default) reference#             length as effective reference length#             Use either normalization or not (default) of texts (Nicola Bertoldi)# 31 Jul 2006 move gzip run*.out to avoid failure wit restartings#             adding default paths# 29 Jul 2006 run-filter, score-nbest and mert run on the queue (Nicola; Ondrej had to type it in again)# 28 Jul 2006 attempt at foolproof usage, strong checking of input validity, merged the parallel and nonparallel version (Ondrej Bojar)# 27 Jul 2006 adding the safesystem() function to handle with process failure# 22 Jul 2006 fixed a bug about handling relative path of configuration file (Nicola Bertoldi) # 21 Jul 2006 adapted for Moses-in-parallel (Nicola Bertoldi) # 18 Jul 2006 adapted for Moses and cleaned up (PK)# 21 Jan 2005 unified various versions, thorough cleanup (DWC)#             now indexing accumulated n-best list solely by feature vectors# 14 Dec 2004 reimplemented find_threshold_points in C (NMD)# 25 Oct 2004 Use either average or shortest (default) reference#             length as effective reference length (DWC)# 13 Oct 2004 Use alternative decoders (DWC)# Original version by Philipp Koehn# for each _d_istortion, _l_anguage _m_odel, _t_ranslation _m_odel and _w_ord penalty, there is a list# of [ default value, lower bound, upper bound ]-triples. In most cases, only one triple is used,# but the translation model has currently 5 features# defaults for initial values and ranges are:my $default_triples = {    # these two basic models exist even if not specified, they are    # not associated with any model file    "w" => [ [ 0.0, -1.0, 1.0 ] ],  # word penalty};my $additional_triples = {    # if the more lambda parameters for the weights are needed    # (due to additional tables) use the following values for them    "d"  => [ [ 1.0, 0.0, 2.0 ],    # lexicalized reordering model	      [ 1.0, 0.0, 2.0 ],	      [ 1.0, 0.0, 2.0 ],	      [ 1.0, 0.0, 2.0 ],	      [ 1.0, 0.0, 2.0 ],	      [ 1.0, 0.0, 2.0 ],	      [ 1.0, 0.0, 2.0 ] ],    "lm" => [ [ 1.0, 0.0, 2.0 ] ],  # language model    "g"  => [ [ 1.0, 0.0, 2.0 ],    # generation model	      [ 1.0, 0.0, 2.0 ] ],    "tm" => [ [ 0.3, 0.0, 0.5 ],    # translation model	      [ 0.2, 0.0, 0.5 ],	      [ 0.3, 0.0, 0.5 ],	      [ 0.2, 0.0, 0.5 ],	      [ 0.0,-1.0, 1.0 ] ],  # ... last weight is phrase penalty};# moses.ini file uses FULL names for lambdas, while this training script internally (and on the command line)# uses ABBR names.my $ABBR_FULL_MAP = "d=weight-d lm=weight-l tm=weight-t w=weight-w g=weight-generation";my %ABBR2FULL = map {split/=/,$_,2} split /\s+/, $ABBR_FULL_MAP;my %FULL2ABBR = map {my ($a, $b) = split/=/,$_,2; ($b, $a);} split /\s+/, $ABBR_FULL_MAP;# We parse moses.ini to figure out how many weights do we need to optimize.# For this, we must know the correspondence between options defining files# for models and options assigning weights to these models.my $TABLECONFIG_ABBR_MAP = "ttable-file=tm lmodel-file=lm distortion-file=d generation-file=g";my %TABLECONFIG2ABBR = map {split(/=/,$_,2)} split /\s+/, $TABLECONFIG_ABBR_MAP;# There are weights that do not correspond to any input file, they just increase the total number of lambdas we optimize#my $extra_lambdas_for_model = {#  "w" => 1,  # word penalty#  "d" => 1,  # basic distortion#};my $minimum_required_change_in_weights = 0.00001;    # stop if no lambda changes more than thismy $verbose = 0;my $usage = 0; # request for --helpmy $___WORKING_DIR = "mert-work";my $___DEV_F = undef; # required, input text to decodemy $___DEV_E = undef; # required, basename of files with referencesmy $___DECODER = undef; # required, pathname to the decoder executablemy $___CONFIG = undef; # required, pathname to startup ini filemy $___N_BEST_LIST_SIZE = 100;my $queue_flags = "-l mem_free=0.5G -hard";  # extra parameters for parallelizer      # the -l ws0ssmt is relevant only to JHU workshopmy $___JOBS = undef; # if parallel, number of jobs to use (undef -> serial)my $___DECODER_FLAGS = ""; # additional parametrs to pass to the decodermy $___LAMBDA = undef; # string specifying the seed weights and boundaries of all lambdasmy $continue = 0; # should we try to continue from the last saved step?my $skip_decoder = 0; # and should we skip the first decoder run (assuming we got interrupted during mert)my $___FILTER_PHRASE_TABLE = 1; # filter phrase table# Parameter for effective reference length when computing BLEU score# This is used by score-nbest-bleu.py# Default is to use shortest reference# Use "--average" to use average reference length# Use "--closest" to use closest reference length# Only one between --average and --closest can be set# If both --average is usedmy $___AVERAGE = 0;my $___CLOSEST = 0;# Use "--nonorm" to non normalize translation before computing BLEUmy $___NONORM = 0;# set 0 if input type is text, set 1 if input type is confusion networkmy $___INPUTTYPE = 0; # set 1 if using with async decodermy $___ASYNC = 0; my $allow_unknown_lambdas = 0;my $allow_skipping_lambdas = 0;my $SCRIPTS_ROOTDIR = undef; # path to all tools (overriden by specific options)my $cmertdir = undef; # path to cmert directorymy $pythonpath = undef; # path to python libraries needed by cmertmy $filtercmd = undef; # path to filter-model-given-input.plmy $SCORENBESTCMD = undef;my $qsubwrapper = undef;my $moses_parallel_cmd = undef;my $old_sge = 0; # assume sge<6.0my $___CONFIG_BAK = undef; # backup pathname to startup ini filemy $obo_scorenbest = undef; # set to pathname to a Ondrej Bojar's scorer (not included                            # in scripts distribution)my $efficient_scorenbest_flag = undef; # set to 1 to activate a time-efficient scoring of nbest lists                                  # (this method is more memory-consumptive)my $___ACTIVATE_FEATURES = undef; # comma-separated (or blank-separated) list of features to work on                                   # if undef work on all features                                  # (others are fixed to the starting values)use strict;use Getopt::Long;GetOptions(  "working-dir=s" => \$___WORKING_DIR,  "input=s" => \$___DEV_F,  "inputtype=i" => \$___INPUTTYPE,  "refs=s" => \$___DEV_E,  "decoder=s" => \$___DECODER,  "config=s" => \$___CONFIG,  "nbest=i" => \$___N_BEST_LIST_SIZE,  "queue-flags=s" => \$queue_flags,  "jobs=i" => \$___JOBS,  "decoder-flags=s" => \$___DECODER_FLAGS,  "lambdas=s" => \$___LAMBDA,  "continue" => \$continue,  "skip-decoder" => \$skip_decoder,  "average" => \$___AVERAGE,  "closest" => \$___CLOSEST,  "nonorm" => \$___NONORM,  "help" => \$usage,  "allow-unknown-lambdas" => \$allow_unknown_lambdas,  "allow-skipping-lambdas" => \$allow_skipping_lambdas,  "verbose" => \$verbose,  "rootdir=s" => \$SCRIPTS_ROOTDIR,  "cmertdir=s" => \$cmertdir,  "pythonpath=s" => \$pythonpath,  "filtercmd=s" => \$filtercmd, # allow to override the default location  "scorenbestcmd=s" => \$SCORENBESTCMD, # path to score-nbest.py  "qsubwrapper=s" => \$qsubwrapper, # allow to override the default location  "mosesparallelcmd=s" => \$moses_parallel_cmd, # allow to override the default location  "old-sge" => \$old_sge, #passed to moses-parallel  "filter-phrase-table!" => \$___FILTER_PHRASE_TABLE, # allow (disallow)filtering of phrase tables  "obo-scorenbest=s" => \$obo_scorenbest, # see above  "efficient_scorenbest_flag" => \$efficient_scorenbest_flag, # activate a time-efficient scoring of nbest lists  "async=i" => \$___ASYNC, #whether script to be used with async decoder  "activate-features=s" => \$___ACTIVATE_FEATURES, #comma-separated (or blank-separated) list of features to work on (others are fixed to the starting values)) or exit(1);# the 4 required parameters can be supplied on the command line directly# or using the --optionsif (scalar @ARGV == 4) {  # required parameters: input_file references_basename decoder_executable  $___DEV_F = shift;  $___DEV_E = shift;  $___DECODER = shift;  $___CONFIG = shift;}if ($___ASYNC) {	delete $default_triples->{"w"};	$additional_triples->{"w"} = [ [ 0.0, -1.0, 1.0 ] ];}print STDERR "After default: $queue_flags\n";if ($usage || !defined $___DEV_F || !defined$___DEV_E || !defined$___DECODER || !defined $___CONFIG) {  print STDERR "usage: mert-moses.pl input-text references decoder-executable decoder.iniOptions:  --working-dir=mert-dir ... where all the files are created  --nbest=100 ... how big nbestlist to generate  --jobs=N  ... set this to anything to run moses in parallel  --mosesparallelcmd=STRING ... use a different script instead of moses-parallel  --queue-flags=STRING  ... anything you with to pass to               qsub, eg. '-l ws06osssmt=true'              The default is 								-l mem_free=0.5G -hard              To reset the parameters, please use \"--queue-flags=' '\" (i.e. a space between              the quotes).  --decoder-flags=STRING ... extra parameters for the decoder  --lambdas=STRING  ... default values and ranges for lambdas, a complex string         such as 'd:1,0.5-1.5 lm:1,0.5-1.5 tm:0.3,0.25-0.75;0.2,0.25-0.75;0.2,0.25-0.75;0.3,0.25-0.75;0,-0.5-0.5 w:0,-0.5-0.5'  --allow-unknown-lambdas ... keep going even if someone supplies a new lambda         in the lambdas option (such as 'superbmodel:1,0-1'); optimize it, too  --continue  ... continue from the last achieved state  --skip-decoder ... skip the decoder run for the first time, assuming that                     we got interrupted during optimization  --average ... Use either average or shortest (default) reference                  length as effective reference length  --closest ... Use either closest or shortest (default) reference                  length as effective reference length  --nonorm ... Do not use text normalization  --filtercmd=STRING  ... path to filter-model-given-input.pl  --rootdir=STRING  ... where do helpers reside (if not given explicitly)  --cmertdir=STRING ... where is cmert installed  --pythonpath=STRING  ... where is python executable  --scorenbestcmd=STRING  ... path to score-nbest.py  --old-sge ... passed to moses-parallel, assume Sun Grid Engine < 6.0  --inputtype=[0|1] ... Handle different input types (0 for text, 1 for confusion network, default is 0)  --no-filter-phrase-table ... disallow filtering of phrase tables                              (useful if binary phrase tables are available)  --efficient_scorenbest_flag ... activate a time-efficient scoring of nbest lists                                  (this method is more memory-consumptive)  --activate-features=STRING  ... comma-separated list of features to work on                                  (if undef work on all features)                                  # (others are fixed to the starting values)";  exit 1;}# update variables if input is confusion networkif ($___INPUTTYPE == 1){  $ABBR_FULL_MAP = "$ABBR_FULL_MAP I=weight-i";  %ABBR2FULL = map {split/=/,$_,2} split /\s+/, $ABBR_FULL_MAP;  %FULL2ABBR = map {my ($a, $b) = split/=/,$_,2; ($b, $a);} split /\s+/, $ABBR_FULL_MAP;  push @{$default_triples -> {"I"}}, [ 1.0, 0.0, 2.0 ];  #$extra_lambdas_for_model -> {"I"} = 1; #Confusion network posterior}# Check validity of input parameters and set defaults if neededif (!defined $SCRIPTS_ROOTDIR) {  $SCRIPTS_ROOTDIR = $ENV{"SCRIPTS_ROOTDIR"};  die "Please set SCRIPTS_ROOTDIR or specify --rootdir" if !defined $SCRIPTS_ROOTDIR;}else{  $ENV{"SCRIPTS_ROOTDIR"}=$SCRIPTS_ROOTDIR;}print STDERR "Using SCRIPTS_ROOTDIR: $SCRIPTS_ROOTDIR\n";# path of script for filtering phrase tables and running the decoder$filtercmd="$SCRIPTS_ROOTDIR/training/filter-model-given-input.pl" if !defined $filtercmd;$qsubwrapper="$SCRIPTS_ROOTDIR/generic/qsub-wrapper.pl" if !defined $qsubwrapper;$moses_parallel_cmd = "$SCRIPTS_ROOTDIR/generic/moses-parallel.pl"  if !defined $moses_parallel_cmd;$cmertdir = "$SCRIPTS_ROOTDIR/training/cmert-0.5" if !defined $cmertdir;my $cmertcmd="$cmertdir/enhanced-mert";$SCORENBESTCMD = "$cmertdir/score-nbest.py" if ! defined $SCORENBESTCMD;$pythonpath = "$cmertdir/python" if !defined $pythonpath;$ENV{PYTHONPATH} = $pythonpath; # other scripts need to knowmy ($just_cmd_filtercmd,$x) = split(/ /,$filtercmd);die "Not executable: $just_cmd_filtercmd" if ! -x $just_cmd_filtercmd;die "Not executable: $cmertcmd" if ! -x $cmertcmd;die "Not executable: $moses_parallel_cmd" if defined $___JOBS && ! -x $moses_parallel_cmd;die "Not executable: $qsubwrapper" if defined $___JOBS && ! -x $qsubwrapper;die "Not a dir: $pythonpath" if ! -d $pythonpath;die "Not executable: $___DECODER" if ! -x $___DECODER;if (defined $obo_scorenbest) {  die "Not executable: $obo_scorenbest" if ! -x $___DECODER;  die "Ondrej's scorenbest supports only closest ref length"    if $___AVERAGE;}if ($___ACTIVATE_FEATURES){ $cmertcmd.=" -activate \"$___ACTIVATE_FEATURES\""; }my $input_abs = ensure_full_path($___DEV_F);die "File not found: $___DEV_F (interpreted as $input_abs)."  if ! -e $input_abs;$___DEV_F = $input_abs;# Option to pass to qsubwrapper and moses-parallelmy $pass_old_sge = $old_sge ? "-old-sge" : "";my $decoder_abs = ensure_full_path($___DECODER);die "File not found: $___DECODER (interpreted as $decoder_abs)."  if ! -x $decoder_abs;$___DECODER = $decoder_abs;my $ref_abs = ensure_full_path($___DEV_E);# check if English dev set (reference translations) exist and store a list of all referencesmy @references;if (-e $ref_abs) {  push @references, $ref_abs;}else {  # if multiple file, get a full list of the files    my $part = 0;    while (-e $ref_abs.$part) {        push @references, $ref_abs.$part;        $part++;    }    die("Reference translations not found: $___DEV_E (interpreted as $ref_abs)") unless $part;}my $config_abs = ensure_full_path($___CONFIG);die "File not found: $___CONFIG (interpreted as $config_abs)."  if ! -e $config_abs;$___CONFIG = $config_abs;# check validity of moses.ini and collect number of models and lambdas per model# need to make a copy of $extra_lambdas_for_model, scan_config spoils it#my %copy_of_extra_lambdas_for_model = %$extra_lambdas_for_model;my %used_triples = %{$default_triples};my ($models_used) = scan_config($___CONFIG);# Parse the lambda config string and convert it to a nice structure in the same format as $used_triplesif (defined $___LAMBDA) {  my %specified_triples;  # interpreting lambdas from command line  foreach (split(/\s+/,$___LAMBDA)) {      my ($name,$values) = split(/:/);      die "Malformed setting: '$_', expected name:values\n" if !defined $name || !defined $values;      foreach my $startminmax (split/;/,$values) {	  if ($startminmax =~ /^(-?[\.\d]+),(-?[\.\d]+)-(-?[\.\d]+)$/) {	      my $start = $1;	      my $min = $2;	      my $max = $3;              push @{$specified_triples{$name}}, [$start, $min, $max];	  }	  else {	      die "Malformed feature range definition: $name => $startminmax\n";	  }      }   }  # sanity checks for specified lambda triples  foreach my $name (keys %used_triples) {      die "No lambdas specified for '$name', but ".($#{$used_triples{$name}}+1)." needed.\n"	  unless defined($specified_triples{$name});      die "Number of lambdas specified for '$name' (".($#{$specified_triples{$name}}+1).") does not match number needed (".($#{$used_triples{$name}}+1).")\n"	  if (($#{$used_triples{$name}}) != ($#{$specified_triples{$name}}));  }  foreach my $name (keys %specified_triples) {      die "Lambdas specified for '$name' ".(@{$specified_triples{$name}}).", but none needed.\n"	  unless defined($used_triples{$name});  }  %used_triples = %specified_triples;}# moses should use our configif ($___DECODER_FLAGS =~ /(^|\s)-(config|f) /|| $___DECODER_FLAGS =~ /(^|\s)-(ttable-file|t) /|| $___DECODER_FLAGS =~ /(^|\s)-(distortion-file) /|| $___DECODER_FLAGS =~ /(^|\s)-(generation-file) /|| $___DECODER_FLAGS =~ /(^|\s)-(lmodel-file) /) {  die "It is forbidden to supply any of -config, -ttable-file, -distortion-file, -generation-file or -lmodel-file in the --decoder-flags.\nPlease use only the --config option to give the config file that lists all the supplementary files.";
mert-moses.pl.svn-base - 源码说明

本页面展示了「moses开源的机器翻译系统」中的 mert-moses.pl.svn-base 源码文件，采用 SVN-BASE 编程语言编写，共 1,162 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与moses相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?