mert-moses.pl.svn-base

来自「解码器是基于短语的统计机器翻译系统的核心模块」· SVN-BASE 代码 · 共 1,024 行 · 第 1/3 页
SVN-BASE
1,024 行
#!/usr/bin/perl -w# Usage:# mert-moses.pl <foreign> <english> <decoder-executable> <decoder-config># For other options see below or run 'mert-moses.pl --help'# Notes:# <foreign> and <english> should be raw text files, one sentence per line# <english> can be a prefix, in which case the files are <english>0, <english>1, etc. are used# Revision history# 11 Oct 2006 Handle different input types through parameter --inputype=[0|1]#             (0 for text, 1 for confusion network, default is 0) (Nicola Bertoldi)# 10 Oct 2006 Allow skip of filtering of phrase tables (--no-filter-phrase-table)#             useful if binary phrase tables are used (Nicola Bertoldi)# 28 Aug 2006 Use either closest or average or shortest (default) reference#             length as effective reference length#             Use either normalization or not (default) of texts (Nicola Bertoldi)# 31 Jul 2006 move gzip run*.out to avoid failure wit restartings#             adding default paths# 29 Jul 2006 run-filter, score-nbest and mert run on the queue (Nicola; Ondrej had to type it in again)# 28 Jul 2006 attempt at foolproof usage, strong checking of input validity, merged the parallel and nonparallel version (Ondrej Bojar)# 27 Jul 2006 adding the safesystem() function to handle with process failure# 22 Jul 2006 fixed a bug about handling relative path of configuration file (Nicola Bertoldi) # 21 Jul 2006 adapted for Moses-in-parallel (Nicola Bertoldi) # 18 Jul 2006 adapted for Moses and cleaned up (PK)# 21 Jan 2005 unified various versions, thorough cleanup (DWC)#             now indexing accumulated n-best list solely by feature vectors# 14 Dec 2004 reimplemented find_threshold_points in C (NMD)# 25 Oct 2004 Use either average or shortest (default) reference#             length as effective reference length (DWC)# 13 Oct 2004 Use alternative decoders (DWC)# Original version by Philipp Koehn# defaults for initial values and ranges are:my $default_triples = {  # for each _d_istortion, _l_anguage _m_odel, _t_ranslation _m_odel and _w_ord penalty, there is a list  # of [ default value, lower bound, upper bound ]-triples. In most cases, only one triple is used,  # but the translation model has currently 5 features  "d" => [ [ 1.0, 0.0, 2.0 ] ],  "lm" => [ [ 1.0, 0.0, 2.0 ] ],  "tm" => [            [ 0.3, 0.0, 0.5 ],            [ 0.2, 0.0, 0.5 ],            [ 0.3, 0.0, 0.5 ],            [ 0.2, 0.0, 0.5 ],            [ 0.0, -1.0, 1.0 ],	  ],  "g" => [           [ 1.0, 0.0, 2.0 ],           [ 1.0, 0.0, 2.0 ],         ],  "w" => [ [ 0.0, -1.0, 1.0 ] ],};# moses.ini file uses FULL names for lambdas, while this training script internally (and on the command line)# uses ABBR names.my $ABBR_FULL_MAP = "d=weight-d lm=weight-l tm=weight-t w=weight-w g=weight-generation";my %ABBR2FULL = map {split/=/,$_,2} split /\s+/, $ABBR_FULL_MAP;my %FULL2ABBR = map {my ($a, $b) = split/=/,$_,2; ($b, $a);} split /\s+/, $ABBR_FULL_MAP;# We parse moses.ini to figure out how many weights do we need to optimize.# For this, we must know the correspondence between options defining files# for models and options assigning weights to these models.my $TABLECONFIG_ABBR_MAP = "ttable-file=tm lmodel-file=lm distortion-file=d generation-file=g";my %TABLECONFIG2ABBR = map {split(/=/,$_,2)} split /\s+/, $TABLECONFIG_ABBR_MAP;# There are weights that do not correspond to any input file, they just increase the total number of lambdas we optimizemy $extra_lambdas_for_model = {  "w" => 1,  # word penalty  "d" => 1,  # basic distortion};my $minimum_required_change_in_weights = 0.00001;    # stop if no lambda changes more than thismy $verbose = 0;my $usage = 0; # request for --helpmy $___WORKING_DIR = "mert-work";my $___DEV_F = undef; # required, input text to decodemy $___DEV_E = undef; # required, basename of files with referencesmy $___DECODER = undef; # required, pathname to the decoder executablemy $___CONFIG = undef; # required, pathname to startup ini filemy $___N_BEST_LIST_SIZE = 100;my $queue_flags = "-l mem_free=0.5G -hard";  # extra parameters for parallelizer      # the -l ws0ssmt is relevant only to JHU workshopmy $___JOBS = undef; # if parallel, number of jobs to use (undef -> serial)my $___DECODER_FLAGS = ""; # additional parametrs to pass to the decodermy $___LAMBDA = undef; # string specifying the seed weights and boundaries of all lambdasmy $continue = 0; # should we try to continue from the last saved step?my $skip_decoder = 0; # and should we skip the first decoder run (assuming we got interrupted during mert)my $___FILTER_PHRASE_TABLE = 1; # filter phrase table# Parameter for effective reference length when computing BLEU score# This is used by score-nbest-bleu.py# Default is to use shortest reference# Use "--average" to use average reference length# Use "--closest" to use closest reference length# Only one between --average and --closest can be set# If both --average is usedmy $___AVERAGE = 0;my $___CLOSEST = 0;# Use "--nonorm" to non normalize translation before computing BLEUmy $___NONORM = 0;# set 0 if input type is text, set 1 if input type is confusion networkmy $___INPUTTYPE = 0; my $allow_unknown_lambdas = 0;my $allow_skipping_lambdas = 0;my $SCRIPTS_ROOTDIR = undef; # path to all tools (overriden by specific options)my $cmertdir = undef; # path to cmert directorymy $pythonpath = undef; # path to python libraries needed by cmertmy $filtercmd = undef; # path to filter-model-given-input.plmy $SCORENBESTCMD = undef;my $qsubwrapper = undef;my $moses_parallel_cmd = undef;my $old_sge = 0; # assume sge<6.0use strict;use Getopt::Long;GetOptions(  "working-dir=s" => \$___WORKING_DIR,  "input=s" => \$___DEV_F,  "inputtype=i" => \$___INPUTTYPE,  "refs=s" => \$___DEV_E,  "decoder=s" => \$___DECODER,  "config=s" => \$___CONFIG,  "nbest=i" => \$___N_BEST_LIST_SIZE,  "queue-flags=s" => \$queue_flags,  "jobs=i" => \$___JOBS,  "decoder-flags=s" => \$___DECODER_FLAGS,  "lambdas=s" => \$___LAMBDA,  "continue" => \$continue,  "skip-decoder" => \$skip_decoder,  "average" => \$___AVERAGE,  "closest" => \$___CLOSEST,  "nonorm" => \$___NONORM,  "help" => \$usage,  "allow-unknown-lambdas" => \$allow_unknown_lambdas,  "allow-skipping-lambdas" => \$allow_skipping_lambdas,  "verbose" => \$verbose,  "rootdir=s" => \$SCRIPTS_ROOTDIR,  "cmertdir=s" => \$cmertdir,  "pythonpath=s" => \$pythonpath,  "filtercmd=s" => \$filtercmd, # allow to override the default location  "scorenbestcmd=s" => \$SCORENBESTCMD, # path to score-nbest.py  "qsubwrapper=s" => \$qsubwrapper, # allow to override the default location  "mosesparallelcmd=s" => \$moses_parallel_cmd, # allow to override the default location  "old-sge" => \$old_sge, #passed to moses-parallel  "filter-phrase-table!" => \$___FILTER_PHRASE_TABLE, # allow (disallow)filtering of phrase tables) or exit(1);# the 4 required parameters can be supplied on the command line directly# or using the --optionsif (scalar @ARGV == 4) {  # required parameters: input_file references_basename decoder_executable  $___DEV_F = shift;  $___DEV_E = shift;  $___DECODER = shift;  $___CONFIG = shift;}print STDERR "After default: $queue_flags\n";if ($usage || !defined $___DEV_F || !defined$___DEV_E || !defined$___DECODER || !defined $___CONFIG) {  print STDERR "usage: mert-moses.pl input-text references decoder-executable decoder.iniOptions:  --working-dir=mert-dir ... where all the files are created  --nbest=100 ... how big nbestlist to generate  --jobs=N  ... set this to anything to run moses in parallel  --mosesparallelcmd=STRING ... use a different script instead of moses-parallel  --queue-flags=STRING  ... anything you with to pass to               qsub, eg. '-l ws06osssmt=true'              The default is 								-l mem_free=0.5G -hard              To reset the parameters, please use \"--queue-flags=' '\" (i.e. a space between              the quotes).  --decoder-flags=STRING ... extra parameters for the decoder  --lambdas=STRING  ... default values and ranges for lambdas, a complex string         such as 'd:1,0.5-1.5 lm:1,0.5-1.5 tm:0.3,0.25-0.75;0.2,0.25-0.75;0.2,0.25-0.75;0.3,0.25-0.75;0,-0.5-0.5 w:0,-0.5-0.5'  --allow-unknown-lambdas ... keep going even if someone supplies a new lambda         in the lambdas option (such as 'superbmodel:1,0-1'); optimize it, too  --continue  ... continue from the last achieved state  --skip-decoder ... skip the decoder run for the first time, assuming that                     we got interrupted during optimization  --average ... Use either average or shortest (default) reference                  length as effective reference length  --closest ... Use either closest or shortest (default) reference                  length as effective reference length  --nonorm ... Do not use text normalization  --filtercmd=STRING  ... path to filter-model-given-input.pl  --rootdir=STRING  ... where do helpers reside (if not given explicitly)  --cmertdir=STRING ... where is cmert installed  --pythonpath=STRING  ... where is python executable  --scorenbestcmd=STRING  ... path to score-nbest.py  --old-sge ... passed to moses-parallel, assume Sun Grid Engine < 6.0  --inputtype=[0|1] ... Handle different input types (0 for text, 1 for confusion network, default is 0)  --no-filter-phrase-table ... disallow filtering of phrase tables                              (useful if binary phrase tables are available)";  exit 1;}# update variables if input is confusion networkif ($___INPUTTYPE == 1){  $ABBR_FULL_MAP = "$ABBR_FULL_MAP I=weight-i";  %ABBR2FULL = map {split/=/,$_,2} split /\s+/, $ABBR_FULL_MAP;  %FULL2ABBR = map {my ($a, $b) = split/=/,$_,2; ($b, $a);} split /\s+/, $ABBR_FULL_MAP;  push @{$default_triples -> {"I"}}, [ 1.0, 0.0, 2.0 ];  $extra_lambdas_for_model -> {"I"} = 1; #Confusion network posterior}# Check validity of input parameters and set defaults if neededif (!defined $SCRIPTS_ROOTDIR) {  $SCRIPTS_ROOTDIR = $ENV{"SCRIPTS_ROOTDIR"};  die "Please set SCRIPTS_ROOTDIR or specify --rootdir" if !defined $SCRIPTS_ROOTDIR;}print STDERR "Using SCRIPTS_ROOTDIR: $SCRIPTS_ROOTDIR\n";# path of script for filtering phrase tables and running the decoder$filtercmd="$SCRIPTS_ROOTDIR/training/filter-model-given-input.pl" if !defined $filtercmd;$qsubwrapper="$SCRIPTS_ROOTDIR/generic/qsub-wrapper.pl" if !defined $qsubwrapper;$moses_parallel_cmd = "$SCRIPTS_ROOTDIR/generic/moses-parallel.pl"  if !defined $moses_parallel_cmd;$cmertdir = "$SCRIPTS_ROOTDIR/training/cmert-0.5" if !defined $cmertdir;my $cmertcmd="$cmertdir/mert";$SCORENBESTCMD = "$cmertdir/score-nbest.py" if ! defined $SCORENBESTCMD;$pythonpath = "$cmertdir/python" if !defined $pythonpath;$ENV{PYTHONPATH} = $pythonpath; # other scripts need to knowdie "Not executable: $filtercmd" if ! -x $filtercmd;die "Not executable: $cmertcmd" if ! -x $cmertcmd;die "Not executable: $moses_parallel_cmd" if defined $___JOBS && ! -x $moses_parallel_cmd;die "Not executable: $qsubwrapper" if defined $___JOBS && ! -x $qsubwrapper;die "Not a dir: $pythonpath" if ! -d $pythonpath;die "Not executable: $___DECODER" if ! -x $___DECODER;my $input_abs = ensure_full_path($___DEV_F);die "File not found: $___DEV_F (interpreted as $input_abs)."  if ! -e $input_abs;$___DEV_F = $input_abs;# Option to pass to qsubwrapper and moses-parallelmy $pass_old_sge = $old_sge ? "-old-sge" : "";my $decoder_abs = ensure_full_path($___DECODER);die "File not found: $___DECODER (interpreted as $decoder_abs)."  if ! -x $decoder_abs;$___DECODER = $decoder_abs;my $ref_abs = ensure_full_path($___DEV_E);# check if English dev set (reference translations) exist and store a list of all referencesmy @references;if (-e $ref_abs) {  push @references, $ref_abs;}else {  # if multiple file, get a full list of the files    my $part = 0;    while (-e $ref_abs.$part) {        push @references, $ref_abs.$part;        $part++;    }    die("Reference translations not found: $___DEV_E (interpreted as $ref_abs)") unless $part;}my $config_abs = ensure_full_path($___CONFIG);die "File not found: $___CONFIG (interpreted as $config_abs)."  if ! -e $config_abs;$___CONFIG = $config_abs;# check validity of moses.ini and collect number of models and lambdas per model# need to make a copy of $extra_lambdas_for_model, scan_config spoils itmy %copy_of_extra_lambdas_for_model = %$extra_lambdas_for_model;my ($lambdas_per_model, $models_used) = scan_config($___CONFIG, \%copy_of_extra_lambdas_for_model);# Parse the lambda config string and convert it to a nice structure in the same format as $default_triplesmy $use_triples = undef;if (defined $___LAMBDA) {  # interpreting lambdas from command line  foreach (split(/\s+/,$___LAMBDA)) {      my ($name,$values) = split(/:/);      die "Malformed setting: '$_', expected name:values\n" if !defined $name || !defined $values;      foreach my $startminmax (split/;/,$values) {	  if ($startminmax =~ /^(-?[\.\d]+),(-?[\.\d]+)-(-?[\.\d]+)$/) {	      my $start = $1;	      my $min = $2;	      my $max = $3;              push @{$use_triples->{$name}}, [$start, $min, $max];	  }	  else {	      die "Malformed feature range definition: $name => $startminmax\n";	  }      }   }} else {  # no lambdas supplied, use the default ones, but do not forget to repeat them accordingly  # first for or inherent models  foreach my $name (keys %$extra_lambdas_for_model) {    foreach (1..$extra_lambdas_for_model->{$name}) {      die "No default weights defined for -$name"        if !defined $default_triples->{$name};      # XXX here was a deadly bug: we need a deep copy of the default values      my @copy = ();      foreach my $triple (@{$default_triples->{$name}}) {        my @copy_triple = @$triple;        push @copy, [ @copy_triple ];      }      push @{$use_triples->{$name}}, @copy;    }  }  # and then for all models used  foreach my $name (keys %$models_used) {    foreach (1..$models_used->{$name}) {
mert-moses.pl.svn-base - 源码说明

本页面展示了「解码器是基于短语的统计机器翻译系统的核心模块」中的 mert-moses.pl.svn-base 源码文件，采用 SVN-BASE 编程语言编写，共 1,024 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与解码器相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?