📄 changelog
字号:
vectors with copies of data). * svm_base.c (svm_options[]): Alphabetized to improve readability. (svm_parse_opt): Re-ordered to mostly alphabetical to improve readability. (get_top_n): fixed a bug that popped up in obscure places & switched to a more intelligent algorithm (don't know why it was dumb in the first place). (svm_remove_bound_examples): changed the removal code around (again) as part of the new svm model. The fn now removes either bound, or misclassified documents & is called by solve_svm (the most inner svm fn. that calls a solver). (svm_trans_or_chunk): removed chunk_svm for this. Calls either transduce_svm or solve_svm depending on the parameters/data. (svm_tlf): Top-Level-Fn. Permutes data & outputs a hyperplane in bow_wv if possible. This fn also chooses/sets up the proper fn (al, trans, removal, etc) to call. * svm_loqo.c: Updated to work with cvect instead of svm_C. Now all upper bounds come from the cvect parameter which MUST be properly initialized. (this is necessary for transduction & possibly other things). * svm_smo.c (opt_pair): fixed a blatant bug in the solver (the examples were added to I0 set in cases where they shouldn't have been [see keerthi, et al for exactly where the examples should be added if they weren't already present]). Now the upper bounds come from the cvect instead of svm_C. The algorithm is almost identical. The only difference is a little bit more notice to the exact upper bounds on each of the boxes. * svm_trans.c: stable version. Has new interface with the svm model. No known bugs. The code does have some gross inefficiencies (always zero-ing out temporary values & weights, causing the solvers to restart each time), but all of the output examined has been correct. * bow/svm.h: Updated for a new svm interface. The relationship between the different solvers is much cleaner now that redundant code has been mostly eliminated. Note - the prototypes for most functions have changed, as the structure of most of the higher-level svm code has changed.1999-12-01 Kamal Nigam <knigam@zeno.jprc.com> * .cvsignore: added rainbow-be to ignore list * .cvsignore: added rainbow-rank to ignore list. * em.c (bow_cdoc_is_train_or_unlabeled): moved to split.c (bow_em_new_vpc_with_weights): removed usage of halt_using_perplexity. This option is broken, and its code was hurting performance. * bow/libbow.h (bow_cdoc_is_train_or_unlabeled): New prototype. * split.c (bow_files_source_type): added code for the bow_files_source_fraction_train and bow_files_source_number_train. This is indicated by a following t which converts some number of training documents. For example, --unlabeled-set=500t takes 500 training docs and converts them to unlabeled docs. (bow_split_options): Likewise. (bow_split_parse_opt): Likewise. (bow_set_doc_types_randomly_by_fraction_remaining): Likewise. (bow_set_doc_types): Likewise. (bow_set_doc_types_randomly_by_count_per_class): Added argument source_tag. To get previous behavior, call this with source_tag equal to bow_doc_untagged. Used for the new options bow_files_source_number_train and bow_files_source_fraction_trai. (bow_set_doc_types_randomly_by_count): Likewise. (bow_set_doc_types_randomly_by_fraction): Likewise. (bow_cdoc_is_train_or_unlabeled): New function. * maxent.c (maxent_options): Added code for new options --maxent-iteration-docs and --maxent-constraint-docs. (maxent_parse_opt): Likewise. (bow_maxent_new_vpc_with_weights_doc_then_word): Likewise. (bow_maxent_new_vpc_with_weights): Likewise.1999-11-10 Andrew McCallum <mccallum@justresearch.com> * svm_base.c (sqrtf): New macro, necessary on some non-Linux machines. Bug reported by Chuck Rosenberg.1999-11-08 Andrew McCallum <mccallum@justresearch.com> * readme.texi: Add simple usage examples for arrow. * arrow.c (arrow_serve2): Implement the 'query' command. Change XML labels from "archer" to "arrow". (main): Change default number of hits on a query from 1 to 10. * libbow-desc.texi: Update descriptions. * svm_base.c: Surround many condition man printf's on the bow_verbosity_level, so that by default rainbow-stats will still work. * array.c (cdocs_iterator_count_for_doc): Replace NAN macro with arithmetic equivalent. * barrel.c (barrel_iterator_count_for_doc): Likewise. * wv.c (bow_wv_weight_sum): New function. * bow/libbow.h: Declare new function. * train_dirichlet.c (moment_match_mccallum): Separate implementation of moment matching that determines the variance by averaging the variance of all dimentions. (train_dirichlet_mom_sparse): New function. * bow/train_dirichlet.h: Declare new function. * tfidf.c (TFIDF_METHOD): Use bow_wv_set_weights_to_count_times_idf() instead of bow_wv_set_weights_to_count(), as is correct for TFIDF. This was previously corrected in the scoring function. (bow_tfidf_params_tfidf): Change parameter settings for "tfidf" method. Previously it was identical to the "tfidf_log_words" method, now it is identical to the "tfidf_log_occur" method. In other words, previously it calculated IDF using the number of times the word occurred in the training data; now it uses the number of training documents in which the word occurs. * split.c (bow_split_options): Remove documentation for 'r' suffix. It's confusing and shouldn't be used unawares. (bow_split_parse_opt): Add a 'pcr' suffix, but its not implemented yet. (bow_set_doc_types_randomly_by_count_per_class): Count the number of untagged documents in each class, and if this function is trying to tag more than are available, simply have this function tag less. * rainbow.c (bow_print_log_odds_ratio): Handle words that are not in the vocabulary. * ddf.c: Implement ddfmm classification method. This method fits the Dirichlet by moment matching only. * arrow.c (arrow_serve2): New function. Now call this instead of arrow_serve. It provides output in XML, like archer does. Only the rank command is implemented.1999-11-02 Andrew McCallum <mccallum@justresearch.com> * int4str.c (bow_int2str): Assert that INDEX argument is non-negative.1999-10-28 Gregory C Schohn <gcs@cmu.edu> * svm_base.c (svm_vpc_merge): fixed bug for svml-basename - all the docs still need to be output, so that the other data (like word weights can be properly extracted).1999-10-28 Andrew McCallum <mccallum@justresearch.com> * cdmem.c (cdmem_options): New command-line option "cdmem-dist-data". (cdmem_parse_opt): Handle it. (bow_cdmem_new_vpc_with_weights): Let the command-line option determine what documents are used to learn the distance metric. * README-SVM (Outputing data): Added new section describing how to produce files ready for input into SVM^light.1999-10-27 Gregory C Schohn <gcs@cmu.edu> * svm_base.c (svm_vpc_merge): fixed svml bugs * svm_base.c fixed outdated documentation for parse info. * svm_smo.c (smo): fixed a parse error1999-10-26 Gregory C Schohn <gcs@cmu.edu> * rainbow.c (rainbow_test): added a line for svms. When svmlight output is being generated, rainbow_test prints the label (only works for binary barrels) so that svm_score can append the data for that example. * svm_base.c (svm_options[]): removed some of the single character switches. Added arguments for tsvms & added svml-basename arg. (svm_permute_data, svm_unpermute_data): added. (infogain): should have made infogain compatible with sets with unlabeled data (it ignore those docs with y = 0). (svm_vpc_merge): added support for using unlabeled docs for transduction. Also added code to spit out svmlight friendly files. (svm_score): added code to write svmlight files. * svm_trans.c: initial version - pretty much empty now. * bow/svm.h: added svm_*permute_data declarations & the transduce_svm declaration. * svm_al.c (al_svm_test_wrapper): replaced permutation code with calls to svm_permute_data & svm_unpermute_data. * svm_smo.c (smo): removed srandom(1) - was only there for debugging. * README-SVM (Bugs): removed section about smo being broken (was fixed). * Makefile.in: added svm_trans.c (transductive svms) to the svm_files.1999-10-25 Andrew McCallum <mccallum@justresearch.com> * .cvsignore: Add automatically-generated archer files, and a few others.1999-10-21 Andrew McCallum <mccallum@justresearch.com> * barrel.c (bow_barrel_keep_top_words_by_infogain): Don't set the NUM_WORDS_TO_KEEP to be the WI2IG_SIZE (which is the total number of words). Set it to the MIN of this and the original NUM_WORDS_TO_KEEP. Before this fix, no words were ever getting removed. What a bug! I wonder how long this has been in there? Reported by Carsten Lanquillon <lanqui@cs.cmu.edu>.1999-10-20 Andrew McCallum <mccallum@justresearch.com> * ddf.c (bow_ddf_dirichlet_from_doc_word_counts): Only print the diagnostics for 10 sampled words, not 50. * bpe.c (bow_bpe_set_cdoc_word_count_from_wi2dvf_weights): Print the alphas for only 10 sampled words intead of 20.1999-10-19 Andrew McCallum <mccallum@justresearch.com> * svm_base.c: Check verbosity level before printing to stdout. Only print if above bow_progress.1999-10-19 Gregory C Schohn <gcs@cmu.edu> * svm_base.c (svm_score): removed cnt variable (useless) & fixed a typo-bug (sub_model[i] -> barrel). * svm_smo.c (smo): changed the printf for information of where opt_pair failed to an fprintf.1999-10-19 Gregory C Schohn <gcs@justresearch.com> * Makefile.local (DIST_ALL_FILES): added -DGCSJPRC (turn local pedantic debugging) to DEFS. * Makefile.in (ALL_CPPFLAGS): added -Ibow (so that pr_loqo.h is found by pr_loqo.c even though they aren't in the same directory [since we can't change pr_loqo.*]). (DEFS): Changed from _DEFS & now using += instead of the temporary. * svm_base.c: the epsilon_crit is now /2 for SMO (since the actual eps is 2x the variable). fixed some printfs. * svm_loqo.c (build_svm_guts): added code to remember previous KKT epsilon (even though nobody sets the initial value to anything different than the macro). (build_svm_guts): added local define (GCSJPRC) for debugging stuff which includes stopping the proc & sending mail. * svm_smo.c: commented #DEBUG. added kcache_ages to appropriate spots across the file. removed some print statements that weren't to useful anymore. (opt_pair): changed an optimality check - used to use (a2+ao2)*eps to detrmine if something moved far enough, now just using eps_a (may not be right, but its more correct than before) - we need it to prevent inf. looping. (opt_pair): Removed some unreachable in if statements. (opt_pair): Fixed calculations of bup & blow - they were backwards (smo): the threshold, b is now (bup+blow)/2 instead of blow (which is at most epsilon_crit different).1999-10-16 Gregory C Schohn <gcs@justresearch.com> * svm_base.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts * svm_al.c: Added #ifdef HAVE_LOQO around calls to build_svm_guts * Makefile.in: Re-enabled svm code. Made the pr_loqo checks look ./bow/pr_loqo.h1999-10-16 Andrew McCallum <mccallum@justresearch.com> * README-SVM (Obtaining sources): File renamed from README_SVM. Clarify directions for where to put pr_loqo.h.1999-10-15 Andrew McCallum <mccallum@justresearch.com> * Version (BOW_MINOR_VERSION): Changed from 9 to 95. * bow/libbow.h (BOW_MINOR_VERSION): Changed from 9 to 95. Bug fixes for distribution. * .cvsignore: Added rainbow-rank and rainbow-ts. * Makefile.in: Temporarily disable SVM from rainbow. (ARCHER_GENERATED_C_FILES): New variable. Remove this files from those distributed, because they should be generated. (ARCHER_DIST_FILES): Added archer.c and archer_query.c * Makefile.in (DEMO_EXECUTABLES): New variable. (ARCHER_DIST_FILES): Added dirichlet.c. (DIST_FILES): Added archer.el * multiclass.c: Comment out unused variables. Odd assortment of clean-ups. * bow/libbow.h (bow_random_reset_seed): Declare function. * train_dirichlet.c (MOMENT_MATCH_ONLY): New macro. (SPARSE): Change macro value from 0 to 1. This only effects running train_dirichlet's main() directly. (main): comment out the printing of the gammaln() tests. New local variable COUNTS_SIZE, increased from 100 to 10000. Print more diagnostics at the end. * readme.texi: Update for new front-ends and fix command-line options so they work. * rainbow.c (rainbow_options): Clean up wording in several places. (rainbow_query): Change behavior of repeated queries. (bow_print_log_odds_ratio): Add a new FILE* argument. All callers changed. * nbshrinkage.c: Allow different lambda hierarchical mixture weights for different classes. * mix.c (mix_options): New command-line option for setting the number of EM iterations. (mix_new_vpc): Don't allow initial random class_probs to be zero. * libbow-desc.texi: Update for new front-ends and MSWin. * lex-gram.c (bow_lexer_gram_open_text_fp): Properly save the return value of bow_realloc(). This fixes a nasty crash. * emsimple.c (bow_emsimple_new_vpc_with_weights): Print diagnostics using odds_ratio. * dirichlet.c (main): New command-line argument -I. Handle it. * dice.c (print_usage): Expand help statement. * ddf.c (ddf_force_large_alphas): New variable. (bow_ddf_dirichlet_from_doc_word_counts): Handle it. (ddfla): New method. * cdmm.c (CDMM_PRINT_ALPHAS_KEY): Change value to not conflict with the cdm method. * bpe.c (bpe_prior_alpha): Change default prior "ghost count" from 1 to 0. (bow_bpe_set_cdoc_word_count_from_wi2dvf_weights): Make the verbosity work even when the vocabulary size is less than 20. (bow_bpe_score): Print more information when BOW_PRINT_WORD_SCORES. Print more digits of precision of BOW_PRINT_WORD_SCORES for individual words. * Makefile.local (RAINBOW_METHOD_C_FILES): Move some of these to the Makefile.in.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -