📄 changelog

📁 机器学习作者tom mitchell的书上代码
💻
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
2002-02-13  Andrew McCallum  <mccallum@slide.whizbang.com>	* opts.c (parse_bow_opt): Make it still work if $HOME isn't	defined (as in a CGI).	(Patch from John McCall <rjmccall@andrew.cmu.edu>.)	* svm_base.c (sqrtf): #define as sqrt if we don't HAVE_SQRTF.	(Suggestion from Alberto Lavelli <lavelli@itc.it>.)	* lex-html.c: Added the ability to deal with html entities of the	form &ENTITY; or &#DIGITS;.  Patch from Arturo Perez	<arturo@montecarlo.bigchalk.com>.	* primelist.c: Fixed off-by-one error in list of primes.  This was	occasionally causing an infinite loop.  (Reported by Mikhail	Sogrine <mics@cnds.ucd.ie> and Arturo Perez	<arturo@montecarlo.bigchalk.com>.)	* int4str.c: Removed old, irrelevant comments.2001-09-26  Andrew McCallum  <mccallum@whizbang.com>	* wi2pv.c (bow_wi2pv_flush): Assert that there are no bytes in the	PVM.	* archer.c (archer_index_filename): Properly print the filename as	part of the mmap error.  Include the vocabulary size in the	progress message.2001-03-17  Andrew McCallum  <mccallum@whizbang.com>	* wi2pv.c (bow_wi2pv_new): Write something so that no PV can have	a start offset of 0, so that 0 can be reserved in pv.c as a	special value.	* bow/libbow.h (_FILE_OFFSET_BITS): Define to be 64 here.  This	fixed problem with bow_fwrite_off_t().	* strtrie.c (bow_strtrie_present): If STR is not all lowercase,	don't exit, just say it isn't in the trie.	* pv.c: Use 0 instead of -1 to represent an unset seek start,	because I'm not sure whether offset_t is signed.	* int4str.c (_str2id): Make sure we do unsigned arithmetic.	(_bow_str_hash_lookup): Comment out bad fix to looping hash lookup,	and document the persistent problem in a comment below.	* archer.c (archer_index_filename): Use mmap() to get the text	files.2001-02-04  Andrew McCallum  <mccallum@whizbang.com>	* pv.c: Use off_t instead of int in appropriate places.	(_FILE_OFFSET_BITS): New macro, defined to be 64.	* archer.c, wi2pv.c (_FILE_OFFSET_BITS): New macro, defined to be	64.	* bow/archer.h: Use off_t instead of int in appropriate places.	* bow/libbow.h (bow_fwrite_off_t): New function.	(bow_fread_off_t): New function.2001-01-21  Andrew McCallum  <mccallum@whizbang.com>	* stoplist.c (stophash_init): New function.	(bow_stoplist_present_hash): New function.	* pv.c (bow_pv_write): Faster, architecture-dependent	implementation based on single call to fwrite().	(bow_pv_read): Likewise for reading.	* int4word.c (bow_words_set_map): Allow MAP to be NULL, in which	case this function simply initializes the WORD_MAP.	(bow_words_write): If !ARCHIVE_COUNTS, don't write out the	WORD_MAP_COUNTS.  Currently is archiving.	(bow_words_read_from_fp): Likewise for reading.	* int4str.c (_str2id): No need to initialize H twice.	(_bow_str_hash_lookup): Make this a public interface.	(_bow_str_hash_lookup2): New function.	(_bow_str2int): New function that takes a pre-computed hash.  Use	above function. 	(bow_str2int): New shell around above function.	* archer.c (archer_index_filename): Add local lexing	implementation, used only if USE_FAST_LEXER is #define'd to be 1	(which is the default).  Print document count once every 200	documents, not every 100.	(archer_sort_hites): New function.	(archer_query): Use it.	* bow/libbow.h: Declare new functions.2000-12-19  Andrew McCallum  <mccallum@whizbang.com>	* lex-simple.c: Cast (char) to (unsigned char) before passing to	isalpha() and other functions that take an int.  On Solaris, the	high bit in these chars gets changed to a sign bit in the int, and	isalpha() and islower(), etc don't work.	* lex-simple.c (bow_lexer_simple_open_text_fp): Increase default	document size.  Only scan FP for start string if it's non-empty.	If there is no end pattern, read the contents more efficiently.	(bow_lexer_simple_get_raw_word): New temporary version that is more	efficient, but ignores out many of the command-line arguments.	(bow_lexer_simple_postprocess_word): Likewise.2000-12-18  Andrew McCallum  <mccallum@whizbang.com>	* bow/libbow.h: Declare strtrie functions.	* int4str.c: Many efficiency cleanups, including better string	hash function and more efficient code paths for str_hash_lookup().	* stoplist.c: Use strtrie instead of hashtable.	* archer.c, pv.c, wi2pv.c, bow/archer.h: Trimmed back to version	as of 1999/05/20, and augmented with faster indexing.	* Makefile.in (STANDARD_LIBBOW_C_FILES): Added primelist.c and	strtrie.c.	(ARCHER_H_FILES): Remove all bow/archer_query* files.	(ARCHER_LEX_FILES, ARCHER_Y_FILES, ARCHER_GENERATED_C_FILES): Removed.	(ARCHER_DIST_FILES): Trimmed.	* Makefile.in (ARCHER_H_FILES): Remove reduntant one of these	definitions.	* opts.c (_help_filter): Make sure BOW_METHODS is non-NULL before	trying to use it.	* primes.c: Formatting change.	* Makefile.local: RAINBOW_METHOD_C_FILES: Added dirk.c.	* Makefile.in (STANDARD_RAINBOW_METHOD_C_FILES): Removed dirk.c.2000-12-07  Kamal Nigam  <knigam@whizbang.com>	* bow/libbow.h (bow_wi2dvf_hide_all_wi): Added.	(bow_smoothing): Added dirichlet smoothing.	(bow_smoothing_dirichlet_filename): Added.	(bow_smoothing_dirichlet_weight): Added.	* bow/naivebayes.h (bow_naivebayes_dirichlet_alphas): Added to	enable dirichlet smoothing.	(bow_naivebayes_dirichlet_total): Likewise.	(bow_naivebayes_load_dirichlet_alphas): Likewise.	(bow_naivebayes_initialize_dirichlet_smoothing): Likewise.	* bow/em.h (bow_em_calculating_perplexity): Added.	* cotrain.c (cotrain_selection_type): Add randomly selection type.	(cotrain_select_docs): Changed prototype.	(cotrain_print_dependency_matrix): Variable for new option.	(cotrain_vocab_split_file): Likewise.	(cotrain_co_gem): Likewise.	(cotrain_options): Added new options	--cotrain-print-dependency-matrix,	--cotrain-split-vocab-from-file, and --cotrain-co-gem.	(cotrain_parse_opt): Likewise.	(cotrain_calculate_perplexity): New funtion.	(cotrain_split_vocabulary_from_file): New function.	(cotrain_do_vocab_split): New code for splitting from file.	(cotrain_generic_select): Changed prototype for changed data	structure.	(cotrain_select_by_confidence_weighting): Likewise.	(cotrain_select_by_confidence): Likewise.	(cotrain_select_by_density_weighting): Likewise.	(cotrain_select_by_density): Likewise.	(cotrain_select_by_random_weighting): New function for random	selection type.	(cotrain_select_by_random): Likewise.	(cotrain_new_vpc_with_weights): New code for printing the dependency	matrix.  New code for co-GEM.  New code for changed data	structure.	* dirichlet.c (main): Print progress information to stderr, not	stdout.  Increase print precision on alphas.	* em.c (bow_em_calculating_perplexity): Made non-static to allow	access from cotraining.	(bow_em_new_vpc_with_weights): Initialize dirichlet smoothing if using	it.  Add word counts even when class probs are 0.  This will fill	out the class word matrix for perplexity calculations.	(em_calculate_perplexity): Fix class correspondence bug.  hit index is	not class index when adding up log_prob_of_data.  Also correctly	calculate num_data_words based on actual words occurring in the	model.	(bow_em_pr_wi_ci): Add in code for dirichlet smoothing.	(bow_em_set_weights): Likewise.	(bow_method_em): Change word vector normalization to	set_weights_to_count.  This causes document-then-word test	documents to have their probabilities not be scaled to the	document length.  This can be interpreted as more correct than the	other way.	* naivebayes.c (bow_naivebayes_dirichlet_alphas): New global	variable for Dirichlet smoothing.	(bow_naivebayes_dirichlet_total): Likewise.	(bow_naivebayes_load_dirichlet_alphas): New function.	(bow_naivebayes_initialize_dirichlet_smoothing): New function.	(bow_naivebayes_pr_wi_ci): Added dirichlet smoothing option.	(bow_naivebayes_print_word_probabilities_for_class): Changed output	format to include word count as well.	(bow_naivebayes_set_weights): Initialize dirichlet smoothing if its	being used.	* opts.c (bow_smoothing_dirichlet_filename): variable for	Dirichlet smoothing.	(bow_smoothing_dirichlet_weight): Likewise.	(bow_options): Added new options --smoothing-dirichlet-filename and	--smoothing-dirichlet-weight.	(parse_bow_opt): Likewise.	* wi2dvf.c (bow_wi2dvf_hide_all_wi): New function.	* Makefile.local (RAINBOW_METHOD_C_FILES): Remove dirk.c because	it's already in Makefile.in	* primes.c (_bow_nextprime): Bugfix from Andrew for mixing alloca	and realloc in weird ways.	* em.c (em_labeled_for_start_only): New option variable.	(em_set_vocab_from_unlabeled): New option variable.	(em_options): Changes for new options.	(em_parse_opt): Likewise.	(bow_em_new_vpc_with_weights): New option --em-labeled-for-start-only	uses the labeled data just to set the starting point of EM, and	not used during iterations.  Option --em-set-vocab-from-unlabeled	sets to vocabulary to only words occurring in the unlabeled data.2000-09-08  Kamal Nigam  <knigam@whizbang.com>	* primes.c (_bow_nextprime): Fix very peculiar memset bug.	Somehow it doesn't seem to matter...	* barrel.c (bow_barrel_new_from_printed_barrel_file): Fixed for	documents with no features.  Now they won't get a " " feature.	(bow_barrel_printf_selected): Added support for 'l' (print the word as	many times as it occurred) and for 'P' (print docs in IPL format).	* rainbow.c (rainbow_parse_opt): If using vocabulary from file, do	not add to this vocab later.	(main): Allow user to set vocab from file at indexing time.	* bow/libbow.h (word_map): made global	* int4word.c (word_map): make global	(bow_word2int_add_occurrence): grow word_map multiple times, if	necessary	* maxent.c (bow_maxent_new_vpc_with_weights_doc_then_word):	properly ignore all documents that have no features.  These	documents violate the constant document length assumption made by	doc_then_word.2000-05-21  Andrew McCallum  <mccallum@whizbang.com>	* dirk.c (bow_dirk_score): Initialize MAX_SCORE_DI to avoid gcc	warning.	* bpe.c: Don't include <huge_val.h>, it is no longer necessary in	RedHat 6.1.	* bow/crossbow.h (crossbow_classify_doc_new_wa): Declare new	function.	* bow/libbow.h: Declare new functions and variables.	* lex-simple.c (bow_lexer_max_num_words_per_document): New	variable.	(bow_lexer_simple_open_text_fp): Initialize it.	(bow_lexer_simple_open_str): Likewise.	(bow_lexer_simple_postprocess_word): Use it.  Also, handle	BOW_XXX_WORDS_ONLY.	(bow_lexer_infix_length): New variable, but unused.	* crossbow.c: Added code for query serving on a socket.	(crossbow_new_root_from_dir): When recursing directories, skip over	directories named "unlabeled".  Yipes, this is scary, arbitrary	behavior.	(crossbow_index_filename): If filename path includes the directory	"unlabeled", remove that directory from the file path.  Again,	scary arbitrary behavior!	(crossbow_index_filename): Verbosify the file path and class.	(crossbow_index_multiclass_list): Fix call of strtok.  Use strtok	instead of strsep, for the sake of Solaris.	(crossbow_classify_doc_new_wa): New function.	(crossbow_classify_doc) [DOC_LENGTH_SCORE_TRANSFORM]: Rescale the	score in a document-length specific way, as an aid to improved	estimation of confidence.. for the confidence-based selection	which unlabeled documents to label.	(crossbow_socket_init): New function.	(crossbow_serve): New function.	(crossbow_query_serving): New function.	(crossbow_options): New command-line option "query-server".	* rainbow.c: Include <strings.h> for bzero on Solaris.	(rainbow_options): New command-line arguments "forking-query-server"	and "use-saved-classifier".	(rainbow_parse_opt): Handle them.	(struct rainbow_arg_state): New member FORKING_SERVER.	(rainbow_query): Handle UNIX signal for broken pipe.  Code added by	Dan Rapp <drapp@whizbang.com>.  Remove words from QUERY_WV that	are not in the class barrel!  This fixes normalization by document	length.  Comment out a bunch of code that would re-set various	parameters specified on the command-line (such as the	classification method); this makes --query-server work much	better.; this will break old behavior, but I don't think it is	ever used.  Always set the weights and normalize the QUERY_WV	using the class barrel; previously there was a preference for	using the document barrel.	(SigPipeHandler): New function.	(rainbow_serve): Implement a forking server.	(rainbow_test): Remove from QUERY_WV words not in the class barrel.	(rainbow_test_files): Likewise.  If the test file can't be opened,	don't crash, just report so on stderr.	(main): Handle query forking server.  When testing saved model and	looping once for each test document, remove from QUERY_WV words	not in the class barrel.	* opts.c (bow_xxx_words_only): New variable.	(bow_options): New command line options "xxx-words-only" and	"max-num-words-per-document".	(parse_bow_opt): Handle them.	* wv.c (bow_wv_prune_words_not_in_wi2dvf): New function.	(bow_wv_fprintf): Print all on one line, just like --print-matrix.	(bow_wv_printf): New function.	* treenode.c (bow_treenode_descendant_matching_name) [WHIZBANG]:	Rely on tree being rooted at directory named "./data" and tree	depth being 2.  Without this code, we don't reliably find the	right descendant if there are several treenodes with the same	name.	* split.c (bow_set_docs_to_type): When duplicate tags are	requested for a document, just print a warning instead of exiting	with an error.	* random.c (bow_random_double): Use RAND_MAX if available.	* random.c (bow_random_double): Handle case in which RAND_MAX is	not defined, assuming its value is 2147483647, if necessary.
12 3 4 5 下一页
💿 文件大小 522 K
👤 上传用户 yuanata
📂 所属分类数值算法/人工智能
🏷️ 相关标签

#mitchell #tom #机器学习 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -