📄 changelog

📁 机器学习作者tom mitchell的书上代码
💻
📖 第 1 页 / 共 5 页
字号:
	* maxent.c (bow_maxent_score): From query_wv remove words not in	the class_barrel; (I think this affects normalization).  Also, set	and normalize the weights---note that this might now be done	twice.	* lex-gram.c (bow_lexer_gram_get_word): Don't distinguish between	bi-grams in the middle of a document and a tri-gram that only got	two words before the end of the document.  Do this by removing	trailing `;'s	* hem.c (crossbow_hem_incremental_labeling): New variable.	(crossbow_hem_fixed_shrinkage): New variable, but unused.	(crossbow_hem_options): New command-line argument	"hem-incremental-labeling". 	(crossbow_hem_label_most_confident): New function.	(crossbow_hem_full_em): Call it.	(crossbow_hem_em_one_iteration): Handle incremental labeling.2000-05-19  Andrew McCallum  <mccallum@whizbang.com>	* wi2dvf.c (bow_wi2dvf_add_di_text_str): Assert LEX.	* rainbow.c (rainbow_options): New command-line option	"index-lines".	(rainbow_index_lines): New function.	(main): Call it.	* opts.c (bow_options): New command-line option	"use-unknown-word".	* int4word.c (bow_word2int_use_unknown_word): New global variable.	Use it in various functions.	* bow/libbow.h: Define new global variable.	* naivebayes.c (bow_naivebayes_score): Use a temporary annealing	weight that depends on the number of words in the query_wv.	* crossbow.c (crossbow_index_multiclass_list): Even though	strtok() is deprecated, switch to it from strsep() since the later	doesn't exist in Solaris.	* configure.in: Check for nsl library, needed by Solaris.	* cdmemr.c: New learning method "emr".  Change annealing	temperature from 1000 to 100.  Several other changes.	(cdmemr_calculate_perplexity): New function.	* barrel.c: Removed #include <nan.h>.  It's not necessary and	doesn't exist anymore in RedHat.	* archer-server.c: Remove #include signum.h.  It isn't necessary	and doesn't exist in Solaris.2000-04-28  Andrew McCallum  <mccallum@whizbang.com>	* cdm.c (bow_cdm_score): Normalize scores by document length.2000-02-28  Andrew McCallum  <mccallum@justresearch.com>	* barrel.c (bow_barrel_prune_words_in_map): Be sure not to add	words from MAP into the vocabulary.2000-02-23  Andrew McCallum  <mccallum@justresearch.com>	* primes.c (_bow_nextprime): Allocate space for the sieve in the	heap, not on the stack.  This helps when we need really, really	big prime numbers.2000-01-03  Andrew McCallum  <mccallum@justresearch.com>	* split.c (bow_split_options): Changed documentation of "test-set"	default from 0.3 to 0.0.  Reported by Jason Rennie.2000-02-25  Gregory C Schohn  <gcs@cmu.edu>	Final release of SVM code as part of rainbow - it will become its 	own library now (a rainbow interface will always remain up to 	date though).	* svm_trans.c (transduce_svm): fixed a bug in the temporary 	updating code.  Changed some printing (for the simpler).	* svm_al.c (al_test_data): added fields for train and query 	errors.	(al_svm_guts): now using hyp_vect and cur_hyp_yvect for ALL 	labels.  	Made the transduction method skip training if the labels 	queried == labels transduced and they weren't bound.  	Finished error reporting so that it doesn't always email me.	(al_svm_test_wrapper): added (& rearranged) appropriate code to 	print out additional statistics (query & train accuracy).	* svm_base.c: changed some of the things that get printed, a 	couple of constants and other very minor things.	(svm_score): fixed an uninitialized memory read.	(tlf_svm): fixed a bug with the random seed (it was sometimes 	using uninitialized memory) for the splits.	2000-02-04  Kamal Nigam  <knigam@server5.jprc.com>	* vpc.c (bow_barrel_new_vpc_using_class_probs): New function.	* Makefile.local (RAINBOW_METHOD_C_FILES): Removed emda.c as a	file because it does not exist in the repository.	* genem.c: New file for genem method.  Requires a secondary method	that utilizes class_probs for train and unlabeled docs (emsimple	with rounds=1 will do, e.g.).	* gaussian.c: New file for gaussian method.  It's still very basic	and preliminary.	* Makefile.local (RAINBOW_METHOD_C_FILES): Added gaussian.c,	genem.c, cotrain.c	* cotrain.c: New file for cotrain method.  This version was used	for ICML-2000 submission.	* bow/libbow.h (bow_doc_type): Added types bow_doc_pool and	bow_doc_waiting for co-training.	(bow_str2type): Extended for new doc types.	(bow_type2str): New macro.	* wi2dvf.c (bow_wi2dvf_unhide_wi): New function.	(bow_wi2dvf_hide_words_with_prefix): New function.	(bow_wi2dvf_hide_words_without_prefix): New function.	* vpc.c (bow_barrel_set_vpc_priors_using_class_probs): New	function.	* stoplist.c (bow_stoplist_present): If an infix separator is	defined use only the part of the string after it for stopword	identification.	* stem.c (bow_stem_porter): If an infix key is defined, take only	the string after the infix key for stemming purposes.	* split.c (bow_tag_change_tags): Changed prototype.  Now returns	the number of docs changed instead of returning void.	* rainbow.c (rainbow_test): set the priors when building the class	barrel.  Is it really possible this bug has existed forever?	* opts.c (bow_options): Added code for --lex-infix-string	(parse_bow_opt): Likewise.	* next.c (bow_cdoc_is_pool): New function.	(bow_cdoc_is_waiting): New function.	* nbsimple.c	(bow_nbsimple_set_cdoc_word_count_from_wi2dvf_weights): store	number of terms per class in the normalizer, as well as the	word_count.  This way, we have access to the un-rounded number if	we prefer it.	(bow_nbsimple_score): Remove the "feature" that normalizes scores by	doc length for the document-then-word event model.  This will now	have longer documents have more extreme probabilities than shorter	documents.  Use the normalizer for the total number of words per	class instead of the word_count.  This should be slightly more	accurate, as it's not rounded.	* lex-simple.c (bow_lexer_infix_separator): new variable for word	infix recognition.	(bow_lexer_infix_length): Likewise.	* emsimple.c (bow_emsimple_num_em_runs): Changed default to 10.	* active.c (active_cdoc_is_used_for_density): New function.	(active_doc_barrel_set_entropy): Use train, unlabeled, and pool docs	for density-setting.  Used by cotrain.c.	(active_doc_barrel_set_density): Likewise.  Also, don't print density	of each document.	2000-01-12  Gregory C Schohn  <gcs@cmu.edu>	* svm_smo.c (smo): Added smart re-computation for *W.  If *W is 	null & there are non-zero weights, then it is recomputed (since it 	is necessary for the error evaluations).  This saves alot & 	cache-thrashing if the tvals vector is already up-to-date (its not 	much harder to keep W up to date too).		(svm_smo_yflip_tvals):  Killed.  See svm_base.c log for details.	* bow/svm.h (svm_yflip_tvals): killed prototypes for this & the 	smo/loqo functions.  See the svm_smo.c log for details.	* svm_loqo.c (svm_loqo_yflip_tvals): killed (see the log entry for 	svm_trans).	* svm_trans.c (transduce_svm): fixed most of the inefficiencies 	(all of the big ones).  When the smart_vals variable is set, no 	extra recomputation is done, each svm sub-problem's output is used 	as input along for the next sub-problem (very similar to the 	active learning code, but here alot more recomputation needs to be 	done since labels and bounds are changing).  The hyperplane 	Null/non-null std is enforced, where the plane is set to zero 	after it is freed, so that the solvers know not to look at it.		Fixed a bug where all unlabeled documents have the same hyp. label 	(only relevant when no-bias is also being used).		There is also support for hyperplane stability management (see 	svm_base & the refresh option).  Alot of debugging code is around 	for future use.		Killed the yflip functions.  That code just happens inside of the 	loop since the hyperplane needs to also be updated, but only for 	smo (so clean parameter passing wasn't going to happen).	TODO: get the tval-to-err functions working (though this is a very	petty thing, especially if hyperplanes are being used to do the 	error evaluations).		* svm_base.c (svm_options[]): added options for 	TRANS_HYP_REFRESH_ARG (the number of iterations to go in the 	transduction loop before recalculating the hyperplane from scratch 	(to undo precision problems)).  Probably never of any use, just a 	way for the user to check his/her sanity.		(tlf_svm): added line to also print the running time to stdout.	(tlf_svm): Added initialization of *W to NULL (since smo now uses the	data in the array if the array is non-null.		(svm_vpc_merge): fixed a bug where documents were being re-loaded from	the barrel (when weights per barrel & pairwise voting was used).  	The unlabeled docs weren't coming back, but now they are.	* svm_al.c (al_svm_guts): Made the loop a little bit 	smarter/efficient when transduction is used.  If the queried 	labels are the same as those hypothesized & the weights are not 	bound for those vectors, the next problem isn't solved (since the 	solution will be exactly the same).  So far this doesn't seem to 	help to often (since running time increases as step size 	increases, making this less & less probable).  This will likely 	help on very big datasets where transduction is very helpful>>>>>>> 1.4871999-12-30  Andrew McCallum  <mccallum@justresearch.com>	* bow/libbow.h: Declare new functions.	* wa.c (bow_wa_empty): New function.	* rainbow.c (rainbow_options): New command-line option	"print-doc-length".	(bow_print_log_odds_ratio): Don't trod on the IDF any more.	(rainbow_test): If requested, print the length of the document after	each classification.	* naivebayes.c: Add capability to return simply P(d|c) and the	ability to anneal the P(d|c) portion of the P(c|d).	(naivebayes_return_log_pr): New static variable.	(bow_naivebayes_anneal_temperature): New global variable.	(bow_naivebayes_score): Use the new variables.	* bow/naivebayes.h: Declare annealing global variable.	* info_gain.c (bow_word_count_wa): New function.	* em.c (bow_em_set_priors_using_class_probs): Don't set PRIOR_SUM	to MAX_CI.  This was a very odd bug.	* dirk.c (bow_dirk_score): Comment out printing of diagnostics.	(bow_dirk_new_vpc): Add code that uses the CDM.  I'm not sure if this	is working yet.	* cdmemr.c (use_cdm): New static variable, attend to it.	(bow_cdmemr_new_vpc_with_weights): Set the CDM anneal temperature and	the NAIVEBAYES anneal temperature to 1000.  If we aren't very	confident about the most confident classifications this round,	then don't label any more unlabeled documents.	* cdmemi.c: Comments added.	(bow_cdmemi_new_vpc_with_weights): Bug fix.  When	BOW_CDMEMI_BINARY_SCORING add to the WA the di from the 0th not	the 1st HITS.	* cdmem.c (bow_cdmem_new_vpc_with_weights): Only do one cdm round	instead of 5.  Fix bug by pre-decrementing the NUM_CDM_ROUNDS	instead of post-decrementing.	* cdm.c (bow_cdm_anneal_temperature): New global variable.	(bow_cdm_word_probs_using_ct_alphas): Get the number of classes from	the CLASS_COUNT_BARREL->CDOCS->LENGTH instead of from	bow_barrel_num_classes (class_count_barrel).  This way we can use	this code for a version of KNN with a CDM distance metric.	(bow_cdm_score): Calculate the number of words in the query; this was	previously used as the annealing temperature, but is no longer.	Divide the log-prob scores by the annealing temperature.	* archer.c (archer_index_lines): Try to make this work again after	the changes to archer_index() for incremental additions.  Still	not working.  For the canopies experiments, I just checked out an	old version of archer.	* Makefile.local (RAINBOW_METHOD_C_FILES): Added emda.c, cdmemi.c,	cdmemr.c.1999-11-22  Andrew McCallum  <mccallum@justresearch.com>	* Makefile.in (STANDARD_RAINBOW_METHOD_C_FILES): Added dirk.c.	* dirk.c (log_gamma): Cache 100 integer x's.	(bow_dirk_log_kernel): Take vocab size as argument instead of barrel.	(bow_dirk_score): Add exponentiated log-densities, instead of log	densities.  Do this by finding the max and subtracting.1999-11-16  Andrew McCallum  <mccallum@justresearch.com>	* cdm.c (cdm_options): New command-line options	"cdm-print-smallest-alphas" and "cdm-print-largest-alphas".	(cdm_parse_opt): Handle them.	(bow_cdm_initialize_ct): New code allows this to be called more than	once.  This way you can add new document (and hence words) and	re-calculate the infogain.	(bow_cdm_ct_set_alphas): Added structure ALPHA_RECORD for printing	largest and smallest alphas.  Added, but commented out, code for	smoothing the counts before fitting the Dirichlet, using	log(alpha) in place of alpha, smoothing the alphas.  Print the	largest and smallest alphas.	(CDM_SCORE_ANNEAL_TEMPERATURE): New macro, currently defined not to be	used.	(bow_cdm_score): Handle it.1999-12-20  Kamal Nigam  <knigam@zeno.jprc.com>	* emsimple.c: added option --emsimple-no-init1999-12-18  Gregory C Schohn  <gcs@cmu.edu>	* svm_al.c: Updated to work with the new model (ie. this can 	be called only by svm_tlf (top-level-fn) & calls the trans fn.	or the setup & solve fn).  	So far the usage of transduction has no extra heuristics set up,	but the active learning module can be used to get stats about 	incrememtally randomly selected labels.	Rewrote the code to work with transduce_svm with as little hassle	as possible.  The code that handled the labeled & unlabeled arrays	significantly changed.  Now there is only 1 array (no more sub_*
💿 文件大小 522 K
👤 上传用户 yuanata
📂 所属分类数值算法/人工智能
🏷️ 相关标签

#mitchell #tom #机器学习 #代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -