⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 changelog

📁 贝叶斯学习算法分类文本。基于朴素贝叶斯分类器的文本分类的通用算法
💻
📖 第 1 页 / 共 5 页
字号:
Thu Sep 25 17:35:01 1997  Andrew McCallum  <mccallum@jprc.com>	* Makefile.in (LIBBOW_H_FILES): Added bow/kl.h.Wed Sep 17 14:51:33 1997  Karl Kleinpaste  <karl@jprc.com>	* rainbow-h.c (top): Add socket #includes.	(rainbowh_options): Add --query-server and -n options.  Also, 	remove #define of PRINT_TREE_SCORES, in favor of runtime -n.  	(rainbowh_arg_state): Add rainbowh_query_serving to what_doing, 	plus server_port_num, for --query-server, and print_tree_scores,	for -n.  	(rainbowh_parse_opt): Add --query-server and -n detection.  	(_hier_barrel_set_node_scores): Properly conditionalize tree score 	printing.	(hier_barrel_print_scores_recurse): Add FILE *out.  	(hier_barrel_print_scores): Add FILE *out.	(rainbowh_query): New routine, stripped from the mainline `if'.  	(rainbowh_socket_init): Add for --query-server capability.  	(rainbowh_serve): Add for --query-server capability.	(main): Init print_tree_scores; init lexer end pattern; insert 	conditional call to service --query-server; slice out mainline for 	rainbowh_query().Wed Sep 17 14:22:18 1997  Andrew McCallum  <mccallum@jprc.com>	* split.c (bow_test_split): Remove the assertion that we use at 	least 90% of the documents as training data.Tue Sep  9 10:55:31 1997  Andrew McCallum  <mccallum@jprc.com>	* rainbow-h.c: Change default method from prind to naivebayes.	(_hier_barrel_cdoc_write): Handle bow_file_format_version 5.	(_hier_barrel_cdoc_read): Likewise.	(hier_barrel_set_local_class_model): Call vpc function with new	argument.	(hier_barrel_set_vpc_with_weights): Likewise.	(hier_barrel_add_document): Set HBARREL->DOC_BARREL->METHOD to	HIER_DEFAULT_METHOD.	(hier_barrel_set_vpc_and_populate_lower_branches): Only recursively	set vpc in children branches if there are documents there.	(hier_barrel_prob_wi_in_ci): Assert that the CDOC->NORMALIZER has been	set.  Set M_EST_M according to CDOC->NORMALIZER, which is number 	of unique words.	(_hier_barrel_local_score): Clean up a little.	(main): Call hier_set_method() if BOW_ARGP_METHOD.Sat Aug 30 19:03:17 1997  Andrew McCallum  <mccallum@jprc.com>	* kl.c (bow_kl_score): Initialize scores to class prior divided by 	query document length, not just class prior.  This way our 	classifications match Naive Bayes, as they should.Fri Aug 29 09:12:05 1997  Andrew McCallum  <mccallum@jprc.com>	* barrel.c (bow_barrel_add_from_text_dir): Add newline before 	warning about a file being skipped because it is not text.	Before the following change we were overflowing DV->ENTRY[i].DI in 	document barrel's when there were more than 32767 documents.  	Karl's Yahoo experiments were trying to build models with about 	60000 documents.  We would get an error in vpc.c at the assertion 	that "ci < num_classes".	* dv.c (bow_dv_add_di_count_weight): Warn if we overflow int, 	not short.	(bow_dv_write_size): Adjust for change of COUNT and DI from	short to int.	(bow_dv_write): Likewise.	(bow_dv_new_from_data_fp): Likewise.	* bow/libbow.h (bow_cdoc): Change member CLASS from short to int.	(bow_de): Change members DI and COUNT from short to int.	* barrel.c (_bow_barrel_cdoc_write): If BOW_FILE_FORMAT_VERSION is 	5 or greater, change CDOC->CLASS from short to int.	(_bow_barrel_cdoc_read): Likewise.	* bow/libbow.h (BOW_DEFAULT_FILE_FORMAT_VERSION): Changed from 4 to 5.	* io.c: Add comment about bow_file_format_version history.Thu Aug 28 22:36:55 1997  Andrew McCallum  <mccallum@jprc.com>	* kl.c (bow_kl_score): Add class prior probabilities.	* lex-simple.c (bow_lexer_simple_get_raw_word): When we find the 	NULL at the end of the document, and before we find the beginning 	of a word, back up DOCUMENT_POSITION (even though will return 0 	this time already).  Add some assertions about DOCUMENT_POSITION.	* lex-html.c (bow_lexer_html_get_raw_word): When we find the NULL 	at the end of the document, back up DOCUMENT_POSITION so we will 	return 0 next time we are called.  Add some assertions about 	DOCUMENT_LENGTH.Wed Aug 27 11:23:34 1997  Andrew McCallum  <mccallum@jprc.com>	* bow/libbow.h: Include <unistd.h>.	* rainbow.c (rainbow_parse_opt): Fix typo.	* rainbow.c (rainbow_parse_opt) [SERVER_KEY]: Set 	DOCUMENT_END_PATTERN to a single dot on a line.	(main): Don't set DOCUMENT_END_PATTERN here for server mode.	* lex-simple.c (bow_lexer_simple_open_text_fp): Explicitly seek 	the PRE_PIPE_FP to the end of the file!  Otherwise, we can 	sometimes read the same file over and over again in the many 	`while(open_text_fp())' loops throughout the library.	* rainbow.c (rainbow_print_weight_vector): Change the test for 	deciding when we need to multiply by CDOC->NORMALIZER before 	printing the weight. Instead of looking specifically for 	"naivebayes", look for a METHOD->NORMALIZE_WEIGHTS function 	pointer that is NULL.  Now this works properly for the "kl" method 	too.	* kl.c (bow_kl_set_weights): Calculate the total number of 	occurrences of each word; store this in DV->IDF.  The the DV 	weights to the weighted log odds ratio P(w|C)*log(P(w|C)/P(w|~C)).	* rainbow.c (rainbow_lisp_setup): Update for new default 	arguments.	(rainbow_lisp_query): Add LOO_CV argument to bow_barrel_score().	* kl.c (bow_kl_score): Move declaration of SCORES_SUM.Tue Aug 26 11:12:00 1997  Andrew McCallum  <mccallum@jprc.com>	* vpc.c (bow_barrel_set_vpc_priors_by_counting): Add assertion 	about the PRIOR.Mon Aug 25 14:03:58 1997  Andrew McCallum  <mccallum@jprc.com>	* rainbow-ac.pl: As a diagnostic, print the number of predictions 	found in the file.	* naivebayes.c (bow_naivebayes_set_weights): Set CDOC->NORMALIZER 	to the number of unique terms in each class.  (This is now used by 	rainbow-h.)	* kl.c (bow_kl_set_weights): Add assertion about CDOC->NORMALIZER.	* foilgain.c (bow_foilgain_ci_per_wi_new): New function.	* bow/libbow.h (bow_default_method_name): New macro.	* barrel.c (bow_barrel_new): Use new macro 	`bow_default_method_name' instead of "naivebayes".Tue Aug 19 09:50:16 1997  Andrew McCallum  <mccallum@jprc.com>	* int4word.c (bow_words_set_map): Be sure to initialize the 	map/counts if they haven't been initialized yet.  Otherwise, 	WORD_MAP_COUNTS will point nowhere an we can tromp on memory.  I 	was getting malloc() errors before this was fixed.	(bow_words_keep_top_by_infogain): Change so that word indices are	ordered by information gain, even when NUM_WORDS_TO_KEEP is less 	than the number of words returned by bow_infogain_per_wi_new().	* wi2dvf.c (bow_wi2dvf_entry_at_wi_di): New function.	* dv.c (bow_dv_entry_at_di): New function.	* bow/libbow.h: Declare new functions.	* barrel.c (bow_barrel_add_from_text_dir): Add verbosity when a 	file is skipped because istext() fails.	(bow_new_slow_barrel_printf): New function.	* vpc.c (bow_barrel_new_vpc): New argument, NUM_CLASSES.  Use it 	to initialize an array that is filled with counts of the number of 	documents per class.  Initialize CDOC->NUM_WORDS to be the number 	of documents per class.  This can then be used in "event=document" 	models.	(bow_barrel_new_vpc_merge_then_weight): New argument, NUM_CLASSES.	(bow_barrel_new_vpc_weight_then_merge): Likewise.	* rainbow.c (rainbow_index): Use macro 	bow_barrel_new_vpc_with_weights(), with new `num_classes' 	argument.	(rainbow_query): Likewise.	(rainbow_test): Likewise.	(main): Likewise.	(rainbow_test_files): Likewise.  If QUERY_WV is NULL, verbosify a	warning.	* bow/libbow.h (bow_method): Add NUM_CLASSES argument to 	VPC_WITH_WEIGHTS.	(bow_barrel_new_vpc_with_weights): Add NUM_CLASSES argument.	(bow_barrel_new_vpc): Likewise.	(bow_barrel_new_vpc_merge_then_weight): Likewise.	(bow_barrel_new_vpc_weight_then_merge): Likewise.	* bow/naivebayes.h (bow_params_naivebayes): Remove 	SCORE_WITH_LOG_PROBABILITIES.	* kl.c (bow_kl_score): Reformat error message.	* naivebayes.c (bow_naivebayes_set_weights): Only set 	CDOC->WORD_COUNT if not doing BOW_BINARY_WORD_COUNTS, otherwise 	leave them as the "document counts" as they were initialized in 	vpc.c.Thu Aug 14 11:46:46 1997  Andrew McCallum  <mccallum@jprc.com>	* naivebayes.c: Remove all references and code for 	SCORE_WITH_LOG_PROBABILITIES.  Use KL method instead.	(bow_method_crossentropy): Removed, and all related structures and	functions.	* opts.c (bow_options): Remove "naivebayes-score-with-log-probs" 	option.	(parse_bow_opt): Don't handle it anymore.	* naivebayes.c: Add a naivebayes-specific command-line option by 	using "argp child".	(naivebayes_argp_m_est_m): New static variable.	(naivebayes_options): New argp structure.  New command-line option	"naivebayes-m-est-m".	(naivebayes_parse_opt): New function.	(naivebayes_argp: New structure.	(naivebayes_argp_child): New structure.	(_register_method_naivebayes): Add the argp child.	(bow_naivebayes_score): Comment out assertion that (loo_class == -1)	because it trips up rainbow-h.	These changes were made a while ago.	* rainbow-h.c (hier_recursive_set_rankings): Pass new LOO argument 	to bow_barrel_score.	(classify_single_doc): Likewise.	(hier_barrel_set_vpc_and_populate_lower_branches): Likewise.	(hier_barrel_prob_wi_in_ci): Add two new pass-by-ref arguments that	return certain counts.  Pass new arguments.	(check_prob_wi_in_ci): Pass new arguments.	(_hier_barrel_local_score): Call above function with new arguments,	and print them out.	(main): Switch back to using POPULATE_BY_SCORING and HIER_NIECE	options by default.Wed Aug 13 16:44:07 1997  Andrew McCallum  <mccallum@jprc.com>	* lex-simple.c (bow_lexer_simple_open_text_fp): Print error 	message if popen() call failed.	* opts.c (bow_argp_add_child): Change asssertion.  Add call to 	memset(), which should be unnecessary.	Before this code was added, some inlinks WebKB files were being	declared as "nontext" and skipped because many lines had the same	length.		* istext.c (bow_fp_is_text): Pay attention to 	BOW_ISTEXT_AVOID_UUENCODE.	* opts.c (bow_istext_avoid_uuencode): Declare new global variable.	(bow_options): New option "istext-avoid-uuencode".	(parse_bow_opt): Handle it.	* bow/libbow.h (bow_istext_avoid_uuencode): New global variable 	set by command-line option.	(bow_lex_pipe_command): Make it extern!	* kl.c (bow_kl_score): Give more detailed error message for LOO 	negative probabilities.	Before this code was added, some WebKB files were being skipped	because the non-MIME-header part was already buffered in STDIO.	* lex-simple.c (bow_lexer_simple_open_text_fp): When using 	BOW_LEX_PIPE_COMMAND, make sure that the file descriptor file 	position matches the stdio FP position, otherwise we can get a 	premature EOF because the stdio has already read much of the file 	for buffering.Mon Aug 11 11:51:11 1997  Andrew McCallum  <mccallum@jprc.com>	* info_gain.c (bow_infogain_per_wi_print): If NUM_TO_PRINT is 0, 	then print infogain of all words, not zero words.	* bow/libbow.h (bow_model_next_wv): Declare new split function.Mon Jul 14 11:09:04 1997  Andrew McCallum  <mccallum@jprc.com>	* rainbow-stats.pl (overall_accuracy): Shorten the label before 	the numbers.	* istext.c (bow_fp_is_text): Initialize 	MAX_LINE_LENGTH_HISTOGRAM_LENGTH to avoid warning.	* istext.c (bow_fp_is_text): Re-enable the uuencode-block 	detection.  Now, in order to reject the file, insist that the 	length of the lines with the most common length be greater than 	or equal to 50.  Hopefully this will not falsely reject HTML files	as it did before.Tue Jul  1 08:39:25 1997  Andrew McCallum  <mccallum@jprc.com>	* kl.c (bow_kl_score): Remove assertion that SCORE_INCREMENT be 	non-zero.  It can be zero when PR_W_C == PR_W_D, then 	LOG(PR_W_C/PR_W_D) will be zero, and SCORE_INCREMENT will be zero.Mon Jun 30 17:41:06 1997  Karl Kleinpaste  <karl@jprc.com>	* rainbow.c (rainbow_serve): Added.	(rainbow_socket_init): Added.  	(rainbow_parse_opt): Added SERVER_KEY case.	(rainbow_query): Modified FILE * handling for use of other than	stdin/stdout.  	(main): Added query-server handling.Sat Jun 28 12:22:30 1997  Andrew McCallum  <mccallum@jprc.com>	* rainbow.c (rainbow_test_files): Temporarilty comment out code 	that removes some of the training documents from training until we 	add a scheme that really makes the default test percentage 0.	(main): Put the call of rainbow_test_files after doing things	necessary to update the class/word weights for the command-line 	options.  Temporarily, ALWAYS rebuild the VPC model, even if non 	of the parameters change because the weights read from disk were 	bad; find out why eventually!	* prind.c (bow_prind_score): When BOW_PRINT_WORD_SCORES, also 	print PR_W_C.	* prind.c (bow_prind_score): When all pre-normalized scores are 	zero, set normalized scores to -1.0/#classes, don't leave them as 	zero.  [Perhaps we should set the scores to the class priors?  	Althought this does not fall our of the PrInd derivation.]	* kl.c (bow_kl_score): When all pre-normalized scores are zero, 	set normalized scores to -1.0/#classes, not -9999.	* arrow.c (arrow_query): Pass LOO_CV argument to score.Thu Jun 26 14:48:28 1997  Andrew McCallum  <mccallum@jprc.com>	* lex-simple.c (bow_lexer_simple_open_text_fp): Attend to 	BOW_LEX_PIPE_COMMAND and implement it.	* opts.c (bow_lex_pipe_command): New global variable.	(bow_options): New command-line option "lex-pipe-command".	(parse_bow_opt): Handle it.	* bow/libbow.h: Declare new global variable.	* istext.c (bow_fp_is_text): Move local variables to avoid GCC

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -