📄 ripper.1

📁 Ripper 分类算法
💻 1
📖 第 1 页 / 共 2 页
字号:
上一页 12
An attributeis "suppressed" by inserting the keyword.I suppressedafter the colon in the atrribute's definition..PPThe .I data file contains a set of classified examples.  Each example is acomma-separated list of attribute values, followed by an atomindicating the class of the example, followed by a period.  (It isusually convenient to have one example per line, but this is notrequired.) Attribute values are given in the same order that attributes are defined in the names file; most of the usual syntaxes for numbers are supported.  Set- and bag-valued attributesare specified by simply enumerating the elements of the set,separated with whitespace.Unknown attributes are indicated with a question mark token..PPExamples can also be given a weight, by inserting .I :wbetween the class name and the terminating period(where .I wis a real number, the default value for which is one)..PP The .I test fileis formatted in the same way as the data file..PPFinally, the.I grammar filecontains a description of a context-free grammar, roughly in BNFnotation.  The grammar file is optional for ripper,and most users will be probably not want to change the default grammar; however we will describe it here for completeness.The terminal symbols of the grammar are tests on the valuesof attributes defined in the names file; each sentence generated bythe grammar is thus a sequence of attribute-value tests.  Ripperwill read in this grammar and constrain its learning component so thatevery rule generated by ripper will have as an antecedent a sequenceof attribute-value tests that is a sentence of the grammar.  Thegrammar thus is a way for the user to guide ripper's choice of rules..PPMore specifically, the grammar file contains a series of.I grammar rules.Each grammar rule consists of an atomic .I left-hand sidefollowed by the token"-->"followed by a comma-separated list of .I grammar symbolsfollowed by a period.A .I grammar symbolis either a nonterminal symbol (which is simply an atomthat appears on the left-hand side of some grammar rule)or a .I terminal symbol.A terminal symbol is of the form.I attribute op valuewhere .I attributeis the name of an attribute (e.g. "height") and value is a validvalue for that attribute.  An operator .I opmust be one of thetokens "=", "!=", ">=", "<=", "~" or "!~".Terminal symbols of the form.I attribute op *are also allowed, in which case any possible value is allowed..PP The condition .I attribute ~ symbolis used for set- and bag-valuedattributes.The condition.I attribute ~ symbolis true of an example if .I attribute is set-valued and .I symbol is contained in the set.The condition .I attribute !~ symbolis true if.I symbolis not present in the set.For bags, the condition.I attribute ~ symbol__kis true if .I attributecontains at least .Ikinstances of .I symbol.The condition .I attribute !~ symbol__kis treated analogously..PPOften one will have several grammar rules with the same left-handside, but different right-hand sides.  In this case one may use thesyntax.in +6.brLHS --> RHS1 | RHS2 | ... | RHSk.in -6rather than the wordier.in +6.brLHS --> RHS1.br ....brLHS --> RHSk.in -6Finally, prefixing a grammar rule with an exclamation point indicatesto ripper that sentences generated using that grammar rule have a lowerpriority; if possible, ripper will build a hypothesis without usinglow-priority sentences.  Even lower priorities can be assigned byprefixing grammar rules with a string of two or more exclamationpoints..SH THE DEFAULT GRAMMAR.PPWhen learning rules to predict the class "class", ripper will expectto find some left-hand side of the form "body_class" to use as thestart symbol of the grammar; if this is not present, ripper will usethe atom "body" as the start symbol.  If this is not present, ripperwill construct the following default grammar:.PP.in +6body --> body_conds..brbody_conds --> ..brbody_conds --> cond,body_conds..brcond --> attr1_cond..br ....brcond --> attrk_cond..br.in -6.PPwhere .I attr1, ..., attrkare the names of the attributes defined in the names file.  If discretization is used, then for eachcontinuous attribute.I cattr,the default grammar also contains the rules .PP.in +6 cattr_cond --> cattr>=t1..br cattr_cond --> cattr<=t1..br  ....br cattr_cond --> cattr>=tn..br cattr_cond --> cattr<=tn..br .in -6.PPwhere .I t1, ..., tnare ripper's discretization of the training data.Otherwise, the grammar will contain the rules.PP.in +6 cattr_cond --> cattr>= '*'..br cattr_cond --> cattr<= '*'..br .in -6.PPFor a nominal attribute .I nattrthe default grammar contains the rule .PP.in +6 nattr_cond --> nattr = '*'..in -6.PPFor a set- or bag-valued attribute .I sattrthe default grammar contains the rules.PP.in +6 sattr_cond --> sattr ~ '*'..br sattr_cond --> sattr !~ '*'..br.in -6.PPIf the grammar file is missing or empty, then the default grammar willbe used.  If the grammar contains definitions of some but not all ofthe nonterminal symbols used in the default grammar, they willoverride the default definitions. .SH FILES.PP.in +6ripper.brfilestem.data (data file).brfilestem.names (names file).brfilestem.gram (grammar file).brfilestem.test (unseen data).brfilestem.hyp (learned rules).in -6.PPSome sample input files are also available from wcohen@research..SH SEE ALSO.PPThe man page for.I ripperauxcontains brief descriptions of some additional useful programs forworking with ripper rulesets and/or datasets..PPRipper's input files are more-or-less compatible with Quinlan's .I C4.5 tree-learning system..PPThe papers "Fast efficient rule learning" (Cohen, ML95) and"Learning trees and rules with set-valued features" (Cohen, AAAI96)describe the algorithms used in Ripper in more detail..SH USING RIPPER TO CLASSIFY TEXT.PPI am frequently asked about tools to preprocess text so that it can beeasily digested by Ripper.  I don't have any tools to distribute,largely because I think it would be hard to have any tool that issufficiently general to handle all the necessary cases,but still substantially simpler than a general-purpose text processinglanguage like Perl..PPMy current recommendation in feeding Ripper is to use something likePerl (or whatever suits you) to convert punctuation to white space,and coerce everything to lower case, and then feed the result intoRipper as a single set (or perhaps bag).  If you use sets, it is notnecessary to remove duplicate tokens.  Be careful to remove thepunctuation symbols percent sign (%), comma (,), colon (:), single quote ('), and period(.), all of which have special meaning to Ripper.If you want a non-default tokenization, then you must surround eachtoken with single quotes. Stemmingdoesn't seem to make a big difference on the benchmarks I've tried.Coercion to lower case also means that you can safely use anyupper-case or mixed-case symbol as an attribute or class name. .SH BUGS.PP Attribute names, attribute values, grammar symbols, and class namesare all put in the same name space, so you can't use the same symbolfor, say, a class name and a possible set value.  This is awkwardwhen you're using set- or bag-valued attributes to handle text.  .PP Ripper doesn't actually check the rangeof symbolic attributes for consistency withdeclaration in the names file..PPThe response of ripper to the -S and -L options is sometimesrather abrupt---i.e. small changes can sometimes have drastic consequences.
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -