📄 devguide.pod

📁 外国人写的Perl搜索引擎程序
💻 POD
字号:
=head1 NAMEKinoSearch::Docs::DevGuide - hacking/debugging KinoSearch=head1 SYNOPSISDeveloper-only documentation.  If you just want to build a search engine, youprobably don't need to read this.=head1 Fundamental ClassesMost of the classes in KinoSearch rely on L<KinoSearch::Util::Class|KinoSearch::Util::Class> andL<KinoSearch::Util::ToolSet|KinoSearch::Util::ToolSet>, so you'll probablywant to familiarize yourself with them.=head1 Object Oriented Design=head2 No public member variables.Multiple classes defined within a single source-code file, e.g. TermQuery andTermWeight, may use direct access to get at each others member variables.Everybody else has to use accessor methods.C-struct based classes such as TermInfo allow direct access to their members,but only from C (of course).=head2 Subroutine/method access levelsThere are three access levels in KinoSearch.  =over=item 1B<public>: documented in "visible" pod.=item 2B<private>: subs which are prepended with an _underscore may only be usedwithin the package in which they reside -- as per L<perlstyle|perlstyle>guidelines -- and in only one source file.=item 3B<distro>: any sub which doesn't fall into either category above may be usedanywhere within the KinoSearch distribution.  =back=head1 Documentation ConventionsKinoSearch's public API is defined by what you get when you run the suitethrough a well-behaved pod-to-whatever converter.  Developer-onlydocumentation is limited to comments and "invisible" =for/=begin POD blocks.=head1 Integration of XS and C codeXS and C code in KinoSearch is stored faux-L<Inline|Inline>-style, after anC<__END__> token, and delimited by either C<__XS__>, C<__H__>, or C<__C__>.  Aheavily customized Build.PL detects these code blocks and writes out hardfiles at install-time, so the inlining is mostly for convenience whileediting: the XS code is often tightly coupled to the Perl code in a givenmodule, and having everything in one place makes it easier to see what's goingon and move things back and forth.Build.PL writes out separate .h and .c files for each block it finds, but allthe XS blocks are concatenated into a single file -- KinoSearch.xs.  Thecontent of KinoSearch.xs consists of the XS block from KinoSearch.pm, followedby all the other XS blocks in an undetermined order.  Ultimately, only asingle compiled library gets installed along with the Perl modules.At runtime, the only module which calls XSLoader::load is KinoSearch.  Becausethe KinoSearch C<MODULE> has many C<PACKAGE>s, C<use KinoSearch;> loads I<all>of the XS routines in the entire KinoSearch suite.  A pure-Perl version ofKinoSearch.pm which did the same thing might look like this...    package KinoSearch;    our $VERSION = 1.0;    package KinoSearch::Index::TermInfo;        sub get_doc_freq {        # ...    }        package KinoSearch::Store::InStream;        sub lu_read {        # ...    }        # ...Since KinoSearch.xs is only generated/modified when Build.PL is run, an extracommand line call to Build.PL has to be integrated into the developmentworkflow when working on XS or C material.    % perl Build.PL; ./Build code; perl -Mblib t/some_test.t Build.PL tracks modification times, using them to determine whether it needsto recompile anything. If only pure Perl modules have been edited, it won'tforce needless recompilation, and if only a limited number of .pm filescontaining XS/C/H code have been edited, it will recompile as little as itcan.=head2 Divison of labor between Perl, XS, and CTechnically, most or all of the C code in KinoSearch could be stuffed into XSfunctions.  However, the mechanics of moving data across the Perl/C boundaryare complicated, and the added cruft tends to make what's going on in the Ccode more difficult to grok.To maximize clarity, when possible XS in KinoSearch is limited to "glue"code, while Perl and C do the heavy lifting.  Exceptions occur when XSfunctions need to manipulate the Perl stack, for instance when returning morethan one value.=head1 Relationship to Lucene=head2 API Differences, both public and privateSince there is no method overloading by signature in Perl, it's impossible forKinoSearch to mimic Lucene's API exactly.  For a variety of other reasons,most of them performance-related, it's difficult or inadvisable to even try.The most crucial API distinction between KinoSearch and Lucene relates toindexing: KinoSearch, like its predecessor, Search::Kinosearch, forces you tospecify field definitions in advance; in contrast, Lucene employs an elaborateapparatus behind the scenes which adapts field definitions on the fly.  IfKinoSearch were to implement that apparatus, indexing would be substantiallyslower than it is.  Since KinoSearch could only muster a poor imitation of Lucene's API at best,it doesn't work very hard at imitating it -- it just tries to be a good,idiomatic Perl search API.  Here's a sampling of other ways in whichKinoSearch and Lucene differ:=over =item *In Lucene both IndexWriter and IndexReader can be used to modify an index,while in KinoSearch all index modifications are performed via the InvIndexerclass.  =item *Since Perl's filehandles can be used to write to scalars, there's littlebenefit to having separate classes like RAMIndexOutput and FSIndexOutput, andKinoSearch only has two IO classes: InStream and OutStream.=item * Certain classes have had their names transformed in KinoSearch, usually byshortening (SegmentTermDocs => SegTermDocs, Document => Doc), but sometimesarbitrarily (Directory => InvIndex).  =item *Other classes are specific to KinoSearch, e.g. PostingsWriter, PolyAnalyzer.These are either scavenged from Search::Kinosearch, or written from scratch.=itemA lot of the common method names in Lucene are core keywords in Perl: length,close, seek, read, write.  Where it seemed important to disabiguate those,they have been changed -- e.g., for a Perl programmer accustomed to thepassive behavior of CORE::close, having a close() method that triggersfile-writes is counter-intutive, so such behaviors have typically been shuntedinto a method called finish().=backNevertheless, developers who speak both Perl and Java shouldn't find it toodifficult to move back and forth between Lucene and KinoSearch -- as long aswhat they want to do is within the scope of KinoSearch's leaner API.  At thetime of this writing KinoSearch's public API is artificially small becausesome pieces aren't done, and it's a young library so preserving theflexibility to change the internal implementation is paramount.  ButKinoSearch does not aspire to do all the things that  Lucene does -- only thethings that can be done well, and maybe a few tricks of its own.=head2 Thematic architectural differencesThe relatively low object-oriented overhead of Java makes possible somearchitectures in Lucene which do not translate well to Perl.  For instance,arrays are often accumulated in Lucene by calling an iterator's next()method over and over.  Method calls in Perl, which are always dynamic, aremore costly, making this technique impractical in areas which are bothdata-intensive and performance-critical.  Most often, KinoSearch solves theproblem by moving the algorithm to a for/while loop in either Perl or C.  In aminority of cases, frequently called methods are emulated using either aglobally exposed C function or C pointer-to-function.Lucene is also a profligate wastrel when it comes to creating and destroyingobjects -- it can afford to be, because Java objects come cheap.  For example,during the indexing phase Lucene creates a mini-inverted-index for eachdocument, complete with its own FieldsWriter, DocumentWriter, and so on.Having Perl objects do the same thing in the same way yields disappointingexecution speed.  Therefore, KinoSearch does the same thing, but in adifferent way: it uses a completely different merge model, based aroundserialization and external sorting, which makes it possible towrite indexes by-the-segment rather than by-the document -- slashing the OO requirements and largely offsetting the extra overhead.Similar architectural differences exist throughout KinoSearch.  While thefile-format dictates certain algorithms, when idiomatic alternatives presentedthemselves or were already known, they were adopted.=head2 Search LogicKinoSearch's search logic differs from Lucene's.  In particular,very short fields do not contribute as much to a score, making it essential toassign a <boost> of greater than 1 to short but important fields, e.g."title".  =head2 File FormatLucene's file format is a beautiful thing.  KinoSearch was retooled to use itbecause it didn't seem possible to improve on it more than superficially.It's somewhat hard to work with, and impossible to work with efficiently usingonly native Perl, but it has both data-compression andsearch-time-execution-speed nailed.Unfortunately, the way Lucene defines a String has a Java-centric quirk thatdeep-sixes prospects for 100% index compatibility, and the only realisticsolutions involve changes to Lucene.  In theory, these changes (moving tobytecount-based Strings) should actually improve Lucene's performance, and theLucene development community has expressed openness to them.  At some point,time may be found to write these win-win patches, but at present working onKinoSearch itself takes higher priority.Here's a list of the ways in which KinoSearch's file format differ's fromLucene's:=over=item *B<Strings> - Lucene's file format specifies that a String is a VInt count ofthe number of java chars a string occupies, followed by the character dataencoded in Sun's "modified UTF-8" format (a misnamed, illegal variant,unsanctioned by the Unicode Consortium).  Dealing with this format in Perl hasproven to be so fiddly and performance-killing as to be not worth the effort.At present, KinoSearch's String format is: a byte count, followed by arbitrarydata.  =item *B<Field Numbers> - KinoSearch requires that field numbers correspond tolexically sorted order of field names; Lucene indexes are unlikely to havethis property.  KinoSearch 0.05 had compatibility code which dealt withunordered field numbers, but that code was removed in 0.06.  It can berestored if need be.=item *B<Term Vectors> - KinoSearch groups term vector data with stored field data;Lucene uses separate files.  Keeping term vector data and field data togetherminimizes disk activity for common usage patterns.  It's also a little simplerto maintain.  However, if the String issue is resolved in a future version ofLucene, KinoSearch will likely switch to Lucene's format.=back=head1 Coding styleThe gist: when possible, KinoSearch's Perl code follows the recommendationsset out in Damian Conway's book, "Perl Best Practices", and its XS/C code moreor less (no tab characters allowed) emulates the style of the sourcecode for Perl itself.Perl code is auto-formatted using a PerlTidy-based helper app called kinotidy,which is basically perltidy with a profile set up to use the PBP settings.  It would be nice if there were a formatter for XS and C code that was as goodas PerlTidy.  Since there isn't, the code is manually set to look as though itwas, with one important difference: a bias towards maximum parentheticaltightness.  In both Perl and XS/C, code is organized into commented paragraphs a few linesin length, as per PBP recommendations.  Strong efforts are made to keep thecomment to a single line.  Stupefyingly obvious "code narration" comments areused when something more literate doesn't present itself -- the goal is tobe able to grok the intended flow of a function by scanning the first line ofeach "paragraph".=head1 COPYRIGHTCopyright 2005-2007 Marvin Humphrey=head1 LICENSE, DISCLAIMER, BUGS, etc.See L<KinoSearch|KinoSearch> version 0.163.=cut
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -