⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 notes

📁 harvest是一个下载html网页得机器人
💻
字号:
Contains some notes about the Harvest Broker code intended for developers.----------------------------------------------------------------------Registry file format (stor_reg.c):It's one file with a record-based format.  A record looks like:	4 bytes in network-byte order for record size	4 bytes in network-byte order for magic number	4 bytes in network-byte order for record flag	4 bytes in network-byte order for URL length	n bytes of the URL	4 bytes in network-byte order for Gatherer Name length	n bytes of the Gatherer Name	[... and so on for other ASCII fields (empty fields have length 0)...]	4 bytes in network-byte order for number 1	4 bytes in network-byte order for number 2	[... and so on for other numeric fields ...]	[...end of record of record-size bytes...]The record header for each record includes:	4 bytes in network-byte order for a record size	4 bytes in network-byte order for a magic number	4 bytes in network-byte order for a flag The flag (an unsigned int) would mark a deleted or valid record, and other stuff in the future.With this format, the broker issues 2 read() calls per record: thefirst to get the record size, the second to read() the n bytes of therecord.  The broker code would then check the magic number, dosomething with the flag, then (if needed) parse out the record.  Thishelps to cut the system calls down.We might also want a header to the registry file that includes:	4 bytes in network-byte order for a magic number	4 bytes in network-byte order for a version numberand maybe some other things like:	4 bytes in network-byte order for the number of records	4 bytes in network-byte order for the number of deleted records	4 bytes in network-byte order for the number of valid recordsThe version number would let us for sure know how many ASCII fields andnumeric fields there are for each record.  The stats on the records would helpto determine when to garbage collect the registry file, but they would need tobe continually updated.So, the whole file looks like:	[registry header of 20 bytes]	[record header of 12 bytes]     --------|	[record data of n bytes]		|	[...]					| n records	[record header of 12 bytes]		|	[record data of n bytes]	--------|The problem with this format is garbage collection.  When you delete an entry,you just mark the flag in the record header that the record was deleted, andappend the new one to the end.  However, the Broker will compress the Registryevery so often.----------------------------------------------------------------------Below are the valid Query manager flags to the indexers:Common:		#desc				Show Description Lines		#opaque				Force no matched linesGlimpse:	#index case insenstive		Case Insenstive		#index error number		Allow "number" errors		#index matchword		Matches on word boundaries		#index maxresult number		Allow max of "number" resultsWais:		#index maxresult number		Allow max of "number" results----------------------------------------------------------------------Each SOIF object in the Registry contains the following attributes:	URL			MANDATORY	Gatherer-Name		MANDATORY	Gatherer-Host		MANDATORY	Gatherer-Version	MANDATORY	Update-Time		MANDATORY	MD5			OPTIONAL	Description		OPTIONALTwo objects are the same if they both have the same:	Gatherer-Name, Gatherer-Host, Gatherer-Version, Update-Timeand either the same URL or the same MD5.----------------------------------------------------------------------Running the Broker:To start the Broker, type:      % broker /your/broker.conf [-new | -nocol]   The -new flag causes the broker to begin a new collection.  The broker will doa collection immediately by default, rather than waiting for the normalcollection time.  This is useful for starting the Broker the very first time.If you don't want the broker to do a collection on startup, then use the-nocol flag instead.  ----------------------------------------------------------------------Gatherer Bookkeeping Attributes:	Update-Time		- The time that the summary object was last updated.		  REQUIRED field, no default.	Last-Modification-Time		- The L-M-T of the object itself.  Defaults to 0.	MD5		- The unique string identifying the object itself.		  Defaults to NULL.	Refresh-Rate		- The number of seconds after Update-Time when the		  summary object is to be re-generated.  Defaults		  to 1 week.	Time-to-Live		- The number of seconds after Update-Time when		  the summary object is no longer valid.  Defaults		  to 1 month.----------------------------------------------------------------------The Broker's Query Result set (we're in the middle of redoing it, sorry) is astream of newline separated items with a 3 digit code, space, hypen, and spaceat the beginning of each line.  It looks like this:	101 - Message to the User	103 - Error Message to the User	111 - Error Message to the User that ends the Broker results	120 - URL of the Match	122 - Opaque data	124 - nbytes\nnbytes of Description 	125 - URL of the SOIF object	126 - URL of the Broker's home page	130 - End of Object markerThis line '200 - ...' is always sent first (for the version) and should alwaysbe *ignored*.  This message may be sent a few times during the output to testthe connection, so ignore it.For bulk transfers:	000 - Bulk xfer success	400 - Bulk xfer error--------------------------------------------------------------------Glimpse Performance Issues:  Limiting the lifetime of 'glimpse' queries:This is the broker's view of things right now, so far it works very well...  1. The Broker runs 'glimpse', and allows it to run for LIFETIME seconds;     it also puts a *hard* time limit of LIFETIME CPU-seconds using setrlimit.  2. after LIFETIME seconds, if 'glimpse' has not exited, then the Broker      sends SIGTERM to 'glimpse', sleeps for a few seconds, and sends      SIGKILL to 'glimpse'.  3. The Broker sends SIGUSR1 to 'glimpseserver' to verify that it really     did a clean up.  The SIGTERM to 'glimpse' should send 'glimpseserver'     a SIGPIPE which will also cause a cleanup.  But the redundancy helps...  4. The Broker uses what ever results 'glimpse' returned as the result set     and then sends it to the user.  This is nice for very heavily loaded      brokers, you can give each user a small time slice worth of result sets.Use <INPUT TYPE="hidden" NAME="lifetime" VALUE="LIFETIME"> in your query.htmlto change the lifetime per query to LIFETIME.The MAX_LIFETIME seconds value is configurable in the Broker's broker.conffile.  LIFETIME is always between 10 seconds and MAX_LIFETIME seconds.  Bydefault, LIFETIME == MAX_LIFETIME, but LIFETIME can be passed along viaquery.html.  See Glimpse-MaxLife in broker.conf.--------------------------------------------------------------------Debugging:  Use -Dsection,level (or -Dsection for everything) after 	    broker.conf arg in brokerregistry.c	section 70, uses level 1, 5, and 9	  REGISTRYcollector.c 	section 71, uses level 1parser.c	section 72, uses level 1registry.c	section 73, uses level 1, 5, and 9	  HASH TABLESstor_man.c	section 74, uses level 1query_man.c	section 75, uses level 1event.c		section 76, uses level 1main.c		section 77, uses level 1select_loop.c	section 78, uses level 9--------------------------------------------------------------------WIP: Proposed query result interface specification (3/95):  BrokerReturn   --> Version Header Body Trailer  Version	 --> INTERFACEVERSION Separator VersionRev  VersionRev     --> MajorNumber MinorNumber string  MajorNumber    --> number  MinorNumber    --> number  Header         --> InfoField Header  Header         -->   InfoField      --> BROKER_URL       Separator string  InfoField      --> BROKER_INDEXER   Separator string  InfoField      --> BROKER_COLLECT   Separator string  InfoField      --> MESSAGE_TO_USER  Separator string  InfoField      --> USER_EXT         Separator UserExtType Separator Data  UserExtType    --> string  Body           --> BulkTransfer  Body           --> ObjectList  Body           -->   BulkTransfer   --> CompressedBulkTransfer  BulkTransfer   --> RawBulkTransfer  CompressedBulkTransfer --> STARTMARKER "gzip'd RawBulkTransfer" ENDMARKER  RawBulkTransfer   --> @DELETE  { SOIFStream } RawBulkTransfer  RawBulkTransfer   --> @UPDATE  { SOIFStream } RawBulkTransfer  RawBulkTransfer   --> @REFRESH { SOIFStream } RawBulkTransfer  RawBulkTransfer   -->   SOIFStream     --> SingleSOIFObject SOIFStream  SOIFStream     -->   ObjectList     --> Object ObjectList  ObjectList     -->   Object         --> OptWarning ResourceURL ObjectURL OptExt ObjectEnd  OptWarning     --> WARNING Separator WarningNumber string  OptWarning     -->  WarningNumber  --> number  ResourceURL    --> RESOURCE Separator string  ObjectURL      --> OBJECT Separator string  OptExt         --> DescData OptExt  OptExt         --> OpaqueData OptExt  OptExt         --> AttributeData OptExt  OptExt         --> UserExtData OptExt  OptExt	 -->  DescData       --> DESCRIPTION Separator Data  OpaqueData     --> OPAQUE Separator Data  AttributeData  --> ATTRIBUTE Separator AttrString Separator Data  AttrString     --> string  UserExtData    --> USEREXTENSION Separator ExtentionType Separator Data  ExtentionType  --> string  ObjectEnd      --> OBJEND  Trailer        --> ObjectCount  Trailer        --> Error  Trailer        --> Stats  Trailer        -->   ObjectCount    --> OBJCOUNT Separator number  Error          --> ERROR Separator ErrorNumber string  Stats          --> STATS Separator Data  ErrorNumber    --> number  Data           --> MagicNumber Nbytes NbytesOfData   Nbytes         --> number  string         --> [^\n]*\n  number         --> htonl(number)  MagicNumber    --> htonl(0x329fa1d2)

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -