📄 txtvsbin.txt
字号:
A Fast Method for Identifying Plain Text Files==============================================Introduction------------Given a file coming from an unknown source, it is sometimes desirableto find out whether the format of that file is plain text. Althoughthis may appear like a simple task, a fully accurate detection of thefile type requires heavy-duty semantic analysis on the file contents.It is, however, possible to obtain satisfactory results by employingvarious heuristics.Previous versions of PKZip and other zip-compatible compression toolswere using a crude detection scheme: if more than 80% (4/5) of the bytesfound in a certain buffer are within the range [7..127], the file islabeled as plain text, otherwise it is labeled as binary. A prominentlimitation of this scheme is the restriction to Latin-based alphabets.Other alphabets, like Greek, Cyrillic or Asian, make extensive use ofthe bytes within the range [128..255], and texts using these alphabetsare most often misidentified by this scheme; in other words, the rateof false negatives is sometimes too high, which means that the recallis low. Another weakness of this scheme is a reduced precision, due tothe false positives that may occur when binary files containing largeamounts of textual characters are misidentified as plain text.In this article we propose a new, simple detection scheme that featuresa much increased precision and a near-100% recall. This scheme isdesigned to work on ASCII, Unicode and other ASCII-derived alphabets,and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)and variable-sized encodings (ISO-2022, UTF-8, etc.). Wider encodings(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.The Algorithm-------------The algorithm works by dividing the set of bytecodes [0..255] into threecategories:- The white list of textual bytecodes: 9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.- The gray list of tolerated bytecodes: 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).- The black list of undesired, non-textual bytecodes: 0 (NUL) to 6, 14 to 31.If a file contains at least one byte that belongs to the white list andno byte that belongs to the black list, then the file is categorized asplain text; otherwise, it is categorized as binary. (The boundary case,when the file is empty, automatically falls into the latter category.)Rationale---------The idea behind this algorithm relies on two observations.The first observation is that, although the full range of 7-bit codes[0..127] is properly specified by the ASCII standard, most controlcharacters in the range [0..31] are not used in practice. The onlywidely-used, almost universally-portable control codes are 9 (TAB),10 (LF) and 13 (CR). There are a few more control codes that arerecognized on a reduced range of platforms and text viewers/editors:7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but thesecodes are rarely (if ever) used alone, without being accompanied bysome printable text. Even the newer, portable text formats such asXML avoid using control characters outside the list mentioned here.The second observation is that most of the binary files tend to containcontrol characters, especially 0 (NUL). Even though the older textdetection schemes observe the presence of non-ASCII codes from the range[128..255], the precision rarely has to suffer if this upper range islabeled as textual, because the files that are genuinely binary tend tocontain both control characters and codes from the upper range. On theother hand, the upper range needs to be labeled as textual, because itis used by virtually all ASCII extensions. In particular, this range isused for encoding non-Latin scripts.Since there is no counting involved, other than simply observing thepresence or the absence of some byte values, the algorithm producesconsistent results, regardless what alphabet encoding is being used.(If counting were involved, it could be possible to obtain differentresults on a text encoded, say, using ISO-8859-16 versus UTF-8.)There is an extra category of plain text files that are "polluted" withone or more black-listed codes, either by mistake or by peculiar designconsiderations. In such cases, a scheme that tolerates a small fractionof black-listed codes would provide an increased recall (i.e. more truepositives). This, however, incurs a reduced precision overall, sincefalse positives are more likely to appear in binary files that containlarge chunks of textual data. Furthermore, "polluted" plain text shouldbe regarded as binary by general-purpose text detection schemes, becausegeneral-purpose text processing algorithms might not be applicable.Under this premise, it is safe to say that our detection method providesa near-100% recall.Experiments have been run on many files coming from various platformsand applications. We tried plain text files, system logs, source code,formatted office documents, compiled object code, etc. The resultsconfirm the optimistic assumptions about the capabilities of thisalgorithm.--Cosmin TrutaLast updated: 2006-May-28
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -