📄 catdoc.1
字号:
.TH catdoc 1 "Version 0.91" "MS-Word reader".SH NAMEcatdoc \- reads MS-Word file and puts its content as plain text on standard output.SH SYNOPSIS.BR catdoc " [" -vlu8btawx "] [" -m " .IR number ] [.B -s.IR charset ] [.B -d .IR charset ] [ .B -f.IR output-format ].I file.SH DESCRIPTION.B catdocbehaves much like.BR cat (1)but it reads MS-Word file and produces human-readable text on standard output.Optionally it can use .BR latex (1)escape sequenses for characters which have special meaning for LaTeX.It also makes some effort to recognize MS-Word tables, although it nevertries to write correct headers for LaTeX tabular environment. Additionaloutput formats, such is HTML can be easily defined. .PP.B catdocdoesn't attempt to extract formatting information other than tables fromMS-Word document, so different output modes means mainly that differentcharachers should be escaped and different ways used to represent characters,missing from output charset. See CHARACTER SUBSTITUTION below .PP.B catdocuses internal .BR unicode (4)representation of text, so it is able to convert texts when charset insource document doesn't match charset on target system.See CHARACTER SETS below..PPIf no file names supplied, .B catdocprocesses its standard input unless it is terminal. It is unlikely that somebody could type Word document from keyboard, so if .B catdoc invoked without arguments and stdin is not redirected, it prints briefusage message and exits. Processing of standard input (even among other files) can be forced usingdash '-' as file name..PPBy default, .B catdocwraps lines which are more than 72 chars long and separates paragraphs byblank lines. This behavoir can be turned of by .B -wswitch. In .I widemode .B catdoc prints each paragraph as one long line, suitable for import intoword processors which perform word wrapping theirselves. .SH OPTIONS.TP 8.B -a - shortcut for -f ascii. Produces ASCII text as output.Separates table columns with TAB.TP 8.B -b- process broken MS-Word file. Normally,.B catdoc checks if first 8 bytesof file is Microsoft OLE signature. If so, it processes file, otherwiseit just copies it to stdin. It is intended to use .B catdoc as filter for viewing all files with .I .docextension..TP 8.BI -d charset- specifies destination charset name. Charset file has format described inCHARACTER SETS below and should have .B .txtextension and reside in .B catdoc library directory (normally /usr/local/lib/catdoc)..TP 8.BI -f format- specifies output format as described in CHARACTER SUBSTITUTION below..B catdoccomes with two output formats - ascii and tex. You can add your own if youwish..TP 8.B -lCauses .B catdocto list names of available charsets to the stdout and exit successfully..TP 8.BI -m numberSpecifies right margin for text (default 72). .B -m 0is equivalent to.B -w.TP 8.BI -s charsetSpecifies source charset. (one used in Word document), if Word documentdoesn't contain UTF-16 text..TP 8.B -t- shortcut for .B -f tex converts all printable chars, which have special meaning for .BR LaTeX (1)into appropriate control sequences. Separates table columns by .BR &..TP 8.B -u- declares that Word document contain UNICODE (UTF-16) represntationof text (as some Word-97 documents). If catdoc fails to correct Word documentwith default charset, try this option..TP 8.B -8- declares is Word document is 8 bit. Just in case that catdoc recognizes file format incorrectly..TP 8.B -wdisables word wrapping. By default .B catdocoutput is splitted into lines not longer than 72 (or number, specified by-m option) characters and paragraphsare separated by blank line. With this option each paragraph is onelong line. .TP 8.B -xcauses catdoc to output unknown UNICODE characher as \\xNNNN, insteadof question marks..TP 8.B -vcauses catdoc to print some useless information about word documentstructure to stdout before actual start of text..SH CHARACTER SETSWhen processing MS-Word file .B catdocuses information about two character sets, typically different - input and output. They are stored in plain text files in .B catdoclibrary directory. Character set files should contain two whitespace-separatedhexadecimal numbers - 8-bit code in character set and 16-bit unicode code.Anything from hash mark to end of line is ignored, as well as blank lines..B catdoc distribution includes some of these character sets. Additional character setdefinitions, directly usable by .B catdoc can be obtained from ftp.unicode.org. Charset files have.B .txtsuffix, which shouldn't be specified in command-line or configurationfiles. .PPNote that.B catdoc is distributed with Cyrillic charsets as default. If you are notRussian, you probably don't want it, an should reconfigure catdoc at compile time or in runtime configuration file..PPWhen dealing with documents with charsets other than default, rememberthat Microsoft never uses ISO charsets. While letters in, say cp1252 areat the same position as in ISO-8859-1, some punctuation signs would belost, if you specify ISO-8859-1 as input charset. If you use cp1252,catdoc would deal with those signs as described in CHARACTERSUBSTITUTION below..SH CHARACTER SUBSTITUTION .B catdocconverts MS-Word file into following internal unicode representation:.TP 41. Paragraphs are separated by ASCII Line Feed symbol (0x000A).TP 42. Table cells within row are separated by ASCII Field Separator symbol(0x001C).TP 43. Table rows are separated by ASCII Record Separator (0x001E) .TP 44. All printable characters, including whitespace are represented with theirrespective UNICODE codes..PP This UNICODE representation is subsequentely converted into 8-bit text intarget character set using following four-step algorithm:.TP 41. List of special characters is searched for given unicode character.If found, then appropriate multi-character sequence is output instead ofcharacter. .TP 42. If there is an equivalent in target character set, it is output..TP 43. Otherwise, replacement list is searched and, if there is multi-charactersubstitution for this UNICODE char, it is output..TP 44. If all above fails, "Unknown char" symbol (question mark) is output..PPLists of special characters and list of substitution are characterset-independent, becouse special chars should be escaped regardless of theirexistense in target character set (usially, they are parts of US-ASCII, andtherefore exist in any character set) and replacement list is searched onlyfor those characters, which are not found in target character set..PPThese lists are stored in.B catdoc library directory in files with prefix of format name. These files havefollowing format:.PPEach line can be either comment (starting with hash mark) or containhexadecimal UNICODE value, separated by whitespace from string, whichwould be substituted instead of it. If string contain no whitespace it can be used as is, otherwise it should be enclosed in single or doublequotes. Usial backslash sequences like .IR '\en' , '\et'can be used in these string..SH RUNTIME CONFIGURATIONUpon startup catdoc reads its system-wide configuration file (.B catdocrc in .B catdoclibrary directory) and thenuser-specific configuration file.BR ${HOME}/.catdocrc..PPThese files can contain following directives:.TP 8.BI "source_charset = " charset-nameSets default source charset, which would be used if no .B -soption specified. Consult configuration of nearby windowsworkstation to find one you need..TP 8.BI "target_charset = " charset-name Sets default output charset. You probably know, which one you use..TP 8.BI "charset_path = " directory-listcolon-separated list of directories, which are searched for charset files.This allows you to install additional charsets in your home directory..TP 8.BI "map_path = " directory-listcolon-separated list of directories, which are searched for special charactermap and replacement map..TP 8.BI "format = " "format name"Output format which would be used by default..B catdoccomes with two formats - .BR ascii " and " texbut nothing prevents you from writing your own format (set two map files -special character map and replacement map)..TP 8.BI "unknown_char = " "character specification"sets characher to output instead of unknown unicode character (default '?')Character specification can have one of two form - character enclosed insingle quotes or hexadecimal code..SH BUGSDoesn't handlefast-saves properly. Prints footnotes as separate paragraphs at the end offile, instead of producing correct LaTeX commands. Cannot distinguishbetween empty table cell and end of table row..SH "SEE ALSO".BR xls2csv (1),.BR cat (1),.BR strings (1),.BR utf (4),.BR unicode (4).SH AUTHORV.B.Wagner <vitus@ice.ru>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -