📄 0368-0370.html
字号:
<HTML>
<HEAD>
<TITLE>Developer.com - Online Reference Library - 0672311739:RED HAT LINUX 2ND EDITION:GNU Project Utilities</TITLE>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<SCRIPT>
<!--
function displayWindow(url, width, height) {
var Win = window.open(url,"displayWindow",'width=' + width +
',height=' + height + ',resizable=1,scrollbars=yes');
}
//-->
</SCRIPT>
</HEAD>
-->
<!-- ISBN=0672311739 //-->
<!-- TITLE=RED HAT LINUX 2ND EDITION //-->
<!-- AUTHOR=DAVID PITTS ET AL //-->
<!-- PUBLISHER=MACMILLAN //-->
<!-- IMPRINT=SAMS PUBLISHING //-->
<!-- PUBLICATION DATE=1998 //-->
<!-- CHAPTER=17 //-->
<!-- PAGES=0351-0372 //-->
<!-- UNASSIGNED1 //-->
<!-- UNASSIGNED2 //-->
<P><CENTER>
<a href="0365-0367.html">Previous</A> | <a href="../ewtoc.html">Table of Contents</A> | <a href="0371-0372.html">Next</A>
</CENTER></P>
<A NAME="PAGENUM-368"><P>Page 368</P></A>
<H4><A NAME="ch17_ 18">
The split Command
</A></H4>
<P>The split command is probably one of the handiest
commands for transporting large files around. One of its most common uses is to split up compressed source files (to upload in
pieces or fit on a floppy). The basic syntax is
</P>
<!-- CODE SNIP //-->
<PRE>
split [options] filename [output prefix]
</PRE>
<!-- END CODE SNIP //-->
<P>where the options and output prefix are optional. If no output prefix is given,
split uses the prefix of x and output files are labeled
xaa, xab, xac, and so on. By default, split puts
1000 lines in each of the output files (the last file can be fewer than 1000 lines), but because
1000 lines can mean variable file sizes, the -b or
--bytes option is used. The basic syntax is
</P>
<!-- CODE SNIP //-->
<PRE>
-b bytes[bkm]
</PRE>
<!-- END CODE SNIP //-->
<P>or
</P>
<!-- CODE SNIP //-->
<PRE>
--bytes=bytes[bkm]
</PRE>
<!-- END CODE SNIP //-->
<P>where bytes is the number of bytes of size:
</P>
<BR>
b
512 bytes
k
1KB (1024 bytes)
m
1MB (1,048,576 bytes)
<P>Thus,
</P>
<!-- CODE SNIP //-->
<PRE>
split -b1000k JDK.tar.gz
</PRE>
<!-- END CODE SNIP //-->
<P>will split the file JDK.tar.gz into 1000KB pieces. To get the output files to be
labeled JDK.tar.gz., you would use the following:
</P>
<!-- CODE SNIP //-->
<PRE>
split -b1000k JDK.tar.gz JDK.tar.gz.
</PRE>
<!-- END CODE SNIP //-->
<P>This would create 1000KB files that could be copied to a floppy or uploaded one at a time
over a slow modem link.
</P>
<P>When the files reach their destination, they can be
joined by using cat:
</P>
<!-- CODE SNIP //-->
<PRE>
cat JDK.tar.gz.* > JDK.tar.gz
</PRE>
<!-- END CODE SNIP //-->
<P>A command that is useful for confirming whether or not a split file has been joined correctly
is the cksum command. Historically, it has been used to confirm if files have been
transferred properly over noisy phone lines.
</P>
<P>cksum computes a cyclic redundancy check (CRC) for each filename argument and prints
out the CRC along with the number of bytes in the file and the filename. The easiest way to
compare the CRC for the two files is to get the CRC for the original file:
</P>
<!-- CODE SNIP //-->
<PRE>
cksum JDK.tar.gz > JDK.crc
</PRE>
<!-- END CODE SNIP //-->
<P>and then compare it to the output cksum for the joined file.
</P>
<A NAME="PAGENUM-369"><P>Page 369</P></A>
<H4><A NAME="ch17_ 19">
Counting Words
</A></H4>
<P>Counting words is a handy thing to be able to do,
and there are many ways to do it. Probably the easiest is the
wc command, which stands for word count, but wc only prints the number
of characters, words, or lines. What about if you need a breakdown by word? It's a good
problem, and one that serves to introduce the next set of GNU text utilities.
</P>
<P>Here are the commands you need:
</P>
<TABLE WIDTH="360">
<TR><TD>
tr
</TD><TD>
Transliterate; changes the first set of characters it is given into the
second set of characters it is given; also deletes characters
</TD></TR>
<TR><TD>
sort
</TD><TD>
Sorts the file (or its standard input)
</TD></TR>
<TR><TD>
uniq
</TD><TD>
Prints out all the unique lines in a file (collapses duplicates into one
line and optionally gives a count)
</TD></TR>
</TABLE>
<P>I used this chapter as the text for this example. First, this line gets rid of all the punctuation
and braces, and so on, in the input file:
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc
</PRE>
<!-- END CODE SNIP //-->
<P>This demonstrates the basic usage of tr:
</P>
<!-- CODE SNIP //-->
<PRE>
tr `set1' `set2'
</PRE>
<!-- END CODE SNIP //-->
<P>This takes all the characters in set1 and transliterates them to the characters in
set2. Usually, the characters themselves are used, but the standard C escape sequences work also (as you
will see).
</P>
<P>I specified set2 as ` ` (the space character) because words separated by those characters need
to remain separate. The next step is to transliterate all capitalized versions of words together
because the words To and to, the and The, and
Files and files are really the same word. To do
this, tell tr to change all the capital characters
`A-Z' into lowercase characters `a-z':
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z'
</PRE>
<!-- END CODE SNIP //-->
<P>I broke the command into two lines, with the pipe character as the last character in the
first line so that the shell (sh, bash, ksh) will do the right thing and use the next line as the
command to pipe to. It's easier to read and cut and paste from an
xterm this way, also. This won't work under csh or
tcsh unless you start one of the preceding shells.
</P>
<P>Multiple spaces in the output can be squeezed into single spaces with
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z' | tr -s ` `
</PRE>
<!-- END CODE SNIP //-->
<P>To get a count of how many times each word is used, you need to sort the file. In the
simplest form, the sort command sorts each line, so you need to have one word per line to get a
good sort. This code deletes all of the tabs (\t) and the newlines
(\n) and then changes all the spaces into newlines:
</P>
<A NAME="PAGENUM-370"><P>Page 370</P></A>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' | tr ` ` `\n'
</PRE>
<!-- END CODE SNIP //-->
<P>Now you can sort the output, so simply tack on the
sort command:
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' | tr ` ` `\n' | sort
</PRE>
<!-- END CODE SNIP //-->
<P>You could eliminate all the repeats at this point by giving the
sort the -u (unique) option, but you need a count of the repeats, so use the
uniq command. By default, the uniq command prints out "the unique lines in a sorted file, discarding all but one of a run of matching
lines" (man page uniq). uniq requires sorted files because it only compares consecutive lines. To
get uniq to print out how many times a word occurs, give it the
-c (count) option:
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' |
tr ` ` `\n' | sort | uniq -c
</PRE>
<!-- END CODE SNIP //-->
<P>Next, you need to sort the output again because the
order in which the output is printed out is not sorted by number. This time, to get
sort to sort by numeric value instead of string
compare and have the largest number printed out first, give sort the
-n (numeric) and -r (reverse) options:
</P>
<!-- CODE SNIP //-->
<PRE>
tr `!?":;[]{}(),.' ` ` < ~/docs/ch16.doc |
tr `A-Z' `a-z' | tr -s ` ` | tr -d `\t\n' |
tr ` ` `\n' | sort | uniq -c | sort -rn
</PRE>
<!-- END CODE SNIP //-->
<P>The first few lines (ten actually, I piped the output to
head) look like this:
</P>
<!-- CODE //-->
<PRE>
389 the
164 to
127 of
115 is
115 and
111 a
80 files
70 file
69 in
65 `
</PRE>
<!-- END CODE //-->
<P>Note that the tenth most common word is the single quote character, but I said we took
care of the punctuation with the very first tr. Well, I lied (sort of); we took care of all the
characters that would fit between quotes, and a single quote won't fit. So why not just backslash
escape that sucker? Well, not all shells will handle that properly.
</P>
<P>So what's the solution?
</P>
<P>The solution is to use the predefined character sets in
tr. The tr command knows several character classes, and the punctuation class is one of them.
Here is a complete list (names and definitions) of class names, from the man page for
uniq:
</P>
<TABLE WIDTH="360">
<TR><TD>
alnum
</TD><TD>
Letters and digits
</TD></TR>
<TR><TD>
alpha
</TD><TD>
Letters
</TD></TR>
</TABLE>
<P><CENTER>
<a href="0365-0367.html">Previous</A> | <a href="../ewtoc.html">Table of Contents</A> | <a href="0371-0372.html">Next</A>
</CENTER></P>
</td>
</tr>
</table>
<!-- begin footer information -->
</body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -