📄 textproc.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><HTML><HEAD><TITLE>Text Processing Commands</TITLE><METANAME="GENERATOR"CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+"><LINKREL="HOME"TITLE="Advanced Bash-Scripting Guide"HREF="index.html"><LINKREL="UP"TITLE="External Filters, Programs and Commands"HREF="external.html"><LINKREL="PREVIOUS"TITLE="Time / Date Commands"HREF="timedate.html"><LINKREL="NEXT"TITLE="File and Archiving Commands"HREF="filearchiv.html"><METAHTTP-EQUIV="Content-Style-Type"CONTENT="text/css"><LINKREL="stylesheet"HREF="common/kde-common.css"TYPE="text/css"><METAHTTP-EQUIV="Content-Type"CONTENT="text/html; charset=iso-8859-1"><METAHTTP-EQUIV="Content-Language"CONTENT="en"><LINKREL="stylesheet"HREF="common/kde-localised.css"TYPE="text/css"TITLE="KDE-English"><LINKREL="stylesheet"HREF="common/kde-default.css"TYPE="text/css"TITLE="KDE-Default"></HEAD><BODYCLASS="SECT1"BGCOLOR="#FFFFFF"TEXT="#000000"LINK="#AA0000"VLINK="#AA0055"ALINK="#AA0000"STYLE="font-family: sans-serif;"><DIVCLASS="NAVHEADER"><TABLESUMMARY="Header navigation table"WIDTH="100%"BORDER="0"CELLPADDING="0"CELLSPACING="0"><TR><THCOLSPAN="3"ALIGN="center">Advanced Bash-Scripting Guide: An in-depth exploration of the art of shell scripting</TH></TR><TR><TDWIDTH="10%"ALIGN="left"VALIGN="bottom"><AHREF="timedate.html"ACCESSKEY="P">Prev</A></TD><TDWIDTH="80%"ALIGN="center"VALIGN="bottom">Chapter 15. External Filters, Programs and Commands</TD><TDWIDTH="10%"ALIGN="right"VALIGN="bottom"><AHREF="filearchiv.html"ACCESSKEY="N">Next</A></TD></TR></TABLE><HRALIGN="LEFT"WIDTH="100%"></DIV><DIVCLASS="SECT1"><H1CLASS="SECT1"><ANAME="TEXTPROC"></A>15.4. Text Processing Commands</H1><DIVCLASS="VARIABLELIST"><P><B><ANAME="TPCOMMANDLISTING1"></A>Commands affecting text and text files</B></P><DL><DT><ANAME="SORTREF"></A><BCLASS="COMMAND">sort</B></DT><DD><P>File sort utility, often used as a filter in a pipe. This command sorts a <ICLASS="FIRSTTERM">text stream</I> or file forwards or backwards, or according to various keys or character positions. Using the <TTCLASS="OPTION">-m</TT> option, it merges presorted input files. The <ICLASS="FIRSTTERM">info page</I> lists its many capabilities and options. See <AHREF="loops.html#FINDSTRING">Example 10-9</A>, <AHREF="loops.html#SYMLINKS">Example 10-10</A>, and <AHREF="contributed-scripts.html#MAKEDICT">Example A-8</A>.</P></DD><DT><ANAME="TSORTREF"></A><BCLASS="COMMAND">tsort</B></DT><DD><P><ICLASS="FIRSTTERM">Topological sort</I>, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of <BCLASS="COMMAND">tsort</B> was to sort a list of dependencies for an obsolete version of the <ICLASS="FIRSTTERM">ld</I> linker in an <SPANCLASS="QUOTE">"ancient"</SPAN> version of UNIX.</P><P>The results of a <ICLASS="FIRSTTERM">tsort</I> will usually differ markedly from those of the standard <BCLASS="COMMAND">sort</B> command, above.</P></DD><DT><ANAME="UNIQREF"></A><BCLASS="COMMAND">uniq</B></DT><DD><P>This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with <AHREF="textproc.html#SORTREF">sort</A>.</P><P><TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="PROGRAMLISTING"> 1 cat list-1 list-2 list-3 | sort | uniq > final.list 2 # Concatenates the list files, 3 # sorts them, 4 # removes duplicate lines, 5 # and finally writes the result to an output file.</PRE></TD></TR></TABLE></P><P>The useful <TTCLASS="OPTION">-c</TT> option prefixes each line of the input file with its number of occurrences.</P><P> <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="SCREEN"> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>cat testfile</B></TT> <TTCLASS="COMPUTEROUTPUT">This line occurs only once. This line occurs twice. This line occurs twice. This line occurs three times. This line occurs three times. This line occurs three times.</TT> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>uniq -c testfile</B></TT> <TTCLASS="COMPUTEROUTPUT"> 1 This line occurs only once. 2 This line occurs twice. 3 This line occurs three times.</TT> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>sort testfile | uniq -c | sort -nr</B></TT> <TTCLASS="COMPUTEROUTPUT"> 3 This line occurs three times. 2 This line occurs twice. 1 This line occurs only once.</TT> </PRE></TD></TR></TABLE> </P><P>The <TTCLASS="USERINPUT"><B>sort INPUTFILE | uniq -c | sort -nr</B></TT> command string produces a <ICLASS="FIRSTTERM">frequency of occurrence</I> listing on the <TTCLASS="FILENAME">INPUTFILE</TT> file (the <TTCLASS="OPTION">-nr</TT> options to <BCLASS="COMMAND">sort</B> cause a reverse numerical sort). This template finds use in analysis of log files and dictionary lists, and wherever the lexical structure of a document needs to be examined.</P><DIVCLASS="EXAMPLE"><HR><ANAME="WF"></A><P><B>Example 15-12. Word Frequency Analysis</B></P><TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="PROGRAMLISTING"> 1 #!/bin/bash 2 # wf.sh: Crude word frequency analysis on a text file. 3 # This is a more efficient version of the "wf2.sh" script. 4 5 6 # Check for input file on command line. 7 ARGS=1 8 E_BADARGS=65 9 E_NOFILE=66 10 11 if [ $# -ne "$ARGS" ] # Correct number of arguments passed to script? 12 then 13 echo "Usage: `basename $0` filename" 14 exit $E_BADARGS 15 fi 16 17 if [ ! -f "$1" ] # Check if file exists. 18 then 19 echo "File \"$1\" does not exist." 20 exit $E_NOFILE 21 fi 22 23 24 25 ######################################################## 26 # main () 27 sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\ 28 /g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr 29 # ========================= 30 # Frequency of occurrence 31 32 # Filter out periods and commas, and 33 #+ change space between words to linefeed, 34 #+ then shift characters to lowercase, and 35 #+ finally prefix occurrence count and sort numerically. 36 37 # Arun Giridhar suggests modifying the above to: 38 # . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr 39 # This adds a secondary sort key, so instances of 40 #+ equal occurrence are sorted alphabetically. 41 # As he explains it: 42 # "This is effectively a radix sort, first on the 43 #+ least significant column 44 #+ (word or string, optionally case-insensitive) 45 #+ and last on the most significant column (frequency)." 46 # 47 # As Frank Wang explains, the above is equivalent to 48 #+ . . . | sort | uniq -c | sort +0 -nr 49 #+ and the following also works: 50 #+ . . . | sort | uniq -c | sort -k1nr -k 51 ######################################################## 52 53 exit 0 54 55 # Exercises: 56 # --------- 57 # 1) Add 'sed' commands to filter out other punctuation, 58 #+ such as semicolons. 59 # 2) Modify the script to also filter out multiple spaces and 60 #+ other whitespace.</PRE></TD></TR></TABLE><HR></DIV><P> <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="SCREEN"> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>cat testfile</B></TT> <TTCLASS="COMPUTEROUTPUT">This line occurs only once. This line occurs twice. This line occurs twice. This line occurs three times. This line occurs three times. This line occurs three times.</TT> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>./wf.sh testfile</B></TT> <TTCLASS="COMPUTEROUTPUT"> 6 this 6 occurs 6 line 3 times 3 three 2 twice 1 only 1 once</TT> </PRE></TD></TR></TABLE> </P></DD><DT><ANAME="EXPANDREF"></A><BCLASS="COMMAND">expand</B>, <BCLASS="COMMAND">unexpand</B></DT><DD><P>The <BCLASS="COMMAND">expand</B> filter converts tabs to spaces. It is often used in a pipe.</P><P>The <BCLASS="COMMAND">unexpand</B> filter converts spaces to tabs. This reverses the effect of <BCLASS="COMMAND">expand</B>.</P></DD><DT><ANAME="CUTREF"></A><BCLASS="COMMAND">cut</B></DT><DD><P>A tool for extracting fields from files. It is similar to the <TTCLASS="USERINPUT"><B>print $N</B></TT> command set in <AHREF="awk.html#AWKREF">awk</A>, but more limited. It may be simpler to use <ICLASS="FIRSTTERM">cut</I> in a script than <ICLASS="FIRSTTERM">awk</I>. Particularly important are the <TTCLASS="OPTION">-d</TT> (delimiter) and <TTCLASS="OPTION">-f</TT> (field specifier) options.</P><P>Using <BCLASS="COMMAND">cut</B> to obtain a listing of the mounted filesystems: <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="PROGRAMLISTING"> 1 cut -d ' ' -f1,2 /etc/mtab</PRE></TD></TR></TABLE></P><P>Using <BCLASS="COMMAND">cut</B> to list the OS and kernel version: <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="PROGRAMLISTING"> 1 uname -a | cut -d" " -f1,3,11,12</PRE></TD></TR></TABLE></P><P>Using <BCLASS="COMMAND">cut</B> to extract message headers from an e-mail folder: <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="SCREEN"> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>grep '^Subject:' read-messages | cut -c10-80</B></TT> <TTCLASS="COMPUTEROUTPUT">Re: Linux suitable for mission-critical apps? MAKE MILLIONS WORKING AT HOME!!! Spam complaint Re: Spam complaint</TT></PRE></TD></TR></TABLE> </P><P>Using <BCLASS="COMMAND">cut</B> to parse a file: <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="PROGRAMLISTING"> 1 # List all the users in /etc/passwd. 2 3 FILENAME=/etc/passwd 4 5 for user in $(cut -d: -f1 $FILENAME) 6 do 7 echo $user 8 done 9 10 # Thanks, Oleg Philon for suggesting this.</PRE></TD></TR></TABLE></P><P><TTCLASS="USERINPUT"><B>cut -d ' ' -f2,3 filename</B></TT> is equivalent to <TTCLASS="USERINPUT"><B>awk -F'[ ]' '{ print $2, $3 }' filename</B></TT></P><DIVCLASS="NOTE"><TABLECLASS="NOTE"WIDTH="90%"BORDER="0"><TR><TDWIDTH="25"ALIGN="CENTER"VALIGN="TOP"><IMGSRC="common/note.png"HSPACE="5"ALT="Note"></TD><TDALIGN="LEFT"VALIGN="TOP"><P>It is even possible to specify a linefeed as a delimiter. The trick is to actually embed a linefeed (<BCLASS="KEYCAP">RETURN</B>) in the command sequence.</P><P> <TABLEBORDER="0"BGCOLOR="#E0E0E0"WIDTH="90%"><TR><TD><PRECLASS="SCREEN"> <TTCLASS="PROMPT">bash$ </TT><TTCLASS="USERINPUT"><B>cut -d' ' -f3,7,19 testfile</B></TT> <TTCLASS="COMPUTEROUTPUT">This is line 3 of testfile. This is line 7 of testfile. This is line 19 of testfile.</TT> </PRE></TD></TR></TABLE> </P><P>Thank you, Jaka Kranjc, for pointing this out.</P></TD></TR></TABLE></DIV><P>See also <AHREF="mathc.html#BASE">Example 15-46</A>.</P></DD><DT><ANAME="PASTEREF"></A><BCLASS="COMMAND">paste</B></DT><DD><P>Tool for merging together different files into a single, multi-column file. In combination with <AHREF="textproc.html#CUTREF">cut</A>, useful for creating system log files. </P></DD><DT
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -