Wednesday, 2 July 2008

Regular Expressions for grep, sed, awk

One of the examples in last week's tip used the following awk statement to extract, from a file named unixfile, lines (records) that contained the string "learn" in them:

awk '/learn/ { print $2 " " $1 }' unixfile
The string "learn" in this statement is a regular expression that is delimited on each end by the forward slash (/) character. In addition to awk, regular expressions are often used with other UNIX utilities such as grep, sed, and vi.

Regular expressions, often abbreviated as regex or regexp, describe a pattern or particular sequence of characters and are used to search for and replace strings.

Most characters used in a regex will represent themselves, but there are special characters (known as metacharacters) that take on special meaning in the context of the UNIX utility/tool in which they are used.

Since the topic of regular expressions is quite extensive, this brief overview will only focus on two of its frequently used positional or anchor metacharacters, the caret (^) and the dollar sign ($).

The caret is used to match at the beginning of a line, and the dollar sign is used to match at the end of a line. Carets will logically be found on the left-hand side of a regex, and dollar signs on the right.

To demonstrate the usage of these two positional metacharacters, the same data file used for last week's tip will be used again this week. The only change made was the insertion of 4 blank lines between each line of text. The file unixfile now contains the following data:

unix training

learn unix

unix class

learning unix

unix course

Using grep, all lines in unixfile that begin with "unix" will be extracted with the help of the caret metacharacter:

# grep '^unix' unixfile
unix training
unix class
unix course

Removing the caret from the beginning of the regex and adding a dollar sign to the end will cause grep to display lines ending with "unix":

# grep 'unix$' unixfile
learn unix
learning unix
These two metacharacters can also be combined in a single regex to identify/manipulate blank lines. The -c option for grep will be used with a regex containing both the caret and the dollar sign to count the number of blank lines in unixfile:

# grep -c '^$' unixfile
4

You may recognize that this regex would be useful for removing blank lines from a file when needed.

Experienced UNIX system administrators and shell script programmers understand that becoming skilled in the use of regular expressions is essential for using standard UNIX utilities (e.g. grep, awk, sed, and vi) to their fullest potential.

GREP

grep searches the named input FILEs (or standard input if no files are named, or the file name - is given) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.”

AWK

Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile. With each pattern there can be an associated action that will be performed when a line of a file matches the pattern. Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern.”

SED

“The sed utility is a stream editor that reads one or more text files, makes editing changes according to a script of editing commands, and writes the results to standard output.”

# 在file 1中找到含有firststring那行,并在第二个空格之后加入secondstring。
# 注意 s/set1/set2/g, g用来代表全局替换,而2则表示再第二次出现的时候替换。
sed -i '/firststring/ s/ / secondstring/2' file1

# 在file 1中最后那行的末尾加入finalstring。
sed -i '$ s/$/finalstring/' file1


REGULAR EXPRESSIONS

Regular expressions allow you to use variables in your commands. Most of us are at least familiar with *, where *.db means “every file ending in .db”. There are however many useful regular expressions, and they are the key to executing the above commands efficiently. Brush up on regular expressions here.

PIPES AND REDIRECTS

In order to send the results from one command to another command you need to “pipe” them together. An example:

cat *.txt | grep "blatti.net"

cat *.txt will display the contents of every file that ends in “.txt” to the standard output. By piping those results, they become the input for the second command. grep "blatti.net" will take the standard input, and display each line that contains “blatti.net” on the standard output. So when these two commands are piped together, we will get every line that contains “blatti.net” in every file that ends in “.txt” in the working directory.

While most of your work can be done with standard input and output, you’ll probably want to write your results to a file. Enter redirection. Using the example above, lets say we wanted to take our results and write it to a file. This is done with the > redirector.

cat *.txt | grep "blatti.net" > results.txt

This command will take every line that contains “blatti.net” in every file that ends in “.txt” in the working director, and write them to the file “results.txt”. Note that this will overwrite any existing “results.txt” file that exists in the working directory. Which begs the question “What if I want to append instead of overwrite?” Well then use >> instead of >. The >> redirector is especially useful if you are creating logs.

Now you too can create obnoxious commands with limited knowledge! The command below is one I created using the resources listed in this article. It is part of a piece of software that pulls text out of a PDF and returns a specified section in a text file. (I’m sure it is inefficient - so someone chime in and tell me how to make it better so I’ll know for next time)

sed 's/^[0-9]$/Period &/’ /tmp/skypdf.txt | sed ’s/[A-Z]*$/&,/’ | sed ‘/Period/G’ | sed ‘$!N;s/\\n/ /’ | sed ‘/Period/{x;p;x;}’ | sed ’s/^ //’ > ~/Desktop/full.txt”

Now say that 10 times fast!



No comments:

My photo
London, United Kingdom
twitter.com/zhengxin

Facebook & Twitter