Bash Shell Scripting: Sed

astrotycoon 2020-07-16

展开全文

01. Introduction to sed

Sed (short for stream editor) is a non-interactive command line program that takes some text input, performs some command, and prints the results to standard out. Its support for regular expressions makes it extremely powerful, and is often the choice of tool when filtering or transforming text.

To call the sed program, simply use the sed command.

Sed is a pretty big deal!

sed is such a versatile and powerful command line tool that there are entire books written on it. Most people use sed for just search and replace operations, but there are quite a few other useful commands.

How does sed work?

The steps in processing a sed invocation can be broken down to the following four parts:

Sed reads a line from the input stream (from a file, pipe or standard in). and stores it into an internal buffer called pattern buffer or pattern space.
The specified sed command is executed on the pattern buffer line.
Modified contents are outputted and displayed to standard out.
The pattern buffer is emptied, and this process is repeated on all lines until the file is exhausted.

Figure of sed operations: read, execute, display. — How sed works: read, execute, display.

Sample text

For this tutorial, let's use the following text titled oneOS.txt. Save the following text to a file called oneOS.txt to follow through our examples.

One OS to rule them all,
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

Great! Time to try out some sed!

Let's try out sed

To use sed, simply invoke the command, with the first argument being a string of commands. As we'll see, this may include line addresses, a sed command (deletion, substitution, printing, etc.), flags, and more.

Before we try out the sed command, note that it may read from a file, or from read from standard in.

Bash

$ sed '' oneOS.txt
$ cat oneOS.txt | sed ''
$ sed '' < oneOS.txt

We prefer the first method (if possible), as it uses less resources and is easier to manage. The bottom command would fall under form of a Useless Use of Cat

Let's try sed out with no commands at all and see what it does.

Bash

$ sed '' oneOS.txt
One OS to rule them all,
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

We can see here how sed operates. It reads from the input file, stores the first line in the pattern buffer, then applies the sed command on the pattern buffer. In this case, there is no command, so no operation is performed. The (un)modified content is then outputted to standard output, and this loop is repeated for the second, third, fourth lines until the end of file.

Great! You're on your way to become a sed guru. Let's now see how to specify line addresses to print specific lines out.

02. Line Addresses and Printing

Invoking the sed command without any line address will cause sed to automatically perform operations on all lines.

Printing

Firstly, we can print out text through the sed command with the p option. This command simply outputs the file or whatever is passed through standard in.

Bash

$ sed 'p' oneOS.txt
One OS to rule them all,
One OS to rule them all,
One OS to find them.
One OS to find them.
One OS to call them all,
One OS to call them all,
And in salvation bind them.
And in salvation bind them.
In the bright land of Linux,
In the bright land of Linux,
Where the hackers play.
Where the hackers play.

Now here, we can see that our text file prints out twice. This is because sed is reading in a line, performing the "print" operation on it, then outputting it again. To suppress the automatic printing, we use the -n option.

Bash

$ sed -n 'p' oneOS.txt
One OS to rule them all,
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

Great! Keep the -n option in mind, as we'll use it later.

If we want to be more specific, we can precede the option with a line address.

Specifying address ranges

Now let's see how we can specify specific lines to print out. There are a handful of ways to specify a range or a specific pattern.

n: Operate only on line n.; '3p' would print just the third line.
n,m: Print lines n to m (inclusive).; '3,5p' would print the 3rd, 4th, and 5th lines.
,m: Print lines from beginning of file to line m (inclusive).
n,+m: Print m startings from n (GNU only).; '3,+2p' would print lines 3 to 5, inclusive.; May not be supported in some versions.
n~m: Starting from line n, print every other m line (GNU only).; '0~2' would print even numbered line.; May not be supported in some versions.
$: Last line.; '$p' would print just the last line.
!: Proceed the line address with an exclamation point to specific all lines except these addresses.; '1,2!p' would print all lines except the first and second.

In our example:

Bash

$ sed -n '2,3p' oneOS.txt
One OS to find them.
One OS to call them all,
$ sed -n '$p' oneOS.txt
Where the hackers play.

Line addresses with regular expressions

You may also specify lines that match a particular regular expression. Simply embed the regex between two slashes before the option.

This command prints only lines that start with "One".

Bash

$ sed -n '/^One/p' oneOS.txt
One OS to rule them all,
One OS to fine them.
One OS to call them all,

This next command prints only lines that do not contain "OS" in them.

Bash

$ sed -n '/OS/!p' oneOS.txt
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

Printing sections between two patterns

We can also pull out text between two regex patterns.

Bash

$ sed -n '/call/,/hackers/p' oneOS.txt
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

Now that we are able to select which lines we want, let's move on to how to delete lines.

03. Deleting lines and saving processed files

To delete lines from our file, we use the d command instead of the p command as we saw earlier.

Notice that for deleting, we no longer need the -n option since sed won't print out the deleted lines (once they are deleted there is nothing to be printed). Thus the output will only be the lines that were not deleted.

Don't forget an address with deleting!

Make sure to specify an address! Without one, sed will delete all lines, which isn't you wouldn't want.

Let's see what happens when we delete the third and fourth lines:

Bash

$ sed '3,4d' oneOS.txt
One OS to rule them all,
One OS to find them.
In the bright land of Linux,
Where the hackers play.

Saving to a new file

Notice that using the delete operation does not save the output to a new file. To save to a new file, simply direct the output to a new file with the redirection symbol (>).

Bash

$ sed '1~2d' BSD > everyOther.txt

Editing and Saving in-place

To edit the file and save directly, we can use the -i option. Note that you may lose data if used incorrectly, so be sure to create a backup prior to editing! To create a .bak file directly within the command, tack on the .bak (or any other extension) after the -i option.

Bash

$ sed -i.bak 'd' oneOS.txt
# Woops! Accidentally deleted all lines, but good thing we made a backup!

The "-i" in-place option varies between implementations of sed. GNU sed allows for it while FreeBSD's sed requires an extension parameter. OpenBSD's sed does not support the "-i" option at all.

Deleting lines with a regex match

To delete lines that contain a specific match to a regex, embed the regular expression between two slashes, before the option.

This command would delete the lines that start with an uppercase vowel.

Bash

$ sed '/^[AEIOU]/d' oneOS.txt
Where the hackers play.

More fun examples

Here are a list of more fun deletions that you may find helpful:

1. Deleting first and last lines

To specify multiple delete commands, separate them with a semi-colon (;).

Bash

$ sed '1d;$d' oneOS.txt
# Deletes first and last line
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,

2. Deleting every other line

To delete every other line, refer back to the ~ address operator, which takes every multiple starting from an integer. Note that the ~ syntax is only GNU-specific and won't work in BSD flavors of sed.

Bash

$ sed '1~2d' oneOS.txt
# Removes all odd number lines (only on GNU)
One OS to rule them all,
One OS to call them all,
In the bright land of Linux,

3. Removing all empty lines

Use a simple regular expression to rid of all empty lines. Or rid the inverse of all lines that have content.

Bash

$ sed '/^$/d' oneOS.txt
$ sed '/./!d' oneOS.txt

4. Deleting from a pattern to end of file

Bash

$ sed '/salvation/,$d' oneOS.txt
One OS to rule them all,
One OS to find them.
One OS to call them all,

04. Substituting (Search and Replace)

The substitution option is probably the most well-used feature of sed. Substitution takes in an optional line address, the 's' command, a delimiter, the pattern to search for, the replacement, and any flags. Seems like a lot going on for a one line-liner, but with practice, you'll see it's a cinch!

Anatomy of a sed substitution command — The anatomy of a sed substitution command.

We've already seen how line addresses work, so let's move onto the delimiter. This is the symbol used to separate the different components of our sed command. The interesting thing here is that we may use any character we want!

Canonically, we use the / character, but in the case we are matching filename paths (e.g. /usr/bin), it's common to use the semi-colon (;) instead.

If for some reason really want to keep consistent, we can escape the backslashes in our filesnames with the forward slash \. However, this ends up looking like a picket fence, which can be difficult to read.

Bash

$ sed 's/\/usr\/bin/\/usr\/local\/bin/g' file.txt
# Use a delimiter to make reading easier.
$ sed 's;/usr/bin;/usr/local/bin;g' file.txt

In summary, any character that comes after the 's' command is considered the delimiter.

Two regex's - the search and the replace

Now we may include two regular expressions - one used to find the pattern of interest (what to search for), and the other to replace.

Bash

$ sed 's/OS/ring/' oneOS.txt
One ring to rule them all,
One ring to find them.
One ring to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

Don't forget the last delimiter!

Make sure to add that third delimiter at the end of your sed command. If not, the command won't go through!

Bash

$ sed 's/OS/ring' oneOS.txt
sed: 1: "s/OS/ring": unterminated substitute in regular expression

Flags

After the two regular expressions, we can tack on an optional modifier. Think of these as a secondary option we can include per command.

Global replace with g

The most common modifier you'll come across is the global modifer. By default, sed only replaces the first occurrence per line. This means that if you have two or more occurrences of a pattern you would like to replace on a line, only the first will be edited.

Bash

$ echo "hello, hello, hello" | sed 's/hello/hi/'
hi hello hello
$ echo "hello, hello, hello" | sed 's/hello/hi/g'
hi hi hi

Remember that the file edits won't save in-place, unless you specify the -i option. Another way to save your output is to redirect it to some file. Refer back to the previous page for more information.

Printing lines where substitution took place with p

We may print just the lines where sed performed a substitution. Recall that the -n option is used to suppress all lines from outputting. By using this option with the p modifier, we can output only the lines where substitution took place.

Bash

$ sed -n 's/OS/ring/p' oneOS.txt
One right to rule them all,
One ring to find them.
One ring to call them all,

Ignoring case with I

Often times case-sensitivity does not matter. To match both upper and lower-case letters, use the I modifier.

Bash

$ sed 's/os/ring/I' oneOS.txt
One ring to rule them all,
One ring to find them.
One ring to call them all,

Note that this will not work on OS X's implementation of sed.

05. Backreferences and the Ampersand

Many times when we substitute a word, we want to make a reference to the phrase or part of the phrase that we just matched. For example, we may want to surround a query in parentheses or swap out two words we found. We can perform these substitutions using backreferences and the ampersand character.

Backreferences

Recall back to regular expressions that we can use backreferences to target a previously matched pattern. We capture these phrases by surrounding them by $ and $.

Later within the code, we may use \1 to recall the first pattern, \2 the second pattern, and so on for up to 9 patterns.

Example: switching two words around

To switch two words around, we create backreferences on two texts, and then references them with \1 and \2. In this example, we can see how to switch around any two words when the second is "them".

Bash

$ sed 's/\([a-z]*\) \(them\)/\2 \1/' oneOS.txt
# Extended regular expression version of the above
$ sed -E 's/([a-z]*) (them)/ \2 \1/' oneOS.txt
One OS to them rule all,
One OS to them find.
One OS to them call all,
And in salvation them bind.
In the bright land of Linux,
Where the hackers play.

The above two commands perform the same task. However, the bottom one uses extended regular expressions, so does not need the \ character to escape parentheses.

& as the matched string

Sometimes you'll want to refer back to the entire matched text. For example, if you use a regex for the search term and want to surround that term with asterisks.

Example: wrapping around asterisks

To specify this, use the & character. This allows you to use whatever pattern matched in your replacement query.

For example, let's say you want to match all words that start with an uppercase vowels

Bash

$ sed 's/^[AEIOU][a-z]*/\*&\*/' oneOS.txt
*One* OS to rule them all,
*One* OS to find them.
*One* OS to call them all,
*And* in salvation bind them.
*In* the bright land of Linux,
Where the hackers play.

To use the actual & sign, simply escape it with a backslash (\).

Bash

$ sed 's/and/\&/' file.txt

06. Applying sed with multiple files and commands

Now let's learn how to perform two more feats - applying a sed command to multiple files, and applying multiple commands to a single file.

Invoking multiple sed commands

You'll most likely want to substitute a number of times - not just once. We can do this in a naive manner with pipes.

Bash

$ sed 's/OS/ring/g' oneOS.txt | sed 's/Linux/Mordor/g' | sed 's/hackers/orcs/g'
One ring to rule them all,
One ring to find them.
One ring to call them all,
And in salvation bind them.
In the bright land of Mordor,
Where the orcs play.

This works, but there is an unnecessary amount of overhead, and uses multiple processes, making it extremely inefficient. Instead, we can use the -e option to string these substitutions together.

Bash

$ sed -e 's/OS/ring/g' -e 's/Linux/Mordor/g' -e 's/hackers/orcs/g' oneOS.txt
One ring to rule them all,
One ring to find them.
One ring to call them all,
And in salvation bind them.
In the bright land of Mordor,
Where the orcs play.

We may also use the semi-colon character (;) instead.

Bash

$ sed 's/OS/ring/g;s/Linux/Mordor/g;s/hackers/orcs/g' oneOS.txt
One ring to rule them all,
One ring to find them.
One ring to call them all,
And in salvation bind them.
In the bright land of Mordor,
Where the orcs play.

Applying sed scripts

Oftentimes at your workstation, you'll find that you're performing the same operations for standardized formatting with sed. We could apply them all at once them with the -e or semi-colons (;) as shown before, but this can get messy and the whole process is not as portable.

A better way to perform a list of text manipulations is through a sed script. Here, we take each sed command and place them line-by-line in a single file. To run the file, we supply it as an argument to the sed command.

Bash

# mySedScript
s/rule/cure/g
s/OS/antibiotic/g
s/hackers/doctors/g

Now we call it:

Bash

$ sed -f mySedScript oneOS.txt
One antibiotic to cure them all,
One antibiotic to find them.
One antibiotic to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the doctors play.

07. Writing to file(s)

There is one more flag that wasn't covered that is used with substitution command, the w. This is used to write the outputs to a file.

# sed script
/all/w all.txt
/One/w one.txt
/the/w the.txt
...
# Up to ten output files in one script

Within each instance of sed, you can include up to ten output files. Note the single space between the w flag and the output file name.

You can output a list of files depending on the last digit.

Examples

1) Write to first line

Bash

$ sed -n '1w output.txt'

2) Write first and last line

Bash

$ sed -n -e '1w output.txt' -e '$w output.txt' oneOS.txt

3) Write lines that match a specific pattern

Bash

$ sed -n '/pattern/w output.txt' oneOS.txt

08. The Hold Buffer/Space

Earlier we introduced the pattern buffer, which is where a matched line is stored before it becomes manipulated.

There is another storage space known as the hold buffer which is used to copy or save the data in the pattern space for subsequent retrieval. The content held in the hold buffer can be exchanged or appended to the pattern buffer and vice versa.

Figure of sed operations: read, execute, display. With Hold Buffer — How sed works with its hold buffer.

Let's go over the five commands, then discuss how we can use the hold buffer to reverse a file.

Exchange with x

The first command, x, is used to exchange the current pattern and holding buffers. Try this simple command to see what happens:

Bash

$ sed 'x' oneOS.txt
One OS to rule them all,
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,

Notice how the entire file got shifted down one line, and the last line was deleted. Here is a step-by-step analysis of what is happening.

The first line is placed in the pattern space.
The contents of the pattern space is exchanged with the contents of the hold buffer, which is empty as of now.
The pattern buffer then replaces the current line with its contents which is nothing.
Sed then makes its way to the next line, and stores this line into the pattern buffer.
The pattern buffer is swapped with the hold buffer, which contains the line preceding it.
The line within the pattern buffer is then outputted.
Repeat from #4 until reaching the last line.
On the last line, the second-to-last line will be in the hold buffer, and swapped with the pattern buffer.
Since the last line will be held in the hold buffer, but is never swapped with the pattern buffer, it does not get printed out and sed finishes its execution. Remember that the hold buffer will not get ouputted unless explicitly requested.

Phew! Nine steps for one simple command. I may have overly broken it down, since this may a difficult concept to grasp for first-time users.

Hold with h or H

The hold function (h) is used to copy the pattern buffer (line we are currently working on) into the hold buffer. The pattern buffer stays unchanged, while the previous contents of the holding buffer is destroyed.

If we wish to keep our hold buffer, and append to it, we can use the H command. The pattern buffer is appended to the hold buffer, separated by a newline (\n) character.

Get with g or G

Now the reverse of the hold command is g. This places what's in the hold buffer to the pattern space. This deletes the contents of the current pattern space.

Similarly, to append instead of overwriting, we use the G command. This adds a new line to the pattern space, and appends the contents of the hold buffer.

Double spacing with G

Let's try the following example, which only makes use of the G command.

Bash

$ sed 'G' oneOS.txt
One OS to rule them all,
One OS to find them.
One OS to call them all,
And in salvation bind them.
In the bright land of Linux,
Where the hackers play.

How does double-spacing work? Let's break it down step-by-step.

The first line is placed in the pattern space.
The holding space, which contains nothing, is appended with a newline character to the pattern space.
The pattern space replaces the current line.
Current line is outputted.
This process is repeated until end of file.

This pretty simple right?

Reversing line order using sed

Let's try one more trick to solidify our understanding of pattern and holding spaces. This three-tier command reverses the lines of a file. Remember that the semi-colon (;) is used to separated commands in sed.

Bash

$ sed -n '1!G;h;$p' oneOS.txt
Where the hackers play.
In the bright land of Linux,
And in salvation bind them.
One OS to call them all,
One OS to find them.
One OS to rule them all,

The first line is copied into the pattern space.
1!G says to apply the G command to all lines except the first, so we skip this one.
The h command copies the contents into the holding space. The previous contents of the holding space are destroyed.
The last command says print out just last line's pattern space, so we skip this one.
Sed goes back to the first command, 1!G.
The current line is placed within the pattern space.
The contents of the holding space are appended to the pattern buffer with a newline character. On the nth iteration, this would cause the order to be: n, n-1 ... 2, 1.
Repeat this from #2 until the last line.
On the last line, print out the pattern space.

If you can understand the above, then you got the basics of sed!