Blog Projects, Tips, Tricks, and How-Tos

Writing Better Shell Scripts – Part 2

Quick Start

As with Part 1 of this series, this information does not lend itself to having a “Quick Start” section. With that said, you can read the How-To section of this post for a quick general overview. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. Video and Audio are also included with this post which may work as a quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re learning commands and solving problems.

Video

Audio

Download

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory Path
Warning or Error
Command Line Snippet With Commands/Options/Arguments
Command Options and Their Arguments Only
Hyperlink

Overview

This post is the second in a series on shell script debugging, error handling, and security. The content of this post will be geared mainly toward BASH users, but there will be information that’s suitable for users of other shells as well. Information such as techniques and methodologies may transfer very well, but BASH specific constructs and commands will not. The users of other shells (CSH, KSH, etc) will have to do some homework to see what transfers and what does not.

There are a lot of opinions about how error handling should be done, which range from doing nothing to implementing comprehensive solutions. In this post, as well as my professional work, I try to err on the side of in-depth solutions. Some people will argue that you don’t need to go through the trouble of providing error handling on small single-user scripts, but useful scripts have a way of growing past their original intent and user group. If you’re a system administrator, you need to be especially careful with error handling in your scripts. If you or an admin under you gets careless, someday you may end up getting a call from one of your users complaining that they just deleted the contents of their home directory – with one of your scripts. It’s easier to do than you might think when precautions are not taken. All you need are a couple of lines in your script like the those in Listing 1.

Listing 1

#!/bin/bash cd $1 rm -rf *

So what happens if a user forgets to supply a command line argument to Listing 1? The cd command changes into the user’s home directory, and the rm command deletes all of their files and directories without prompting. That has the makings of a bad day for both you and your user. In this post I’ll cover some ways to avoid this kind of headache.

To help ease the extra burden of making your scripts safer with error handling, we’ll talk about separating error handling code out into reusable modules which can be sourced. Once you do this and become familiar with a few error handling techniques, you’ll be able to implement robust error handling in your scripts with less effort.

The intent of this post is to give you the information you need to make good judgments about error handling within your own scripts. Both proactive and reactive error handling techniques will be covered so that you can make the decision on when to try to head off errors before they happen, and when to try to catch them after they happen. With those things in mind, lets start off with some of the core elements of error handling.

BASH Options

There are several BASH command line options that can help you avoid some errors in your scripts. The first two are ones that we already covered in Part 1 of this series. The -e option, which is the same as set -o errexit, causes BASH to exit as soon as it detects an error. While there are a significant number of people who promote setting the -e option for all of your scripts, that can prevent you from using some of the other error handling techniques that we’ll be talking about shortly. The next option -u, which is the same as set -o nounset causes the shell to throw an error whenever a variable is used before its value has been set. This is a simple way to prevent the risky behavior of Listing 1. If the user does not provide an argument to the script, the shell will see it as the 1 variable not being set and complain. This is usually a good option to use in your scripts.

set -o pipefail is something that we’ll touch on in the Command Sequences section and causes a whole command pipeline to error out if just one of the sections has an error. The last shell option that I want to touch on is set -o noclobber (or the -C option) which helps you because it prevents the overwriting of files with redirection. You will just get an error similar to cannot overwrite existing file. This can save you when you’re working with system configuration files, as overwriting one of them could result in any number of big problems. Listing 2 holds a quick reference list of these options.

Listing 2

errexit (-e) Causes the script to exit whenever there is an error. noclobber (-C) Prevents the overwriting of files when using redirection. nounset (-u) Causes the shell to throw an error whenever an unset variable is used. pipefail Causes a pipeline to error out if any section has an error.

Exit Status

Exit status is the 8-bit integer that is returned to a parent process when a subprocess exits (either normally or is forced to exit). Typically, an exit status of 0 means that the process completed successfully, and a greater than 0 exit status means that there was a problem. This may seem counter intuitive to C/C++ programmers who are used to true being 1 (non-zero) and false being 0. There are exceptions to the shell’s exit status standard, so it’s always best to understand how the distribution/shell/command combo you’re using will handle the exit status. An example of a command that acts differently is diff. When you run diff on two files, it will return 0 if the files are the same, 1 if the files are different, and some number greater than 1 if there was an error. So if you checked the exit status of diff expecting it to behave “normally”, you would think that the command failed when it was really telling you that the files are different.

Probably the easiest way to begin experimenting with exit status is to use the BASH shell’s built-in ? variable. The ? variable holds the exit status of the last command that was run. Listing 3 shows an example where I check the exit status of the true command which always gives an exit status of 0 (success), and of the false command which always gives an exit status of 1 (failure). Credit goes to William Shotts, Jr. who’s straight forward use of true and false in his examples on this topic inspired some of the examples in this post.

Listing 3

$true $echo $? 0 $false $echo $? 1

In this case the true and false commands follow the 0 = success, non-zero = failure standard, so we can be certain whether or not the command succeeded. As stated above though, the meaning of the exit status is not always so clear. I check the man page for any unfamiliar commands to see what their exit statuses mean, and I suggest you do the same with the commands you use. Listing 4 lists some of the standard exit statuses and their usual meanings.

Listing 4

0 Command completed successfully. 1-125 Command did not complete successfully. Check the command's man page for the meaning of the status. 126 Command was found, but couldn't be executed. 127 Command was not found. 128-254 Command died due to receiving a signal. The signal code is added to 128 (128 + SIGNAL) to get the status. 130 Command exited due to Ctrl-C being pressed. 255 Exit status is out of range.

For statuses 128 through 254, you see that the signal that caused the command to exit is added to the base status of 128. This allows you to subtract 128 from the given exit status later to see which signal was the culprit. Some of the signals that can be added to the base of 128 are shown in Listing 5 and were obtained from the signal man page via man 7 signal . Note that SIGKILL and SIGSTOP cannot be caught, blocked, or ignored because those signals are handled at the kernel level. You may see all of these signals at one time or another, but the most common are SIGHUP, SIGINT, SIGQUIT, SIGKILL, SIGTERM, and SIGSTOP.

Listing 5

Signal Value Action Comment ────────────────────────────────────────────────────────────────────── SIGHUP 1 Term Hangup detected on controlling terminal or death of controlling process SIGINT 2 Term Interrupt from keyboard SIGQUIT 3 Core Quit from keyboard SIGILL 4 Core Illegal Instruction SIGABRT 6 Core Abort signal from abort(3) SIGFPE 8 Core Floating point exception SIGKILL 9 Term Kill signal SIGSEGV 11 Core Invalid memory reference SIGPIPE 13 Term Broken pipe: write to pipe with no readers SIGALRM 14 Term Timer signal from alarm(2) SIGTERM 15 Term Termination signal SIGUSR1 30,10,16 Term User-defined signal 1 SIGUSR2 31,12,17 Term User-defined signal 2 SIGCHLD 20,17,18 Ign Child stopped or terminated SIGCONT 19,18,25 Cont Continue if stopped SIGSTOP 17,19,23 Stop Stop process SIGTSTP 18,20,24 Stop Stop typed at tty SIGTTIN 21,21,26 Stop tty input for background process SIGTTOU 22,22,27 Stop tty output for background process

A listing of signals which only shows the symbolic and numeric representations without the descriptions can be obtained with either kill -l or trap -l .

You can explicitly pass the exit status of the last command executed back to the parent process (most likely the shell) with a line like exit $? . You can do the same thing implicitly by calling the exit command without an argument. This works fine if you want to exit immediately, but if you want to do some other things with the exit status first you’ll need to store it in a variable. This is because after you read the ? variable once, it resets. Listing 6 shows one way of using an if statement to pass the exit status back to the parent after implementing your own error handling functionality.

Listing 6

#!/bin/bash - # Run the command(s) false # Save the exit status (it's reset once we read it) EXITSTAT=$? # If the command has a non-zero exit status if [ $EXITSTAT -gt 0 ] then echo "There was an error." exit $EXITSTAT #Pass the exit status back to parent fi

You can also use an if statement to directly test the exit status of a command as in Listing 7. Notice that using the command this way resets the ? variable so that you can’t use it later.

Listing 7

#!/bin/bash - # If the command has a non-zero exit status if ! false then echo "There was an error." exit 1 fi

The if ! false statement is the key here. What’s inside of the if statement will be executed if the command (in this case false) returns a non-zero exit status. Using this type of statement can give you a chance to warn the user of what’s going on and take any actions that are needed before the script exits.

You can also use the if and test combination in more complex ways. For instance, according to its man page, the ls command uses an exit status of 0 for no errors, 1 for minor errors like not being able to access a sub directory, and 2 for major errors like not being able to access a file/directory specified on the command line. With this in mind, take a look at Listing 8 to see how you could differentiate between the “no error”, “minor error”, and “major error” conditions.

Listing 8

#!/bin/bash - function testex { # We can only read $? once before it resets, so save it exitstat=$1 # See which condition we have if test $exitstat -eq 0; then echo "No error detected" elif test $exitstat -eq 1; then echo "Minor error detected" elif test $exitstat -eq 2; then echo "Major error detected" fi } # Try a listing the should succeed echo "- 'ls ~/*' Executing" ls ~/* &> /dev/null # Check the success/failure of the ls command testex $? # Try a listing that should not succeed echo "- 'ls doesnotexist' Executing" ls doesnotexist &> /dev/null testex $?

Inside the testex function I have placed code that looks for specific exit statuses and then tells the user what was found. Normally you wouldn’t worry about handling the situation where there’s no error (exit status 0), but doing so helps clarify the concept in our example. The output that you would get from running this script is shown in Listing 9.

Listing 9

$ ./testex.sh - 'ls ~/*' Executing No error detected - 'ls doesnotexist' Executing Major error detected

There are a couple of final things to be aware of when you’re using the ? variable. First, remember that whenever you use ? from the command line or in a script, the shell resets its value. If you need to use the ? variable more than once in your script, you’ll want to store it’s value in another variable and use that. The second is that ? becomes ineffective when you are using the -e option or the line set -o errexit. The reason for this is that the script will exit as soon as an error is detected, and so you never get a chance to check the ? variable.

The command_not_found_handle Function

As of BASH 4.0, the provision for a command_not_found_handle function has been added. This function makes it possible to display user friendly messages when a command the user types is not found. BASH searches for the command and if it’s not found anywhere, BASH looks to see if you have the command_not_found_handle function defined. If you do, that function is invoked passing it the attempted command and its arguments so that a useful message can be displayed. If you use a Debian or Ubuntu system you’ve probably seen this in action as they’ve had this feature for awhile. Listing 10 shows an example of the command_not_found_handle function output on an Ubuntu 9.10 system.

Listing 10

$cat2 No command 'cat2' found, did you mean: Command 'cat' from package 'coreutils' (main) cat2: command not found

You can implement/override the behavior of the command_not_found_handle function to provide your own functionality. Listing 11 shows an implementation of the command_not_found_handle function inside of a stand-alone script. In most cases you would want to add it to your BASH configuration file(s) so that you can make use of the function anytime that you’re at the shell prompt.

Listing 11

#!/bin/bash - # File: cmdnf.sh function command_not_found_handle { echo "The command ($1) is not valid." exit 127 #The command not found status } cat2

You would access the arguments to the original (not found) command via $2, $3 and so on. Notice that I used the exit command and passed it the code of 127, which is the command not found exit status. The exit status of the whole script is the exit status of the command_not_found_handle function. If you don’t set the exit status explicitly the script will end up returning 0 (success), thus preventing a user or script from using the exit status to determine what type of error occurred. Propagation of the exit status and terminating signal (which we’ll talk about later) is a good thing to do to prevent your users from missing important information and/or having problems. When run, the script in Listing 11 gives you the following output in Listing 12.

Listing 12

$./cmdnf.sh The command (cat2) is not valid. $echo $? 127

Command Sequences

Command sequences are multiple commands that are linked by pipes or logical short-circuit operators. Two logical short-circuits are the double ampersand (&&) and double pipe (||) operators. The && only allows the command that comes after it in the series to be executed if the previous command exited with a status of 0. The || operator does the opposite by only allowing the next command to be executed if the previous one returned a non-zero exit status. Listing 13 shows examples of how each of these work.

Listing 13

$true && echo 'Hello World!' Hello World! $false && echo 'Hello World!' $true || echo 'Hello World!' $false || echo 'Hello World!' Hello World!

So, one of the many ways to solve the unset variable problem we see in Listing 1 is the example shown in Listing 14.

Listing 14

#!/bin/bash #Make sure the user provided a command line argument [ -n "$1" ] || { echo "Please provide a command line argument."; exit 1; } #Change to the directory and delete the files and dirs cd $1 && rm -rf *

In the first line of interest, we check to make sure that the value of $1 is not null. If that test command fails, it means that $1 is unset and that the user did not provide a command line argument. Since the || operator only allows the next command to run if the previous one fails, our code block warns the user of their mistake and exits with a non-zero status. If a command line argument was supplied, the script continues on. In the second interesting line we use the && operator to run the rm command if, and only if, the cd command succeeds. This keeps us from accidentally deleting all of the files and directories in the user’s/script’s current working directory if the cd command fails for some reason.

The next type of command sequence that we’re going to cover is a pipeline. When commands are piped together, only the last return code will be looked at by the shell. If you have a series of pipes like the one in Listing 15, you would expect it to show a non-zero exit status, but instead it’s 0.

Listing 15

$true | false | true $echo $? 0

To change the shell’s behavior so that it will return a non-zero value for a pipeline if any of it’s elements have a non-zero exit status, use the set -o pipefail line in your script. The result of using pipefail is shown in Listing 16.

Listing 16

$set -o pipefail $true | false | true $echo $? 1

This method doesn’t give you any insight into where in the pipeline your error occurred though. In many cases I prefer to use the BASH array variable PIPESTATUS to check pipelines. It gives you the ability to tell where in the pipeline the error occurred, so that your script can more intelligently adapt to or warn about the error. Listing 17 gives an example.

Listing 17

$true | false | true $echo ${PIPESTATUS[0]} ${PIPESTATUS[1]} ${PIPESTATUS[2]} 0 1 0

To keep things clean inside your script, you might put the code to check the PIPESTATUS array into a function and use a loop to process the array elements. This way you have reusable code that will automatically adjust to the number of commands that are in your pipe. One of the scripts in the Scripting section shows this technique.

If you’re running a version of BASH prior to 3.1, a potential problem with using pipes is the Broken pipe warning. If a reader in a pipeline finishes before its writer completes, the writer command will get a SIGPIPE signal which causes the Broken pipe warning to be thrown. It may be a non-issue for you, but it doesn’t hurt to be aware of it. If you’re running a version of BASH that’s 3.1 or higher, you use the PIPESTATUS variable to see if there’s been a pipe error. I’ve done this in Listing 18 where I’ve written two scripts that will cause the pipeline to break. The code inside the scripts doesn’t really matter in this case, just the end result.

Listing 18

$./pipeerr2.sh | ./pipeerr.sh test test test $echo ${PIPESTATUS[0]} ${PIPESTATUS[1]} 141 0

You can see that the pipe exit status for the first script (or pipeline section) is 141. This number actually results from the addition of a base exit status and the signal code, which I’ve mentioned before. The base status is 128, which the shell uses to signify that a command stopped due to receiving a signal rather than exiting normally. Added to that is the code of the signal that caused the termination, which in this case is 13 (SIGPIPE) on my system. This technique embeds the signal code in the exit status in a way that makes it easy to retrieve. Since the status is built by adding 128 and 13, all I have to do is use arithmetic expansion to extract the signal code from Listing 18: echo $((${PIPESTATUS[0]}-128)) . This gives me output showing the value of 13, which is what we expect. Keep in mind that the PIPESTATUS array variable is like the ? variable in that it resets once you access it or a new pipeline is executed.

As stated in Part 1 of this series, you can replace pipes with temporary files. This will eliminate the SIGPIPE and exit status pitfalls of pipes, but as stated before temp files are much slower than pipes and require you to clean them up after you’re done with them. In general, I would suggest staying away from temp files unless you have a compelling reason to use them. A compromise between temp files and pipes might be named pipes. On modern Linux systems you use the mkfifo command to create a named pipe, which you can then use with redirection. On older systems you may have to use mknod instead to create the pipe. In Listing 19 you can see that I’ve used named pipes instead of regular pipes, and that this technique allows me to check each of the sections of the pipeline as they’re used. Keep in mind that I’m reading from the named pipe in another terminal with cat < pipe1 since a line like true > pipe1 will block until the pipe has been read from. Also notice that I use the rm command to delete the named pipe after I’m done with it. I do this as a housekeeping measure, since I don’t want to leave named pipes laying around that I don’t need.

Listing 19

$mkfifo pipe1 $true > pipe1 $echo $? 0 $false > pipe1 $echo $? 1 $rm pipe1

Wrapper Functions

If there’s a command that you’re using multiple times in your script and that command requires some error handling, you might want to think about creating a wrapper function. For instance, in Listing 1 the cd command has the unwanted side effect of switching to the user’s home directory if the user hasn’t supplied a command line argument. If you’re using cd multiple times throughout the script, you could write a function that extends cd‘s functionality. Listing 20 shows an example of this.

Listing 20

#!/bin/bash - function cdext { # We want to make sure that the user gave an argument if [ $# -eq 1 ] then cd $1 else echo "You must supply a directory to change to." exit 1 fi } # This should succeed cdext /tmp # Make sure that it did succeed pwd # This should fail with our warning cdext

I first use the shell’s built-in # variable to make sure that the user has specified and single argument. It would probably also be a good idea to add a separate else statement to warn the user that they supplied too many arguments. If the user supplied the single argument, the function uses cd to change to that directory and we make sure it worked correctly with the pwd command. If the user didn’t supply a command line argument, we warn them of their error and exit the script. This simple function adds an extra restriction to the cd command’s usage to help make your script safer.

To make the most of this technique you need to understand what types of things can go wrong with a command. Make sure that you’ve learned enough about the command, through resources like the man page, to handle the potential errors properly.

“Scrubbing” Error Output

What I mean by scrubbing in this instance is searching through the error output from a command looking for patterns. That pattern could be something like “file not found” or “file or directory does not exist”. Essentially what you’re doing is looking through the command’s output trying to find a string that will give you specific information about what error occurred. This method tends to be very brittle, meaning that the slightest change in the output can break your script. For this reason I don’t recommend this method, but in some cases it may be your only choice to gather more specific information about a command’s error condition. One method to make this technique slightly more robust would be to use regular expressions and case insensitivity. In Listing 21 I’ve provided a very simple example of output scrubbing.

Listing 21

$ls doesnotexist 2>&1 | grep -i "file not found" $ls doesnotexist 2>&1 | grep -i "no such" ls: cannot access doesnotexist: No such file or directory

Notice that I’m using the -i option of grep to make it case insensitive. I’m also redirecting both stdout and stderr into the pipe with the 2>&1 statement. That way I can search all of of the command’s messages, errors, and warnings looking for the pattern of interest. In the first search statement I look for the pattern “file not found”, which is not a statement found in the ls command’s output. When I search for the statement “no such”, I get the line of output that contains the error. You could push this example a lot further with the use of regular expressions, but even if you’re very careful a simple change to the command’s output by the developer could leave your script broken. I would suggest filing this technique away in your memory and using it only when you’re sure there’s not a better way to solve the problem.

Being A Good Linux/UNIX Citizen

There are some signals that we need to take extra care in dealing with, such as SIGINT. With SIGINT all processes in the foreground see the signal, but the innermost (foremost) child process decides what will be done with the signal. The problem with this is that if the innermost process just absorbs the SIGINT signal and doesn’t act on it and/or send it on up to it’s parent, the user will be unable to exit the program/script with the Ctrl-C key combination. There are a few applications that trap this signal intentionally which is fine, but doing this on your own can lead to unpredictable behavior and is what I would consider to be an undesirable practice. Try to avoid this in your own scripts unless you have a compelling reason to do otherwise and understand the consequences. To get around this issue we’ll propagate signals like SIGINT up the process stack to give the parent(s) a chance to react to them.

One way of handling error propagation is shown in Listing 22 where I’ve assumed that the shell is the direct parent of the script.

Listing 22

#!/bin/bash - function int_handler { echo "SIGINT Caught" #Propagate the signal up to the shell kill -s SIGINT $$ # 130 is the exit status from Ctrl-C/SIGINT exit 130 } # Our trap to handle SIGINT/Ctrl-C trap 'int_handler' INT while true do : done

First of all, don’t get caught up in the trap statement if you don’t already know what it is. We’ll talk about traps shortly. This script busy waits in a while loop until the user presses Ctrl-C or the system sends the SIGINT signal. When this happens the script uses the kill command to send SIGINT on up to the shell (who’s process ID is represented by $$ in the line kill -s SIGINT $$), and then exits with an exit status corresponding to a forced exit due to SIGINT. This way the shell gets to decide what it wants to do with the SIGINT, and the exit status of our script can be examined to see what happened. Our script handles the signal properly and then allows everyone else above it to do the same.

Error Handling Functions

Since you’re most likely going to be using error handling code in multiple places in your script, it can be helpful to separate it out into a function. This keeps your script clean and free of duplicate code. Listing 23 shows one of the many ways of using a function to encapsulate some simple error handling functionality.

Listing 23

#!/bin/bash - function err_handler { # Check to see which error code we were given if [ $1 -eq 1001 ]; then echo "Non-Fatal Error #1 Has Occurred" # We don't need to exit here elif [ $1 -eq 1002 ]; then echo "Fatal Error #2 Has Occurred" exit 1 # Error was fatal so exit with non-zero status fi } # Notice that I'm using my own made up error codes (1001, 1002) err_handler 1001 err_handler 1002

Notice that I made up my own error codes (1001 and 1002). These have no correlation to any exit status of any of the commands that my script would use, they’re just for my own use. Using codes in this way keeps me from having to pass long error description strings to my function, and thus saves typing, space, and clutter in my code. The drawback is that someone modifying the script later (maybe years later) can’t just glance at a line of code (err_handler 1001) and know what error it is referring to. You could help lessen this problem by placing error code descriptions in the comments at the top of your script. When I run the script in Listing 23 I get the output in Listing 24.

Listing 24

$./err_handler.sh Non-Fatal Error #1 Has Occurred Fatal Error #2 Has Occurred $

Introducing The trap Command

The trap command allows you to associate a section of code with a particular signal (see Listing 5), so that when the signal is seen by the shell the code is run. The shell essentially sets up a signal handler for the signal associated with the trap. This can be very handy to allow you to correct for errors, log what happened, or remove things like temporary files before your script exits. These things highlight one of the downsides to using kill -9 because SIGKILL is one of the two signals that can’t be trapped. If you use SIGKILL, the process that you’re killing won’t get a chance to clean up after itself before exiting. That could leave things like temporary files and stale file locks around to cause problems later. It’s better to use SIGTERM to end a process because it gives the process a chance to clean up.

Listing 25 shows a couple of ways to use the trap command in a script.

Listing 25

#!/bin/bash - function exit_handler { echo "Script Exiting" } trap "echo Ctrl-C Caught; exit 0" int trap 'exit_handler' EXIT while true do : done

Notice that I first use a semi-colon separated list of commands with trap to catch the SIGINT (Ctrl-C) signal. While this particular implementation is bad design because it doesn’t propagate SIGINT, it allows me to keep the example simple. The exit 0 statement is what causes the second trap that’s watching for the EXIT condition to be triggered. This second trap uses a function instead of a semi-colon separated list of commands. This is a cleaner way to handle traps that promotes code reuse, and except in simple cases should probably be your preferred method. Notice the form of the SIGINT specifier that I use at the end of the first trap statement. I use int because the prefix SIG is not required, and the signal declaration is not case sensitive. The same applies when using signals with commands like kill as well. You’re also not limited to specifying one signal per trap. You can append a list of signal specifiers onto the end of the trap statement and each one will use the error handling code specified within the trap.

One tip to be aware of is that you can specify the signals by their numeric representation, but I would advise against it. Using their symbolic representation tells anyone looking at your script (which could even be you years from now) at a glance which signal you’re using. There’s no chance for misinterpretation, and symbolic signals are more portable than just specifying a signal number since numbers tend to vary more by platform.

The output from running the script in Listing 25 and hitting Ctrl-C is shown in Listing 26. Notice that the SIGINT trap is processed before the EXIT trap. This is the expected behavior because the traps for all other signals should be processed before the EXIT trap.

Listing 26

$./trapuse.sh ^CCtrl-C Caught Script Exiting $

There are four signal specifiers that you’re probably going to be most interested in when using traps and they are INT, TERM, EXIT, and ERR. All of these have been touched on so far except for ERR. If you remember from above, you could use set -o errexit to cause the shell to exit on an error. This was great from the standpoint that it kept your script from running after a potentially dangerous error had occurred, but kept you from handling the error yourself. Setting a trap using the ERR signal specifier takes care of this shortcoming. The shell receives an ERR signal on the same conditions that cause an exit with errexit, so you can use a trap statement to do any clean up or error correction before exiting. ERR does have the limitation that an error is not detected if it is enclosed in a command sequence, if statement test, a while or until statement, or if the command’s exit status is being inverted by an ! . On older versions of BASH command substitutions $(...) that fail may not be caught by a trap statement either.

You can reset traps back to their original conditions before they were associated with commands using the - command specifier. For example, in the script in Listing 25 you could add the line trap - SIGINT after which the code for the SIGINT trap would no longer be called when the user hits Ctrl-C. You can also cause the shell to ignore signals by passing a null string as a signal specification as in trap "" SIGINT . This would cause the shell to ignore the user whenever they press the Ctrl-C key combination. This is not recommended though as it makes it harder for the user to terminate the process. It’s a better practice to do our clean up and then propagate the signal in the way that we talked about earlier. A handy trick is that you can simulate the functionality of the nohup command with a line like trap "" SIGHUP . What this does is cause your script to ignore the HUP (Hangup) signal so that it will keep running even after you’ve logged out.

If you run trap by itself without any arguments, it outputs the traps that are currently set. Using the -p option with trap causes the same behavior. You can also supply signal specifications (trap -p INT EXIT) and trap will output only the commands associated with those signals. This output can be redirected and stored, and with a little bit of work read back into a script to reinstate the traps later. Listing 27 shows two lines of output from the addition of the line trap -p to the script in Listing 25 just before the while loop.

Listing 27

trap -- 'exit_handler' EXIT trap -- 'echo Ctrl-C Caught; exit 0' SIGINT

Even with all the information that I’ve given you on the trap command, there’s still more information to be had. I’ve tried to hit the highlights that I think will be most useful to you. You can open the BASH man page and search for “trap” if you want to dig deeper.

How-To

In this section I’m going to use a few of the different methods that we’ve discussed to fix the script in Listing 1. The goal is to protect the user from unexpected behavior such as having everything in their home directory deleted. I won’t cover every single way of solving the problem, instead I’ll be integrating a few of the topics we’ve covered into one script to show some practical applications. It’s my hope that by this point in the post you’re starting to see your own solutions and will be able to build on (and/or simplify) what I do here.

If you look at Listing 28 I’ve added the -u option to the shebang line of the script, and also added a check to make sure that the directory exists before changing to it.

Listing 28

#!/bin/bash -u if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi cd $1 rm -rf *

Listing 29 shows what happens when I make a couple of attempts at running the script incorrectly.

Listing 29

$./l1cor_1.sh ./l1cor_1.sh: line 3: $1: unbound variable $./l1cor_1.sh /doesnotexist Please provide a valid directory.

The -u option causes the unbound variable error because $1 will not be set if the user doesn’t supply at least one command line argument. The if/test statement declares that if the directory does not exist we will give the user an error message and then exit. There are also other checks that you could add to Listing 28 including one to make sure that the directory is writable by the current user. Ultimately you decide which checks are necessary, but the end goal with this particular example is to make sure that any dangerous behavior is avoided.

Listing 28 still has a problem because the rm command will run even if the cd command has thrown an error (like Permission denied). To fix this I’m going to rearrange the cd and rm commands into a command sequence using the && operator, and then check the exit status of the sequence. You can see these changes in Listing 30.

Listing 30

#!/bin/bash -u if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi cd $1 && rm -rf * if [ $? -gt 0 ];then echo "An error occurred during the cd/rm process." exit 1 fi

The double ampersand (&&) will cause the command sequence to exit if the cd command fails, thus ignoring the rm command. I do this to catch any of the other errors that can occur with the cd command. If there’s an unknown error with the cd command, we don’t want rm to delete all of the files/directories in the current directory. Remember that I can only check the exit status of the last command in the sequence, which doesn’t tell me whether it was cd or rm that failed. As a work around to this I’ll check to see if the rm command succeeded in the next step where I set a trap on the EXIT signal. I’ve added the trap statement and a function to use with the trap in Listing 31.

Listing 31

#!/bin/bash -u # A final check to let the user know if this script failed # to perform its primary function - deleting files function exit_handler { # Count the number of lines (files/dirs) in the directory DIR_ENTRIES=$(ls $1 | wc -l) # If there are still files in there throw an error message if [ $DIR_ENTRIES -gt 0 ];then echo "Some files/directories were not deleted" exit 1 fi } # We want to check one last thing before exiting trap 'exit_handler $1' EXIT # If the directory doesn't exist, warn the user if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi # Don't execute rm unless cd succeeds and suppress messages cd $1 &> /dev/null && rm -rf * &> /dev/null # If there was an error with cd or rm, warn the user if [ $? -gt 0 ];then echo "An error occurred during the cd/rm process." exit 1 fi

I’m not saying that this is the most efficient way to solve this problem, but it does show you some interesting uses of the techniques we’ve talked about. I went ahead and suppressed the messages from cd and rm so that I could substitute my own. This is done with the &> /dev/null additions to the command sequence. I also added the trap 'exit_handler $1' EXIT line to the script, which sets a trap for the EXIT signal and uses the exit_handler function to handle the event. Notice the use of single quotes around the 'exit_handler $1' argument to trap. This keeps the $1 variable reference from being expanded until the trap is called. We need that variable so that our exit handler can check the directory to make sure that all the files and directories were deleted. For our purposes the example script is now complete and does a reasonable job of protecting the user, but there is plenty of room for improvement. Tell us how you would change Listing 31 to make it better and/or simpler in the comments section of this post.

Tips and Tricks

  • You can sometimes use options with your commands to make them more fault tolerant. For instance the -p option of mkdir automatically creates the parents of the directory you specify if they don’t already exist. This keeps you from getting a No such file or directory error. Just make sure the options you use don’t introduce their own new problems.
  • It’s usually a good idea to enclose variables in quotation marks, especially the @ variable. Doing this ensures that your script can better handle spaces in filenames, paths, and arguments. So, doing something like echo "$@" instead of echo $@ can save you some trouble.
  • You can lessen your chances of leaving a file (like a system configuration file) in an inconsistent state if you make changes to a copy of the file and then use the mv command to put the altered file in place. Since mv typically only changes the information for the file and doesn’t move any bits, the changeover is much faster so it’s less likely that another program will try to access the file in the time the change is being made. There are a few subtle issues to be aware of when using this method though. Have a look at David Pashley’s article (link #2) in the Resources section for more details.
  • You can use parameter expansion (${...}) to avoid the null/unset variable problem that you see in Listing 1. Using a line like cd ${1:?"A directory to change to is required"} would display the phrase “A directory to change to is required” and exit the script if the user didn’t provide the command line argument represented by $1 . When used inside a script, the line gives you error output similar to ./expansion.sh: line 3: 1: A directory to change to is required
  • When you’re accepting input from a user, you can make your script more forgiving by using regular expressions and the case insensitive options of your commands. For instance, use the -i option of grep so that your script will not care whether it matches “Yes” or “yes”. With a regular expression, you could be as vague as ^[yY].* to match “y”, “Y”, “ya”, “Ya”, “Yeah”, “yeah”, “yes”, “Yes” and many other entries that begin with an upper/lower case “y” and have 0 or more letters that come after it.
  • Always check to make sure that you got the expected number of command line arguments before going any further in your script. If possible, also check the arguments to make sure that they’re what you expect (i.e. that a phone number wasn’t given for a directory name).
  • To avoid introducing portability errors when writing scripts for the Bourne Shell (sh), you can use the checkbashisms program from the devscripts package. This program will check to make sure that you don’t have any BASH specific statements in your Bourne Shell script.
  • Don’t catch an error on a low level inside your script and not pass it back up the stack to the parent. This can cause your program to behave in a non-standard (non-Unix) way.
  • If you have a script that runs in the background, it can create a predefined file and redirect output to it so that you can see what/when/how/why your script exited.
  • If you use file locks in your scripts, you’ll want to check for dead/stale file locks each time your script starts. This is because a user may have issued a kill -9 (SIGKILL) command on your script, which doesn’t give your script a chance to clean up it’s lock files. If you don’t check for stale/dead locks, your user could end up having to remove the locks themselves manually, which is definitely not ideal.
  • When you have a script that is processing a large amount of data/files, you can use trap to keep track of where your script was in the event of an unexpected exit. One way to do this would be to echo a filename into a predefined file when the trap is triggered. You can then read the start location back into the script when it starts up again and resume where you left off. If there’s a really large amount of data and you need to make sure your script keeps its place, you should probably already be continuously tracking the progress as part of the processing loop and using the trap(s) as a fallback.

Scripting

In this scripting section I’m going to create a script that we can source to add ready made error handling functions to other scripts. You will also see a couple of conceptual additions such as the use of code blocks in an attempt to streamline sections of code. Listing 32 shows the modular script that you can source, and Listing 33 shows it in use.

Listing 32

#!/bin/bash -u # File: error_source.sh # Holds functions that can be used to more easily add error handling # to your scripts. # The -u option in the shebang line above causes the shell to throw # an error whenever a variable is unset. # Define our handlers for errors and/or forced exits trap 'fatal_err $LINENO 1001' ERR #Handle uncaught errors trap 'clean_up; exit' HUP TERM #Clean up and exit on SIGHUP or SIGTERM trap 'clean_up; propagate' INT #Clean up after and propagate SIGINT trap 'clean_up' EXIT #Clean up last thing before we exit PROGNAME=$(basename $0) #Error source program name TEMPFILES=( ) #Array holding temp files to remove on script exit # This function steps through each pipe section's exit status to see if # there was an error anywhere. Takes as an argument the line number # that's being checked. function check_pipe { # We want to see if there was an error somewhere in the pipeline for PIPEPART in $2 do # There was an error at the current part of the pipeline if [ "$PIPEPART" != "0" ] then nonfatal_err $1 1002 return 0; #We don't need to step through the rest fi done } # Function that gets rid of things like temp files before an exit. function clean_up { # We want to remove all of the temp files we created for TFILE in ${TEMPFILES[@]} do # If the file doesn't exist, skip it [ -e $TFILE ] || continue # Notice the use of a code block to streamline this check { # If you use -f, errors are ignored rm --interactive=never $TFILE &> /dev/null } || nonfatal_err $LINENO 1001 done } # Function to create "safe" temporary files which we'll get into more in the # next blog post on security. function create_temp { # Give preference to user tmp directory for security if [ -e "$HOME/tmp" ] then TEMP_DIR="$HOME/tmp" else TEMP_DIR="/tmp" fi # Construct a "safe" temp file name TEMP_FILE="$TEMP_DIR"/"$PROGNAME".$$.$RANDOM # Keep the file in an array to remove it later TEMPFILES+=( "$TEMP_FILE" ) { touch $TEMP_FILE &> /dev/null } || fatal_err $LINENO "Could not create temp file $TEMP_FILE" } # Function that handles telling the user about critical errors that # force an exit. It takes 2 arguments, a line number near where the # error occurred, and an error code / message telling what happened. function fatal_err { # Call function that will clean up temp files clean_up printf "Near line $1 in $PROGNAME: " # Check to see if the supplied error matches any predefined codes if [ "$2" == "1001" ];then printf "There has been an unknown fatal error.n" # A custom error message has been specified by the caller else printf "$2n" fi # We don't want to continue running with a fatal error exit 1 } # Function that handles telling the user about non-critical errors # that don't force an exit. It takes 2 arguments, a line number near # where the error occurred, and an error code / message telling what # happened. function nonfatal_err { printf "Near line $1 in $PROGNAME: " # Check to see if the supplied error matches any predefined codes if [ "$2" == "1001" ];then printf "Could not remove temp file.n" elif [ "$2" == "1002" ];then printf "There was an error in a pipe.n" elif [ "$2" == "1003" ];then printf "A file you tried to access doesn't exist.n" # A custom error message has been specified by the caller else printf "$2n" fi } # Function that handles propagating the SIGINT signal up to the parent # process, which in this case is assumed to be the shell. function propagate { echo "Caught SIGINT" #Propagate the signal up to the shell kill -s SIGINT $$ # 130 is the exit status from Ctrl-C/SIGINT exit 130 }

Listing 32 has 6 functions that are designed to handle various error related conditions. These functions are check_pipe, create_temp, clean_up, propagate, fatal_err, and nonfatal_err. The check_pipe function takes a list representing all the elements of the PIPESTATUS array variable, and steps through each item in the list to see if there was an error. If there was an error it throws a non-fatal error message, which could just as easily be a fatal error message that causes an exit. This makes it a little easier to check our pipes for errors without using set -o pipefail. This function could easily be modified to tell you which part of the pipe failed as well.

The create_temp function automates the process of creating “safe” temporary files for us. It gives preference to the user’s tmp directory, and uses the system /tmp directory if the user’s is not available. We’ll talk more about temporary file safety in the next blog post on security. The path/name of the temp file created is added to a global array so that it will be easier to remove it later on exit. Notice the use of the code block around the touch command that creates the temp file. It might have been easier to leave the brackets out and just put the || right after the touch statement, but I felt that the code block helped streamline the code a little bit. The || at the end of the code block causes our error handling code to be executed if there’s an error with the last command in the block.

The clean_up function steps through the file names in our array of temporary files and deletes them. This is meant to be called just before we exit the script so that we don’t leave any stray temp files laying around. The function checks to make sure that it doesn’t try to delete files that have already been removed. This is to prevent a warning from being displayed when we have an error, thus calling clean_up and then exit which also calls clean_up. There are other ways to handle this type of problem, but for our purposes the “skip if already deleted” method works fine. The propagate function uses the kill command to resend the INT signal on up to the shell, and then uses the exit command to set the exit status of the script to 130. This tells anyone checking the ? built-in variable that the script exited because of SIGINT.

The fatal_err and nonfatal_err functions are very similar, with the only difference being that fatal_err calls the clean_up function and exit command when it runs. Both functions take 2 arguments which are a line number and an error code or string. The line number is presumably the line near where the error occurred, but won’t be exact. It’s designed to get a shell script developer close enough to the error that they should be able to find it. The error code is a 4 digit number that’s used in an if statement (a case statement would be a little cleaner here) to see what error message should be given to the user. The else part of the statement allows the caller to provide their own custom error string. This way the caller isn’t stuck if they can’t find a code that fits their situation. If the script was going to see wide spread general use, it might be best to dump all of the error codes into a separate function that fatal_err and nonfatal_err could both call. That way you would have consistent and reusable error codes across all of the functions.

To make sure that the functions are called properly, the script defines several traps at the top. The ERR signal is used to catch any errors that we haven’t handled ourselves. These are treated as “unknown” fatal errors since we obviously didn’t see them coming. The HUP and TERM signals are trapped so that we have a chance to run our clean_up function before exiting. Keep in mind that the KILL signal cannot be trapped, so if somebody runs kill -9 on our script, we’re still going to be leaving temp files behind. The INT signal is trapped to give us a chance clean up as well, but we also take the opportunity to propagate the signal up to the shell. That way we’re not just absorbing SIGINT and not allowing the world around us to react to it. The final trap is set on the EXIT condition and is our last chance to make sure that the temp files have been removed.

Listing 33

#!/bin/bash -u # File: err_src_test.sh # Tests the modular error_source.sh script which holds error handling functions. # Include the modular error handling script so that we can use its functions. . error_source.sh # Use our function to create a random "safe" temp file create_temp # Be proactive in checking for problems like a file that doesn't exist if [ -e doesnotexist ] then ls doesnotexist else nonfatal_err $LINENO 1003 fi # Check a bad pipeline with a function we've created true|false|true # Error not caught because of last true PIPEST="${PIPESTATUS[@]}" check_pipe $LINENO "$PIPEST" # Check a good pipeline with the same function true|true|true|true PIPEST="${PIPESTATUS[@]}" check_pipe $LINENO "$PIPEST" # Generate a custom non-fatal error nonfatal_err $LINENO "This is a custom error message." # Generate an unhandled error false echo "The script shouldn't still be running here."

The Listing 33 implementation shows just a few ways to use the modular error handling script in one of your own scripts. The first thing that the script does is source the error_source.sh script so that it is treated like a part of our own. Once that’s done, the error handling functions can be called as if we had typed them directly into our script. That’s why we can call the create_temp function. Normally we would do something with the temporary file path/name that is created, but in this case I only want to create a temp file that can be removed later by the clean_up function. The next thing I do is be proactive in checking to see if a file/directory exists before I try to use it. If it doesn’t exist I throw a non-fatal error to warn the user. Normally you would want to throw a fatal error that would cause an exit here, but I want the script to fall all the way through to the last error so that the output in Listing 34 will be a little cleaner. Ultimately with this error handling method it’s your call on whether or not the script should exit on an error, but I would suggest erring on the side of exiting rather than letting the script continue with a potentially dangerous error in place.

The next section of Listing 33 has code that checks a pipeline with an error (the false in the middle), and after that there’s a check of a pipeline with no errors. This is done using the check_pipe function that we wrote earlier. You can see that I’ve basically converted the PIPESTATUS array elements into a string list before passing that to check_pipe. The list works a little more cleanly in the for loop that’s used to check each part of the pipeline.

Next, I’ve shown how to generate your own custom error by passing the nonfatal_err function a string instead of an error code. A custom string should fail all of the tests in the nonfatal_err if construct, causing the else to be triggered. This gives us the ability to create compact error handling code in our own scripts using error codes, but still gives us the flexibility to throw errors that haven’t been defined yet.

The last interesting thing that the script does is use the false command to generate an unhandled error which is caught by the ERR signal’s trap. You can see that even if we miss handling an error manually, it still gets caught overall. The drawback is that although the user gets a line number for the error, they are given a message telling them that and unknown error has occurred which doesn’t tell them very much. This is still preferable to letting your script run with an unhandled error though. The very last line of the script is just there to alert us that something very wrong has happened if our script reaches that point.

Listing 34 shows what happens when I run the script in Listing 33.

Listing 34

$./err_src_test.sh Near line 16 in err_src_test.sh: A file you tried to access doesn't exist. Near line 22 in err_src_test.sh: There was an error in a pipe. Near line 30 in err_src_test.sh: This is a custom error message. Near line 33 in err_src_test.sh: There has been an unknown fatal error.

If you have any additions or changes to the script(s) above don’t hesitate to tell us about it in the comments section. I would especially like to see what changes all of you would make to the script in Listing 32 to make it more useful and/or correct any flaws that it may have. Feel free to paste your updates to the code in the comments section.

Troubleshooting

This post was developed using BASH 4.0.x, so if you’re running an earlier version keep an eye out for subtle syntax differences and missing features. Post something in the comments section if you have any trouble so that we can try to help you out. Also, don’t forget to apply the debugging knowledge that you got from reading Post 1 in this series as you’re experimenting with these concepts.

Conclusion

As with shell script debugging, we can see that script error handling is a very in-depth subject. Unfortunately, error handling is often overlooked in shell scripts but is an important part of creating and maintaining production scripts. My goal with this post has been to give you a diverse set of tools to help you efficiently and effectively add error handling to your scripts. I know that opinions on this topic vary widely, so if you’ve got any suggestions or thoughts on the content of this post it would be great to hear from you. Leave a comment to let us know what you think. Thanks for reading.

Resources

Books

Links

  1. Linux Journal, May 2008, Work The Shell, By Dave Taylor, “Handling Errors and Making Scripts Bulletproof”, pp 26-27
  2. Writing Robust Shell Scripts – DavidPashley.com
  3. Linux Planet Article On Making Friendlier Error Messages
  4. Linux Planet Article With A Good Example Of A Modularized Error Handling Script
  5. Errors and Signals and Traps (Oh My!) – Part 1 By William Shotts, Jr.
  6. Errors and Signals and Traps (Oh My!) – Part 2 By William Shotts, Jr.
  7. Turnkey Linux Article With Good Discussion In Comments Section
  8. Script Error Handling Overview
  9. Article On The “Proper handling of SIGINT/SIGQUIT”
  10. Script Error Handling Slide Presentation (Download Link)
  11. General UNIX Scripting Guide With Error Handling By Steve Parker
  12. Some General Thoughts On Making Scripts Better And Less Error Prone
  13. OpenGroup.org Article On Scripting Including A Section On “Exit Status and Errors”
  14. A checkbashisms man Page Entry
  15. Common Shell Mistakes and Error Handling Article
  16. CSIRO Advanced Scientific Computing Article
  17. Opinions On Error Handling On stackoverflow
  18. A Way To Handle Errors Using Their Error Messages
  19. Simple BASH Error Handling
  20. BASH FAQ Including Broken Pipe Warning Information
  21. Linux Journal Article On Named Pipes
  22. Example Use Of command_not_found_handle

Comments (5)

  1. [...] This post was mentioned on Twitter by Marius Adrian, Yee Hon Choong, Kevin Kent, Quintus Leung, Joe Hermando and others. Joe Hermando said: Writing Better Shell Scripts – Part 2 http://bit.ly/d8ZjCi [...]

  2. Rodolfo

    2010/07/27 at 1:29 AM

    For creating temp files the “mktemp” command must be used.

  3. Jeremy Mack Wright

    2010/07/27 at 10:54 AM

    Hi Rodolfo,

    The “mktemp” command is certainly the preferred (safer) way of creating temp files, and I’ll talk about it in the next post in this series. It’s not the only way to create temp files though. What were you referring to when you said that the “mktemp command must be used”?

  4. [...] Writing Better Shell Scripts – Part 2 [...]

  5. Mr.Goldstink

    2010/07/27 at 7:00 PM

    THIS IS GOLD!! I needed help with Bash scripting so bad. Thank you!

The comments are now closed.