Blog Projects, Tips, Tricks, and How-Tos

Writing Better Shell Scripts – Part 2

Quick Start

As with Part 1 of this series, this information does not lend itself to having a “Quick Start” section. With that said, you can read the How-To section of this post for a quick general overview. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. Video and Audio are also included with this post which may work as a quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re learning commands and solving problems.

Video

Audio

Download

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory Path
Warning or Error
Command Line Snippet With Commands/Options/Arguments
Command Options and Their Arguments Only
Hyperlink

Overview

This post is the second in a series on shell script debugging, error handling, and security. The content of this post will be geared mainly toward BASH users, but there will be information that’s suitable for users of other shells as well. Information such as techniques and methodologies may transfer very well, but BASH specific constructs and commands will not. The users of other shells (CSH, KSH, etc) will have to do some homework to see what transfers and what does not.

There are a lot of opinions about how error handling should be done, which range from doing nothing to implementing comprehensive solutions. In this post, as well as my professional work, I try to err on the side of in-depth solutions. Some people will argue that you don’t need to go through the trouble of providing error handling on small single-user scripts, but useful scripts have a way of growing past their original intent and user group. If you’re a system administrator, you need to be especially careful with error handling in your scripts. If you or an admin under you gets careless, someday you may end up getting a call from one of your users complaining that they just deleted the contents of their home directory – with one of your scripts. It’s easier to do than you might think when precautions are not taken. All you need are a couple of lines in your script like the those in Listing 1.

Listing 1

#!/bin/bash cd $1 rm -rf *

So what happens if a user forgets to supply a command line argument to Listing 1? The cd command changes into the user’s home directory, and the rm command deletes all of their files and directories without prompting. That has the makings of a bad day for both you and your user. In this post I’ll cover some ways to avoid this kind of headache.

To help ease the extra burden of making your scripts safer with error handling, we’ll talk about separating error handling code out into reusable modules which can be sourced. Once you do this and become familiar with a few error handling techniques, you’ll be able to implement robust error handling in your scripts with less effort.

The intent of this post is to give you the information you need to make good judgments about error handling within your own scripts. Both proactive and reactive error handling techniques will be covered so that you can make the decision on when to try to head off errors before they happen, and when to try to catch them after they happen. With those things in mind, lets start off with some of the core elements of error handling.

BASH Options

There are several BASH command line options that can help you avoid some errors in your scripts. The first two are ones that we already covered in Part 1 of this series. The -e option, which is the same as set -o errexit, causes BASH to exit as soon as it detects an error. While there are a significant number of people who promote setting the -e option for all of your scripts, that can prevent you from using some of the other error handling techniques that we’ll be talking about shortly. The next option -u, which is the same as set -o nounset causes the shell to throw an error whenever a variable is used before its value has been set. This is a simple way to prevent the risky behavior of Listing 1. If the user does not provide an argument to the script, the shell will see it as the 1 variable not being set and complain. This is usually a good option to use in your scripts.

set -o pipefail is something that we’ll touch on in the Command Sequences section and causes a whole command pipeline to error out if just one of the sections has an error. The last shell option that I want to touch on is set -o noclobber (or the -C option) which helps you because it prevents the overwriting of files with redirection. You will just get an error similar to cannot overwrite existing file. This can save you when you’re working with system configuration files, as overwriting one of them could result in any number of big problems. Listing 2 holds a quick reference list of these options.

Listing 2

errexit (-e) Causes the script to exit whenever there is an error. noclobber (-C) Prevents the overwriting of files when using redirection. nounset (-u) Causes the shell to throw an error whenever an unset variable is used. pipefail Causes a pipeline to error out if any section has an error.

Exit Status

Exit status is the 8-bit integer that is returned to a parent process when a subprocess exits (either normally or is forced to exit). Typically, an exit status of 0 means that the process completed successfully, and a greater than 0 exit status means that there was a problem. This may seem counter intuitive to C/C++ programmers who are used to true being 1 (non-zero) and false being 0. There are exceptions to the shell’s exit status standard, so it’s always best to understand how the distribution/shell/command combo you’re using will handle the exit status. An example of a command that acts differently is diff. When you run diff on two files, it will return 0 if the files are the same, 1 if the files are different, and some number greater than 1 if there was an error. So if you checked the exit status of diff expecting it to behave “normally”, you would think that the command failed when it was really telling you that the files are different.

Probably the easiest way to begin experimenting with exit status is to use the BASH shell’s built-in ? variable. The ? variable holds the exit status of the last command that was run. Listing 3 shows an example where I check the exit status of the true command which always gives an exit status of 0 (success), and of the false command which always gives an exit status of 1 (failure). Credit goes to William Shotts, Jr. who’s straight forward use of true and false in his examples on this topic inspired some of the examples in this post.

Listing 3

$true $echo $? 0 $false $echo $? 1

In this case the true and false commands follow the 0 = success, non-zero = failure standard, so we can be certain whether or not the command succeeded. As stated above though, the meaning of the exit status is not always so clear. I check the man page for any unfamiliar commands to see what their exit statuses mean, and I suggest you do the same with the commands you use. Listing 4 lists some of the standard exit statuses and their usual meanings.

Listing 4

0 Command completed successfully. 1-125 Command did not complete successfully. Check the command's man page for the meaning of the status. 126 Command was found, but couldn't be executed. 127 Command was not found. 128-254 Command died due to receiving a signal. The signal code is added to 128 (128 + SIGNAL) to get the status. 130 Command exited due to Ctrl-C being pressed. 255 Exit status is out of range.

For statuses 128 through 254, you see that the signal that caused the command to exit is added to the base status of 128. This allows you to subtract 128 from the given exit status later to see which signal was the culprit. Some of the signals that can be added to the base of 128 are shown in Listing 5 and were obtained from the signal man page via man 7 signal . Note that SIGKILL and SIGSTOP cannot be caught, blocked, or ignored because those signals are handled at the kernel level. You may see all of these signals at one time or another, but the most common are SIGHUP, SIGINT, SIGQUIT, SIGKILL, SIGTERM, and SIGSTOP.

Listing 5

Signal Value Action Comment ────────────────────────────────────────────────────────────────────── SIGHUP 1 Term Hangup detected on controlling terminal or death of controlling process SIGINT 2 Term Interrupt from keyboard SIGQUIT 3 Core Quit from keyboard SIGILL 4 Core Illegal Instruction SIGABRT 6 Core Abort signal from abort(3) SIGFPE 8 Core Floating point exception SIGKILL 9 Term Kill signal SIGSEGV 11 Core Invalid memory reference SIGPIPE 13 Term Broken pipe: write to pipe with no readers SIGALRM 14 Term Timer signal from alarm(2) SIGTERM 15 Term Termination signal SIGUSR1 30,10,16 Term User-defined signal 1 SIGUSR2 31,12,17 Term User-defined signal 2 SIGCHLD 20,17,18 Ign Child stopped or terminated SIGCONT 19,18,25 Cont Continue if stopped SIGSTOP 17,19,23 Stop Stop process SIGTSTP 18,20,24 Stop Stop typed at tty SIGTTIN 21,21,26 Stop tty input for background process SIGTTOU 22,22,27 Stop tty output for background process

A listing of signals which only shows the symbolic and numeric representations without the descriptions can be obtained with either kill -l or trap -l .

You can explicitly pass the exit status of the last command executed back to the parent process (most likely the shell) with a line like exit $? . You can do the same thing implicitly by calling the exit command without an argument. This works fine if you want to exit immediately, but if you want to do some other things with the exit status first you’ll need to store it in a variable. This is because after you read the ? variable once, it resets. Listing 6 shows one way of using an if statement to pass the exit status back to the parent after implementing your own error handling functionality.

Listing 6

#!/bin/bash - # Run the command(s) false # Save the exit status (it's reset once we read it) EXITSTAT=$? # If the command has a non-zero exit status if [ $EXITSTAT -gt 0 ] then echo "There was an error." exit $EXITSTAT #Pass the exit status back to parent fi

You can also use an if statement to directly test the exit status of a command as in Listing 7. Notice that using the command this way resets the ? variable so that you can’t use it later.

Listing 7

#!/bin/bash - # If the command has a non-zero exit status if ! false then echo "There was an error." exit 1 fi

The if ! false statement is the key here. What’s inside of the if statement will be executed if the command (in this case false) returns a non-zero exit status. Using this type of statement can give you a chance to warn the user of what’s going on and take any actions that are needed before the script exits.

You can also use the if and test combination in more complex ways. For instance, according to its man page, the ls command uses an exit status of 0 for no errors, 1 for minor errors like not being able to access a sub directory, and 2 for major errors like not being able to access a file/directory specified on the command line. With this in mind, take a look at Listing 8 to see how you could differentiate between the “no error”, “minor error”, and “major error” conditions.

Listing 8

#!/bin/bash - function testex { # We can only read $? once before it resets, so save it exitstat=$1 # See which condition we have if test $exitstat -eq 0; then echo "No error detected" elif test $exitstat -eq 1; then echo "Minor error detected" elif test $exitstat -eq 2; then echo "Major error detected" fi } # Try a listing the should succeed echo "- 'ls ~/*' Executing" ls ~/* &> /dev/null # Check the success/failure of the ls command testex $? # Try a listing that should not succeed echo "- 'ls doesnotexist' Executing" ls doesnotexist &> /dev/null testex $?

Inside the testex function I have placed code that looks for specific exit statuses and then tells the user what was found. Normally you wouldn’t worry about handling the situation where there’s no error (exit status 0), but doing so helps clarify the concept in our example. The output that you would get from running this script is shown in Listing 9.

Listing 9

$ ./testex.sh - 'ls ~/*' Executing No error detected - 'ls doesnotexist' Executing Major error detected

There are a couple of final things to be aware of when you’re using the ? variable. First, remember that whenever you use ? from the command line or in a script, the shell resets its value. If you need to use the ? variable more than once in your script, you’ll want to store it’s value in another variable and use that. The second is that ? becomes ineffective when you are using the -e option or the line set -o errexit. The reason for this is that the script will exit as soon as an error is detected, and so you never get a chance to check the ? variable.

The command_not_found_handle Function

As of BASH 4.0, the provision for a command_not_found_handle function has been added. This function makes it possible to display user friendly messages when a command the user types is not found. BASH searches for the command and if it’s not found anywhere, BASH looks to see if you have the command_not_found_handle function defined. If you do, that function is invoked passing it the attempted command and its arguments so that a useful message can be displayed. If you use a Debian or Ubuntu system you’ve probably seen this in action as they’ve had this feature for awhile. Listing 10 shows an example of the command_not_found_handle function output on an Ubuntu 9.10 system.

Listing 10

$cat2 No command 'cat2' found, did you mean: Command 'cat' from package 'coreutils' (main) cat2: command not found

You can implement/override the behavior of the command_not_found_handle function to provide your own functionality. Listing 11 shows an implementation of the command_not_found_handle function inside of a stand-alone script. In most cases you would want to add it to your BASH configuration file(s) so that you can make use of the function anytime that you’re at the shell prompt.

Listing 11

#!/bin/bash - # File: cmdnf.sh function command_not_found_handle { echo "The command ($1) is not valid." exit 127 #The command not found status } cat2

You would access the arguments to the original (not found) command via $2, $3 and so on. Notice that I used the exit command and passed it the code of 127, which is the command not found exit status. The exit status of the whole script is the exit status of the command_not_found_handle function. If you don’t set the exit status explicitly the script will end up returning 0 (success), thus preventing a user or script from using the exit status to determine what type of error occurred. Propagation of the exit status and terminating signal (which we’ll talk about later) is a good thing to do to prevent your users from missing important information and/or having problems. When run, the script in Listing 11 gives you the following output in Listing 12.

Listing 12

$./cmdnf.sh The command (cat2) is not valid. $echo $? 127

Command Sequences

Command sequences are multiple commands that are linked by pipes or logical short-circuit operators. Two logical short-circuits are the double ampersand (&&) and double pipe (||) operators. The && only allows the command that comes after it in the series to be executed if the previous command exited with a status of 0. The || operator does the opposite by only allowing the next command to be executed if the previous one returned a non-zero exit status. Listing 13 shows examples of how each of these work.

Listing 13

$true && echo 'Hello World!' Hello World! $false && echo 'Hello World!' $true || echo 'Hello World!' $false || echo 'Hello World!' Hello World!

So, one of the many ways to solve the unset variable problem we see in Listing 1 is the example shown in Listing 14.

Listing 14

#!/bin/bash #Make sure the user provided a command line argument [ -n "$1" ] || { echo "Please provide a command line argument."; exit 1; } #Change to the directory and delete the files and dirs cd $1 && rm -rf *

In the first line of interest, we check to make sure that the value of $1 is not null. If that test command fails, it means that $1 is unset and that the user did not provide a command line argument. Since the || operator only allows the next command to run if the previous one fails, our code block warns the user of their mistake and exits with a non-zero status. If a command line argument was supplied, the script continues on. In the second interesting line we use the && operator to run the rm command if, and only if, the cd command succeeds. This keeps us from accidentally deleting all of the files and directories in the user’s/script’s current working directory if the cd command fails for some reason.

The next type of command sequence that we’re going to cover is a pipeline. When commands are piped together, only the last return code will be looked at by the shell. If you have a series of pipes like the one in Listing 15, you would expect it to show a non-zero exit status, but instead it’s 0.

Listing 15

$true | false | true $echo $? 0

To change the shell’s behavior so that it will return a non-zero value for a pipeline if any of it’s elements have a non-zero exit status, use the set -o pipefail line in your script. The result of using pipefail is shown in Listing 16.

Listing 16

$set -o pipefail $true | false | true $echo $? 1

This method doesn’t give you any insight into where in the pipeline your error occurred though. In many cases I prefer to use the BASH array variable PIPESTATUS to check pipelines. It gives you the ability to tell where in the pipeline the error occurred, so that your script can more intelligently adapt to or warn about the error. Listing 17 gives an example.

Listing 17

$true | false | true $echo ${PIPESTATUS[0]} ${PIPESTATUS[1]} ${PIPESTATUS[2]} 0 1 0

To keep things clean inside your script, you might put the code to check the PIPESTATUS array into a function and use a loop to process the array elements. This way you have reusable code that will automatically adjust to the number of commands that are in your pipe. One of the scripts in the Scripting section shows this technique.

If you’re running a version of BASH prior to 3.1, a potential problem with using pipes is the Broken pipe warning. If a reader in a pipeline finishes before its writer completes, the writer command will get a SIGPIPE signal which causes the Broken pipe warning to be thrown. It may be a non-issue for you, but it doesn’t hurt to be aware of it. If you’re running a version of BASH that’s 3.1 or higher, you use the PIPESTATUS variable to see if there’s been a pipe error. I’ve done this in Listing 18 where I’ve written two scripts that will cause the pipeline to break. The code inside the scripts doesn’t really matter in this case, just the end result.

Listing 18

$./pipeerr2.sh | ./pipeerr.sh test test test $echo ${PIPESTATUS[0]} ${PIPESTATUS[1]} 141 0

You can see that the pipe exit status for the first script (or pipeline section) is 141. This number actually results from the addition of a base exit status and the signal code, which I’ve mentioned before. The base status is 128, which the shell uses to signify that a command stopped due to receiving a signal rather than exiting normally. Added to that is the code of the signal that caused the termination, which in this case is 13 (SIGPIPE) on my system. This technique embeds the signal code in the exit status in a way that makes it easy to retrieve. Since the status is built by adding 128 and 13, all I have to do is use arithmetic expansion to extract the signal code from Listing 18: echo $((${PIPESTATUS[0]}-128)) . This gives me output showing the value of 13, which is what we expect. Keep in mind that the PIPESTATUS array variable is like the ? variable in that it resets once you access it or a new pipeline is executed.

As stated in Part 1 of this series, you can replace pipes with temporary files. This will eliminate the SIGPIPE and exit status pitfalls of pipes, but as stated before temp files are much slower than pipes and require you to clean them up after you’re done with them. In general, I would suggest staying away from temp files unless you have a compelling reason to use them. A compromise between temp files and pipes might be named pipes. On modern Linux systems you use the mkfifo command to create a named pipe, which you can then use with redirection. On older systems you may have to use mknod instead to create the pipe. In Listing 19 you can see that I’ve used named pipes instead of regular pipes, and that this technique allows me to check each of the sections of the pipeline as they’re used. Keep in mind that I’m reading from the named pipe in another terminal with cat < pipe1 since a line like true > pipe1 will block until the pipe has been read from. Also notice that I use the rm command to delete the named pipe after I’m done with it. I do this as a housekeeping measure, since I don’t want to leave named pipes laying around that I don’t need.

Listing 19

$mkfifo pipe1 $true > pipe1 $echo $? 0 $false > pipe1 $echo $? 1 $rm pipe1

Wrapper Functions

If there’s a command that you’re using multiple times in your script and that command requires some error handling, you might want to think about creating a wrapper function. For instance, in Listing 1 the cd command has the unwanted side effect of switching to the user’s home directory if the user hasn’t supplied a command line argument. If you’re using cd multiple times throughout the script, you could write a function that extends cd‘s functionality. Listing 20 shows an example of this.

Listing 20

#!/bin/bash - function cdext { # We want to make sure that the user gave an argument if [ $# -eq 1 ] then cd $1 else echo "You must supply a directory to change to." exit 1 fi } # This should succeed cdext /tmp # Make sure that it did succeed pwd # This should fail with our warning cdext

I first use the shell’s built-in # variable to make sure that the user has specified and single argument. It would probably also be a good idea to add a separate else statement to warn the user that they supplied too many arguments. If the user supplied the single argument, the function uses cd to change to that directory and we make sure it worked correctly with the pwd command. If the user didn’t supply a command line argument, we warn them of their error and exit the script. This simple function adds an extra restriction to the cd command’s usage to help make your script safer.

To make the most of this technique you need to understand what types of things can go wrong with a command. Make sure that you’ve learned enough about the command, through resources like the man page, to handle the potential errors properly.

“Scrubbing” Error Output

What I mean by scrubbing in this instance is searching through the error output from a command looking for patterns. That pattern could be something like “file not found” or “file or directory does not exist”. Essentially what you’re doing is looking through the command’s output trying to find a string that will give you specific information about what error occurred. This method tends to be very brittle, meaning that the slightest change in the output can break your script. For this reason I don’t recommend this method, but in some cases it may be your only choice to gather more specific information about a command’s error condition. One method to make this technique slightly more robust would be to use regular expressions and case insensitivity. In Listing 21 I’ve provided a very simple example of output scrubbing.

Listing 21

$ls doesnotexist 2>&1 | grep -i "file not found" $ls doesnotexist 2>&1 | grep -i "no such" ls: cannot access doesnotexist: No such file or directory

Notice that I’m using the -i option of grep to make it case insensitive. I’m also redirecting both stdout and stderr into the pipe with the 2>&1 statement. That way I can search all of of the command’s messages, errors, and warnings looking for the pattern of interest. In the first search statement I look for the pattern “file not found”, which is not a statement found in the ls command’s output. When I search for the statement “no such”, I get the line of output that contains the error. You could push this example a lot further with the use of regular expressions, but even if you’re very careful a simple change to the command’s output by the developer could leave your script broken. I would suggest filing this technique away in your memory and using it only when you’re sure there’s not a better way to solve the problem.

Being A Good Linux/UNIX Citizen

There are some signals that we need to take extra care in dealing with, such as SIGINT. With SIGINT all processes in the foreground see the signal, but the innermost (foremost) child process decides what will be done with the signal. The problem with this is that if the innermost process just absorbs the SIGINT signal and doesn’t act on it and/or send it on up to it’s parent, the user will be unable to exit the program/script with the Ctrl-C key combination. There are a few applications that trap this signal intentionally which is fine, but doing this on your own can lead to unpredictable behavior and is what I would consider to be an undesirable practice. Try to avoid this in your own scripts unless you have a compelling reason to do otherwise and understand the consequences. To get around this issue we’ll propagate signals like SIGINT up the process stack to give the parent(s) a chance to react to them.

One way of handling error propagation is shown in Listing 22 where I’ve assumed that the shell is the direct parent of the script.

Listing 22

#!/bin/bash - function int_handler { echo "SIGINT Caught" #Propagate the signal up to the shell kill -s SIGINT $$ # 130 is the exit status from Ctrl-C/SIGINT exit 130 } # Our trap to handle SIGINT/Ctrl-C trap 'int_handler' INT while true do : done

First of all, don’t get caught up in the trap statement if you don’t already know what it is. We’ll talk about traps shortly. This script busy waits in a while loop until the user presses Ctrl-C or the system sends the SIGINT signal. When this happens the script uses the kill command to send SIGINT on up to the shell (who’s process ID is represented by $$ in the line kill -s SIGINT $$), and then exits with an exit status corresponding to a forced exit due to SIGINT. This way the shell gets to decide what it wants to do with the SIGINT, and the exit status of our script can be examined to see what happened. Our script handles the signal properly and then allows everyone else above it to do the same.

Error Handling Functions

Since you’re most likely going to be using error handling code in multiple places in your script, it can be helpful to separate it out into a function. This keeps your script clean and free of duplicate code. Listing 23 shows one of the many ways of using a function to encapsulate some simple error handling functionality.

Listing 23

#!/bin/bash - function err_handler { # Check to see which error code we were given if [ $1 -eq 1001 ]; then echo "Non-Fatal Error #1 Has Occurred" # We don't need to exit here elif [ $1 -eq 1002 ]; then echo "Fatal Error #2 Has Occurred" exit 1 # Error was fatal so exit with non-zero status fi } # Notice that I'm using my own made up error codes (1001, 1002) err_handler 1001 err_handler 1002

Notice that I made up my own error codes (1001 and 1002). These have no correlation to any exit status of any of the commands that my script would use, they’re just for my own use. Using codes in this way keeps me from having to pass long error description strings to my function, and thus saves typing, space, and clutter in my code. The drawback is that someone modifying the script later (maybe years later) can’t just glance at a line of code (err_handler 1001) and know what error it is referring to. You could help lessen this problem by placing error code descriptions in the comments at the top of your script. When I run the script in Listing 23 I get the output in Listing 24.

Listing 24

$./err_handler.sh Non-Fatal Error #1 Has Occurred Fatal Error #2 Has Occurred $

Introducing The trap Command

The trap command allows you to associate a section of code with a particular signal (see Listing 5), so that when the signal is seen by the shell the code is run. The shell essentially sets up a signal handler for the signal associated with the trap. This can be very handy to allow you to correct for errors, log what happened, or remove things like temporary files before your script exits. These things highlight one of the downsides to using kill -9 because SIGKILL is one of the two signals that can’t be trapped. If you use SIGKILL, the process that you’re killing won’t get a chance to clean up after itself before exiting. That could leave things like temporary files and stale file locks around to cause problems later. It’s better to use SIGTERM to end a process because it gives the process a chance to clean up.

Listing 25 shows a couple of ways to use the trap command in a script.

Listing 25

#!/bin/bash - function exit_handler { echo "Script Exiting" } trap "echo Ctrl-C Caught; exit 0" int trap 'exit_handler' EXIT while true do : done

Notice that I first use a semi-colon separated list of commands with trap to catch the SIGINT (Ctrl-C) signal. While this particular implementation is bad design because it doesn’t propagate SIGINT, it allows me to keep the example simple. The exit 0 statement is what causes the second trap that’s watching for the EXIT condition to be triggered. This second trap uses a function instead of a semi-colon separated list of commands. This is a cleaner way to handle traps that promotes code reuse, and except in simple cases should probably be your preferred method. Notice the form of the SIGINT specifier that I use at the end of the first trap statement. I use int because the prefix SIG is not required, and the signal declaration is not case sensitive. The same applies when using signals with commands like kill as well. You’re also not limited to specifying one signal per trap. You can append a list of signal specifiers onto the end of the trap statement and each one will use the error handling code specified within the trap.

One tip to be aware of is that you can specify the signals by their numeric representation, but I would advise against it. Using their symbolic representation tells anyone looking at your script (which could even be you years from now) at a glance which signal you’re using. There’s no chance for misinterpretation, and symbolic signals are more portable than just specifying a signal number since numbers tend to vary more by platform.

The output from running the script in Listing 25 and hitting Ctrl-C is shown in Listing 26. Notice that the SIGINT trap is processed before the EXIT trap. This is the expected behavior because the traps for all other signals should be processed before the EXIT trap.

Listing 26

$./trapuse.sh ^CCtrl-C Caught Script Exiting $

There are four signal specifiers that you’re probably going to be most interested in when using traps and they are INT, TERM, EXIT, and ERR. All of these have been touched on so far except for ERR. If you remember from above, you could use set -o errexit to cause the shell to exit on an error. This was great from the standpoint that it kept your script from running after a potentially dangerous error had occurred, but kept you from handling the error yourself. Setting a trap using the ERR signal specifier takes care of this shortcoming. The shell receives an ERR signal on the same conditions that cause an exit with errexit, so you can use a trap statement to do any clean up or error correction before exiting. ERR does have the limitation that an error is not detected if it is enclosed in a command sequence, if statement test, a while or until statement, or if the command’s exit status is being inverted by an ! . On older versions of BASH command substitutions $(...) that fail may not be caught by a trap statement either.

You can reset traps back to their original conditions before they were associated with commands using the - command specifier. For example, in the script in Listing 25 you could add the line trap - SIGINT after which the code for the SIGINT trap would no longer be called when the user hits Ctrl-C. You can also cause the shell to ignore signals by passing a null string as a signal specification as in trap "" SIGINT . This would cause the shell to ignore the user whenever they press the Ctrl-C key combination. This is not recommended though as it makes it harder for the user to terminate the process. It’s a better practice to do our clean up and then propagate the signal in the way that we talked about earlier. A handy trick is that you can simulate the functionality of the nohup command with a line like trap "" SIGHUP . What this does is cause your script to ignore the HUP (Hangup) signal so that it will keep running even after you’ve logged out.

If you run trap by itself without any arguments, it outputs the traps that are currently set. Using the -p option with trap causes the same behavior. You can also supply signal specifications (trap -p INT EXIT) and trap will output only the commands associated with those signals. This output can be redirected and stored, and with a little bit of work read back into a script to reinstate the traps later. Listing 27 shows two lines of output from the addition of the line trap -p to the script in Listing 25 just before the while loop.

Listing 27

trap -- 'exit_handler' EXIT trap -- 'echo Ctrl-C Caught; exit 0' SIGINT

Even with all the information that I’ve given you on the trap command, there’s still more information to be had. I’ve tried to hit the highlights that I think will be most useful to you. You can open the BASH man page and search for “trap” if you want to dig deeper.

How-To

In this section I’m going to use a few of the different methods that we’ve discussed to fix the script in Listing 1. The goal is to protect the user from unexpected behavior such as having everything in their home directory deleted. I won’t cover every single way of solving the problem, instead I’ll be integrating a few of the topics we’ve covered into one script to show some practical applications. It’s my hope that by this point in the post you’re starting to see your own solutions and will be able to build on (and/or simplify) what I do here.

If you look at Listing 28 I’ve added the -u option to the shebang line of the script, and also added a check to make sure that the directory exists before changing to it.

Listing 28

#!/bin/bash -u if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi cd $1 rm -rf *

Listing 29 shows what happens when I make a couple of attempts at running the script incorrectly.

Listing 29

$./l1cor_1.sh ./l1cor_1.sh: line 3: $1: unbound variable $./l1cor_1.sh /doesnotexist Please provide a valid directory.

The -u option causes the unbound variable error because $1 will not be set if the user doesn’t supply at least one command line argument. The if/test statement declares that if the directory does not exist we will give the user an error message and then exit. There are also other checks that you could add to Listing 28 including one to make sure that the directory is writable by the current user. Ultimately you decide which checks are necessary, but the end goal with this particular example is to make sure that any dangerous behavior is avoided.

Listing 28 still has a problem because the rm command will run even if the cd command has thrown an error (like Permission denied). To fix this I’m going to rearrange the cd and rm commands into a command sequence using the && operator, and then check the exit status of the sequence. You can see these changes in Listing 30.

Listing 30

#!/bin/bash -u if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi cd $1 && rm -rf * if [ $? -gt 0 ];then echo "An error occurred during the cd/rm process." exit 1 fi

The double ampersand (&&) will cause the command sequence to exit if the cd command fails, thus ignoring the rm command. I do this to catch any of the other errors that can occur with the cd command. If there’s an unknown error with the cd command, we don’t want rm to delete all of the files/directories in the current directory. Remember that I can only check the exit status of the last command in the sequence, which doesn’t tell me whether it was cd or rm that failed. As a work around to this I’ll check to see if the rm command succeeded in the next step where I set a trap on the EXIT signal. I’ve added the trap statement and a function to use with the trap in Listing 31.

Listing 31

#!/bin/bash -u # A final check to let the user know if this script failed # to perform its primary function - deleting files function exit_handler { # Count the number of lines (files/dirs) in the directory DIR_ENTRIES=$(ls $1 | wc -l) # If there are still files in there throw an error message if [ $DIR_ENTRIES -gt 0 ];then echo "Some files/directories were not deleted" exit 1 fi } # We want to check one last thing before exiting trap 'exit_handler $1' EXIT # If the directory doesn't exist, warn the user if [ ! -d $1 ];then echo "Please provide a valid directory." exit 1 fi # Don't execute rm unless cd succeeds and suppress messages cd $1 &> /dev/null && rm -rf * &> /dev/null # If there was an error with cd or rm, warn the user if [ $? -gt 0 ];then echo "An error occurred during the cd/rm process." exit 1 fi

I’m not saying that this is the most efficient way to solve this problem, but it does show you some interesting uses of the techniques we’ve talked about. I went ahead and suppressed the messages from cd and rm so that I could substitute my own. This is done with the &> /dev/null additions to the command sequence. I also added the trap 'exit_handler $1' EXIT line to the script, which sets a trap for the EXIT signal and uses the exit_handler function to handle the event. Notice the use of single quotes around the 'exit_handler $1' argument to trap. This keeps the $1 variable reference from being expanded until the trap is called. We need that variable so that our exit handler can check the directory to make sure that all the files and directories were deleted. For our purposes the example script is now complete and does a reasonable job of protecting the user, but there is plenty of room for improvement. Tell us how you would change Listing 31 to make it better and/or simpler in the comments section of this post.

Tips and Tricks

  • You can sometimes use options with your commands to make them more fault tolerant. For instance the -p option of mkdir automatically creates the parents of the directory you specify if they don’t already exist. This keeps you from getting a No such file or directory error. Just make sure the options you use don’t introduce their own new problems.
  • It’s usually a good idea to enclose variables in quotation marks, especially the @ variable. Doing this ensures that your script can better handle spaces in filenames, paths, and arguments. So, doing something like echo "$@" instead of echo $@ can save you some trouble.
  • You can lessen your chances of leaving a file (like a system configuration file) in an inconsistent state if you make changes to a copy of the file and then use the mv command to put the altered file in place. Since mv typically only changes the information for the file and doesn’t move any bits, the changeover is much faster so it’s less likely that another program will try to access the file in the time the change is being made. There are a few subtle issues to be aware of when using this method though. Have a look at David Pashley’s article (link #2) in the Resources section for more details.
  • You can use parameter expansion (${...}) to avoid the null/unset variable problem that you see in Listing 1. Using a line like cd ${1:?"A directory to change to is required"} would display the phrase “A directory to change to is required” and exit the script if the user didn’t provide the command line argument represented by $1 . When used inside a script, the line gives you error output similar to ./expansion.sh: line 3: 1: A directory to change to is required
  • When you’re accepting input from a user, you can make your script more forgiving by using regular expressions and the case insensitive options of your commands. For instance, use the -i option of grep so that your script will not care whether it matches “Yes” or “yes”. With a regular expression, you could be as vague as ^[yY].* to match “y”, “Y”, “ya”, “Ya”, “Yeah”, “yeah”, “yes”, “Yes” and many other entries that begin with an upper/lower case “y” and have 0 or more letters that come after it.
  • Always check to make sure that you got the expected number of command line arguments before going any further in your script. If possible, also check the arguments to make sure that they’re what you expect (i.e. that a phone number wasn’t given for a directory name).
  • To avoid introducing portability errors when writing scripts for the Bourne Shell (sh), you can use the checkbashisms program from the devscripts package. This program will check to make sure that you don’t have any BASH specific statements in your Bourne Shell script.
  • Don’t catch an error on a low level inside your script and not pass it back up the stack to the parent. This can cause your program to behave in a non-standard (non-Unix) way.
  • If you have a script that runs in the background, it can create a predefined file and redirect output to it so that you can see what/when/how/why your script exited.
  • If you use file locks in your scripts, you’ll want to check for dead/stale file locks each time your script starts. This is because a user may have issued a kill -9 (SIGKILL) command on your script, which doesn’t give your script a chance to clean up it’s lock files. If you don’t check for stale/dead locks, your user could end up having to remove the locks themselves manually, which is definitely not ideal.
  • When you have a script that is processing a large amount of data/files, you can use trap to keep track of where your script was in the event of an unexpected exit. One way to do this would be to echo a filename into a predefined file when the trap is triggered. You can then read the start location back into the script when it starts up again and resume where you left off. If there’s a really large amount of data and you need to make sure your script keeps its place, you should probably already be continuously tracking the progress as part of the processing loop and using the trap(s) as a fallback.

Scripting

In this scripting section I’m going to create a script that we can source to add ready made error handling functions to other scripts. You will also see a couple of conceptual additions such as the use of code blocks in an attempt to streamline sections of code. Listing 32 shows the modular script that you can source, and Listing 33 shows it in use.

Listing 32

#!/bin/bash -u # File: error_source.sh # Holds functions that can be used to more easily add error handling # to your scripts. # The -u option in the shebang line above causes the shell to throw # an error whenever a variable is unset. # Define our handlers for errors and/or forced exits trap 'fatal_err $LINENO 1001' ERR #Handle uncaught errors trap 'clean_up; exit' HUP TERM #Clean up and exit on SIGHUP or SIGTERM trap 'clean_up; propagate' INT #Clean up after and propagate SIGINT trap 'clean_up' EXIT #Clean up last thing before we exit PROGNAME=$(basename $0) #Error source program name TEMPFILES=( ) #Array holding temp files to remove on script exit # This function steps through each pipe section's exit status to see if # there was an error anywhere. Takes as an argument the line number # that's being checked. function check_pipe { # We want to see if there was an error somewhere in the pipeline for PIPEPART in $2 do # There was an error at the current part of the pipeline if [ "$PIPEPART" != "0" ] then nonfatal_err $1 1002 return 0; #We don't need to step through the rest fi done } # Function that gets rid of things like temp files before an exit. function clean_up { # We want to remove all of the temp files we created for TFILE in ${TEMPFILES[@]} do # If the file doesn't exist, skip it [ -e $TFILE ] || continue # Notice the use of a code block to streamline this check { # If you use -f, errors are ignored rm --interactive=never $TFILE &> /dev/null } || nonfatal_err $LINENO 1001 done } # Function to create "safe" temporary files which we'll get into more in the # next blog post on security. function create_temp { # Give preference to user tmp directory for security if [ -e "$HOME/tmp" ] then TEMP_DIR="$HOME/tmp" else TEMP_DIR="/tmp" fi # Construct a "safe" temp file name TEMP_FILE="$TEMP_DIR"/"$PROGNAME".$$.$RANDOM # Keep the file in an array to remove it later TEMPFILES+=( "$TEMP_FILE" ) { touch $TEMP_FILE &> /dev/null } || fatal_err $LINENO "Could not create temp file $TEMP_FILE" } # Function that handles telling the user about critical errors that # force an exit. It takes 2 arguments, a line number near where the # error occurred, and an error code / message telling what happened. function fatal_err { # Call function that will clean up temp files clean_up printf "Near line $1 in $PROGNAME: " # Check to see if the supplied error matches any predefined codes if [ "$2" == "1001" ];then printf "There has been an unknown fatal error.n" # A custom error message has been specified by the caller else printf "$2n" fi # We don't want to continue running with a fatal error exit 1 } # Function that handles telling the user about non-critical errors # that don't force an exit. It takes 2 arguments, a line number near # where the error occurred, and an error code / message telling what # happened. function nonfatal_err { printf "Near line $1 in $PROGNAME: " # Check to see if the supplied error matches any predefined codes if [ "$2" == "1001" ];then printf "Could not remove temp file.n" elif [ "$2" == "1002" ];then printf "There was an error in a pipe.n" elif [ "$2" == "1003" ];then printf "A file you tried to access doesn't exist.n" # A custom error message has been specified by the caller else printf "$2n" fi } # Function that handles propagating the SIGINT signal up to the parent # process, which in this case is assumed to be the shell. function propagate { echo "Caught SIGINT" #Propagate the signal up to the shell kill -s SIGINT $$ # 130 is the exit status from Ctrl-C/SIGINT exit 130 }

Listing 32 has 6 functions that are designed to handle various error related conditions. These functions are check_pipe, create_temp, clean_up, propagate, fatal_err, and nonfatal_err. The check_pipe function takes a list representing all the elements of the PIPESTATUS array variable, and steps through each item in the list to see if there was an error. If there was an error it throws a non-fatal error message, which could just as easily be a fatal error message that causes an exit. This makes it a little easier to check our pipes for errors without using set -o pipefail. This function could easily be modified to tell you which part of the pipe failed as well.

The create_temp function automates the process of creating “safe” temporary files for us. It gives preference to the user’s tmp directory, and uses the system /tmp directory if the user’s is not available. We’ll talk more about temporary file safety in the next blog post on security. The path/name of the temp file created is added to a global array so that it will be easier to remove it later on exit. Notice the use of the code block around the touch command that creates the temp file. It might have been easier to leave the brackets out and just put the || right after the touch statement, but I felt that the code block helped streamline the code a little bit. The || at the end of the code block causes our error handling code to be executed if there’s an error with the last command in the block.

The clean_up function steps through the file names in our array of temporary files and deletes them. This is meant to be called just before we exit the script so that we don’t leave any stray temp files laying around. The function checks to make sure that it doesn’t try to delete files that have already been removed. This is to prevent a warning from being displayed when we have an error, thus calling clean_up and then exit which also calls clean_up. There are other ways to handle this type of problem, but for our purposes the “skip if already deleted” method works fine. The propagate function uses the kill command to resend the INT signal on up to the shell, and then uses the exit command to set the exit status of the script to 130. This tells anyone checking the ? built-in variable that the script exited because of SIGINT.

The fatal_err and nonfatal_err functions are very similar, with the only difference being that fatal_err calls the clean_up function and exit command when it runs. Both functions take 2 arguments which are a line number and an error code or string. The line number is presumably the line near where the error occurred, but won’t be exact. It’s designed to get a shell script developer close enough to the error that they should be able to find it. The error code is a 4 digit number that’s used in an if statement (a case statement would be a little cleaner here) to see what error message should be given to the user. The else part of the statement allows the caller to provide their own custom error string. This way the caller isn’t stuck if they can’t find a code that fits their situation. If the script was going to see wide spread general use, it might be best to dump all of the error codes into a separate function that fatal_err and nonfatal_err could both call. That way you would have consistent and reusable error codes across all of the functions.

To make sure that the functions are called properly, the script defines several traps at the top. The ERR signal is used to catch any errors that we haven’t handled ourselves. These are treated as “unknown” fatal errors since we obviously didn’t see them coming. The HUP and TERM signals are trapped so that we have a chance to run our clean_up function before exiting. Keep in mind that the KILL signal cannot be trapped, so if somebody runs kill -9 on our script, we’re still going to be leaving temp files behind. The INT signal is trapped to give us a chance clean up as well, but we also take the opportunity to propagate the signal up to the shell. That way we’re not just absorbing SIGINT and not allowing the world around us to react to it. The final trap is set on the EXIT condition and is our last chance to make sure that the temp files have been removed.

Listing 33

#!/bin/bash -u # File: err_src_test.sh # Tests the modular error_source.sh script which holds error handling functions. # Include the modular error handling script so that we can use its functions. . error_source.sh # Use our function to create a random "safe" temp file create_temp # Be proactive in checking for problems like a file that doesn't exist if [ -e doesnotexist ] then ls doesnotexist else nonfatal_err $LINENO 1003 fi # Check a bad pipeline with a function we've created true|false|true # Error not caught because of last true PIPEST="${PIPESTATUS[@]}" check_pipe $LINENO "$PIPEST" # Check a good pipeline with the same function true|true|true|true PIPEST="${PIPESTATUS[@]}" check_pipe $LINENO "$PIPEST" # Generate a custom non-fatal error nonfatal_err $LINENO "This is a custom error message." # Generate an unhandled error false echo "The script shouldn't still be running here."

The Listing 33 implementation shows just a few ways to use the modular error handling script in one of your own scripts. The first thing that the script does is source the error_source.sh script so that it is treated like a part of our own. Once that’s done, the error handling functions can be called as if we had typed them directly into our script. That’s why we can call the create_temp function. Normally we would do something with the temporary file path/name that is created, but in this case I only want to create a temp file that can be removed later by the clean_up function. The next thing I do is be proactive in checking to see if a file/directory exists before I try to use it. If it doesn’t exist I throw a non-fatal error to warn the user. Normally you would want to throw a fatal error that would cause an exit here, but I want the script to fall all the way through to the last error so that the output in Listing 34 will be a little cleaner. Ultimately with this error handling method it’s your call on whether or not the script should exit on an error, but I would suggest erring on the side of exiting rather than letting the script continue with a potentially dangerous error in place.

The next section of Listing 33 has code that checks a pipeline with an error (the false in the middle), and after that there’s a check of a pipeline with no errors. This is done using the check_pipe function that we wrote earlier. You can see that I’ve basically converted the PIPESTATUS array elements into a string list before passing that to check_pipe. The list works a little more cleanly in the for loop that’s used to check each part of the pipeline.

Next, I’ve shown how to generate your own custom error by passing the nonfatal_err function a string instead of an error code. A custom string should fail all of the tests in the nonfatal_err if construct, causing the else to be triggered. This gives us the ability to create compact error handling code in our own scripts using error codes, but still gives us the flexibility to throw errors that haven’t been defined yet.

The last interesting thing that the script does is use the false command to generate an unhandled error which is caught by the ERR signal’s trap. You can see that even if we miss handling an error manually, it still gets caught overall. The drawback is that although the user gets a line number for the error, they are given a message telling them that and unknown error has occurred which doesn’t tell them very much. This is still preferable to letting your script run with an unhandled error though. The very last line of the script is just there to alert us that something very wrong has happened if our script reaches that point.

Listing 34 shows what happens when I run the script in Listing 33.

Listing 34

$./err_src_test.sh Near line 16 in err_src_test.sh: A file you tried to access doesn't exist. Near line 22 in err_src_test.sh: There was an error in a pipe. Near line 30 in err_src_test.sh: This is a custom error message. Near line 33 in err_src_test.sh: There has been an unknown fatal error.

If you have any additions or changes to the script(s) above don’t hesitate to tell us about it in the comments section. I would especially like to see what changes all of you would make to the script in Listing 32 to make it more useful and/or correct any flaws that it may have. Feel free to paste your updates to the code in the comments section.

Troubleshooting

This post was developed using BASH 4.0.x, so if you’re running an earlier version keep an eye out for subtle syntax differences and missing features. Post something in the comments section if you have any trouble so that we can try to help you out. Also, don’t forget to apply the debugging knowledge that you got from reading Post 1 in this series as you’re experimenting with these concepts.

Conclusion

As with shell script debugging, we can see that script error handling is a very in-depth subject. Unfortunately, error handling is often overlooked in shell scripts but is an important part of creating and maintaining production scripts. My goal with this post has been to give you a diverse set of tools to help you efficiently and effectively add error handling to your scripts. I know that opinions on this topic vary widely, so if you’ve got any suggestions or thoughts on the content of this post it would be great to hear from you. Leave a comment to let us know what you think. Thanks for reading.

Resources

Books

Links

  1. Linux Journal, May 2008, Work The Shell, By Dave Taylor, “Handling Errors and Making Scripts Bulletproof”, pp 26-27
  2. Writing Robust Shell Scripts – DavidPashley.com
  3. Linux Planet Article On Making Friendlier Error Messages
  4. Linux Planet Article With A Good Example Of A Modularized Error Handling Script
  5. Errors and Signals and Traps (Oh My!) – Part 1 By William Shotts, Jr.
  6. Errors and Signals and Traps (Oh My!) – Part 2 By William Shotts, Jr.
  7. Turnkey Linux Article With Good Discussion In Comments Section
  8. Script Error Handling Overview
  9. Article On The “Proper handling of SIGINT/SIGQUIT”
  10. Script Error Handling Slide Presentation (Download Link)
  11. General UNIX Scripting Guide With Error Handling By Steve Parker
  12. Some General Thoughts On Making Scripts Better And Less Error Prone
  13. OpenGroup.org Article On Scripting Including A Section On “Exit Status and Errors”
  14. A checkbashisms man Page Entry
  15. Common Shell Mistakes and Error Handling Article
  16. CSIRO Advanced Scientific Computing Article
  17. Opinions On Error Handling On stackoverflow
  18. A Way To Handle Errors Using Their Error Messages
  19. Simple BASH Error Handling
  20. BASH FAQ Including Broken Pipe Warning Information
  21. Linux Journal Article On Named Pipes
  22. Example Use Of command_not_found_handle

Writing Better Shell Scripts – Part 1

Quick Start

The information presented in this post doesn’t really lend itself to having a “Quick Start” section, but if you’re in a hurry we have a How-To section along with Video and Audio included with this post that may be a good quick reference for you. There are some really great general references in the Resources section that may help you as well.

Video

General Debugging

BASHDB Overview

Audio

Download

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands and scripts easier, but if you’re not already familiar with the concepts presented here, typing things yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory Path
Warning or Error
Command Line Snippet With Commands/Options/Arguments
Command Options and Their Arguments Only
Hyperlink

Overview

This post is the first in a series on shell script debugging, error handling, and security. Although I’ll be presenting some methodologies and techniques that apply to all shell languages (and most programming languages), this series will focus very heavily on BASH. Users of other shells like CSH will need to do some homework to see what information transfers and what does not.

One of the difficulties with debugging a shell script is that BASH typically doesn’t give you very much information to go on. You might get error output showing a line number, but that’s just the line where the shell became aware of the error, not necessarily the line where the error actually occurred. Add in a vague error message such as the one in Listing 1, and it gets difficult to tell what’s going on inside your script.

Listing 1

$ ./buggy_script.sh ./buggy_script.sh: line 23: syntax error: unexpected end of file

This post is written with the intent of giving you knowledge that will help when you see an error like the one in Listing 1 while trying to run a script. This type of error is just one of many errors that the shell may give you, and is more easily dealt with when you have a good understanding of scripting syntax and the debugging tools at your disposal.

Along with talking about debugging tools/techniques, I’m going to introduce a handy script debugger called BASHDB. BASHDB allows you to step through a script in much the same way as a program debugger like GNU’s GDB does with C code.

By the end of this post you should be armed with enough knowledge to handle the majority of debugging needs that you have. There’s a lot of information here, but taking the time to learn it will help make you more effective in your work with Linux.

Command Line Script Debugging

BASH has several command line options for debugging your shell scripts, and some of these are shown in Listing 2. These options will be applied to your entire script though, so it’s an all-or-nothing trade off. Later in this post I’ll talk about more selective methods of debugging.

Listing 2

-n Checks for syntax errors without executing the script (noexec). -u Causes an error to be thrown whenever you try to access a variable that has not been set (nounset). -v Sends all lines to standard error (stderr) as they are read, even comments. -x Turns on execution tracing (xtrace) which displays each command as it is executed.

All of the options in Listing 2 can be used just like options with other programs (bash -x scriptname), or with the built-in set command as shown later. With the -x option, the number of + characters before each of the lines of output denotes the subshell level. The more + characters there are, the further down into nested subshells you are. If there are no + characters at the start of the line, then the line is the normal output from the execution of the script. You can use the -x and -v options together for verbose execution tracing, but the amount of output can become a little overwhelming. Using the -n and -v options together provides a verbose syntax check without executing the script.

If you decide to use the -x and -v options together, it can be helpful to use redirection in conjunction with a pager like less, or the tee command to help you handle the information. The shell sends debugging output to stderr and the normal output to stdout, so you’ll need to redirect both of them if you want the full picture of what’s going on. To do this and use the less pager to handle the information, you would use a command line like bash -xv scriptname 2>&1 | less . Instead of seeing the debugging output scroll by in the shell, you’ll be placed into the less pager where you’ll have access to functions like scrolling and search. While using the pager in this way, it’s possible that you may get an error like Broken pipe if you exit the pager before the script is done executing. This error has to do with the script trying to write output to something (less) that’s no longer there, and in this case can be ignored.

If you would prefer to redirect the debugging output to a file for later review and/or processing, you can use tee: bash -xv scriptname 2>&1 | tee scriptname.dbg . You will see the debugging output scroll by on the screen, but if you check the current working directory you will also find the scriptname.dbg file which holds the redirected output. This is what the tee command does for you. It allows you to send the output to a file while still displaying it on the screen. If the script will take awhile to run you can alter the redirection operator slightly, put the script in the background, and then use tail -f scriptname.dbg to follow the updates to the file. You can see this in action in Listing 3, where I’ve created a script that runs in an infinite loop (the code is incorrect on purpose) generating output every 20 seconds. I start the script in the background, redirecting the output to the infinite_loop.dbg file only (not to the screen too). I then start the tail -f command to follow the file for a few iterations, and then hit Ctrl-C to interrupt the tail command. Once you understand how to redirect the debugging output in this way, it’s fairly easy to figure out how to split the debugging and regular output into separate files.

Listing 3

$ bash -xv infinite_loop.sh &> infinite_loop.dbg & [1] 9777 $ tail -f infinite_loop.dbg num=0 + num=0 while [ $num -le 10 ] do sleep 2 echo "Testing" done + '[' 0 -le 10 ']' + sleep 2 + echo Testing Testing + '[' 0 -le 10 ']' + sleep 2 ^C

Internal Script Debugging

This section is called “Internal Script Debugging” because it focuses on changes that you make to the script itself to add debugging functionality. The easiest change to make in order to enable debugging is to change the shebang line of the script (the first line) to include the shell’s normal command line switches. So, instead of a shebang line like #!/bin/bash - you would have #!/bin/bash -xv. There are also both external and built-in commands for the BASH shell that make it easier for you to debug your code, the first of which is set.

The set command allows you to set shell options while your script is running. The options of the most interest for our purposes are the ones from Listing 2. For example, you can enclose sections of your script between the set -x and set +x command lines. By doing this you enable debugging for only the section of code within those lines, giving you control over what specific section of the script is debugged. Listing 4 shows a very simple script using this technique, and Listing 5 shows the script in action.

Listing 4

#!/bin/bash - # File: set_example.sh echo "Output #1" set -x #Debugging on echo "Output #2" set +x #Debugging off echo "Output #3"

Listing 5

$ ./set_example.sh Output #1 + echo 'Output #2' Output #2 + set +x Output #3

As you can see, the debugging output looks like you started the script with the bash -x command line. The difference is that you get to control what is traced and what is not, instead of having the execution of the whole script traced. Notice that the command to disable execution tracing (set +x) is included in the execution trace. This makes sense because execution tracing is not actually turned off until after the set +x line is done executing.

Output statements (echo/print/printf) are useful for getting information from your script at specific points. You can use output statements to track the progression of logic throughout your script by doing things like evaluating variable values and shell expansions, and finding infinite loops. Another advantage of using output statements is that you can control the format. When using command line debugging switches you have little or no control over the format, but with echo, print, and printf, you have the opportunity to customize the output to display in a way that makes sense to you.

You can utilize a DEBUG function to provide a flexible and clean way to turn debugging output on and off in your script. Listing 6 shows the script in Listing 4 with the addition of the DEBUG function, and Listing 7 shows one way to switch the debugging on and off from the command line using a variable.

Listing 6

#!/bin/bash - # File: func_example.sh # This function can be used to selectively enable/disable debugging. # Use with the set command to debug sections of the script. function DEBUG() { # Check to see if the enable debugging variable is set if [ -n "${DEBUG_ENABLE+x}" ] then # Run whatever command/option/argument combo that was # passed to our DEBUG function. $@ fi } echo "Output #1" DEBUG set -x #Debugging on echo "Output #2" DEBUG set +x #Debugging off echo "Output #3"

Listing 7

$ ./func_example.sh #Without debugging Output #1 Output #2 Output #3 $ DEBUG_ENABLE=true ./func_example.sh #With debugging Output #1 + echo 'Output #2' Output #2 + DEBUG set +x + '[' -n x ']' + set +x Output #3

The DEBUG function treats the rest of the line after it as an argument. If the DEBUG_ENABLE variable is set, the DEBUG function will output it’s argument (the rest of the line) as a command via the $@ operator. So, any line that has DEBUG in front of it can be turned on or off by simply setting/unsetting one variable from the command line or inside your script. This method gives you a lot of flexibility in how you set up debugging in your script, and allows you to easily hide that functionality from your end users if needed.

Instead of requiring a user to set an environment variable on the command line to enable debugging, you can add command line options to your script. For instance, you could have the user run your script with a -d option (./scriptname -d) in order to enable debugging. The mechanism that you use could be as simple as having the -d option set the DEBUG_ENABLE variable inside of the script. An example of this, with the addition of multiple debugging levels, can be seen in the Scripting section.

Another technique that you can use to track down problems in your script is to write data to temporary files instead of using pipes. Temp files are many times slower than pipes though, so I would use them sparingly and in most cases only for temporary debugging. There is a Linux Journal article by Dave Taylor (April 2010) referenced in the Resources section that talks about using temporary files in the article’s script. In a nutshell, you replace the pipe operator (|) with a redirection to file (> $temp), where $temp is a variable holding the name of your temporary file. You read the temporary file back into the script with another redirection operator (< $temp). This allows you to examine the temporary file for errors in the script’s pipeline. Listing 8 shows a very simplified example of this.

Listing 8

#!/bin/bash - # Set the path and filename for the temp file temp="./example.tmp" # Dump a list of numbers into the temp file printf "1n2n3n4n5n" > $temp # Process the numbers in the temp file via a loop while read input_val do # We won't do any real work, just output the values echo $input_val done < $temp # Feeds the temp file into the loop # Clean up our temp file rm $temp

The last debugging technique that I'm going to touch on here is writing to the system log. You can use the logger command to write debugging output to /var/log/messages, or another file if you use the -f option. I consider this technique to be primarily for production scripts that have already been released to your users, and you don't want to abuse this mechanism. Flooding your system log with script debugging messages would be counter productive for you and/or your system administrator. It's best to only log mission critical messages like warnings or errors in this way.

To use the logger command to help track script debugging information, you would just add a line like logger "${BASH_SOURCE[0]} - My script failed somewhere before line $LINENO." to your script. The line that this adds in the system log looks like the output line in Listing 9. There are a couple of variables that I've thrown in here to make my entry in the system log more descriptive. One is BASH_SOURCE, which is an array that in this case holds the name and path of the script that logged the message. The other is LINENO, which holds the current line number that you are on in your script. There are several other useful environment variables built into the newer versions of BASH (>= 3.0). Some of these other variables (all arrays) include BASH_LINENO, BASH_ARGC, BASH_ARGV, BASH_COMMAND, BASH_EXECUTION_STRING, and BASH_SUBSHELL. See the BASH man page for details.

Listing 9

$ tail -1 /var/log/messages May 28 14:35:35 testhost jwright: ./logger_test.sh - My script failed somewhere before line 11.

Introducing BASHDB

As I mentioned before, BASHDB is a debugger that does for BASH scripts what GNU's GDB does for C/C++ programs. BASHDB can do a lot, and it has four main features to help you eliminate errors from your scripts. First, It can start a script with options, arguments, and anything else that might affect its operation. Second, it allows you to set conditions on which a script will stop. Third, it gives you the ability to examine what's going on at the point in a script where it's stopped. Fourth, BASHDB allows you to manipulate things like variable values before telling the script to move on.

You can type bashdb scriptname to start BASHDB and set it to debug the script scriptname. Listing 10 shows a couple of useful options for the bashdb program.

Listing 10

-X Traces the entire script from beginning to end without putting bashdb in interactive mode. Notice that it's capital X, not lowercase. -c Tests/traces a single string command. For example, "bashdb -c ls *" will allow you to step through the command string "ls *" inside the debugger.

In order to show where you're at, BASHDB displays the full path and current line number of the running script above the prompt. In interactive mode, the prompt BASHDB gives you looks something like bashdb<(1)> where 1 is the number of commands that have been executed. The parentheses around the command number denote the number of subshells you are nested within. The more parentheses there are, the deeper into subshells you are nested. Listing 11 gives a decent command reference that you can use when debugging scripts at the BASHDB interactive mode prompt.

Listing 11

- Lists the current line and up to 10 lines that came before it. backtrace Abbreviated "T". Shows the trace of calls including things like functions and sourced files that have brought the script to where it is now. You can follow "backtrace" with a number, and only that number of calls will be shown. break Abbreviated "b". Sets a persistent breakpoint at the current line unless followed by a number, in which case a breakpoint is set at the line specified by the number. See the "continue" command for a shortcut to specifying the line number. continue Abbreviated "c". Resumes execution of the script and moves to the next stopping point or breakpoint. If followed by a number, "continue" works in a similar way as issuing the "break" command followed by the number and then the continue command. The difference is that "continue" sets a one time breakpoint whereas "break" sets a persistent one. edit Opens the text editor specified by the EDITOR environment variable to allow you make and save changes to the current script. Typing "edit" by itself will start editing on the current line. If "edit" is followed by a number, editing will start on the line specified by that number. Once you're done editing you have to type "restart" or "R" to reload and restart the script with your changes. help Abbreviated "h". Lists all of the commands that are available when running in interactive mode. When you follow "help" or "h" with a command name, you are shown information on that command. list Abbreviated "l". Lists the current line and up to 10 lines that come after it. If followed by a number, "list" will start at the specified line and print the next 10 lines. If followed by a function name, "list" starts at the beginning of the function and prints up to 10 lines. next Abbreviated "n". Moves execution of the script to the next instruction, skipping over functions and sourced files. If followed by a number, "next" will move that number of instructions before stopping. print Abbreviated "p". When followed by a variable name, prints the value of a specified variable. Example: print $VARIABLE quit Exits from BASHDB. set Allows you to change the way BASH interacts with you while running BASHDB. You can follow "set" with an argument and then the words "on" or "off" to enable/disable a feature. Example: "set linetrace on". step Abbreviated "s". Moves execution of the script to the next instruction. "step" will move down into functions and sourced files. See the "next" command if you need behavior that skips these. If followed by a number, "step" will move that number of instructions before stopping. x Similar to the "print" command, but more powerful. Can print variable and function definitions, and can be used to explore the effects of a change to the current value of a variable. Example: "x n-1" subtracts 1 from the variable "n" and displays the result.

Normally when you hit the Enter/Return key without entering a command, BASHDB executes the next command. This behavior is overridden though when you have just run the step command. Once you've run step, pressing the Enter/Return key will re-execute step. The rest of the operation of BASHDB is fairly straight forward, and I'll run through an example session in the How-To section.

If you're a person who prefers to use a graphical interface, have a look at GNU DDD. DDD is a graphical front end for several debuggers including BASHDB, and includes some interesting features like the ability to display data structures as graphs.

How-To

If you've been reading this post straight through, you can see that there are a lot of script debugging tools at your disposal. In this section, I'm going to go through a simple example using a few of the different methods so that you can see some practical applications. Listing 12 shows a script that has several bugs intentionally added so that we can use it as our example.

Listing 12

#!/bin/bash - # buggy_script.sh is designed to help us learn about # shell script debugging # if [-z $1 ] # Space left out after first test bracket then echo "TEST" #fi #The closing fi is left out # Use of uninitialized variable echo "The value is: $VALUE1" # Infinite loop caused by not incrementing num num=0 while [ $num -le 10 ] do sleep 2 echo "Testing" done

When I try to run the script for the first time I get the same error that we got in Listing 1. The first thing that I'm going to do is use the -x and -u options of BASH to run the script with extra debugging output (bash -xu ./buggy_script.sh). When I rerun the script this way, I see that I don't really gain anything because BASH detects the unexpected end of file bug before it even tries to execute the script. The line number isn't any help either since it just points me to the very last line of the script, and that's not very likely to be where the error occurred. I'll run into the same problems if I try to run the script with BASHDB as well.

I remember that the rule of thumb with unexpected end of file errors is that they usually mean that I've forgotten to close something out. It could be an if statement without a fi at the end, a case statement that's missing an esac or ;;, or any number of other constructs that require closure. When I start looking through the script I notice that my if statement is missing a fi, so I add (uncomment) that. This particular bug teaches us an important lesson - that there will always be some errors that will require us to do some digging on our own. We may be able to use our debugging techniques to get us close to the error, but in the end we have to know
the language well enough to be able to spot syntax errors. Once I add the fi statement, I'm ready to rerun the script. The second time the script runs, I get an unbound variable error.

Listing 13

$ bash -xu ./buggy_script.sh ./buggy_script.sh: line 6: $1: unbound variable

You can see in the error that a command line argument ($1) is unbound. This tells me that I forgot to add an argument after ./buggy_script.sh . I end up with the command line bash -xu ./buggy_script.sh testarg1 which gives me the next two errors shown in Listing 14.

Listing 14

$ bash -xu ./buggy_script.sh testarg1 + '[-z' testarg1 ']' ./buggy_script.sh: line 6: [-z: command not found ./buggy_script.sh: line 12: VALUE1: unbound variable

Execution tracing shows me that the last command executed is [-z' testarg1 '] . The first error tells me that for some reason the start of the test statement ([-z) is being treated as a command. I think about it for a second and remember that there has to be a space between test brackets and what they enclose. The statement [-z $1 ] should read [ -z $1 ] . Since I try to focus on one error at a time, I fix the test statement and rerun the script. The first error from Listing 14 goes away, but the second error remains. You can see that it's another unbound variable error, but this time it's referencing a variable that I created and not a command line argument. The problem is that I use the variable VALUE1 in an echo statement before I've even set a value for it. In this case that would just leave a blank at the end of the echo statement, but in some cases it can cause more serious problems. This is what using the -u option of BASH does for you. It warns you that a variable doesn't have a value before you try to use it. To correct this error, I add a statement right above the echo line that sets a value for the variable (VALUE1="1").

After fixing the above errors and rerunning the script, everything seems to work fine. The only problem is that even though I set the while loop up to quit after the variable num gets to 10, the loop doesn't exit. It seems that I have an infinite loop problem. This loop is simple enough that you can probably just glance at it and see the problem, but for the sake of the example we're going to take the long way around. I add an echo statement (echo "num Value: $num") to show me the value of the num variable right above the sleep 2 line. When I run the script again without the BASH -x option (to cut out some clutter), I get the output shown in Listing 15.

Listing 15

$ bash -u ./buggy_script.sh testarg1 The value is: 1 num Value: 0 Testing num Value: 0 Testing num Value: 0

You can see that the output from the echo statement I added is always the same (num Value: 0). This tells me that the value of num is never incremented and so it will never reach the limit of 10 that I set for the while loop. The fix is to use arithmetic expansion to increment the num variable by 1 each time around the while loop: num=$((num+1)) . When I run the script now, num increments like it should and the script exits when it's supposed to. With this bug fixed, it looks like we've eliminated all of the errors from our script. The finalized script with the num evaluation echo statement removed can be seen in Listing 16.

Listing 16

#!/bin/bash - # buggy_script.sh is designed to help us learn about # shell script debugging. if [ -z $1 ] # Space added after first test bracket then echo "TEST" fi #The closing fi was added # Set a value for our variable VALUE1="1" # Use of initialized variable echo "The value is: $VALUE1" # Finite loop caused by incrementing num num=0 while [ $num -le 10 ] do sleep 2 echo "Testing" num=$((num+1)) done

Now I'll walk you through correcting the same buggy script using BASHDB. As I said above, the unexpected end of file error is best solved by applying your understanding of shell scripting syntax. Because of this, I'm going to start debugging the script right after we notice and fix the unclosed if statement. To start the debugging process, I use the line bashdb ./buggy_script.sh to launch BASHDB and have it start to step through the script. If you compiled BASHDB from source and haven't installed it, you'll need to adjust the paths in the command line accordingly.

BASHDB starts the script and then stops at line 7, the if statement. I then use the step command to move to the next instruction and get the output in Listing 17.

Listing 17

$ bashdb ./buggy_script.sh bash Shell Debugger, release 4.0-0.4 Copyright 2002, 2003, 2004, 2006, 2007, 2008, 2009 Rocky Bernstein This is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:7): 7: if [-z $1 ] # Space left out after first test bracket bashdb<0> step ./buggy_script.sh: line 7: [-z: command not found (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:13): 13: echo "The value is: $VALUE1"

Notice that until I run the step command, BASHDB doesn't give me an error for line 7. That's because it has stopped on the line 7 instruction, but hasn't executed it yet. When I step through that instruction and on to the next one, I get the same error as the BASH shell gives us ([-z: command not found). As before, we realize that we've left a space out between the test bracket and the statement. To fix this, I type the edit command to open the script in the text editor specified by the EDITOR environment variable. In my case this is vim. I have to type visual to go to normal mode, and then I'm able to edit and save my changes to the script like I would in any vi/vim session. With the space added, I save the file and exit vim which puts me back at the BASHDB prompt. I type the R character and hit the Enter/Return key to restart the script, which also loads my changes. I end up right back at line 7 again.

This time when I use the step command, BASHDB moves past the if statement and stops right before executing line 13 (the next instruction). Everything looks good, so I use the step command again by simply hitting the Enter/Return key. The output in Listing 18 is what I see.

Listing 18

bashdb<1> edit bashdb<2> R Restarting with: /usr/local/bin/bashdb ./buggy_script.sh bash Shell Debugger, release 4.0-0.4 Copyright 2002, 2003, 2004, 2006, 2007, 2008, 2009 Rocky Bernstein This is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:7): 7: if [ -z $1 ] # Space left out after first test bracket bashdb<0> step (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:13): 13: echo "The value is: $VALUE1" bashdb<1> The value is: (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:16): 16: num=0

We see that the echo statement ends up not having any text after the colon, which is not what we want. What I'll do is issue an R (restart) command and then step back to line 13 so that I can check the value of the variable. Once I'm back at the echo statement on line 13, I use the command print $VALUE1 to inspect the value of that variable. A snippet of the output from the print command is in Listing 19.

Listing 19

7: if [ -z $1 ] # Space left out after first test bracket bashdb<0> step (/home/jwright/Documents/Scripts/Learning/buggy_script.sh:13): 13: echo "The value is: $VALUE1" bashdb<1> print $VALUE1 bashdb<2>

There's a blank line between the bashdb<1> print $VALUE1 and bashdb<2> lines. This tells me that there is definitely not a value (or there's a blank string) set for the VALUE1 variable. To correct this I go back into edit mode, and add the variable declaration VALUE1="1" just above our echo statement. I follow the same edit, save, exit, restart (with the R character) routine as before, and then step down through the echo statement again.

This time the output from the echo statement is The value is: 1 which is what we would expect. With that error fixed, we continue to step down through the script until we realize that we're stuck in our infinite while loop. We can use the print statement here as well, and with the line print $num we see that the num variable is not being incremented. Once again, we enter edit mode to fix the problem. We add the statement num=$((num+1)) at the bottom of our while loop, save, exit, and restart. We now see that the num variable is incrementing properly and that the loop will exit. We can type the continue command to let the loop finish without any more intervention.

After the script has run successfully, you'll see the message Debugged program terminated normally. Use q to quit or R to restart. If you haven't been adding comments as you go, it would be a good idea at this point to re-enter edit mode and add those comments to any changes that you made. Make sure to run your script through one more time though to make sure that you didn't break anything during the process of commenting.

That's a pretty simple BASHDB session, but my hope is that it will give you a good start. BASHDB is a great tool to add to your shell script development toolbox.

Tips and Tricks

  • If you're like many of us, you may have trouble with quoting in your scripts from time to time. If you need a hint on how quoted sections are being interpreted by the shell, you can replace the command that's acting on the quoted section with the echo command. This will give you output showing how your quotes are being interpreted. This can also be a handy trick to use when you need insight into other issues like shell expansion too.
  • If you don't indent temporary (debugging) code, it will be easier to find in order to remove it before releasing your script to users. If you don't already make a habit of indenting your scripts in the first place, I would recommend that you start. It greatly increases the readability, and thus maintainability, of your scripts.
  • You can set the PS4 environment variable to include more information with the shell's debugging output. You can add things like line numbers, filenames, and more. For example, you would use the line export PS4='$LINENO ' to add line numbers to your script's debugging output. The creator of the bashdb script debugger sets the PS4 variable to (${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]} - [${SHLVL},${BASH_SUBSHELL}, $?] which gives you very detailed information about where you're at in your script. You can make this change to the variable permanent by adding an export declaration to one of your bash configuration files.
  • Make sure to use unique names for your shell scripts. You can run into problems if you name your shell script the same as a system or built-in command (i.e. test). I like to make my shell script names distinctive, and for added protection I almost always add a .sh extension onto the end of the filename.

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

Our first script example is going to have two separate parts to it. The first is a script in which we've enclosed our debugging functionality from above. This is a case where it's helpful to create modular code so that other scripts can add debugging functionality simply by sourcing one file. That way you're not duplicating code needlessly for commonly used functionality. The second script implements the debugging script, and uses a command line option (-d) to enable debugging. The script also uses multiple debugging levels to allow the user to control how verbose the output is by passing an argument to the -d option.

Listing 20

#!/bin/bash - # File:debug_module.sh # Holds common script debugging functionality # Set the PS4 variable to add line #s to our debug output PS4='Line $LINENO : ' # The function that enables the enabling/disabling of # debugging in the script, and also takes the user # specified debug level into account. # 0 = No debugging # 1 = Debug executed statements only # 2 = Debug all lines and executed statements function DEBUG() { # We need to see what level (0-2) of debugging is set if [ "$1" = "0" ] #User disabled debugging then echo "Debugging Off" set +xv # Set the variable that tracks the debugging state _DEBUG=0 elif [ "$1" = "1" ] #User wants minimal debugging then echo "Minimal Debugging" set -x # Set the variable that tracks the debugging state _DEBUG=0 elif [ "$1" = "2" ] #User wants maximum debugging then echo "Maximum Debugging" set -xv # Set the variable that tracks the debugging state _DEBUG=0 else #Run/suppress a command line depending on debug level # If debugging is turned on, output the line # that this function was passed as a parameter if [ $_DEBUG -gt 0 ] then $@ fi fi }

This script has two main purposes. One is to set the PS4 variable so that line numbers are added to the debugging output to make it easier to trace errors. The other is to provide a function that takes an argument of either a number (0-2), or a command line and then decides what to do with it. If the argument is a number from 0 to 2, the function sets a debugging level accordingly. Level 0 turns off all debugging (set +xv), level 1 turns on execution tracing only (set -x), and level 2 turns on execution tracing and line echoing (set -xv). Anything else that is passed to the function is treated as a command line that is either run or suppressed depending on what the debugging level is.

As always, there are many ways to improve this script. One would be to add more debugging levels to it. I created three (0-2), which accommodated only the -x and -v options. You could add another level for the -u option, or create your own custom levels. Listing 21 shows an implementation of our simple modular debugging script.

Listing 21

#!/bin/bash - # File: debug_module_test.sh # Used as a test of the debug_module.sh script # Source the debug_module.sh script so that its # function(s) will be used as this script's own . ./debug_module.sh # Parse the command line options and set this script up for use while getopts "d:h" opt do case $opt in d) _DEBUG=$OPTARG # Enable debugging DEBUG $_DEBUG ;; h) echo "Usage: $0 [-dh]" #Give the user usage info echo " -d Enables debugging mode" echo " -h Displays this help message" exit 0 ;; '?') echo "$0: Invalid Option - $OPTARG" echo "Usage: $0 [-dh]" exit 1 ;; esac done # Begin our test statements DEBUG echo "Debugging 1" DEBUG echo "Debugging 2" echo "Regular Output Line" # Turn debugging off DEBUG 0 # Test to make sure debugging is off DEBUG echo "Debugging 3" # You can also create your own custom debugging output sections _DEBUG=2 #Manually set debugging back to max for last section [ $_DEBUG -gt 0 ] && echo "First debugging level" [ $_DEBUG -gt 1 ] && echo "Second debugging level"

The first statement that you see in the Listing 21 script is a source statement reading the modular debugging script (debug_module.sh). This treats the debugging script as if it was part of the script we're currently running. The next major section that you see is the while loop that parses the command line options and arguments. The main option to be concerned with is "d", since it's the one that enables or disables debugging output. The getopts command requires the -d option to have an argument on the command line via the getopts "d:h" statement. The user passes a 0, 1, or 2 to the option and that in turn sets the debugging level via the _DEBUG variable and the DEBUG function. The DEBUG function is called 4 more times throughout the rest of the script. Three of those times it is used as a switch to run or suppress a line of the script, and once it is used to reset the debugging level to 0 (debugging off).

The last three lines of the script are a little different. I put them in there to show how you could implement your own custom debugging functionality. In the first of those lines, the _DEBUG variable is set to 2 (maximum debugging output). The next two lines are used to select how much debugging output you see. When you set _DEBUG to 1, the line "First debugging level" is output. If you set _DEBUG to 2 as in the script, the conditions for both the "First debugging level" (> 0) and the "Second debugging level" (> 1) statements are met, so both lines are output. Listing 22 shows the output that you get from running this script, and if you look at the bottom you'll see that the lines "First debugging level" and "Second debugging level" are output.

Listing 22

$ ./debug_module_test.sh -d 1 Minimal Debugging Line 29 : _DEBUG=0 Line 11 : getopts d:h opt Line 30 : DEBUG echo 'Debugging 1' Line 18 : '[' echo = 0 ']' Line 24 : '[' echo = 1 ']' Line 30 : '[' echo = 2 ']' Line 39 : '[' 0 -gt 0 ']' Line 32 : DEBUG echo 'Debugging 2' Line 18 : '[' echo = 0 ']' Line 24 : '[' echo = 1 ']' Line 30 : '[' echo = 2 ']' Line 39 : '[' 0 -gt 0 ']' Line 34 : echo 'Regular Output Line' Regular Output Line Line 37 : DEBUG 0 Line 18 : '[' 0 = 0 ']' Line 20 : echo 'Debugging Off' Debugging Off Line 21 : set +xv First debugging level Second debugging level

This next script is somewhat like an automated unit test. It's a wrapper script that automatically runs another script with varying combinations of options and arguments so that you can easily look for errors. It takes some time up front to create this script, but it allows you to quickly test how any changes you make to a test script might cause problems for the end user. It could take a lot of time to step through and test all of the option/argument combinations manually on a complex script, and with that extra work (if we're honest) this test might get left out all together. That's where the automation of the script in Listing 23 comes in.

Listing 23

#!/bin/bash - # File unit_test.sh # A wrapper script that automatically runs another script with # a varying combination of predefined options and arguments, # to help find any errors. # Variables to make the script a little more readable. _TESTSCRIPT=$1 #The script that the user wants to test _OPTSFILE=$2 #The file holding the predefined options _ARGSFILE=$3 #The file holding the predefined arguments # Read the options and arguments from their files into arrays. _OPTSARRAY=($(cat $_OPTSFILE)) _ARGSARRAY=($(cat $_ARGSFILE)) # The string that holds the option/argument combos to try. _TRIALSTRING="" # Step through all of the arguments one at a time. for _ARG in ${_ARGSARRAY[*]} do # The string of multiple command line options that we'll # build as we step through the available options. _OPTSTRING="" # Step through all of the options one at a time. for _OPT in ${_OPTSARRAY[*]} do # Append the new option onto the multi-option string. _OPTSTRING="${_OPTSTRING}$_OPT " # Accumulate the command lines that will be tacked onto # the command as we're testing it. _TRIALSTRING="${_TRIALSTRING}${_OPT} $_ARGn" #Single option _TRIALSTRING="${_TRIALSTRING}${_OPTSTRING}$_ARGn" #Multi-option done done # Change the Internal Field Separator to avoid newline/space troubles # with the command list array assignment. IFS=":" # Sort the lines and make sure we only have unique entries. This could # be taken care of by more clever coding above, but I'm going to let # the shell do some extra work for me instead. An array is used to hold # the command lines. _CLIST=($(echo -e $_TRIALSTRING | sort | uniq | sed '/^$/d' | tr "n" ":")) # Step through each of the command lines that were built. for _CMD in ${_CLIST[*]} do # We can pipe the full concatenated command string into bash to run it. echo $_TESTSCRIPT $_CMD | bash done

There are two files that I created to go along with this test script. The first is sample_opts, which holds a single line of possible options separated by spaces (-d -v -q). These options stand for debugging mode, verbose mode, and quiet mode respectively. The second file that I create is sample_args, which contains two possible arguments separated by a space (/etc/passwd /etc/shadow). I'll run our unit_test.sh script by passing it the name of the script to test, the sample_opts argument, and the sample_args argument. For this example, it really doesn't matter what the test script (./test_script.sh) is designed to do. We just provide the options and arguments that we want to test, and that's all the unit_test.sh script needs to know. Listing 24 shows what happens when I run the test.

Listing 24

$ ./unit_test.sh ./test_script.sh sample_opts sample_args Debug mode Debug mode Debug mode Verbose mode Debug mode Verbose mode Debug mode Verbose mode Quiet mode The -v and -q options are conflicting. Debug mode Verbose mode Quiet mode The -v and -q options are conflicting. Quiet mode Quiet mode Verbose mode Verbose mod

Notice that the output from the unit test script shows that the -v and -q options cause a conflict. I have hard coded that error in the test script for clarity, but in everyday use you would have to look for things like real errors or output that doesn't match what is expected. The error about the -v and -q options makes sense in this case because you wouldn't want to run verbose (chatty) mode and quiet (non-chatty) mode at the same time. They are mutually exclusive options that should not be used together. This unit test script not only finds errors that I may miss with manual inspection, it allows you to easily recheck your script whenever you make a change, and ensures that your script is checked the same way every time.

There are a lot of improvements that can be made to this unit test script. For starters, the script doesn't check every possible combination of options. It's limited by the order that the options are in the sample_opts file. The script never reorders those options. Another improvement would be to have the script automatically check for common errors like illegal option, file not found, etc. As it stands now though, you can pipe the output of the script to grep in order to look for a specific error yourself.

Troubleshooting

The version of BASHDB that came with my chosen Linux distribution had a bug causing an error when a BASHDB function tried to return the value of -1. The problem went away though once I downloaded and compiled the latest version straight from the BASHDB website.

If a script you're debugging causes BASHDB to hang, you can try the CTRL+C key combination. This should exit from the script you're debugging and return you to the BASHDB prompt.

Conclusion

There are quite a few tools and methods at your disposal when debugging scripts. From BASH command line options, to a full debugger like BASHDB, to your own custom debugging and test scripts, there's a lot of room for creativity in making your scripts more error free. Better and more thorough debugging of your scripts from the outset will help lessen problems down the line, reducing down time and user frustration. In the future, I'll talk about handling runtime errors and security as the next steps in ensuring the quality and reliability of your shell scripts. Look for another post in this series soon.

Resources

  1. Expert Shell Scripting (Expert's Voice in Open Source) Book
  2. Learning the bash Shell: Unix Shell Programming (In a Nutshell (O'Reilly))
  3. BigAdmin Community Debugging Tip
  4. Shell Script Debugging Gotchas
  5. NixCraft Debugging Article
  6. Linux Journal, April 2010, Work The Shell, By Dave Taylor, "Our Twitter Autoresponder Goes Live!", pp 24-26
  7. The Linux Documentation Project Debugging Article
  8. BASHDB Homepage
  9. BASHDB Documentation
  10. Line Number Output In set -x Debugging Output
  11. 6 Cool Bash Tricks Article
  12. Using VIM as a BASH IDE
  13. General BASH Debugging Info
  14. Good Debugging Reference With Sample Error-Filled Scripts
  15. Good Debugging Tips Page By Bash-Hackers
  16. Modularizing The Debug Function To A Separate Script

Innovations Adds LPIC Classes and Social Media

Here are a couple of important updates about what’s going on at Innovations Technology Solutions.

Linux Training

First of all, I’m proud to announce that Innovations Technology Solutions is now a Linux Professional Institute Approved Training Partner. If you take a look in our navigation bar to the left (also Figure 1), you’ll notice that there is a new button called “Training”. Clicking this button will take you to a page where you can find the latest information on upcoming classes and sign up to attend one. There is also a change that has been made to the homepage so that it features a short introductory section on training at Innovations (Figure 2).

Innovations will begin by offering two classes starting in July, with each class corresponding to one of the tests required to get your LPIC-1 certification. The classes will utilize the excellent courseware developed by Guru Labs, and a Prometric test voucher will be included in the price of each class.

Figure 1 Figure 2

Social Media

Figure 3

Next, you might have already noticed that social media badges have been added to the homepage (Figure 3) and to this blog (left). This is so that you can have the easiest access possible to all of the tips, tricks, how-tos, and updates coming from Innovations. Subscribe, become a fan, follow, and always stay up-to-date with the latest news and information.

Conclusion

We hope that these changes will further our mission to help you use Linux and other open source technologies in the most practical, productive, and profitable ways possible. Click here, or on the image in Figure 3 to go to the homepage and see the changes there for yourself.

As always, we value your input. Please let us know what you think about the changes.

Shared Library Issues In Linux

Video

Audio

Download

Quick Start

If you just want enough information to fix your problem quickly, you can read the How-To section of this post and skip the rest. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. We also have Video and Audio included with this post that may be a good quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re trying to solve problems.

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory Path
Warning or Error
Command Line Snippet With Commands/Options/Arguments
Command Options and Their Arguments Only
Hyperlink

Where listings on command options are made available, anything with square brackets around it (“[” and “]“) is an argument to the option, and a pipe (“|“) means that you can choose one of two alternatives ([4|6] means choose 4 or 6).

Overview

This post is geared more toward system administrators than software developers, but anyone can make good use of the information that you’re going to see here. The Resources section holds links to take your study further, even into the developer realm. I’m going to start off by giving you a brief background on shared libraries and some of the rules that apply to their use. Listing 1 shows an example of an error you might see after installing PostgreSQL via a bin installer file. In this post, I’m going to step through some commands and techniques to help you deal with this type of shared library problem. I’ll also work through resolving the error in Listing 1 as an example, and give you some tips and tricks as well as items to help you if you get stuck.

Listing 1

$ ./psql ./psql: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory

Background

Shared libraries are one of the many strong design features of Linux, but can lead to headaches for inexperienced users, and even experienced users in certain situations. Shared libraries allow software developers keep the size of their application storage and memory footprints down by using common code from a single location. The glibc library is a good example of this. There are two standardized locations for shared libraries on a Linux system, and these are the /lib and /usr/lib directories. On some distributions /usr/local/lib is included, but check the documentation for your specific distribution to be sure. These are not the only locations that you can use for libraries though, and I’ll talk about how to use other library directories later. According to the Filesystem Hierarchy Standard (FHS), /lib is for shared libraries and kernel modules that are required for startup and running in the root filesystem (/bin and /sbin), and /usr/lib holds most of the internal libraries that are not meant to be executed directly by users or shell scripts. The /usr/local/lib directory is not defined in the latest version of the FHS, but if it exists on a distribution it normally holds libraries that aren’t a part of the standard distribution, including libraries that the system administrator has compiled/installed after the initial setup. There are some other directories like /lib/security that holds PAM modules, but for our discussion we’ll focus on /lib and /usr/lib.

The counterpart to the dynamically linked (shared) library is the statically linked library. Whereas dynamically linked libraries are loaded and used as they are needed by the applications, statically linked libraries are either built into, or closely associated with a program at the time it is compiled. A couple of the situations where static libraries are used is when you’re trying to work around an odd/outdated library dependency, or when you’re building a self-contained rescue system. Static linking typically makes the resulting application faster and more portable, but increases the size (and thus the memory and storage footprint) of the binary. There is also a multiplication of the size of a static library’s footprint if more than one program uses it. For instance, one program using a library that is 10 MB in size just consumes 10 MB of memory (1 program x 10 MB), but if you run 10 programs with the same library compiled into them, you end up with 100 MB of memory consumed (10 programs x 10 MB). Also, when programs are statically linked, they can’t take advantage of updates made to the libraries that they depend on. They are locked into whatever version of the library they were compiled with. Programs that depend on dynamically linked libraries refer to a specific file on the Linux file system, and so when that file is updated, the program can automatically take advantage of the new features and fixes the next time it loads.

Shared libraries typically have the extension .so which stands for Shared Object. Library file names are followed by a version numbering scheme which can include major and minor version numbers. A system of symbolic links are used to point the majority of programs to the latest and greatest library version, while still allowing a minority of programs to use older libraries. Listing 2 shows output that I modified to illustrate this point.

Listing 2

$ ls -l | grep libread lrwxrwxrwx 1 root root 18 2010-03-03 11:11 libreadline.so.5 -> libreadline.so.5.2 -rw-r--r-- 1 root root 217188 2009-08-24 19:10 libreadline.so.5.2 lrwxrwxrwx 1 root root 18 2010-02-02 09:34 libreadline.so.6 -> libreadline.so.6.0 -rw-r--r-- 1 root root 225364 2009-09-23 08:16 libreadline.so.6.0

You can see in the output that there are two versions of libreadline installed side-by-side (5.2 and 6.0). The version numbers are in the form major.minor, so 5 and 6 are major version numbers, with 2 and 0 being minor version numbers. You can usually mix and match libraries with the same major version number and differing minor numbers, but it can be a bad idea to use libraries with different major numbers in place of one another. Major version number changes usually represent significant changes to the interface of the library, which are incompatible with previous versions. Minor version numbers are only changed when an update such as a bug fix is added without significantly changing how the library interacts with the outside world. Another thing that you’ll notice in Listing 2 is that there are links created from libreadline.so.5 to libreadline.so.5.2 and from libreadline.so.6 to libreadline.so.6.0. This is so that programs that depend on the 5 or 6 series of the libraries don’t have to figure out where the newest version of the library is. If an application works with major version 6 of the library, it doesn’t care if it grabs 6.0, 6.5, or 6.9 as long as it’s compatible, so it just looks at the base name of the library and takes whatever that’s linked to. There are also a couple of other situations that you’re likely to encounter with this linking scheme. The first is that you may see a link file name containing no version numbers (libreadline.so) that points to the actual library file (libreadline.so.6.0). Also, even though I said that libraries with different major version numbers are risky to mix, there are situations where you will see an earlier major version number (libreadline.so.5) linked to a newer version number of the library (libreadline.so.6.0). This should only happen when your distribution maintainers or system administrators have made sure that nothing will break by doing this. Listing 3 shows an example of the first situation.

Listing 3

$ ls -l | grep ".*so " lrwxrwxrwx 1 root root 18 2010-02-02 14:48 libdbus-1.so -> libdbus-1.so.3.4.0

All things considered, the shared library methodology and numbering scheme do a good job of ensuring that your software can maintain a smaller footprint, make use to the latest and greatest library versions, and still have backwards compatibility with older libraries when needed. With this said, the shared library model isn’t perfect. There are some disadvantages to using them, but those disadvantages are typically considered to be outweighed by the benefits. One of the disadvantages is that shared libraries can slow the load time of a program. This is only a problem the first time that the library is loaded though. After that, the library is in memory and other applications that are launched won’t have to reload it. One of the most potentially dangerous drawbacks of shared libraries is that they can create a central point of failure for you system. If there is a library that a large set of your programs rely on and it gets corrupted, deleted, over-written, etc, all of those programs are probably going to break. If any of those programs that were just taken down are needed to boot your Linux system, you’ll be dead in the water and in need of a rescue CD.

While I would argue that dependency chains are not really a “problem”, they can work hardships on a system administrator. A dependency chain happens when one library depends on another library, then that one depends on another, and another, and so on. When dealing with a dependency chain, you may have satisfied all of the first level dependencies, but your program still won’t run. You have to go through and check each library in turn for a dependency chain, and then follow that chain all the way through, filling in the missing dependencies as you go.

One final problem with shared libraries that I’ll mention again is version compatibility issues. You can end up with a situation where two different applications require different versions of the same library – that aren’t compatible. That is the reason for the version numbering system that I talked about above, and robust package management systems have helped ease shared library problems from the user’s perspective, but they still exist in certain situations. Any time that you compile and/or install an application/library yourself on your Linux system, you have to keep an eye out for problems since you don’t have the benefit of a package manager ensuring library compatibility.

Introducing ld-linux.so

ld-linux.so (or ld.so for older a.out binaries) is itself a library, and is responsible for managing the loading of shared libraries in Linux. For the purposes of this post, we’ll be working with ld-linux.so, and if you need or want to learn more about the older style a.out loading/linking, have a look at the Resources section. The ld-linux.so library reads the /etc/ld.so.cache file which is a non-human readable file that is updated when you run the ldconfig command. The way that shared libraries are loaded is that ld-linux.so checks to see what paths to look for the libraries in by checking the value of the LD_LIBRARY_PATH environment variable, then the contents of the /etc/ld.so.cache file, and finally the default path of /lib followed by the /usr/lib directory.

The LD_LIBRARY_PATH environment variable is a colon separated list that preempts all of the other library paths in the ldconfig search order. This means that you can use it to temporarily alter library paths when you’re trying to test a new library before rolling it out to the entire system, or to work around problems. This variable is typically not set by default on Linux distributions, and should not be used as a permanent fix. Use it with care, and preference should be given to the other library search path configuration methods. A handy thing about the LD_LIBRARY_PATH variable is that since it’s an environment variable, you can set it on the same line as a command and the new value will only effect the command, and not the parent environment. So, you would issue a command line like LD_LIBRARY_PATH="/home/user/lib" ./program to run program and force it to use the experimental shared libraries in /home/user/lib in preference to any others on the system. The shell that you run program in never sees the change to LD_LIBRARY_PATH. Of course you can also use the export command to set this variable, but be careful because doing this will affect your entire system. One final thing about the LD_LIBRARY_PATH variable is that you don’t have to run ldconfig after changing it. The changes take effect immediately, unlike changes to /lib, /usr/lib, and /etc/ld.so.conf. I’ll explain more about ldconfig later.

You can use the ld-linux.so library by itself to list which libraries a program depends on. It’s behavior is very much like the ldd command that we’ll talk about next because ldd is actually a wrapper script that adds more sophisticated behavior to ld-linux.so. In most cases ldd should be your preferred command for listing required shared libraries. In order to use ld-linux.so.2 to get a listing of the depended upon libraries for the ls command, you would type /lib/ld-linux.so.2 --list /bin/ls swapping the 2 out for whatever major version of the library that your system is running. I’ve shown some of the command line options for ld-linux.so in Listing 4.

Listing 4

--list Lists all library dependencies for the executable --verify Verifies that the program is dynamically linked and that the ld-linux.so linker can handle it --library-path [PATH] Overrides the LD_LIBRARY_PATH environment variable and uses PATH instead

You can start a program directly with ld-linux.so by using the following command line form /lib/ld-linux.so.2 --library-path FULL_LIBRARY_PATH FULL_EXECUTABLE_PATH , where you replace 2 with whatever version of the library you are using. An example would be /lib/ld-linux.so.2 --library-path /home/user/lib /home/bin/program which would run program using /home/user/lib as the location to look for required libraries. This should be used for testing purposes only, and not for a permanent fix on a production system though.

Introducing ldd

The name of the ldd command comes from its function, which is to “List Dynamic Dependencies”. As mentioned in the previous section, by default the ldd command gives you the same output as issuing the command line /lib/ld-linux.so.2 --list FULL_EXECUTABLE_PATH. Each library entry in the output includes a hexadecimal number which is the load address of the library, and can change from run to run. Chances are that system administrators will never even need to know what this value is, but I’ve mentioned it here because some people may be curious. Listing 5 shows a few of the options for ldd that I use the most.

Listing 5

-d --data-relocs Perform data relocations and report any missing objects -r --function-relocs Perform relocations for both data objects and functions, and report any missing objects or functions -u --unused Print unused direct dependencies -v --verbose Print all information, including e.g. symbol versioning information

Keep in mind that you have to give ldd the full path to the binary/executable for it to work. The only way to work around giving ldd the full path is to use cd to change into the directory where the binary is. Otherwise you get an error like ldd: ./ls: No such file or directory. The only time that you would need to run ldd with root privileges would be if the binary has restrictive permissions placed on it.

As I mentioned in the Background section, you need to be aware of dependency chains when using shared libraries. Just because you’ve run the ldd command on an executable and satisfied all of it’s top level dependencies doesn’t mean that there aren’t more dependencies lurking underneath. If your program still won’t run, you should check each of the top level libraries to see if any of them have their own library dependencies that are unmet. You continue that process, running ldd on each library in each layer until you’ve satisfied all of the dependencies.

Introducing ldconfig

Any time that you make changes to the installed libraries on your system, you’ll want to run the ldconfig command with root privileges to update your library cache. ldconfig rebuilds the /etc/ld.so.cache file of currently installed libraries based on what it first finds in the directories listed in the /etc/ld.so.conf file, and then in the /lib and /usr/lib directories. The /etc/ld.so.cache file is formatted in binary by ldconfig and so it’s not designed to be human readable, and should not be edited by hand. Formatting the ld.so.cache file in this way makes it more efficient for the system to retrieve the information. The ld.so.conf file may include a directive that reads include /etc/ld.so.conf.d/*.conf that tells ldconfig to check the ld.so.conf.d directory for additional configuration files. This allows the easy addition of configuration files to load third-party shared libraries such as those for MySQL. On some distributions, this include directive may be the only line you find in the ld.so.conf file.

You often need to run ldconfig manually because a Linux system cannot always know when you have made changes to the currently installed libraries. Many package management systems run ldconfig as part of the installation process, but if you compile and/or install a library without using the package management system, the system software may not know that there is a new library present. The same applies when you remove a shared library.

Listing 6 holds several options for the ldconfig command. This is by no means an exhaustive list, so be sure to check the man page for more information.

Listing 6

-C [file] Specifies an alternate cache file other than ld.so.cache -f [file] Specifies an alternate configuration file other than ld.so.conf -n Rebuilds the cache using only directories specified on the command line, skipping the standard directories and ld.so.conf -N Only updates the symbolic links to libraries, skipping the cache rebuilding step -p --print-cache Lists the shared library cache, but needs to be piped to the less command because of the amount of output -v --verbose Gives output information about version numbers, links created, and directories scanned -X Opposite of -N, it rebuilds the library cache and skips updating the links to the libraries

ldconfig is not the only method used to rebuild the library cache. Gentoo handles this task in a slightly different way, which I’ll talk about next.

Introducing env-update

Gentoo takes a slightly different path to updating the cache of installed libraries which includes the use of the env-update script. env-update reads library path configuration files from the /etc/env.d directory in much the same way that ldconfig reads files from /etc/ld.so.conf.d via the ld.so.conf include directive. env-update then creates a set of files within /etc including ld.so.conf . After this, env-update runs ldconfig so that it reloads the cache of libraries into the /etc/ld.so.cache file.

How-To

Hopefully by the point you’re reading this section you either have, or are beginning to get a pretty good understanding of the commands used when dealing with shared libraries. Now I’m going to take you through a sample scenario of a PostgreSQL installation running on Red Hat 5.4 to demonstrate how you would use these commands.

I have downloaded a bin installer to use on my CentOS installation instead of the PostgreSQL Yum repository because I wanted to install a specific older version of Postgres outside of the package management system. In most cases you’ll want to use a repository with your package management system though, as you’ll get a more integrated installation that can be kept up to date more easily. That’s assuming that your Linux distribution offers the repository mechanism for installing and updating packages, and many distributions don’t.

After installing Postgres via the bin file, I take a look around and see that the majority of the PostgreSQL files are in the /opt/PostgreSQL directory. I decide to experiment with the binaries under the pgAdmin3 directory, and so I use the cd command to move to /opt/PostgreSQL/8.4/pgAdmin3/bin. Once I’m there, I try to run the psql command and get the output in Listing 7 (same as Listing 1).

Listing 7

$ ./psql ./psql: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory

There might be some of you reading this who will realize that I could have probably avoided the library error in Listing 7 by running the psql command from the /opt/PostgreSQL/8.4/bin directory. While this is true, for the sake of this example I’m going to forge ahead trying to figure out why it won’t run under the pgAdmin3 directory.

The main thing that I take away from the output in Listing 7 is that there is a shared library named libpq.so.5 that cannot be found by ld-linux.so. To dig just a little bit deeper, I use the ldd command and get the output in Listing 8.

Listing 8

$ ldd ./psql linux-gate.so.1 => (0x003fc000) libpq.so.5 => not found libxml2.so.2 => /usr/lib/libxml2.so.2 (0x00845000) libpam.so.0 => /lib/libpam.so.0 (0x0054f000) libssl.so.4 => not found libcrypto.so.4 => not found libkrb5.so.3 => /usr/lib/libkrb5.so.3 (0x00706000) libz.so.1 => /usr/lib/libz.so.1 (0x003d5000) libreadline.so.4 => not found libtermcap.so.2 => /lib/libtermcap.so.2 (0x00325000) libcrypt.so.1 => /lib/libcrypt.so.1 (0x004a3000) libdl.so.2 => /lib/libdl.so.2 (0x0031f000) libm.so.6 => /lib/libm.so.6 (0x0033f000) libc.so.6 => /lib/libc.so.6 (0x001d7000) libaudit.so.0 => /lib/libaudit.so.0 (0x00532000) libk5crypto.so.3 => /usr/lib/libk5crypto.so.3 (0x0079e000) libcom_err.so.2 => /lib/libcom_err.so.2 (0x0052d000) libkrb5support.so.0 => /usr/lib/libkrb5support.so.0 (0x006f6000) libkeyutils.so.1 => /lib/libkeyutils.so.1 (0x005ae000) libresolv.so.2 => /lib/libresolv.so.2 (0x00518000) /lib/ld-linux.so.2 (0x001b9000) libselinux.so.1 => /lib/libselinux.so.1 (0x003bb000) libsepol.so.1 => /lib/libsepol.so.1 (0x00373000)

Notice that the error given in Listing 7 only gives you the first shared library that’s missing. As you can see in Listing 8, this doesn’t mean that other libraries won’t be missing as well.

My next step is to see if the missing libraries are already installed somewhere on my system using the find command. If the libraries are not already installed, I’ll have to use the package management system or the Internet to see which package(s) I need to install to get them. The output in Listing 9 shows the output from the find command.

Listing 9

$ sudo find / -name libpq.so.5 /opt/PostgreSQL/8.4/lib/libpq.so.5 /opt/PostgreSQL/8.4/pgAdmin3/lib/libpq.so.5

After looking in both of the directories shown in the output, I notice that all of my other missing libraries are housed within them. If you were just temporarily testing some new features of the psql command, you could use the export command to set the LD_LIBRARY_PATH environment variable as I have in Listing 10.

Listing 10

$ export LD_LIBRARY_PATH="/opt/PostgreSQL/8.4/lib/" bash-3.2$ ./psql Password: psql (8.4.3) Type "help" for help. postgres=#

You can see that once I’ve set the LD_LIBRARY_PATH variable, all I have to do is enter my PostgreSQL password and I’m greeted with the psql command line interface. I’ve used the /opt/PostgreSQL/8.4/lib/ library directory instead of the one beneath the pgAdmin3 directory as a matter of preference. In this case both directories include the same required libraries. For a permanent solution, we can add the path via the ld.so.conf file.

I could just add /opt/PostgreSQL/8.4/lib/ directly to the ld.so.conf file on its own line, but since the ld.so.conf file on my installation has the include ld.so.conf.d/*.conf directive, I’m going to add a separate conf file instead. In Listing 11 you can see that I’ve echoed the PostgreSQL library path into a file called postgres-i386.conf under the /etc/ld.so.conf.d directory. After checking to make sure the file has the directory in it, I run the ldconfig command to update the library cache.

Listing 11

$ sudo -s Password: [root@localhost bin]# echo /opt/PostgreSQL/8.4/lib > /etc/ld.so.conf.d/postgres-i386.conf [root@localhost bin]# cat /etc/ld.so.conf.d/postgres-i386.conf /opt/PostgreSQL/8.4/lib [root@localhost bin]# /sbin/ldconfig [root@localhost bin]# exit exit

Make sure that you unset the LD_LIBRARY_PATH variable though so that you can make sure that it was your ld.so.conf configuration file changes that fixed the problem, and not the environment variable. Issuing a command line such as unset LD_LIBRARY_PATH will accomplish this for you.

There are many scenarios beyond the one in this example, but it gives you the concepts used to work through the majority of shared library problems that you’re likely to come up against as a system administrator. If you’re interested in delving more deeply though, there are several links in the Resources section that should help you.

Tips and Tricks

  • I have read that running ldd on an untrusted program can open your system up to a malicious attack. This happens when an executable’s embedded ELF information is crafted in such a way that it will run itself by specifying its own loader. The man pages on the Ubuntu and Red Hat systems that I checked don’t mention anything about this security concern, but you’ll find a very good article by Peteris Krumins in the Resources section of this post. I would suggest at least skimming Peteris’ post so that you’re aware of the security implications of running ldd on unverified code.
  • Although it’s a little bit beyond the scope of this post, you can compile a program from source and manually control which libraries it links to. This is yet another way to work around library compatibility issues. You use the GNU C Compiler/GNU Compiler Collection (gcc) along with its -L, and -l options to accomplish this. Have a look at item 13 (the YoLinux tutorial) in the Resources section for an example, and the gcc man page for details on the options.
  • Have a look at the readelf and nm commands if you want a more in-depth look at the internals of the binaries and libraries that you’re working with. readelf shows you some extra information on your ELF files by reading and parsing their internal information, and nm lists the symbols (functions, etc) within an object file.
  • You can temporarily preempt your current set of libraries and their functions with the LD_PRELOAD environment variable and/or the /etc/ld.so.preload file. Once these are set, the dynamic library loader will use the preload libraries/functions in preference to the ones that you have cached using ldconfig. This can help you work around shared library problems in a few instances.
  • If you run into a program that has its required library path(s) hard coded into it, you can create symbolic links from each one of the missing libraries to the location that’s expected by the executable. This technique can also help you work around incompatibilities in the naming conventions between what your system software expects, and what libraries are actually named. I talk about using symbolic links in this way a little more in the Troubleshooting section.

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

The first script shown in Listing 12 can be used to search directory trees for binaries with missing libraries. It makes use of the ldd and find commands to do the bulk of the work, looping through their output. Since I have heavily commented the scripts in Listing 12 and Listing 13, I won’t explain the details of how they work in this text.

Listing 12

#!/bin/bash - # These variables are designed to be changed if your Linux distro's ldd output # varies from Red Hat or Ubuntu for some reason iself="not a dynamic executable" # Used to see if executable is not dynamic notfound="not.*found" # Used to see if ldd doesn't find a library # Step through all of the executable files in the user specified directory for exe in $(find $1 -type f -perm /111) do # Check to see if ldd can get any information from this executable. It won't # if the executable is something like a script or a non-ELF executable. if [ -z "$(ldd $exe | grep -i "$iself")" ] then # Step through each of the lines of output from the ldd command # substituting : for a delimiter instead of a space for line in $(ldd $exe | tr " " ":") do # If ldd gives us output with our "not found" variable string in it, # we'll need to warn the user that there is a shared library issue if [ -n "$(echo "$line" | grep -i "$notfound")" ] then # Grab the first field, ignoring the words "not" or "found". # If we don't do this, we'll end up grabbing a field with a # word and not the library name. library="$(echo $line | cut -d ":" -f 1)" printf "Executable %s is missing shared object %sn" $exe $library fi done fi done

When run on the /opt/PostgreSQL directory mentioned above, it finds all of the programs that exhibit our missing library problem. As it stands now, this script will only check the first layer of library dependencies. One way to improve it would be to make the script follow the dependency chain of every library to the end, making sure that there is not a library farther down the chain that is missing. Better yet, you could add a “max-depth” option so that the user could specify how deeply into the dependency chain they wanted the script to check before moving on. A max-depth setting of “0” would allow the user to specify that they wanted the script to follow the dependency chain to the very end.

In Listing 13, I have created a wrapper script that could be used when developing new software, or as a last ditch effort to work around a really tough shared library problem. It utilizes the shell’s feature of temporarily setting an environment variable for a command on the same line as the command designation. That way we’re not setting LD_LIBRARY_PATH for the overall environment, which could cause problems for other programs if there are library naming conflicts.

Listing 13

#!/bin/bash - # Set up the variables to hold the PostgreSQL lib and bin paths. These paths may # vary on your system, so change them accordingly. LIB_PATH=/opt/PostgreSQL/8.4/lib # Postgres library path BIN_FILE=/opt/PostgreSQL/8.4/pgAdmin3/bin/psql # The binary to run # Start the specified program with the library path and have it replace this # process. Note that this will not change LD_LIBRARY_PATH in the parent shell. exec $(LD_LIBRARY_PATH="$LIB_PATH" "$BIN_FILE")

I’ve broken the library and binary paths out into variables to make it easier for you to adapt this script for use on your system. This script could easily serve as a template for other wrapper scripts as well, anytime that you need to alter the environment before launching a program. Remember though that this wrapper script should not be used for a permanent solution to your shared library problems unless you have no other choice.

Troubleshooting

In some cases, a program may have been hard coded to look for a specific library on your system in a certain path, thus ignoring your local library settings. In order to fix this problem, you can research what version/path of the library the program is looking for and then create a symbolic link between the expected library location and a compatible library. In some cases you can recompile the program with options set to change how/where it looks for libraries. If the programmer was really kind, they may have included a command line option to set the library location, but this would be the exception rather than the rule when library locations are hard coded.

The ldd command will not work with older style a.out binaries, and will probably give output mentioning “DLL jump” if it encounters one. It’s a good idea not to trust what ldd tells you when you’re running it on these types of binaries because the output is unpredictable and inaccurate. Newer ELF binaries have support for ldd built into them via the compiler, which is why they work.

Just because the dynamic linker finds a library doesn’t mean that the library isn’t missing “symbols” (things like functions/subroutines). If this happens, you may be able to match the ldd command output to libraries that are installed, but your program will still have unpredictable behavior (like not starting or crashing) when it tries to access the symbol(s) that are missing. In this case the ldd command’s -d and -r options can give you more information on the missing symbols, and you’ll need to dig deeper into the software developer’s documentation to see if there are compatibility issues with the specific version of the library that you’re running. Remember that you can always use the LD_LIBRARY_PATH variable to temporarily test different versions of the library to see if they fix your problem.

There may be some rare cases where ldconfig may not be able to determine a library type (libc4, 5, or 6) from it’s embedded information. If this happens, you can specify the type manually in the /etc/ld.so.conf file with a directive like dirname=TYPE where type can be libc4, libc5, or libc6. According to the man page for ldconfig, you can also specify this information directly on the command line to keep the change on a temporary basis.

If you have stubborn library problems that you just can’t seem to get a handle on, you might try setting the LD_DEBUG environment variable. Try typing export LD_DEBUG="help" first and then run a command (like ls) so that you can see what options are available. I normally use “all“, but you can be more selective on your choices. The next time that you run a program, you’ll see output that is like a stack trace for the library loading process. You can follow this output through to see where exactly your library problem is occurring. Issue unset LD_DEBUG to disable this debugging output again.

Conclusion

I hope that this post has armed you with the knowledge that you need to solve any shared library problems that you might come up against. Work through shared library problems step-by-step by determining what library/libraries are needed, finding out if they’re already installed, installing any missing libraries, and making sure that your Linux distribution can find the libraries, and you should have no problem fixing most of your dynamic library issues. If you have any questions, or have any information that should be added to this post, leave a comment or drop me an email. I welcome your feedback.

Resources

  1. IBM developerWorks Article On Shared Libraries By Peter Seebach
  2. Linux Foundation Reference On Statically And Dynamically Linked Libraries (Developer Oriented)
  3. LPIC-1 : Linux Professional Institute Certification Study Guide By Roderick W. Smith
  4. LPIC-1 In Depth By Michael Jang
  5. How-To On Shared Libraries From Linux Online
  6. Stack Overflow Post Giving An Example Of A Shared Library Problem
  7. OpenGuru Post On Shared Library Problem Caused By Not Having /usr/local/lib in /etc/ld.so.conf
  8. An Introduction To ELF Binaries By Eric Youngdale (Linux Journal)
  9. Short Explanation Of How To Tell a.out and ELF Binaries Apart
  10. Post On ldd Arbitrary Code Execution Security Issues By Peteris Krumins
  11. The Text Version Of The Filesystem Hierarchy System (Version 2.3)
  12. A Linux Documentation Project gcc Reference Covering Shared Libraries
  13. YoLinux Library Tutorial Including A gcc Linking Example
  14. Article By Johan Petersson Explaining What linux-gate.so.1 Is And Why You’ll Never Find It

Device Or Resource Busy Errors In Linux

Video

Part 1

Part 2

Audio

Download

Quick Start

If you just want enough information to fix your problem quickly, you can read the How-To section of this post and skip the rest. I would highly recommend reading everything though, as a good understanding of the concepts and commands outlined here will serve you well in the future. We also have Video and Audio included with this post that may be a good quick reference for you. Don’t forget that the man and info pages of your Linux/Unix system can be an invaluable resource as well when you’re trying to solve problems.

Preface

To make things easier on you, all of the black command line and script areas are set up so that you can copy the text from them. This does make using the commands easier, but if you’re not already familiar with the concepts presented here, typing the commands yourself and working through why you’re typing them will help you learn more. If you hit problems along the way, take a look at the Troubleshooting section near the end of this post for help.

There are formatting conventions that are used throughout this post that you should be aware of. The following is a list outlining the color and font formats used.

Command Name or Directory Path
Warning or Error
Command Line Snippet With Commands/Options/Arguments
Command Options and Their Arguments Only
Hyperlink

Overview

When you try to access an object on a Linux file system that is in use, you may get an error telling you that the device or resource you want is busy. When this happens, you may see a message like the one in Listing 1.

Listing 1

$ sudo umount /media/4278-62C2/ umount: /media/4278-62C2: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1))

Notice that there are 2 commands specified at the end of the output – lsof and fuser, which are the two commands that this post will be focused on.

Introducing lsof

lsof is used to LiSt Open Files, hence the command’s name. It’s a handy tool normally used to list the open files on a system along with the associated processes or users, and can also be used to gather information on your system’s network connections. When run without options, lsof lists all open files along with all of the active processes that have them open. To get a full and accurate view of what files are open by what processes, make sure that you run the lsof command with root privileges.

To use lsof on a specific file, you have to specify the full path to the file. Remember that everything in Linux is a file, so you can use lsof on anything from directories to devices. This makes lsof a very powerful tool once you’ve learned it.

There are many options for lsof, and I have listed summaries for the ones that I find most useful in Listing 2. Anything with square brackets around it (“[” and “]“) is an argument to the option, and a pipe (“|“) means that you can choose one of two alternatives ([4|6] means choose 4 or 6).

Listing 2

+d [directory] Scans the specified directory and all directories/files in its top level to see if any are open. +D [directory] Scans the specified directory and all directories/files in it recursively to see if any are open. -F [characters] Allows you to specify a list of characters used to split the output up into fields to make it easier to process. Type lsof -F ? for a list of characters. -i [address] Shows the current user's network connections and the processes associated with them. Connection types can be specified via an argument: [4|6][protocol][@hostname|hostaddr][:service|port] -N Enables the scanning/listing of files on NFS mounts. -r [seconds] Causes lsof to repeat it's scan indefinitely or every so many seconds. +r [seconds] A variation of the -r option that will exit on the first iteration when no open files are listed. It uses seconds as a delay value. -t Strips all data out of the output except the PIDs. This is good for scripts and piping data around. -u [user|UID] Allows you to show the open files for the user or user ID that you specify. -w Causes warning messages to be suppressed. Make sure that the warnings are harmless before you suppress them.

If you are extra security conscious, have a look at the SECURITY section of the lsof man page. There are 3 main issues that the developers of lsof feel may be security caveats. Many distributions have addressed at least some of these security concerns already, but it doesn’t hurt to understand them yourself.

Introducing fuser

By default fuser just gives you the PIDs of processes that have a file open on a system. The PIDs are accompanied by a single character that represents the type of access that the process is performing on that file (f=open file, m=memory mapped file or shared library, c=current directory, etc). If you want output that’s somewhat similar to the lsof command, you can add the -v option for verbose output. According to the man page, this formats the output in a “ps-like” style. To get a full and accurate view of what files are open by all processes, make sure that you run fuser with root privileges. Listing 3 holds some of the fuser options that I find most useful.

Listing 3

-i Used with the -k option, it prompts the user before killing each process. -k Attempts to kill all processes that are accessing the specified file. -m Shows the users and processes accessing any file within a mounted file system. -s Silent mode where no output is shown. This is useful if you only want to check the exit code of fuser in a script to see if it was successful. -u Appends the user name associated with each process to each PID in the output. -v Gives a "ps-like" output format that is somewhat similar to the default lsof output.

fuser is supposed to be a little lighter weight than lsof when it comes to using your system resources. To get an idea of what “a little” meant, I ran some very quick tests on both of the commands. I found that fuser consistently took only 30% – 50% of the time that it took lsof to run the same scan, but used about the same amount of RAM (within 5%). My tests were quick and dirty using the ps and time commands, so your mileage may vary. In any event very few users, if any, will notice a performance difference between the two commands because they use such a small amount of system resources.

How-To

Hopefully by the point you’re reading this section you either have, or are beginning to get a pretty good understanding of both the lsof and fuser commands. Either one of them can be used to solve device and/or resource busy errors in Linux. Let’s take a look at a few scenarios.

Say that I have mounted a CD to /media/cdrom0, used it for awhile copying files from it, and now want to unmount it. The problem is that Linux won’t let me unmount the CD. I get the familiar error in Listing 4, but you can see that I then use lsof and fuser to track down what’s going on.

Listing 4

$ sudo umount /media/cdrom0 umount: /media/cdrom0: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) $ sudo lsof -w /media/cdrom0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME bash 2238 jwright cwd DIR 11,0 4096 1600 /media/cdrom0/boot $ sudo fuser -mu /media/cdrom0 /media/cdrom0: 2238c(jwright)

Both commands tell me what PID is accessing the file system mounted on /media/cdrom0 (2238). Each of the two commands also tells me that the process is using a directory within the /media/cdrom0 file system as it’s current working directory. This is shown as the cwd specifier in the lsof output, and the letter c in the output of fuser (appended to the PID). Finally, each of the commands tells me that a process I (jwright) started is using the directory, and lsof goes one step further in telling me the exact directory the process (listed as bash in the COMMAND column) is using as its current working directory.

Armed with this information, I start searching around and find that I have a virtual terminal open in which I used the cd command to descend into the /media/cdrom0/boot directory. I have to change to a directory outside of the mounted file system or exit that virtual terminal for the umount command to succeed. This example uses a simple oversight on my part to illustrate the point, but many times the process holding the file open is going to be outside of your direct control. In that case you have to decide whether or not to contact the user who owns the process and/or kill the process to release the file. Be careful when killing processes without contacting your users though, as it can cause the user who is accessing the file/directory some major problems.

Another scenario is something that has happened to me when running Arch Linux. At seemingly random intervals, MPlayer (run from the command line) would refuse to output sound and started complaining that the resource /dev/dsp was busy and that it couldn’t open /dev/snd/pcmC0D0p. Listing 5 shows an excerpt from the error MPlayer was giving me, and Listing 6 is the output that I got from running the lsof command on /dev/snd/pcmC0D0p.

Listing 5

[AO OSS] audio_setup: Can't open audio device /dev/dsp: Device or resource busy [AO_ALSA] alsa-lib: pcm_hw.c:1325:(snd_pcm_hw_open) open /dev/snd/pcmC0D0p failed: Device or resource busy

Listing 6

$ lsof /dev/snd/pcmC0D0p COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME firefox 4398 jwright mem CHR 116,6 5818 /dev/snd/pcmC0D0p firefox 4398 jwright 84u CHR 116,6 0t0 5818 /dev/snd/pcmC0D0p exe 4534 jwright mem CHR 116,6 5818 /dev/snd/pcmC0D0p exe 4534 jwright 37u CHR 116,6 0t0 5818 /dev/snd/pcmC0D0p

After doing some research, I found that the exe process was associated with the version of the Google Chrome browser that I was running and with it’s use of Flash player. I closed Firefox and Chrome and then tested MPlayer again, but still didn’t have any sound. I then ran the same lsof command again and noticed that the exe process was still there, apparently hung. I killed the exe process and was then able to get sound out of MPlayer immediately.

Through this investigation I found that the problem was not truly random, but occurred whenever Chrome came in contact with a Flash movie with sound. The silent MPlayer problem only seemed random because I was not accessing Flash movies with sound at consistent intervals. Now I’m not meaning to pick on Arch Linux here, because the problem seems to have been present in other distributions as well. Also, I have been unable to reproduce this problem on newer versions of Google Chrome running on Arch Linux, telling me that the issue has probably been resolved.

Listing 7 shows a basic example of how you might use the lsof command to track what services/processes are using the libwrap (TCP Wrappers) library. Keep in mind that the | head -4 text at the end of the command line just selects the first 4 lines of output.

Listing 7

$ lsof /lib/libwrap.so.0 | head -4 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pulseaudi 1690 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 gconf-hel 1693 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 gnome-set 1703 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6

If you wanted to get a full system-wide view of the processes using libwrap, you would run the command with sudo or by issuing the su command (I recommend using sudo instead thought).

Carrying this example further, we could add the -i option to display the network connection information as well (Listing 8). The TCP argument to the option tells lsof that we want to only look at TCP connections, excluding other connections like UDP. This is a good way study the services that are currently being protected by the TCP Wrappers mechanism. Please note that this command may take some time to complete.

Listing 8

$ lsof -i TCP /lib/libwrap.so.0 | head -10 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME pulseaudi 1675 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 gconf-hel 1678 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 gnome-set 1690 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 metacity 1719 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 gnome-vol 1770 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 firefox 1909 jwright mem REG 8,1 30960 668 /lib/libwrap.so.0.7.6 chrome 1992 jwright 59u IPv4 101427 0t0 TCP topbuntu.local:42427->iy-in-f83.1e100.net:https (ESTABLISHED) chrome 1992 jwright 61u IPv4 124360 0t0 TCP topbuntu.local:40761->208.69.36.231:https (CLOSE_WAIT) chrome 1992 jwright 68u IPv4 12636 0t0 TCP topbuntu.local:35689->iy-in-f18.1e100.net:https (ESTABLISHED)

By using the -t option, you receive output from lsof that can then be passed to another command like kill. Listing 9 shows that I have opened a file with two instances of tail -f so that tail will keep the file open and update me on any data that is appended to it. Listing 10 shows a quick way to terminate both of the tail processes in one shot using the -t option and back-ticks.

Listing 9

$ lsof /tmp/testfile.txt COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tail 10784 jwright 3r REG 8,1 0 16282 /tmp/testfile.txt tail 10792 jwright 3r REG 8,1 0 16282 /tmp/testfile.txt

Listing 10

$ kill `lsof -t /tmp/testfile.txt`

If you haven’t seen back-ticks (`) used before in the shell, it probably looks a little strange to you. The back-ticks in this instance tell the shell to execute the command between them, and then replace the back-ticked command with the output. So, for Listing 10 the section of the line within the back-ticks would be replaced by the list of PIDs that are accessing /tmp/testfile.txt. These PIDs are passed to the kill command which sends SIGTERM to each instance of tail, causing them to exit.

An alternative to this would be what you see in Listing 11, where you see the -k and -i options of the fuser command used to interactively kill both instances to tail.

Listing 11

$ fuser -ki /tmp/testfile.txt /tmp/testfile.txt: 11106 11107 Kill process 11106 ? (y/N) y Kill process 11107 ? (y/N) y

Tips and Tricks

  • Don’t use the -k option with fuser without checking to see which processes it will kill first. The easiest way to do this is by using the -ki option combination so that fuser will prompt you before killing the processes (see Listing 11). You can specify a signal other than SIGKILL to send to a process with the -SIGNAL argument to the -k option.
  • As mentioned above, the -r option of lsof causes it to repeat its scan every so many seconds, or indefinitely. This can be very useful when you are writing a script that may need to call lsof repeatedly because it avoids the wasted overhead of starting the command from scratch each time.
  • lsof functionality is supposed to be fairly standard across the Linux and Unix landscape, so using lsof in your scripts can be an advantage when you’re shooting for portability.
  • When you are using fuser to check who or what is using a mounted file system, add the -m option to your command line. By doing this, you tell fuser to list the users/processes that have files open in the entire file system, not just the directory you specify. This will prevent you from being confused when fuser doesn’t give you any information even though you know the mounted file system is in use. So, you would issue a command that’s something like

    sudo fuser -mu /media/cdrom

    to save you that trouble. You still don’t know which subdirectory or file is being held open, but this is easily solved by using the +D option with lsof to search the mounted file system recursively.

    sudo lsof +D /media/cdrom/

Scripting

These scripts are somewhat simplified and in most cases could be done other ways too, but they will work to illustrate the concepts. If you use these scripts, make sure you adapt them to your situation. Never run a script or command without understanding what it will do to your system.

For the first scripting example, lets say that it’s 5:00 and you need to leave for the day, but you also have to delete a shared configuration file that’s still being used by several people. Presumably the configuration file will be automatically recreated when someone needs it next. The script shown in Listing 12 shows one way of taking care of the file deletion while still leaving on time, and it uses lsof. This assumes for the sake of the example that every system that has access to the shared configuration file releases it when users are done and logout for the night. Make sure to run this script with root privileges or it might not see everyone that’s using the file before deleting it, causing a mess.

Listing 12

#!/bin/bash - # Check every 30 seconds to see everyone is done with the file lsof +r 30 /tmp/testfile.txt > /dev/null 2>&1 # We've made it past the lsof line, so we must be ok to delete the file rm /tmp/testfile.txt

You end up with a very quick and simple script that doesn’t require a continuous while loop, or a cron job to finish its task.

Another example would be using fuser to make a decision in a script. The script could check to see if a preferred resource is in use and move on to the next one if it is. Listing 13 shows an example of such a script.

Listing 13

#!/bin/bash - # Make sure to run this script with root privileges or it # may not work. # Set up a counter to track which console we are checking COUNTER=0 # Loop until we find an unused virtual console or run out of consoles while true do # Check to see if any user/process is using the virtual console fuser -s /dev/tty$COUNTER # Check to see if we've found an unused virtual console if [ $? -ne 0 ] then echo "The first unused virtual console is" /dev/tty$COUNTER break fi # Get ready to check the next virtual console COUNTER=$((COUNTER+1)) # Try to get a listing of the virtual console we are checking ls /dev/tty$COUNTER > /dev/null 2>&1 # Check to see if we've run out of virtual consoles to check. # The ls command won't return anything if the file doesn't exist. if [ $? -ne 0 ] then echo "No unused virtual console was found." break fi done

This script loops through all of the virtual console device files (/dev/tty*) and looks for one that fuser says is unused. Notice that I’m checking the exit code of both fuser and ls via the built-in variable $?, which holds the exit status of the last command that was run.

That’s just a small sampling of what you can do with lsof and fuser within scripts. There are any number of ways to improve and expand upon the scripts that I’ve given in Listing 12 and Listing 13. Having an in-depth knowledge of the commands will open up a lot of possibilities for your scripts and even for your general use of the shell.

Troubleshooting

Every time that I try to run the lsof command on my Ubuntu 9.10 machine with administrative privileges, I get the following warning:

lsof: WARNING: can't stat() fuse.gvfs-fuse-daemon file system /home/jwright/.gvfs

This warning occurs when lsof tries to access the Gnome Virtual File System (gvfs), which is (among other things) a foundational part of Gnome’s Nautilus file manager. lsof is warning you that it doesn’t have the ability to look inside of the virtual file system and so it’s output may not contain every relevant file. This warning should be harmless, and can be suppressed with the -w option.

Listing 14

$ sudo lsof | head -3 lsof: WARNING: can't stat() fuse.gvfs-fuse-daemon file system /home/jwright/.gvfs Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 8,1 4096 2 / init 1 root rtd DIR 8,1 4096 2 /

becomes something like this…

Listing 15

$ sudo lsof -w | head -3 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 8,1 4096 2 / init 1 root rtd DIR 8,1 4096 2 /

If lsof stops for a long time, you might need to use some of the “Precautionary Options” listed in the Apple Quickstart Guide in the Resources section. The lsof man page also has a group of sections which start at BLOCKS AND TIMEOUTS that may help you.

Conclusion

There’s a whole host of possibilities for the lsof and fuser commands beyond what I’ve mentioned here, but hopefully I’ve given you a good start. As with so many other things, the time you put into mastering your Linux system will pay you back again and again. If you have any information to add to what I’ve said here, feel free to drop a line in the comments section or send us an email.

Resources

  1. Apple’s Quickstart Guide For lsof
  2. A Good Practical lsof Reference By Philippe Hanrigou
  3. Undeleting Files With lsof and cp
  4. Using fuser To Deal With Device Busy Errors
  5. A Good Reference On fuser (Geared Toward Solaris) By Sandra Henry-Stocker
  6. What To Do With An lsof Gnome Virtual File System (gvfs) Error
  7. LPIC-1 : Linux Professional Institute Certification Study Guide By Roderick W. Smith

Hello World!

Welcome to the new Innovations Technology Solutions blog. It is our goal to make this blog a valuable resource for anyone trying to utilize Linux and open source technologies within their business. The information presented here will include instructional items such as how-tos, tips, and tricks, updates about open source technology trends, and information about Innovations’ services and projects. Anything that we think will be helpful as you leverage open source technologies for your business will be fair game. Also, keep a lookout for multimedia including video and podcasts in our posts. We will be working to enhance the provided information with these features whenever we can. We will strive to keep posts updated with new information as it becomes available, and the times/dates when posts are updated will be reflected in their Last modified field. Please let us know if you think a post is becoming outdated.

Feel free to participate in this blog through RSS feeds, comments, emails, etc. Let us know what you think.

Thanks for visiting!