Amador Pahim

Bash, Python, SRE

For Better Exit Codes

apahim

It’s the SRE mantra: Automate Yourself Out of a Job. And by doing that, we build up, step by step, an automation infrastructure with multiple pieces of software that have to talk to each other.

Regardless where those applications will be executed on, the exit code might be the only clue we have to figure out what happened to them when something goes wrong.

For example, if you have a Kubernetes Job that is failing, all you get from the Kubernetes API is that exit code number:

  containerStatuses:
    - restartCount: 5
      started: false
      ready: false
      name: service
      state:
        terminated:
          exitCode: 1
          reason: Error

If you’re lucky enough, the container logs will tell you more, but that’s not always the case. So, when setting up the exit codes of your applications, you know the drill, exit code 0 is good, non-0 is bad, but it is common to have applications that give you exit codes like this:

No matter what happened, if it is bad, exit code 1 is used.

And making that better is what this post is about. So, for starters, your application should have different exit codes depending on the error. Something like this:

If you get that far, you already have my appreciation. But what happens if we have more than one error in the same execution? How to flag all the errors in the exit code? One approach would be:

I’m not against it, but you might agree that it can get quite complex as you want to have one exit code for each combination of errors. Also, for whoever is going to interpret that code, it won’t be easy to figure out whether a specific error happened or not. For example, if you want to know if the run-time error happened, you have to keep my own map between the errors and all their possible codes.

One way to solve that, the way I propose here, is by using use bit-wise operations to both create and interpret the exit codes. With bit-wise operations, you can allocate one bit for each error. If that error happens, that bit is set. Using our example:

If the save secret error happens, we set that bit only:

If the run-time error also happens, the corresponding bit is set:

The final return code will be the binary number that accumulated all the errors, in decimal, of course. Checking if a specific error occurred is as simple as checking if that specific bit is set.

All good for the theory? Let’s do some coding. First, we define the bit for each error:

OK = 0              # Binary 0000
ERR_ACCESS = 1      # Binary 0001
ERR_PERMISSION = 2  # Binary 0010
ERR_SECRET = 4      # Binary 0100
ERR_RUNTIME = 8     # Binary 1000

Then we make sure that the proper error will be somehow flagged when it happens. Example:

def add_secret():
    error = True
    if error:
        return ERR_ACCESS
    return OK

Last but not least, we accumulate the errors:

exit_status = OK
exit_status |= add_secret()

The | there is the Python’s bit-wise OR operator. It will get the current value of exit_status in binary and perform an OR with the value returned by add_secret(), assigning the result back to the exit_status variable:

BINARY OR

  0000  # Initial value of exit_status
  0100  # add_secret() return value
--------
  0100  # final exit_status

At the end of the execution, we just call sys.exit(exit_status). Here’s the complete dummy code:

#!/usr/bin/env python

import sys


OK = 0              # Binary 0000
ERR_ACCESS = 1      # Binary 0001
ERR_PERMISSION = 2  # Binary 0010
ERR_SECRET = 4      # Binary 0100
ERR_SERVICE = 8     # Binary 1000


def check_access():
    error = False
    if error:
        return ERR_ACCESS
    return OK


def check_permission():
    error = False
    if error:
        return ERR_PERMISSION
    return OK


def add_secret():
    error = True
    if error:
        return ERR_SECRET
    return OK


def run():
    error = True
    if error:
        return ERR_RUNTIME
    return OK


exit_status = OK

exit_status |= check_access()
exit_status |= check_permission()
exit_status |= add_secret()
exit_status |= run()

sys.exit(exit_status)

Executing that code, we have:

$ ./app.py 
$ echo $?
12

So, 12 is the application return code. We know that something bad happened, but to know if the ERR_RUNTIME happened, we have to check if the corresponding bit is set. To do that, we will use the bit-wise AND operator:

BINARY AND

  1100  # Application exit code (12)
  1000  # ERR_RUNTIME code (8)
--------
  1000  # If result is equals to the ERR_RUNTIME
        # code (8), that error happened 

Sticking to Python, we use the & operator:

$ python -c 'print(12 & 8)'
8

The call above printed 8, meaning that the error with code 8 happened in the application execution. We do that for each error that we want to check.

Now that we know how to extract the information from the exit code, let’s make it systematic:

#!/usr/bin/env python

import sys


OK = 0              # Binary 0000
ERR_ACCESS = 1      # Binary 0001
ERR_PERMISSION = 2  # Binary 0010
ERR_SECRET = 4      # Binary 0100
ERR_RUNTIME = 8     # Binary 1000

exit_code = int(sys.argv[1])

if exit_code & ERR_ACCESS:
    print('ERR_ACCESS')

if exit_code & ERR_PERMISSION:
    print('ERR_PERMISSION')

if exit_code & ERR_SECRET:
    print('ERR_SECRET')

if exit_code & ERR_RUNTIME:
    print('ERR_RUNTIME')

Running that script:

$ ./check.py 12
ERR_SECRET
ERR_RUNTIME

That is it! We have an application that exits with a meaningful code, which accumulates all the errors that happened, and a check script to that knows how to interpret the application exit code. I’m sure you can come up with a number of use cases for that (did I hear “CI pipelines”?), but the ball is in your court now. Have fun!

:wq!

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top