Exited (143)

One day, I had to resolve a JIRA ticket with the following message:

Steps:

(Some steps here)

Expected Result:
Every container should be up & running.

Actual Result:
A few containers are down

There was also a nice figure presenting that state:

exit_code_143

The first time a read the JIRA ticket I did not pay much attention to the figure above. I just logged to the server to see what happened with the running applications in the docker containers.

There was nothing interesting. Apps seemed to had been working properly (without any exceptions, crashes etc.) …and suddenly they shut down…and they did that gracefully (meaning, they closed the connections and any other IOs, etc.).

It all looked like, they were asked to shut down…in fact the figure above says that right away in the STATUS column. It says “Exited (143)”…

I am not an Dev-Ops or Admin so I head to make sure that the signal code is in fact the one I suspected….

There are two very important references regarding SIGNALS in unix: Unix Signals and exit codes with special meaning.

The first one is just a list of the signals that are used in the Unix system. You can get the same list by typing


kill -l

in terminal.

These signals are numbered from 1-34 and none of these numbers much 143 code…

That is why the second page is more interesting as it tells us how to interpret these exit codes:

128+n

Fatal error signal “n”

kill -9 $PPID of script

$? returns 137 (128 + 9)

 

If do the simple math: 143 – 128 = 15 we get our signal SIGTERM that killed the process.

Of course, we could google more to find out what does SIGTERM do etc…But the most important thing here is that the apps got killed with SIGTERM signal….so they were asked to shut down by somebody or some external process.

This was not enough for the reporter of the JIRA ticket.

Is there anything that we can do more? Yes, if we know when such event had happened.

Production Unix systems should have a logs that will tell us who might have logged in and what commands the user issued.

If you are dealing with RedHat (this is the server that I worked with) you will find the logs under:


less /var/log/secure-20170716

Jul 13 10:35:46 ip-x-x-x-x sshd[4494]: Accepted publickey for xx-user from xx.xxx.xx.xx port XXXXX
Jul 13 10:35:46 ip-x-x-x-x sshd[4494]: pam_unix(sshd:session): session opened for user xx-user by (uid=0)
Jul 13 10:35:46 ip-x-x-x-x sudo: xx-user : TTY=unknown ; PWD=/home/xx-user ; USER=root ; COMMAND=/usr/bin/pkill -f some_app.jar
Jul 13 10:35:46 ip-x-x-x-x sshd[4496]: Received disconnect from xx.xxx.xx.xx: 11: disconnected by user
Jul 13 10:35:46 ip-x-x-x-x sshd[4494]: pam_unix(sshd:session): session closed for user xx-user

Above log gives us a definite proof that there was a ‘xx-user’ that logged in and run command:

COMMAND=/usr/bin/pkill -f some_app.jar

And this is our SIGTERM kill signal that shut down the app.

In the end it appeared that there was a Jenkins job that logged in, killed, deployed and run new app jar (which happened to be the same app that was run with docker).

This was very bizarre because we were not aware that the Jenkins job kills the app with pkill command, which goes through the whole list of processes and tries to find a match by the name….and it kills the process… So it killed our docker app.