(flashback)
Our project was inching closer and closer to going live. On the evening before the Easter break some clod started a massive report which tried to process all the data since the beginning of time. My colleague’s inefficient program gathered up all of the data from the database and wrote it to the data directory in a temp file.
The data directory was more than large enough for a test environment but it wasn’t a big enough for such a test. We could have easily recovered if it had occurred while we were at work but instead the system attempted to process nightly batches over those four days without enough space and made a pretty big mess.
My boss Theodore was more upset than he should have been for a test environment and kept yammering on about what if this has been production. He was right of course but one of the preconditions of the system is that enough resources are available. It is our groups responsibility to write the programs but it is someone else’s to ensure that the system doesn’t run out of resources.
Anyway, we implemented the boss’ warning email feature. Each time the program is run it checks for enough disk space when not enough space exists then send out a warning email and quits. To be on the safe side, my boss asked that I have my email address as one of the recipients.
(the present)
If I receive one more warning email from one of the test systems I am afraid I am going to kill someone.
Von: Automatically.Generated.Message@acme.com [mailto:Automatically.Generated.Message@acme.com] Gesendet: Dienstag, 7. Februar 2017 11:54 An: process_monitoring@acme.com; Betreff: Warning ... the end is near on acme-app1 An error has occurred, the copy app has not been launched because of insufficient disk space on following partition. Filesystem size used avail capacity Mounted on acme-app1_dpool/app 85G 85G 0K 100% /appdir Corrective action is required immediately. The status of rest of machine is as follows. Filesystem size used avail capacity Mounted on / 10G 7.7G 2.3G 77% / /dev 10G 7.7G 2.3G 77% /dev proc 0K 0K 0K 0% /proc ctfs 0K 0K 0K 0% /system/contract mnttab 0K 0K 0K 0% /etc/mnttab objfs 0K 0K 0K 0% /system/object swap 140G 400K 140G 1% /etc/svc/volatile fd 0K 0K 0K 0% /dev/fd swap 8.0G 700M 7.3G 9% /tmp swap 140G 40K 140G 1% /var/run acme-app1_dpool/app 85G 85G 0K 100% /appdir acme-app1_dpool/acme_home 1.0G 353K 1.0G 1% /appdir/home/gast acme-app1_dpool/acme_samba 2.0G 36K 2.0G 1% /appdir/samba acme-app1_dpool/acme_scripts 2.0G 249M 1.8G 13% /appdir/scripts This is an automatically generated message for informational purposes.
The idea seemed ok; when no disk space exists then send out an email. The underlying assumption was someone in IT would deal with the problem.
Apparently the idiot users turn off half of the system about a week back but not every process. I came to work and found hundreds of emails clogging up my inbox. Looking through them you could literally see the space filling up over time.
Well, hundreds of files are annoying but the general functionality is awesome. A combination of the bash script and sendmail allows me to capture the important facts about our system and send it to someone.
Just look at the script.
#!/usr/bin/bash SUBJECT="warning the end is near" DF_Command=`df -h ${FILESYSTEM}` FULL_DF=`df -h` TO=process_monitoring@acme.com FROM="automatically generated message" HOSTMACHINE=acme-app1 ( cat << ! To: ${TO} From: ${FROM} Subject: warning ... the end is near on $HOSTMACHINE An error has occurred, the copy app has not been launched because of insufficient disk space on following partition. ${DF_Command} Corrective action is required immediately. The status of rest of machine is as follows. ${FULL_DF} This is automatically generated message for informational purposes. ! ) | /usr/sbin/sendmail -t
Fill the variables with information ranging from a single word up to a lot of lines of text and then substitute them into your mail. The bash shell will expand them before sending the mail out.
I guess that the moral of the story should be that more logic should be used because some idiot will inevitably trigger it on a non-production environment. Well, that or just get rid of the idiots ….