GWME-7.1.0-8 - NoMa fixes

Problem

NoMa has been found to have a few bugs affecting the reliable delivery of notifications.

  • GWMON-10653: Monarch seed data for host and service notify by NoMa commands are missing
  • GWMON-10961: The fas.executor.interrupt property should be present in our standard foundation.properties file.
  • GWMON-12478: Notifications are not displayed under Configuration > NoMa > Logs
  • GWMON-12574: Error in handling a socket file descriptor.
  • GWMON-12700: noma_daemon.pl cycles at 1-second intervals
  • GWMON-12790: NoMa drops notifications with a socket error.
  • GWMON-12857: Perl warning messages produced by the NoMa daemon.
  • GWMON-12863: Foundation may time out the alert_via_noma.pl script before it can even start.
  • GWMON-12997: NoMa does not escalate beyond the first run of rules, even when using rollover.
  • GWMON-13006: NoMa.yaml needs restricted permissions.
  • GWMON-13023: NoMa fails to recognize large notification-number counts
  • GWMON-13041: NoMa Contacts "Suppress multiple alerts" option is broken
  • GWMON-13042: NoMa sometimes sends out problem notifications for recovery alerts
  • GWMON-13051: NoMa nth-weekday-of-month calculations are probably bogus
  • GWMON-13074: NoMa alert script help message is incomplete
  • GWMON-13076: NoMa hostgroup / servicegroup filtering model is broken
  • GWMON-13080: NoMa contains bad Perl constructions
  • GWMON-13084: NoMa notifies on some recovery alerts when it should not
  • GWMON-13086: NoMa alert script requires a controlled -u option value for correct operation
  • GWMON-13099: NoMa voicecall bundling does not work
  • GWMON-13107: noma database could use additional indexes for efficient operation

Solution

This patch rolls up all the available NoMa-related fixes into one patch for the GWME 7.1.0 release. Some NoMa files are replaced, the Perl JSON::PP package is installed, the config/foundation.properties file is augmented with a new configuration option, and unique-ID values are adjusted in the noma database.

That new option (fas.executor.interrupt) controls how long the Java thread that runs the alert_via_noma.pl script for CloudHub-related notifications can run. Field experience shows that the historical hardcoded timeout has been too small for reliable operation in the context of NoMa. Exposing this parameter in the config file allows it to be adjusted if necessary. The default in the config file is now set an order of magnitude larger, which should be sufficient to prevent problems even on large, heavily-loaded systems.

Installing

  1. Download the patch file tar archive to, for example, the /tmp directory.
    Name Size Creator Creation Date Comment  
    ZIP Archive TB7.1.0-8.noma_fixes.tar.gz 142 kB Glenn Herteg Sep 07, 2017 14:26 MD5: a3edf629f9e51c41087fc5c41078a139  
  2. Unroll the downloaded tar archive. The patch files will appear in the TB7.1.0-8.noma_fixes/ subdirectory. Go there and run the install script.
    cd /tmp    # or wherever you downloaded the TB7.1.0-8.noma_fixes.tar.gz tarball to
    service groundwork stop noma
    tar xvfz TB7.1.0-8.noma_fixes.tar.gz
    cd TB7.1.0-8.noma_fixes
    ./TB7.1.0-8_install
    

    The original files which are affected by this patch are first backed up, then the changes are applied, and the patch directory is adjusted to reflect the application of this patch.

  3. Bounce NoMa, to run using the replacement files. Also bounce Foundation, to pick up the non-default setting for the fas.executor.interrupt parameter.
    service groundwork restart noma
    service groundwork restart gwservices
    

Uninstalling

Uninstalling this patch must force the loss of operational data
The changes to NoMa made in this patch switch it from using externally-generated unique-alert-ID values (mostly from Nagios) to using internally-generated unique-ID values. This is necessary for NoMa to fully support CloudHub and other agents that might send in alerts. Those agents have historically not provided proper ID values, and it was deemed to be the wrong approach to have them do so. Instead, we now centralize the tracking of alerted-upon states within NoMa itself.

One consequence of this shift is that we need to alter the operational data in the noma database when this patch is installed. Of necessity, this is essentially a one-way transition. It is done to prevent collisions between historical ID values and those which will henceforth be internally generated. No record is kept of the old values that would be of any use in later trying to revert the database content. And in any case, new alerts created while the patch is in play will be following a new regime that would not have entries in such a mapping.

The upshot is that if you decide to uninstall this patch, you will probably need to clear out all the operational tables in the noma database at the same time. This will leave an open field for further operation of NoMa.

Uninstalling this patch does not revert the config/foundation.properties file
The change made to the config/foundation.properties file for this patch is quite simple, and desirable anyway for other reasons. More importantly, other patches might also alter this file, and it would be unfortunate if uninstalling this patch inadvertently rolled back some of those changes. Therefore, we leave this file alone during an uninstall of this patch. The original file is still available in the backup tarball made during patch installation, so that is one place it can be found if there is some desperate need for it. If it finds it needs to modify the file, the install process also makes a sibling config/foundation.properties.pre-TB7.1.1-10-noma file that is more convenient to reference, and can be directly compared against the config/foundation.properties file to highlight any differences in the final file contents.
  1. Go back to the patch directory, and run the uninstall script.
    service groundwork stop noma
    cd TB7.1.0-8.noma_fixes
    ./TB7.1.0-8_uninstall
    

    The backup directory will be accessed to restore the original files, and the patch directory will be processed to reflect the restoration of those files.

  2. Clear out the noma database operational tables. After this step, general setup will be left intact, but all record of previous notifications will be gone. Enter the password for the noma database when requested.
    /usr/local/groundwork/postgresql/bin/psql -U noma -d noma
    delete from escalation_stati;
    delete from notification_logs;
    delete from notification_stati;
    delete from tmp_active;
    delete from tmp_commands;
    \q
    
  3. Bounce NoMa and Foundation, to revert back to the original files and the original setting for the fas.executor.interrupt parameter.
    service groundwork restart noma
    service groundwork restart gwservices
    

Configuration

The Nagios commands to send notifications to NoMa use the alert_via_noma.pl script. The flag used to pass the incident id (-u) has historically been mis-documented, and needs to be updated to refer to the host or service PROBLEMID instead of the NOTIFICATIONID. This change is necessary for subsequent notifications on the same incident to function properly, clocking the alert-counting logic so the right contacts get sent notifications as configured. The commands below reflect not only that change, but some other refinements as well.

That said, installation of this patch switches NoMa to generating incoming-alert IDs internally. (This is controlled by the new generate_IDs option in the noma/etc/NoMa.yaml config file, which will be automatically enabled when this patch is installed.) Switching to internal generation of ID values means that the -u option value will be ignored. So we document the clean construction here for a clear understanding of what would need to be in play if Nagios were the only alerting agent and internal generation of ID values was disabled. Regardless, you should put these definitions in place to avoid any future confusion.

These are the updated commands we will be using in future GWMEE releases. You can copy/paste these definitions into your own copies of these commands in Monarch (under Configuration > Commands > Modify).

  • host-notify-by-noma command line:

    /usr/local/groundwork/noma/notifier/alert_via_noma.pl -c h -s "$HOSTSTATE$" -H "$HOSTNAME$" -G "$HOSTGROUPNAMES$" -n "$NOTIFICATIONTYPE$" -i "$HOSTADDRESS$" -o "$HOSTOUTPUT$" -t "$TIMET$" -u "$$(( $HOSTPROBLEMID$ ? $HOSTPROBLEMID$ : $LASTHOSTPROBLEMID$ ))" -A "$$([ -n "$NOTIFICATIONAUTHORALIAS$" ] && echo "$NOTIFICATIONAUTHORALIAS$" || echo "$NOTIFICATIONAUTHOR$")" -C "$NOTIFICATIONCOMMENT$" -R "$NOTIFICATIONRECIPIENTS$"

  • service-notify-by-noma command line:

    /usr/local/groundwork/noma/notifier/alert_via_noma.pl -c s -s "$SERVICESTATE$" -H "$HOSTNAME$" -G "$HOSTGROUPNAMES$" -E "$SERVICEGROUPNAMES$" -S "$SERVICEDESC$" -o "$SERVICEOUTPUT$" -n "$NOTIFICATIONTYPE$" -a "$HOSTALIAS$" -i "$HOSTADDRESS$" -t "$TIMET$" -u "$$(( $SERVICEPROBLEMID$ ? $SERVICEPROBLEMID$ : $LASTSERVICEPROBLEMID$ ))" -A "$$([ -n "$NOTIFICATIONAUTHORALIAS$" ] && echo "$NOTIFICATIONAUTHORALIAS$" || echo "$NOTIFICATIONAUTHOR$")" -C "$NOTIFICATIONCOMMENT$" -R "$NOTIFICATIONRECIPIENTS$"

Labels

tb tb Delete
bulletins bulletins Delete
7-1-0 7-1-0 Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.