This patch applies to GroundWork Monitor 7.1.0.
The GroundWork Monitor 7.1.0 release did not fully implement for the archive database some changes that had been made in the runtime database. Consequently, archiving is broken in the 7.1.0 release. This happens both for fresh installs of 7.1.0 and for upgrades to 7.1.0 from some previous release. The broken archive database causes archiving to fail, which causes the normal purging of old records in the runtime database to be skipped, which causes records to build up to an unacceptable level in the runtime database, slowing down performance and causing some indirect hiccups.
A secondary issue, not related to the archive database structure issue just mentioned, is that archiving has been removing certain old data points from the runtime database which might be needed for availability graphing in the Status Viewer. Bundled into this package are upgraded archiving scripts that perform more selective delete operations in the runtime database, preserving the important data points even though they would have otherwise aged out. This fix means the archiving software will no longer remove useful data, but by itself it does not replace the important data data points which have previously been deleted from the runtime database. A fix for that will come in a future patch now under development.
The TB7.1.0-2.archive-fixes.tar.gz tarball attached to this article provides replacement files for the 7.1.0 release to fully implement the archive database and get daily archiving back in order.
|Run these commands as the nagios user.|
Whether for a freshly-installed 7.1.0 or for a 7.1.0 which was upgraded from some previous release, run all the following steps while logged in as the nagios user.
Before you install the replacement files, back up the existing files:
Next, install the new files, here assuming that the tarball has been placed into the /tmp directory:
Then compare the old and new config files. Resolve differences between any previously localized settings and those that came in with this patch. The patch includes support for the auditlog, hostname, hostidentityid, and hostblacklistid tables to be archived, so you should keep that setup as-is in the new config files.
If your system was installed with 7.1.0 without upgrading that system from a previous release, follow these steps. This will re-create the archive database from scratch. Since archiving never worked in 7.1.0 before, and your 7.1.0 system was a fresh install not upgraded from a prior release, you won't be destroying any existing archive data by following these steps.
On the other hand, if you followed some process like installing a fresh 7.1.0 on a new server and then importing gwcollagedb and archive_gwcollagedb databases from an older server before starting production on the new server, you must instead follow the procedure in the "Additional steps for systems upgraded to 7.1.0 from a previous release" section, below.
Run the following commands in a bash shell. Each run of $psql will ask for a password. Respond with the administrative password of the PostgreSQL-database postgres user.
Do not worry about the
NOTICE: constraint "host_hostname_key" of relation "host" does not exist, skipping
message that appears when you run the Archive_GWCollageDB_extensions.sql script. It is only a NOTICE, not a WARNING or ERROR, and it is normal and expected.
If you either upgraded your present 7.1.0 server from a previous release of GroundWork Monitor, or you installed a fresh 7.1.0 release and then imported databases from a previous release of GroundWork monitor, this section is for you.
Run the scripting needed to repair the structure and content of the archive database, in this order:
The conflicting_archive_service_rows.pl script, when run with the -m show options, will show you if any archive-database rows need to be deleted to synchronize with the runtime database. You will be asked to type in the PostgreSQL administrative password to perform these steps. If you see this message in the results:
there will be no reason to run the script again with the -m remove options. However, if you see rows something like this:
then you will need to run the -m remove form of the command to clean up these rows. You should see:
Afterward, run the -m show form again to demonstrate that all of the required cleanup has been done.
Archiving is normally scheduled to run at 00:30 each morning, via a nagios-user cron job. You can check to see whether it succeeded or failed by looking at the archiving log files:
A quick check of the status of the last run can be made this way, without examining the whole files:
Archiving cycles cannot be run immediately back-to-back. The minimum_additional_hours_to_archive parameter in the config/log-archive-send.conf enforces a minimum delay period between cycles. Don't mess with this in an attempt to speed up the transfer of data from the runtime database to the archive database; it won't work.
After the steps listed above, daily archiving should run without error. Given that there will have been many records in the runtime database that have built up since the 7.1.0 release was installed, the first few runs of archiving may take a fair amount of time to run. Take that into account. Also, records are not deleted from the runtime database until that data has lain in the archive database for a few days; this is controlled by the post_archiving_retention_days_for_messages and post_archiving_retention_days_for_performance_data configuration parameters in the config/log-archive-send.conf file. This is a good thing; you should allow it to happen without interference.
In the normal configuration, both of those parameters are set to 2 (days). You should pretend that value is 3, because the determination of a full "day" might depend on exact timing of the daily archiving runs, which can vary somewhat. Once that time period has passed, the archiving will have been seen to be working (via the log files mentioned above), the runtime database will have been pruned back, and your system should be operating more smoothly.
|The first run of log archiving after a hiatus is special.|
The initial run of archiving after you install the fixes described above might be best run by hand (see the instructions below).
The initial run of archiving on a production system or after a hiatus is special, because it needs to sweep up all the accumulated legacy data at one time. See the Initial application of archiving section below for the considerations to address before the first run. Subsequent runs will only archive data which has not yet been archived (subject to configured redundancy). If a run is skipped, there will be no loss of continuity; generally speaking, all the data which would have been archived in the skipped run will simply be picked up in the following run. Missing a run could happen due to both major events (say, you have a power outage one night), or minor events (say, you have a temporary failure in hostname resolution, so the scripting cannot connect to the archive database at the moment it first needs to do so).
The initial run of the scripting on a production system, or a run after a long hiatus, should have dump_days_maximum set to 10000, which is the default value for this parameter in the shipped send-side config file (config/log-archive-send.conf). This setting will archive all data from the Pleistocene era until midnight last night, in the local timezone. For a large site which has been running a long time with fine-grained performance data, this could require a lot of space in the filesystem, and a lot of CPU and disk activity. You should plan for that, perhaps by running the sending script manually the first time (as the nagios user) during a weekend maintenance window. To do that, you may need to temporarily disable the software by either commenting out the nagios-user cron job that runs the nightly archiving:
or via the enable_processing directive in the (config/log-archive-send.conf) config file.
|Have enough disk space available for staging data transfers|
When archiving is first put into play, or after a long hiatus, there will be a very large backlog of message and performance data to be transferred to the archive database. This data is staged in files, and the same data will appear in multiple sets of those files (for safety purposes). Make sure you have enough space in the filesystem on the source and target machines where these files will be stored (listed as the values of the log_archive_source_data_directory and log_archive_target_data_directory in the config/log-archive-send.conf and config/log-archive-receive.conf configuration files).
For purposes of estimating the amount of space needed, the sizes of the logmessage and logperformancedata tables in the gwcollagedb database are most critical:
While the details will vary from site to site, a reasonable first estimate is to allow 200 bytes per logmessage row, and 61 bytes per logperformancedata row, for each copy of these tables that will be retained in the filesystem. Multiply the space so determined for one copy by the source_dumpfile_retention_days parameter in the log-archive-send.conf config file on the source machine, for the space needed on the source machine. If your target machine is different from your source machine (this is not a common scenario), multiply the space so determined for one copy by the target_dumpfile_retention_days parameter in the log-archive-receive.conf config file on the target machine, for the space needed on the target machine.
Example: We have 768539 logmessage rows, and 118465 logperformancedata rows in our runtime database. Dumping out this data into files will take around:
With source_dumpfile_retention_days set to 11.5 (the default value, as shipped), this yields:
Which is to say, you would need to ensure that about 2 GB of space is available on the source machine, in the log_archive_source_data_directory filesystem (normally configured as /usr/local/groundwork/core/archive/log-archive), for the initial period of archiving this example database.
To run archiving manually, you must do so as the nagios user: