This page reviews automated backup for GroundWork Monitor.
Install Packages The initial installation of the system, and many patches, are delivered as RPMs and .bin Bitrock packages along with defined steps for applying them. The base installation of the system from bare metal can be repeated exactly. The original sources and a log book or run book detailing the specific configuration entries (IP address, name, DNS, etc.) are necessary for this level of recovery.
Databases GroundWork Monitor utilizes PostgreSQL as the database engine for storage of configuration, control, and monitored status and events. The several databases are named below but are not an exhaustive list as others may be added with the integration of more projects. Each of these databases is essential to the function of GroundWork Monitor so they must be included in a backup and recovery strategy.
- GWCollageDB - GroundWork Foundation database
- Monarch - Configuration database
- Dashboard - Insight and Availability Reports database
- JBoss Portal - User Interface, web applications and permission databases
Files GroundWork Monitor uses certain files which may be included in the RPMs or .bin packages for installing the product, as well as others which are added in customization. While the packages can rebuild most of them, tuning and localization must be specifically captured post install. Thus, these files must also be identified and dealt with by the backup and recovery strategy. Examples of these kinds of files include:
Conclusion The backup and recovery must treat initial Installation, databases, and configuration files in different ways to insure a successful operation.
We recommend the following methods for backup and restore. Depending on the size of the database(s) you may choose to make the backups more frequently.
- Run a cron job to perform a complete backup of the PostgreSQL databases using the syntax in the example. This produces a point in time image of all the databases in a single backup file with a timestamp. The cron job example below automatically prunes the directory of images older than 5 days.
- Once a day, at a time when change is likely to be minimal, from a cron job, perform a separate backup of the critical databases listed below. Because the syntax overwrites the existing file these backups are not persistent.
- Finally, in the normal course of operation when using Monarch, always select to make a backup before performing a Commit. The backups created are functionally equivalent to that produced in the cron jobs above and have the advantage of spanning all changes ever made (assuming you keep the entire list of backups). The location of the target directory in the examples is an important choice. It should be a share on a resource that is:
- Not associated with the machine or machines being backed up.
- An off site copy should be considered, produced from the dump files by writing to removable media.
- Regularly scheduled rotation of the offsite media should be set up in accordance with your company’s data retention and recovery policy guidelines
|The process of making a complete PostgreSQL dump can be several minutes and during the time of backup the database may be unavailable for regular updates by the monitoring engine. Therefore it is not practical to perform such backups excepting when the monitoring system has little or no activity or when it can be guaranteed to be inactive.
You can mitigate the effects of the interaction between backup and system use. By stopping gwservices before making the backup, and restarting them afterward, the database GWCollageDB is not open for updates. Feeders are quiescent and all messages that come in during the backup will be held in unprocessed log files and picked up afterward. The choice to do this should be based on experience with your installation. Periodically review the log files (/usr/local/groundwork/foundation/container/logs) for the time when backups occur to see if connections to PostgreSQL are refused.
An alternative is to use transaction logging on PostgreSQL - Transaction logging means the database engine writes every transaction to a log file as the transactions are completed. The value of this is that long transactions, like the backup operation, can co exist with other activities, and that recovery of the database is possible up to a moment in time, useful in case of system failure. The combination of the most recent full system backup (a "postgreSQL dump") and any log files produced since that backup can produce a clean recovery.
The consequence to customers is that a system failure which results in data loss on the monitoring server may compromise recovery of the running image and PostgreSQL databases. If in place recovery fails then the most recent backup image will be used. The loss of monitoring data (status and availability) and configuration changes to Monarch, users, dashboards, and the portal will be for the period between the most recent backup and the time that recovery is complete.
Which backup image you use is dependent on the type of failure you have encountered. Assess the situation and choose the image that presents the smallest amount of work to recover. The steps in recovery follow.
- Bring up the new server or installation by recovering to the point of a running PostgreSQL with the correct filesystems in place.
- If there are any applications running (like gwservices) stop them.
- Unzip the latest available postgreSQL dump file onto an accessible file share.
- Recover by making the following entry where xxx is the file name of the dump:
- Restart the PostgreSQL engine and associated GW services.
- After relaunching your browser, you should clear your browser's cache.
- Perform the steps above substituting in xxx with the particular file associated with the recovery; for example gwcollagedb and where xxx is the file name of the dump.
- Restart the PostgreSQL engine and associated GW services
- After relaunching your browser, you should clear your browser's cache.
The installation includes configuration files and customizations, which you may have to replace in the event of loss due to accident or upgrade. Here is a recommended procedure to support the restoration of configuration files and customizations after such an accident or upgrade.
Identify all the configuration files, by path and name, in a single change control file at minimum, or in an organized scheme such as CF Engine or CVS. Produce a backup of these changed files, on a secure resource, along with the database backups, for future recovery. Perform a scheduled review of the accuracy of the saved file copies, making regular updates.
Here is an example cron job that uses the simplistic change file approach (we recommend that you set up a more professional regime which is out of the scope of the normal project).
Replace yyy with the path and name of the file containing the change file list. Make sure that the backup files are on a share that is not part of the machine being backed up. Consider taking copies of site regularly.
Recovery may consist of replacing a few files, which were overwritten in an upgrade, or an entire installation that is rebuilt from scratch after server loss. In either case the recommendation is to untar the latest backup into a temporary file system (/tmp) and work through the recovered list file by file, comparing each one with its counterpart before overwriting.
These are RRD files with value for long-term analysis. These files are updated in operational mode frequently by the monitoring process. A successful backup means capturing the changes without getting partially written files. The decision to capture these must be driven by requirement.
The GroundWork Monitor controlled active or passive checks produce RRD files; In order to back them up successfully we recommend integrating the backup with the file based performance data processing (for Nagios). This requires at the least halting the generating process Nagios. Following this, perform a backup. Then restart the appropriate daemon or cron job.
As with making the backup, care must be exercised to avoid collision between the generating process and the restore process. Stop the daemons, perform the restore, start the daemons.