Assessing Groundwork Monitoring Server Performance

compared with
Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (6)

View Page History
h2. Contents
{toc}
{html}
<A name="Introduction"></A>
<H2>Introduction</H2>
<P>This document is intended to help you assess your GroundWork Monitoring Server's performance. As the server's workload increases towards its upper range, certain functions will begin to manifest problematic behavior.</P>
<A name="Typical_First-Noticed_Symptoms"></A>
<H2>Typical First-Noticed Symptoms</H2>
<OL start="1"><LI>One typical symptom of a over-loaded server is gaps in RRD Performance Graphs. As check latencies increase, data points may start falling outside the allowed tolerances and be dropped from the graphs. </LI><LI>Changes made in the Monarch 'Configuration' Tab take noticeably longer to propagate from a Monarch 'commit' to the GroundWork Status Viewer pages. </LI></OL>
<A name="Nagios_Engine_Program-Wide_Performance_Information_.28Check_Latency.29"></A>
<H2>Nagios Engine Program-Wide Performance Information (Check Latency)</H2>
<P>The first place to look for indications that your system is over-loaded is the Nagios Performance CGI. From the GroundWork Portal, click on 'Monitoring Server' option. This will bring you to a Nagios display of the Monitoring Server and its services. In the secondary menu, you then pick 'Performance Information'. </P><P><BR></BR><BR></BR>The most likely metric to show you any performance load on the server is the Active Service Check Latency metrics (Min,Max,Avg) in the top right table shown. These metrics are in seconds. 180 seconds is three minutes, which isn't bad. But 1800 seconds is thirty minutes, which <I>is</I> bad.</P>
<P>Check Latency is the time difference between when a service check was scheduled to run and when it actually ran.<BR></BR><BR></BR></P>
<A name="Nagiosstats"></A>
<H2><A class="extiw" href="http://nagios.sourceforge.net/docs/2_0/nagiostats.html" mce_href="http://nagios.sourceforge.net/docs/2_0/nagiostats.html" title="nagiosdocs:nagiostats.html">Nagiosstats</A></H2>
<P><A class="extiw" href="http://nagios.sourceforge.net/docs/2_0/nagiostats.html" mce_href="http://nagios.sourceforge.net/docs/2_0/nagiostats.html" title="nagiosdocs:nagiostats.html">Nagiosstats</A> is a command-line version of the CGI shown above. It is simpler and has less dependencies. It also carries slightly different information, such as 'Total Services' and 'Total Hosts'</P>
<TABLE>
<TBODY>
<TR>
<TD>
<PRE>[root@lunias ~]# $GW_HOME/nagios/bin/nagiostats<BR></BR><BR></BR><BR></BR><BR></BR>Nagios Stats 2.5<BR></BR><BR></BR>Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)<BR></BR><BR></BR>Last Modified: 07-13-2006<BR></BR><BR></BR>License: GPL<BR></BR><BR></BR><BR></BR><BR></BR>CURRENT STATUS DATA<BR></BR><BR></BR>----------------------------------------------------<BR></BR><BR></BR>Status File: /usr/local/groundwork/nagios/var/status.log<BR></BR><BR></BR>Status File Age: 0d 0h 0m 5s<BR></BR><BR></BR>Status File Version: 2.5<BR></BR><BR></BR><BR></BR><BR></BR>Program Running Time: 3d 3h 20m 51s<BR></BR><BR></BR><BR></BR><BR></BR>Total Services: 1417<BR></BR><BR></BR>Services Checked: 1289<BR></BR><BR></BR>Services Scheduled: 1417<BR></BR><BR></BR>Active Service Checks: 1417<BR></BR><BR></BR>Passive Service Checks: 0<BR></BR><BR></BR>Total Service State Change: 0.000 / 80.660 / 0.214 %<BR></BR><BR></BR>Active Service Latency: 213286.359 / 215290.335 / 214727.280 %<BR></BR><BR></BR>Active Service Execution Time: 0.000 / 52.552 / 10.106 sec<BR></BR><BR></BR>Active Service State Change: 0.000 / 80.660 / 0.214 %<BR></BR><BR></BR>Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 0<BR></BR><BR></BR>Passive Service State Change: 0.000 / 0.000 / 0.000 %<BR></BR><BR></BR>Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0<BR></BR><BR></BR>Services Ok/Warn/Unk/Crit: 187 / 23 / 668 / 539<BR></BR><BR></BR>Services Flapping: 0<BR></BR><BR></BR>Services In Downtime: 0<BR></BR><BR></BR><BR></BR><BR></BR>Total Hosts: 260<BR></BR><BR></BR>Hosts Checked: 260<BR></BR><BR></BR>Hosts Scheduled: 0<BR></BR><BR></BR>Active Host Checks: 260<BR></BR><BR></BR>Passive Host Checks: 0<BR></BR><BR></BR>Total Host State Change: 0.000 / 8.160 / 0.031 %<BR></BR><BR></BR>Active Host Latency: 0.000 / 0.000 / 0.000 %<BR></BR><BR></BR>Active Host Execution Time: 0.029 / 17.250 / 9.365 sec<BR></BR><BR></BR>Active Host State Change: 0.000 / 8.160 / 0.031 %<BR></BR><BR></BR>Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0<BR></BR><BR></BR>Passive Host State Change: 0.000 / 0.000 / 0.000 %<BR></BR><BR></BR>Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0<BR></BR><BR></BR>Hosts Up/Down/Unreach: 24 / 153 / 83<BR></BR><BR></BR>Hosts Flapping: 0<BR></BR><BR></BR>Hosts In Downtime: 0<BR></BR><BR></BR></PRE>
</TD>
</TR>
</TBODY>
</TABLE>
<A name="Linux_System_Utilities"></A>
<H2>Linux System Utilities</H2>
<A name="The_Linux_Utility.2C_Top"></A>
<H3>The Linux Utility, <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top">Top</A></H3>
<P>Another assessment of performance is CPU Load. This metric is expressed in how many processes are in a runnable state averaged over 1 minute, 5 minute and 10 minute intervals. If this number is, say 4, that means that there were 4 runnable processes on average over the minute intervals. The following <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top">Top</A> output shows load averages of 0.58, 0.63, 0.64. These numbers state that about half of the time there were no processes in a runnable state.</P>
<P><A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top">Top's</A> main strength is in showing you which processes are in what states. Which processes are taking the most resources? <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top">Top</A> can show you. A typical scenario is where the Java listener that accepts feeds of data and posts them to the Collage DB, and the MySQL daemon, are the most active processes.</P>
<P><A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top"><BR></BR><BR></BR>Top</A> also gives you Memory and Swap metrics. These can also help pinpoint why a system is slowing. So, load numbers above 3, which means that, on average, there were over three runnable processes, with only one running and the others waiting to run, would indicate a slowing system.</P>

<A name="The_Linux_Utility.2C_Uptime"></A>
<H3>The Linux Utility, <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime" title="manpage:uptime">Uptime</A></H3>
<P><A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime" title="manpage:uptime">Uptime</A> is a smaller footprint tool that you can use to get the same load averages that <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?top" title="manpage:top">Top</A> had.</P>
<TABLE>
<TBODY>
<TR>
<TD>
<PRE># uptime<BR></BR><BR></BR>15:05:52 up 14 days, 22:55, 4 users, load average: 0.46, 0.52, 0.52<BR></BR><BR></BR></PRE>
</TD>
</TR>
</TBODY>
</TABLE>
<A name="The_Linux_Utility.2C_VMStat"></A>
<H3>The Linux Utility, <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" title="manpage:vmstat">VMStat</A></H3>
<P><A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" title="manpage:vmstat">Vmstat</A> has many options to it. Each option may output totally different types of metrics. One of the most useful is <A class="extiw" href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" mce_href="http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat" title="manpage:vmstat">Vmstat</A> -s.</P>
<TABLE>
<TBODY>
<TR>
<TD>
<PRE>[root@lunias ~]# vmstat -s<BR></BR><BR></BR> 1034160 total memory<BR></BR><BR></BR> 1016344 used memory<BR></BR><BR></BR> 639180 active memory<BR></BR><BR></BR> 193732 inactive memory<BR></BR><BR></BR> 17816 free memory<BR></BR><BR></BR> 168028 buffer memory<BR></BR><BR></BR> 298856 swap cache<BR></BR><BR></BR> 2031608 total swap<BR></BR><BR></BR> 118044 used swap<BR></BR><BR></BR> 1913564 free swap<BR></BR><BR></BR> 32897831 non-nice user cpu ticks<BR></BR><BR></BR> 2581 nice user cpu ticks<BR></BR><BR></BR> 9633122 system cpu ticks<BR></BR><BR></BR> 79034644 idle cpu ticks<BR></BR><BR></BR> 8014841 IO-wait cpu ticks<BR></BR><BR></BR> 122537 IRQ cpu ticks<BR></BR><BR></BR> 0 softirq cpu ticks<BR></BR><BR></BR> 113646240 pages paged in<BR></BR><BR></BR> 972813361 pages paged out<BR></BR><BR></BR> 15889354 pages swapped in<BR></BR><BR></BR> 184057520 pages swapped out<BR></BR><BR></BR> 1347364576 interrupts<BR></BR><BR></BR> 959223271 CPU context switches<BR></BR><BR></BR> 1161043831 boot time<BR></BR><BR></BR> 7213271 forks<BR></BR><BR></BR></PRE>
</TD>
</TR>
</TBODY>
</TABLE>
<P>Many useful metrics are here: free memory, free swap and pages paged/swapped in/out. These metrics tell you whether you have enough memory.</P>
<A name="Linux_utility:_Sar"></A>
<H3>Linux utility: <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">Sar</A></H3>
<P><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">Sar</A> is the system activity reporter. By interpreting the reports that <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">sar</A> produces, you can locate system bottlenecks and suggest some possible solutions to those annoying performance problems. The Linux kernel maintains internal counters that keep track of requests, completion times, I/O block counts, etc. From this and other information, <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">sar</A> calculates rates and ratios that give insight into where the bottlenecks are.</P>
<P>The key to understanding <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">sar</A> is that it reports on system activity over a period of time. You must take care to collect <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">sar</A> data at an appropriate time (not at lunch time or on weekends, for example).</P>
<P>A good <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">sar</A> tutorial can be found at <A class="external free" href="http://perso.orange.fr/sebastien.godard/tutorial.html" mce_href="http://perso.orange.fr/sebastien.godard/tutorial.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/tutorial.html">http://perso.orange.fr/sebastien.godard/tutorial.html</A>.</P>
<A name="Other_Sar_Utilities"></A>
<H4>Other <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">Sar</A> Utilities</H4>
<TABLE width="60%">
<TBODY>
<TR>
<TD width="34%">
<UL><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_iostat.html" mce_href="http://perso.orange.fr/sebastien.godard/man_iostat.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_iostat.html">man iostat</A> </LI><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sadc.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sadc.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sadc.html">man sadc</A> </LI></UL>
</TD>
<TD width="33%">
<UL><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sa1.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sa1.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sa1.html">man sa1</A> </LI><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sa2.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sa2.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sa2.html">man sa2</A> </LI></UL>
</TD>
<TD width="33%">
<UL><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sadf.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sadf.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sadf.html">man sadf</A> </LI><LI><A class="external text" href="http://perso.orange.fr/sebastien.godard/man_mpstat.html" mce_href="http://perso.orange.fr/sebastien.godard/man_mpstat.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_mpstat.html">man mpstat</A> </LI></UL>
</TD>
</TR>
</TBODY>
</TABLE>
<A name="Installing_.26_Configuring_Sar_.28Sysstat.29"></A>
<H4>Installing & Configuring <A class="external text" href="http://perso.orange.fr/sebastien.godard/man_sar.html" mce_href="http://perso.orange.fr/sebastien.godard/man_sar.html" rel="nofollow" title="http://perso.orange.fr/sebastien.godard/man_sar.html">Sar</A> (Sysstat)</H4>
<UL><LI><A class="external free" href="http://www.linuxfromscratch.org/blfs/view/svn/general/sysstat.html" mce_href="http://www.linuxfromscratch.org/blfs/view/svn/general/sysstat.html" rel="nofollow" title="http://www.linuxfromscratch.org/blfs/view/svn/general/sysstat.html">http://www.linuxfromscratch.org/blfs/view/svn/general/sysstat.html</A> </LI></UL>
<A name="RRD_Performance_Data_Collection"></A>
<H2>RRD Performance Data Collection</H2>
<P>GroundWork can collect performance data using an eventhandler. It runs after each check finishes, reading a DB Table and trying to log performance data. Performance data collection can contribute to a slow system.</P>
<P>The simplest way to gauge this is to count the RRD files in /usr/local/groundwork/rrd that have been modified that day.</P>
<TABLE>
<TBODY>
<TR>
<TD>
<PRE># cd /usr/local/groundwork/rrd<BR></BR><BR></BR># ls -chot *.rrd | egrep 'Oct 27' | wc -l<BR></BR><BR></BR>93<BR></BR><BR></BR></PRE>
</TD>
</TR>
</TBODY>
</TABLE>
<A name="GroundWork_subsystems"></A>
<H2>GroundWork Subsystems</H2>
<P>There are various log files and other audit aids in GroundWork Monitor that can give you hints about how well the server is performing.</P>
<A name="Process_Performance_Data_Event_Handler_Log"></A>
<H3>Process Performance Data Event Handler Log</H3>
<P>Refer to the Document, Performance Charting - Next Generation <A href="http://www.groundworkopensource.com/wiki/index.php/Performance_Charting_-_Next_Generation_v4.5" mce_href="http://www.groundworkopensource.com/wiki/index.php/Performance_Charting_-_Next_Generation_v4.5" title="Performance Charting - Next Generation v4.5">v4.5</A> or <A href="http://www.groundworkopensource.com/wiki/index.php/Performance_Charting_-_Next_Generation_v5.0" mce_href="http://www.groundworkopensource.com/wiki/index.php/Performance_Charting_-_Next_Generation_v5.0" title="Performance Charting - Next Generation v5.0">v5.0.x</A>, for an overview of this feature. For the purposes of this discussion, we will want to focus on the eventhandler logfile, /usr/local/groundwork/eventhandlers/process_service_perf.log. <BR></BR><BR></BR></P>
<P>Output to this logfile is controlled by a $debug variable, set to 1, in the first page of the eventhandler, /usr/local/groundwork/eventhandlers/process_service_perf_db.pl</P>
<TABLE>
<TBODY>
<TR>
<TD>
<PRE>use strict;<BR></BR><BR></BR>use Time::Local;<BR></BR><BR></BR>use Time::HiRes;<BR></BR><BR></BR>use DBI;<BR></BR><BR></BR>use CollageQuery;<BR></BR><BR></BR><BR></BR><BR></BR>my $start_time = Time::HiRes::time();<BR></BR><BR></BR>my ($rrdname,$rrdcreatestring,$rrdupdatestring,$perfidstring,$parseregx,$parseregx_first,$configlabel);<BR></BR><BR></BR>my $debug = 1;<BR></BR><BR></BR>my $debuglog = ">> /usr/local/groundwork/nagios/eventhandlers/process_service_perf.log";<BR></BR><BR></BR>my $rrdtool = "/usr/local/groundwork/bin/rrdtool";<BR></BR><BR></BR>my %ERRORS = ('UNKNOWN' , '-1',<BR></BR><BR></BR> 'OK' , '0',<BR></BR><BR></BR> 'WARNING', '1',<BR></BR><BR></BR> 'CRITICAL', '2');<BR></BR><BR></BR></PRE>
</TD>
</TR>
</TBODY>
</TABLE>
<P>First, this $debug variable only needs to be set when you are actively debugging something in the Performance data collection subsystem. Otherwise, it should be set to zero.</P>
<P>The contents of the log can be illuminating. This is a special-purpose eventhandler, one that runs after each service check that has performance data handling enabled - that's the default. Many checks, though, don't have a metric - such as any check that returns true or false. To tune the system, you should disable performance data collection for those services whose results aren't suited for such.</P>
<P>Also, just because you could plot the data on a graph doesn't mean that you should. Conservation of effort here will pay off. If no one is asking for the data, and you aren't particularly interested either, then turn it off.</P>
<P>To turn off performance data handling on a particular host or service, go into the Service record in the Monarch 'Configuration' Tab and, in the 'Service Detail' Tab, uncheck the 'Process Perf Data' property. In the 'Host Detail' Tab it is called the 'Process performance data' property.</P>
<P>Executing the Process-Perf-Data eventhandler only for services that actually are set up for performance data collection and not others, and turning off the logging function when not actively tuning the system, will save computing resources.</P>
<A name="GroundWork_Diagnostics_Tool"></A>
<H2>GroundWork Diagnostics Tool</H2>
<P>GroundWork Customer Support has created a tool that methodically collects the above information and more. You could run the GroundWork Diagnostics Tool and send the information to Support or use it in your own analysis.</P>
<A name="Conclusions"></A>
<H2>Conclusions</H2>
<P>Having assessed your system and tuned it as well as possible, you might find that you still have a performance issue. Contact <A class="extiw" href="http://www.groundworkopensource.com//support/" mce_href="http://www.groundworkopensource.com//support/" title="itgw:/support/">Customer Support</A>, or your GroundWork Account Representative to explore other options.</P
h2. Introduction
This document is intended to help you assess your GroundWork Monitoring Server's performance. As the server's workload increases towards its upper range, certain functions will begin to manifest problematic behavior.

h2. Typical First-Noticed Symptoms
# One typical symptom of a over-loaded server is gaps in RRD Performance Graphs. As check latencies increase, data points may start falling outside the allowed tolerances and be dropped from the graphs.
# Changes made in the Monarch 'Configuration' Tab take noticeably longer to propagate from a Monarch 'commit' to the GroundWork Status Viewer pages.
{html}
h2. Nagios Engine Program-Wide Performance Information (Check Latency)
The first place to look for indications that your system is over-loaded is the Nagios Performance CGI. From the GroundWork Portal, click on 'Monitoring Server' option. This will bring you to a Nagios display of the Monitoring Server and its services. In the secondary menu, you then pick 'Performance Information'.

The most likely metric to show you any performance load on the server is the Active Service Check Latency metrics (Min,Max,Avg) in the top right table shown. These metrics are in seconds. 180 seconds is three minutes, which isn't bad. But 1800 seconds is thirty minutes, which _is_ bad.

Check Latency is the time difference between when a service check was scheduled to run and when it actually ran.

h2. Nagiosstats
[Nagiosstats|http://nagios.sourceforge.net/docs/2_0/nagiostats.html] is a command-line version of the CGI shown above. It is simpler and has less dependencies. It also carries slightly different information, such as 'Total Services' and 'Total Hosts'
{code}
[root@lunias ~]# $GW_HOME/nagios/bin/nagiostats
Nagios Stats 2.5
Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
Last Modified: 07-13-2006
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File: /usr/local/groundwork/nagios/var/status.log
Status File Age: 0d 0h 0m 5s
Status File Version: 2.5

Program Running Time: 3d 3h 20m 51s

Total Services: 1417
Services Checked: 1289
Services Scheduled: 1417
Active Service Checks: 1417
Passive Service Checks: 0
Total Service State Change: 0.000 / 80.660 / 0.214 %
Active Service Latency: 213286.359 / 215290.335 / 214727.280 %
Active Service Execution Time: 0.000 / 52.552 / 10.106 sec
Active Service State Change: 0.000 / 80.660 / 0.214 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 187 / 23 / 668 / 539
Services Flapping: 0
Services In Downtime: 0

Total Hosts: 260
Hosts Checked: 260
Hosts Scheduled: 0
Active Host Checks: 260
Passive Host Checks: 0
Total Host State Change: 0.000 / 8.160 / 0.031 %
Active Host Latency: 0.000 / 0.000 / 0.000 %
Active Host Execution Time: 0.029 / 17.250 / 9.365 sec
Active Host State Change: 0.000 / 8.160 / 0.031 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 24 / 153 / 83
Hosts Flapping: 0
Hosts In Downtime: 0
{code}

h2. Linux System Utilities

h3. The Linux Utility, [top|http://unixhelp.ed.ac.uk/CGI/man-cgi?top]
Another assessment of performance is CPU Load. This metric is expressed in how many processes are in a runnable state averaged over 1 minute, 5 minute and 10 minute intervals. If this number is, say 4, that means that there were 4 runnable processes on average over the minute intervals. The following [top|http://unixhelp.ed.ac.uk/CGI/man-cgi?top] output shows load averages of 0.58, 0.63, 0.64. These numbers state that about half of the time there were no processes in a runnable state.
[Top|http://unixhelp.ed.ac.uk/CGI/man-cgi?top]'s main strength is in showing you which processes are in what states. Which processes are taking the most resources? Top can show you. A typical scenario is where the Java listener that accepts feeds of data and posts them to the Collage DB, and the MySQL daemon, are the most active processes.
[Top|http://unixhelp.ed.ac.uk/CGI/man-cgi?top] also gives you Memory and Swap metrics. These can also help pinpoint why a system is slowing. So, load numbers above 3, which means that, on average, there were over three runnable processes, with only one running and the others waiting to run, would indicate a slowing system.

h3. The Linux Utility, [uptime|http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime]
[Uptime|http://unixhelp.ed.ac.uk/CGI/man-cgi?uptime] is a smaller footprint tool that you can use to get the same load averages that [top|http://unixhelp.ed.ac.uk/CGI/man-cgi?top] had.
{code}# uptime
15:05:52 up 14 days, 22:55, 4 users, load average: 0.46, 0.52, 0.52
{code}

h3. The Linux Utility, [vmstat|http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat]
[Vmstat|http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat] has many options to it. Each option may output totally different types of metrics. One of the most useful is [vmstat|http://unixhelp.ed.ac.uk/CGI/man-cgi?vmstat] \-s.
{code}[root@lunias ~]# vmstat -s
1034160 total memory
1016344 used memory
639180 active memory
193732 inactive memory
17816 free memory
168028 buffer memory
298856 swap cache
2031608 total swap
118044 used swap
1913564 free swap
32897831 non-nice user cpu ticks
2581 nice user cpu ticks
9633122 system cpu ticks
79034644 idle cpu ticks
8014841 IO-wait cpu ticks
122537 IRQ cpu ticks
0 softirq cpu ticks
113646240 pages paged in
972813361 pages paged out
15889354 pages swapped in
184057520 pages swapped out
1347364576 interrupts
959223271 CPU context switches
1161043831 boot time
7213271 forks
{code}
Many useful metrics are here: free memory, free swap and pages paged/swapped in/out. These metrics tell you whether you have enough memory.

h3. Linux utility: [sar|http://perso.orange.fr/sebastien.godard/man_sar.html]
[Sar|http://perso.orange.fr/sebastien.godard/man_sar.html] is the system activity reporter. By interpreting the reports that [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] produces, you can locate system bottlenecks and suggest some possible solutions to those annoying performance problems. The Linux kernel maintains internal counters that keep track of requests, completion times, I/O block counts, etc. From this and other information, [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] calculates rates and ratios that give insight into where the bottlenecks are.
The key to understanding [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] is that it reports on system activity over a period of time. You must take care to collect [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] data at an appropriate time (not at lunch time or on weekends, for example).
A good [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] tutorial can be found at [http://perso.orange.fr/sebastien.godard/tutorial.html].

h4. Other [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] Utilities
* [iostat|http://perso.orange.fr/sebastien.godard/man_iostat.html]
* [sadc|http://perso.orange.fr/sebastien.godard/man_sadc.html]
* [sa1|http://perso.orange.fr/sebastien.godard/man_sa1.html]
* [sadf|http://perso.orange.fr/sebastien.godard/man_sadf.html]
* [mpstat|http://perso.orange.fr/sebastien.godard/man_mpstat.html]

h4. Installing & Configuring [sar|http://perso.orange.fr/sebastien.godard/man_sar.html] (Sysstat)
* [http://www.linuxfromscratch.org/blfs/view/svn/general/sysstat.html]

h2. RRD Performance Data Collection
GroundWork can collect performance data using an eventhandler. It runs after each check finishes, reading a DB Table and trying to log performance data. Performance data collection can contribute to a slow system.
The simplest way to gauge this is to count the RRD files in /usr/local/groundwork/rrd that have been modified that day.
{code}
# cd /usr/local/groundwork/rrd
# ls -chot *.rrd | egrep 'Oct 27' | wc -l
93
{code}

h2. GroundWork Subsystems
There are various log files and other audit aids in GroundWork Monitor that can give you hints about how well the server is performing.

h3. Process Performance Data Event Handler Log
For the purposes of this discussion, we will want to focus on the eventhandler logfile, /usr/local/groundwork/nagios/var/log/process_service_perfdata_file.log.

Output to this logfile is controlled by a debug_level variable, at the beginning of the perfdata properties file
{code:title=/usr/local/groundwork/config/perfdata.properties}
# Possible debug_level values:
# 0 = no info of any kind printed, except for startup/shutdown
# messages and major errors
# 1 = print just error info and summary statistical data
# 2 = also print basic debug info
# 3 = print detailed debug info
debug_level = 1
{code}
First, this debug_level variable only needs to be set to a value above 1 when you are actively debugging something in the Performance data collection subsystem. Otherwise, it should be set to 0 or 1.
The contents of the log can be illuminating. This is a special-purpose eventhandler, one that runs after each service check that has performance data handling enabled - that's the default. Many checks, though, don't have a metric - such as any check that returns true or false. To tune the system, you should disable performance data collection for those services whose results aren't suited for such.
Also, just because you could plot the data on a graph doesn't mean that you should. Conservation of effort here will pay off. If no one is asking for the data, and you aren't particularly interested either, then turn it off.
To turn off performance data handling on a particular host or service, go into the Service record in the Monarch 'Configuration' Tab and, in the 'Service Detail' Tab, uncheck the 'Process Perf Data' property. In the 'Host Detail' Tab it is called the 'Process performance data' property.
Executing the Process-Perf-Data eventhandler only for services that actually are set up for performance data collection and not others, and turning off the logging function when not actively tuning the system, will save computing resources.

h2. GroundWork Diagnostics Tool
GroundWork Customer Support has created a tool that methodically collects the above information and more. You could run the GroundWork Diagnostics Tool and send the information to Support or use it in your own analysis. See the kb article for more information: [Running the gwdiags diagnostic tool]


h2. Conclusions
Having assessed your system and tuned it as well as possible, you might find that you still have a performance issue. Contact [Support|https://cases.groundworkopensource.com], or your GroundWork Account Representative to explore other options.