Design

Consistent data recording

After more than 25 years of computer business we still lack of consistent performance monitoring between different operating systems, each system deploying its own type of monitoring and data collection. UNIX systems try to stay a bit close with each other since all are POSIX systems and follow similar industry standards, like The Open Group. Other systems, like Windows, use different data collection techniques.

It is very difficult to have a consistent data recording across many operating systems without purchasing separately additional software or install 3rd parties software. Even more, the recorded format data varies from system to system making difficult the collection and analysis.

If we step back and we look other industries, how are they doing it, we see a completely different picture, efforts being made towards standardization and common ways to record and offer data for analysis:

Aerospace industry
Airplanes use some sort of data recorders, usually found as a device called flight data recorder FDR, built to store aircraft data parameters. Such unit is found by default on many airplanes nowadays and its usage is regulated by governments and federal administrations, example FAA in United States. This device sometimes is referred as the black box.
Shipbuilding industry
Ships, boats or other type of vessels use some sort of recorder, called voyager data recorder VDR, used to store vessel data parameters. Similar to aerospace industry such devices are required when a certain vessel must comply with international standards, International Convention for the Safety of Life at Sea, SOLAS. Used mainly for accident investigation the VDR can serve as preventive maintenance, performance efficiency monitoring, heavy weather damage analysis and accident avoidance. This device sometimes is referred as the black box.
Auto industry
Automobiles use some sort of device used to store vehicle parameters, called event data recorder EDR. EDRs are not enforced by any standard organizations and are not really required by law so their usage varies from vendor to vendor. National Highway Traffic Safety Administration, NHTSA, proposed a series of changes to standardize and enforce mandatory EDR installation and usage by vendors. Around 2010 over 85% of all vehicles in US would already have some sort of EDR installed.
IT industry
Computer systems have no such data recording device, installed. Manufacturers are not interested in standardizing this effort since they prefer selling additional software packages which can perform such recording features for an extra cost. The lack of standardization and agreements between vendors makes computing business complex and difficult to handle performance data. Currently, there are houndreads of performance monitoring solutions.

Checking each operating system, we can see a smilar way to fetch and extract performance data using different interfaces, called differently from vendor to vendor and implemnetation: Sun Solaris KSTAT, Linux /proc, HP-UX KSTAT, IBM AIX RSTAT, Microsoft Windows WMI. So what if we could have several standard data recorders or agents, which could fetch metrics from each system interface and have them exporting this data same way, no matter of the implementation. And to make things even simpler we could use a very simple data format for the exported data, for example flat text file format, which can be used by any reporting system for future analysis and visualization.

Similar to a FDR device, we could develop a simple data recording module which can be used for system troubleshooting, performance analysis, system crash analysis and it can be enabled across a large number of hosts in a data center, no matter of the operating system used.

Data Recording

Raw Data

All recorded observations we call them raw data. Raw data is produced by a monitoring agent, running on each host we plan to record data from. This set of data is not modified, altered or changed in any way and it is entirely the way we collected from the computer system. Its format is simple, as already mentioned, having its parameters collected separated by a character like , or :. Each recorder will write and store all collected parameters under such raw data file for the entire duration of its execution.

By default, the SDR raw data file has the extension called, sdrd, system data recorder datafile. Each sdrd file will have the following format, described below:

Raw Data File
timestamp: parameter: parameter: parameter: parameter: parameter: ... parameter
timestamp: parameter: parameter: parameter: parameter: parameter: ... parameter
...
timestamp: parameter: parameter: parameter: parameter: parameter: ... parameter
timestamp: parameter: parameter: parameter: parameter: parameter: ... parameter

Time Series

All collected metrics are variable measured sequentially in time, called time series. All these observations collected over fixed sampling intervals create a historical time series. To easy the access to all this set of data SDR simple records and stores the observations on commodity disk drives, compressed, in text format. All these are the sdrd data files, as described above.

Time series let us understand what has happened in past and look in the future, using various statistical models. In addition , having access to these historical time series will help us to build a simple capacity planning model for our application or site.

Agents

The recording process consists of a number of running agents, light probes developed in a language like Perl5, Java or C which can directly talk and extract from operating system interfaces, the parameters we are interested in. For example on Linux based systems we directly extract various metrics from /proc interface. On Solaris systems we interact with KSTAT interface to collect all needed parameters.

Monitoring each host as closely as possible means more accurate and complete data. SDR is an agent based monitoring system which runs continously on each host. If needed, SDR recorders can be used as an agent-less system as well.


SDR

There are five main recorders: sysrec, cpurec, nicrec, diskrec and hdwrec. Each recorder runs as a separate Perl5 process without any relation to the others. This makes very flexible to operation mode of all recorders, since they are autonomous. Additional there are different other recorders which can collect other type of system or application data: netrec, jvmrec, hdwrec, webrec. See below for a complete list of all available recorders or check our documentation.

Standard Agents

sysrec
overall system CPU, MEM, DISK, NIC utilization
cpurec
per-CPU statistics
nicrec
per-NIC statistics
diskrec
per-DISK statistics
hdwrec
hardware, software inventory

Specialized Agents

dbrec
Oracle, MariaDB, MySQL, PostgreSQL statistics
corerec
SPARC CMT T1, T2 processor statistics
netrec
UDP, TCP, IP statistics
jvmrec
Java Virtual Machine Garbage Collector statistics
procrec
per-process statistics
vmrec
VmWare, Xen, KVM statistics
webrec
HTTP response time statistics
zonerec
Solaris zone statistics

Data Transport

All observations are recorded for a number of days on each computer system. However we would like to send this data to a reporting backend where we could do some analysis and see it visual. There are currently two ways to transport sdrd raw data for analysis: instant and batch modes.

First mode of transporting the raw data to a reporting backend system is the instant mode. On this mode, the output of each data recorder will be scanned by a special utility, sender responsible to detect each changes and send over a SSH2 channel this data for analysis. Sender will scan periodically all sdrd raw data files, configured under a XML type of configuration and it will send these changes, secure to the reporting backend. This way we ensure each recorded data will arrive for analysis as soon as it has happened.

The other mode, where we would like to see changes less often, like every 24 hrs, would mean we will transport each sdrd raw data every 24 hrs to a reporting backend system for analysis using raw2day utility. This utility simple transports all recorded sdrd data for 1 day using SSH2 or FTP.

Data Analysis

The reporting module is responsible for this part, but we will shortly describe how the recording part packages all collected data and prepares for the reporting module.