TSR – The Server Room – Shownotes – Episode 41 – Infrastructure Monitoring Software

Infrastructure Monitoring Softwares: Nagios , Zabbix, OpenNMS

IT monitoring is the process to gather metrics about the operations of an IT environment’s hardware and software to ensure everything functions as expected to support applications and services.

Basic monitoring is performed through device operation checks, while more advanced monitoring gives granular views on operational statuses, including average response times, number of application instances, error and request rates, CPU usage and application availability.


  • Hardware – Physical Health
  • Operating System – Utilization and depletion
  • Network – Bandwidth consumption and errors
  • Application – Performance and availability

IT monitoring covers three sections, called the foundation, software and interpretation.

Foundation. The infrastructure is the lowest layer of a software stack and includes physical or virtual devices, such as servers, CPUs and VMs.

Software. This part is sometimes referred to as the monitoring section and it analyzes what is working on the devices in the foundation, including CPU usage, load, memory and a running VM count.

Interpretation. Gathered metrics are presented through graphs or data charts, often on a GUI dashboard. This is often accomplished through integration with tools that specifically focus on data visualization.

IT monitoring can rely on agents or be agentless. Agents are independent programs that install on the monitored device to collect data on hardware or software performance data and report it to a management server. Agentless monitoring uses existing communication protocols to emulate an agent, with many of the same functionalities.

For example, to monitor server usage, an IT admin installs an agent on the server. A management server receives that data from the agent and displays it to the user via the IT monitoring system interface, often as a graph of performance over time. If the server stops working as intended, the tool alerts the administrator, who can repair, update or replace the item until it meets the standard for operation.

Real-time vs. trends monitoring

Real-time monitoring is a technique whereby IT teams use systems to continuously collect and access data to determine the active and ongoing status of an IT environment. Measurements from real-time monitoring software depict data from the current IT environment, as well as the recent past, which enables IT managers to react quickly to current events in the IT ecosystem.

Historical monitoring data enables the IT manager to improve the environment or identify potential complications before they occur, because they identify a pattern or trend in data from a period of operation. Trend analysis takes a long-term view of an IT ecosystem to determine system uptimes, service-level agreement adherence and capacity planning.

Two extensions of real-time monitoring are reactive monitoring and proactive monitoring. The key difference is that reactive monitoring is triggered by an event or problem, while proactive monitoring seeks to uncover abnormalities without relying on a trigger event. The proactive approach can enable an IT staff to take action to address an issue, such as a memory leak that could crash an application or server, before it becomes a problem.

Point-in-time vs. time-series monitoring: Point-in-time analysis examines one specific event at a particular instant. It can be used to identify a problem that must be fixed immediately, such as a 100% full disk drive. Time-series analysis plots metrics over time to account for seasonal or cyclical events and more accurately recognize abnormal behavior. Point-in-time analysis relies on fixed thresholds, while time-series analysis employs variable thresholds to paint a broader picture and better detect and even predict anomalies.

IT infrastructure monitoring

IT infrastructure monitoring is a foundation-level process that collects and reviews metrics concerning the IT environment’s hardware and low-level software. Infrastructure monitoring provides a benchmark for ideal physical systems operation, therefore easing the process to fine-tune and reduce downtime, and enabling IT teams to detect outages, such as an overheated server.

Server monitoring and system monitoring tools review and analyze metrics, such as server uptime, operations, performance and security.

As more organizations embrace cloud computing, cloud monitoring capabilities and options have expanded as well. Cloud customers can get visibility into certain metrics, such as CPU, memory and storage usage, to gauge how well their applications perform, but the nature of cloud infrastructure limits the view into the physical assets on which cloud workloads run.

Network monitoring seeks out issues caused by slow or failing network components or security breaches. Metrics include response time, uptime, status request failures and HTTP/HTTPS/SMTP checks.

Security monitoring focuses on the detection and prevention of intrusions, typically at the network level. This includes monitoring for vulnerabilities, logging network access and identifying traffic patterns in real time to look for potential breaches.

Application performance monitoring

Application performance monitoring (APM) gathers software performance metrics based on both end user experience and computational resource consumption. Examples of APM-provided metrics include average response time under peak load, performance bottleneck data and load and response times.

Cloud providers largely support APM capabilities with their own native tools. Cloud customers can also choose from many third-party APM tools to see metrics on resource availability, response times and security.

Application monitoring is within the scope of application performance management, a concept that involves more broadly controlling an application’s performance levels.

IT monitoring tool options

Some APM vendors also offer IT infrastructure monitoring capabilities, and vice versa. Other tools are designed specifically to watch over the network or CPU performance and so on. Some monitoring tools incorporate AI capabilities.

The following lists show just some examples of various monitoring tool types. These lists are not comprehensive, however, and many tools incorporate capabilities typically seen in other segments, such as AI or the ability to track cloud and on-premises infrastructure.

APM tools. BMC TrueSight, Cisco AppDynamics, Datadog, Dynatrace, ManageEngine Applications Manager, Microsoft Azure Application Insights, New Relic and SolarWinds APM.

IT infrastructure tools. LogicMonitor, ManageEngine OpManager, Microsoft System Center Operations Manager (SCOM), Nagios XI, SolarWinds, VMware vRealize Operations and Zabbix.

Cloud monitoring tools. Amazon CloudWatch, Google Stackdriver (now folded into Google Cloud Console), Microsoft Azure Monitor, Cisco CloudCenter and Oracle Application Performance Monitoring Cloud Service.

Containers/microservices/distributed app monitoring tools. Confluent Kafka, Jaeger, LightStep and Prometheus.

AIops tools. BigPanda, Datadog, Dynatrace, Moogsoft and New Relic.

Log monitoring tools. Elastic Stack, Fluentd, Splunk and Sumo Logic.

Network security monitoring tools. Cisco DNA Analytics and Assurance, LiveAction LiveNX, LogRhythm and PRTG Network Monitor.

Nagios

Nagios, now known as Nagios Core, is a free and open-source computer-software application that monitors systems, networks and infrastructure. Nagios offers monitoring and alerting services for servers, switches, applications and services. It alerts users when things go wrong and alerts them a second time when the problem has been resolved.

Nagios was originally designed to run under Linux, but it also runs well on other Unix variants. It is free software licensed under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

Nagios Core is open source software licensed under the GNU GPL V2.

Currently it provides:

  • Monitoring of network services (SMTPPOP3HTTPNNTPICMPSNMPFTPSSH)
  • Monitoring of host resources (processor load, disk usage, system logs) on a majority of network operating systems, including Microsoft Windows, using monitoring agents.
  • Monitoring of any hardware (like probes for temperature, alarms, etc.) which have the ability to send collected data via a network to specifically written plugins
  • Monitoring via remotely run scripts via Nagios Remote Plugin Executor
  • Remote monitoring supported through SSH or SSL encrypted tunnels.
  • A simple plugin design that allows users to easily develop their own service checks depending on needs, by using their tools of choice (shell scriptsC++PerlRubyPythonPHPC#, etc.)
  • Available data graphing plugins
  • Parallelized service checks
  • Flat-text formatted configuration files (integrates with many config editors)
  • The ability to define network host using ‘parent’ hosts, allowing the detection of and distinction between hosts that are down or unreachable
  • Contact notifications when service or host problems occur and get resolved (via e-mailpagerSMS, or any user-defined method through plugin system)
  • The ability to define event handlers to be run during service or host events for proactive problem resolution
  • Automatic log file rotation
  • Support for implementing redundant monitoring hosts
  • Support for implementing performance data graphing
  • Support for database backend (such as NDOUtils)
  • Push notifications
  • A web-interface for viewing current network status, notifications, problem history, log files, etc.
Instalación de Nagios Core en Ubuntu server 16.04.1 para monitorización de  servidores - YouTube
Nagios

Nagios agents

NRPE

Nagios Remote Plugin Executor (NRPE) is a Nagios agent that allows remote system monitoring using scripts that are hosted on the remote systems. It allows for monitoring of resources such as disk usage, system load or the number of users currently logged in. Nagios periodically polls the agent on remote system using the check_nrpe plugin.

NRPE allows you to remotely execute Nagios plugins on other Linux/Unix machines. This allows you to monitor remote machine metrics (disk usage, CPU load, etc.). NRPE can also communicate with some of the Windows agent add-ons, so you can execute scripts and check metrics on remote Windows machines, as well.

NRDP

Nagios Remote Data Processor (NRDP) is a Nagios agent with a flexible data transport mechanism and processor. It is designed with an architecture that allows it to be easily extended and customized. NRDP uses standard ports and protocols (HTTP and XML) and can be implemented as a replacement for Nagios Service Check Acceptor (NSCA).

NSClient++

This program is mainly used to monitor Windows machines. Being installed on a remote system NSClient++ listens to port TCP 12489. The Nagios plugin that is used to collect information from this addon is called check_nt. As NRPE, NSClient++ allows to monitor the so-called ‘private services’ (memory usage, CPU load, disk usage, running processes, etc.) Nagios is a host and service monitor which is designed to inform you of network problems.

NCPA

The Nagios Cross Platform Agent is an open source project maintained by Nagios Enterprises. NCPA installs on Windows, Linux, and Mac OS X. Created as a scale-able API that allows flexibility and simplicity in monitoring hosts. NCPA allows multiple checks such as memory usage, CPU usage, disk usage, processes, services, and network usage. Active checks are queried through the API of the “NCPA Listener” service while passive checks are sent via the “NCPA Passive” service.

Nagios XI

Nagios XI is an extended interface, config manager, and toolkit using Nagios Core as the back-end, written and maintained by the original author, Ethan Galstad, and Nagios Enterprises. It is an enterprise-class application that monitors systems, networks and infrastructure. It offers an extensive user interface, configuration editor, advanced reporting, monitoring wizards, an extensible front-end and back-end, along with many other additions over Nagios Core. CentOS and RHEL are the currently supported operating systems. It combines Nagios Core with other technologies. Its main database and the ndoutils module that is used alongside Nagios Core use MySQL. Prior to XI 5, PostgreSQL was used for one of the three databases it uses, and is no longer used on new installs of Nagios XI. While the front-end of Nagios Core is mainly CGI with some PHP most of the Nagios XI front-end and back-end are written in PHP including the subsystem, event handlers, and notifications, and Python is used to create capacity planning reports and other reports. RRDtool and Highcharts are included to create customizable graphs that can be displayed in dashboards

Nagios Version Upgrade Tests and Customization

Zabbix

Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines (VMs) and cloud services. Zabbix provides monitoring metrics, among others network utilization, CPU load and disk space consumption. Zabbix monitoring configuration can be done using XML based templates which contain elements to monitor. The software monitors operations on Linux, Hewlett Packard Unix (HP-UX), Mac OS X, Solaris and other operating systems (OSes); however, Windows monitoring is only possible through agents. Zabbix can use MySQL, MariaDB, PostgreSQL, SQLite, Oracle or IBM DB2 to store data. Its backend is written in C and the web frontend is written in PHP. Zabbix offers several monitoring options:

Simple checks can verify the availability and responsiveness of standard services such as SMTP or HTTP without installing any software on the monitored host.
A Zabbix agent can also be installed on UNIX and Windows hosts to monitor statistics such as CPU load, network utilization, disk space, etc.

As an alternative to installing an agent on hosts, Zabbix includes support for monitoring via SNMP, TCP and ICMP checks, as well as over IPMI, JMX, SSH, Telnet and using custom parameters. Zabbix supports a variety of near-real-time notification mechanisms, including XMPP.
Released under the terms of GNU General Public License version 2, Zabbix is free software.

Zabbix 3.4.0 dashboard, dark theme

OpenNMS



OpenNMS is an enterprise-grade, integrated, open-source platform to build network monitoring solutions. Goals include accelerating time to production by supporting industry standard network management protocols, agents, and a programmable provisioning system. The OpenNMS community helps to make interoperable network monitoring solutions.

An event-driven architecture allows flexible workflow integration in existing monitoring and management stacks. OpenNMS normalizes device- and vendor-specific messages and protocol-specific performance measurements. Based on open source technologies, the data are accessible through a powerful ReST API and can be used in high level management workflow applications.

Notifications

Send alerts to on-call system engineers using a variety of implemented notification strategies. Extend the platform by using the native Java API or run scripts on the underlying operating system.

  • Use E-Mail with SMTP/s protocol
  • Slack and Mattermost (and other Slack-compatible) teams via outbound webhook integration
  • Jabber and XMPP as direct message or in XMPP chatrooms
  • Support for Microblog notifications via Identica / StatusNet / Twitter
  • Run external scripts
  • Extend the Java native notification strategy API

Ticket Integration

Building monitoring application stacks requires strong integration capabilities. With the OpenNMS platform there are several possibilities to forward monitoring information to integrate in management workflows. Leverage from open source with using pre-built integrations or build your own ticketing integration.

  • Request Tracker (RT) integration
  • BMC Remedy integration
  • OTRS integration
  • IBM Tivoli Service Request Manager (TSRM) integration
  • Atlassian JIRA integration
  • Extensible Ticketing API

Southbound Integration

Underlying monitoring events can be used to generate high level alarms. Streams of normalized alarms can be forwarded to external applications to integrate in management workflows.

  • JMS Alarm Northbound implementation
  • AMQP Alarm Northbound implementation
  • Forward alarms into Elasticsearch for analysis
  • Send alarms via Syslog or SNMP trap protocol to legacy management solutions
  • Extensible Northbound API

Java Based Framework

The database schema is version controlled with Liquibase which allows easier updates and maintenance. OpenNMS uses Hibernate as data persistence for PostgreSQL.

With OpenNMS you have a choice of different time series databases:

  • RRDtool: for maximal compatibility and small and medium sized performance data collections
  • JRobin: Java based RRD storage for maximum platform independence and small and medium sized performance data collections
  • NewTS: For maximal scalability and medium to large performance data collections

OpenNMS Out of The Box….

Links

Best Open Source Monitoring Software
https://geekflare.com/best-open-source-monitoring-software/
https://devopscube.com/best-opensource-monitoring-tools/

Zabbix
https://www.zabbix.com/
https://en.wikipedia.org/wiki/Zabbix

Nagios Vs Nagios XI
https://cdn2.hubspot.net/hubfs/3796979/Inbound%20Assets/Content%20Offers/Nagios%20XI%20Comparison%20Guide.pdf

OpenNMS 101
https://www.youtube.com/watch?v=GJzmkshdjiI&list=PLsXgBGH3nG7iZSlssmZB3xWsAJlst2j2z