Skip to content

Monitoring of TeskaLabs LogMan.io with Zabbix

This section provides recommendations for integrating Zabbix monitoring with a TeskaLabs LogMan.io. It covers infrastructure-level monitoring of the cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.

Scope and responsibilities

  • Zabbix is not deployed as part of the TeskaLabs LogMan.io product. The customer is responsible for deploying and maintaining their own Zabbix server infrastructure.
  • It is possible and permitted to install a Zabbix agent on each cluster node of the TeskaLabs LogMan.io deployment.
  • TeskaLabs provides recommendations on which metrics to collect and monitor. TeskaLabs does not provide pre-built Zabbix templates or dashboards.
  • The customer (or the implementation partner) is responsible for creating Zabbix dashboards, triggers, and alerting rules based on the recommendations in this document.

Infrastructure Description

Component Details
Application TeskaLabs LogMan.io
Deployment Multi-node cluster
Hardware Supermicro servers or Dell servers
Operating System Ubuntu Server 22.04 LTS
Containerization Docker with Docker Compose
Telemetry DB InfluxDB (deployed within the cluster)
Key services Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB

Zabbix Agent Installation

Prerequisites

  • SSH access to each cluster node (user tladmin with sudo privileges).
  • Network connectivity between the Zabbix server and each cluster node (TCP port 10050 for passive checks, TCP port 10051 for active checks).
  • The Zabbix server must be deployed and configured separately by the customer or the partner.

Installation

Note

TeskaLabs LogMan.io runs on Linux Ubuntu Server 22.04 LTS.

Execute the following commands on each cluster node:

# Download and install the Zabbix repository package
wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo dpkg -i zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo apt update

# Install the Zabbix agent 2 (recommended for Docker and advanced monitoring)
sudo apt install -y zabbix-agent2 zabbix-agent2-plugin-*

# Enable and start the agent
sudo systemctl enable zabbix-agent2
sudo systemctl start zabbix-agent2

# Verify the agent is running
sudo systemctl status zabbix-agent2

Tip

Zabbix Agent 2 is recommended over the legacy Zabbix Agent because it has native support for Docker container monitoring and plugin-based extensibility.

Zabbix Agent 2 Configuration

Edit the main configuration file on each node:

sudo nano /etc/zabbix/zabbix_agent2.conf

Apply the following configuration (adjust values in angle brackets):

# /etc/zabbix/zabbix_agent2.conf

# --- Connection to Zabbix Server ---
Server=<ZABBIX_SERVER_IP>
ServerActive=<ZABBIX_SERVER_IP>
Hostname=<NODE_HOSTNAME>

# --- Logging ---
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=10
DebugLevel=3

# --- Security ---
# Optionally restrict access with PSK encryption
# TLSConnect=psk
# TLSAccept=psk
# TLSPSKIdentity=<PSK_IDENTITY>
# TLSPSKFile=/etc/zabbix/zabbix_agent2.psk

# --- Timeouts ---
Timeout=10

# --- Docker monitoring plugin ---
# Zabbix Agent 2 includes a built-in Docker plugin.
# Ensure the zabbix user has access to the Docker socket.
Plugins.Docker.Endpoint=unix:///var/run/docker.sock

# --- User parameters for LogMan.io-specific checks ---
# Include directory for custom check scripts
Include=/etc/zabbix/zabbix_agent2.d/*.conf

After editing, restart the agent:

sudo systemctl restart zabbix-agent2

Docker Socket Access for Zabbix Agent 2

The Zabbix Agent 2 needs read access to the Docker socket to monitor containers:

sudo usermod -aG docker zabbix
sudo systemctl restart zabbix-agent2

Tip

Adding the zabbix user to the docker group grants it effective root-level access to the Docker daemon. If this is not acceptable per the security policy, use a Docker socket proxy with read-only access instead.

System-Level Metrics (per Node)

These correspond to the "System Level Overview" section of the prophylactic check procedure.

Danger

Software RAID health is the single most important metric to monitor. A degraded RAID array means the node is operating without redundancy — a second disk failure will result in complete data loss. RAID alerts MUST trigger immediate action.

Metric Threshold / Condition Severity
Software RAID Any degraded array or failed/removed disk is critical Critical
Disk usage Warning ≥ 65%, Critical ≥ 80% (except /boot: warn ≥ 95%) High
CPU utilization Warning ≥ 85% sustained over 15 minutes High
System load Warning ≥ 120% of available cores; max load = number of cores Medium
IOWait Warning ≥ 30%, Critical ≥ 50% High
RAM usage Warning ≥ 80%, Critical ≥ 90% sustained High
Swap usage Warning ≥ 50%, Critical ≥ 70% Medium
Network errors Any interface errors or drops > 0 sustained Medium

Docker Container Metrics

Monitor the health and resource usage of all LogMan.io containers.

Metric What to monitor Severity
Container state All containers must be in running state Critical
Container restarts Restart count increasing indicates instability High
Container CPU usage Per-container CPU consumption Medium
Container memory usage Per-container RSS / memory limit ratio Medium
Container network I/O Bytes in/out, errors, drops per container Low

Elasticsearch Monitoring

These correspond to the "Elasticsearch Monitoring" section of the prophylactic check.

Metric Threshold / Condition Severity
Cluster health Must be green; yellow (for more than 10 minutes) = warning, red = critical Critical
Inactive nodes Must be 0 Critical
Unassigned shards Must be 0; any nonzero value for more than 10 minutes requires investigation High
JVM Heap usage Warning ≥ 75%, Critical ≥ 85% High
Shard count per node Warning ≥ 800, Critical ≥ 1100 shards per node Medium
Index size Investigate any index exceeding 200 GB Medium
ILM policy assignment Indices without numeric suffix are not managed by ILM Medium

Apache Kafka Monitoring

These correspond to the "Kafka Lag Overview" section of the prophylactic check.

Metric What to monitor Severity
Consumer group lag Lag must not increase over time Critical
Consumer groups to monitor lmio parsec, lmio depositor, lmio baseliner, lmio correlator
Broker availability All Kafka brokers must be reachable Critical
Under-replicated partitions Must be 0 High

Application Telemetry

TeskaLabs LogMan.io microservices produce telemetry to InfluxDB. Key application metrics to track:

Metric Category Key Metrics
Pipeline metrics Events per second (EPS) — mean and max over 7 days
Microservice memory VmRSS, VmSwap, VmHWM per service (memory leaks, excessive usage)
Microservice uptime Service availability and restart frequency
Disk metrics Per-mount usage for /data/hdd and /data/ssd
Network metrics Bytes sent/received, connection counts per microservice
Kernel metrics Context switches, interrupts, fork rate

Additional Services

Service Metric Threshold / Condition
ZooKeeper Node count, leader election, outstanding requests All nodes up, leader elected
MongoDB Replication lag, connection count, oplog window Lag near 0, oplog > 24h
NGINX Active connections, error rate (5xx), latency 5xx rate near 0
InfluxDB Write throughput, query duration, disk usage No write failures

Zabbix Custom UserParameter Configuration

Create a configuration file for LogMan.io-specific checks on each node:

sudo nano /etc/zabbix/zabbix_agent2.d/logmanio.conf
# /etc/zabbix/zabbix_agent2.d/logmanio.conf
#
# Custom Zabbix UserParameters for TeskaLabs LogMan.io monitoring
#

# ==========================================================================
# --- Software RAID (mdadm) --- MOST CRITICAL METRIC
# ==========================================================================
#
# TeskaLabs LogMan.io nodes use Linux software RAID (mdadm) for disk
# redundancy. A degraded array means the node is running without protection
# against disk failure. These checks MUST trigger immediate alerting.
#
# How it works:
#   /proc/mdstat contains the live state of all MD arrays.
#   A healthy RAID1 array shows e.g. [UU] — all members Up.
#   A degraded array shows e.g. [U_] or [_U] — one member failed/removed.
#

# Number of degraded RAID arrays (0 = healthy, >0 = CRITICAL)
UserParameter=logmanio.raid.degraded_count,cat /proc/mdstat 2>/dev/null | grep -c '_'

# Overall RAID status: 1 = all arrays healthy, 0 = at least one degraded
UserParameter=logmanio.raid.healthy,cat /proc/mdstat 2>/dev/null | grep -q '_' && echo 0 || echo 1

# Total number of MD arrays present on this node
UserParameter=logmanio.raid.array_count,cat /proc/mdstat 2>/dev/null | grep -c '^md'

# Detailed state of a specific array (discovery parameter: md0, md1, etc.)
UserParameter=logmanio.raid.detail[*],sudo mdadm --detail /dev/$1 2>/dev/null | grep -E 'State|Active Devices|Failed Devices|Spare Devices' | tr '\n' '|' | sed 's/|$//'

# Number of failed devices across all arrays
UserParameter=logmanio.raid.failed_devices,sudo mdadm --detail /dev/md* 2>/dev/null | grep 'Failed Devices' | awk '{sum+=$NF} END {print sum+0}'

# Number of active sync actions (rebuild/resync in progress)
UserParameter=logmanio.raid.sync_action_count,cat /proc/mdstat 2>/dev/null | grep -c 'recovery\|resync\|reshape\|check'

# Rebuild/resync progress percentage (returns 100 if no rebuild in progress)
UserParameter=logmanio.raid.sync_progress,grep -oP '\d+\.\d+(?=%)' /proc/mdstat 2>/dev/null | head -1 || echo 100

# Full /proc/mdstat output for diagnostics (text item)
UserParameter=logmanio.raid.mdstat,cat /proc/mdstat 2>/dev/null

# --- Disk Usage ---
# Monitor /data/hdd usage percentage
UserParameter=logmanio.disk.hdd.pct_used,df /data/hdd --output=pcent | tail -1 | tr -d ' %'

# Monitor /data/ssd usage percentage
UserParameter=logmanio.disk.ssd.pct_used,df /data/ssd --output=pcent | tail -1 | tr -d ' %'

# --- Docker Containers ---
# Count of running LogMan.io containers
UserParameter=logmanio.docker.running_count,docker ps --filter "status=running" --format '{{.Names}}' 2>/dev/null | wc -l

# Count of non-running (exited/restarting) containers
UserParameter=logmanio.docker.unhealthy_count,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}}' 2>/dev/null | wc -l

# List of non-running containers (for diagnostics)
UserParameter=logmanio.docker.unhealthy_list,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}} ({{.Status}})' 2>/dev/null | tr '\n' ',' | sed 's/,$//'

# Total container restart count (sum across all containers)
UserParameter=logmanio.docker.restart_total,docker inspect --format '{{.RestartCount}}' $(docker ps -aq) 2>/dev/null | paste -sd+ | bc

# --- Elasticsearch ---
# Cluster health status (green/yellow/red)
UserParameter=logmanio.es.cluster_health,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"

# Number of active nodes in the cluster
UserParameter=logmanio.es.active_nodes,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('number_of_nodes',0))"

# Unassigned shard count
UserParameter=logmanio.es.unassigned_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('unassigned_shards',0))"

# Active shards count
UserParameter=logmanio.es.active_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('active_shards',0))"

# --- Kafka Consumer Lag ---
# Total lag for lmio parsec consumer group
UserParameter=logmanio.kafka.lag.parsec,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio parsec" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio depositor consumer group
UserParameter=logmanio.kafka.lag.depositor,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio depositor" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio baseliner consumer group
UserParameter=logmanio.kafka.lag.baseliner,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio baseliner" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio correlator consumer group
UserParameter=logmanio.kafka.lag.correlator,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio correlator" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# --- ZooKeeper ---
# ZooKeeper status (leader/follower/standalone)
UserParameter=logmanio.zk.status,echo mntr | nc localhost 2181 2>/dev/null | grep zk_server_state | awk '{print $2}'

# ZooKeeper outstanding requests
UserParameter=logmanio.zk.outstanding_requests,echo mntr | nc localhost 2181 2>/dev/null | grep zk_outstanding_requests | awk '{print $2}'

# --- IOWait ---
UserParameter=logmanio.cpu.iowait,iostat -c 1 2 | tail -1 | awk '{print $4}'

# --- Swap Usage ---
UserParameter=logmanio.swap.used_pct,free | awk '/Swap:/ {if ($2>0) printf "%.1f", $3/$2*100; else print 0}'

Restart the agent to load the new parameters:

sudo systemctl restart zabbix-agent2

Sudoers Configuration for RAID Monitoring

The mdadm --detail command requires root privileges. Grant the zabbix user passwordless access to mdadm only:

sudo visudo -f /etc/sudoers.d/zabbix-mdadm

Add the following line:

zabbix ALL=(root) NOPASSWD: /usr/sbin/mdadm --detail /dev/md*

Set the correct permissions:

sudo chmod 440 /etc/sudoers.d/zabbix-mdadm

Verify it works:

sudo -u zabbix sudo mdadm --detail /dev/md0

Verifying UserParameters

From the Zabbix server or proxy, test the parameters:

# Test from the Zabbix server
zabbix_get -s <NODE_IP> -k logmanio.es.cluster_health
zabbix_get -s <NODE_IP> -k logmanio.disk.hdd.pct_used
zabbix_get -s <NODE_IP> -k logmanio.docker.running_count
zabbix_get -s <NODE_IP> -k logmanio.kafka.lag.parsec

The following trigger expressions are provided as starting points. Adjust thresholds and evaluation periods according to the specific deployment.

# ==============================================================
# SOFTWARE RAID — HIGHEST PRIORITY TRIGGERS
# ==============================================================

# CRITICAL: Any RAID array is degraded (data loss risk)
{<HOST>:logmanio.raid.healthy.last()}=0

# CRITICAL: Number of degraded arrays
{<HOST>:logmanio.raid.degraded_count.last()}>0

# CRITICAL: Failed physical disks detected across arrays
{<HOST>:logmanio.raid.failed_devices.last()}>0

# WARNING: RAID rebuild/resync is in progress
{<HOST>:logmanio.raid.sync_action_count.last()}>0

# ==============================================================
# DISK, SYSTEM, AND APPLICATION TRIGGERS
# ==============================================================

# Disk usage on /data/hdd exceeds 80%
{<HOST>:logmanio.disk.hdd.pct_used.last()}>80

# Disk usage on /data/ssd exceeds 80%
{<HOST>:logmanio.disk.ssd.pct_used.last()}>80

# Elasticsearch cluster health is not green
{<HOST>:logmanio.es.cluster_health.str(green)}=0

# Elasticsearch has unassigned shards
{<HOST>:logmanio.es.unassigned_shards.last()}>0

# Kafka parsec consumer lag is increasing
{<HOST>:logmanio.kafka.lag.parsec.change()}>0 and {<HOST>:logmanio.kafka.lag.parsec.avg(30m)}>1000

# Any Docker container is in a non-running state
{<HOST>:logmanio.docker.unhealthy_count.last()}>0

# IOWait exceeds 20% (critical threshold)
{<HOST>:logmanio.cpu.iowait.avg(10m)}>20

# RAM usage exceeds 80%
{<HOST>:vm.memory.utilization.last()}>80

# CPU utilization exceeds 85% for 15 minutes
{<HOST>:system.cpu.util.avg(15m)}>85

# Swap is actively being used
{<HOST>:logmanio.swap.used_pct.last()}>5

Tip

Replace <HOST> with the actual Zabbix host name. These examples use the Zabbix classic trigger syntax. Adjust to the trigger expression format supported by your Zabbix version.

Hardware-Specific Monitoring (IPMI)

For Supermicro and Dell servers, hardware-level monitoring via IPMI is recommended.

Enabling IPMI Monitoring

On each node, install the IPMI tools:

sudo apt install -y ipmitool openipmi
sudo systemctl enable openipmi
sudo systemctl start openipmi

In the Zabbix agent configuration, enable IPMI if desired, or configure IPMI monitoring directly on the Zabbix server by adding the IPMI interface to each host in the Zabbix UI.

Key IPMI Metrics

Metric Description
CPU temperature Per-CPU thermal readings
System inlet temperature Ambient temperature at server intake
Fan speed / status Cooling fan RPM and operational status
Power supply status PSU health and redundancy
Memory ECC errors Correctable and uncorrectable memory errors

Tip

Zabbix provides built-in IPMI templates for both Supermicro and Dell iDRAC. Import the appropriate vendor template on the Zabbix server and link it to each monitored host.

Network Firewall Rules

Ensure the following network connectivity is available:

Source Destination Port Protocol Purpose
Zabbix Server Cluster Nodes 10050 TCP Zabbix passive checks
Cluster Nodes Zabbix Server 10051 TCP Zabbix active checks
Zabbix Server Cluster Nodes 623 UDP IPMI monitoring (if used)