Monitoring of TeskaLabs LogMan.io with Zabbix¶

This section provides recommendations for integrating Zabbix monitoring with a TeskaLabs LogMan.io. It covers infrastructure-level monitoring of the cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.

Scope and responsibilities

Zabbix is not deployed as part of the TeskaLabs LogMan.io product. The customer is responsible for deploying and maintaining their own Zabbix server infrastructure.
It is possible and permitted to install a Zabbix agent on each cluster node of the TeskaLabs LogMan.io deployment.
TeskaLabs provides recommendations on which metrics to collect and monitor. TeskaLabs does not provide pre-built Zabbix templates or dashboards.
The customer (or the implementation partner) is responsible for creating Zabbix dashboards, triggers, and alerting rules based on the recommendations in this document.

Infrastructure Description¶

Component	Details
Application	TeskaLabs LogMan.io
Deployment	Multi-node cluster
Hardware	Supermicro servers or Dell servers
Operating System	Ubuntu Server 22.04 LTS
Containerization	Docker with Docker Compose
Telemetry DB	InfluxDB (deployed within the cluster)
Key services	Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB

Zabbix Agent Installation¶

Prerequisites¶

SSH access to each cluster node (user tladmin with sudo privileges).
Network connectivity between the Zabbix server and each cluster node (TCP port 10050 for passive checks, TCP port 10051 for active checks).
The Zabbix server must be deployed and configured separately by the customer or the partner.

Installation¶

Note

TeskaLabs LogMan.io runs on Linux Ubuntu Server 22.04 LTS.

Execute the following commands on each cluster node:

# Download and install the Zabbix repository package
wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo dpkg -i zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo apt update

# Install the Zabbix agent 2 (recommended for Docker and advanced monitoring)
sudo apt install -y zabbix-agent2 zabbix-agent2-plugin-*

# Enable and start the agent
sudo systemctl enable zabbix-agent2
sudo systemctl start zabbix-agent2

# Verify the agent is running
sudo systemctl status zabbix-agent2

Tip

Zabbix Agent 2 is recommended over the legacy Zabbix Agent because it has native support for Docker container monitoring and plugin-based extensibility.

Zabbix Agent 2 Configuration¶

Edit the main configuration file on each node:

sudo nano /etc/zabbix/zabbix_agent2.conf

Apply the following configuration (adjust values in angle brackets):

# /etc/zabbix/zabbix_agent2.conf

# --- Connection to Zabbix Server ---
Server=<ZABBIX_SERVER_IP>
ServerActive=<ZABBIX_SERVER_IP>
Hostname=<NODE_HOSTNAME>

# --- Logging ---
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=10
DebugLevel=3

# --- Security ---
# Optionally restrict access with PSK encryption
# TLSConnect=psk
# TLSAccept=psk
# TLSPSKIdentity=<PSK_IDENTITY>
# TLSPSKFile=/etc/zabbix/zabbix_agent2.psk

# --- Timeouts ---
Timeout=10

# --- Docker monitoring plugin ---
# Zabbix Agent 2 includes a built-in Docker plugin.
# Ensure the zabbix user has access to the Docker socket.
Plugins.Docker.Endpoint=unix:///var/run/docker.sock

# --- User parameters for LogMan.io-specific checks ---
# Include directory for custom check scripts
Include=/etc/zabbix/zabbix_agent2.d/*.conf

After editing, restart the agent:

sudo systemctl restart zabbix-agent2

Docker Socket Access for Zabbix Agent 2¶

The Zabbix Agent 2 needs read access to the Docker socket to monitor containers:

sudo usermod -aG docker zabbix
sudo systemctl restart zabbix-agent2

Tip

Adding the zabbix user to the docker group grants it effective root-level access to the Docker daemon. If this is not acceptable per the security policy, use a Docker socket proxy with read-only access instead.

Recommended Metrics and Thresholds¶

System-Level Metrics (per Node)¶

These correspond to the "System Level Overview" section of the prophylactic check procedure.

Danger

Software RAID health is the single most important metric to monitor. A degraded RAID array means the node is operating without redundancy — a second disk failure will result in complete data loss. RAID alerts MUST trigger immediate action.

Metric	Threshold / Condition	Severity
Software RAID	Any degraded array or failed/removed disk is critical	Critical
Disk usage	Warning ≥ 65%, Critical ≥ 80% (except `/boot`: warn ≥ 95%)	High
CPU utilization	Warning ≥ 85% sustained over 15 minutes	High
System load	Warning ≥ 120% of available cores; max load = number of cores	Medium
IOWait	Warning ≥ 30%, Critical ≥ 50%	High
RAM usage	Warning ≥ 80%, Critical ≥ 90% sustained	High
Swap usage	Warning ≥ 50%, Critical ≥ 70%	Medium
Network errors	Any interface errors or drops > 0 sustained	Medium

Docker Container Metrics¶

Monitor the health and resource usage of all LogMan.io containers.

Metric	What to monitor	Severity
Container state	All containers must be in `running` state	Critical
Container restarts	Restart count increasing indicates instability	High
Container CPU usage	Per-container CPU consumption	Medium
Container memory usage	Per-container RSS / memory limit ratio	Medium
Container network I/O	Bytes in/out, errors, drops per container	Low

Elasticsearch Monitoring¶

These correspond to the "Elasticsearch Monitoring" section of the prophylactic check.

Metric	Threshold / Condition	Severity
Cluster health	Must be `green`; `yellow` (for more than 10 minutes) = warning, `red` = critical	Critical
Inactive nodes	Must be 0	Critical
Unassigned shards	Must be 0; any nonzero value for more than 10 minutes requires investigation	High
JVM Heap usage	Warning ≥ 75%, Critical ≥ 85%	High
Shard count per node	Warning ≥ 800, Critical ≥ 1100 shards per node	Medium
Index size	Investigate any index exceeding 200 GB	Medium
ILM policy assignment	Indices without numeric suffix are not managed by ILM	Medium

Apache Kafka Monitoring¶

These correspond to the "Kafka Lag Overview" section of the prophylactic check.

Metric	What to monitor	Severity
Consumer group lag	Lag must not increase over time	Critical
Consumer groups to monitor	`lmio parsec`, `lmio depositor`, `lmio baseliner`, `lmio correlator`	—
Broker availability	All Kafka brokers must be reachable	Critical
Under-replicated partitions	Must be 0	High

Application Telemetry¶

TeskaLabs LogMan.io microservices produce telemetry to InfluxDB. Key application metrics to track:

Metric Category	Key Metrics
Pipeline metrics	Events per second (EPS) — mean and max over 7 days
Microservice memory	`VmRSS`, `VmSwap`, `VmHWM` per service (memory leaks, excessive usage)
Microservice uptime	Service availability and restart frequency
Disk metrics	Per-mount usage for `/data/hdd` and `/data/ssd`
Network metrics	Bytes sent/received, connection counts per microservice
Kernel metrics	Context switches, interrupts, fork rate

Additional Services¶

Service	Metric	Threshold / Condition
ZooKeeper	Node count, leader election, outstanding requests	All nodes up, leader elected
MongoDB	Replication lag, connection count, oplog window	Lag near 0, oplog > 24h
NGINX	Active connections, error rate (5xx), latency	5xx rate near 0
InfluxDB	Write throughput, query duration, disk usage	No write failures

Zabbix Custom UserParameter Configuration¶

Create a configuration file for LogMan.io-specific checks on each node:

sudo nano /etc/zabbix/zabbix_agent2.d/logmanio.conf

# /etc/zabbix/zabbix_agent2.d/logmanio.conf
#
# Custom Zabbix UserParameters for TeskaLabs LogMan.io monitoring
#

# ==========================================================================
# --- Software RAID (mdadm) --- MOST CRITICAL METRIC
# ==========================================================================
#
# TeskaLabs LogMan.io nodes use Linux software RAID (mdadm) for disk
# redundancy. A degraded array means the node is running without protection
# against disk failure. These checks MUST trigger immediate alerting.
#
# How it works:
#   /proc/mdstat contains the live state of all MD arrays.
#   A healthy RAID1 array shows e.g. [UU] — all members Up.
#   A degraded array shows e.g. [U_] or [_U] — one member failed/removed.
#

# Number of degraded RAID arrays (0 = healthy, >0 = CRITICAL)
UserParameter=logmanio.raid.degraded_count,cat /proc/mdstat 2>/dev/null | grep -c '_'

# Overall RAID status: 1 = all arrays healthy, 0 = at least one degraded
UserParameter=logmanio.raid.healthy,cat /proc/mdstat 2>/dev/null | grep -q '_' && echo 0 || echo 1

# Total number of MD arrays present on this node
UserParameter=logmanio.raid.array_count,cat /proc/mdstat 2>/dev/null | grep -c '^md'

# Detailed state of a specific array (discovery parameter: md0, md1, etc.)
UserParameter=logmanio.raid.detail[*],sudo mdadm --detail /dev/$1 2>/dev/null | grep -E 'State|Active Devices|Failed Devices|Spare Devices' | tr '\n' '|' | sed 's/|$//'

# Number of failed devices across all arrays
UserParameter=logmanio.raid.failed_devices,sudo mdadm --detail /dev/md* 2>/dev/null | grep 'Failed Devices' | awk '{sum+=$NF} END {print sum+0}'

# Number of active sync actions (rebuild/resync in progress)
UserParameter=logmanio.raid.sync_action_count,cat /proc/mdstat 2>/dev/null | grep -c 'recovery\|resync\|reshape\|check'

# Rebuild/resync progress percentage (returns 100 if no rebuild in progress)
UserParameter=logmanio.raid.sync_progress,grep -oP '\d+\.\d+(?=%)' /proc/mdstat 2>/dev/null | head -1 || echo 100

# Full /proc/mdstat output for diagnostics (text item)
UserParameter=logmanio.raid.mdstat,cat /proc/mdstat 2>/dev/null

# --- Disk Usage ---
# Monitor /data/hdd usage percentage
UserParameter=logmanio.disk.hdd.pct_used,df /data/hdd --output=pcent | tail -1 | tr -d ' %'

# Monitor /data/ssd usage percentage
UserParameter=logmanio.disk.ssd.pct_used,df /data/ssd --output=pcent | tail -1 | tr -d ' %'

# --- Docker Containers ---
# Count of running LogMan.io containers
UserParameter=logmanio.docker.running_count,docker ps --filter "status=running" --format '{{.Names}}' 2>/dev/null | wc -l

# Count of non-running (exited/restarting) containers
UserParameter=logmanio.docker.unhealthy_count,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}}' 2>/dev/null | wc -l

# List of non-running containers (for diagnostics)
UserParameter=logmanio.docker.unhealthy_list,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}} ({{.Status}})' 2>/dev/null | tr '\n' ',' | sed 's/,$//'

# Total container restart count (sum across all containers)
UserParameter=logmanio.docker.restart_total,docker inspect --format '{{.RestartCount}}' $(docker ps -aq) 2>/dev/null | paste -sd+ | bc

# --- Elasticsearch ---
# Cluster health status (green/yellow/red)
UserParameter=logmanio.es.cluster_health,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"

# Number of active nodes in the cluster
UserParameter=logmanio.es.active_nodes,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('number_of_nodes',0))"

# Unassigned shard count
UserParameter=logmanio.es.unassigned_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('unassigned_shards',0))"

# Active shards count
UserParameter=logmanio.es.active_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('active_shards',0))"

# --- Kafka Consumer Lag ---
# Total lag for lmio parsec consumer group
UserParameter=logmanio.kafka.lag.parsec,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio parsec" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio depositor consumer group
UserParameter=logmanio.kafka.lag.depositor,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio depositor" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio baseliner consumer group
UserParameter=logmanio.kafka.lag.baseliner,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio baseliner" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# Total lag for lmio correlator consumer group
UserParameter=logmanio.kafka.lag.correlator,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio correlator" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'

# --- ZooKeeper ---
# ZooKeeper status (leader/follower/standalone)
UserParameter=logmanio.zk.status,echo mntr | nc localhost 2181 2>/dev/null | grep zk_server_state | awk '{print $2}'

# ZooKeeper outstanding requests
UserParameter=logmanio.zk.outstanding_requests,echo mntr | nc localhost 2181 2>/dev/null | grep zk_outstanding_requests | awk '{print $2}'

# --- IOWait ---
UserParameter=logmanio.cpu.iowait,iostat -c 1 2 | tail -1 | awk '{print $4}'

# --- Swap Usage ---
UserParameter=logmanio.swap.used_pct,free | awk '/Swap:/ {if ($2>0) printf "%.1f", $3/$2*100; else print 0}'

Restart the agent to load the new parameters:

sudo systemctl restart zabbix-agent2

Sudoers Configuration for RAID Monitoring¶

The mdadm --detail command requires root privileges. Grant the zabbix user passwordless access to mdadm only:

sudo visudo -f /etc/sudoers.d/zabbix-mdadm

Add the following line:

zabbix ALL=(root) NOPASSWD: /usr/sbin/mdadm --detail /dev/md*

Set the correct permissions:

sudo chmod 440 /etc/sudoers.d/zabbix-mdadm

Verify it works:

sudo -u zabbix sudo mdadm --detail /dev/md0

Verifying UserParameters¶

From the Zabbix server or proxy, test the parameters:

# Test from the Zabbix server
zabbix_get -s <NODE_IP> -k logmanio.es.cluster_health
zabbix_get -s <NODE_IP> -k logmanio.disk.hdd.pct_used
zabbix_get -s <NODE_IP> -k logmanio.docker.running_count
zabbix_get -s <NODE_IP> -k logmanio.kafka.lag.parsec

Recommended Zabbix Trigger Examples¶

The following trigger expressions are provided as starting points. Adjust thresholds and evaluation periods according to the specific deployment.

# ==============================================================
# SOFTWARE RAID — HIGHEST PRIORITY TRIGGERS
# ==============================================================

# CRITICAL: Any RAID array is degraded (data loss risk)
{<HOST>:logmanio.raid.healthy.last()}=0

# CRITICAL: Number of degraded arrays
{<HOST>:logmanio.raid.degraded_count.last()}>0

# CRITICAL: Failed physical disks detected across arrays
{<HOST>:logmanio.raid.failed_devices.last()}>0

# WARNING: RAID rebuild/resync is in progress
{<HOST>:logmanio.raid.sync_action_count.last()}>0

# ==============================================================
# DISK, SYSTEM, AND APPLICATION TRIGGERS
# ==============================================================

# Disk usage on /data/hdd exceeds 80%
{<HOST>:logmanio.disk.hdd.pct_used.last()}>80

# Disk usage on /data/ssd exceeds 80%
{<HOST>:logmanio.disk.ssd.pct_used.last()}>80

# Elasticsearch cluster health is not green
{<HOST>:logmanio.es.cluster_health.str(green)}=0

# Elasticsearch has unassigned shards
{<HOST>:logmanio.es.unassigned_shards.last()}>0

# Kafka parsec consumer lag is increasing
{<HOST>:logmanio.kafka.lag.parsec.change()}>0 and {<HOST>:logmanio.kafka.lag.parsec.avg(30m)}>1000

# Any Docker container is in a non-running state
{<HOST>:logmanio.docker.unhealthy_count.last()}>0

# IOWait exceeds 20% (critical threshold)
{<HOST>:logmanio.cpu.iowait.avg(10m)}>20

# RAM usage exceeds 80%
{<HOST>:vm.memory.utilization.last()}>80

# CPU utilization exceeds 85% for 15 minutes
{<HOST>:system.cpu.util.avg(15m)}>85

# Swap is actively being used
{<HOST>:logmanio.swap.used_pct.last()}>5

Tip

Replace <HOST> with the actual Zabbix host name. These examples use the Zabbix classic trigger syntax. Adjust to the trigger expression format supported by your Zabbix version.

Hardware-Specific Monitoring (IPMI)¶

For Supermicro and Dell servers, hardware-level monitoring via IPMI is recommended.

Enabling IPMI Monitoring¶

On each node, install the IPMI tools:

sudo apt install -y ipmitool openipmi
sudo systemctl enable openipmi
sudo systemctl start openipmi

In the Zabbix agent configuration, enable IPMI if desired, or configure IPMI monitoring directly on the Zabbix server by adding the IPMI interface to each host in the Zabbix UI.

Key IPMI Metrics¶

Metric	Description
CPU temperature	Per-CPU thermal readings
System inlet temperature	Ambient temperature at server intake
Fan speed / status	Cooling fan RPM and operational status
Power supply status	PSU health and redundancy
Memory ECC errors	Correctable and uncorrectable memory errors

Tip

Zabbix provides built-in IPMI templates for both Supermicro and Dell iDRAC. Import the appropriate vendor template on the Zabbix server and link it to each monitored host.

Network Firewall Rules¶

Ensure the following network connectivity is available:

Source	Destination	Port	Protocol	Purpose
Zabbix Server	Cluster Nodes	10050	TCP	Zabbix passive checks
Cluster Nodes	Zabbix Server	10051	TCP	Zabbix active checks
Zabbix Server	Cluster Nodes	623	UDP	IPMI monitoring (if used)