Monitoring of TeskaLabs LogMan.io with Zabbix¶
This section provides recommendations for integrating Zabbix monitoring with a TeskaLabs LogMan.io. It covers infrastructure-level monitoring of the cluster nodes and application-level monitoring of the TeskaLabs LogMan.io platform and its supporting services.
Scope and responsibilities
- Zabbix is not deployed as part of the TeskaLabs LogMan.io product. The customer is responsible for deploying and maintaining their own Zabbix server infrastructure.
- It is possible and permitted to install a Zabbix agent on each cluster node of the TeskaLabs LogMan.io deployment.
- TeskaLabs provides recommendations on which metrics to collect and monitor. TeskaLabs does not provide pre-built Zabbix templates or dashboards.
- The customer (or the implementation partner) is responsible for creating Zabbix dashboards, triggers, and alerting rules based on the recommendations in this document.
Infrastructure Description¶
| Component | Details |
|---|---|
| Application | TeskaLabs LogMan.io |
| Deployment | Multi-node cluster |
| Hardware | Supermicro servers or Dell servers |
| Operating System | Ubuntu Server 22.04 LTS |
| Containerization | Docker with Docker Compose |
| Telemetry DB | InfluxDB (deployed within the cluster) |
| Key services | Elasticsearch, Apache Kafka, ZooKeeper, MongoDB, NGINX, InfluxDB |
Zabbix Agent Installation¶
Prerequisites¶
- SSH access to each cluster node (user
tladminwithsudoprivileges). - Network connectivity between the Zabbix server and each cluster node (TCP port 10050 for passive checks, TCP port 10051 for active checks).
- The Zabbix server must be deployed and configured separately by the customer or the partner.
Installation¶
Note
TeskaLabs LogMan.io runs on Linux Ubuntu Server 22.04 LTS.
Execute the following commands on each cluster node:
# Download and install the Zabbix repository package
wget https://repo.zabbix.com/zabbix/7.0/ubuntu/pool/main/z/zabbix-release/zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo dpkg -i zabbix-release_latest_7.0+ubuntu22.04_all.deb
sudo apt update
# Install the Zabbix agent 2 (recommended for Docker and advanced monitoring)
sudo apt install -y zabbix-agent2 zabbix-agent2-plugin-*
# Enable and start the agent
sudo systemctl enable zabbix-agent2
sudo systemctl start zabbix-agent2
# Verify the agent is running
sudo systemctl status zabbix-agent2
Tip
Zabbix Agent 2 is recommended over the legacy Zabbix Agent because it has native support for Docker container monitoring and plugin-based extensibility.
Zabbix Agent 2 Configuration¶
Edit the main configuration file on each node:
sudo nano /etc/zabbix/zabbix_agent2.conf
Apply the following configuration (adjust values in angle brackets):
# /etc/zabbix/zabbix_agent2.conf
# --- Connection to Zabbix Server ---
Server=<ZABBIX_SERVER_IP>
ServerActive=<ZABBIX_SERVER_IP>
Hostname=<NODE_HOSTNAME>
# --- Logging ---
LogFile=/var/log/zabbix/zabbix_agent2.log
LogFileSize=10
DebugLevel=3
# --- Security ---
# Optionally restrict access with PSK encryption
# TLSConnect=psk
# TLSAccept=psk
# TLSPSKIdentity=<PSK_IDENTITY>
# TLSPSKFile=/etc/zabbix/zabbix_agent2.psk
# --- Timeouts ---
Timeout=10
# --- Docker monitoring plugin ---
# Zabbix Agent 2 includes a built-in Docker plugin.
# Ensure the zabbix user has access to the Docker socket.
Plugins.Docker.Endpoint=unix:///var/run/docker.sock
# --- User parameters for LogMan.io-specific checks ---
# Include directory for custom check scripts
Include=/etc/zabbix/zabbix_agent2.d/*.conf
After editing, restart the agent:
sudo systemctl restart zabbix-agent2
Docker Socket Access for Zabbix Agent 2¶
The Zabbix Agent 2 needs read access to the Docker socket to monitor containers:
sudo usermod -aG docker zabbix
sudo systemctl restart zabbix-agent2
Tip
Adding the zabbix user to the docker group grants it effective root-level access to the Docker daemon. If this is not acceptable per the security policy, use a Docker socket proxy with read-only access instead.
Recommended Metrics and Thresholds¶
System-Level Metrics (per Node)¶
These correspond to the "System Level Overview" section of the prophylactic check procedure.
Danger
Software RAID health is the single most important metric to monitor. A degraded RAID array means the node is operating without redundancy — a second disk failure will result in complete data loss. RAID alerts MUST trigger immediate action.
| Metric | Threshold / Condition | Severity |
|---|---|---|
| Software RAID | Any degraded array or failed/removed disk is critical | Critical |
| Disk usage | Warning ≥ 65%, Critical ≥ 80% (except /boot: warn ≥ 95%) |
High |
| CPU utilization | Warning ≥ 85% sustained over 15 minutes | High |
| System load | Warning ≥ 120% of available cores; max load = number of cores | Medium |
| IOWait | Warning ≥ 30%, Critical ≥ 50% | High |
| RAM usage | Warning ≥ 80%, Critical ≥ 90% sustained | High |
| Swap usage | Warning ≥ 50%, Critical ≥ 70% | Medium |
| Network errors | Any interface errors or drops > 0 sustained | Medium |
Docker Container Metrics¶
Monitor the health and resource usage of all LogMan.io containers.
| Metric | What to monitor | Severity |
|---|---|---|
| Container state | All containers must be in running state |
Critical |
| Container restarts | Restart count increasing indicates instability | High |
| Container CPU usage | Per-container CPU consumption | Medium |
| Container memory usage | Per-container RSS / memory limit ratio | Medium |
| Container network I/O | Bytes in/out, errors, drops per container | Low |
Elasticsearch Monitoring¶
These correspond to the "Elasticsearch Monitoring" section of the prophylactic check.
| Metric | Threshold / Condition | Severity |
|---|---|---|
| Cluster health | Must be green; yellow (for more than 10 minutes) = warning, red = critical |
Critical |
| Inactive nodes | Must be 0 | Critical |
| Unassigned shards | Must be 0; any nonzero value for more than 10 minutes requires investigation | High |
| JVM Heap usage | Warning ≥ 75%, Critical ≥ 85% | High |
| Shard count per node | Warning ≥ 800, Critical ≥ 1100 shards per node | Medium |
| Index size | Investigate any index exceeding 200 GB | Medium |
| ILM policy assignment | Indices without numeric suffix are not managed by ILM | Medium |
Apache Kafka Monitoring¶
These correspond to the "Kafka Lag Overview" section of the prophylactic check.
| Metric | What to monitor | Severity |
|---|---|---|
| Consumer group lag | Lag must not increase over time | Critical |
| Consumer groups to monitor | lmio parsec, lmio depositor, lmio baseliner, lmio correlator |
— |
| Broker availability | All Kafka brokers must be reachable | Critical |
| Under-replicated partitions | Must be 0 | High |
Application Telemetry¶
TeskaLabs LogMan.io microservices produce telemetry to InfluxDB. Key application metrics to track:
| Metric Category | Key Metrics |
|---|---|
| Pipeline metrics | Events per second (EPS) — mean and max over 7 days |
| Microservice memory | VmRSS, VmSwap, VmHWM per service (memory leaks, excessive usage) |
| Microservice uptime | Service availability and restart frequency |
| Disk metrics | Per-mount usage for /data/hdd and /data/ssd |
| Network metrics | Bytes sent/received, connection counts per microservice |
| Kernel metrics | Context switches, interrupts, fork rate |
Additional Services¶
| Service | Metric | Threshold / Condition |
|---|---|---|
| ZooKeeper | Node count, leader election, outstanding requests | All nodes up, leader elected |
| MongoDB | Replication lag, connection count, oplog window | Lag near 0, oplog > 24h |
| NGINX | Active connections, error rate (5xx), latency | 5xx rate near 0 |
| InfluxDB | Write throughput, query duration, disk usage | No write failures |
Zabbix Custom UserParameter Configuration¶
Create a configuration file for LogMan.io-specific checks on each node:
sudo nano /etc/zabbix/zabbix_agent2.d/logmanio.conf
# /etc/zabbix/zabbix_agent2.d/logmanio.conf
#
# Custom Zabbix UserParameters for TeskaLabs LogMan.io monitoring
#
# ==========================================================================
# --- Software RAID (mdadm) --- MOST CRITICAL METRIC
# ==========================================================================
#
# TeskaLabs LogMan.io nodes use Linux software RAID (mdadm) for disk
# redundancy. A degraded array means the node is running without protection
# against disk failure. These checks MUST trigger immediate alerting.
#
# How it works:
# /proc/mdstat contains the live state of all MD arrays.
# A healthy RAID1 array shows e.g. [UU] — all members Up.
# A degraded array shows e.g. [U_] or [_U] — one member failed/removed.
#
# Number of degraded RAID arrays (0 = healthy, >0 = CRITICAL)
UserParameter=logmanio.raid.degraded_count,cat /proc/mdstat 2>/dev/null | grep -c '_'
# Overall RAID status: 1 = all arrays healthy, 0 = at least one degraded
UserParameter=logmanio.raid.healthy,cat /proc/mdstat 2>/dev/null | grep -q '_' && echo 0 || echo 1
# Total number of MD arrays present on this node
UserParameter=logmanio.raid.array_count,cat /proc/mdstat 2>/dev/null | grep -c '^md'
# Detailed state of a specific array (discovery parameter: md0, md1, etc.)
UserParameter=logmanio.raid.detail[*],sudo mdadm --detail /dev/$1 2>/dev/null | grep -E 'State|Active Devices|Failed Devices|Spare Devices' | tr '\n' '|' | sed 's/|$//'
# Number of failed devices across all arrays
UserParameter=logmanio.raid.failed_devices,sudo mdadm --detail /dev/md* 2>/dev/null | grep 'Failed Devices' | awk '{sum+=$NF} END {print sum+0}'
# Number of active sync actions (rebuild/resync in progress)
UserParameter=logmanio.raid.sync_action_count,cat /proc/mdstat 2>/dev/null | grep -c 'recovery\|resync\|reshape\|check'
# Rebuild/resync progress percentage (returns 100 if no rebuild in progress)
UserParameter=logmanio.raid.sync_progress,grep -oP '\d+\.\d+(?=%)' /proc/mdstat 2>/dev/null | head -1 || echo 100
# Full /proc/mdstat output for diagnostics (text item)
UserParameter=logmanio.raid.mdstat,cat /proc/mdstat 2>/dev/null
# --- Disk Usage ---
# Monitor /data/hdd usage percentage
UserParameter=logmanio.disk.hdd.pct_used,df /data/hdd --output=pcent | tail -1 | tr -d ' %'
# Monitor /data/ssd usage percentage
UserParameter=logmanio.disk.ssd.pct_used,df /data/ssd --output=pcent | tail -1 | tr -d ' %'
# --- Docker Containers ---
# Count of running LogMan.io containers
UserParameter=logmanio.docker.running_count,docker ps --filter "status=running" --format '{{.Names}}' 2>/dev/null | wc -l
# Count of non-running (exited/restarting) containers
UserParameter=logmanio.docker.unhealthy_count,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}}' 2>/dev/null | wc -l
# List of non-running containers (for diagnostics)
UserParameter=logmanio.docker.unhealthy_list,docker ps --filter "status=exited" --filter "status=restarting" --format '{{.Names}} ({{.Status}})' 2>/dev/null | tr '\n' ',' | sed 's/,$//'
# Total container restart count (sum across all containers)
UserParameter=logmanio.docker.restart_total,docker inspect --format '{{.RestartCount}}' $(docker ps -aq) 2>/dev/null | paste -sd+ | bc
# --- Elasticsearch ---
# Cluster health status (green/yellow/red)
UserParameter=logmanio.es.cluster_health,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
# Number of active nodes in the cluster
UserParameter=logmanio.es.active_nodes,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('number_of_nodes',0))"
# Unassigned shard count
UserParameter=logmanio.es.unassigned_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('unassigned_shards',0))"
# Active shards count
UserParameter=logmanio.es.active_shards,curl -s http://localhost:9200/_cluster/health 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('active_shards',0))"
# --- Kafka Consumer Lag ---
# Total lag for lmio parsec consumer group
UserParameter=logmanio.kafka.lag.parsec,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio parsec" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'
# Total lag for lmio depositor consumer group
UserParameter=logmanio.kafka.lag.depositor,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio depositor" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'
# Total lag for lmio baseliner consumer group
UserParameter=logmanio.kafka.lag.baseliner,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio baseliner" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'
# Total lag for lmio correlator consumer group
UserParameter=logmanio.kafka.lag.correlator,docker exec $(docker ps -qf "name=kafka" | head -1) kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group "lmio correlator" 2>/dev/null | awk 'NR>1 {sum+=$6} END {print sum+0}'
# --- ZooKeeper ---
# ZooKeeper status (leader/follower/standalone)
UserParameter=logmanio.zk.status,echo mntr | nc localhost 2181 2>/dev/null | grep zk_server_state | awk '{print $2}'
# ZooKeeper outstanding requests
UserParameter=logmanio.zk.outstanding_requests,echo mntr | nc localhost 2181 2>/dev/null | grep zk_outstanding_requests | awk '{print $2}'
# --- IOWait ---
UserParameter=logmanio.cpu.iowait,iostat -c 1 2 | tail -1 | awk '{print $4}'
# --- Swap Usage ---
UserParameter=logmanio.swap.used_pct,free | awk '/Swap:/ {if ($2>0) printf "%.1f", $3/$2*100; else print 0}'
Restart the agent to load the new parameters:
sudo systemctl restart zabbix-agent2
Sudoers Configuration for RAID Monitoring¶
The mdadm --detail command requires root privileges. Grant the zabbix user passwordless access to mdadm only:
sudo visudo -f /etc/sudoers.d/zabbix-mdadm
Add the following line:
zabbix ALL=(root) NOPASSWD: /usr/sbin/mdadm --detail /dev/md*
Set the correct permissions:
sudo chmod 440 /etc/sudoers.d/zabbix-mdadm
Verify it works:
sudo -u zabbix sudo mdadm --detail /dev/md0
Verifying UserParameters¶
From the Zabbix server or proxy, test the parameters:
# Test from the Zabbix server
zabbix_get -s <NODE_IP> -k logmanio.es.cluster_health
zabbix_get -s <NODE_IP> -k logmanio.disk.hdd.pct_used
zabbix_get -s <NODE_IP> -k logmanio.docker.running_count
zabbix_get -s <NODE_IP> -k logmanio.kafka.lag.parsec
Recommended Zabbix Trigger Examples¶
The following trigger expressions are provided as starting points. Adjust thresholds and evaluation periods according to the specific deployment.
# ==============================================================
# SOFTWARE RAID — HIGHEST PRIORITY TRIGGERS
# ==============================================================
# CRITICAL: Any RAID array is degraded (data loss risk)
{<HOST>:logmanio.raid.healthy.last()}=0
# CRITICAL: Number of degraded arrays
{<HOST>:logmanio.raid.degraded_count.last()}>0
# CRITICAL: Failed physical disks detected across arrays
{<HOST>:logmanio.raid.failed_devices.last()}>0
# WARNING: RAID rebuild/resync is in progress
{<HOST>:logmanio.raid.sync_action_count.last()}>0
# ==============================================================
# DISK, SYSTEM, AND APPLICATION TRIGGERS
# ==============================================================
# Disk usage on /data/hdd exceeds 80%
{<HOST>:logmanio.disk.hdd.pct_used.last()}>80
# Disk usage on /data/ssd exceeds 80%
{<HOST>:logmanio.disk.ssd.pct_used.last()}>80
# Elasticsearch cluster health is not green
{<HOST>:logmanio.es.cluster_health.str(green)}=0
# Elasticsearch has unassigned shards
{<HOST>:logmanio.es.unassigned_shards.last()}>0
# Kafka parsec consumer lag is increasing
{<HOST>:logmanio.kafka.lag.parsec.change()}>0 and {<HOST>:logmanio.kafka.lag.parsec.avg(30m)}>1000
# Any Docker container is in a non-running state
{<HOST>:logmanio.docker.unhealthy_count.last()}>0
# IOWait exceeds 20% (critical threshold)
{<HOST>:logmanio.cpu.iowait.avg(10m)}>20
# RAM usage exceeds 80%
{<HOST>:vm.memory.utilization.last()}>80
# CPU utilization exceeds 85% for 15 minutes
{<HOST>:system.cpu.util.avg(15m)}>85
# Swap is actively being used
{<HOST>:logmanio.swap.used_pct.last()}>5
Tip
Replace <HOST> with the actual Zabbix host name. These examples use the Zabbix classic trigger syntax. Adjust to the trigger expression format supported by your Zabbix version.
Hardware-Specific Monitoring (IPMI)¶
For Supermicro and Dell servers, hardware-level monitoring via IPMI is recommended.
Enabling IPMI Monitoring¶
On each node, install the IPMI tools:
sudo apt install -y ipmitool openipmi
sudo systemctl enable openipmi
sudo systemctl start openipmi
In the Zabbix agent configuration, enable IPMI if desired, or configure IPMI monitoring directly on the Zabbix server by adding the IPMI interface to each host in the Zabbix UI.
Key IPMI Metrics¶
| Metric | Description |
|---|---|
| CPU temperature | Per-CPU thermal readings |
| System inlet temperature | Ambient temperature at server intake |
| Fan speed / status | Cooling fan RPM and operational status |
| Power supply status | PSU health and redundancy |
| Memory ECC errors | Correctable and uncorrectable memory errors |
Tip
Zabbix provides built-in IPMI templates for both Supermicro and Dell iDRAC. Import the appropriate vendor template on the Zabbix server and link it to each monitored host.
Network Firewall Rules¶
Ensure the following network connectivity is available:
| Source | Destination | Port | Protocol | Purpose |
|---|---|---|---|---|
| Zabbix Server | Cluster Nodes | 10050 | TCP | Zabbix passive checks |
| Cluster Nodes | Zabbix Server | 10051 | TCP | Zabbix active checks |
| Zabbix Server | Cluster Nodes | 623 | UDP | IPMI monitoring (if used) |