Prophylactic check manual¶
A prophylactic check is a systematic preventative assessment to verify that a system is working properly, and to identify and mitigate potential issues before they escalate into more severe or critical problems. By performing regular prophylactic checks, you can proactively maintain the integrity, reliability, and efficiency of your TeskaLabs LogMan.io system, minimizing the risk of unexpected failures or disruptions that could arise if left unaddressed.
Support
If you need any further information or support than what you see here, reach out to your TeskaLabs LogMan.io support Slack channel, or send an e-mail to support@teskalabs.com. We will assist you promptly.
Performing prophylactic checks¶
Important
Conduct prophylactic checks at consistent intervals, ideally on the same day of the week and around the same time. Remember that the volume and timing of incoming events can fluctuate depending on the day of the week, working hours, and holidays.
During prophylactic checks, make sure to conduct a comprehensive review of all available tenants.
Examine each of the following components of your TeskaLabs LogMan.io installment according to our recommendations, and report issues as needed.
TeskaLabs LogMan.io functionalities¶
Location: TeskaLabs LogMan.io sidebar
Goal: Ensuring that every functionality of the Teskalabs LogMan.io app works properly
Within the assigned tenant, thoroughly examine each component featured in the sidebar (Discover, Dashboards, Exports, Lookups, Reports, etc.) to ensure their proper operation. Issues identified in this section should be reported to your TeskaLabs support channel. This can include issues such as pop up errors when opening a section from sidebar, lost availability of some of the tools or for example not being able to open Dashboards.
Issue reporting: Utilize the support Slack channel for general reporting.
Log source monitoring¶
Location: TeskaLabs LogMan.io Discover screen or dedicated dashboard
Goal: Ensuring that each log source is active and works as expected and no anomalies are found (for example a drop out, peak, or anything unusual). This is also crucial for your log source visibility.
Note: Consider incorporating Baselines as another option for log source checks.
Log source monitoring can be achieved by individually reviewing each log source, or by creating an overview dashboard equipped with widgets for monitoring each log source's activity visually. We recommend creating a dashboard with line charts.
The examination should always cover a sample of data between each prophylactic check.
Issue reporting: In case of an inactive log source, conduct further investigation and report to your TeskaLabs LogMani.io Slack support channel.
Log time zones¶
Location: TeskaLabs LogMan.io Discover screen
Goal: Ensuring that there are no discrepancies between your time zone and time zone present in the logs
Investigate if there are any logs with a @timestamp
value that is a future time. You can do so by filtering the time range to from now to 2+ (or more) hours from now.
Issue reporting: Utilize the project support Slack for general reporting.
If the issue appears to be linked to the logging device settings, please investigate this further within your own network.
Other events¶
Location: TeskaLabs LogMan.io Discover screen, lmio-others-events
index
Goal: Ensuring all the events are parsed correctly using either Parsec or Parser.
In most installations, we collect error logs from the following areas:
-
Parser
-
Parsec
-
Dispatcher
-
Depositor
-
Unstructured logs
Logs that are not parsed correctly go to others index
. Ideally, no logs should be present in the others index
.
Issue reporting: If a few logs are found in others index
, such as those indicating incorrect parsing errors, it's generally not a severe problem requiring immediate attention. Investigate these logs further and report to your TeskaLabs LogMan.io support Slack channel.
System logs¶
Location: TeskaLabs LogMan.io - System tenant, index Events & Others.
Goal: Ensuring the system is working properly and there are no unusual or critical system logs that could signal any internal issue
Issue reporting: A multitude of log types may be found in this section. Reporting can be done either via your TeskaLabs LogMan.io Slack channel, or within your infrastructure.
Baseliner¶
Note
Baseliner is included only in advanced deployments of LogMan.io. If you would like to upgrade LogMan.io, contact support, and we'll be happy to assist you.
Location: TeskaLabs LogMan.io Discover screen filtering for event.dataset:baseliner
Goal: Ensuring that the Baseliner functionality is working properly and is detecting deviations from a calculated activity baseline.
Issue reporting: If the Baseliner is not active, report it to your TeskaLabs LogMan.io support Slack channel.
Elasticsearch¶
Location: Grafana, dedicated Elasticsearch dashboard
Goal: Ensuring that there are no malfunctions linked to Elasticsearch and services associated with it.
The assessment should always be based on a sample of data from the past 24 hours. This operational dashboard provides an indication of the proper functioning of Elasticsearch.
Key Indicators:
-
Inactive Nodes should be at zero.
-
System Health should be green. Any indication of yellow or red should be escalated to TeskaLabs LogMan.io Slack support channel immediately.
-
Unassigned Shards should be at zero and marked as green. Any value in yellow or above warrants monitoring and reporting.
Issue reporting: If there are any issues detected, ensure prompt escalation. Further investigation of the Elastic cluster can be conducted in Kibana/Stack monitoring.
Nodes¶
Detailed information about node health can be found in Elasticsearch. JVM Heap monitors memory usage.
Overview¶
The current EPS (events per second) of the entire Elastic cluster is visible.
Index sizing & lifecycle monitoring¶
Location: Kibana, Stack monitoring or Stack management
Follow these steps to analyze indices for abnormal size:
- Access the "Indices" section.
- Proceed to filter the "Data" column, arranging it from largest to smallest.
- Examine the indexes to identify any that exhibit a significantly larger size compared to the others.
The acceptable index size range is a topic for discussion, but generally, indices up to 200 GB are considered acceptable.
Any indices exceeding 200 GB in size should be reported.
In the case of indexes associated with ILM (index lifecycle management), it's crucial to verify the index status. If an index lacks a string of numbers at the end of its name, it indicates it is not linked to an ILM policy and may grow without automatic rollover. To confirm this, review the index's properties to check whether it falls under the hot, warm, or cold category. When indices are not connected to ILM, they tend to remain in a hot state or exhibit irregular shifts between hot, cold, and warm.
Please note that lookups do not have ILM and should always be considered in the hot state.
Issue reporting: Report to the dedicated project support Slack channel. Such reports should be treated with the utmost seriousness and escalated promptly.
System-Level Overview¶
Location: Grafana, dedicated System Level Overview dashboard
The assessment should always be based on a sample of data from the past 24 hours.
Key metrics to monitor:
-
Disk usage:
All values must not exceed 80%, except for/boot
, which should not exceed 95%. -
Load:
Values must not exceed 40%, and the maximum load should align with the number of cores. -
IOWait:
Indicates data processing and should only register as a small percentage, signifying that the device is waiting for data to load from the disk. -
RAM usage:
Further considerations should be made regarding the establishment of high-value thresholds.
In the case of multiple servers, ensure values are checked for each.
Issue reporting: Report to the dedicated project support Slack channel.
Burrow Consumer Lag¶
Location: Grafana, dedicated Burrow Consumer Lag dashboard
For Kafka Monitoring, scrutinize this dashboard for consumerGroup, with a specific focus on:
-
lmio dispatcher
-
lmio depositor
-
lmio baseliner
-
lmio correlator
-
lmio watcher
The lag value exhibiting an increasing trend over time indicates a problem that needs to be addressed immediately.
Issue reporting: If lag increases compared to the previous week's prophylaxis, promptly report this on the support Slack channel.
Depositor Monitoring¶
Location: Grafana, dedicated Depositor dashboard.
Key metrics to monitor:
-
Failed bulks
- Must be green and equal to zero -
Output Queue Size of Bulks
-
Duty Cycle
-
EPS IN & OUT
-
Successful Bulks
-
Failed Bulks
Issue reporting: Report to the dedicated project support Slack channel.