Table of Contents
Introduction ↵
TeskaLabs LogMan.io documentation¶
Welcome to TeskaLabs LogMan.io documentation.
TeskaLabs LogMan.io¶
TeskaLabs LogMan.io™️ is a software product for log collection, log aggregation, log storage and retention, real-time log analysis and prompt incident response for an IT infrastructure, collectively known as log management.
TeskaLabs LogMan.io consists of a central infrastructure and log collectors, that resides on monitored systems such as servers or network appliances. Log collectors collect various logs (operation system, applications, databases) and system metrics such as CPU usage, memory usage, disk space etc. Collected events are sent in real-time to central infrastructure for consolidation, orchestration and storage. Thanks to its real-time nature, LogMan.io provides alerts for anomalous situation in perspective of system operation (e.g. is disk space running low), availability (e.g. is the application running?), business (e.g. is number of transaction below normal?) or security (e.g. any unusual access to servers?).
TeskaLabs SIEM¶
TeskaLabs SIEM is a real-time Security Information and Event Managemet tool. TeskaLabs SIEM provides real-time analysis and correlations of security events and alerts processed by a TeskaLabs LogMan.io. We designed TeskaLabs SIEM to enhance cyber security posture and compliance with regulatory.
More components
TeskaLabs SIEM and TeskaLabs LogMan.io are standalone products. Thanks to its modular architecture, these products also include other TeskaLabs technologies:
- TeskaLabs SeaCat Auth for authentification, authorization including user roles and fine-grained access control.
- TeskaLabs SP-Lang is an expression language used on many places in the product.
Made with ❤️ by TeskaLabs
TeskaLabs LogMan.io™️ is a product of TeskaLabs.
Features¶
TeskaLabs LogMan.io is a real-time SIEM with log management.
- Multitenancy: a single instance of TeskaLabs LogMan.io can serve multiple tenants (customers, departments).
- Multiuser: TeskaLabs LogMan.io can be used by unlimmited number of users simultanously.
Technologies¶
Cryptography¶
- Transport layer: TLS 1.2, TLS 1.3 and better
- Symmetric cryptography: AES-128, AES-256, AES-512
- Asymmetric cryptography: RSA, ECC
- Hash methods: SHA-256, SHA-384, SHA-512
- MAC functions: HMAC
- HSM: PKCS#11 interface
Note
TeskaLabs LogMan.io uses only strong cryptography, it means that we use only these ciphers, hashes and other algorithms that are recongized as secure by cryptographic comunity and by organizations such as ENISA or NIST.
Supported Log Sources¶
TeskaLabs LogMan.io supports a variety of different technologies, which we have listed below.
Formats¶
- Syslog RFC 5424 (IEFT)
- Syslog RFC 3164 (BSD)
- Syslog RFC 3195 (BEEP profile)
- Syslog RFC 6587 (Frames over TCP)
- Reliable Event Logging Protocol (REPL), including SSL
- Windows Event Log
- SNMP
- ArcSight CEF
- LEEF
- JSON
- JSON
- XML
- YAML
- Avro
- Custom/raw log format
And many more.
Vendors and Products¶
Cisco¶
- Cisco Firepower Threat Defense (FTD)
- Cisco Adaptive Security Appliance (ASA)
- Cisco Identity Services Engine (ISE)
- Cisco Meraki (MX, MS, MR devices)
- Cisco Catalyst Switches
- Cisco IOS
- Cisco WLC
- Cisco ACS
- Cisco SMB
- Cisco UCS
- Cisco IronPort
- Cisco Nexus
- Cisco Routers
- Cisco VPN
- Cisco Umbrella
Palo Alto Networks¶
- Palo Alto Next-Generation Firewalls
- Palo Alto Panorama (Centralized Management)
- Palo Alto Traps (Endpoint Protection)
Fortinet¶
- FortiGate (Next-Generation Firewalls)
- FortiSwitch (Switches)
- FortiAnalyzer (Log Analytics)
- FortiMail (Email Security)
- FortiWeb (Web Application Firewall)
- FortiADC
- FortiDDos
- FortiSandbox
Juniper Networks¶
- Juniper SRX Series (Firewalls)
- Juniper MX Series (Routers)
- Juniper EX Series (Switches)
Check Point Software Technologies¶
- Check Point Security Gateways
- Check Point SandBlast (Threat Prevention)
- Check Point CloudGuard (Cloud Security)
Microsoft¶
- Microsoft Windows (Operating System)
- Microsoft Azure (Cloud Platform)
- Microsoft SQL Server (Database)
- Microsoft IIS (Web Server)
- Microsoft Office 365
- Microsoft Exchange
- Microsoft Sharepoint
Linux¶
- Ubuntu (Distribution)
- CentOS (Distribution)
- Debian (Distribution)
- Red Hat Enterprise Linux (Distribution)
- IPTables
- nftables
- Bash
- Cron
- Kernel (dmesg)
Oracle¶
- Oracle Database
- Oracle WebLogic Server (Application Server)
- Oracle Cloud
Amazon Web Services (AWS)¶
- Amazon EC2 (Virtual Servers)
- Amazon RDS (Database Service)
- AWS Lambda (Serverless Computing)
- Amazon S3 (Storage Service)
VMware¶
- VMware ESXi (Hypervisor)
- VMware vCenter Server (Management Platform)
F5 Networks¶
- F5 BIG-IP (Application Delivery Controllers)
- F5 Advanced Web Application Firewall (WAF)
Barracuda Networks¶
- Barracuda CloudGen Firewall
- Barracuda Web Application Firewall
- Barracuda Email Security Gateway
Sophos¶
- Sophos XG Firewall
- Sophos UTM (Unified Threat Management)
- Sophos Intercept X (Endpoint Protection)
Aruba Networks (HPE)¶
- Aruba Switches
- Aruba Wireless Access Points
- Aruba ClearPass (Network Access Control)
- Aruba Mobility Controller
HPE¶
- iLO
- IMC
- HPE StoreOnce
- HPE Primera Storage
- HPE 3PAR StoreServ
Trend Micro¶
- Trend Micro Deep Security
- Trend Micro Deep Discovery
- Trend Micro TippingPoint (Intrusion Prevention System)
- Trend Micro Endpoint Protection Manager
- Trend Micro Apex One
Zscaler¶
- Zscaler Internet Access (Secure Web Gateway)
- Zscaler Private Access (Remote Access)
Akamai¶
- Akamai (Content Delivery Network and Security)
- Akamai Kona Site Defender (Web Application Firewall)
- Akamai Web Application Protector
Imperva¶
- Imperva Web Application Firewall (WAF)
- Imperva Database Security (Database Monitoring)
SonicWall¶
- SonicWall Next-Generation Firewalls
- SonicWall Email Security
- SonicWall Secure Mobile Access
WatchGuard Technologies¶
- WatchGuard Firebox (Firewalls)
- WatchGuard XTM (Unified Threat Management)
- WatchGuard Dimension (Network Security Visibility)
Apple¶
- macOS (Operating System)
Apache¶
- Apache Cassandra (Database)
- Apache HTTP Server
- Apache Kafka
- Apache Tomcat
- Apache Zookeeper
NGINX¶
- NGINX (Web Server and Reverse Proxy Server)
Docker¶
- Docker (Container Platform)
Kubernetes¶
- Kubernetes (Container Orchestration)
Atlassian¶
- Jira (Issue and Project Tracking)
- Confluence (Collaboration Software)
- Bitbucket (Code Collaboration and Version Control)
Cloudflare¶
- Cloudflare (Content Delivery Network and Security)
SAP¶
- SAP HANA (Database)
Balabit¶
- syslog-ng
Open-source¶
- PostgreSQL (Database)
- MySQL (Database)
- OpenSSH (Remote access)
- Dropbear SSH (Remote access)
- Jenkins (Continuous Integration and Continuous Delivery)
- rsyslog
- GenieACS
- Haproxy
- spamassasin
- FreeRadius
- Bind
- DHCP
- Postfix
- Squid Cache
- Zabbix
- FileZilla
IBM¶
- IBM Db2 (Database)
- IBM AIX (Operating System)
- IBM i (Operating System)
Brocade¶
- Brocade Switches
Google¶
- Google Cloud
- Pub/Sub & BigQuery
Elastic¶
- ElasticSearch
Citrix¶
- Citrix Virtual Apps and Desktops (Virtualization)
- Citrix Hypervisor (Virtualization)
- Citrix ADC, NetScaler
- Citrix Gateway (Remote access)
- Citrix SD-WAN
- Citrix Endpoint Management (MDM, MAM)
Dell¶
- Dell EMC Isilon (network-attached storage)
- Dell PowerConnect Switches
- Dell W-Series (Access points)
- Dell iDRAC
- Dell Force10 Switches
FlowMon¶
- Flowmon Collector
- Flowmon Probe
- Flowmon ADS
- Flowmon FPI
- Flowmon APM
GreyCortex¶
- GreyCortex Mendel
Huawei¶
- Huawei Routers
- Huawei Switches
- Huawei Unified Security Gateway (USG)
Synology¶
- Synology NAS
- Synology SAN
- Synology NVR
- Synology Wi-Fi routers
Ubiquity¶
- UniFi
Avast¶
- Avast Antivirus
Kaspersky¶
- Kaspesky Endpoint Security
- Kaspesky Security Center
Kerio¶
- Kerio Connect
- Kerio Control
- Kerio Clear Web
Symantec¶
- Symantec Endpoint Protection Manager
- Symantec Messaging Gateway
ESET¶
- ESET Antivirus
- ESET Remote Administrator
AVG¶
- AVG Antivirus
Extreme Networks¶
- ExtremeXOS
IceWarp¶
- IceWarp Mail Server
Mikrotik¶
- Mikrotic Routers
- Mikrotik Switches
Pulse Secure¶
- Pulse Connect Secure SSL VPN
QNAP¶
- QNAP NAS
Safetica¶
- Safetica DLP
Veeam¶
- Veeam Backup and Restore
SuperMicro¶
- IPMI
Mongo¶
- MongoDB
YSoft¶
- SafeQ
Bitdefender¶
- Bitdefender GravityZone
- Bitdefender Network Traffic Security Analytics (NTSA)
- Bitdefender Advanced Threat Intelligence
This list is not exhaustive, as there are many other vendors and products that can send logs to TeskaLabs LogMan.io using standard protocols such as Syslog. Please contact us if you seek for a specific technology to be integrated.
SQL log extraction¶
TeskaLabs LogMan.io can extract logs from various SQL databases using ODBC (Open Database Connectivity).
Among supported databases are:
- PostgreSQL
- Oracle Database
- IBM Db2
- MySQL
- SQLite
- MariaDB
- SAP HANA
- Sybase ASE
- Informix
- Teradata
- Amazon RDS (Relational Database Service)
- Google Cloud SQL
- Azure SQL Database
- Snowflake
TeskaLabs LogMan.io Architecture¶
lmio-collector¶
LogMan.io Collector serves to receive log lines from various sources such as SysLog NG, files, Windows Event Forwarding, databases via ODBC connectors and so on. The log lines may be further processed by a declarative processor and put into LogMan.io Ingestor via WebSocket.
lmio-ingestor¶
LogMan.io Ingestor receives events via WebSocket, transforms them to Kafka-readable format
and put them to Kafka collected-
topic. There are multiple ingestors for different
event formats, such as SysLog, databases, XML and so on.
lmio-parser¶
LogMan.io Parser runs in multiple instances to receive different formats of incoming events (different Kafka topics) and/or the same events (the instances then run in the same Kafka group to distribute events among them). LogMan.io Parser loads the LogMan.io Library via ZooKeeper or from files to load declarative parsers and enrichers from configured parsing groups.
If the events were parsed by the loaded parser, they are put to lmio-events
Kafka topic, otherwise
they enter the lmio-others
Kafka topic.
lmio-dispatcher¶
LogMan.io Dispatcher loads events from lmio-events
Kafka topic and sends them both to all
subscribed (via ZooKeeper) LogMan.io Correlator instances and ElasticSearch in the
appropriate index, where all events can be queried and visualized using Kibana.
LogMan.io Dispatcher runs in multiple instances as well.
lmio-correlator¶
LogMan.io Correlator uses ZooKeeper to subscribe to all LogMan.io Dispatcher instances to receive parsed events (log lines etc.). Then LogMan.io Correlator loads the LogMan.io Library from ZooKeeper or from files to create correlators based on the declarative configuration. Events produced by correlators (Window Correlator, Match Correlator) are then handed down to LogMan.io Dispatcher and LogMan.io Watcher via Kafka.
lmio-watcher¶
LogMan.io Watcher observes changes in lookups used in LogMan.io Parsers and LogMan.io Correlators
instances. When a change occurs, all running components that use LogMan.io Library are notified
via Kafka topic lmio-lookups
about the change and the lookup is updated in the ElasticSearch,
which serves as a persistent storage for all lookups.
lmio-integ¶
LogMan.io Integ allows LogMan.io to be integrated with supported external systems via expected message format and output/input protocol.
Support¶
Live help¶
Our team is available at our live support channel at Slack. You can message our internal experts directly, consult your plans, problems and challenges and even get online live help over share screen so that you don't need to be afraid of major upgrades and so on. The access is provided to customers with an active support plan.
Email support¶
Contact us at: support@teskalabs.com
Support hours¶
The 5/8 support level is available at working days based on Czech calendar, 09-18 Central European Time (Europe/Prague).
The 24/7 support level is also available, depending on your active support plan.
Ended: Introduction
User Manual ↵
Welcome¶
What's in the User Manual?
Here, you can learn how to use the TeskaLabs LogMan.io app. For information about setup, configuration, and maintenance, visit the Administration Manual or the Reference guide. If you can't find the help you need, contact Support.
Quickstart¶
Jump to:
- Get an overview of all events in your system (Home)
- Read incoming logs, and filter logs by field and time (Discover)
- View and filter your data as charts and graphs (Dashboards)
- View and print reports (Reports)
- Run, download, and manage exports (Export)
- Change your general or account settings
Some features are only visible to administrators, so you might not see all of the features that are included in the User Manual in your own version of TeskaLabs LogMan.io.
Administrator quickstart¶
Are you an administrator? Jump to:
- Add or edit files in the library, such as dashboards, reports, and exports (Library)
- Add or edit lookups (Lookups)
- Access external components that work with TeskaLabs LogMan.io (Tools)
- Change the configuration of your interface (Configuration)
- See microservices (Services)
- Manage user permissions (Auth)
Settings¶
Use these controls in the top right corner of your screen to change settings:
Tenants¶
A tenant is one entity collecting data from a group of sources. When you're using the program, you can only see the data belonging to the selected tenant. A tenant's data is completely separated from all other tenants' data in TeskaLabs LogMan.io (learn about multitenancy). Your company might have just one tenant, or possibly multiple tenants (for different departments, for example). If you're distributing or managing TeskaLabs LogMan.io for other clients, you have multiple tenants, at least one per client.
Tenants can be accessible by multiple users, and users can have access to multiple tenants. Learn more about tenancy here.
Tips¶
If you're new to log collection, click on the tip boxes to learn why you might want to use a feature.
Why use TeskaLabs LogMan.io?
TeskaLabs LogMan.io collects logs, which are records of every single event in your network system. This information can help you:
- Understand what's happening in your network
- Troubleshoot network problems
- Investigate security issues
Managing your account¶
Your account name is at the top right corner of your screen:
Changing your password¶
- Click on your account name.
- Click Change a password.
- Enter your current password and new password.
- Click Set password.
You should see confirmation of your password change. To return to the page you were on before changing your password, click Go back.
Changing account information¶
- Click on your account name.
- Click Manage.
- Here you can:
- Change your password
- Change your email address
- Change or add your phone number
- Log out
- Click on what you want to do, and make your changes. The changes won't be visible immediately - they'll be visible when you log out and log back in.
Seeing your access permissions¶
- Click on your account name.
- Click Access control, and you'll see what permissions you have.
Logging out¶
- Click on your account name.
- Click Logout.
You can also log out from the Manage screen.
Logging out from all devices¶
- Click on your account name.
- Click Manage.
- Click Logout from all devices.
When you log out, you'll be automatically taken to the login screen.
Using the Home page¶
The Home page gives you an overview of your data sources and critical incoming events. You'll be on the Home page by default when you log in, but you can also get to the Home page from the buttons on the left.
Viewing options¶
Chart and list view¶
To switch between chart and list view, click the list button.
Getting more details¶
Clicking on any portion of a chart takes you to Discover, where you then see the list of logs that make up this portion of the chart. From there, you can examine and filter these logs.
You can see here that Discover is automatically filtering for events from the selected dataset (from the chart on the Home page), event.dataset:devolutions
.
Using Discover¶
Discover gives you an overview of all logs being collected in real time. Here, you can filter the data by time and field.
Navigating Discover¶
Terms¶
Total count: The total number of logs in the timeframe being shown.
Aggregated by: In the bar chart, each bar represents the count of logs collected within a time interval. Use Aggregated by to choose the time interval. For example, Aggregated by: 30m means that each bar in the bar chart shows the count of all of the logs collected in a 30 minute timeframe. If you change to Aggregated by: hour, then each bar represents one hour of logs. The available options change based on the overall timeframe you are viewing in Discover.
Filtering data¶
Change the timeframe from which logs appear, and filter logs by field.
Tip: Why filter data?
Logs contain a lot of information, more than you need to accomplish most tasks. When you filter data, you choose which information you see. This can help you learn more about your network, identify trends, and even hunt for threats.
Examples:
- You want to see login data from just one user, so you filter the data to show logs containing their username.
- You had a security event on Wednesday night, and you want to learn more about it, so you filter the data to show logs from that time period.
- You notice you don't see any data from one of your network devices. You can filter the data to see all the logs from just that device. Now, you can see when the data stopped coming, and what the last event was that might have caused the problem.
Changing the timeframe¶
You can view logs from a specified timeframe. Set the timeframe by choosing start and end points using this tool:
Remember: Once you change the timeframe, press the blue refresh button to update your page.
Using the time setting tool¶
Setting a relative start/end point¶
To set the start or end point to a time relative to now, use the Relative tab.
Quick time settings
Use the quick now- ("now minus") options to set the timeframe to a preset with one click. Selecting one of these options affects both the start and end point. For example, if you choose now-1 week, the start point will be one week ago, and the end point will be "now." Choosing a now- option from the end point does the same thing as choosing a now- option from the start point. (You can't use the now- options to set the end point to anything besides "now.")
Drop-down options
To set a relative time (such as 15 minutes ago) for the start or end point, use the relative time options below the quick setting options. Select your unit of time from the drop-down list, and type or click to your desired number.
To confirm your choice, click Set relative time, and view the logs by clicking on the refresh button.
Example shown: This selection will show logs collected starting from one day ago until now.
Setting an exact start/end point¶
To choose the exact day and time for the start or end point, use the Absolute tab and select a date and time on the calendar.
To confirm your choice, click Set date.
Example shown: This selection will show logs collected starting from June 7, 2023 at 6:00 until now.
Auto refresh¶
To update the view automatically at a set time interval, choose a refresh rate:
Refresh¶
To reload the view with your changes, click the blue refresh button.
Note: Don't choose "Now" as your start point. Since the program can't show data newer than "now," it's not valid, so you'll see an error message.
Using the time selector¶
To select a more specific time period within the current timeframe, click and drag on the graph.
Filtering by field¶
In Discover, you can filter data by any field in multiple ways.
Using the field list¶
Use the search bar to find the field you want, or scroll through the list.
Isolating fields¶
To choose which fields you see in the log list, click the + symbol next to the field name. You can select multiple fields.
Seeing all occuring values in one field¶
To see a percentage breakdown of all the values from one field, click the magnifying glass next to the field name (the magnifying glass appears when you hover over the field name).
Tip: What does this mean?
This list of values from the field http.response.status_code compares how often users are getting certain http response codes. 51.4% of the time, users are getting a 404 code, meaning the webpage wasn't found. 43.3% of the time, users are getting a 200 code, which means that the request succeeded. The high percentage of "not found" response codes could inform a website administrator that one or more of their frequently clicked links are broken.
Viewing and filtering log details¶
To view the details of individual logs as a table or in JSON, click the arrow next to the timestamp. You can apply filters using the field names in the table view.
Filtering from the expanded table view¶
You can use controls in the table view to filter logs:
Filter for logs that contain the same value in the selected field (update_item
in action
in the example)
Filter for logs that do NOT contain the same value in the selected field (update_item
in action
in the example)
Show a percentage breakdown of values in this field (the same function as the magnifying glass in the fields list on the left)
Add to list of displayed fields for all visible logs (the same function as in the fields list on the left)
Query bar¶
You can filter for field (not time) using the query bar. The query bar tells you which query language to use. The query language depends on your data source. Use Lucene Query Syntax for data stored using ElasticSearch.
After you type your query, set the timeframe and click the refresh button. Your filters will be applied to the visible incoming logs.
Investigating IP addresses¶
You can investigate IP addresses using external analysis tools. You might want to do this, for example, if you see multiple suspicious logins from one IP address.
Using external IP analysis tools
1. Click on the IP address you want to analyze.
2. Click on the tool you want to use. You'll be taken to the tool's website, where you can see the results of the IP address analysis.
Using Dashboards¶
A dashboard is a set of charts and graphs that represent data from your system. Dashboards allow you to quickly get a sense for what's going on in your network.
Your administrator sets up dashboards based on the data sources and fields that are most useful to you. For example, you might have a dashboard that shows graphs related only to email activity, or only to login attempts. You might have many dashboards for different purposes.
You can filter the data to change which data the dashboard shows within its preset constraints.
How can dashboards help me?
By having certain data arranged into a chart, table, or graph, you can get a visual overview of activity within your system and identify trends. In this example, you can see that a high volume of emails were sent and received on June 19th.
Navigating Dashboards¶
Opening a dashboard¶
To open a dashboard, click on its name.
Dashboard controls¶
Setting the timeframe¶
You can change the timeframe the dashboard represents. Find the time-setting guide here. To refresh the dashboard with your new timeframe, click on the refresh button.
Note: There is no auto-refresh rate in Dashboards.
Filtering dashboard data¶
To filter the data the dashboard shows, use the query bar. The query language you need to use depends on your data source. The query bar tells you which query language to use. Use Lucene Query Syntax for data stored using ElasticSearch.
Moving widgets¶
You can reposition and resize each widget. To move widgets, click on the dashboard menu button and select Edit.
To move a widget, click anywere on the widget and drag. To resize a widget, click on the widget's bottom right corner and drag.
To save your changes, click the green save button. To cancel the changes, click the red cancel button.
Printing dashboards¶
To print a dashboard, click on the dashboard menu button and select Print. Your browser opens a window, and you can choose your print settings there.
Reports¶
Reports are printer-friendly visual representations of your data, like printable dashboards. Your administrator chooses what information goes into your reports based on your needs.
Find and print a report¶
- Select the report from your list, or use the search bar to find your report.
- Click Print. Your browser opens a print window where you can choose your print settings.
Using Export¶
Turn sets of logs into downloadable, sendable files in Export. You can keep these files on your computer, inspect them in another program, or send them via email.
What is an export?
An export is not a file, but a process that creates a file. The export contains and follows your instructions for which data to put in the file, what type of file to create, and what to do with the file. When you run the export, you create the file.
Why would I export logs?
Being able to see a group of logs in one file can help you inspect the data more closely. A few reasons you might want to export logs are:
- To investigate an event or attack
- To send data to an analyst
- To explore the data in a program outside TeskaLabs LogMan.io
Navigating Export¶
List of exports
The List of exports shows you all the exports that have been run.
From the list page, you can:
- See an export's details by clicking on the export's name
- Download the export by clicking on the cloud beside its name
- Delete the export by clicking on the trash can beside its name
- Search for exports using the search bar
Export status is color-coded:
- Green: Completed
- Yellow: In progress
- Blue: Scheduled
- Red: Failed
Jump to:¶
Run an export¶
Running an export adds the export to your List of exports, but it does not automatically download the export. See Download an export for instructions.
Run an export based on a preset¶
1. Click New on the List of exports page. Now you can see the preset exports:
2. To run a preset export, click the run button beside the export name.
OR
2. To edit the export before running, click on the edit button beside the export name. Make your changes, and then click Start. (Use this guide to learn about making changes.)
Once you run the export, you are automatically brought back to the list of exports, and your export appears at the top of the list.
Note
Export presets are created by administrators.
Run an export based on an export you've run before¶
You can re-run an export. Running an export again does not overwrite the original export.
1. On the List of exports page, click on the name of the export you want to run again.
2. Click Restart.
3. You can make changes here (see this guide) or run as-is.
4. Click Start.
Once you run the export, you are automatically brought back to the list of exports, and your new export appears at the top of the list.
Create a new export¶
Create an export from a blank form¶
1. In List of exports, click New, then click Custom.
2. Fill in the fields.
Note
The options in the drop down menus might change based on the selections you make.
Name
Name the export.
Data Source
Select your data source from the drop-down list.
Output
Choose the file type for your data. It can be:
- Raw: If you want to download the export and import the logs into different software, choose raw. If the data source is Elasticsearch, the raw file format is .json.
- .csv: Comma-separated values
- .xlsx: Microsoft Excel format
Compression
Choose to zip your export file, or leave it uncompressed. A zipped file is compressed, and therefore smaller, so it's easier to send, and it takes up less space in your computer.
Target
Choose the target for your file. It can be:
- Download: A file you can download to your computer.
- Email: Fill in the email fields. When you run the export, the email sends. You can still download the data file any time in the List of exports.
- Jupyter: Saves the file in the Jupyter notebook, which you can access through the Tools page. You need to have administrator permissions to access the Jupyter notebook, so only choose Jupyter as the target if you're an administrator.
Separator
If you select .csv as your output, choose what character will mark the separation between each value in each log. Even though CSV means comma-separated values, you can choose to use a different separator, such as a semicolon or space.
Schedule (optional)¶
To schedule the export, rather than running it immediately, click Add schedule.
-
Schedule once:
- To run the export one time at a future time, type the desired date and time in
YYYY-MM-DD HH:mm
format, for example2023-12-31 23:59
(December 31st, 2023, at 23:59).
- To run the export one time at a future time, type the desired date and time in
-
Schedule a recurring export:
-
To set up the export to run automatically on a regular schedule, use
cron
syntax. You can learn more aboutcron
from Wikipedia, and use this tool and these examples by Cronitor to help you writecron
expressions. -
The Schedule field also supports random
R
usage and Vixie cron-style@
keyword expressions.
-
Query
Type a query to filter for certain data. The query determines which data to export, including the timeframe of the logs.
Warning
You must include a query in every export. If you run an export without a query, all of the data stored in your program will be exported with no filter for time or content. This could create an extremely large file and put strain on data storage components, and the file likely won't be useful to you or to analysts.
If you accidentally run an export without a query, you can delete the export while it's still running in the List of exports by clicking on the trash can button.
TeskaLabs LogMan.io uses the Elasticsearch Query DSL (Domain Specific Language).
Here's the full guide to the Elasticsearch Query DSL.
Example of a query:
{
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d/d",
"lt": "now/d"
}
}
},
{
"prefix": {
"event.dataset": {
"value": "microsoft-office-365"
}
}
}
]
}
}
Query breakdown:
bool
: This tells us that the whole query is a Boolean query, which combines mutliple conditions such as "must," "should," and "must not" Here, it's using filter
to find characteristics the data must have to make it into the export. filter
can have mutliple conditions.
range
is the first filter condition. Since it refers to the field below it, which is @timestamp
, it will filter for logs based on a range of values in the timestamp field.
@timestamp
tells us that the query is filtering for time, so it will export logs from a certain timeframe.
gte
: This means "greater than or equal to," which is set to the value now-1d/d
, meaning the earliest timestamp (the first log) will be from exactly one day ago at the moment you run the export.
lt
means "less than," and it is set to now/d
, so the latest timestamp (the last log) will be the newest at the moment you run the export ("now").
prefix
is the second filter condition. It looks for logs where the value of a field, in this case event.dataset
, starts with microsoft-office-365
.
So, what does this query mean?
This export will show all logs from Microsoft Office 365 from the last 24 hours.
3. Add columns
For .csv and .xlsx files, you need to specify what columns you want to have in your document. Each column represents a data field. If you don't specify any columns, the resulting table will have all possible columns, so the table might be much bigger than you expect or need it to be.
You can see the list of all available data fields in Discover. To find which fields are relevant to the logs you're exporting, inspect an individual log in Discover.
- To add a column, click Add. Type the name of the column.
- To delete a column, click -.
- To reorder the columns, click and drag the arrows.
Warning
Pressing enter after typing a column name will run the export.
This example was downloaded from the export shown above as a .csv file, then separated into columns using the Microsoft Excel Convert Text to Columns Wizard. You can see that the columns here match the columns specified in the export.
4. Run the export by pressing Start.
Once you run the export, you are automatically brought back to the list of exports, and your export appears at the top of the list.
Download an export¶
1. On the List of exports page, click on the cloud button to download.
OR
1. On the List of exports page, click on the export's name.
2. Click Download.
Your browser should automatcially start a download.
Delete an export¶
1. On the List of exports page, click on the trash can button.
OR
1. On the List of exports page, click on the export's name.
2. Click Remove.
The export should disappear from your list.
Add an export to your library¶
Note
This feature is only available to administrators.
If you like an export you've created or edited, you can save it to your library as a preset for future use.
1. On the List of exports page, click on the export's name.
2. Click Save to Library.
When you click on New from the List of exports page, your new export preset should be in the list.
All features ↵
Home page¶
The Home page gives you an overview of your data sources and critical incoming events.
Viewing options¶
Chart and list view¶
To switch between chart and list view, click the list button.
Getting more details¶
Clicking on any portion of a chart takes you to Discover, where you then see the list of logs that make up this portion of the chart. From there, you can examine and filter these logs.
You can see here that Discover is automatically filtering for events from the selected dataset (from the chart on the Home page), event.dataset:devolutions
.
Discover¶
Discover gives you an overview of all logs being collected in real time. Here, you can filter the data by time and field.
Navigating Discover¶
Terms¶
Total count: The total number of logs in the timeframe being shown.
Aggregated by: In the bar chart, each bar represents the count of logs collected within a time interval. Use Aggregated by to choose the time interval. For example, Aggregated by: 30m means that each bar in the bar chart shows the count of all of the logs collected in a 30 minute timeframe. If you change to Aggregated by: hour, then each bar represents one hour of logs. The available options change based on the overall timeframe you are viewing in Discover.
Filtering data¶
Change the timeframe from which logs appear, and filter logs by field.
Tip: Why filter data?
Logs contain a lot of information, more than you need to accomplish most tasks. When you filter data, you choose which information you see. This can help you learn more about your network, identify trends, and even hunt for threats.
Examples:
- You want to see login data from just one user, so you filter the data to show logs containing their username.
- You had a security event on Wednesday night, and you want to learn more about it, so you filter the data to show logs from that time period.
- You notice you don't see any data from one of your network devices. You can filter the data to see all the logs from just that device. Now, you can see when the data stopped coming, and what the last event was that might have caused the problem.
Changing the timeframe¶
You can view logs from a specified timeframe. Set the timeframe by choosing start and end points using this tool:
Remember: Once you change the timeframe, press the blue refresh button to update your page.
Using the time setting tool¶
Setting a relative start/end point¶
To set the start or end point to a time relative to now, use the Relative tab.
Quick time settings
Use the quick now- ("now minus") options to set the timeframe to a preset with one click. Selecting one of these options affects both the start and end point. For example, if you choose now-1 week, the start point will be one week ago, and the end point will be "now." Choosing a now- option from the end point does the same thing as choosing a now- option from the start point. (You can't use the now- options to set the end point to anything besides "now.")
Drop-down options
To set a relative time (such as 15 minutes ago) for the start or end point, use the relative time options below the quick setting options. Select your unit of time from the drop-down list, and type or click to your desired number.
To confirm your choice, click Set relative time, and view the logs by clicking on the refresh button.
Example shown: This selection will show logs collected starting from one day ago until now.
Setting an exact start/end point¶
To choose the exact day and time for the start or end point, use the Absolute tab and select a date and time on the calendar.
To confirm your choice, click Set date.
Example shown: This selection will show logs collected starting from June 7, 2023 at 6:00 until now.
Auto refresh¶
To update the view automatically at a set time interval, choose a refresh rate:
Refresh¶
To reload the view with your changes, click the blue refresh button.
Note: Don't choose "Now" as your start point. Since the program can't show data newer than "now," it's not valid, so you'll see an error message.
Using the time selector¶
To select a more specific time period within the current timeframe, click and drag on the graph.
Filtering by field¶
In Discover, you can filter data by any field in multiple ways.
Using the field list¶
Use the search bar to find the field you want, or scroll through the list.
Isolating fields¶
To choose which fields you see in the log list, click the + symbol next to the field name. You can select multiple fields.
Seeing all occuring values in one field¶
To see a percentage breakdown of all the values from one field, click the magnifying glass next to the field name (the magnifying glass appears when you hover over the field name).
Tip: What does this mean?
This list of values from the field http.response.status_code compares how often users are getting certain http response codes. 51.4% of the time, users are getting a 404 code, meaning the webpage wasn't found. 43.3% of the time, users are getting a 200 code, which means that the request succeeded. The high percentage of "not found" response codes could inform a website administrator that one or more of their frequently clicked links are broken.
Viewing and filtering log details¶
To view the details of individual logs as a table or in JSON, click the arrow next to the timestamp. You can apply filters using the field names in the table view.
Filtering from the expanded table view¶
You can use controls in the table view to filter logs:
Filter for logs that contain the same value in the selected field (update_item
in action
in the example)
Filter for logs that do NOT contain the same value in the selected field (update_item
in action
in the example)
Show a percentage breakdown of values in this field (the same function as the magnifying glass in the fields list on the left)
Add to list of displayed fields for all visible logs (the same function as in the fields list on the left)
Query bar¶
You can filter for field (not time) using the query bar. The query bar tells you which query language to use. The query language depends on your data source. Use Lucene Query Syntax for data stored using ElasticSearch.
After you type your query, set the timeframe and click the refresh button. Your filters will be applied to the visible incoming logs.
Investigating IP addresses¶
You can investigate IP addresses using external analysis tools. You might want to do this, for example, if you see multiple suspicious logins from one IP address.
Using external IP analysis tools
1. Click on the IP address you want to analyze.
2. Click on the tool you want to use. You'll be taken to the tool's website, where you can see the results of the IP address analysis.
Dashboards¶
A dashboard is a set of charts and graphs that represent data from your system. Dashboards allow you to quickly get a sense for what's going on in your network.
Your administrator sets up dashboards based on the data sources and fields that are most useful to you. For example, you might have a dashboard that shows graphs related only to email activity, or only to login attempts. You might have many dashboards for different purposes.
You can filter the data to change which data the dashboard shows within its preset constraints.
How can dashboards help me?
By having certain data arranged into a chart, table, or graph, you can get a visual overview of activity within your system and identify trends. In this example, you can see that a high volume of emails were sent and received on June 19th.
Navigating Dashboards¶
Opening a dashboard¶
To open a dashboard, click on its name.
Dashboard controls¶
Setting the timeframe¶
You can change the timeframe the dashboard represents. Find the time-setting guide here. To refresh the dashboard with your new timeframe, click on the refresh button.
Note: There is no auto-refresh rate in Dashboards.
Filtering dashboard data¶
To filter the data the dashboard shows, use the query bar. The query language you need to use depends on your data source. The query bar tells you which query language to use. Use Lucene Query Syntax for data stored using ElasticSearch.
The example above uses Lucene Query Syntax.
Moving widgets¶
You can reposition and resize each widget. To move widgets, click on the dashboard menu button and select Edit.
To move a widget, click anywere on the widget and drag. To resize a widget, click on the widget's bottom right corner and drag.
To save your changes, click the green save button. To cancel the changes, click the red cancel button.
Printing dashboards¶
To print a dashboard, click on the dashboard menu button and select Print. Your browser opens a window, and you can choose your print settings there.
Reports¶
Reports are printer-friendly visual representations of your data, like printable dashboards. Your administrator chooses what information goes into your reports based on your needs.
Find and print a report¶
- Select the report from your list, or use the search bar to find your report.
- Click Print. Your browser opens a print window where you can choose your print settings.
Export¶
Turn sets of logs into downloadable, sendable files in Export. You can keep these files on your computer, inspect them in another program, or send them via email.
What is an export?
An export is not a file, but a process that creates a file. The export contains and follows your instructions for which data to put in the file, what type of file to create, and what to do with the file. When you run the export, you create the file.
Why would I export logs?
Being able to see a group of logs in one file can help you inspect the data more closely. A few reasons you might want to export logs are:
- To investigate an event or attack
- To send data to an analyst
- To explore the data in a program outside TeskaLabs LogMan.io
Navigating Export¶
List of exports
The List of exports shows you all the exports that have been run.
From the list page, you can:
- See an export's details by clicking on the export's name
- Download the export by clicking on the cloud beside its name
- Delete the export by clicking on the trash can beside its name
- Search for exports using the search bar
Export status is color-coded:
- Green: Completed
- Yellow: In progress
- Blue: Scheduled
- Red: Failed
Jump to:¶
Run an export¶
Running an export adds the export to your List of exports, but it does not automatically download the export. See Download an export for instructions.
Run an export based on a preset¶
1. Click New on the List of exports page. Now you can see the preset exports:
2. To run a preset export, click the run button beside the export name.
OR
2. To edit the export before running, click on the edit button beside the export name. Make your changes, and then click Start. (Use this guide to learn about making changes.)
Once you run the export, you are automatically brought back to the list of exports, and your export appears at the top of the list.
Note
Presets are created by administrators.
Run an export based on an export you've run before¶
You can re-run an export. Running an export again does not overwrite the original export.
1. On the List of exports page, click on the name of the export you want to run again.
2. Click Restart.
3. You can make changes here (see this guide) or run as-is.
4. Click Start.
Once you run the export, you are automatically brought back to the list of exports, and your new export appears at the top of the list.
Create a new export¶
Create an export from a blank form¶
1. In List of exports, click New, then click Custom.
2. Fill in the fields.
Note
The options in the drop down menus might change based on the selections you make.
Name
Name the export.
Data Source
Select your data source from the drop-down list.
Output
Choose the file type for your data. It can be:
- Raw: If you want to download the export and import the logs into different software, choose raw. If the data source is Elasticsearch, the raw file format is .json.
- .csv: Comma-separated values
- .xlsx: Microsoft Excel format
Compression
Choose to zip your export file, or leave it uncompressed. A zipped file is compressed, and therefore smaller, so it's easier to send, and it takes up less space in your computer.
Target
Choose the target for your file. It can be:
- Download: A file you can download to your computer.
- Email: Fill in the email fields. When you run the export, the email sends. You can still download the data file any time in the List of exports.
- Jupyter: Saves the file in the Jupyter notebook, which you can access through the Tools page. You need to have administrator permissions to access the Jupyter notebook, so only choose Jupyter as the target if you're an administrator.
Separator
If you select .csv as your output, choose what character will mark the separation between each value in each log. Even though CSV means comma-separated values, you can choose to use a different separator, such as a semicolon or space.
Schedule (optional)¶
To schedule the export, rather than running it immediately, click Add schedule.
-
Schedule once:
- To run the export one time at a future time, type the desired date and time in
YYYY-MM-DD HH:mm
format, for example2023-12-31 23:59
(December 31st, 2023, at 23:59).
- To run the export one time at a future time, type the desired date and time in
-
Schedule a recurring export:
-
To set up the export to run automatically on a regular schedule, use
cron
syntax. You can learn more aboutcron
from Wikipedia, and use this tool and these examples by Cronitor to help you writecron
expressions. -
The Schedule field also supports random
R
usage and Vixie cron-style@
keyword expressions.
-
Query
Type a query to filter for certain data. The query determines which data to export, including the timeframe of the logs.
Warning
You must include a query in every export. If you run an export without a query, all of the data stored in your program will be exported with no filter for time or content. This could create an extremely large file and put strain on data storage components, and the file likely won't be useful to you or to analysts.
If you accidentally run an export without a query, you can delete the export while it's still running in the List of exports by clicking on the trash can button.
TeskaLabs LogMan.io uses the Elasticsearch Query DSL (Domain Specific Language).
Here's the full guide to the Elasticsearch Query DSL.
Example of a query:
{
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-1d/d",
"lt": "now/d"
}
}
},
{
"prefix": {
"event.dataset": {
"value": "microsoft-office-365"
}
}
}
]
}
}
Query breakdown:
bool
: This tells us that the whole query is a Boolean query, which combines mutliple conditions such as "must," "should," and "must not" Here, it's using filter
to find characteristics the data must have to make it into the export. filter
can have mutliple conditions.
range
is the first filter condition. Since it refers to the field below it, which is @timestamp
, it will filter for logs based on a range of values in the timestamp field.
@timestamp
tells us that the query is filtering for time, so it will export logs from a certain timeframe.
gte
: This means "greater than or equal to," which is set to the value now-1d/d
, meaning the earliest timestamp (the first log) will be from exactly one day ago at the moment you run the export.
lt
means "less than," and it is set to now/d
, so the latest timestamp (the last log) will be the newest at the moment you run the export ("now").
prefix
is the second filter condition. It looks for logs where the value of a field, in this case event.dataset
, starts with microsoft-office-365
.
So, what does this query mean?
This export will show all logs from Microsoft Office 365 from the last 24 hours.
3. Add columns
For .csv and .xlsx files, you need to specify what columns you want to have in your document. Each column represents a data field. If you don't specify any columns, the resulting table will have all possible columns, so the table might be much bigger than you expect or need it to be.
You can see the list of all available data fields in Discover. To find which fields are relevant to the logs you're exporting, inspect an individual log in Discover.
- To add a column, click Add. Type the name of the column.
- To delete a column, click -.
- To reorder the columns, click and drag the arrows.
Warning
Pressing enter after typing a column name will run the export.
This example was downloaded from the export shown above as a .csv file, then separated into columns using the Microsoft Excel Convert Text to Columns Wizard. You can see that the columns here match the columns specified in the export.
4. Run the export by pressing Start.
Once you run the export, you are automatically brought back to the list of exports, and your export appears at the top of the list.
Download an export¶
1. On the List of exports page, click on the cloud button to download.
OR
1. On the List of exports page, click on the export's name.
2. Click Download.
Your browser should automatcially start a download.
Delete an export¶
1. On the List of exports page, click on the trash can button.
OR
1. On the List of exports page, click on the export's name.
2. Click Remove.
The export should disappear from your list.
Add an export to your library¶
Note
This feature is only available to administrators.
If you like an export you've created or edited, you can save it to your library as a preset for future use.
1. On the List of exports page, click on the export's name.
2. Click Save to Library.
When you click on New from the List of exports page, your new export preset should be in the list.
Library¶
Administrator feature
The Library is an administrator feature. The Library has a significant impact on the way TeskaLabs LogMan.io works. Some users don't have access to the Library.
The Library holds items (files) that determine what you see when using TeskaLabs LogMan.io. The items in the Library determine, for example, your homepage, dashboards, reports, exports, and some SIEM functions.
When you recieve TeskaLabs LogMan.io, the Library is already filled with files. You can change these according to your needs.
The Library supports these file types:
- .html
- .json
- .md
- .txt
- .yaml
- .yml
Warning
Changing items in the Library impacts how TeskaLabs LogMan.io and TeskaLabs SIEM work. If you are unsure about making changes in the Library, contact Support.
Navigating the Library¶
Some items have additional options in the upper right corner of the screen:
Locating items¶
To find an item, use the search bar, or click through the folders.
If you navigate to a folder in the Library and want to return to the search bar, click Library again.
Adding items to the Library¶
Warning
Do NOT attempt to add single items to the library with the Restore function. Restore is only for importing a whole library.
Creating items in a folder¶
You can create an item directly in certain folders. If adding an item is possible, you'll see a Create new item in (folder) button when you click on the folder.
- To add an item, click Create new item in (folder).
- Name the item, select the file extension from the dropdown, and click Create.
- If the item doesn't appear immediately, refresh the page, and your item should appear in the library.
Adding an item by duplicating an existing item¶
- Click on the item you want to duplicate.
- Click on the ... button near the top.
- Click Copy.
- Rename the item, choose the file extension from the dropdown, and click Copy.
- If the item doesn't appear immediately, refresh the page, and your item should appear in the library.
Editing an item in the Library¶
- Click on the item you want to edit.
- To edit the item, click Edit.
- To save your changes, click Save, or exit the editor without saving by clicking Cancel.
- If your edits don't display immediately, refresh the page, and your changes should be saved.
Removing an item from the Library¶
- Click on the item you want to remove.
- Click on the ... button near the top.
- Click Remove and confirm Yes if your browser prompts.
- If if the item doesn't disappear immediately, refresh the page, and the removed item should be gone.
Disabling items¶
You can temporarily disable an item. It stays in your library, but its effect on your system is paused.
To disable an item, click on the item and click Disable.
You can re-enable the file any time by clicking Enable.
Note
You can't read the contents of an item while it's disabled.
Backing up the Library¶
You can back up your whole Library onto your computer or other external storage by exporting the Library.
To export and download the contents of the Library, click Actions, then click Backup. Your browswer will start the download.
Restoring the library from backup¶
Warning
Using Restore means importing a whole library from your computer. Restore is intended to restore your library from a backup version, so it will overwrite (delete) the existing contents of your Library in TeskaLabs LogMan.io. ONLY restore the Library if you intend to replace the entire contents of the Library with the files you're importing.
Restoring¶
- Click Actions.
- Click Restore.
- Choose the file from your computer. You can only import tar.gz files.
- Click Import.
Remember, using Restore and Import overwrites your whole library.
Lookups¶
Administrator feature
Lookups are an administrator feature. Some users don't have access to Lookups.
You can use lookups to get and store additional information from external sources. The additional information enhances your data and adds relevant context. This makes your data more valuable because you can analyze the data more deeply. For example, you can store user names, active users, active VPNs, and suspicious IP addresses.
Tip
You can read more about Lookups here in the Reference guide.
Navigating Lookups¶
Creating a new lookup¶
To create a new lookup:
- Click Create lookup.
- Fill in the fields: Name, Short description, Detail description, and Key(s).
- To add another key, click on the +.
- Choose to add or not add an expiration.
- Click Save.
Finding a lookup¶
Use the search bar to find a specific lookup. Using the search bar does not search the contents of the lookups, only the lookup names. To view all the lookups again after using the search bar, clear the search bar and press Enter
or Return
.
Viewing and editing a lookup's details¶
Viewing a lookup's keys/items¶
To see a lookup's keys and values, or items, click on the ... button, and click Items.
Editing a lookup's keys/items¶
From the List of lookups, click on the ... button and click Items. This takes you to the individual lookup's page.
Adding: To add an item, click Add item.
Editing: To edit an existing item, click the ... button on the item line, and click Edit.
Deleting: To delete the item, click the ... button on the item line, and click Delete.
Remember to click Save after making changes.
Viewing a lookup's description¶
To see the detailed description of a lookup, click on the ... button on the List of lookups page, and click Info.
Editing a lookup's description¶
- Click on the ... button on the List of lookups page, and click Info. This takes you to the lookup's info page.
- Click Edit lookup at the bottom.
- After making changes, click Save, or click Cancel to exit editing mode.
Deleting a lookup¶
To delete a lookup:
-
Click on the ... button on the List of lookups page, and click Info. This takes you to the lookup's info page.
-
Click Delete lookup.
Tools¶
Administrator feature
Tools are an administrator feature. Changes you make when visiting external tools can have a significant impact on the way TeskaLabs LogMan.io works. Some users don't have access to the Tools page.
The Tools page gives you quick access to external programs that interact with or can be used alongside TeskaLabs LogMan.io.
Using external tools¶
To automatically log in securely to a tool, click on the tool's icon.
Warnings
- While tenants' data is separated in the TeskaLabs LogMan.io UI, tenants' data is not separated within these tools.
- Changes you make in Zookeeper, Kafka, and Kibana could damage your deployment of TeskaLabs LogMan.io.
Maintenance¶
Administrator feature
Maintenance is an administrator feature. What you do in Maintenance has a significant impact on the way TeskaLabs LogMan.io works. Some users don't have access to Maintenance.
The Maintenance section includes Configuration and Services.
Configuration¶
Configuration holds JSON files that determine some of the components you can see and use in TeskaLabs LogMan.io. For example, Configuration includes:
- The Discover page
- The sidebar
- Tenants
- The Tools page
Warning
Configuration files have a significant impact on the way TeskaLabs LogMan.io works. If you need help with your UI configuration, contact Support.
Basic and Advanced modes¶
You can switch between Basic and Advanced mode for configuration files.
Basic has fillable fields. Advanced shows the file in JSON. To choose a mode, click Basic or Advanced in the upper right corner.
Editing a configuration file¶
To edit a configuration file, click on the file name, choose your preferred mode, and make the changes. The file is always editable - you don't have to click anything to begin editing. Remember to click Save when you're finished.
Services¶
Services shows you all of the services and microservices ("mini programs") that make up the infrastructure of TeskaLabs LogMan.io.
Warning
Since TeskaLabs LogMan.io is made of microservices, interfering with the microservices could have a significant impact on the performance of the program. If you need help with microservices, contact Support.
Viewing service details¶
To view a service's details, click the arrow to the left of the service name.
Auth: Controlling user access¶
Administrator feature
Auth is an administrator feature. It has a significant impact on the people using TeskaLabs LogMan.io. Some users don't have access to the Auth pages.
The Auth (authorization) section includes all the controls administrators need to manage users and tenants.
Credentials¶
Credentials are users. From the Credentials screen, you can see:
- Name: The username that someone uses to log in
- Tenants: The tenants this user has access to
- Roles: The set of permissions this user has (see Roles)
Creating new credentials¶
1. To create a new user, click Create new credentials.
2. In the Create tab, enter a username. If you want to send the person an email inviting them to reset their password, enter their email address and check Send instructions to set password.
3. Click Create credentials.
The new credentials appear in the Credentials list. If you checked Send instructions to set password, the new user should recieve an email.
Editing credentials¶
To edit a credential, click on a username, and click Edit in the section you want to change. Remember to click Save to save your changes, or click Cancel to exit the editor.
Tenants¶
A tenant is one entity collecting data from a group of sources. Each tenant has an isolated space to collect and manage its data. (Every tenant's data is completely separated from all other tenants' data in the UI.) One deployment of TeskaLabs LogMan.io can handle many tenants (mutlitenancy).
As a user, your company might be just one tenant, or you might have different tenants for different departments. If you're a distributor, each of your clients has at least one tenant.
One tenant can be accessible by multiple users, and users can have access to multiple tenants. You can control which users can access which tenants by assigning credentials to tentants or vice-versa.
Resources¶
Resources are the most basic unit of authorization. They are single and specific access permissions.
Examples:
- Being able to access dashboards from a certain data source
- Being able to delete tenants
- Being able to make changes in the Library
Roles¶
A role is a container for resources. You can create a role to include any combination of resources, so a role is a set of permissions.
Clients¶
Clients are additonal applications that are accessing TeskaLabs LogMan.io to support its functioning.
Warning
Removing a client could interrupt essential program functions.
Sessions¶
Sessions are active login periods currently running.
Ways to end a session:
- Click on the red X on the session's line on the Sessions page.
- Click on the session's name, then click Terminate session
- To terminate all sessions (logging all users out), click Terminate all on the Sessions page.
Tip
The Auth module uses TeskaLabs SeaCat Auth. To learn more, you can read its documentation or take a look at its repository on GitHub.
Ended: All features
Ended: User Manual
Analyst Manual ↵
Analyst Manual¶
The Analyst Manual
Cybersecurity and data analysts use the Analyst Manual to:
- Query data
- Create cybersecurity detections
- Create data visualizations
- Use and create other analytical tools
To learn how to use the TeskaLabs LogMan.io web app, visit the User Manual. For information about setup and installation, see the Administration Manual and the Reference guide.
Quickstart¶
- Queries: Writing queries to find and filter data
- Dashboards: Designing visualizations for data summaries and patterns
- Detections: Creating custom detections for activity and patterns
- Notifications: Sending messages via email from detections or alerts
Using Lucene Query Syntax¶
If you're storing data in Elasticsearch, you need to use Lucene Query Syntax to query data in TeskaLabs LogMan.io.
These are some quick tips for using Lucene Query Syntax, but you can also see the full documentation on the Elasticsearch website, or visit this tutorial.
You might use Lucene Query Syntax when creating dashboards, filtering data in dashboards, and when searching for logs in Discover.
Basic query expressions¶
Search for the field message
with any value:
message:*
Search for the value delivered
in the field message
:
message:delivered
Search for the phrase not delivered
in the field message
:
message:"not delivered"
Search for any value in the field message
, but NOT the value delivered
:
message:* -message:delivered
Search for the text delivered
anywhere in the value in the field message
:
message:delivered*
message:delivered
message:not delivered
message:delivered with delay
Note
This query would not return the same results if the specified text (delivered
in this example) was only part of a word or number, not separated by spaces or periods. Therefore, the query message:eliv
, for example, would not return these results.
Search for the range of values 1 to 1000 in the field user.id
:
user.id:[1 TO 1000]
Search for the open range of values 1 and higher in the field user.id
:
user.id:[1 TO *]
Combining query expressions¶
Use boolean operators to combine expressions:
AND
- combines criteria
OR
- at least one of the criteria must be met
Using parentheses
Use parentheses when mutliple items need to be grouped together to form an expression.
Examples of grouped expressions:
Search for logs from the dataset security
, either with an IP address containing 123.456
and a message
of failed login
, or with an event action as deny
and a delay
greater than 10
:
event.dataset:security AND (ip.address:123.456* AND message:"failed login") OR
(event.action:deny AND delay:[10 TO *])
Search a library's database for a book written by either Karel Čapek or Lucie Lukačovičová that has been translated to English, or a book in English that is at least 300 pages and in the genre science fiction:
language:English AND (author:"Karel Čapek" OR author:"Lucie Lukačovičová") OR
(page.count:[300 TO *] AND genre:"science fiction")
Dashboards¶
Dashboards are visualizations of incoming log data. While TeskaLabs LogMan.io comes with a library of preset dashboards, you can also create your own. View preset dashboards in the LogMan.io web app in Dashboards.
In order to create a dashboard, you need to write or copy a dashboard file in the Library.
Creating a dashboard file¶
Write dashboards in JSON.
Creating a blank dashboard
- In TeskaLabs LogMan.io, go to the Library.
- Click Dashboards.
- Click Create new item in Dashboards.
- Name the item, and click Create. If the new item doesn't appear immediately, refresh the page.
Copying an existing dashboard
- In TeskaLabs LogMan.io, go to the Library.
- Click Dashboards.
- Click on the item you want to duplicate, then click the icon near the top. Click Copy.
- Choose a new name for the item, and click Copy. If the new item doesn't appear immediately, refresh the page.
Dashboard structure¶
Write dashboards in JSON, and be aware that they're case-sensitive.
Dashboards have two parts:
- The dashboard base: A query bar, time selector, refresh button, and options button
- Widgets: The visualizations (chart, graph, list, etc.)
Dashboard base
Include this section exactly as-is to include the query bar, time selector, refresh button, and options.
{
"Prompts": {
"dateRangePicker": true,
"filterInput": true,
"submitButton": true
Widgets¶
Widgets are made of datasource
and widget
pairs. When you write a widget, need to include both a datasource
section and a widget
section.
JSON formatting tips:
- Separate every
datasource
andwidget
section by a brace and a comma},
except for the final widget in the dashboard, which does not need a comma (see the full example) - End every line with a comma
,
except the final item in a section
Widget positioning
Each widget has layout lines, which dictate the size and position of the widget. If you don't include layout lines when you write the widget, the dashboard generates them automatically.
- Include the layout lines with the suggested values from each widget template, OR don't include any layout lines. (If you don't include any layout lines, make sure the final item in each section does NOT end with a comma.)
- Go to Dashboards in LogMan.io and resize and move the widget.
- When you move the widget on the Dashboards page, the dashboard file in the Library automatically generates or adjusts the layout lines accordingly. If you're working in the dashboard file in the Library and repositioning the widgets in Dashboards at the same time, make sure to save and refresh both pages after making changes on either page.
The order of widgets in your dashboard file does not determine widget position, and the order does not change if you reposition the widgets in Dashboards.
Naming
We recommend agreeing on naming conventions for dashboards and widgets within your organization to avoid confusion.
matchPhrase filter
For Elasticsearch data sources, use Lucene query syntax for the matchPhrase
value.
Colors
By default, pie chart and bar chart widgets use a blue color scheme. To change the color scheme, insert "color":"(color scheme)"
directly before the layout lines.
- Blue: No extra lines necessary
- Purple:
"color":"sunset"
- Yellow:
"color":"warning"
- Red:
"color":"danger"
Troubleshooting JSON
If you get an error message about JSON formatting when trying to save the file:
- Follow the recommendation of the error message specifying what the JSON is "expecting" - it might mean that you're missing a required key-value pair, or the punctuation is incorrect.
- If you can't find the error, double-check that your formatting is consistent with other functional dashboards.
If your widget does not display correctly:
- Make sure the value of
datasource
matches in both the data source and widget sections. - Check for spelling errors or query structure issues in any fields referenced and in fields specified in the
matchphrase
query. - Check for any other typos or inconsistencies.
- Check that the log source you are referencing is connected.
Use these examples as guides. Click the icons to learn what each line means.
Bar charts¶
A bar chart displays values with vertical bars on an x and y-axis. The length of each bar is proportional to the data it represents.
Bar chart JSON example:
"datasource:office365-email-aggregated": { #(1)
"type": "elasticsearch", #(2)
"datetimeField": "@timestamp", #(3)
"specification": "lmio-{{ tenant }}-events*", #(4)
"aggregateResult": true, #(5)
"matchPhrase": "event.dataset:microsoft-office-365 AND event.action:MessageTrace" #(6)
},
"widget:office365-email-aggregated": { #(7)
"datasource": "datasource:office365-email-aggregated", #(8)
"title": "Sent and received emails", #(9)
"type": "BarChart", #(10)
"xaxis": "@timestamp", #(11)
"yaxis": "o365.message.status", #(12)
"ylabel": "Count", #(13)
"table": true, #(14)
"layout:w": 6, #(15)
"layout:h": 4,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
datasource
marks the beginning of the data source section as well as the name of the data source. The name doesn't affect the dashboard's function, but you need to refer to the name correctly in the widget section.- The type of data source. If you're using Elasticsearch, the value is
"elasticsearch"
- Indicates which field in the logs is the date and time field. For example, in Elasticsearch logs, which are parsed by the Elastic Common Schema (ECS), the date and time field is
@timestamp
. - Refers to the index from which to get data in Elasticsearch. The value
lmio-{{ tenant}}-events*
fits our index naming conventions in Elasticsearch, and{{ tenant }}
is a placeholder for the active tenant. The asterisk*
allows unspecified additional characters in the index name followingevents
. The result: The widget displays data from the active tenant. aggregateResult
set totrue
performs aggregation on the data before displaying it in the dashboard. In this case, the sent and received emails are being counted (sum calculated).- The query that filters for specific logs using Lucene query syntax. In this case, any data displayed in the dashboard must be from the Microsoft Office 365 dataset and have the value
MessageTrace
in the fieldevent.action
. widget
marks the beginning of the widget section as well as the name of the widget. The name doesn't affect the dashboard's function.- Refers to the data source section above which populates it. Make sure the value here matches the name of the corresponding data source exactly. (This is how the widget knows where to get data from.)
- Title of the widget that will display in the dashboard
- Type of widget
- The field from the logs whose values will be represented on the x axis
- The field from the logs whose values will be represented on the y axis
- Label for y axis that will display in the dashboard
- Setting
table
totrue
enables you to switch between chart view and table view on the widget in the dashboard. Choosingfalse
disables the chart-to-table feature. - See the note above about widget positioning for information about layout lines.
Bar chart widget rendered:
Bar chart template:
To create a bar chart widget, copy and paste this template into a dashboard file in the Library and fill in the values. Recommended layout values, the values specifying an Elasicsearch data source, and the value that organizes the bar chart by time are already filled in.
"datasource:Name of datasource": {
"type": "elasticsearch",
"datetimeField": "@timestamp",
"specification": "lmio-{{ tenant }}-events*",
"aggregateResult": true,
"matchPhrase": " "
},
"widget:Name of widget": {
"datasource": "datasource:office365-email-aggregated",
"title": "Widget display title",
"type": "BarChart",
"xaxis": "@timestamp",
"yaxis": " ",
"ylabel": " ",
"table": true,
"layout:w": 6,
"layout:h": 4,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
Pie charts¶
A pie chart is a circle divided into slices, in which each slice represents a percentage of the whole.
Pie chart JSON example:
"datasource:office365-email-status": { #(1)
"datetimeField": "@timestamp", #(2)
"groupBy": "o365.message.status", #(3)
"matchPhrase": "event.dataset:microsoft-office-365 AND event.action:MessageTrace", #(4)
"specification": "lmio-{{ tenant }}-events*", #(5)
"type": "elasticsearch", #(6)
"size": 20 #(7)
},
"widget:office365-email-status": { #(8)
"datasource": "datasource:office365-email-status", #(9)
"title": "Received Emails Status", #(10)
"type": "PieChart", #(11)
"tooltip": true, #(12)
"table": true, #(13)
"layout:w": 6, #(14)
"layout:h": 4,
"layout:x": 6,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
datasource
marks the beginning of the data source section, as well as the name of the data source. The name doesn't affect the dashboard's function, but you need to refer to the name correctly in the widget section.- Indicates which field in the logs is the date and time field. For example, in Elasticsearch logs, which are parsed by the Elastic Common Schema (ECS), the date and time field is
@timestamp
. - The field whose values will represent each "slice" of the pie chart. In this example, the pie chart will separate logs by their message status. There will be a separate slice for each of Delivered, Expanded, Quarantined, etc. to show the percentage occurrance of each message status.
- The query that filters for specific logs. In this case, only data from logs from the Microsoft Office 365 dataset with the value
MessageTrace
in the fieldevent.action
will be displayed. - Refers to the index from which to get data in Elasticsearch. The value
lmio-{{ tenant}}-events*
fits our index naming conventions in Elasticsearch, and{{ tenant }}
is a placeholder for the active tenant. The asterisk*
allows unspecified additional characters in the index name followingevents
. The result: The widget displays data from the active tenant. - The type of data source. If you're using Elasticsearch, the value is
"elasticsearch"
- How many values you want to display. Since this pie chart is showing the statuses of received emails, a
size
of 20 displays the top 20 status types. (The pie chart can have a maximum of 20 slices.) widget
marks the beginning of the widget section as well as the name of the widget. The name doesn't affect the dashboard's function.- Refers to the data source section above which populates it. Make sure the value here matches the name of the corresponding data source exactly. (This is how the widget knows where to get data from.)
- Title of the widget that will display in the dashboard
- Type of widget
- If
tooltip
is set totrue
: When you hover over each slice of the pie chart in the dashboard, a small informational window with the count of values in the slice pops up at your cursor. Iftooltip
is set tofalse
: The count window appears in the top left corner of the widget. - Setting
table
totrue
enables you to switch between chart view and table view on the widget in the dashboard. Choosingfalse
disables the chart-to-table feature. - See the note above about widget positioning for information about layout lines.
Pie chart template
To create a pie chart widget, copy and paste this template into a dashboard file in the Library and fill in the values. Recommended values as well as the values specifying an Elasicsearch data source are already filled in.
"datasource:Name of data source": {
"datetimeField": "@timestamp",
"groupBy": " ",
"matchPhrase": " ",
"specification": "lmio-{{ tenant }}-events*",
"type": "elasticsearch",
"size": 20
},
"widget:Name of widget": {
"datasource": "datasource:Name of data source",
"title": "Widget display title",
"type": "PieChart",
"tooltip": true,
"table": true,
"layout:w": 6,
"layout:h": 4,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
Tables¶
A table displays text and numeric values from data fields that you specify.
Table widget example
"datasource:office365-email-failed-or-quarantined": { #(1)
"type": "elasticsearch", #(2)
"datetimeField": "@timestamp", #(3)
"specification": "lmio-{{ tenant }}-events*", #(4)
"size": 100, #(5)
"matchPhrase": "event.dataset:microsoft-office-365 AND event.action:MessageTrace AND o365.message.status:(Failed OR Quarantined)" #(6)
},
"widget:office365-email-failed-or-quarantined": { #(7)
"datasource": "datasource:office365-email-failed-or-quarantined", #(8)
"field:1": "@timestamp", #(9)
"field:2": "o365.message.status",
"field:3": "sender.address",
"field:4": "recipient.address",
"field:5": "o365.message.subject",
"title": "Failed or quarantined emails", #(10)
"type": "Table", #(11)
"dataPerPage": 9, #(12)
"layout:w": 12, #(13)
"layout:h": 4,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
}
datasource
marks the beginning of the data source section, as well as the name of the data source. The name doesn't affect the dashboard's function, but you need to refer to the name correctly in the widget section.- The type of data source. If you're using Elasticsearch, the value is
"elasticsearch"
- Indicates which field in the logs is the date and time field. For example, in Elasticsearch logs, which are parsed by the Elastic Common Schema (ECS), the date and time field is
@timestamp
. - Refers to the index from which to get data in Elasticsearch. The value
lmio-{{ tenant}}-events*
fits our index naming conventions in Elasticsearch, and{{ tenant }}
is a placeholder for the active tenant. The asterisk*
allows unspecified additional characters in the index name followingevents
. The result: The widget displays data from the active tenant. - How many values you want to display. This table will have a maximum of 100 rows. You can set rows per page in
dataPerPage
below. - The query that filters for specific logs using Lucene query syntax. In this case, the widget displays data only from logs from the Microsoft Office 365 dataset with the value
MessageTrace
in the fieldevent.action
and a message status ofFailed
orQuarantined
. widget
marks the beginning of the widget section as well as the name of the widget. The name doesn't affect the dashboard's function.- Refers to the data source section above which populates it. Make sure the value here matches the name of the corresponding data source exactly. (This is how the widget knows where to get data from.)
- Each field is a column that will display in the table in the dashboard. In this example table of failed or quarantied emails, the table would display the timestamp, message status, sender address, recipient address, and the email subject for each log (which represents each email). Use as many fields as you want.
- Title of the widget that will display in the dashboard
- Type of widget
- The number of items displayed per page (at once) in the table
- See the note above about widget positioning for information about layout lines.
Table widget rendered:
Table widget template:
To create a table widget, copy and paste this template into a dashboard file in the Library and fill in the values. Recommended values as well as the values specifying an Elasicsearch data source are already filled in.
"datasource:Name of datasource": {
"type": "elasticsearch",
"datetimeField": "@timestamp",
"specification": "lmio-{{ tenant }}-events*",
"size": 100,
"matchPhrase": " "
},
"widget:Name of widget": {
"datasource": "Name of datasource",
"field:1": "@timestamp",
"field:2": " ",
"field:3": " ",
"field:4": " ",
"field:5": " ",
"title": "Widget title",
"type": "Table",
"dataPerPage": 9,
"layout:w": 12,
"layout:h": 4,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
}
Single values¶
A value widget displays the most recent single value from the data field you specify.
"datasource:microsoft-exchange1": { #(1)
"datetimeField": "@timestamp", #(2)
"matchPhrase": "event.dataset:microsoft-exchange AND email.from.address:* AND email.to.address:*", #(3)
"specification": "lmio-{{ tenant }}-events*", #(4)
"type": "elasticsearch", #(5)
"size": 1 #(6)
},
"widget:fortigate1": { #(7)
"datasource": "datasource:microsoft-exchange1", #(8)
"field": "email.from.address", #(9)
"title": "Last Active User", #(10)
"type": "Value", #(11)
"layout:w": 4, #(12)
"layout:h": 1,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
}
datasource
marks the beginning of the data source section, as well as the name of the data source. The name doesn't affect the dashboard's function, but you need to refer to the name correctly in the widget section.- Indicates which field in the logs is the date and time field. For example, in Elasticsearch logs, which are parsed by the Elastic Common Schema (ECS), the date and time field is
@timestamp
. - The query that filters for specific logs using Lucene query syntax. In this case, the widget displays data only from logs from the Microsoft Exchange dataset with ANY value (
*
) in theemail.from.address
andemail.to.address
fields. - Refers to the index from which to get data in Elasticsearch. The value
lmio-{{ tenant}}-events*
fits our index naming conventions in Elasticsearch, and{{ tenant }}
is a placeholder for the active tenant. The asterisk*
allows unspecified additional characters in the index name followingevents
. The result: The widget displays data from the active tenant. - The type of data source. If you're using Elasticsearch, the value is
"elasticsearch"
- How many values you want to display. Since a value widget only displays a single value, the
size
is 1. widget
marks the beginning of the widget section as well as the name of the widget. The name doesn't affect the dashboard's function.- Refers to the data source section above which populates it. Make sure the value here matches the name of the corresponding data source exactly. (This is how the widget knows where to get data from.)
- Refers to the field (from the latest log) from which the value will be displayed.
- Title of the widget that will display in the dashboard
- Type of widget. The value type displays a single value.
- See the note above about widget positioning for information about layout lines.
Value widget rendered:
Value widget template:
To create a value widget, copy and paste this template into a dashboard file in the Library and fill in the values. Recommended values as well as the values specifying an Elasicsearch data source are already filled in.
"datasource:Name of datasource": {
"datetimeField": "@timestamp",
"matchPhrase": " ",
"specification": "lmio-{{ tenant }}-events*",
"type": "elasticsearch",
"size": 1
},
"widget:Name of widget": {
"datasource": "datasource:Name of datasource",
"field": " ",
"title": "Widget title",
"type": "Value",
"layout:w": 4,
"layout:h": 1,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
}
Dashboard example¶
This example is structured correctly:
{
"Prompts": {
"dateRangePicker": true,
"filterInput": true,
"submitButton": true
},
"datasource:access-log-combined HTTP Response": {
"type": "elasticsearch",
"datetimeField": "@timestamp",
"specification": "lmio-default-events*",
"size": 20,
"groupBy": "http.response.status_code",
"matchPhrase": "event.dataset: access-log-combined AND http.response.status_code:*"
},
"widget:access-log-combined HTTP Response": {
"datasource": "datasource:access-log-combined HTTP Response",
"title": "HTTP status codes",
"type": "PieChart",
"color": "warning",
"useGradientColors": true,
"table": true,
"tooltip": true,
"layout:w": 6,
"layout:h": 5,
"layout:x": 6,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
"datasource:access-log-combined Activity": {
"type": "elasticsearch",
"datetimeField": "@timestamp",
"specification": "lmio-default-events*",
"matchPhrase": "event.dataset:access-log-combined AND http.response.status_code:*",
"aggregateResult": true
},
"widget:access-log-combined Activity": {
"datasource": "datasource:access-log-combined Activity",
"title": "Activity",
"type": "BarChart",
"table": true,
"xaxis": "@timestamp",
"ylabel": "HTTP requests",
"yaxis": "http.response.status_code",
"color": "sunset",
"layout:w": 6,
"layout:h": 4,
"layout:x": 0,
"layout:y": 1,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
},
"datasource:Access-log-combined Last_http": {
"datetimeField": "@timestamp",
"matchPhrase": "event.dataset:access-log-combined AND http.response.status_code:*",
"specification": "lmio-default-events*",
"type": "elasticsearch",
"size": 1000
},
"widget:Access-log-combined Last_http": {
"datasource": "datasource:Access-log-combined Last_http",
"field": "http.response.status_code",
"title": "Last HTTP status code",
"type": "Value",
"layout:w": 6,
"layout:h": 1,
"layout:x": 0,
"layout:y": 0,
"layout:moved": false,
"layout:static": true,
"layout:isResizable": false
}
}
Note: The data is arbitrary. This example is meant only to help you format your dashboards correctly.
Dashboard rendered:
Parsing ↵
Parsing¶
Parsing is the process of analyzing the original log (which is typically in single/multiple-line string, JSON, or XML format) and transforming it into a list of key-value pairs that describe the log data (such as when the original event happened, the priority and severity of the log, information about the process that created the log, etc).
Every log that enters your TeskaLabs LogMan.io system needs to be parsed. The LogMan.io Parsec microservice is responsible for parsing logs. The Parsec needs parsers, which are sets of declarations (YAML files) to know how to parse each type of log. LogMan.io comes with the LogMan.io Common Library, which has many parsers already created for many common log types. However, if you need to create your own parsers, understanding parsing key terms, learning about declarations, and using the parsing tutorial can help.
Basic parsing example
Parsing takes a raw log, such as this:
<30>2023:12:04-15:33:59 hostname3 ulogd[1620]: id="2001" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped" action="drop" fwrule="60002" initf="eth2.3009" outitf="eth6" srcmac="e0:63:da:73:bb:3e" dstmac="7c:5a:1c:4c:da:0a" srcip="172.60.91.60" dstip="192.168.99.121" proto="17" length="168" tos="0x00" prec="0x00" ttl="63" srcport="47100" dstport="12017"
@timestamp: 2023-12-04 15:33:59.033
destination.ip: 192.168.99.121
destination.mac: 7c:5a:1c:4c:da:0a
destination.port: 12017
device.model.identifier: SG230
dns.answers.ttl 63
event.action: Packet dropped
event.created: 2023-12-04 15:33:59.033
event.dataset: sophos
event.id: 2001
event.ingested: 2023-12-04 15:39:10.039
event.original: <30>2023:12:04-15:33:59 hostname3 ulogd[1620]: id="2001" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped" action="drop" fwrule="60002" initf="eth2.3009" outitf="eth6" srcmac="e0:63:da:73:bb:3e" dstmac="7c:5a:1c:4c:da:0a" srcip="172.60.91.60" dstip="192.168.99.121" proto="17" length="168" tos="0x00" prec="0x00" ttl="63" srcport="47100" dstport="12017"
host.hostname: hostname3
lmio.event.source.id: hostname3
lmio.parsing: parsec
lmio.source: mirage
log.syslog.facility.code: 3
log.syslog.facility.name: daemon
log.syslog.priority: 30
log.syslog.severity.code: 6
log.syslog.severity.name: information
message id="2001" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped" action="drop" fwrule="60002" initf="eth2.3009" outitf="eth6" srcmac="e0:63:da:73:bb:3e" dstmac="7c:5a:1c:4c:da:0a" srcip="172.60.91.60" dstip="192.168.99.121" proto="17" length="168" tos="0x00" prec="0x00" ttl="63" srcport="47100" dstport="12017"
observer.egress.interface.name: eth6
observer.ingress.interface.name: eth2.3009
process.name: ulogd
process.pid: 1620
sophos.action: drop
sophos.fw.rule.id: 60002
sophos.prec: 0x00
sophos.protocol: 17
sophos.sub: packetfilter
sophos.sys: SecureNet
sophos.tos: 0x00
source.bytes: 168
source.ip: 172.60.91.60
source.mac: e0:63:da:73:bb:3e
source.port: 47100
tags: lmio-parsec:v23.47
tenant: default
_id: e1a92529bab1f20e43ac8d6caf90aff49c782b3d6585e6f63ea7c9346c85a6f7
_prev_id: 10cc320c9796d024e8a6c7e90fd3ccaf31c661cf893b6633cb2868774c743e69
_s: DKNA
Parsing key terms¶
Important terms relevant to LogMan.io Parsec and the parsing process.
Event¶
A unit of data that moves through the parsing process is referred to as an event. An original event comes to LogMan.io Parsec as an input and is then parsed by the processors. If parsing succeeds, it produces a parsed event, and if parsing fails, it produces an error event.
Original event¶
An original event is the input that LogMan.io Parsec recieves - in other words, an unparsed log. It can be represented by a raw (possibly encoded) string or structure in JSON or XML format.
Parsed event¶
A parsed event is the output from successful parsing, formatted as an unordered list of key-value pairs serialized into JSON structure. A parsed event always contains a unique ID, the original event, and typically the information about when the event was created by the source and received by Apache Kafka.
Error event¶
An error event is the output from unsuccessful parsing, formatted as an unordered list of key-value pairs serialized into JSON structure. It is produced when parsing, mapping, or enrichment fails, or when another exception occurs in LogMan.io Parsec. It always contains the original event, the information about when the event was unsuccessfully parsed, and the error message describing the reason why the process of parsing failed. Despite unsuccessful parsing, the error event will always be in JSON format, key-value pairs.
Library¶
Your TeskaLabs LogMan.io Library holds all of your declaration files (as well as many other types of files). You can edit your declaration files in your Library via Zookeeper.
Declarations¶
Declarations describe how the event will be transformed. Declarations are YAML files that LogMan.io Parsec can interpret to create declarative processors. There are three types of declarations in LogMan.io Parsec: parsers, enrichers, and mappings. See Declarations for more.
Parser¶
A parser is the type of declaration that takes the original event or a specific field of a partially-parsed event as input, analyzes its individual parts, and then stores them as key-value pairs to the event.
Mapping¶
A mapping declaration is the type of declaration that takes a partially parsed event as input, renames the field names, and eventually converts the data types. It works together with a schema (ECS, CEF). It also works as a filter to leave out data that is not needed in the final parsed event.
Enricher¶
An enricher is the type of declaration that supplement a partially parsed event with additional data.
Declarations ↵
Declarations¶
Declarations describe how the event should be parsed. They are stored as YAML files in the Library. LogMan.io Parsec interprets these declarations and creates parsing processors.
There are three types of declarations:
- Parser declaration: A parser takes an original event or a specific field of a partially parsed event as input, analyzes its individual parts, and stores them as key-value pairs to the event.
- Mapping declaration: Mapping takes a partially parsed event as input, renames the field names, and eventually converts the data types. It works together with a schema (ECS, CEF).
- Enricher declaration: An enricher supplements a partially parsed event with extra data.
Data flow¶
A typical, recommended parsing sequence is a chain of declarations:
- The first main parser declaration begins the chain, and additional parsers (called sub-parsers) extract more detailed data from the fields created by the previous parser.
- Then, the (single) mapping declaration renames the keys of the parsed fields according to a schema and filters out fields that are not needed.
- Last, the enricher declaration supplements the event with additional data. While it's possible to use multiple enricher files, it's recommended to use just one.
Naming declarations¶
Important: Naming conventions
LogMan.io Parsec loads declarations alphabetically and creates the corresponding processors in the same order. Therefore, create the list of declaration files according to these rules:
-
Begin all declaration file names with a numbered prefix:
10_parser.yaml
,20_parser_message.yaml
, ...,90_enricher.yaml
.It is recommended to "leave some space" in your numbering for future declarations in case you want to add a new declaration between two existing ones (e.g.,
25_new_parser.yaml
). -
Include the type of declaration in file names:
20_parser_message.yaml
rather than10_message.yaml
. - Include the type of schema used in mapping file names:
40_mapping_ECS.yaml
rather than40_mapping.yaml
.
Example:
/Parsers/MyParser/:
- 10_parser.yaml
- 20_parser_username.yaml
- 30_parser_message.yaml
- 40_mapping_ECS.yaml
- 50_enricher_lookup.yaml
- 60_enricher.yaml
Parser declarations¶
A parser declaration takes an original event or a specific field of a partially parsed event as input, analyzes its individual parts, and stores them as key-value pairs to the event.
LogMan.io Parsec currently supports three types of parser declarations:
- JSON parser
- Windows Event parser
- Parsec parser
Declaration structure¶
In order to determine the type of the declaration, you need to specify a define
section.
define:
type: <declaration_type>
For a parser declaration, specify the type
as parser
.
JSON parser¶
A JSON parser is used for parsing events with a JSON structure.
define:
name: JSON parser
type: parser/json
This is a complete JSON parser and will parse events from a JSON structure, separating the fields into key-value pairs.
Warning
For now, LogMan.io Parsec does not support parsing of events with nested JSON format. For example, the event below cannot be parsed with JSON parser:
{
"key": {
"foo": 1,
"bar": 2
}
}
Windows Event parser¶
Windows Events parser is used for parsing events that are produced from Microsoft Windows. These events are in XML format.
define:
name: Windows Events Parser
type: parser/windows-event
This is a complete Windows Event parser and will parse events from Microsoft Windows, separating the fields into key-value pairs.
Parsec parser¶
A Parsec parser is used for parsing events in plain string format. It is based on SP-Lang Parsec expressions.
For parsing original events, use the following declaration:
define:
name: My Parser
type: parser/parsec
parse:
!PARSE.KVLIST
- ...
- ...
- ...
define:
name: My Parser
type: parser/parsec
field: <custom_field>
parse:
!PARSE.KVLIST
- ...
- ...
- ...
When field
is specified, parsing is applied on that field, otherwise it is applied on the original event. Therefore, it must be present in every sub-parser.
Examples of Parsec parser declarations¶
Example 1: Simple example
For the purpose of the example, let's say that we want to parse a collection of simple events:
Hello Miroslav from Prague!
Hi Kristýna from Pilsen.
{
"name": "Miroslav",
"city": "Prague"
}
{
"name": "Kristýna",
"city": "Pilsen"
}
define:
type: parser/parsec
parse:
!PARSE.KVLIST
- !PARSE.UNTIL " "
- name: !PARSE.UNTIL " "
- !PARSE.EXACTLY "from "
- city: !PARSE.LETTERS
Example 2: More complex example
For the purpose of this example, let's say that we want to parse a collection of simple events:
Process cleaning[123] finished with code 0.
Process log-rotation finished with code 1.
Process cleaning[657] started.
And we want the output in the following format:
{
"process.name": "cleaning",
"process.pid": 123,
"event.action": "process-finished",
"return.code": 0
}
{
"process.name": "log-rotation",
"event.action": "process-finished",
"return.code": 1
}
{
"process.name": "cleaning",
"process.pid": 657,
"event.action": "process-started",
}
Declaration will be the following:
define:
type: parser/parsec
parse:
!PARSE.KVLIST
- !PARSE.UNTIL " "
- !TRY
- !PARSE.KVLIST
- process.name: !PARSE.UNTIL "["
- process.pid: !PARSE.UNTIL "]"
- !PARSE.SPACE
- !PARSE.KVLIST
- process.name: !PARSE.UNTIL " "
- !TRY
- !PARSE.KVLIST
- !PARSE.EXACTLY "started."
- event.action: "process-started"
- !PARSE.KVLIST
- !PARSE.EXACTLY "finished with code "
- event.action: "process-finished"
- return.code: !PARSE.DIGITS
Example 3: Parsing syslog events
For the purpose of the example, let's say that we want to parse a simple event in syslog format:
<189> Sep 22 10:31:39 server-abc server-check[1234]: User "harry potter" logged in from 198.20.65.68
We would like the output in the following format:
{
"PRI": 189,
"timestamp": 1695421899,
"server": "server-abc",
"process.name": "server-check",
"process.pid": 1234,
"user": "harry potter",
"action": "log-in",
"ip": "198.20.65.68"
}
We will create two parsers. First parser will parse the syslog header and the second will parse the message.
define:
name: Syslog parser
type: parser/parsec
parse:
!PARSE.KVLIST
- !PARSE.EXACTLY "<"
- PRI: !PARSE.DIGITS
- !PARSE.EXACTLY ">"
- timestamp: ...
- server: !PARSE.UNTIL " "
- process.name: !PARSE.UNTIL "["
- process.pid: !PARSE.UNTIL "]"
- !PARSE.EXACTLY ":"
- message: !PARSE.CHARS
This parser
define:
type: parser/parsec
field: message
drop: yes
parse:
!PARSE.KVLIST
- !PARSE.UNTIL " "
- user: !PARSE.BETWEEN { what: '"' }
- !PARSE.EXACTLY " "
- !PARSE.UNTIL " "
- !PARSE.UNTIL " "
- !PARSE.UNTIL " "
- ip: !PARSE.CHARS
Mapping declarations¶
After all declared fields are obtained from parsers, the fields typically have to be renamed according to some schema (ECS, CEF) in a process called mapping.
Why is mapping necessary?
To store event data in Elasticsearch, it's essential that the field names in the logs align with the Elastic Common Schema (ECS), a standardized, open-source collection of field names that are compatible with Elasticsearch. The mapping process renames the fields of the parsed logs according to this schema. Mapping ensures that logs from various sources have unified, consistent field names, which enables Elasticsearch to interpret them accurately.
Important
By default, mapping works as a filter. Make sure to include all fields you want in the parsed output in the mapping declaration. Any field not specified in mapping will be removed from the event.
Writing a mapping declaration¶
Write mapping delcarations in YAML. (Mapping declarations do not use SP-Lang expressions.)
define:
type: parser/mapping
schema: /Schemas/ECS.yaml
mapping:
<original_key>: <new_key>
<original_key>: <new_key>
...
Specify parser/mapping
as the type
in the define
section. In the schema
field, specify the filepath to the schema you're using. If you use Elasticsearch, use the Elastic Common Schema (ECS).
To rename they key and change the data type of the value:
mapping:
<original_key>:
field: <new_key>
type: <new_type>
Find available data types here.
To rename the key without changing the data type of the value:
mapping:
<original_key>: <new_key>
Example¶
Example
For the purpose of the example, let's say that we want to parse a simple event in JSON format:
{
"act": "user login",
"ip": "178.2.1.20",
"usr": "harry_potter",
"id": "6514-abb6-a5f2"
}
and we would like the final output look like this:
{
"event.action": "user login",
"source.ip": "178.2.1.20",
"user.name": "harry_potter"
}
Notice that the key names in the original event differ from the key names in the desired output.
For the initial parser declaration in this case, we can use a simple JSON parser:
define:
type: parser/json
This parser will create a list of key-value pairs that are exactly the same as the original ones.
To change the names of individual fields, we create this mapping delcaration file, 20_mapping_ECS.yaml
, in which we describe what fields to map and how:
---
define:
type: parser/mapping # determine the type of declaration
schema: /Schemas/ECS.yaml # which schema is applied
mapping:
act: 'event.action'
ip: 'source.ip'
usr: 'user.name'
This declaration will produce the desired output. (Data types have not been changed.)
Enricher declarations¶
Enrichers supplement the parsed event with extra data.
An enricher can:
- Create a new field in the event.
- Transform a field's values in some way (changing a letter case, performing a calculation, etc).
Enrichers are most commonly used to:
- Specify the dataset where the logs will be stored in ElasticSearch (add the field
event.dataset
). - Obtain facility and severity from the syslog priority field.
define:
type: parsec/enricher
enrich:
event.dataset: <dataset_name>
new.field: <expression>
...
- Write enrichers in YAML.
- Specify
parsec/enricher
in thedefine
field.
Example
The following example is enricher used for events in syslog format. Suppose you have parser for the events of the form:
<14>1 2023-05-03 15:06:12 server pid: Username 'HarryPotter' logged in.
{
"log.syslog.priority": 14,
"user.name": "HarryPotter"
}
You want to obtain syslog severity and facility, which are computed in the standard way:
(facility * 8) + severity = priority
You would also like to lower the name HarryPotter
to harrypotter
in order to unify the users across various log sources.
Therefore, you create an enricher:
define:
type: parsec/enricher
enrich:
event.dataset: 'dataset_name'
user.id: !LOWER { what: !GET {from: !ARG EVENT, what: user.name} }
# facility and severity are computed from 'syslog.pri' in the standard way
log.syslog.facility.code: !SHR
what: !GET { from: !ARG EVENT, what: log.syslog.priority }
by: 3
log.syslog.severity.code: !AND [ !GET {from: !ARG EVENT, what: log.syslog.priority}, 7 ]
Ended: Declarations
Parsing tutorial¶
The complete parsing process requires parser, mapping, and enricher declarations. This tutorial breaks down creating declarations step-by-step. Visit the LogMan.io Parsec documentation for more on the Parsec microservice.
Before you start
SP-Lang
Parsing declarations are written in TeskaLabs SP-Lang. For more details about parsing expressions, visit the SP-Lang documentation.
Declarations
For more information on specific types of declarations, see:
Sample logs¶
This example uses this set of logs collected from various Sophos SG230 devices:
<181>2023:01:12-13:08:45 asgmtx httpd: 212.158.149.81 - - [12/Jan/2023:13:08:45 +0100] "POST /webadmin.plx HTTP/1.1" 200 2000
<38>2023:01:12-13:09:09 asgmtx sshd[17112]: Failed password for root from 218.92.0.190 port 56745 ssh2
<38>2023:01:12-13:09:20 asgmtx sshd[16281]: Did not receive identification string from 218.92.0.190
<38>2023:01:12-13:09:20 asgmtx aua[2350]: id="3005" severity="warn" sys="System" sub="auth" name="Authentication failed" srcip="43.139.111.88" host="" user="login" caller="sshd" reason="DENIED"
These logs are using the syslog format described in RFC 5424.
Logs can be typically separated into two parts: the header and the body. The header is anything preceding the first colon after the timestamp. The body is the rest of the log.
Parsing strategy¶
The Parsec interprets each declaration alphabetically by name, so naming order matters. Within each declaration, the parsing process follows the order that you write the expressions in like steps.
A parsing sequence can include multiple parser declarations, and also needs a mapping declaration and an enricher declaration. In this case, create these declarations:
- First parser declaration: Parse the syslog headers
- Second parser declaration: Parse the body of the logs as the message.
- Mapping declaration: Rename fields
- Enricher declaration: Add metadata (such as the dataset name) and compute syslog facility and severity from priority
As per naming conventions, name these files:
- 10_parser_header.yaml
- 20_parser_message.yaml
- 30_mapping_ECS.yaml
- 40_enricher.yaml
Remember that declarations are interpreted in alphabetical order, in this case by the increasing numeric prefix. Use prefixes such as 10, 20, 30, etc. so you can add a new declaration between existing ones later without renaming all of files.
1. Parsing the header¶
This is the first parser declaration. The subsequent sections break down and explain each part of the declaration.
---
define:
type: parser/parsec
parse:
!PARSE.KVLIST
# PRI part
- '<'
- PRI: !PARSE.DIGITS
- '>'
# Timestamp
- TIMESTAMP: !PARSE.DATETIME
- year: !PARSE.DIGITS # year: 2023
- ':'
- month: !PARSE.MONTH { what: 'number' } # month: 01
- ':'
- day: !PARSE.DIGITS # day: 12
- '-'
- hour: !PARSE.DIGITS # hour: 13
- ':'
- minute: !PARSE.DIGITS # minute: 08
- ':'
- second: !PARSE.DIGITS # second: 45
- !PARSE.UNTIL ' '
# Hostname and process
- HOSTNAME: !PARSE.UNTIL ' ' # asgmtx
- PROCESS: !PARSE.UNTIL ':'
# Message
- !PARSE.SPACES
- MESSAGE: !PARSE.CHARS
Log headers¶
The syslog headers are in the format:
<PRI>TIMESTAMP HOSTNAME PROCESS.NAME[PROCESS.PID]:
Important: Log variance
Notice that PROCESS.PID
in the square brackets is not present in the first log's header. To accomodate the discrepancy, the parser will need a way to handle the possibility of PROCESS.PID
being either present or absent. This is addressed later in the tutorial.
Parsing the PRI¶
First, parse the PRI, which is enclosed by <
and >
characters, with no space in between.
How to parse <PRI>
, as seen in the first parser declaration:
!PARSE.KVLIST
- !PARSE.EXACTLY { what: '<' }
- PRI: !PARSE.DIGITS
- !PARSE.EXACTLY { what: '>' }
Expressions used:
!PARSE.EXACTLY
: Parsing the characters<
and>
!PARSE.DIGITS
: Parsing the numbers (digits) of the PRI
!PARSE.EXACTLY
shortcut
The !PARSE.EXACTLY
expression has a syntactic shortcut because it is so commonly used. Instead of including the whole expression, PARSE.EXACTLY { what: '(character)' }
can be shortened to '(character')
.
So, the above parser declaration can be shortened to:
!PARSE.KVLIST
- '<'
- PRI: !PARSE.DIGITS
- '>'
Parsing the timestamp¶
The unparsed timestamp format is:
yyyy:mm:dd-HH:MM:SS
2023:01:12-13:08:45
Parse the timestamp with the !PARSE.DATETIME
expression.
As seen in the first parser declaration:
# 2023:01:12-13:08:45
- TIMESTAMP: !PARSE.DATETIME
- year: !PARSE.DIGITS # year: 2023
- ':'
- month: !PARSE.MONTH { what: 'number' } # month: 01
- ':'
- day: !PARSE.DIGITS # day: 12
- '-'
- hour: !PARSE.DIGITS # hour: 13
- ':'
- minute: !PARSE.DIGITS # minute: 08
- ':'
- second: !PARSE.DIGITS # second: 45
- !PARSE.UNTIL { what: ' ', stop: after }
Parsing the month:
The !PARSE.MONTH
expression requires you to specify the format of the month in the what
parameter. The options are:
'number'
(used in this case) which accepts numbers 01-12'short'
for shortened month names (JAN, FEB, etc.)'full'
for full month names (JANUARY, FEBRUARY, etc.)
Parsing the space:
The space at the end of the timestamp also needs to be parsed. Using the !PARSE.UNTIL
expression parses everything until the space character (' '
), stopping after the space, as defined (stop: after
).
!PARSE.UNTIL
shortcuts and alternatives
!PARSE.UNTIL
has the syntactic shortcut:
- !PARSE.UNTIL ' '
- !PARSE.UNTIL { what: ' ', stop: after }
Alternatively, you can choose an expression that specifically parses one or multiple spaces, respectively:
- !PARSE.SPACE
or
- !PARSE.SPACES
At this point, the sequence of characters <181>2023:01:12-13:08:45
(including the space at the end) is parsed.
Parsing the hostname and process¶
Next, parse the hostname and process: asgmtx sshd[17112]:
.
Remember, the first log's header is different than the rest. For a solution that accommodates this difference, create a parser declaration and a subparser declaration.
As seen in the first parser declaration:
# Hostname and process
- HOSTNAME: !PARSE.UNTIL ' ' # asgmtx
- PROCESS: !PARSE.UNTIL ':'
# Message
- !PARSE.SPACES
- MESSAGE: !PARSE.CHARS
- Parse the hostname: To parse the hostname, use the
!PARSE.UNTIL
expression to parse everything until the single character specified inside' '
, which in this case is a space, and stops after that character, without including the character in the output. - Parse the process: Use
!PARSE.UNTIL
again for parsing until:
. After the colon ('
), the header is parsed. - Parse the message: In this declaration, use
!PARSE.SPACES
to parse all spaces between the header and the message. Then, store the rest of the event in theMESSAGE
field using the!PARSE.CHARS
expression, which in this case parses all of the rest of the characters in the log. You will use additional declarations to parse the parts of the message.
1.5. Parsing for log variance¶
To address the issue of the first log not having a process PID, you need a second parser declaration, a subparser. In the other logs, the process PID is enclosed in square brackets ([ ]
).
Create a declaration called 15_parser_process.yaml
. To accommodate the differences in the logs, create two "paths" or "branches" that the parser can use. The first branch will parse PROCESS.NAME
, PROCESS.PID
and :
. The second branch will parse only PROCESS.NAME
.
Why do I need two branches?
For three of the logs, the process PID is enclosed in square brackets ([ ]
). Thus, the expression that isolates the PID begins parsing at a square bracket [
. However, in the first log, the PID field is not present. If you try to parse the first log using the same expression, the parser will try to find a square bracket in that log and will keep searching regardless of the character [
not being present in the header.
The result would be that whatever is inside the square brackets is parsed as PID
, which in this case would be nonsensical, and would disrupt the rest of the parsing process for that log.
The second declaration:
---
define:
type: parser/parsec
field: PROCESS
error: continue
parse:
!PARSE.KVLIST
- !TRY
- !PARSE.KVLIST
- PROCESS.NAME: !PARSE.UNTIL '['
- PROCESS.PID: !PARSE.UNTIL ']'
- !PARSE.KVLIST
- PROCESS.NAME: !PARSE.CHARS
To achieve this, construct two little parsers under the combinator !PARSE.KVLIST
using the !TRY
expression.
The !TRY
expression
The !TRY
expression allows you to nest a list of expressions under it. !TRY
begins by attempting to use the first expression, and if that first expression is unusable for the log, the process continues with the second nested expression, and so on, until an expression succeeds.
Under the !TRY
expression:
The first branch:
1. The expression parses PROCESS.NAME
and PROCESS.PID
, expecting the square brackets [
and ]
to be present in the event. After these are parsed, it also parses the :
character.
2. If the log does not contain a [
character, the expression !PARSE.UNTIL '['
fails, and in that case the whole !PARSE.KVLIST
expression in the first branch fails.
The second branch:
3. The !TRY
expression will continue with the next parser, which does not require the character [
to be present in the event. It simply parses everything before :
and stops after it.
4. If this second expression fails, the log goes to OTHERS.
2. Parsing the message¶
Consider again the events:
<181>2023:01:12-13:08:45 asgmtx httpd: 212.158.149.81 - - [12/Jan/2023:13:08:45 +0100] "POST /webadmin.plx HTTP/1.1" 200 2000
<38>2023:01:12-13:09:09 asgmtx sshd[17112]: Failed password for root from 218.92.0.190 port 56745 ssh2
<38>2023:01:12-13:09:20 asgmtx sshd[16281]: Did not receive identification string from 218.92.0.190
<38>2023:01:12-13:09:20 asgmtx aua[2350]: id="3005" severity="warn" sys="System" sub="auth" name="Authentication failed" srcip="43.139.111.88" host="" user="login" caller="sshd" reason="DENIED"
There are three different types of messages, dependent on the process name.
httpd
: Message is in a structured format. We can extract the data such as IPs and HTTP requests easily by using the standard parsing expressions.sshd
: Message is a human-readable string. To extract the data such as host IPs and ports, hardcode these messages in the parser and skip the words that are relevant to humans but not relevant for automatic parsing.aua
: Message consists of structured data in the form of key-value pairs. Extract them as they are and rename them in the mapping according to the Elasticsearch Common Schema.
For clarity, put each declaration into a separate YAML file and use the !INCLUDE
expression for including them into one parser.
---
define:
type: parser/parsec
field: MESSAGE
error: continue
parse:
!MATCH
what: !GET { from: !ARG EVENT, what: process.name, type: str }
with:
'httpd': !INCLUDE httpd.yaml
'sshd': !INCLUDE sshd.yaml
'aua': !INCLUDE aua.yaml
else: !PARSE.KVLIST []
The !MATCH
expression has three parameters. The what
parameter specifies the field name, the value is matched with one of the cases specified in with
dictionary. If match is successful, the corresponding expression will be executed, in this case one of !INCLUDE
expressions. If none of the listed cases matches, the expression in else
is executed. In this case, !PARSE.KVLIST
is used with an empty list, which means nothing will be parsed from the message.
Parsing the structured message¶
First, look at the message from 'httpd' process.
212.158.149.81 - - [12/Jan/2023:13:08:45 +0100] "POST /webadmin.plx HTTP/1.1" 200 2000
Parse the IP address, the HTTP request method, the response status and the number of bytes requested to yield the output:
host.ip: '212.158.149.81'
http.request.method: 'POST'
http.response.status_code: '200'
http.response.body.bytes: '2000'
This is straightforward, assuming all the events will satisfy the same format as the one from the example:
!PARSE.KVLIST
- host.ip: !PARSE.UNTIL ' '
- !PARSE.UNTIL '"'
- http.request.method: !PARSE.UNTIL ' '
- !PARSE.UNTIL '"'
- !PARSE.SPACE
- http.response.status_code: !PARSE.DIGITS
- !PARSE.SPACE
- http.request.body.bytes: !PARSE.DIGITS
This case uses the ECS for naming. Alternatively, you can rename fields according to your needs in the mapping declaration.
Parsing the human-readable string¶
Let us continue with 'sshd' messages.
Failed password for root from 218.92.0.190 port 56745 ssh2
Did not receive identification string from 218.92.0.190
You can extract IP addresses from both events and the port from the first one. Additionally, you can store the condensed information about the event type in event.action
field.
event.action: 'password-failed'
user.name: 'root'
source.ip: '218.92.0.190'
source.port: '56745'
event.action: 'id-string-not-received'
source.ip: '218.92.0.190'
To differentiate between these two messages, notice that each of them starts with a different prefix. You can take advantage of this and use !PARSE.TRIE
expression.
!PARSE.TRIE
- 'Failed password for ': !PARSE.KVLIST
- event.action: 'password-failed'
- user.name: !PARSE.UNTIL ' '
- 'from '
- source.ip: !PARSE.UNTIL ' '
- 'port '
- source.port: !PARSE.DIGITS
- 'Did not receive identification string from ': !PARSE.KVLIST
- event.action: 'id-string-not-received'
- source.ip: !PARSE.CHARS
- '': !PARSE.KVLIST []
!PARSE.TRIE
expression tries to match the incoming string with the listed prefixes and performs the corresponding expressions. The empty prefix '' is a fallback: if none of the listed prefixes match, the empty one is used.
Parsing key-value pairs¶
Finally, aua
events have key-value pairs.
id="3005" severity="warn" sys="System" sub="auth" name="Authentication failed" srcip="43.139.111.88" host="" user="login" caller="sshd" reason="DENIED"
Desired output:
id: '3005'
severity: 'warn'
sys: 'System'
sub: 'auth'
name: 'Authentication failed'
srcip: '43.139.111.88'
host: ''
user: 'login'
caller: 'sshd'
reason: 'DENIED'
When encountering structured messages, you can use !PARSE.REPEAT
together with !PARSE.KV
.
The !PARSE.REPEAT
expression performs the expression specified in what
parameter multiple times. In this case, you want to repeat the steps until it is no longer possible:
- Parse everything until '=' character and use it as a key.
- Parse everything between '"' characters and assign that value to the key.
- Optionally, omit spaces before the next key begins.
For that, we create the following expression:
!PARSE.KVLIST
- !PARSE.REPEAT
what: !PARSE.KV
- !PARSE.OPTIONAL { what: !PARSE.SPACE }
- key: !PARSE.UNTIL '='
- value: !PARSE.BETWEEN '"'
KV
in !PARSE.KV
stands for key-value. This expression takes a list of parsing expressions, including the keywords key
and value
.
3. Mapping declaration¶
Mapping renames the keys so that they correspond to the ECS (Elastic Common Schema).
---
define:
type: parser/mapping
schema: /Schemas/ECS.yaml
mapping:
# 10_parser_header.yaml and 15_parser_process.yaml
'PRI': 'log.syslog.priority'
'TIMESTAMP': '@timestamp'
'HOSTNAME': 'host.hostname'
'PROCESS.NAME': 'process.name'
'PROCESS.PID': 'process.pid'
'MESSAGE': 'message'
# 20_parser_message.yaml
# httpd.yaml
- 'host.ip': 'host.ip'
- 'http.request.method': 'http.request.method'
- 'http.response.status_code': 'http.response.status_code'
- 'http.request.body.bytes': 'http.request.body.bytes'
# sshd.yaml
'event.action': 'event.action'
'user.name': 'user.name'
'source.ip': 'source.ip'
'source.port': 'source.port'
# aua.yaml
'sys': 'sophos.sys'
'host': 'sophos.host'
'user': 'sophos.user'
'caller': 'log.syslog.appname'
'reason': 'event.reason'
Mapping as a filter
Note that we must map fields from httpd.yaml
and sshd.yaml
files, although they are already in ECS format. The mapping processor also work as a filter. Any key you do not include in the mapping declaration is dropped from the event. This is the case of aua.yaml
, where some fields are not included in mapping and therefore skipped.
4. Enricher declaration¶
The enricher will have this structure:
---
define:
type: parsec/enricher
enrich:
...
For the purpose of this example, the enricher will
- Add the fields
event.dataset
anddevice.model.identifier
, which will be "static" fields, always with the same value. - Transform the field
HOST.HOSTNAME
to lowercase,host.hostname
. - Compute the syslog facility and severity from syslog priority, both with numeric and readable value.
Note that enrichers do not modify or delete the already existing fields, unless you explicitly specify it in the declaration. This is done by creating a field that is already existing in the event. In that case, the field is simply replaced by the new value.
Enriching simple fields¶
To enrich the event with event.dataset
supplemented by device.model.identifier
:
event.dataset: "sophos"
device.model.identifier: "SG230"
For that, specify these fields in the enricher, and the fields will be added to the event every time.
---
define:
type: parsec/enricher
enrich:
event.dataset: "sophos"
device.model.identifier: "SG230"
Editing existing fields¶
You can perform some operations with already existing fields. In this case, the goal is to change HOST.HOSTNAME
to lowercase, host.hostname
. For that, use the following expression:
host.hostname: !LOWER
what: !GET {from: !ARG EVENT, what: host.hostname}
You can also change the field name. If you do it like this,
host.id: !LOWER
what: !GET {from: !ARG EVENT, what: host.hostname}
the output would include the original field host.hostname
as well as a new lowercase field host.id
.
Computing facility and severity from priority¶
Syslog severity and facility are computed from syslog priority by the formula:
PRIORITY = FACILITY * 8 + SEVERITY
There is a shortcut for faster computation that uses the fact that numbers are represented in binary format.
The shortcut allows the use of low level operations such as !SHR
(right shift) and !AND
.
8 = 2^3
, therefore obtaining an integer quotient after dividing by 8 is done by performing the right shift by 3.
Integer 7
in binary representation is 111
, therefore applying !AND
operation gives the remainder after dividing by 8.
The expression is the following:
log.syslog.facility.code: !SHR { what: !GET { from: !ARG EVENT, what: log.syslog.priority }, by: 3 }
log.syslog.severity.code: !AND [ !GET { from: !ARG EVENT, what: log.syslog.priority }, 7 ]
You can consider the number 38
to illustrate this concept. 38
is 100 110
in binary representation. Dividing it by 8
is the same as right shift by 3 places, which is 11
in binary representation.
shr(100 110, 11) = 000 100
which is 4
. So the value of FACILITY is 4
, which corresponds to AUTH
. Performing !AND
operation gives
and(100 100, 111) = 000 100
which is again 4
. So the value of SEVERITY is 4
, which corresponds to WARNING
.
You can also match the numeric values of severity and facility with human-readable names using the !MATCH
expression. The complete declaration is the following:
---
define:
type: parsec/enricher
enrich:
# New fields
event.dataset: "sophos"
device.model.identifier: "SG230"
# Lowercasing the existing field
host.hostname: !LOWER
what: !GET {from: !ARG EVENT, what: host.hostname}
# SYSLOG FACILITY
log.syslog.facility.code: !SHR { what: !GET { from: !ARG EVENT, what: log.syslog.priority }, by: 3 }
log.syslog.facility.name: !MATCH
what: !GET { from: !ARG EVENT, what: log.syslog.facility.code }
with:
0: 'kern'
1: 'user'
2: 'mail'
3: 'daemon'
4: 'auth'
5: 'syslog'
6: 'lpr'
7: 'news'
8: 'uucp'
9: 'cron'
10: 'authpriv'
11: 'ftp'
16: 'local0'
17: 'local1'
18: 'local2'
19: 'local3'
20: 'local4'
21: 'local5'
22: 'local6'
23: 'local7'
# SYSLOG SEVERITY
log.syslog.severity.code: !AND [ !GET { from: !ARG EVENT, what: log.syslog.priority }, 7 ]
log.syslog.severity.name: !MATCH
what: !GET { from: !ARG EVENT, what: log.syslog.severity.code }
with:
0: 'emergency'
1: 'alert'
2: 'critical'
3: 'error'
4: 'warning'
5: 'notice'
6: 'information'
7: 'debug'
Ended: Parsing
Detections ↵
LogMan.io Correlator¶
TeskaLabs LogMan.io Correlator is a powerful, fast, scalable component of LogMan.io and TeskaLabs SIEM. As the Correlator makes detections possible, it is essential to effective cybersecurity.
The Correlator identifies specified activity, patterns, anomalies, and threats in real time as defined by detection rules. The Correlator works in your system's data stream, rather than on disk storage, making it a fast and uniquely scalable security mechanism.
What does the Correlator do?¶
The Correlator keeps track of events and when they happen in relation to a larger pattern or activity.
- First, you identify the pattern, threat, or anomaly you want the Correlator to monitor for. You write a detection that defines the activity, including which types of events (logs) are relevant and how many times an event needs to occur in a defined timeframe in order to trigger a response.
-
The Correlator identifies the relevant incoming events, and organizes the events first by a specific attribute in the event (dimension), such as source IP address or user ID, then sorts the events into short time intervals so the number of events can be analyzed. The time intervals are also defined by the detection rule.
Note: It's most common to use the Correlator's sum function to count events that occur in a specified timeframe. However, the Correlator can also analyze using other mathematical functions.
-
The Correlator analyzes these dimensions and time intervals to see if the relevant events have happened in the desired timeframe. When the Correlator detects the activity, it triggers the response specified in the detection.
In other words, this microservice shares event statuses over time intervals and uses a sliding, or rolling, analysis window.
What is a sliding analysis window?
Using a sliding analysis window means that the Correlator can analyze multiple time intervals continuously. For example, when analyzing a period of 30 seconds, the Correlator shifts its analysis, which is a window of 30 seconds, to overlap previous analyses as time progresses.
This picture represents a single dimension, for example the analysis of events with the same source IP address. In a real detection rule, you'd have several rows of this table, one row for each IP address. More in the example below.
The sliding window makes it possible to analyze the overlapping 30-second timeframes 0:00-0:30
, 0:10-0:40
, 0:20-0:50
, and 0:30-0:60
, rather than just 0:00-0:30
and 0:30-0:60
.
Example¶
Example scenario: You create a detection to alert you when 20 login attempts are made to the same user account within 30 seconds. Since this password entry rate is higher than most people could achieve on their own, this activity could indicate a brute force attack.
In order to detect this security threat, the Correlator needs to know two things:
- Which events are relevant. In this case, that means failed login attempts to the same user account.
- When the events (login attempts) happen in relation to each other.
Note: The following logs and images are heavily simplified to better illustrate the ideas.
1. These logs occur in the system:
What do these logs mean?
Each table you see above is a log for the event of a user having a single failed login attempt.
log.ID
: The unique log identifier, as seen in the table belowtimestamp
: The time the event occurredusername
: The Correlator will analyze groups of logs from the same users, because it wouldn't be effective in this case to analyze login attempts across all users combined.event.message
: The Correlator is only looking for failed logins, as would be defined by the detection rule.
2. The Correlator begins tracking the events in rows and columns:
- Username is the dimension, as defined by the detection rule, so each user has their own row.
- Log ID (A, B, C, etc.) is here in the table so you can see which logs are being counted.
- The number in each cell is how many events occurred in that time interval per username (dimension).
3. The Correlator continues keeping track of events:
You can see that one account is experiencing a higher volume of failed login attempts now.
4. At the same time, the Correlator is analyzing 30-second time periods with an analysis window:
The analysis window moves across the time intervals to count the total number of events in 30-second timeframes. You can see that when the analysis window reaches the 01:20-01:50
timeframe for the username anna.s.ample
, it will count more than 20 events. This would trigger a response from the Correlator, as defined by the detection (more on triggers here).
A gif to illustrate the analysis window moving
The 30-second analysis window "slides" or "rolls" along the time intervals, counting how many events occurred. When it finds 20 or more events in a single analysis, an action from the detection rule is triggered.
Memory and storage¶
The Correlator operates in the data stream, not in a database. This means that the Correlator is tracking events and performing analysis in real time as events occur, rather than pulling past collected events from a database to perform analysis.
In order to work in the data stream, the Correlator uses memory mapping, which allows it to function in the system's quickly accessible memory (RAM) rather than relying on disk storage.
Memory mapping provides significant benefits:
- Real-time detection: Data in RAM is more quickly accessible than data from a storage disk. This makes the Correlator very fast, allowing you to detect threats immediately.
- Simultaneous processing: Greater processing capacity allows for the Correlator to run many parallel detecions at once.
- Scalability: The volume of data in your log collection system will likely increase as your organization grows. The Correlator can keep up. Allocating additional RAM is faster and simpler than increasing disk storage.
- Persistence: If the system shuts down unexpectedly, the Correlator does not lose data. The Correlator's history is backed up to disk (SSD) often, so the data is available when the system restarts.
For more technical information, visit our Correlator reference documentation.
What is a detection?¶
A detection (sometimes called a correlation rule) defines and finds patterns and specific events in your data. A huge volume of event logs moves through your system, and detections help identify events and combinations of events that might be the result of a security breach or system error.
Important
- The possibilities for your detections depend on your Correlator configuration.
- All detections are written in TeskaLabs SP-Lang. There is a quick guide for SP-Lang in the window correlation example and additional detection guidelines.
What can detections do?¶
You can write detections to describe and find an endless combination of events and patterns, but these are common activities to monitor:
- Multiple failed login attempts: Numerous unsuccessful login attempts within a short period, often from the same IP address, to catch brute-force or password-spraying attacks.
- Unusual data transfer or exfiltration: Abnormal or large data transfers from inside the network to external locations.
- Port scanning: Attempts to identify open ports on network devices, which may be the precursor to an attack.
- Unusual hours of activity: User or system activities during non-business hours, which could indicate a compromised account or insider threat.
- Geographical anomalies: Logins or activities originating from unexpected geographical locations based on the user's typical behavior.
- Access to sensitive resources: Unauthorized or unusual attempts to access critical or sensitive files, databases, or services.
- Changes to critical system files: Unexpected changes to system and configuration files
- Suspicious email activity: Phishing emails, attachments with malware, or other types of malicious email content.
- Privilege escalation: Attempts to escalate privileges, such as a regular user trying to gain admin-level access.
Getting started¶
Plan your correlation rule carefully to avoid missing important events or drawing attention to irrelevant events. Answer the questions:
- What activity (events or patterns) do you want to detect?
- Which logs are relevant to this activity?
- What do you want to happen if the activity is detected?
To get started writing a detection, see this example of a window correlation and follow these additional guidelines.
Writing a window correlation-type detection rule¶
A window correlation rule is a highly versatile type of detection that can identify combinations of events over time. This example shows some of the techniques you can use when writing window correlations, but there are many more options, so this page gives you additional guidance.
Before you can write a new detection rule, you need to:
- Decide what activity you are looking for, and decide the timeframe in which this activity happening is notable.
- Identify which data source produces the logs that could trigger a positive detection, and identify what information those logs contain.
- Decide what you want to happen when the activity is detected.
Use TeskaLabs SP-Lang to write correlation rules.
Sections of a correlaton rule¶
Include each of these sections in your rule:
- Define: Information that describes your rule.
- Predicate: The
predicate
section is a filter that identifies which logs to evaluate, and which logs to ignore. - Evaluate: The
evaluate
section sorts or organizes data to be analyzed. - Analyze: The
analyze
section defines and searches for the desired pattern in the data sorted byevaluate
. - Trigger: The
trigger
section defines what happens if there is a positive detection.
To better understand the structure of a window correlation rule, consult this example.
Comments
Include comments in your detection rules so that you and others can understand what each item in the detection rule does. Add comments on separate lines from code, and begin comments with hashtags #
.
Parentheses
Words in parentheses ()
are placeholders to show that there would normally be a value in this space. Correlation rules don't use parentheses.
Define¶
Always include in define
:
Item in the rule | How to include |
---|---|
|
Name the rule. While the name has no impact on the rule's functionality, it should still be a name that's clear and easy for you and others to understand. |
|
Describe the rule briefly and accurately. The description also has no impact on the rule's functionality, but it can help you and others understand what the rule is for. |
|
Include this line as-is. The type does impact the rule's functionality. The rule uses correlator/window to function as a window correlator.
|
Predicate¶
The predicate
section is a filter. When you write the predicate
, you use SP-Lang expressions to structure conditions for the filter "allow in" only logs that are relevant to the activity or pattern that the rule is detecting.
If a log meets the predicate's conditions, it gets analyzed in the next steps of the detection rule, alongside other related logs. If a log doesn't meet the predicate's conditions, the detection rule ignores the log.
See this guide to learn more about writing predicates.
Evaluate¶
Any log that passes through the filter in predicate
gets evaluated in evaluate
. The evaluate
section organizes the data so it can by analyzed. Usually, you can't spot a security threat (or other noteworthy patterns) based on just one event (for example, one failed login attempt), so you need to write detection rules to group events together to find patterns that point to security or operational issues.
The evaluate
section creates an invisible evaluation window - you can think of the window as a table. The table is what the analyze
section uses to detect the activity the detection rule is seeking.
You can see an example of the evaluate
and analyze
sections working together here.
Item in evaluate |
How to include |
---|---|
|
dimension creates the rows in the table. In the table, the values of the specified fields are grouped into one row (see the table below).
|
|
by creates the columns in the table. In most cases, @timestamp is the right choice because window correlation rules are based around time. So, each column in the table is an interval of time, which the resolution specifies.
|
|
The resolution unit is seconds. Each time interval will be the number of seconds you specify.
|
|
The saturation field sets how many times the trigger can be activated before the rule stops counting events in a single cell that caused the trigger (see the table below). With a recommended saturation of 1, relevant events that happen within the same specified timeframe (resolution ) will stop being counted after one trigger. Setting the saturation to 1 prevents additional triggers for identical behavior in the same timeframe.
|
Analyze¶
analyze
uses the table created by the evaluate
section to find out if the activity the detection rule is seeking has happened.
You can see an example of the evaluate
and analyze
sections working together here.
Item in analyze |
How to include |
---|---|
|
The window analyzes a specified number of cells in the table created by evaluate section, each of which represents logs in a specified timeframe. Hopping window: The window will count the values in cells, testing all adjacent combinations of cells to cover the specified time period, with overlap. A hopping window is recommended. Tumbling window: The window counts the values in cells, testing all adjacent combinations of cells to cover the specified time period, WITHOUT overlap. See the note below to learn more about hopping and tumbling windows. |
|
The aggregate depends on the dimension . Use unique count to ensure that the rule won't count the same value of your specified field in dimension more than once.
|
|
A span sets the number of cells in the table that will be analyzed at once. span multiplied by resolution is the timeframe in which the correlation rule looks for a pattern or behavior. (For example, 2*60 is a 2-minute timeframe.)
|
|
The !GE expression means "greater than or equal to," and !ARG VALUE refers to the output value of the aggregate function. The value listed under !ARG VALUE is the number of unique occurances of a value in a single analysis window that will trigger the correlation rule.
|
Hopping vs. tumbling windows
This page about tumbling and hopping windows can help you understand the different types of analysis windows.
Trigger¶
After identifying the suspicious activity you specified, the rule can:
- Send the detection to Elasicsearch as a document. Then, you can see the detection as a log in TeskaLabs LogMan.io. You can create your own dashboard to display correlation rule detections, or find the logs in Discover.
- Send a notification via email
Visit the triggers page to learn about setting up triggers to create events, and go to the notifications page to learn about sending messages from detections.
Example of a window correlation detection rule¶
A window correlation rule is a type of detection that can identify combinations of events over time. Before using this example to write your own rule, visit these guidelines to better understand each part of the rule.
Like all detections, write window correlation rules in TeskaLabs SP-Lang.
Jump to: Define | Predicate | Evaluate | Analyze | Trigger
This detection rule is looking for a single external IP trying to access 25 or more unique internal IP addresses in 2 minutes. This activity could indicate an attacker trying search the network infrastructure for vulnerability.
Note
Any line beginning with a hashtag (#) is a comment, not part of the detection rule. Add notes to your detection rules to help others understand the rules' purpose and function.
The complete detection rule using a window correlation:
define:
name: "Network T1046 Network Service Discovery"
description: "External IP accessing 25+ internal IPs in 2 minutes"
type: correlator/window
predicate:
!AND
- !OR
- !EQ
- !ITEM EVENT event.dataset
- "fortigate"
- !EQ
- !ITEM EVENT event.dataset
- "sophos"
- !OR
- !EQ
- !ITEM EVENT event.action
- "deny"
- !EQ
- !ITEM EVENT event.action
- "drop"
- !IN
what: source.ip
where: !EVENT
- !NOT
what:
!STARTSWITH
what: !ITEM EVENT source.ip
prefix: "193.145"
- !NE
- !ITEM EVENT source.ip
- "8.8.8.8"
- !IN
what: destination.ip
where: !EVENT
evaluate:
dimension: [tenant, source.ip]
by: "@timestamp"
resolution: 60
saturation: 1
analyze:
window: hopping
aggregate: unique count
dimension: destination.ip
span: 2
test:
!GE
- !ARG VALUE
- 25
trigger:
- event:
!DICT
type: "{str:any}"
with:
ecs.version: "1.10.0"
lmio.correlation.depth: 1
lmio.correlation.name: "Network T1046 Network Service Discovery"
# Events
events: !ARG EVENTS
# Threat description
# https://www.elastic.co/guide/en/ecs/master/ecs-threat.html
threat.framework: "MITRE ATT&CK"
threat.software.platforms: "Network"
threat.indicator.sightings: !ARG ANALYZE_RESULT
threat.indicator.confidence: "Medium"
threat.indicator.ip: !ITEM EVENT source.ip
threat.indicator.port: !ITEM EVENT source.port
threat.indicator.type: "ipv4-addr"
threat.tactic.id: "TA0007"
threat.tactic.name: "Discovery"
threat.tactic.reference: "https://attack.mitre.org/tactics/TA0007/"
threat.technique.id: "T1046"
threat.technique.name: "Network Service Discovery"
threat.technique.reference: "https://attack.mitre.org/techniques/T1046/"
# Identification
event.kind: "alert"
event.dataset: "correlation"
source.ip: !ITEM EVENT source.ip
Define¶
define:
name: "Network T1046 Network Service Discovery"
description: "External IP accessing 25+ internal IPs in 2 minutes"
type: correlator/window
Item in the rule | What does it mean? |
---|---|
|
This is the name of the rule. The name is for the users and has no impact on the rule itself. |
|
The description is also for the users. It describes what the rule does, but it has no impact on the rule itself. |
|
The type does impact the rule. The rule uses correlator/window to function as a window correlator.
|
Predicate¶
predicate
is the filter that checks if an incoming log might be related to the event that the detection rule is searching for.
The predicate is made of SP-Lang expressions. The expressions create conditions. If the expression is "true," the condition is met. The filter checks the incoming log to see if the log makes the predicate's expressions "true" and therefore meets the conditions.
If a log meets the predicate's conditions, it gets analyzed in the next steps of the detection rule, alongside other related logs. If a log doesn't meet the predicate's conditions, the detection rule ignores the log.
You can find the full SP-Lang documentation here.
SP-Lang terms, in the order they appear in the predicate
Expression | Meaning |
---|---|
!AND |
ALL of the criteria nested under !AND must be met for the !AND to be true |
!OR |
At least ONE of the criteria nested under !OR must be met for the !OR to be true |
!EQ |
"Equal" to. Must be equal to, or match the value, to be true |
!ITEM EVENT |
Gets information from the content of the incoming logs (accesses the fields and values in the incoming logs) |
!IN |
Looks for a value in a set of values (what in where ) |
!NOT |
Seeks the opposite of the expression nested under the !NOT (following what ) |
!STARTSWITH |
The value of the field (what ) must start with the specified text (prefix ) to be true |
!NE |
"Not equal" to, or doesn't equal. Must NOT equal (must not match the value) to be true |
You can see that there are several expressions nested under !AND
. A log must meet ALL of the conditions nested under !AND
to pass through the filter.
As seen in rule | What does it mean? |
---|---|
|
This is the first !OR expression, and it has two !EQ expressions nested under it, so at least ONE !EQ condition nested under this !OR must be true. Remember, !ITEM EVENT gets the value of the field it specifies. If the incoming log has "fortigate" OR "sophos" in the field event.dataset , then the log meets the !OR condition.
This filter accepts events only from the FortiGate and Sophos data sources. FortiGate and Sophos provide security tools such as firewalls, so this rule is looking for events generated by security tools that might already be intercepting suspicious activity. |
|
This condition is structured the same way as the previous one. If the incoming log has the value "deny" OR "drop" in the field event.action , then the log meets this !OR condition.
The values "deny" and "drop" in a log both signal that a security device, such as a firewall, blocked attempted access based on authorization or security policies. |
|
If the field source.ip exists in the incoming log (!EVENT ), then the log meets this !IN condition.
The field source.ip is the IP address that is trying to gain access to another IP address. Since this rule is specifically about IP addresses, the log needs to have the source IP address in it to be relevant.
|
|
If the value of the field source.ip DOES NOT begin with "193.145," then this !NOT expression is true. 193.145 is the beginning of this network's internal IP addresses, so the !NOT expression filters out internal IP addresses. This is because internal IPs accessing many other internal IPs in a short timeframe would not be suspicious. If internal IPs were not filtered out, the rule would return false positives.
|
|
If the incoming log DOES NOT have the value "8.8.8.8" in the field source.ip , then the log meets this !NE condition.
The rule filters out 8.8.8.8 as a source IP address because it is a well-known and trusted DNS resolver operated by Google. 8.8.8.8 is not generally associated with malicious activity, so not excluding it would trigger false positives in the rule. |
|
If the field destination.ip exists in the incoming log, then the log meets this !IN condition.
The field destination.ip is the IP address that is being accessed. Since this rule is specifically about IP addresses, the log needs to have the destination IP address in it to be relevant.
|
If an incoming log meets EVERY condition shown above (nested under !AND
), then the log gets evaluated and analyzed in the next sections of the detection rule.
Evaluate¶
Any log that passes through the filter in predicate
gets evaluated in evaluate
. The evaluate
section organizes the data so it can by analyzed. Usually, you can't spot a security threat (or other noteworthy patterns) based on just one event (for example, one failed login attempt), so the detection rule groups events together to find patterns that point to security or operational issues.
The evaluate
section creates an invisible evaluation window - you can think of the window as a table. The table is what the analyze
section uses to detect the event the detection rule is seeking.
evaluate:
dimension: [tenant, source.ip]
by: "@timestamp"
resolution: 60
saturation: 1
As seen in rule | What does it mean? |
---|---|
|
dimension creates the rows in the table. The rows are tenant and source.ip . In the final table, the values of tenant and source.ip are grouped into one row (see the table below).
|
|
by creates the columns in the table. It refers to the field @timestamp because the values from that field enable the rule to compare the events over time. So, each column is an interval of time, which the resolution specifies.
|
|
The resolution unit is seconds, so the value here is 60 seconds. Each time interval will be 60 seconds long.
|
|
The saturation field sets how many times the trigger can be activated before the rule stops counting events in a single cell that caused the trigger (see the table below). Since the saturation is 1, this means that relevant events that happen within one minute of each other will stop being counted after one trigger. Setting the saturation to 1 prevents additional triggers for identical behavior in the same timeframe. In this example, the trigger would be activated only once if an external IP address tried to access any number of unique internal IPs above 25.
|
This is an example of how the evaluate
section sorts logs that pass through the predicate
filter. (Click the table to enlarge.) The log data is heavily simplified for the sake of readability (for example, log IDs in the field _id
are letters rather than real log IDs, and the timestamps are shortened).
As specified by the dimension
field, the logs are grouped by tenant and source IP address, as you can see in cells A2-A5.
Since by
has the value timestamp
, and the resolution
is set to 60 seconds, the cells B1-E1 are time intervals, and the logs are sorted into the columns by their timestamp
value.
The number beside the list of log IDs in each cell (for example, 14 in cell C4) is the count of how many logs with the same source IP address passed through the filter in that timeframe. This becomes essential information in the analyze
section of the rule, since we're counting access attempts by external IPs.
Analyze¶
analyze
uses the table created by the evaluate
section to find out if the event the detection rule is seeking has happened.
analyze:
window: hopping
aggregate: unique count
dimension: destination.ip
span: 2
test:
!GE
- !ARG VALUE
- 25
As seen in rule | What does it mean? |
---|---|
|
The window type is hopping. The window analyzes a specified number of cells in the table created by evaluate section, each of which represents logs in a timeframe of 60 seconds. Since the type is hopping, the window will count some cells twice to test any adjacent combination of a two-minute time period. Since the span is set to 2, the rule will analyze two minutes (cells) at a time, with overlap.
|
|
The aggregate depends on the dimension . Here, unique count applies to destination.ip . This ensures that the rule won't count the same desination IP address more than once.
|
|
A span of 2 means that the cells in the table will be analyzed 2 at a time. |
|
The !GE expression means "greater than or equal to," and !ARG VALUE refers to the output value of the aggregate function. The value 25 is listed under !ARG VALUE , so this whole test expression is testing for 25 or more unique destination IP addresses in a single analysis window.
|
The red window around cells C4 and D4 shows that the rule has detected what it's looking for - attempted connection to 25 unique IP addresses.
Analysis with a hopping window explained in a gif
This illustrates how the window analyzes the data two cells at a time. When the window gets to cells C4 and D4, it detects 25 unique destination IP addresses.
Trigger¶
The trigger
section defines what happens if the analyze
section detects the event that the detection rule is looking for. In this case, the trigger is activated when a single external IP address attempts to connect to 25 or more different interal IP addresses.
As seen in rule | What does it mean? |
---|---|
|
In the trigger, event means that the rule will create an event based on this positive detection and send it into the data pipeline via Elasticsearch, where it is stored as a document. Then, the event comes through to TeskaLabs LogMan.io, where you can see this event in Discover and in dashboards.
|
|
!DICT creates a dictionary of keys (fields) and values. type has "st:any" (meaning string) so that any type of value (numbers, words, etc) can be a value in a key-value pair. with begins the list of key-value pairs, which you define. These are the fields and values that the event will be made of.
|
To learn more about each field, click the icons. Since TeskaLabs LogMan.io uses Elasticsearch and the Elastic Common Schema (ECS), you can get more details about many of these fields in the ECS reference guide.
trigger:
- event:
!DICT
type: "{str:any}"
with:
ecs.version: "1.10.0" #(1)
lmio.correlation.depth: 1 #(2)
lmio.correlation.name: "Network T1046 Network Service Discovery" #(3)
# Events
events: !ARG EVENTS #(4)
# Threat description
# https://www.elastic.co/guide/en/ecs/master/ecs-threat.html
threat.framework: "MITRE ATT&CK" #(5)
threat.software.platforms: "Network" #(6)
threat.indicator.sightings: !ARG ANALYZE_RESULT #(7)
threat.indicator.confidence: "Medium" #(8)
threat.indicator.ip: !ITEM EVENT source.ip #(9)
threat.indicator.port: !ITEM EVENT source.port #(10)
threat.indicator.type: "ipv4-addr" #(11)
threat.tactic.id: "TA0007" #(12)
threat.tactic.name: "Discovery" #(13)
threat.tactic.reference: "https://attack.mitre.org/tactics/TA0007/" #(14)
threat.technique.id: "T1046" #(15)
threat.technique.name: "Network Service Discovery" #(16)
threat.technique.reference: "https://attack.mitre.org/techniques/T1046/" #(17)
# Identification
event.kind: "alert" #(18)
event.dataset: "correlation" #(19)
source.ip: !ITEM EVENT source.ip #(20)
- The version of the Elastic Common Schema that this event conforms to - required field that must exist in all events going to Elasticsearch.
- The correlation depth indicates if this rule depends on any other rules or is in a chain of rules. The value 1 means that it is either the first in a chain, or the only rule involved - it doesn't depend on any other rules.
- The name of the rule
- In SP-Lang,
!ARG EVENTS
accesses original logs. So, this will list the IDs all of the events that make up this positive detection, so that you can investigate each log individually. - Name of the threat framework used to further categorize and classify the tactic and technique of the reported threat. See ECS reference.
- The platforms of the software used by this threat to conduct behavior commonly modeled using MITRE ATT&CK®. See ECS reference.
- Number of times this indicator was observed conducting threat activity. See ECS reference.
- Identifies the vendor-neutral confidence rating using the None/Low/Medium/High scale defined in Appendix A of the STIX 2.1 framework. See ECS reference.
- Identifies a threat indicator as an IP address (irrespective of direction). See ECS reference.
- Identifies a threat indicator as a port number (irrespective of direction). See ECS reference.
- Type of indicator as represented by Cyber Observable in STIX 2.0. See ECS reference.
- The id of tactic used by this threat. See ECS reference.
- The name of the type of the tactic used by this threat. See ECS reference.
- The reference url of tactic used by this threat. See ECS reference.
- The id of technique used by this threat. See ECS reference.
- The name of technique used by this threat. See ECS reference.
- The reference url of technique used by this threat. See ECS reference.
- The type of event
- The dataset that this event will be grouped in.
- The source IP address associated with this event (the one that tried to access 25 internal IPs in two minutes)
Ended: Detections
Predicates¶
A predicate
is a filter made of conditions formed by SP-Lang expressions.
How to write predicates¶
Before you can create a filter, you need to know the possible fields and values of the logs you are looking for. To see what fields and values your logs have, go to Discover in the TeskaLabs LogMan.io web app.
SP-Lang expressions¶
Construct conditions for the filter using SP-Lang expressions. The filter checks the incoming log to see if the log makes the expressions "true" and therefore meets the conditions.
You can find the full SP-Lang documentation here.
Common SP-Lang expressions:
Expression | Meaning |
---|---|
!AND |
ALL of the criteria nested under !AND must be met for the !AND to be true |
!OR |
At least ONE of the criteria nested under !OR must be met for the !OR to be true |
!EQ |
"Equal" to. Must be equal to, or match the value, to be true |
!NE |
"Not equal" to, or doesn't equal. Must NOT equal (must not match the value) to be true |
!IN |
Looks for a value in a set of values (what in where ) |
!STARTSWITH |
The value of the field (what ) must start with the specified text (prefix ) to be true |
!ENDSWITH |
The value of the field (what ) must end with the specified text (postfix ) to be true |
!ITEM EVENT |
Gets information from the content of the incoming logs (allows the filter to access the fields and values in the incoming logs) |
!NOT |
Seeks the opposite of the expression nested under the !NOT (following what ) |
Conditions¶
Use this guide to structure your individual conditions correctly.
Parentheses
Words in parentheses ()
are placeholders to show where values go. SP-Lang does not use parentheses.
Filter for a log that: | SP-Lang |
---|---|
Has a specified value in a specified field |
|
Has a specified field |
|
Does NOT have a specified value in a specified field |
|
Has one of multiple possible values in a field |
|
Has a specified value that begins with a specified number or text (prefix), in a specified field |
|
Has a specified value that ends with a specified number or text (postfix), in a specified field |
|
Does NOT satisfy a condition or set of conditions |
|
Example¶
To learn what each expression means in the context of this example, click the icons.
!AND #(1)
- !OR #(2)
- !EQ
- !ITEM EVENT event.dataset
- "sophos"
- !EQ
- !ITEM EVENT event.dataset
- "vmware-vcenter"
- !OR #(3)
- !EQ
- !ITEM EVENT event.action
- "Authentication failed"
- !EQ
- !ITEM EVENT event.action
- "failed password"
- !EQ
- !ITEM EVENT event.action
- "unsuccessful login"
- !OR #(4)
- !IN
what: source.ip
where: !EVENT
- !IN
what: user.id
where: !EVENT
- !NOT #(5)
what:
!STARTSWITH
what: !ITEM EVENT user.id
prefix: "harry"
- Every expression nested under
!AND
must be true for a log to pass through this filter. - In the log, in the field
event.dataset
, the value must besophos
orvmware-vcenter
for this!OR
to be true. - In the log, in the field
event.action
, the value must beAuthentication failed
,failed password
, orunsuccessful login
for this!OR
to be true. - The log must contain the field
source.ip
or the fielduser.id
for this!OR
to be true. - In the log, the field
user.id
must not begin withharry
for this!NOT
to be true.
This filters for logs that:
- Have the value
sophos
orvmware-vcenter
in the fieldevent.dataset
AND - Have the value
Authentication failed
,failed password
, orunsuccessful login
in the fieldevent.action
AND - Include at least one of the fields
source.ip
oruser.id
AND - Do not have a value that begins with
harry
in the fielduser.id
For more ideas and formatting tips, see this example in the context of a detection rule, including details about the predicate
section.
Triggers¶
A trigger, in an alert or detection, executes an action. For example, in a detection, the trigger
section can send an email when the specified activity is detected.
A trigger can:
- Trigger an event: Send an event to Elasicsearch where it is stored as a document. Then, you can see the event as a log in the TeskaLabs LogMan.io app. You can create your own dashboard to display correlation rule detections, or find the logs in Discover.
- Trigger a notification: Send a message via email
Trigger an event¶
You can trigger an event. The end result is that the trigger creates a log of the event, which you can see in TeskaLabs LogMan.io.
Item in trigger |
How to include |
---|---|
|
In the trigger, event means that the rule will create an event based on this positive detection and send it into the data pipeline via Elasticsearch, where it is stored as a document. Then, the event comes through to TeskaLabs LogMan.io, where you can see this event in Discover and Dashboards.
|
|
!DICT creates a dictionary of keys (fields) and values. type has "st:any" (meaning string) so that any type of value (numbers, words, etc) can be a value in a key-value pair. with begins the list of key-value pairs, which you define. These are the fields and values that the event will be made of.
|
Following with
, make a list of the key-value pairs, or fields and values, that you want in the event.
!DICT
type: "{str:any}"
with:
key.1: "value"
key.2: "value"
key.3: "value"
key.4: "value"
If you're using Elasticsearch and therefore the Elastic Common Schema (ECS), you can read about standard fields in the ECS reference guide.
Trigger a notification¶
Notifications send messages. Currently, you can use notifications to send emails.
Learn more about writing notifications and creating email templates.
Notifications ↵
Notifications¶
Notifications send messages. You can add a notification
section anywhere that you want the output of a trigger
to be a message, such as in an alert or detection. In a detection, the notification
section can send a message when the specified activity (such as a potential threat) is detected.
TeskaLabs LogMan.io uses TeskaLabs ASAB Iris, a TeskaLabs microservice, to send messages.
Warning
To avoid notification spam, only use notifications for highly urgent and well-tested detection rules. Some detections are better suited to be sent as events through Elasticsearch and viewed in the LogMan.io web app.
Notification types¶
Currently, you can send messages via email.
Sending notifications via email¶
Write notifications in TeskaLabs SP-Lang. If you're writing a notification for a detection, write the email notification in the trigger
section.
Important
For notifications that send emails, you need to create an email template in the Library to connect with. This template includes the actual text that the recipient will see, with blank fields that change based on what the detected activity is (using Jinja templating), including which logs are involved in the detection, and any other information you choose. The notification section in the detection rule is what populates the blank fields in the email template. You can use a single email template for multiple detection rules.
Example:
Use this example as a guide. Click the icons to learn what each line means.
trigger: #(1)
- notification: #(2)
type: email #(3)
template: "/Templates/Email/Notification.md" #(4)
to: [email@example.com] #(5)
variables: #(6)
!DICT #(7)
type: "{str:any}" #(8)
with: #(9)
name: Notification from the detection X #(10)
events: !ARG EVENTS #(11)
address: !ITEM EVENT client.address #(12)
description: Detection of X by TeskaLabs LogMan.io #(13)
-
Indicates the beginning of the
trigger
section. -
Indicates the beginning of the
notification
section. -
To send an email, write email for
type
. -
This tells the notification where to get the email template from. You need to specify the filepath (or location) of the email template in the Library. In this example, the template is in the Library, in the Templates folder, in the Email subfolder, and it’s called Notification.md.
-
Write the email address where you want the email to go.
-
Begins the section that gives directions for how to fill the blank fields from the email template.
-
An SP-Lang expression that creates a dictionary so you can use key-value pairs in the notification. (The key is the first word, and the value is what follows.) Always include
!DICT
. -
Always make type "{str:any}" so that the values in the key-value pairs can be in any format (numbers, words, arrays, etc.).
-
Always include
with
, because it begins the list of fields from the email template. Everything nested underwith
is a field from the email template. -
The name of the detection rule, which should be understandable to the recipient
-
events
is the key, or field name, and!ARG EVENTS
is an SP-Lang expression that lists the logs that caused a positive detection from the detection rule. -
address
is the key, or field name, and!ITEM EVENT client.address
gets the value of the fieldclient.address
from each log that caused a positive detection from the detection rule. -
Your description of the event, which needs to be very clear and accurate
Populating the email template
name
, events
, address
, and description
are fields in the email template in this example. Always make sure that the keys you write in the with
section match the fields in your email template.
The fields name
and description
are static text values - they stay the same in every notification.
The fileds events
and address
are dynamic values - they change based on which logs caused a positive detection from the detection rule. You can write dynamic fields using TeskaLabs SP-Lang.
Refer to our directions for creating email templates to write templates that work correctly as notifications.
Creating email templates¶
An email template is a document that works with a notification
to send an email, for example as a result of a positive detection in a detection rule. Jinja template fields allow the email template to have dynamic values that change based on variables such as events involved in a positive detection. (After you learn about creating email templates, learn how to use Jinja template fields.)
The email template provides the text that the recipient sees when they get an email from the notification. You can find email templates in your Library in the Templates folder.
When you write an email template to go with a notification, make sure that the template fields in each item match.
How do the notification and email template work together?
TeskaLabs ASAB Iris is a message-sending microservice that pairs the notification and the email template to send emails with populated placeholder fields.
Creating an email template¶
Create a new blank email template
- In the Library, click Templates, then click Email.
- To the right, click Create new item in Email.
- Name your template, choose the file type, and click Create. If the new item doesn't appear immediately, refresh the page.
- Now, you can write the template.
Copy an existing email template
- In the Library, click Templates, then click Email.
- Click on the existing template you'd like to copy. The copy you create will be placed in the same folder as the original.
- Click the icon at the top of the screen, and click Copy.
- Rename the file, choose the file type, and click Copy. If the new item doesn't appear immediately, refresh the page.
- Click Edit to make changes, and click Save to save your changes.
To exit editing mode, save by clicking Save or cancel by clicking Cancel.
Writing an email template¶
You can write email templates in Markdown or in HTML. Markdown is less complex, but HTML gives you more formatting options.
When you write the text, make sure to tell the recipient:
- Who the email is from
- Why they are receiving this email
- What the email/alert means
- How to investigate or follow up on the problem - include all of the relevant and useful information, such as log IDs or direct links to view selected logs
Simple template example using Markdown:
SUBJECT: {{ name }}
TeskaLabs LogMan.io has identified a noteworthy event in your IT infrastructure which might require your immediate attention.
Please review following summary of the event:
Event: {{name}}
Event description: {{description}}
This notification has been created based on the original log/logs:
{% for event in events %}
- {{event}}
{% endfor %}
The notification was generated for this address: {{address}}
We encourage you to review this incident promptly to determine the next appropriate course of action.
Remember, the effectiveness of any security program lies in a swift response.
Thank you for your attention to this matter.
Stay safe,
TeskaLabs LogMan.io
Made with <3 by [TeskaLabs](https://teskalabs.com)
The words in double braces (such as {{address}}
) are template fields, or placeholders. These are the Jinja template fields that pull information from the notification
section in a detection rule. Learn about Jinja templates here.
Testing an email template¶
You can test an email template using the Test template feature. Testing an email template means sending a real email to see if the format and fields are displaying correctly. This test does not interact with the detection rule at all.
Fill out the From, To, CC, BCC, and Subject fields the same way you would for any email (but it's best practice to send the email to yourself). You must always fill in, at minimum, the From and To fields.
Test parameters¶
You can populate the Jinja template fields for testing purposes using the Parameters tool. Write the parameters in JSON. JSON uses keys-value pairs. Keys are the fields in the template, and values are what populate the fields.
In this example, the keys and values are highlighted to show that the keys in Parameters need to match the fields in the template, and the values will populate the fields in the resulting email:
Parameters has two editing modes: the text editor and the clickable JSON editor. To switch between modes, click the <···> or icon beside Parameters. You can switch between modes without losing your work.
Clickable editor¶
To switch to the clickable JSON editor, click the <···> icon beside Parameters. The clickable editor formats your parameters for you and tells you the value type for each item.
How to use the clickable editor:
In the clickable editor, edit, delete, and add icons appear when you hover over lines and items.
1. Add a key: When you hover over the top line (it says the number of keys you have, for example 0 items), a icon appears. To add a parameter, click the icon. It prompts you for the key name. Type the key name (the field name you want to test) and click the to save. Don't use quotation marks - the editor adds the quotation marks for you. The key name appears with the value NULL
beside it.
2. Add a value: To edit the value, click the icon that appears when you hover beside NULL
. Type the value (what you want to appear in place of the field/placeholder in the email you send), and save by clicking the icon.
3. To add more key-value pairs, click on the that appears when you hover over the top line.
4. To delete an item, click the that appears when you hover over the item. To edit an item, click the that appears when you hover over the item.
Text editor¶
To switch to the text editor, click the icon beside Parameters.
Example of parameter formatting:
{
"name":"Detection rule 1",
"description":"Description of Detection rule 1",
"events":["log-ID-1", "log-ID-2", "log-ID-3"],
"address":"Example address"
}
Quick JSON tips
- Begin and end the parameters with braces (curly brackets)
{}
- Write every item, both keys and values, in quotation marks
""
- Link keys to their values with a colon
:
(for example:"key":"value"
) - Separate key-value pairs with commas
,
. You can also use spaces and line breaks for your own readability - they'll be ignored in terms of function. - Type arrays in brackets
[]
and separate items with commas (the keyevents
might have multiple values, as the Jinjafor
expression allows for, so here it's written as an array)
The testing box tells you if the parameters are not in a valid JSON format.
Switching modes¶
You can switch modes and continue editing your parameters. The Parameters tool will automatically convert your work for the new mode.
Note about arrays
An array is a list of multiple values. To edit an array value in the clickable editor, you need to type at least two values manually in the text editor in the correct array format (see Quick JSON tips above). Then, you can switch to the clickable editor and add more items to the array.
Sending the test email¶
When you're ready to test the email, click Send. You should receive the email in the inbox of addressee in the To: field, where you can check the formatting of the template. If you don't see the email, check your spam folder.
Jinja templating¶
The notification
section of a detection rule works with an email template to send a message when the detection rule is triggered. The email template has placeholder fields, and the notification determines what fills those placeholder fields in the actual email that the recipient gets. This is possible because of Jinja templating. (Learn about writing email templates before you learn about Jinja fields.)
Format¶
Format all Jinja template fields with two braces (curly brackets) on each side of the field name in both Markdown and HTML email templates. You can use or not use a space on either side of the field name.
{{fieldname}}
OR {{ fieldname }}
For a more in-depth explanation of Jinja templating, visit this tutorial.
if
expression¶
You might want to use the same email template for multiple detection rules. Since different detection rules might have different data included, some parts of your email might only be relevant for some detection rules. You can use if
to include a section only if a certain key in the notification template has a value. This helps you avoid unpopulated template fields or nonsensical text in an email.
In this example, anything between if
and endif
is only included in the email if the key sender
has a value in the notification section of the detection rule. (If there is no value for sender
, this section won't appear in the email.)
{% if sender %}
The email address {{ sender }} has sent a suspicious number of emails.
{% endif %}
For more details, visit this tutorial.
for
expression¶
Use for
when you might have multiple values from the same category that you want to appear as a list in your email.
In this example, events
is the actual template field that you'd see in the notification, and it might contain multiple values (in this case, multiple log IDs). Here, log
is just a temporary variable used only in this for
expression to represent one value that the notification sends from the field events
. (This temporary variable could be any word, as it refers only to itself in the email template.) The for
expression allows the template to display these multiple values as a bulleted list (mutliple instances).
{% for log in events %}
- {{ log }}
{% endfor %}
For more details, visit this tutorial.
Link templating¶
Thanks to TeskaLabs ASAB Iris, you can include links in your emails that change based on tenant or events detected by the rule.
Link to a tenant's home page:
{{lmio_url}}/?tenant={{tenant}}#/
tenant
in your detection rule notification
section for the link to work.
Link to a specific log:
[{{event}}]({{lmio_url}}/?tenant={{tenant}}#/discover/lmio-{{tenant}}-events?aggby=minute&filter={{event}}&ts=now-2d&te=now&refresh=off&size=40)
tenant
or lmio_url
in your detection rule notification
section for the link to work.Using Base64 images in HTML email templates¶
To hardcode an image into an email template written in HTML, use Base64. Converting an image to Base64 makes the image into a long string of text.
- Use an image converting tool (such as this one by Atatus) to convert your image to Base64.
- Using image
<img>
and alt textalt
tags, copy and paste the Base64 string into your template like this:
<img alt="ALT TEXT HERE" src="PASTE HERE"/>
Note
The alt text is optional, but it is recommended in case your image doesn't load for any reason.
Ended: Notifications
Ended: Analyst Manual
Administration Manual ↵
TeskaLabs LogMan.io Administration Manual¶
Welcome to the Administration Manual. Use this guide to set up and configure LogMan.io for yourself or clients.
Installation¶
TeskaLabs LogMan.io could be installed manually on compute resources. Compute resources include physical servers, virtual servers, private and public cloud compute/VM instances and so on.
Danger
TeskaLabs LogMan.io CANNOT BE operated under root
user (superuser). Violation of this rule may lead to a significant cybersecurity risks.
Prerequisites¶
- Hardware (physical or virtualized server)
- OS Linux: Ubuntu 22.04 LTS and 20.04 LTS, RedHat 8 and 7, CentOS 7 and 8 (for others, kindly contact our support)
- Network connectivity with enabled outgoing access to the Internet (could be restricted after the installation); details are descibed here
- Credentials to SMTP server for outgoing emails
- DNS domain, even internal (needed for HTTPS setup)
- Credentials to "docker.teskalabs.com" (contact our support if you don't have one)
From Bare Metal server to the Operating system¶
Note
Skip this section if you are installing on the virtual machine, respective on the host with the operating system installed already.
Prerequisites¶
- The server that conforms to prescribed data storage organisation.
- Bootable USB stick with Ubuntu Server 22.04 LTS; the most recent release.
- Access to the server equipped with a monitor and a keyboard; alternatively over IPMI or equivalent Out-of-band management.
- Network connectivity with enabled outgoing access to the Internet.
Note
These are additional prerequisites on top of the general prerequisites from above.
Steps¶
1) Boot the server using a bootable USB stick with Ubuntu Server.
Insert the bootable USB stick into the USB port of the server, then power on the server.
Use UEFI partition on the USB stick as a boot device.
Select "Try or Install Ubuntu Server" in a boot menu.
2) Select "English" as the language
3) Update to the new installer if needed
4) Select the english keyboard layout
5) Select the "Ubuntu Server" installation type
6) Configure the network connection
This is the network configuration for the installation purposes, the final network configuration can be different.
If you are using DHCP server, the network configuration is automatic.
IMPORTANT: The Internet connectivity must be available.
Note the IP address of the server for a future use.
7) Skip or configure the proxy server
Skip (press "Done") the proxy server configuration.
8) Confirm selected mirror address
Confirm the selected mirror address by pressing "Done".
9) Select "Custom storage layout"
The custom storage layout of the system storage is as follows:
Mount | Size | FS | Part. | RAID / Part. | VG / LV |
---|---|---|---|---|---|
/boot/efi |
1G | fat32 | 1 | ||
SWAP | 64G | 2 | |||
/boot |
2G | ext4 | 3 | md0 / 1 |
|
/ |
50G | etx4 | 3 | md0 / 2 |
systemvg / rootlv |
/var/log |
50G | etx4 | 3 | md0 / 2 |
systemvg / loglv |
Unused | >100G | 3 | md0 / 2 |
systemvg |
Legend:
- FS: Filesystem
- Part.: GUID Partition
- RAID / Part.: MD RAID volume and a partition on the given RAID volume
- VG: LVM Volume Group
- LV: LVM Logical Volume
Note
Unused space will be used later in the installation for i.e. Docker containers.
10) Identify two system storage drives
The two system storage drives are structured symmetrically to provided redundancy in case of one system drive failure.
Note
The fast and slow storage is NOT configured here during the OS installation but later from the installed OS.
11) Set the first system storage as a primary boot device
This step will create a first GPT partition with UEFI, that is mounted at /boot/efi
.
The size of this partition is approximately 1GB.
12) Set the second system storage as a secondary boot device
Another UEFI partition is created on the second system storage.
13) Create SWAP partitions on both system storage drives
On each of two drives, add a GPT partition with size 64G and format swap
.
Select "free space" on respective system storage drive and then "Add GPT Partition"
Resulting layout is as follows:
14) Create the GPT partition for RAID1 on both system storage drives
On each of two drives, add GPT partition with the all remaining free space. The format is "Leave unformatted" because this partition will be added to the RAID1 array. You can leave “Size” blank to use all the remaining space on the device.
The result is "partition" entry instead of the "free space" on respective drives.
15) Create software RAID1
Select "Create software RAID (md)".
The name of the array is md0
(default).
RAID level is "1 (mirrored)".
Select two partitions from the above step, keep them marked as "active", and press "Create".
The layout of system storage drives is following:
16) Create a BOOT partition of the RAID1
Add a GPT partition onto the md0
RAID1 from the step above.
The size is 2G, format is ext4
and the mount is /boot
.
17) Setup LVM partition on the RAID1
The remaining space on the RAID1 will be managed by LVM.
Add a GPT partition onto the md0
RAID1, using "free space" entry under md0
device.
Use the maximum available space and set the format to "Leave unformatted". You can leave “Size” blank to use all the remaining space on the device.
18) Setup LVM system volume group
Select "Create volume group (LVM)".
The name of the volume group is systemvg
.
Select the available partition on the md0
that has been created above.
19) Create a root logical volume
Add a logical volume named rootlv
on the systemvg
(in "free space" entry), the size is 50G, format is ext4
and mount is /
.
20) Add a dedicated logical volume for system logs
Add a logical volume named loglv
on the systemvg
, the size is 50G, format is ext4
and mount is "Other" and /var/log
.
21) Confirm the layout of the system storage drives
Press "Done" on the bottom of the screen and eventually "Continue" to confirm application of actions on the system storage drives.
22) Profile setup
Your name: TeskaLabs Admin
Your server's name: lm01
(for example)
Pick a username: tladmin
Select a temporary password, it will be removed at the end of the installation.
23) SSH Setup
Select "Install OpenSSH server"
24) Skip the server snaps
Press "Done", no server snaps will be installed from this screen.
25) Wait till the server is installed
It takes approximately 10 minutes.
When the installation is finished, including security updated, select "Reboot Now".
26) When prompted, remove USB stick from the server
Press "Enter" to continue reboot process.
Note
You can skip this step if you are installing over IPMI.
27) Boot the server into the installed OS
Select "Ubuntu" in the GRUB screen or just wait 30 seconds.
28) Login as tladmin
29) Update the operating system
sudo apt update
sudo apt upgrade
sudo apt autoremove
30) Configure the slow data storage
Slow data storage (HDD) is mounted at /data/hdd
.
Assuming the server provides following disk devices /dev/sdc
, /dev/sdd
, /dev/sde
, /dev/sdf
, /dev/sdg
and /dev/sdh
.
Create software RAID5 array at /dev/md1
with ext4
filesystem, mounted at /data/hdd
.
sudo mdadm --create /dev/md1 --level=5 --raid-devices=6 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh
Note
For the RAID6 array, use --level=6
.
Create a EXT4 filesystem and the mount point:
sudo mkfs.ext4 -L data-hdd /dev/md1
sudo mkdir -p /data/hdd
Enter the following line to /etc/fstab
:
/dev/disk/by-label/data-hdd /data/hdd ext4 defaults,noatime 0 1
Danger
The noatime
flag is important for a optimal storage performance.
Mount the drive:
sudo mount /data/hdd
Note
The RAID array construction can take substantial amount of time. You can monitor progress by cat /proc/mdstat
. Server reboots are safe during RAID array construction.
You can speed up the construction by increasing speed limits:
sudo sysctl -w dev.raid.speed_limit_min=5000000
sudo sysctl -w dev.raid.speed_limit_max=50000000
These speed limit settings will last till the next reboot.
31) Configure the fast data storage
Fast data storage (SSD) is mounted at /data/ssd
.
Assuming the server provides following disk devices /dev/nvme0n1
and /dev/nvme1n1
.
Create software RAID1 array at /dev/md2
with ext4
filesystem, mounted at /data/ssd
.
sudo mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
sudo mkfs.ext4 -L data-ssd /dev/md2
sudo mkdir -p /data/ssd
Enter the following line to /etc/fstab
:
/dev/disk/by-label/data-ssd /data/ssd ext4 defaults,noatime 0 1
Danger
The noatime
flag is important for a optimal storage performance.
Mount the drive:
sudo mount /data/ssd
32) Persist the RAID array configuration
Run:
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf
The example of the output:
ARRAY /dev/md/2 metadata=1.2 name=lmd01:2 UUID=5ac64642:51677d00:20c5b5f9:7de93474
ARRAY /dev/md/1 metadata=1.2 name=lmd01:1 UUID=8b0c0872:b8c08564:1815e508:a3753449
Update the init ramdisk:
sudo update-initramfs -u
33) Disable periodic check of RAID
sudo systemctl disable mdcheck_continue
sudo systemctl disable mdcheck_start
34) Installation of the OS is completed
Reboot the server to verify the correctness of the OS installation.
sudo reboot
Here is a video, that recapitulates the installation process:
From the Operating system to the Docker¶
Prerequisites¶
- Running server with installed operating system.
- Access to the server over SSH, the user is
tladmin
with an permission to executesudo
. - Slow storage mounted at
/data/hdd
. - Fast storage mounted at
/data/ssd
.
Steps¶
1) Login into the server over SSH as an user tladmin
ssh tladmin@<ip-of-the-server>
2) Configure SSH access
Install public SSH key(s) for tladmin
user:
cat > /home/tladmin/.ssh/authorized_keys
Restrict the access:
sudo vi /etc/ssh/sshd_config
Changes in the ssh_config
:
PermitRootLogin no
PubkeyAuthentication yes
PasswordAuthentication no
3) Configure Linux kernel parameters
Write this contents into file /etc/sysctl.d/01-logman-io.conf
vm.max_map_count=262144
net.ipv4.ip_unprivileged_port_start=80
The parameter vm.max_map_count
increase the maximum number of mmaps in Virtual Memory subsystem of Linux.
It is needed for the ElasticSearch.
The parameter net.ipv4.ip_unprivileged_port_start
enabled unpriviledged processes to listen on port 80 (and more).
This is to enable NGINX to listen on this port and not require elevated priviledges.
4) Install a Docker
Docker is necessary for deployment of all LogMan.io microservices in containers, namely Apache Kafka, ElasticSearch, NGINX and individual streaming pumps etc.
Create dockerlv
logical volume with EXT4 filesystem:
sudo lvcreate -L100G -n dockerlv systemvg
sudo mkfs.ext4 /dev/systemvg/dockerlv
sudo mkdir /var/lib/docker
Enter the following line to /etc/fstab
:
/dev/systemvg/dockerlv /var/lib/docker ext4 defaults,noatime 0 1
Mount the volume:
sudo mount /var/lib/docker
Install the Docker package:
sudo apt-get install ca-certificates curl gnupg lsb-release
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
sudo usermod -aG docker tladmin
Re-login to the server to apply the group change.
5) Install git
sudo apt install git
6) Configure hostnames' resolution (optionally)
TeskaLabs LogMan.io cluster requires that each node can resolve IP address of any other cluster node from its hostname.
If the configured DNS server doesn't provide this ability, node names and their IP addresses have to be inserted into /etc/hosts
.
sudo vi /etc/hosts
Example:
192.168.108.101 lma1
192.168.108.111 lmb1
192.168.108.121 lmx1
7) Reboot the server
sudo reboot
This is important to apply all above parametrization.
From a Docker to a running LogMan.io¶
Steps¶
1) Create a folder structure
sudo mkdir -p \
/data/ssd/zookeeper/data \
/data/ssd/zookeeper/log \
/data/ssd/kafka/kafka-1/data \
/data/ssd/elasticsearch/es-master/data \
/data/ssd/elasticsearch/es-hot01/data \
/data/ssd/elasticsearch/es-warm01/data \
/data/hdd/elasticsearch/es-cold01/data \
/data/ssd/influxdb/data \
/data/hdd/nginx/log
Change ownership to elasticsearch data folder:
sudo chown -R 1000:0 /data/ssd/elasticsearch
sudo chown -R 1000:0 /data/hdd/elasticsearch
2) Clone the site configuration files into the /opt
folder:
cd /opt
git clone https://gitlab.com/TeskaLabs/<PARTNER_GROUP>/<MY_CONFIG_REPO_PATH>
Login to docker.teskalabs.com.
cd <MY_CONFIG_REPO_PATH>
docker login docker.teskalabs.com
Enter the repository and deploy the server specific Docker Compose file:
docker compose -f docker-compose-<SERVER_ID>.yml pull
docker compose -f docker-compose-<SERVER_ID>.yml build
docker compose -f docker-compose-<SERVER_ID>.yml up -d
Check that all containers are running:
docker ps
Hardware for TeskaLabs LogMan.io¶
This is a hardware specification designed for vertical scalability. It is optimised for those who plan to built an initial TeskaLabs LogMan.io cluster with the lower possible cost yet with the possibility to add more hardware gradually as the cluster grows. This specification is also fully compatible with the horizontal scalability strategy, which means adding one or more new server node to the cluster.
Specifications¶
- Chasis: 2U
- Front HDD trays: 12 drive bays, 3.5", for Data HDDs, hot-swap
- Rear HDD trays: 2 drive bays, 2.5", for OS HDDs, hot-swap
- CPU: 1x AMD EPYC 32 Cores
- RAM: 256GB DDR4 3200, using 64GB modules
- Data SSD: 2x 4TB SSD NVMe, PCIe 3.0+
- Data SSD controller: NVMe PCIe 3.0+ riser card, no RAID; or use motherboard NVMe slots
- Data HDD: 3x 20TB SATA 2/3+ or SAS 1/2/3+, 6+ Gb/s, 7200 rpm
- Data HDD controller: HBA or IT mode card, SATA or SAS, JBOD, no RAID, hot-swap
- OS HDD: 2x 256GB+ SSD SATA 2/3+, HBA, no RAID, directly attached to motherboard SATA
- Network: 2x 1Gbps+ Ethernet NIC; or 1x dual port
- Power supply: Redundant 920W
- IPMI or equivalent
Note
RAID is implemented in software/OS.
Vertical scalability¶
- Add one more CPU (2 CPUs in total), a motherboard with 2 CPU slots is required for this option
- Add RAM up to 512GB
- Add up to 9 additional Data HDDs, maximum 220 TB space using 12x 20 TB HDDs in RAID5
Note
3U and 4U variants are also available with 16 respective 24 drive bays.
Last update: Dec 2023
Data Storage¶
TeskaLabs LogMan.io operates with several different storage tiers in order to deliver optimal data isolation, performance and the cost.
Data storage structure¶
Schema: Recommended structure of the data storage.
Fast data storage¶
Fast data storage (also known as 'hot' tier) contains the most fresh logs and other events received into the TeskaLabs LogMan.io. We recommend to use the fastest possible storage class for the best throughput and search performance. The real-time component (Apache Kafka) also uses fast data storage for the stream persistency.
- Recommended time span: a one day to one week
- Recommended size: 2TB - 4TB
- Recommended redundancy: RAID 1, additional redundancy is provided by the application layer
- Recommended hardware: NVMe SSD PCIe 4.0 and better
- Fast data storage physical devices MUST BE managed by mdadm
- Mount point:
/data/ssd
- Filesystem: EXT4,
noatime
flag is recommended to be set for an optimum performance
Backup strategy¶
Incoming events (logs) are copied into the archive storage once they enter TeskaLabs LogMan.io. It means that there is always the way how to "replay" events into the TeskaLabs LogMan.in in case of need. Also, data are replicated to other nodes of the cluster immediately after arrival to the cluster. For this reason, traditional backup is not recommended but possible.
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Example
/data/ssd/kafka-1
/data/ssd/elasticsearch/es-master
/data/ssd/elasticsearch/es-hot1
/data/ssd/zookeeper-1
/data/ssd/influxdb-2
...
Slow data storage¶
The slow storage contains data, that does not have to be quickly accessed, and usually contain older logs and events, such as warm and cold indices for ElasticSearch.
- Recommended redundancy: software RAID 6 or RAID 5; RAID 0 for virtualized/cloud instances with underlying storage redundancy
- Recommended hardware: Cost-effective hard drives, SATA 2/3+, SAS 1/2/3+
- Typical size: tens of TB, e.g. 18TB
- Controller card: SATA or HBA SAS (IT Mode)
- Slow data storage physical devices MUST BE managed by software RAID (mdadm)
- Mount point:
/data/hdd
- Filesystem: EXT4,
noatime
flag is recommended to be set for an optimum performance
Calculation of the cluster capacity¶
This is a formula how to calculate total available cluster capacity on the slow data storage.
total = (disks-raid) * capacity * servers / replica
disks
is a number of slow data storage disk per serverraid
is a RAID overhead, 1 for RAID5 and 2 for RAID6capacity
is a capacity of the slow data storage diskservers
is a number of serversreplica
is a replication factor in ElasticSearch
Example
(6[disks]-2[raid6]) * 18TB[capacity] * 3[servers] / 2[replica] = 108TB
Backup strategy¶
The data stored on the slow data storage are ALWAYS replicated to other nodes of the cluster and also stored in the archive. For this reason, traditional backup is not recommended but possible (consider the huge size of the slow storage).
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Example
/data/hdd/elasticsearch/es-warm01
/data/hdd/elasticsearch/es-warm02
/data/hdd/elasticsearch/es-cold01
/data/hdd/mongo-2
/data/hdd/nginx-1
...
Large slow data storage strategy¶
If your slow data storage will be larger than >50 TB, we recommend to employ HBA SAS Controllers, SAS Expanders and JBOD as the optimal strategy for scaling slow data storage. SAS storage connectivity can be daisy-chained to enable large number of drives to be connected. External JBOD chassis can be also connected using SAS to provide housing for additional drives.
RAID 6 vs RAID 5¶
RAID 6 and RAID 5 are both types of RAID (redundant array of independent disks) that use data striping and parity to provide data redundancy and increased performance.
RAID 5 uses striping across multiple disks, with a single parity block calculated across all the disks. If one disk fails, the data can still be reconstructed using the parity information. However, the data is lost if a second disk fails before the first one has been replaced.
RAID 6, on the other hand, uses striping and two independent parity blocks, which are stored on separate disks. If two disks fail, the data can still be reconstructed using the parity information. RAID 6 provides an additional level of data protection compared to RAID 5. However, RAID 6 also increases the overhead and reduces the storage capacity because of the two parity blocks.
Regarding slow data storage, RAID 5 is generally considered less secure than RAID 6 because the log data is usually vital, and two disk failures could cause data loss. RAID 6 is best in this scenario as it can survive two disk failures and provide more data protection.
In RAID 5, the number of disks required is (N-1) disks, where N is the number of disks in the array. This is because one of the disks is used for parity information, which is used to reconstruct the data in case of a single disk failure. For example, if you want to create a RAID 5 array with 54 TB of storage, you would need at least four (4) disks with a capacity of at least 18 TB each.
In RAID 6, the number of disks required is (N-2) disks. This is because it uses two sets of parity information stored on separate disks. As a result, RAID 6 can survive the failure of up to two disks before data is lost. For example, if you want to create a RAID 6 array with 54 TB of storage, you would need at least five (5) disks with a capacity of at least 18 TB each.
It's important to note that RAID 6 requires more disk space as it uses two parity blocks, while RAID5 uses only one. That's why RAID 6 requires additional disks as compared to RAID 5. However, RAID 6 provides extra protection and can survive two disk failures.
It is worth mentioning that the data in slow data storage are replicated across the cluster (if applicable) to provide additional data redundancy.
Tip
Use Online RAID Calulator to calculate storage requirements.
System storage¶
The system storage is dedicated for an operation system, software installations and configurations. No operational data are stored on the system storage. Installations on virtualization platforms uses commonly available locally redundant disk space.
- Recommended size: 250 GB and more
- Recommend hardware: two (2) local SSD disks in software RAID 1 (mirror), SATA 2/3+, SAS 1/2/3+
If applicable, following storage parititioning is recommended:
- EFI partition, mounted at
/boot/efi
, size 1 GB - Swap partition, 64 GB
- Software RAID1 (mdadm) over rest of the space
- Boot partition on RAID1, mounted at
/boot
, size 512 MB, ext4 filesystem - LVM partition on RAID1, rest of the available space with volume group
systemvg
- LVM logical volume
rootlv
, mounted at/
, size 50 GB, ext4 filesystem - LVM logical volume
loglv
, mounted at/var/log
, size 50 GB, ext4 filesystem - LVM logical volume
dockerlv
, mounted at/var/lib/docker
, size 100 GB, ext4 filesystem (if applicable)
Backup strategy for the system storage¶
It is recommended to periodically backup all filesystems on the system storage so that they could be used for restoring the installation when needed. The backup strategy is compatible with most common backup technologies in the market.
- Recovery Point Objective (RPO): full backup once per week or after major maintenance work, incremental backup one per day.
- Recovery Time Objective (RTO): 12 hours.
Note
RPO and RTO are recommended, assuming highly available setup of the LogMan.io cluster. It means three and more nodes so that the complete downtime of the single node don't impact service availability.
Archive data storage¶
Data archive storage is recommended but optional. It serves for a very long data retention periods and redundancy purposes. It also represents economical way of long-term data storage. Data are not available online in the cluster, they has to be restored back when needed, which is connected with a certain "time-to-data" interval.
Data are compressed when copied into the archive, the typical compression ratio is in range from 1:10 to 1:2, depending on the nature of the logs.
Data are replicated into the storage after initial consolidation on the fast data storage, practically immediately after ingesting into a cluster.
- Recommended technologies: SAN / NAS / Cloud cold storage (AWS S3, MS Azure Storage)
- Mount point:
/data/archive
(if applicable)
Note
Public clouds can be used as a data archive storage. Data encryption has to be enabled in such a case to protect data from unauthorised access.
Dedicated archive nodes¶
For large archives, dedicated archive nodes (servers) are recommended. These nodes should use HBA SAS drive connectivity and storage-oriented OS distributions such as Unraid or TrueNAS.
Data Storage DON'Ts¶
- We DON'T recommend use of NAS / SAN storage for data storages
- We DON'T recommend use of hardware RAID controllers etc. for data storages
The storage administration¶
This chapter provides a practical example of the configuration of the storage for TeskaLabs LogMan.io. You don't need to configure or manage the LogMan.io storage unless you have a specific reason for it, the LogMan.io is delivered in fully configured state.
Assuming following hardware configuration:
- SSD drives for a fast data storage:
/dev/nvme0n1
,/dev/nvme1n1
- HDD drives for a slow data storage:
/dev/sde
,/dev/sdf
,/dev/sdg
Tip
Use lsblk
command to monitor the actual status of the storage devices.
Create a software RAID1 for a fast data storage¶
mdadm --create /dev/md2 --level=1 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
mkfs.ext4 /dev/md2
mkdir -p /data/ssd
Add mount points into /etc/fstab
:
/dev/md2 /data/ssd ext4 defaults,noatime 0 2
Mount data storage filesystems:
mount /data/ssd
Tip
Use cat /proc/mdstat
to check the state of the software RAID.
Create a software RAID5 for a slow data storage¶
mdadm --create /dev/md1 --level=5 --raid-devices=3 /dev/sde /dev/sdf /dev/sdg
mkfs.ext4 /dev/md1
mkdir -p /data/hdd
Note
For RAID6 use --level=6
.
Add mount points into /etc/fstab
:
/dev/md1 /data/hdd ext4 defaults,noatime 0 2
Mount data storage filesystems:
mount /data/hdd
Grow the size of a data storage¶
With ever increasing data volumes, it is highly likely that you need to grow (aka extend) the data storage, either on fast or slow data storage. It is done by adding a new data volume (eg. physical disk or virtual volume) to the machine - or on some virtualized solutions - by growing an existing volume.
Note
The data storage could be extended without any downtime.
Slow data storage grow example¶
Assuming that you want to add a new disk /dev/sdh
to a slow data storage /dev/md1
:
mdadm --add /dev/md1 /dev/sdh
The new disk is added as a spare device.
You can check the state of the RAID array by:
cat /proc/mdstat
The (S) behind the device means spare device.
The grow the RAID to the spare devices:
mdadm --grow --raid-devices=4 /dev/md1
Number 4
needs to be adjusted to reflect the actual RAID setup.
Grow the filesystem:
resize2fs /dev/md1
Networking¶
This documentation section is designed to guide you through the process of setting up and managing the networking of TeskaLabs LogMan.io. To ensure seamless functionality, it is important to follow the prescribed network configuration described below.
Schema: Network overview of the LogMan.io cluster.
Fronting network¶
Fronting network is a private L2 or L3 segment that serves for log collection. For that reason, it has to be accessible from all log sources.
Each node (server) has a dedicated IPv4 address on a fronting network. IPv6 is also supported.
Fronting network must be available at all locations of the LogMan.io cluster.
User network¶
User is a private L2 or L3 segment that serves for a user access to Web User Interface. For that reason, it has to be accessible for all users.
Each node (server) has a dedicated IPv4 address on a user network. IPv6 is also supported.
User network must be available at all locations of the LogMan.io cluster.
Internal network¶
Internal network is a private L2 or L3 segment that is used for private cluster communication. It MUST BE dedicated to the TeskaLabs LogMan.io with no external access to maintain the security envelope of the cluster. The internal network must provide the encryption if it is operated in the shared environment (ie as VLAN). This is critical requirement for a security of the cluster.
Each node (server) has a dedicated IPv4 address on an internal network. IPv6 is also supported.
Internal network must be available at all locations of the LogMan.io cluster.
Containers running on the node use "network mode" set to "host" on the internal network. It means that container’s network stack is not isolated from the node (host), and the container does not get its own IP address.
Connectivity¶
Each node (aka server) has following connectivity requirement:
Fronting network¶
- Minimal: 1Gbit NIC
- Recommended: 2x bonded 10Gbit NIC
User network¶
- Minimal: shared with the fronting network
- Recommended: 1Gbit NIC
Internal network¶
- Minimal: No NIC, internal only for a single node installations, 1Gbit
- Recommended: 2x bonded 10Gbit NIC
- IPMI if available at the server level
Internet connectivity (NAT, Firewalled, behind proxy server) using Fronting network OR Internal network.
SSL Server Certificate¶
The fronting network and the user network exposes web interfaces over HTTPS on the port TCP/443. For this reason, the LogMan.io needs an SSL Server certificate.
It could be either:
- self-signed SSL server certificate
- SSL server certificate issued by the Certificate Authority operated internally by the user
- SSL server certificate issued by a public (commercial) Certificate Authority
Tip
You can use XCA tool to generate or verify your SSL certificates.
Self-signed certificate¶
This option is suitable for very small deployments.
Users will get warnings from thier browsers when accessing LogMan.io Web interface.
Also insecure
flags needs to be used in collectors.
Create a self-signed SSL certificate using OpenSSL command-line
openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 \
-keyout key.pem -out cert.pem -sha256 -days 3650 -nodes \
-subj "/CN=logman.int"
This command will create key.pem
(a private key) and cert.pem
(a certificate), for internal domain name logman.int
.
Certificate from Certificate Authority¶
Parameters for the SSL Server certificate:
- Private key: EC 384 bit, curve secp384p1 (minimum), alternatively RSA 2048 (minimum)
- Subject Common name
CN
: Fully Qualified Domain Name of the LogMan.io user Web UI - X509v3 Subject Alternative Name: Fully Qualified Domain Name of the LogMan.io user Web UI set to "DNS"
- Type: End Entity, critical
- X509v3 Subject Key Identifier set
- X509v3 Authority Key Identifier set
- X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment, Key Agreement
- X509v3 Extended Key Usage: TLS Web Server Authentication
Example of SSL Server certificate for http://logman.example.com/
Certificate:
Data:
Version: 3 (0x2)
Serial Number: 6227131463912672678 (0x566b3712dc2c4da6)
Signature Algorithm: ecdsa-with-SHA256
Issuer: CN = logman.example.com
Validity
Not Before: Nov 16 11:17:00 2023 GMT
Not After : Nov 15 11:17:00 2024 GMT
Subject: CN = logman.example.com
Subject Public Key Info:
Public Key Algorithm: id-ecPublicKey
Public-Key: (384 bit)
pub:
04:79:e2:9f:69:cb:ac:f5:3f:93:43:56:a5:ac:d7:
cf:97:f9:ba:44:ee:9b:53:89:19:fd:91:02:0d:bd:
59:41:d6:ec:c6:2b:01:33:03:b6:3e:4a:1d:f4:e9:
2c:3f:af:49:92:79:9c:00:0b:0b:e3:28:7b:13:33:
b4:ac:88:d7:9c:0a:7b:95:90:09:a2:f7:aa:ce:7c:
51:3e:3a:94:af:a8:4b:65:4f:82:90:6a:2f:a9:57:
25:6f:5f:80:09:4c:cb
ASN1 OID: secp384r1
NIST CURVE: P-384
X509v3 extensions:
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Key Identifier:
49:7A:34:F8:A6:EB:6D:8E:92:42:57:BB:EB:2D:B3:82:F4:98:9D:17
X509v3 Authority Key Identifier:
49:7A:34:F8:A6:EB:6D:8E:92:42:57:BB:EB:2D:B3:82:F4:98:9D:17
X509v3 Key Usage:
Digital Signature, Non Repudiation, Key Encipherment, Key Agreement
X509v3 Extended Key Usage:
TLS Web Server Authentication
X509v3 Subject Alternative Name:
DNS:logman.example.com
Signature Algorithm: ecdsa-with-SHA256
Signature Value:
30:64:02:30:16:09:95:f4:04:1b:99:f4:06:ef:1e:63:4e:aa:
1d:21:b0:b1:31:c1:84:9a:a9:55:c6:14:bd:a1:62:c5:14:14:
35:73:da:8b:a8:7b:f2:f6:4c:8c:b0:6b:72:79:5f:4c:02:30:
49:6f:ef:05:0f:dd:28:fb:26:f8:76:71:01:f3:e4:da:63:72:
17:db:96:fb:5c:09:43:f8:7b:3b:a1:b6:dc:23:31:66:5d:23:
18:94:0b:e4:af:8b:57:1e:c3:3d:93:6f
Generate a CSR¶
If the Certificate Authority requires CSR to be submitted to receive a SSL certificate, follow this procedure:
1. Generate a private key:
openssl genpkey -algorithm EC -pkeyopt ec_paramgen_curve:prime256v1 -out key.pem
This command will create key.pem
with the private key.
2. Create CSR using generated private key:
openssl req -new -key key.pem -out csr.pem -subj "/CN=logman.example.com"
This command will produce csr.pem
file with that Certificate Signing Request.
Replace logman.example.com
with the FQDN (domain name) of the LogMan.io deployment.
3. Submit the CSR to a Certificate Authority
The Certificate Authority will generate a certificate, store it in a cert.pem
in a PEM format.
Cluster¶
TeskaLabs LogMan.io can be deployed into a single server (aka "node") or in a cluster setup. TeskaLabs LogMan.io supports also geo-clustering.
Geo-clustering¶
Geo-clustering is a technique used to provide redundancy against failures by replicating data, and services across multiple geographic locations. This approach aims to minimize the impact of any unforeseen failures, disasters, or disruptions that may occur in one location, by ensuring that the system can continue to operate without interruption from another location.
Geo-clustering involves deploying multiple instances of the LogMan.io across different geographic regions or data centers, and configuring them to work together as a single logical entity. These instances are linked together using a dedicated network connection, which enables them to communicate and coordinate their actions in real-time.
One of the main benefits of geo-clustering is that it provides a high level of redundancy against failures. In the event of a failure in one location, the remaining instances of the system take over and continue to operate without disruption. This not only helps to ensure high availability (HA) and uptime, but also reduces the risk of data loss and downtime.
Another advantage of geo-clustering is that it can provide better performance and scalability by enabling load balancing and resource sharing across multiple locations. This means that resources can be dynamically allocated and adjusted to meet changing demands, ensuring that the system is always optimized for performance and efficiency.
Overall, geo-clustering is a powerful technique that helps to ensure high availability, resilience, and scalability for their critical applications and services. By replicating resources across multiple geographic locations, organizations can minimize the impact of failures and disruptions, while also improving performance and efficiency.
Locations¶
Location "A"¶
Location "A" is the first location to be build. In the single node setup, it is also the only location.
Node lma1
is the first server to built of the cluster.
Nodes in this location are named "Node lmaX
". X
is a sequence number of the server (eg 1, 2, 3, 4 and so on).
If you run out of numbers, continue with small letters (eg. a, b, c and so on).
Please refer to the recommended hardware specification for details about nodes.
Location B, C, D and so on¶
Location B (and C, D and so on) are next locations of the cluster.
Nodes in these locations are named "Node lmLX
".
L
is a small letter that represents location in the alphabetical order (eg a, b, c).
X
is a sequence number of the server( eg 1, 2, 3, 4 and so on).
If you run out of numbers, continue with small letters (eg. a, b, c and so on).
Please refer to the recommended hardware specification for details about nodes.
Coordinating location "X"¶
The cluster MUST have odd number of locations to avoid Split-brain problem.
For that reason, we recommend to build a small, coordinating location with one node (Node lmx1
).
We recommend to use virtualisation platform for "Node x1
", not a physical hardware.
No data (logs, events) are stored at this location.
Types of nodes¶
Core node¶
First three nodes in the cluster are called code nodes. Core nodes form the consensus within the cluster, ensuring consistency, and coordinating activities across the cluster.
Peripheral nodes¶
Peripheral nodes are these nodes that don't participate in the consensus of the cluster.
Cluster layouts¶
Schema: Example of the cluster layout.
Single node "cluster"¶
Node: lma1
(Location a, Server 1).
Two big and one small node¶
Nodes: lma1
, lmb1
and lmx1
.
Thee nodes, three locations¶
Nodes: lma1
, lmb1
and lmc1
.
Four big and one small node¶
Nodes: lma1
, lma2
, lmb1
, lmb2
and lmx1
.
Six nodes, three locations¶
Nodes: lma1
, lma2
, lmb1
, lmb2
, lmc1
and lmc2
.
Bigger clusters¶
Bigger clusters typically introduce a specialization of nodes.
Data Lifecycle¶
Data (e.g. logs, events, metrics) are stored in several availability stages, basically in the chronological order. It means that the recent logs are stored in the fastest data storage and as they age, they are moved to the slower and cheaper data storage and eventually into the offline archive or they are deleted.
Schema: Data life cycle in the TeskaLabs LogMan.io.
The lifecycle is controlled by ElasticSearch feature called Index Lifecycle Management (ILM).
Index Lifecycle Management¶
Index Lifecycle Management (ILM) in ElasticSearch serves to automatically close or delete old indices (f. e. with data older than three months), so searching performance is kept and data storage is able to store present data. The setting is present in the so-called ILM policy.
The ILM should be set before the data are pumped into ElasticSearch, so the new index finds and associates itself with the proper ILM policy. For more information, please refer to the official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html
LogMan.io components such as Dispatcher then use a specified ILM alias (lm_) and ElasticSearch automatically put the data to the proper index assigned with the ILM policy.
Hot-Warm-Cold architecture (HWC)¶
HWC is an extension of the standard index rotation provided by the ElasticSearch ILM and it is a good tool for managing time series data. HWC architecture enables us to allocate specific nodes to one of the phases. When used correctly, along with the cluster architecture, this will allow for maximum performance, using available hardware to its fullest potential.
Hot stage¶
There is usually some period of time (week, month, etc.), where we want to query the indexes heavily, aiming for speed, rather than memory (and other resources) conservation. That is where the “Hot” phase comes in handy, by allowing us to have the index with more replicas, spread out and accessible on more nodes for optimal user experience.
Hot nodes¶
Hot nodes should use the fast parts of the available hardware, using most CPU's and faster IO.
Warm stage¶
Once this period is over, and the indexes are no longer queried as often, we will benefit by moving them to the “Warm” phase, which allows us to reduce the number of nodes (or move to nodes with less resources available) and index replicas, lessening the hardware load, while still retaining the option to search the data reasonably fast.
Warm nodes¶
Warm nodes, as the name suggests, stand on the crossroads, between being solely for the storage purposes, while still retaining some CPU power to handle the occasional queries.
Cold stage¶
Sometimes, there are reasons to store data for extended periods of time (dictated by law, or some internal rule). The data are not expected to be queried, but at the same time, they cannot be deleted just yet.
Cold nodes¶
This is where the Cold nodes come in, there may be few, with only little CPU resources, they have no need to use SSD drives, being perfectly fine with slower (and optionally larger) storage.
The setting should be done in following way:
Archive stage¶
The archive stage is optional in the design. It is an offline long-term storage. The oldest data from a cold stage could be moved periodically to the archive stage instead of their deletion.
The standard archiving policy of the SIEM operating organization are applied. The archived data needs to be encrypted.
It is also possible to forward certain logs directly from a warm stage into the archive stage.
Create the ILM policy¶
Kibana¶
Kibana version 7.x can be used to create ILM policy in ElasticSearch.
1.) Open Kibana
2.) Click Management in the left menu
3.) In the ElasticSearch section, click on Index Lifecycle Policies
4.) Click Create policy blue button
5.) Enter its name, which should be the same as the index prefix, f. e. lm_
6.) Set max index size to the desired rollover size, f. e. 25 GB (size rollover)
7.) Set maximum age of the index, f. e. 10 days (time rollover)
8.) Click the switch down the screen at Delete phase, and enter the time after which the index should be deleted, f. e. 120 days from rollover
9.) Click on Save policy green button
Use the policy in index template¶
Modify index template(s)¶
Add the following lines to the JSON index template:
"settings": {
"index": {
"lifecycle": {
"name": "lm_",
"rollover_alias": "lm_"
}
}
},
Kibana¶
Kibana version 7.x can be used to link ILM policy with ES index template.
1.) Open Kibana
2.) Click Management in the left menu
3.) In the ElasticSearch section, click on Index Management
4.) At the top, select Index Template
5.) Select your desired index template, f. e. lm_
6.) Click on Edit
7.) On the Settings screen, add:
{
"index": {
"lifecycle": {
"name": "lm_",
"rollover_alias": "lm_"
}
}
}
8.) Click on Save
Create a new index which will utilize the latest index template¶
Through PostMan or Kibana, create a following HTTP request to the instance of ElasticSearch you are using:
PUT lm_tenant-000001
{
"aliases": {
"lm_": {
"is_write_index": true
}
}
}
The alias is then going to be used by the ILM policy to distribute data to the proper ElasticSearch index, so pumps do not have to care about the number of the index.
Warning
The prefix and number of index for ILM rollover must be separated with -
000001, not _
000001!
Note
Make sure there is no index prefix configuration in the source, like in ElasticSearchSink in the pipeline. The code configuration would replace the file configuration.
Elasticsearch backup and restore¶
Snapshots¶
Located under Stack Management -> Snapshot and Restore
. The snapshots are stored in the repository location. The structure is as follows. The snapshot itself is just a pointer to the indices that it contains. The indices themselves are stored in a separate directory, and they are stored incrementally. This basically means, that if you create a snapshot every day, the older indices are just referenced again in the snapshot, while only the new indices are actually copied to the backup directory.
Repositories¶
First, the snapshot repository needs to be set up. Specify the location where the snapshot repository resides, /backup/elasticsearch
for instance. This path needs to be accessible from all nodes in the cluster. With the Elasticsearch running in docker, this includes mounting the space inside of the docker containers, and restarting them.
Policies¶
To begin taking snapshots, a policy needs to be created. The policy determines the naming prefix of the snapshots it creates, it specifies repository it will be using for creating snapshots, It requires a schedule setting, indices (defined using patterns or specific index names - lmio-mpsv-events-*
for instance).
Furthermore, the policy is able to specify whether to ignore unavailable indices, allow partial indices and include global state. Use of these depends on the specific case, in which the snapshot policy will be used and are not recommended by default. There is also a setting available to automatically delete snapshots and define expiration. These also depend on specific policy, the snapshots themselves however are very small (memory wise), when they do not include global state, which is to be expected since they are just pointers to a different place, where the actual index data are stored.
Restoring a snapshot¶
To restore a snapshot, simply select the snapshot containing the index or indices you wish to bring back and select "Restore". You then need to specify whether you want to restore all indices contained in the snapshot, or just a portion. You are able to rename the restored indices, you can also restore partially snapshot indices and modify the index setting while restoring them. Or resetting them to default. The indices are then restored as specified back into the cluster.
Caveats¶
When deleting snapshots, bear in mind that you need to have the backed up indices covered by a snapshot to be able to restore them. What this means is, when you for example clear some of the indices from the cluster and then delete the snapshot that contained the reference to these indexes, you will be unable to restore them.
Continuity Plan¶
Risk matrix¶
The risk matrix defines the level of risk by considering the category of "Likelihood" of an incident occurring against the category of "Impact". Both categories are given a score between 1 and 5. By multiplying the scores for "Likelihood" and "Impact" together, a total risk score is be produced.
Likelihood¶
Likelihood | Score |
---|---|
Rare | 1 |
Unlikely | 2 |
Possible | 3 |
Likely | 4 |
Almost certain | 5 |
Impact¶
Impact | Score | Description |
---|---|---|
Insignificant | 1 | The functionality is not impacted, performance is not reduced, downtime is not needed. |
Minor | 2 | The functionality is not impacted, the performance is not reduced, downtime of the impacted cluster node is needed. |
Moderate | 3 | The functionality is not impacted, the performance is reduced, downtime of the impacted cluster node is needed. |
Major | 4 | The functionality is impacted, the performance is significantly reduced, downtime of the cluster is needed. |
Catastrophic | 5 | Total loss of functionality. |
Incident scenarios¶
Complete system failure¶
Impact: Catastrophic (5)
Likelihood: Rare (1)
Risk level: medium-high
Risk mitigation:
- Geographically distributed cluster
- Active use of monitoring and alerting
- Prophylactic maintenance
- Strong cyber-security posture
Recovery:
- Contact the support and/or vendor and consult the strategy.
- Restore the hardware functionality.
- Restore the system from the backup of the site configuration.
- Restore the data from the offline backup (start with the most fresh data and continue to the history).
Loss of the node in the cluster¶
Impact: Moderate (4)
Likelihood: Unlikely (2)
Risk level: medium-low
Risk mitigation:
- Geographically distributed cluster
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Contact the support and/or vendor and consult the strategy.
- Restore the hardware functionality.
- Restore the system from the backup of the site configuration.
- Restore the data from the offline backup (start with the most fresh data and continue to the history).
Loss of the fast storage drive in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Fast drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second fast drive failure will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Turn off the impacted cluster node
- Replace failed fast storage drive ASAP
- Turn on the impacted cluster node
- Verify correct RAID1 array reconstruction
Note
Hot swap of the fast storage drive is supported on a specific customer request.
Fast storage space shortage¶
Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high
This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Remove unnecessary data from the fast storage space.
- Adjust the life cycle configuration so that the data are moved to slow storage space sooner.
Loss of the slow storage drive in one node of the cluster¶
Impact: Insignificant (1)
Likelihood: Likely (4)
Risk level: medium-low
Slow drives are in RAID 5 or RAID 6 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent another drive failure. A second drive failure in RAID 5 or third drive failure in RAID 6 will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Replace failed slow storage drive ASAP (hot swap)
- Verify a correct slow storage RAID reconstruction
Slow storage space shortage¶
Impact: Moderate (3)
Likelihood: Likely (4)
Risk level: medium-high
This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely extension of the slow data storage size
Recovery:
- Remove unnecessary data from the slow storage space.
- Adjust the life cycle configuration so that the data are removed from slow storage space sooner.
Loss of the system drive in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
System drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second system drive failure will escalate to a "Loss of the node in the cluster".
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely replacement of the failed drive
Recovery:
- Replace failed fast storage drive ASAP (how swap)
- Verify correct RAID1 array reconstruction
System storage space shortage¶
Impact: Moderate (3)
Likelihood: Rare (1)
Risk level: low
Use monitoring tools to identify this situation ahead of escalation.
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Remove unnecessary data from the system storage space.
- Contact the support or the vendor.
Loss of the network connectivity in one node of the cluster¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Redundant network connectivity
Recovery:
- Restore the network connectivity
- Verify the proper cluster operational condition
Failure of the ElasticSearch cluster¶
Impact: Major (4)
Likelihood: Possible (3)
Risk level: medium-high
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating ElasticSearch cluster health
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the ElasticSearch node¶
Impact: Minor (2)
Likelihood: Likely (4)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating ElasticSearch cluster health
Recovery:
- Monitor an automatic ElasticSearch node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the Apache Kafka cluster¶
Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache Kafka cluster health
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the Apache Kafka node¶
Impact: Minor (2)
Likelihood: Rare (1)
Risk level: low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache Kafka cluster
Recovery:
- Monitor an automatic Apache Kafka node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the Apache ZooKeeper cluster¶
Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache ZooKeeper cluster
Recovery:
- Contact the support and/or vendor and consult the strategy.
Failure of the Apache ZooKeeper node¶
Impact: Insignificant (1)
Likelihood: Rare (1)
Risk level: low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
- Timely reaction to the deteriorating Apache ZooKeeper cluster
Recovery:
- Monitor an automatic Apache ZooKeeper node rejoining to the cluster
- Contact the support / the vendor if the failure persists over several hours.
Failure of the stateless data path microservice (collector, parser, dispatcher, correlator, watcher)¶
Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Restart the failed microservice.
Failure of the stateless support microservice (all others)¶
Impact: Insignificant (1)
Likelihood: Possible (3)
Risk level: medium-low
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Restart the failed microservice.
Significant reduction of the system performance¶
Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high
Risk mitigation:
- Active use of monitoring and alerting
- Prophylactic maintenance
Recovery:
- Identify and remove the root cause of the reduction of the performance
- Contact the vendor or the support if help is needed
Backup and recovery strategy¶
Offline backup for the incoming logs¶
Incoming logs are duplicated to the offline backup storage that is not part of the active cluster of LogMan.io (hence is "offline"). Offline backup provides an option to restore logs to the LogMan.io after critical failure etc.
Backup strategy for the fast data storage¶
Incoming events (logs) are copied into the archive storage once they enter the LogMan.io. It means that there is always the way how to “replay” events into the TeskaLabs LogMan.in in case of need. Also, data are replicated to other nodes of the cluster immediately after arrival to the cluster. For this reason, traditional backup is not recommended but possible.
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Backup strategy for the slow data storage¶
The data stored on the slow data storage are ALWAYS replicated to other nodes of the cluster and also stored in the archive. For this reason, traditional backup is not recommended but possible (consider the huge size of the slow storage).
The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.
Backup strategy for the system storage¶
It is recommended to periodically backup all filesystems on the system storage so that they could be used for restoring the installation when needed. The backup strategy is compatible with most common backup technologies in the market.
- Recovery Point Objective (RPO): full backup once per week or after major maintenance work, incremental backup one per day.
- Recovery Time Objective (RTO): 12 hours.
Note
RPO and RTO are recommended, assuming highly available setup of the LogMan.io cluster. It means three and more nodes so that the complete downtime of the single node don’t impact service availability.
Generic backup and recovery rules¶
-
Data Backup: Regularly backup to a secure location, such as a cloud-based storage service, backup tapes, to minimize data loss in case of failures.
-
Backup Scheduling: Establish a backup schedule that meets the needs of the organization, such as daily, weekly, or monthly backups.
-
Backup Verification: Verify the integrity of backup data regularly to ensure that it can be used for disaster recovery.
-
Restoration Testing: Test the restoration of backup data regularly to ensure that the backup and recovery process is working correctly and to identify and resolve any issues before they become critical.
-
Backup Retention: Establish a backup retention policy that balances the need for long-term data preservation with the cost of storing backup data.
Monitoring and alerting¶
Monitoring is an important component of a Continuity Plan as it helps to detect potential failures early, identify the cause of failures, and support decision-making during the recovery process.
LogMan.io microservices provides OpenMetrics API and/or ship their telemetry into InfluxDB and uses Grafana as a monitoring tool.
-
Monitoring Strategy: OpenMetrics API is used to collect telemetry from all microservices in the cluster, Operating system and hardware. Telemetry is collected once per minute. InfluxDB is used to store the telemetry data. Grafana is used as the Web-based User interface for telemetry inspection.
-
Alerting and Notification: The monitoring system is configured to generate alerts and notifications in case of potential failures, such as low disk space, high resource utilization, or increased error rates.
-
Monitoring Dashboards: Monitoring dashboards are provided in Grafana that display the most important metrics for the system, such as resource utilization, error rates, and response times.
-
Monitoring Configuration: Regularly reviews and updates are provided for the monitoring configuration to ensure that it is effective and that it reflects changes in the system.
-
Monitoring Training: Trainings are provided for the monitoring team and other relevant parties on the monitoring system and the monitoring dashboards in Grafana.
High availability architecture¶
TeskaLabs LogMan.io is deployed in a highly available architecture (HA) with multiple nodes to reduce the risk of single points of failure.
High availability architecture is a design pattern that aims to ensure that a system remains operational and available, even in the event of failures or disruptions.
In a LogMan.io cluster, a high availability architecture includes the following components:
-
Load Balancing: Distribution of incoming traffic among multiple instances of microservices, thereby improving the resilience of the system and reducing the impact of failures.
-
Redundant Storage: Storing of data redundantly across multiple storage nodes to prevent data loss in the event of a storage failure.
-
Multiple Brokers: Use multiple brokers in Apache Kafka to improve the resilience of the messaging system and reduce the impact of broker failures.
-
Automatic Failover: Automatic failover mechanisms, such as leader election in Apache Kafka, to ensure that the system continues to function in the event of a cluster node failure.
-
Monitoring and Alerting: Usage of monitoring and alerting components to detect potential failures and trigger automatic failover mechanisms when necessary.
-
Rolling Upgrades: Upgrades to the system without disrupting its normal operation, by upgrading nodes one at a time, without downtime.
-
Data Replication: Replication of log across multiple cluster nodes to ensure that the system continues to function even if one or more nodes fail.
Communication plan¶
A clear and well-communicated plan for responding to failures and communicating with stakeholders helps to minimize the impact of failures and ensure that everyone is on the same page.
-
Stakeholder Identification: Identify all stakeholders who may need to be informed during and after a disaster, such as employees, customers, vendors, and partners.
-
Participating organisations: The LogMan.io operator, the integrating party and the vendor (TeskaLabs).
-
Communication Channels: Communication channels that will be used during and after a disaster are Slack, email, phone and SMS.
-
Escalation Plan: Specify an escalation plan to ensure that the right people are informed at the right time during a disaster, and that communication is coordinated and effective.
-
Update and Maintenance: Regularly update and maintain the communication plan to ensure that it reflects changes in the organization, such as new stakeholders or communication channels.
Log Collector ↵
TeskaLabs LogMan.io Collector¶
This is the administration manual for the TeskaLabs LogMan.io Collector. It describes how to install the collector.
For more details about how to collect logs, continue to the reference manual.
Installation of TeskaLabs LogMan.io Collector¶
This short tutorial explains how to connect a new log collector running as a virtual machine.
Tip
If you are using a hardware TeskaLabs LogMan.io Collector, connect the monitor via HDMI and go straight to step 5.
-
Download the virtual machine image.
Here's the download link.
-
Import the downloaded image to your virtualization platform.
-
Configure network settings of a new virtual machine
Requirements:
- The virtual machine must be able to reach the TeskaLabs LogMan.io installation.
- The virtual machine must be reachable from devices that will ship logs into TeskaLabs LogMan.io.
-
Launch the virtual machine.
-
Determine the identity of TeskaLabs LogMan.io Collector.
The identity consists of 16 letters and digits. Please save this for the following steps.
-
Open the LogMan.io web application in your browser.
Follow this link or navigate to "Collectors" and click on the "Provisioning" button.
-
Enter the collector identity from step 4 in the box.
Then, click Provision to connect the collector and start collecting logs.
-
TeskaLabs LogMan.io Collector is successfully connected and collects logs.
Tip
The green cirle on the left indicates that the log collector is online. The blue line indicates how many logs the collector has received in the last 24 hours.
Administration inside the VM¶
Administrative actions in the Virtual Machine of TeskaLabs LogMan.io Collector are available in the menu. Press "M" to access it. Use arrow keys and Enter to navigate and select actions.
Available options are:
- Power down
- Reboot
- Network configuration
Tip
We recommend to use Power down
feature to safely turn off the collector's virtual machine.
Additional notes¶
You can connect unlimited amount of log collectors, e.g. to collect from different sources or to collect different types of logs.
Supported virtualization technologies¶
The TeskaLabs LogMan.io Collector supports the following virtualization technologies:
- VMWare
- Oracle VirtualBox
- Microsoft Hyper-V
- Qemu
Virtual Machine¶
TeskaLabs LogMan.io Collector can be manually installed into a virtual machine.
Specifications¶
- 1 vCPU
- OS Linux, preferably Ubuntu Server 22.04.4 LTS, other mainstream distributions are also supported
- 4 GB RAM
- 500 GB disk (50 GB for OS; the rest is a buffer for collected logs)
- 1x NIC, preferably 1Gbps
The collector must be able to connect to a TeskaLabs LogMan.io installation over HTTPS (WebSocket) using its URL.
Note
For environments with higher loads, the virtual machine should be scaled up accordingly.
Network¶
We recommend to assign static IP address to the collector virtual machine because it will be used in the many configurations of log sources.
Ended: Log Collector
ElasticSearch Setting¶
Index Templates¶
Before the data are loaded to the ElasticSearch, there should be an index template present, so proper data types are assigned to every field.
This is especially needed for time-based fields, which would not work without index template and could not be used for sorting and creating index patterns in Kibana.
The ElasticSearch index template should be present in the site-
repository
under the name es_index_template.json
.
To insert the index template through PostMan or Kibana, create a following HTTP request to the instance of ElasticSearch you are using:
PUT _template/lmio-
{
//Deploy to <SPECIFY_WHERE_TO_DEPLOY_THE_TEMPLATE>
"index_patterns" : ["lmio-*"],
"version": 200721, // Increase this with every release
"order" : 9999998, // Decrease this with every release
"settings": {
"index": {
"lifecycle": {
"name": "lmio-",
"rollover_alias": "lmio-"
}
}
},
"mappings": {
"properties": {
"@timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" },
"rt": { "type": "date", "format": "strict_date_optional_time||epoch_second" },
...
}
}
The body of the request is the content of the es_index_template.json
.
Index Lifecycle Management¶
Index Lifecycle Management (ILM) in ElasticSearch serves to automatically close or delete old indices (f. e. with data older than three months), so searching performance is kept and data storage is able to store present data. The setting is present in the so-called ILM policy.
The ILM should be set before the data are pumped into ElasticSearch, so the new index finds and associates itself with the proper ILM policy. For more information, please refer to the official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html
LogMan.io components such as Dispatcher then use a specified ILM alias (lm_) and ElasticSearch automatically put the data to the proper index assigned with the ILM policy.
The setting should be done in following way:
Create the ILM policy¶
Kibana¶
Kibana version 7.x can be used to create ILM policy in ElasticSearch.
1.) Open Kibana
2.) Click Management in the left menu
3.) In the ElasticSearch section, click on Index Lifecycle Policies
4.) Click Create policy blue button
5.) Enter its name, which should be the same as the index prefix, f. e. lm_
6.) Set max index size to the desired rollover size, f. e. 25 GB (size rollover)
7.) Set maximum age of the index, f. e. 10 days (time rollover)
8.) Click the switch down the screen at Delete phase, and enter the time after which the index should be deleted, f. e. 120 days from rollover
9.) Click on Save policy green button
Use the policy in index template¶
Modify index template(s)¶
Add the following lines to the JSON index template:
"settings": {
"index": {
"lifecycle": {
"name": "lmio-",
"rollover_alias": "lmio-"
}
}
},
Kibana¶
Kibana version 7.x can be used to link ILM policy with ES index template.
1.) Open Kibana
2.) Click Management in the left menu
3.) In the ElasticSearch section, click on Index Management
4.) At the top, select Index Template
5.) Select your desired index template, f. e. lmio-
6.) Click on Edit
7.) On the Settings screen, add:
{
"index": {
"lifecycle": {
"name": "lmio-",
"rollover_alias": "lmio-"
}
}
}
8.) Click on Save
Create a new index which will utilize the latest index template¶
Through PostMan or Kibana, create a following HTTP request to the instance of ElasticSearch you are using:
PUT lmio-tenant-events-000001
{
"aliases": {
"lmio-tenant-events": {
"is_write_index": true
}
}
}
The alias is then going to be used by the ILM policy to distribute data to the proper ElasticSearch index, so pumps do not have to care about the number of the index.
//Note: The prefix and number of index for ILM rollover must be separated with -
000001, not _
000001!//
Configure other LogMan.io components¶
The pumps may now use the ILM policy through the created alias, which in the case above is lm_tenant
. The configuration file should then look like this:
[pipeline:<PIPELINE>:ElasticSearchSink]
index_prefix=lm_tenant
doctype=_doc
The pump will always put data to the lm_tenant
alias, where ILM will take care of the proper assignment to the index, f. e. lm_-000001
.
//Note: Make sure there is no index prefix configuration in the source, like in ElasticSearchSink in the pipeline. The code configuration would replace the file configuration.//
Hot-Warm-Cold architecture (HWC)¶
HWC is an extension of the standard index rotation provided by the ElasticSearch ILM and it is a good tool for managing time series data. HWC architecture enables us to allocate specific nodes to one of the phases. When used correctly, along with the cluster architecture, this will allow for maximum performance, using available hardware to its fullest potential.
Hot¶
There is usually some period of time (week, month, etc.), where we want to query the indexes heavily, aiming for speed, rather than memory (and other resources) conservation. That is where the “Hot” phase comes in handy, by allowing us to have the index with more replicas, spread out and accessible on more nodes for optimal user experience.
Hot nodes¶
Hot nodes should use the fast parts of the available hardware, using most CPU's and faster IO.
Warm¶
Once this period is over, and the indexes are no longer queried as often, we will benefit by moving them to the “Warm” phase, which allows us to reduce the number of nodes (or move to nodes with less resources available) and index replicas, lessening the hardware load, while still retaining the option to search the data reasonably fast.
Warm nodes¶
Warm nodes, as the name suggests, stand on the crossroads, between being solely for the storage purposes, while still retaining some CPU power to handle the occasional queries.
Cold¶
Sometimes, there are reasons to store data for extended periods of time (dictated by law, or some internal rule). The data are not expected to be queried, but at the same time, they cannot be deleted just yet.
Cold nodes¶
This is where the Cold nodes come in, there may be few, with only little CPU resources, they have no need to use SSD drives, being perfectly fine with slower (and optionally larger) storage.
Conclusion¶
Using the HWC ILM feature to its full effect requires some preparation, it should be considered when building the production ElasticSearch cluster. The added value however, can be very high, depending on the specific use cases.
InfluxDB Setting¶
Docker-compose.yaml configuration for Influx v1.x¶
influxdb:
restart: on-failure:3
image: influxdb:1.8
ports:
- "8083:8083"
- "8086:8086"
- "8090:8090"
volumes:
- /<path_on_host>/<where_you_want_data>:/var/lib/influxdb
environment:
- INFLUXDB_DB=<your_db>
- INFLUXDB_USER=telegraf
- INFLUXDB_ADMIN_ENABLED=true
- INFLUXDB_ADMIN_USER=<your_user>
- INFLUXDB_ADMIN_PASSWORD=<your_password>
logging:
options:
max-size: 10m
Docker-compose.yaml configuration for Influx v2.x¶
influxdb:
image: influxdb:2.0.4
restart: 'always'
ports:
- "8086:8086"
volumes:
- /data/influxdb/data:/var/lib/influxdb2
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=telegraf
- DOCKER_INFLUXDB_INIT_PASSWORD=my-password
- DOCKER_INFLUXDB_INIT_ORG=my-org
- DOCKER_INFLUXDB_INIT_BUCKET=my-bucket
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=my-super-secret-auth-token
Run InfluxDB container¶
docker-compose up -d
Use UI interface on:¶
http://localhost:8086/
How to write/delete data using CLI influx:¶
docker exec -it <influx-container> bash
influx write \
-b my-bucket \
-o my-org \
-p s \
'myMeasurement,host=myHost testField="testData" 1556896326' \
-t ${your-token}
influx delete \
-bucket my-bucket \
--org my-org \
--start 2001-03-01T00:00:00Z \
--stop 2021-04-14T00:00:00Z \
--token ${your-token}
Setting up retention policy¶
Retention policy controls how long do you want to keep data in the influxdb, you setup a name for your policy, which databse is affected, how long will you keep the data, replication and finaly group (DEFAULT in the case below) DEFAULT is used for all sources that do not specify the group when inserting data to InfluxDB.
docker exec <container_name> influx -execute CREATE RETENTION POLICY "<name_your_policy>" ON "<your_db>" DURATION 47h60m REPLICATION 1 DEFAULT
Altering an existing policy¶
docker exec <container_name> influx -execute ALTER RETENTION POLICY "autogen" on "<dbs>/<affected>" duration 100d
Deleting old data¶
Mind the quotation marks
delete from "<collection>" where "<field>" = '<value>'
Deleting old data in a specific field¶
When reconfiguring your sources, you may want to get rid of some old values in specific fields, so they do not clog your visualizations. You may do so using the folloging command:
docker exec <container_name> influx -execute DROP SERIES WHERE "<tag_key>" = '<tag_value>'
Downsampling¶
https://docs.influxdata.com/influxdb/v1.8/guides/downsample_and_retain/ if want to use multiple rules for different data sources, use the group name other than DEFAULT and configure your sources accordingly, for example in telegraf use:
Specific retention policies example (telegraf)¶
Used when you want to set different retention on different sources.
[[outputs.influxdb]
]
## Name of existing retention policy to write to. Empty string writes to
## the default retention policy. Only takes effect when using HTTP.
# retention_policy = "**telegraf1**"
docker exec <container_name> influx -execute CREATE RETENTION POLICY "<name_your_policy>" ON "<your_db>" DURATION 47h60m REPLICATION 1 **telegraf1**
Guide to deploying TeskaLabs LogMan.io for partners¶
Preimplementation analysis¶
Every delivery should begin with a preimplementation analysis, which lists all the log sources, that should be connected to LogMan.io. Outcome of the analysisis the spreadsheet, where each lane describes one log source, the way how logs are gathered (reading files, log forwarding to destination port etc.), who is responsible for the log source from the customer perspective and the estimation when the log source should be connected. See the following picture:
In the picture, there are two more columns that are not part of the preimplementation analysis and that are filled later when the implementation takes place (kafka topic & dataset). Fore more information, see Event lanes section below.
It MUST BE defined, which domain (URL) will be used to host the LogMan.io.
The customer or partner themselves SHOULD provide appropriate HTTPS SSL certificates (see nginx
below), e.g. using Let's Encrypt or other Certification authority.
LogMan.io cluster and collector servers¶
Servers¶
By the end of the preimplementation analysis, it should be clear how big the volume of gathered logs (in events or log messages per second, EPS for short) should be. The logs are always gathered from the customer's infrastructure with at least one server dedicated to collecting logs (aka log collector).
When it comes to LogMan.io cluster, there are three ways:
- LogMan.io cluster is deployed to the customer's infrastructure to physical and/or virtual machines (on-premise)
- LogMan.io cluster is deployed at the partner's infrastructure and available for more customers, where each customer is assigned a single
tenant
(SoC, SaaS etc.)
See Hardware specification section for more information about the physical servers' configuration.
Cluster architecture¶
In either cluster way, there SHOULD be at least one server (for PoCs) or at least three servers (for deployment) available for the LogMan.io cluster. If the cluster is deployed to the customer's infrastructure, the servers may also act as the collector servers, so there is no need to have a dedicated collector server in this case. The three server architecture may consists of three similar physical servers, or two physical servers and one small arbitrary virtual machine.
A smaller or non-critical deployments are possible on the single machine configuration.
For more information of the LogMan.io cluster organization, see Cluster architecture section.
Data storage¶
Every physical or non-arbiter server in LogMan.io cluster should have enough available disk storage to hold the data for the requested time period from the preimplementation analysis.
There should be at least one fast (for current or one-day log messages and Kafka topics) and one slower (for older data, metadata and configurations) data storage mapped to /data/ssd
and /data/hdd
.
Since all LogMan.io services run as Docker containers, /var/lib/docker
folder should be also mapped to one of those storage.
For detailed information about the disk storage organization, mount etc. please see Data Storage section.
Installation¶
The RECOMMENDED operating system is Linux Ubuntu 22.04 LTS or newer. Alternatives are Linux RedHat 8 and 7, CentOS 7.
The hostnames of the LogMan.io servers in the LogMan.io cluster should follow the notation lm01
, lm11
etc.
If separate collector servers are used (see above) there is no requirement for their hostname naming.
If TeskaLabs is part of the delivery, there should be a tladmin
user created with sudoer
permissions.
On every server (both LogMan.io cluster and Collector), there should be git
, docker
and docker-compose
installed.
Please refer to Manual installation for a comprehensive guide.
All services are then created and started via docker-compose up -d
command from the folder the site repository is cloned to (see the following section):
$ cd /opt/site-tenant-siterepository/lm11
$ docker-compose up -d
The Docker credentials are provided to the partner by TeskaLabs' team.
Site repository and configuration¶
Every partner is given access to TeskaLabs GitLab to manage the configurations for deployments there, which is a recommended way to store configurations for future consultations with TeskaLabs. However, every partner may also use their own GitLab or any other Git repository and provide TeskaLabs' team with appropriate (at least read-only) accesses.
Every deployment to every customer should have a separate site repository, regardless if the entire LogMan.io cluster is installed or only collector servers are deployed. The structure of the site repository should look as follows:
Each server node (server) should have a separate subfolder at the top of the GitLab repository.
Next, there should be a folder with LogMan.io library
, that contains declarations for parsing, correlating etc. groups, config
, that contains configuration of the Discover screen in UI and dashboards and ecs
folder with index templates for ElasticSearch.
Every partner is given access to a reference site repository with all the configurations including parsers and discover settings ready.
ElasticSearch¶
Each node in the LogMan.io Cluster should contain at least one ElasticSearch master
node, one ElasticSearch data_hot
node, one ElasticSearch data_warm
node and one ElasticSearch data_cold
node.
All the ElasticSearch nodes are deployed via Docker Compose and are part of the site/configuration repository.
Arbitrary nodes in the cluster contain only one ElasticSearch master node.
If one server architecture is used, the replicas in ElasticSearch should be set to zero (this will also be provided after the consultation with TeskaLabs). For illustration, see the following snipplet from Docker Compose file to see how ElasticSearch hot node is deployed:
lm21-es-hot01:
network_mode: host
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.2
depends_on:
- lm21-es-master
environment:
- network.host=lm21
- node.attr.rack_id=lm21 # Or datacenter name. This is meant for ES to effectively and safely manage replicas
# For smaller installations -> a hostname is fine
- node.attr.data=hot
- node.name=lm21-es-hot01
- node.roles=data_hot,data_content,ingest
- cluster.name=lmio-es # Basically "name of the database"
- cluster.initial_master_nodes=lm01-es-master,lm11-es-master,lm21-es-master
- discovery.seed_hosts=lm01:9300,lm11:9300,lm21:9300
- http.port=9201
- transport.port=9301 # Internal communication among nodes
- "ES_JAVA_OPTS=-Xms16g -Xmx16g -Dlog4j2.formatMsgNoLookups=true"
# - path.repo=/usr/share/elasticsearch/repo # This option is enabled on demand after the installaton! It's not part of the initial setup (but we have it here because it's a workshop)
- ELASTIC_PASSWORD=$ELASTIC_PASSWORD
- xpack.security.enabled=true
- xpack.security.transport.ssl.enabled=true
...
For more information about ElasticSearch including the explanation of hot (recent, one-day data on SSD), warm (older) and cold nodes, please refer to ElasticSearch Setting section.
ZooKeeper & Kafka¶
Each server node in the LogMan.io Cluster should contain at least one ZooKeeper and one Kafka node. ZooKeeper is a metadata storage available in the entire cluster, where Kafka stores information about topic consumers, topic names etc., and where LogMan.io stores the current library
and config
files (see below).
The Kafka and ZooKeeper setting can be copied from the reference site repository and consulted with TeskaLabs developers.
Services¶
The following services should be available at least on one of the LogMan.io nodes and they include:
nginx
(webserver with HTTPS certificate, see the reference site repository)influxdb
(metric storage, see InfluxDB Setting)mongo
(database for credentials of users, sessions etc.)telegraf
(gathers telemetry metrics from the infrastructure, burrow and ElasticSearch and sends them to InfluxDB, it should be installed on every server)burrow
(gathers telemetry metrics from Kafka and sends them to InfluxDB)seacat- auth
(TeskaLabs SeaCat Auth is a OAuth service, that stores its data to mongo)asab-library
(manages thelibrary
with declarations)asab-config
(manages theconfig
section)lmio-remote-control
(monitors other microservices likeasab-config
)lmio-commander
(uploads thelibrary
to ZooKeeper)lmio-dispatcher
(dispatches data fromlmio-events
andlmio-others
Kafka topics to ElasticSearch, it should run in at least three instances on every server)
For more information about SeaCat Auth and its management part in LogMan.io UI, see TeskaLabs SeaCat Auth documentation.
For information on how to upload the library
from the site repository to ZooKeeper, refer to LogMan.io Commander guide.
UI¶
The following UIs should be deployed and made available via nginx
. First implementation should always be discussed with TeskaLabs' developers.
LogMan.io UI
(see LogMan.io User Interface)Kibana
(discover screen, visualizations, dashboards and monitoring on top of ElasticSearch)Grafana
(telemetry dashboards on top of data from InfluxDB)ZooKeeper UI
(management of data stored in ZooKeeper)
The following picture shows the Parsers
from the library
imported to ZooKeeper in ZooKeeper UI:
LogMan.io UI Deployment¶
Deployment of LogMan.io UI is partially semi-automatic process when set up correctly. So there are several steps to ensure the safe UI deployment:
- Deployment artifact of the UI should be pulled via
azure
site repository provided to the partner by TeskaLabs' developers. Information about where the particular UI application is stored can be obtained from CI/CD image of the application repository. - It is recommended to use
tagged
versions, but there can be situations whenmaster
version is desired. Information how to set it up can be found indocker-compose.yaml
file of the reference site repository. - UI application have to be aligned with the services to ensure the best performance (usually latest
tag
versions). If uncertain, contact TeskaLabs' developers.
Creating the tenant¶
Each customer is assigned one or more tenants
.
Tenants are lowercase ASCII names, that tag the data/logs belonging to the user and store each tenant's data in a separate ElasticSearch index.
All event lanes (see below) are also tenant specific.
Create the tenant in SeaCat Auth using LogMan.io UI¶
In order to create the tenant, log into the LogMan.io UI with the superuser role, which can be done through the provisioning. For more information about provisioning, please refer to Provisioning mode section of the SeaCat Auth documentation.
In LogMan.io UI, navigate to the Auth
section in the left menu and select Tenants
.
Once there, click on Create tenant
option and write the name of the tenant there.
Then click on the blue button and the tenant should be created:
After that, go to Credentials
and assign the newly created tenant to all relevant users.
ElasticSearch indices¶
In Kibana, every tenant should have index templates for lmio-tenant-events
and lmio-tenant-others
indices, where tenant
is the name of the tenant (refer to the reference site repository provided by TeskaLabs).
The index templates can be inserted via Kibaba's Dev Tools from the left menu.
After the insertion of the index templates, ILM (index life cycle management) policy and the first indices should be manually created, exactly as specified in the ElasticSearch Setting guide.
Kafka¶
There is no specific tenant creation setting in Kafka, except the event lanes below.
However, always make sure the lmio-events
and lmio-others
topics are created properly.
The following commands should be run in the Kafka container (f. e.: docker exec -it lm11_kafka_1 bash
):
# LogMan.io
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic lmio-events --replication-factor 1 --partitions 6
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic lmio-others --replication-factor 1 --partitions 6
/usr/bin/kafka-topics --zookeeper lm11:2181 --alter --topic lmio-events --config retention.ms=86400000
/usr/bin/kafka-topics --zookeeper lm11:2181 --alter --topic lmio-others --config retention.ms=86400000
# LogMan.io+ & SIEM
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic lmio-events-complex --replication-factor 1 --partitions 6
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic lmio-lookups --replication-factor 1 --partitions 6
Each Kafka topic should have at least 6 partitions (that can be automatically used for parallel consuming), which is the appropriate number for most of the deployments.
Important note¶
The following section describes the connection of event lanes to LogMan.io. The knowledge about LogMan.io architecture from the documentation is mandatory.
Event lanes¶
Event lanes in LogMan.io define how logs are sent to the cluster. Each event lane is specific for the collected source, hence one row in the preimplementation analysis table should correspond to one event lane. Each event lane consists of one lmio-collector
service, one lmio-ingestor
service and one or more instances of lmio-parser
service.
Collector¶
LogMan.io Collector should run on the collector server or on one or more LogMan.io servers, if they are part of the same internal network. The configuration sample is part of the reference site repository.
LogMan.io Collector is able to, via YAML configuration, open a TCP/UDP port to obtain logs from, read files, open a WEC server, read from Kafka topics, Azure accounts and so on. The comprehensive documentation is available here: LogMan.io Collector
The following configuration sample opens 12009/UDP
port on the server the collector is installed to, and redirects the collected data via WebSocket to the lm11
server to port 8600
, where lmio-ingestor
should be running:
input:Datagram:UDPInput:
address: 0.0.0.0:12009
output: WebSocketOutput
output:WebSocket:WebSocketOutput:
url: http://lm11:8600/ws
tenant: mytenant
debug: false
prepend_meta: false
The url
is either the hostname of the server and port of the Ingestor, if Collector and Ingestor are deployed to the same server, or URL with https://
, if collector server outside of the internal network is used. It is then necessary to specify HTTPS certificates, please see the output:WebSocket
section in the LogMan.io Collector Outputs guide for more information.
The tenant
is the name of the tenant the logs belong to. The tenant name is then automatically propagated to Ingestor and Parser.
Ingestor¶
LogMan.io Ingestor takes the log messages from Collector along with metadata and stores them in Kafka in a topic, that begins with collected-tenant-
prefix, where tenant
is the tenant name the logs belong to and technology
the name of the technology the data are gathered from like microsoft-windows
.
The following sections in the CONF files are necessary to be always set up differently for each event lane:
# Output
[pipeline:WSPipeline:KafkaSink]
topic=collected-tenant-technology
# Web API
[web]
listen=0.0.0.0 8600
The port in the listen
section should match the port in the Collector YAML configuration (if the Collector is deployed to the same server) or the setting in nginx (if the data are collected from a collector server outside of the internal network). Please refer to the reference site repository provided by TeskaLabs' developers.
Parser¶
The parser should be deployed in more instances to scale the performance. It parses the data from original bytes or strings to a dictionary in the specified schema like ECS (ElasticSearch Schema) or CEF (Common Event Format), while using a parsing group from the library loaded in ZooKeeper. It is important to specify the Kafka topic to read from, which is the same topic as specified in the Ingestor configuration:
[declarations]
library=zk://lm11:2181/lmio/library.lib
groups=Parsers/parsing-group
raw_event=log.original
# Pipeline
[pipeline:ParsersPipeline:KafkaSource]
topic=collected-tenant-technology
group_id=lmio_parser_collected
auto.offset.reset=smallest
Parsers/parsing-group
is the location of the parsing group from the library loaded in ZooKeeper through LogMan.io Commander. It does not have to exist at the first try, because all data are then automatically send to lmio-tenant-others
index. When the parsing group is ready, the parsing takes place and the data can be seen in the document format in lmio-tenant-events
index.
Kafka topics¶
Before all three services are started via docker-compose up -d
command, it is important to check the state of the specific collected-tenant-technology
Kafka topic (where tenant
is the name of the tenant and technology
is the name of the connected technology/device type). In the Kafka container (f. e.: docker exec -it lm11_kafka_1 bash
), the following commands should be run:
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic collected-tenant-technology --replication-factor 1 --partitions 6
/usr/bin/kafka-topics --zookeeper lm11:2181 --alter --topic collected-tenant-technology --config retention.ms=86400000
Parsing groups¶
For most common technologies, TeskaLabs have already prepared the parsing groups to ECS schema. Please get in touch with TeskaLabs developers. Since all parsers are written in the declarative language, all parsing groups in the library can be easily adjusted. The name of the group should be the same as the name of the dataset
attribute written in the parser groups' declaration.
For more information about our declarative language, please refer to the official documentation: SP-Lang
After the parsing group is deployed via LogMan.io Commander
, the appropriate Parser(s) should be restarted.
Deployment¶
On the LogMan.io servers, simply run the following command in the folder the site-
repository is cloned to:
docker-compose up -d
The collection of logs can be then checked in the Kafka Docker container via Kafka's console consumer:
/usr/bin/kafka-console-consumer --bootstrap-server lm11:9092 --topic collected-tenant-technology --from-beginning
The data are pumped in Parser from collected-tenant-technology
topic to lmio-events
or lmio-others
topic and then in Dispatcher (lmio-dispatcher
, see above) to lmio-tenant-events
or lmio-tenant-others
index in ElasticSearch.
SIEM¶
SIEM part should be now always discussed with TeskaLabs's developers, who will provide first correlation rules and entries to configuration files and Docker Compose. The SIEM
part consists mainly of different lmio-correlators
instances and lmio-watcher
.
For more information, see the LogMan.io Correlator section.
Connecting a new log source to LogMan.io¶
Prerequisites¶
Tenant¶
Each customer is assigned one or more tenants.
Name of the tenant must be a lowercase ASCII name, that tag the data/logs belonging to the user and store each tenant's data in a separate ElasticSearch index. All Event Lanes (see below) are also tenant specific.
Create the tenant in SeaCat Auth using LogMan.io UI¶
In order to create the tenant, log into the LogMan.io UI with the superuser role, which can be done through the provisioning. For more information about provisioning, please refer to Provisioning mode section of the SeaCat Auth documentation.
In LogMan.io UI, navigate to the Auth
section in the left menu and select Tenants
.
Once there, click on Create tenant
option and write the name of the tenant there.
Then click on the blue button and the tenant should be created:
After that, go to Credentials
and assign the newly created tenant to all relevant users.
ElasticSearch index templates¶
In Kibana, every tenant should have index templates for lmio-tenant-events
and lmio-tenant-others
indices, where tenant
is the name of the tenant (refer to the reference site-
repository provided by TeskaLabs), so proper data types are assigned to every field.
This is especially needed for time-based fields, which would not work without index template and could not be used for sorting and creating index patterns in Kibana.
The ElasticSearch index template should be present in the site-
repository
under the name es_index_template.json
.
The index templates can be inserted via Kibaba's Dev Tools from the left menu.
ElasticSearch index lifecycle policy¶
Index Lifecycle Management (ILM) in ElasticSearch serves to automatically close or delete old indices (f. e. with data older than three months), so searching performance is kept and data storage is able to store present data. The setting is present in the so-called ILM policy.
The ILM should be set before the data are pumped into ElasticSearch, so the new index finds and associates itself with the proper ILM policy. For more information, please refer to the official documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html
LogMan.io components such as Dispatcher then use a specified ILM alias (lmio-) and ElasticSearch automatically put the data to the proper index assigned with the ILM policy.
The setting should be done in following way:
Create the ILM policy¶
Kibana version 7.x can be used to create ILM policy in ElasticSearch.
1.) Open Kibana
2.) Click Management in the left menu
3.) In the ElasticSearch section, click on Index Lifecycle Policies
4.) Click Create policy blue button
5.) Enter its name, which should be the same as the index prefix, f. e. lmio-
6.) Set max index size to the desired rollover size, f. e. 25 GB (size rollover)
7.) Set maximum age of the index, f. e. 10 days (time rollover)
8.) Click the switch down the screen at Delete phase, and enter the time after which the index should be deleted, f. e. 120 days from rollover
9.) Click on Save policy green button
Use the policy in index template¶
Add the following lines to the JSON index template:
"settings": {
"index": {
"lifecycle": {
"name": "lmio-",
"rollover_alias": "lmio-"
}
}
},
ElasticSearch indices¶
Through PostMan or Kibana, create a following HTTP request to the instance of ElasticSearch you are using.
1.) Create a index for parsed events/logs:
PUT lmio-tenant-events-000001
{
"aliases": {
"lmio-tenant-events": {
"is_write_index": true
}
}
}
2.) Create a index for unparsed and error events/logs:
PUT lmio-tenant-others-000001
{
"aliases": {
"lmio-tenant-others": {
"is_write_index": true
}
}
}
The alias is then going to be used by the ILM policy to distribute data to the proper ElasticSearch index, so pumps do not have to care about the number of the index.
//Note: The prefix and number of index for ILM rollover must be separated with -
000001, not _
000001!//
Event Lane¶
Event Lane in LogMan.io define how logs from a specific data source for a given tenant are sent to the cluster. Each event lane is specific for the collected source. Each event lane consists of one lmio-collector
service, one lmio-ingestor
service and one or more instances of lmio-parser
service.
Collector¶
LogMan.io Collector should run on the collector server or on one or more LogMan.io servers, if they are part of the same internal network. The configuration sample is part of the reference site-
repository.
LogMan.io Collector is able to, via YAML configuration, open a TCP/UDP port to obtain logs from, read files, open a WEC server, read from Kafka topics, Azure accounts and so on. The comprehensive documentation is available here: LogMan.io Collector
The following configuration sample opens 12009/UDP
port on the server the collector is installed to, and redirects the collected data via WebSocket to the lm11
server to port 8600
, where lmio-ingestor
should be running:
input:Datagram:UDPInput:
address: 0.0.0.0:12009
output: WebSocketOutput
output:WebSocket:WebSocketOutput:
url: http://lm11:8600/ws
tenant: mytenant
debug: false
prepend_meta: false
The url
is either the hostname of the server and port of the Ingestor, if Collector and Ingestor are deployed to the same server, or URL with https://
, if collector server outside of the internal network is used. It is then necessary to specify HTTPS certificates, please see the output:WebSocket
section in the LogMan.io Collector Outputs guide for more information.
The tenant
is the name of the tenant the logs belong to. The tenant name is then automatically propagated to Ingestor and Parser.
Ingestor¶
LogMan.io Ingestor takes the log messages from Collector along with metadata and stores them in Kafka in a topic, that begins with collected-tenant-
prefix, where tenant
is the tenant name the logs belong to and technology
the name of the technology the data are gathered from like microsoft-windows
.
The following sections in the CONF files are necessary to be always set up differently for each event lane:
# Output
[pipeline:WSPipeline:KafkaSink]
topic=collected-tenant-technology
# Web API
[web]
listen=0.0.0.0 8600
The port in the listen
section should match the port in the Collector YAML configuration (if the Collector is deployed to the same server) or the setting in nginx (if the data are collected from a collector server outside of the internal network). Please refer to the reference site-
repository provided by TeskaLabs' developers.
Parser¶
The parser should be deployed in more instances to scale the performance. It parses the data from original bytes or strings to a dictionary in the specified schema like ECS (ElasticSearch Schema) or CEF (Common Event Format), while using a parsing group from the library loaded in ZooKeeper. It is important to specify the Kafka topic to read from, which is the same topic as specified in the Ingestor configuration:
[declarations]
library=zk://lm11:2181/lmio/library.lib
groups=Parsers/parsing-group
raw_event=log.original
# Pipeline
[pipeline:ParsersPipeline:KafkaSource]
topic=collected-tenant-technology
group_id=lmio_parser_collected
auto.offset.reset=smallest
Parsers/parsing-group
is the location of the parsing group from the library loaded in ZooKeeper through LogMan.io Commander. It does not have to exist at the first try, because all data are then automatically send to lmio-tenant-others
index. When the parsing group is ready, the parsing takes place and the data can be seen in the document format in lmio-tenant-events
index.
Kafka topics¶
Before all three services are started via docker-compose up -d
command, it is important to check the state of the specific collected-tenant-technology
Kafka topic (where tenant
is the name of the tenant and technology
is the name of the connected technology/device type). In the Kafka container (f. e.: docker exec -it lm11_kafka_1 bash
), the following commands should be run:
/usr/bin/kafka-topics --zookeeper lm11:2181 --create --topic collected-tenant-technology --replication-factor 1 --partitions 6
/usr/bin/kafka-topics --zookeeper lm11:2181 --alter --topic collected-tenant-technology --config retention.ms=86400000
Parsing groups¶
For most common technologies, TeskaLabs have already prepared the parsing groups to ECS schema. Please get in touch with TeskaLabs developers. Since all parsers are written in the declarative language, all parsing groups in the library can be easily adjusted. The name of the group should be the same as the name of the dataset
attribute written in the parser groups' declaration.
For more information about our declarative language, please refer to the official documentation: SP-Lang
After the parsing group is deployed via LogMan.io Commander
, the appropriate Parser(s) should be restarted.
Deployment¶
On the LogMan.io servers, simply run the following command in the folder the site-
repository is cloned to:
docker-compose up -d
The collection of logs can be then checked in the Kafka Docker container via Kafka's console consumer:
/usr/bin/kafka-console-consumer --bootstrap-server lm11:9092 --topic collected-tenant-technology --from-beginning
The data are pumped in Parser from collected-tenant-technology
topic to lmio-events
or lmio-others
topic and then in Dispatcher (lmio-dispatcher
) to lmio-tenant-events
or lmio-tenant-others
index in ElasticSearch.
Kafka ↵
Kafka¶
The Apache Kafka serves as a queue to temporarily store events among the LogMan.io microservices. For more information, see Architecture
Kafka within LogMan.io¶
Topic naming in event lanes¶
Each event lane has received
, events
and others
topics specified.
Each topic name contains the name of the tenant and the event lane's stream in the following manner:
received.tenant.stream
events.tenant.stream
others.tenant
received.tenant.stream¶
The received
topic stores the incoming logs for the incoming tenant
and event lane's stream
.
events.tenant.stream¶
The events
topic stores the parsed events for the given event lane defined by tenant
and stream
.
others.tenant¶
The others
topic stores the unparsed events for the given tenant
.
Internal topics¶
There are the following internal topics for LogMan.io:
lmio-alerts¶
This topic stores the triggered alerts and is read by LogMan.io Alerts microservice.
lmio-notifications¶
This topic stores the triggered notifications and is read by ASAB IRIS microservice.
lmio-lookups¶
This topic stores the requested changes in lookups and is read by LogMan.io Watcher microservice.
Recommended setup for 3-node cluster¶
There are three instances of Apache Kafka, one on each node.
The number of partitions for each topic must be at least the same as the number of consumers (3) and divisible by 2, hence the recommended number for partitions is always 6.
The recommended replica count is 1.
Each topic must have a reasonable retention set based on the available size of SSD drives.
In the LogMan.io cluster environment, where average EPS is above 1000 events per second and SSD disk space is below 2 TB, the retention is usually 1 day (86400000 milliseconds). See the Commands section.
Hint
When the EPS is lower or there is more SSD space, it is recommended to set the retention for Kafka topics to higher values like 2 or more days in order to give to the administrators more time to solve potential issues.
To create the partitions, replicas and retention properly, see the Commands section.
Commands¶
The following commands serve to create, alter and delete Kafka topics within the LogMan.io environment. All Kafka topics managed by LogMan.io next to the internal ones are specified in event lane *.yaml
declarations inside /EventLanes
folder in the library.
Prerequisites¶
All commands should be run from Kafka Docker container, that can be accessed via the following command:
docker exec -it kafka_container bash
The command utilizes Kafka Command Line interface, which is documented here: Kafka Command-Line Interface (CLI) Tools
Create a topic¶
In order to create a topic, specify the topic name, number of partitions and replication factor. The replication factor should be set to 1 and partitions to 6, which is the default for LogMan.io Kafka topics.
/usr/bin/kafka-topics --zookeeper locahost:2181 --create --topic "events.tenant.fortigate" --replication-factor 1 --partitions 6
Replace events.tenant.fortigate
with your topic name.
Configure a topic¶
Retention¶
The following command changes the retention of data for Kafka topic to 86400000
milliseconds, that is 1 day. This means that data older than 1 day will be deleted from Kafka to spare storage space:
/usr/bin/kafka-configs --bootstrap-server localhost:9092 --entity-type topics --entity-name "events\.tenant\.fortigate" --alter --add-config retention.ms=86400000
events\.tenant\.fortigate
with your topic name.
Info
All Kafka topics in LogMan.io should have a retention for data set.
Info
When editing a topic setting in Kafka, the special characters like dot (.) should be escaped with slash (\).
Reseting a consumer group offset for a given topic¶
In order to reset the reading position, or the offset, for the given group ID (consumer group), use the following command:
/usr/bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group "my-console-client" --topic "events\.tenant\.fortigate" --reset-offsets --to-datetime 2020-12-20T00:00:00.000 --execute
Replace events\.tenant\.fortigate
with your topic name.
Replace my-console-client
with the given group ID.
Replace 2020-12-20T00:00:00.000
with the time to reset the reading offset to.
Hint
To reset the group to the current offset, use --to-current instead of --to-datetime 2020-12-20T00:00:00.000.
Deleting a consumer group offset for a given topic¶
The offset for the given topic can be deleted from the consumer group, hence the consumer group would be effectively disconnected from the topic itself. Use the following command:
/usr/bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group "my-console-client" --topic "events\.tenant\.fortigate" --delete-offsets
Replace events\.tenant\.fortigate
with your topic name.
Replace my-console-client
with the given group ID.
Deleting the consumer group¶
A consumer group for ALL topics can be deleted with its offset information using the following command:
/usr/bin/kafka-consumer-groups --bootstrap-server localhost:9092 --delete --group my-console-client
Replace my-console-client
with the given group ID.
Alter a topic¶
Change the number of partitions¶
The following command increases the number of partitions within the given topic.
/usr/bin/kafka-topics --zookeeper locahost:2181 --alter -partitions 6 --topic "events\.tenant\.fortigate"
Replace events\.tenant\.fortigate
with your topic name.
Specify ZooKeeper node
Kafka reads and alters data stored in ZooKeeper. In case you've configured Kafka so its files are stored in specific ZooKeeper node, you will get this error.
Error while executing topic command : Topic 'events.tenant.fortigate' does not exist as expected
[2024-05-06 10:16:36,207] ERROR java.lang.IllegalArgumentException: Topic 'events.tenant.fortigate' does not exist as expected
at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:539)
at kafka.admin.TopicCommand$ZookeeperTopicService.alterTopic(TopicCommand.scala:408)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:66)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
(kafka.admin.TopicCommand$)
Adjust the --zookeeper
argument accordingly. E.g. Kafka data is stored in kafka
node of ZooKeeper:
/usr/bin/kafka-topics --zookeeper lm11:2181/kafka --alter --partitions 6 --topic 'events\.tenant\.fortigate'
Try to remove escape characters (/
) if the topic name is still not recognized.
Delete a topic¶
The topic can be deleted using the following command. Please keep in mind that Kafka topics are automatically created if new data are being produced/sent to it by any service.
/usr/bin/kafka-topics --zookeeper locahost:2181 --delete --topic "events\.tenant\.fortigate"
Replace events\.tenant\.fortigate
with your topic name.
Troubleshooting¶
There are many logs in others and I cannot find the ones with "interface" attribute inside¶
Kafka Console Consumer can be used to obtain events from multiple topics, here from all topics starting with events.
.
Next it is possible the grep the field in doublequotes:
/usr/bin/kafka-console-consumer --bootstrap-server localhost:9092 --whitelist "events.*" | grep '"interface"'
This command gives you all incoming logs with "interface"
attribute from all events topics.
Kafka Partition Reassignment¶
When a new Kafka node is added, Kafka automatically does not do the partition reassignment. The following steps are used to perform manual reassignment of Kafka partitions for specified topic(s):
1.) Go to Kafka container
docker exec -it kafka_container bash
2.) Create /tmp/topics.json
with topics whose partitions should be reassigned in the following format:
cat << EOF | tee /tmp/topics.json
{
"topics": [
{"topic": "events.tenant.stream"},
],
"version": 1
}
EOF
3.) Generate reassignment JSON output from list of topics to be migrated, specify the broker IDs in the broker list:
/usr/bin/kafka-reassign-partitions --zookeeper localhost:2181 --broker-list "121,122,221,222" --generate --topics-to-move-json-file /tmp/topics.json
The result should be stored in /tmp/reassign.json
and look as follows, with all topics and partitions having their new assigment specified:
[appuser@lm11 data]$ cat /tmp/reassign.json
{"version":1,"partitions":[{"topic":"events.tenant.stream","partition":0,"replicas":[122],"log_dirs":["any"]},{"topic":"events.tenant.stream","partition":1,"replicas":[221],"log_dirs":["any"]},{"topic":"events.tenant.stream","partition":2,"replicas":[222],"log_dirs":["any"]},{"topic":"events.tenant.stream","partition":3,"replicas":[121],"log_dirs":["any"]},{"topic":"events.tenant.stream","partition":4,"replicas":[122],"log_dirs":["any"]},{"topic":"events.tenant.stream","partition":5,"replicas":[221],"log_dirs":["any"]}]}
4.) Use the output from the previous command as input to the execution of the reassignment/rebalance:
/usr/bin/kafka-reassign-partitions --zookeeper localhost:2181 --execute --reassignment-json-file /tmp/reassign.json --additional --bootstrap-server localhost:9092
That's it! Now Kafka should perform the partitions reassignment within the following hours.
For more information, see Reassigning partitions in Apache Kafka Cluster .
Ended: Kafka
System monitoring ↵
System monitoring¶
The following tools and techniques can help you understand how your TeskaLabs LogMan.io system is performing and investigate any issues that arise.
Preset dashboards¶
LogMan.io includes preset diagnostic dashboards that give you insight into your system performance. This is the best place to start monitoring.
Prophylactic checks¶
Prophylacitic checks are preventative checkups on your LogMan.io app and system performance. Visit our prophylactic check manual to learn how to perform regular prophylactic checks.
Metrics¶
Metrics are measurements regarding system performance. Investigating metrics can be useful if you already know what area of your system you need insight into, which you can discover through analyzing your preset dashboards or performing a prophylactic check.
Grafana dashboards for system diagnostics¶
Through TeskaLabs LogMan.io, you can access dashboards in Grafana that monitor your data pipelines. Use these dashboards for diagnostic purposes.
The first few months of your deployment of TeskaLabs LogMan.io are a stabilization period, in which you might see extreme values produced by these metrics. These dashboards are especially useful during stabilization and can help with system optimization. Once your system is stable, extreme values, in general, indicate a problem.
To access the dashboards:
1. In LogMan.io, go to Tools.
2. Click on Grafana. You are now securely logged in to Grafana with your LogMan.io user credentials.
3. Click the menu button, and go to Dashboards.
4. Select the dashboard you want to see.
Tips
- Hover over any graph to see details at specific time points.
- You can change the timeframe of any dashboard with the timeframe tools in the top right corner of the screen.
LogMan.io dashboard¶
The LogMan.io dashboard monitors all data pipelines in your installation of TeskaLabs LogMan.io. This dashboard can help you investigate if, for example, you're seeing fewer logs than expected in LogMan.io. See Pipeline metrics for deeper explanations.
Metrics included:
-
Event In/Out: The volume of events passing through each data pipeline measured in in/out operations per second (io/s). If the pipeline is running smoothly, the In and Out quantities are equal, and the Drop line is zero. This means that the same amount of events are entering and leaving the pipeline, and none are dropped. If you can see in the graph that the In quantity is greater than the Out quantity, and that the Drop line is greater than zero, then some events have been dropped, and there might be an issue.
-
Duty cycle: Displays the percentage of data being processed as compared to data waiting to be processed. If the pipeline is working as expected, the duty cycle is at 100%. If the duty cycle is lower than 100%, it means that somewhere in the pipeline, there is a delay or a throttle causing events to queue.
-
Time drift: Shows you the delay or lag in event processing, meaning how long after an event's arrival it is actually processed. A significant or increased delay impacts your cybersecurity because it inhibits your ability to respond to threats immediately. Time drift and duty cycle are related metrics. There is a greater time drift when the duty cycle is below 100%.
System-level overivew dashboard¶
The System-level overview dashboard monitors the servers involved in your TeskaLabs LogMan.io installation. Each node of the installation has its own section in the dashboard. When you encounter a problem in your system, this dashboard helps you perform an initial assessment on your server by showing you if the issue is related to input/output, CPU usage, network, or disk space or usage. However, for a more specific analysis, pursue exploring specific metrics in Grafana or InfluxDB.
Metrics included:
- IOWait: Percentage of time the CPU remains idle while waiting for disk I/O (input/output) requests. In other words, IOWait tells you how much processing time is being wasted waiting for data. A high IOWait, especially if it's around or exceeds 20% (depending on your system), signals that the disk read/write speed is becoming a system bottleneck. A rising IOWait indicates that the disk's performance is limiting the system's ability to receive and store more logs, impacting overall system throughput and efficiency.
- Uptime: The amount of a time the server has been running without since last being shut down or restarted.
- Load: Represents the average number of processes waiting in the queue for CPU time over the last 5 minutes. It's a direct indicator of how busy your system is. In systems with multiple CPU cores, this metric should be considered in relation to the total number of available cores. For instance, a load of 64 on a 64-core system might be acceptable, but above 100 indicates severe stress and unresponsiveness. The ideal load varies based on the specific configuration and use case but generally should not exceed 80% of the total number of CPU cores. Consistently high load values indicate that the system is struggling to process the incoming stream of logs efficiently.
- RAM usage: The percentage of the total memory currently being used by the system. Keeping RAM usage between 60-80% is generally optimal. Usage above 80% often leads to increased swap usage, which in turn can slow down the system and lead to instability. Monitoring RAM usage is crucial for ensuring that the system has enough memory to handle the workload efficiently without resorting to swap, which is significantly slower.
- CPU usage: An overview of the percentage of CPU capacity currently in use. It averages the utilization across all CPU cores, which means individual cores could be under or over-utilized. High CPU usage, particularly over 95%, suggests the system is facing CPU-bound challenges, where the CPU's processing capacity is the primary limitation. This dashboard metric helps differentiate between I/O-bound issues (where the bottleneck is data transfer) and CPU-bound issues. It's a critical tool for identifying processing bottlenecks, although it's important to interpret this metric alongside other system indicators for a more accurate diagnosis.
- Swap usage: How much of the swap space is being used. A swap partition is dedicated space on the disk used as a temporary substitute for RAM ("data overflow"). When RAM is full, the system temporarily stores data in swap space. High swap usage, above approximately 5-10%, indicates that the system is running low on memory, which can lead to degraded performance and instability. Persistent high swap usage is a sign that the system requires more RAM, as relying heavily on swap space can become a major performance bottleneck.
- Disk usage: Measures how much of the storage capacity is currently being used. In your log management system, it's crucial to keep disk usage below 90% and take action if it reaches 80%. Inadequate disk space is a common cause of system failures. Monitoring disk usage helps in proactive management of storage resources, ensuring that there is enough space for incoming data and system operations. Since most systems are configured to delete data after 18 months of storage, disk space usage can begin to stabilize after the system has been running for 18 months. Read more about the data lifecycle.
Elasticsearch metrics dashboard¶
The Elasticsearch metrics dashboard monitors the health of the Elastic pipeline. (Most TeskaLabs LogMan.io users use the Elasticsearch database to store log data.)
Metrics included:
- Cluster health: Green is good; yellow and red indicate a problem.
- Number of nodes: A node is a single instance of Elasticsearch. The number of nodes is how many nodes are part of your LogMan.io Elasticsearch cluster.
- Shards
- Active shards: Number of total shards active. A shard is the unit at which Elasticsearch distributes data around a cluster.
- Unassigned shards: Number of shards that are not available. They might be in a node which is turned off.
- Relocating shards: Number of shards that are in the process of being moved to a different node. (You might want to turn off a node for maintenance, but you still want all of your data to be available, so you can move a shard to a different node. This metrics tells you if any shards are actively in this process and therefore can't provide data yet.)
- Used mem: Memory used. Used memory at 100% would mean that Elasticsearch is overloaded and requires investigation.
- Output queue: The number of tasks waiting to be processed in the output queue. A high number could indicate a significant backlog or bottleneck.
- Stored GB: The amount of disk space being used for storing data in the Elasticsearch cluster. Monitoring disk usage is helps to ensure that there's sufficient space available and to plan for capacity scaling as necessary.
- Docs count: The total number of documents stored within the Elasticsearch indices. Monitoring the document count can provide insights into data growth and index management requirements
- Task max waiting in queue: The maximum time a task has waited in a queue to be processed. It’s useful for identifying delays in task processing which could impact system performance and throughput.
- Open file descriptors: File descriptors are handles that allow the system to manage and access files and network connections. Monitoring the number of open file descriptors is important to ensure that system resources are being managed effectively and to prevent potential file handle leaks which could lead to system instability
- Used cpu %: The percentage of CPU resources currently being used by Elasticsearch. Monitoring CPU usage helps you understand the system's performance and identify potential CPU bottlenecks.
- Indexing: The rate at which new documents are being indexed into Elasticsearch. A higher rate means your system can index more information more efficiently.
- Inserts: The number of new documents being added to the Elasticsearch indices. This line follows a regular pattern if you have a consistent number of inputs. If the line spikes or dips irregularly, there could be an issue in your data pipeline keeping events from reaching Elasticsearch.
Burrow consumer lag dashboard¶
The Burrow dashboard monitors the consumers and partitions of Apache Kafka. Learn more about Burrow here.
Apache Kafka terms:
- Consumers: Consumers read data. They subscribe to one or more topics and read the data in the order in which it was produced.
- Consumer groups: Consumers are typically organized into consumer groups. Each consumer within a group reads from exclusive partitions of the topics they subscribe to, ensuring that each record is processed only once by the group, even if multiple consumers are reading.
- Partitions: Topics are split into partitions. This allows the data to be distributed across the cluster, allowing for concurrent read and write operations.
Metrics included:
- Group status: The overall health status of the consumer group. A status of OK means that the group is functioning normally, while a warning or error could indicate issues like connectivity problems, failed consumers, or misconfigurations.
- Total lag: In this case, lag can be thought of as a queue of tasks waiting to be processed by a microservice. The total lag metric represents the count of messages that have been produced to the topic but not yet consumed by a specific consumer or consumer group. If the lag is 0, everything is dispatched properly, and there is no queue. Because Apache Kafka tends to group data into batches, some amount of lag is often normal. However, an increasing lag, or a lag above approximately 300,000 (this number is dependent on your system capacity, configuration, and sensitivity) is cause for investigation.
- Partitions lag: The lag for individual partitions within a topic. Being able to see partitions' lags separated tells you if some partitions have a larger queue, or higher delay, than others, which might indicate uneven data distribution or other partition-specific issues.
- Partition status: The status of individual partitions. An OK status indicates the partition is operating normally. Warnings or errors can signify problems like a stalled consumer, which is not reading from the partition. This metric helps identify specific partition-level issues that might not be apparent when looking at the overall group status.
Prophylactic check manual¶
A prophylactic check is a systematic preventative assessment to verify that a system is working properly, and to identify and mitigate potential issues before they escalate into more severe or critical problems. By performing regular prophylactic checks, you can proactively maintain the integrity, reliability, and efficiency of your TeskaLabs LogMan.io system, minimizing the risk of unexpected failures or disruptions that could arise if left unaddressed.
Support
If you need any further information or support than what you see here, reach out to your TeskaLabs LogMan.io support Slack channel, or send an e-mail to support@teskalabs.com. We will assist you promptly.
Performing prophylactic checks¶
Important
Conduct prophylactic checks at consistent intervals, ideally on the same day of the week and around the same time. Remember that the volume and timing of incoming events can fluctuate depending on the day of the week, working hours, and holidays.
During prophylactic checks, make sure to conduct a comprehensive review of all available tenants.
Examine each of the following components of your TeskaLabs LogMan.io installment according to our recommendations, and report issues as needed.
TeskaLabs LogMan.io functionalities¶
Location: TeskaLabs LogMan.io sidebar
Goal: Ensuring that every functionality of the Teskalabs LogMan.io app works properly
Within the assigned tenant, thoroughly examine each component featured in the sidebar (Discover, Dashboards, Exports, Lookups, Reports, etc.) to ensure their proper operation. Issues identified in this section should be reported to your TeskaLabs support channel. This can include issues such as pop up errors when opening a section from sidebar, lost availability of some of the tools or for example not being able to open Dashboards.
Issue reporting: Utilize the support Slack channel for general reporting.
Log source monitoring¶
Location: TeskaLabs LogMan.io Discover screen or dedicated dashboard
Goal: Ensuring that each log source is active and works as expected and no anomalies are found (for example a drop out, peak, or anything unusual). This is also crucial for your log source visibility.
Note: Consider incorporating Baselines as another option for log source checks.
Log source monitoring can be achieved by individually reviewing each log source, or by creating an overview dashboard equipped with widgets for monitoring each log source's activity visually. We recommend creating a dashboard with line charts.
The examination should always cover a sample of data between each prophylactic check.
Issue reporting: In case of an inactive log source, conduct further investigation and report to your TeskaLabs LogMani.io Slack support channel.
Log time zones¶
Location: TeskaLabs LogMan.io Discover screen
Goal: Ensuring that there are no discrepancies between your time zone and time zone present in the logs
Investigate if there are any logs with a @timestamp
value that is a future time. You can do so by filtering the time range to from now to 2+ (or more) hours from now.
Issue reporting: Utilize the project support Slack for general reporting.
If the issue appears to be linked to the logging device settings, please investigate this further within your own network.
Other events¶
Location: TeskaLabs LogMan.io Discover screen, lmio-others-events
index
Goal: Ensuring all the events are parsed correctly using either Parsec or Parser.
In most installations, we collect error logs from the following areas:
-
Parser
-
Parsec
-
Dispatcher
-
Depositor
-
Unstructured logs
Logs that are not parsed correctly go to others index
. Ideally, no logs should be present in the others index
.
Issue reporting: If a few logs are found in others index
, such as those indicating incorrect parsing errors, it's generally not a severe problem requiring immediate attention. Investigate these logs further and report to your TeskaLabs LogMan.io support Slack channel.
System logs¶
Location: TeskaLabs LogMan.io - System tenant, index Events & Others.
Goal: Ensuring the system is working properly and there are no unusual or critical system logs that could signal any internal issue
Issue reporting: A multitude of log types may be found in this section. Reporting can be done either via your TeskaLabs LogMan.io Slack channel, or within your infrastructure.
Baseliner¶
Note
Baseliner is included only in advanced deployments of LogMan.io. If you would like to upgrade LogMan.io, contact support, and we'll be happy to assist you.
Location: TeskaLabs LogMan.io Discover screen filtering for event.dataset:baseliner
Goal: Ensuring that the Baseliner functionality is working properly and is detecting deviations from a calculated activity baseline.
Issue reporting: If the Baseliner is not active, report it to your TeskaLabs LogMan.io support Slack channel.
Elasticsearch¶
Location: Grafana, dedicated Elasticsearch dashboard
Goal: Ensuring that there are no malfunctions linked to Elasticsearch and services associated with it.
The assessment should always be based on a sample of data from the past 24 hours. This operational dashboard provides an indication of the proper functioning of Elasticsearch.
Key Indicators:
-
Inactive Nodes should be at zero.
-
System Health should be green. Any indication of yellow or red should be escalated to TeskaLabs LogMan.io Slack support channel immediately.
-
Unassigned Shards should be at zero and marked as green. Any value in yellow or above warrants monitoring and reporting.
Issue reporting: If there are any issues detected, ensure prompt escalation. Further investigation of the Elastic cluster can be conducted in Kibana/Stack monitoring.
Nodes¶
Detailed information about node health can be found in Elasticsearch. JVM Heap monitors memory usage.
Overview¶
The current EPS (events per second) of the entire Elastic cluster is visible.
Index sizing & lifecycle monitoring¶
Location: Kibana, Stack monitoring or Stack management
Follow these steps to analyze indices for abnormal size:
- Access the "Indices" section.
- Proceed to filter the "Data" column, arranging it from largest to smallest.
- Examine the indexes to identify any that exhibit a significantly larger size compared to the others.
The acceptable index size range is a topic for discussion, but generally, indices up to 200 GB are considered acceptable.
Any indices exceeding 200 GB in size should be reported.
In the case of indexes associated with ILM (index lifecycle management), it's crucial to verify the index status. If an index lacks a string of numbers at the end of its name, it indicates it is not linked to an ILM policy and may grow without automatic rollover. To confirm this, review the index's properties to check whether it falls under the hot, warm, or cold category. When indices are not connected to ILM, they tend to remain in a hot state or exhibit irregular shifts between hot, cold, and warm.
Please note that lookups do not have ILM and should always be considered in the hot state.
Issue reporting: Report to the dedicated project support Slack channel. Such reports should be treated with the utmost seriousness and escalated promptly.
System-Level Overview¶
Location: Grafana, dedicated System Level Overview dashboard
The assessment should always be based on a sample of data from the past 24 hours.
Key metrics to monitor:
-
Disk usage:
All values must not exceed 80%, except for/boot
, which should not exceed 95%. -
Load:
Values must not exceed 40%, and the maximum load should align with the number of cores. -
IOWait:
Indicates data processing and should only register as a small percentage, signifying that the device is waiting for data to load from the disk. -
RAM usage:
Further considerations should be made regarding the establishment of high-value thresholds.
In the case of multiple servers, ensure values are checked for each.
Issue reporting: Report to the dedicated project support Slack channel.
Burrow Consumer Lag¶
Location: Grafana, dedicated Burrow Consumer Lag dashboard
For Kafka Monitoring, scrutinize this dashboard for consumerGroup, with a specific focus on:
-
lmio dispatcher
-
lmio depositor
-
lmio baseliner
-
lmio correlator
-
lmio watcher
The lag value exhibiting an increasing trend over time indicates a problem that needs to be addressed immediately.
Issue reporting: If lag increases compared to the previous week's prophylaxis, promptly report this on the support Slack channel.
Depositor Monitoring¶
Location: Grafana, dedicated Depositor dashboard.
Key metrics to monitor:
-
Failed bulks
- Must be green and equal to zero -
Output Queue Size of Bulks
-
Duty Cycle
-
EPS IN & OUT
-
Successful Bulks
-
Failed Bulks
Issue reporting: Report to the dedicated project support Slack channel.
Metrics ↵
System monitoring metrics¶
When logs and events pass through the TeskaLabs LogMan.io, the logs and events are processed by several TeskaLabs microservices as well as Apache Kafka, and most deployments store data in Elasticsearch. Since the microservices and other technologies handle a huge volume of events, it is not practical to monitor them with logs. Instead, metrics, or measurements, monitor the status and health of each microservice and other parts of your system.
You can access the metrics in Grafana and/or InfluxDB with preset or custom visualizations. Each metric for each microservice updates approximately once per minute.
Viewing metrics¶
To access system monitoring metrics, you can use Grafana and/or InfluxDB through the TeskaLabs LogMan.io web app Tools page.
Using Grafana to view metrics¶
Preset dashboards¶
We deploy TeskaLabs LogMan.io with a prepared set of monitoring and diagnostic dashboards - details and instructions for access here. These dashboards give you a broader overview of what's going on in your system. We recommend consulting these dashboards first if you don't know what specfic metrics you want to investigate.
Using Grafana's Explore tool¶
1. In Grafana, click the (menu) button, and go to Explore.
2. Set data source to InfluxDB.
3. Use the clickable query builder:
Grafana query builder
FROM:
1. Measurement: Click on select measurement to choose a group of metrics. In this case, the metrics group is bspump.pipeline
.
2. Tag: Click the plus sign beside WHERE to select a tag. Since this example shows metrics from a microservice appclass::tag
is selected.
3. Tag value: Click select tag value, and select a value. In this example, the query will show metrics from the Parsec microservice.
Optionally, you can add additional filters in the FROM section, such as pipeline and host.
SELECT:
4. Fields: Add fields to add specific metrics to the query.
5. Aggregation: You can choose the aggregation method for each metric. Be aware that Grafana cannot display a graph in which some values are aggregated and others are non-aggregated.
GROUP BY:
6. Fill: You can choose fill(null)
or fill(none)
to decide how to fill gaps between data points. fill(null)
does not fill the gaps, so your resulting graph will be data points with space between. fill(none)
connects data points with a line, so you can more easily see trends.
4. Adjust the timeframe as needed, and click Run query.
For more information about Grafana's Explore function, visit Grafana's documentation.
Using InfluxDB to view metrics¶
If you have access to InfluxDB, you can use it to explore data. InfluxDB provides a query builder that allows you to filter out which metrics you want to see, and get visualizations (graphs) of those metrics.
To access InfluxDB:
- In the LogMan.io web app, go to Tools.
- Click on InfluxDB, and log in.
Using the query builder:
This example guides you through investigating a metric that is specific to a microservice, such as a pipeline monitoring metric. If you're seeking a metric that does not involve a microservice, begin with the _measurement
tag, then filter with additional relevant tags.
- In InfluxDB, in the left sidebar, click the icon to go to the Data Explorer. Now, you can see InfluxDB's visual query builder.
- In the first box, select a bucket. (Your metrics bucket is most likely either named
metrics
or named after your organization.) - In the next filter, select appclass from the drop-down menu to see the list of microservices that produce metrics. Click on the microservice from which you want to see metrics.
- In the next filter, select _measurement from the drop-down menu to see the list of metrics groups. Select the group you want to see.
- In the next filter, select _field from the drop-down menu to see the list of metrics available. Select the metrics you want to see.
- A microservice can have multiple pipelines. To narrow your results to a specific pipeline, use an additional filter. Select pipeline from the drop-down menu, and select the pipeline(s) you want represented.
- Optionally, you can also select a host in the next filter. Without filtering, InfluxDB displays the data from all hosts available, but you likely have only one host. To select a host, choose host in the drop-down menu, and select a host.
- Change the timeframe if desired.
- To load the visualization, click Submit.
Visualization produced in this example:
For more information about InfluxDB's Data explorer function, visit InfluxDB's documentation.
Pipeline metrics¶
Pipeline metrics, or measurements, monitor the throughput of logs and events in the microservices' pipelines. You can use these pipeline metrics to understand the status and health of each microservice.
The data that moves through microservices is broken down to and measured in events. (Each event is one message in Kafka and will result in one entry in Elasticsearch.) Since events are countable, the metrics quantify the throughput, allowing you to assess pipeline status and health.
BSPump
Several TeskaLabs microservices are built on the technology BSPump, so the names of the metrics include bspump
.
Microservices built on BSPump:
Microservice architecture
The internal architecture of each microservice differs and might affect your analysis of the metrics. Visit our Architecture page.
The microservices most likely to produce uneven event.in
and event.out
counter metrics without actually having an error are:
- Parser/Parsec - This is due to its internal architecture; the parser sends events into a different pipeline (Enricher), where the events are then not counted in
event.out
. - Correlator - Since the correlator assesses events as they are involved in patterns, it often has a lower
event.out
count thanevent.in
.
Metrics¶
Naming and tags in Grafana and InfluxDB
- Pipeline metrics groups are under the
measurement
tag. - Pipeline metrics are produced for microservices (tag
appclass
) and can be further filtered with the additional tagshost
andpipeline
. - Each individual metric (for example,
event.in
) is a value in thefield
tag.
All metrics update automatically once per minute by default.
bspump.pipeline
¶
event.in
¶
Description: Counts the number of events entering the pipeline
Unit: Number (of events)
Interpretation: Observing event.in
over time can show you patterns, spikes, and trends in how many events have been received by the microservice. If no events are coming in, event.in
is a line at 0. If you are expecting throughput, and event.in
is 0, there is a problem in the data pipeline.
event.out
¶
Description: Counts the number of events leaving the pipeline successfully
Unit: Number (of events)
Interpretation: event.out
should typically be the same as event.in
, but there are exceptions. Some microservices are constructed to have either multiple outputs per input, or to divert data in such a way that the output is not detected by this metric.
event.drop
¶
Description: Counts the number of events that have been dropped, or messages that have been lost, by a microservice.
Unit: Number (of events)
Interpretation: Since the microservices built on BSPump are generally not designed to drop messages, any drop is most likely an error.
When you hover over a graph in InfluxDB, you can see the values of each line at any point in time. In this graph, you can see that event.out
is equal to event.in
, and event.drop
equals 0, which is the expected behavior of the microservice. The same number of events are leaving as are entering the pipeline, and no events are being dropped.
warning
¶
Description: Counts the number of warnings produced in a pipeline.
Unit: Number (of warnings)
Interpretation: Warnings tell you that there is an issue with the data, but the pipeline was still able to process it. A warning is less severe than an error.
error
¶
Description: Counts the number of errors in a pipeline.
Unit: Number (of errors)
Interpretation: Microservices might trigger errors for different reasons. The main reason for an error is that the data does not match the microservice's expectation, and the pipeline has failed to process that data.
bspump.pipeline.eps
¶
EPS means events per second.
eps.in
¶
Description: "Events per second in" - Rate of events successfully entering the pipeline
Unit: Events per second (rate)
Interpretation: eps.in
should stay consistent over time if If a microservice's eps.in
slows over time unexpectedly, there might be a problem in the data pipeline before the microservice.
eps.out
¶
Description: "Events per second out" - Rate of events successfully leaving the pipeline
Unit: Events per second (rate)
Interpretation: Similar to event.in
and event.out
, eps.in
and eps.out
should typically be the same, but they could differ depending on the microservice. If events are entering the microservice much faster than they are leaving, and this is not the expected behavior of that pipeline, you might need to address an error causing a bottleneck in the microservice.
eps.drop
¶
Description: "Events per second dropped" - rate of events being dropped in the pipeline
Unit: Events per second (rate)
Interpretation: See event.drop
. If eps.drop
rapidly increases, and it is not the expected behavior of the microservice, that indicates that events are being dropped, and there is a problem in the pipeline.
Similar to graphing event.in
and event.out
, the expected behavior of most microservices is for eps.out
to equal eps.in
with drop
being equal to 0.
warning
¶
Description: Counts the number of warnings produced in a pipeline in the specified timeframe.
Unit: Number (of warnings)
Interpretation: Warnings tell you that there is an issue with the data, but the pipeline was still able to process it. A warning is less severe than an error.
error
¶
Description: Counts the number of errors in a pipeline in the specified timeframe.
Unit: Number (of errors)
Interpretation: Microservices might trigger errors for different reasons. The main reason for an error is that the data does not match the microservice's expectation, and the pipeline has failed to process that data.
bspump.pipeline.gauge
¶
A gauge metric, percentage expressed as a number 0 to 1.
warning.ratio
¶
Description: Ratio of events that generated warnings compared to the total number of successfully processed events.
Interpretation: If the warning ratio increases unexpectedly, investigate the pipeline for problems.
error.ratio
¶
Description: Ratio of events that failed to process compared to the total number of successfully processed events.
Interpretation: If the error ratio increases unexpectedly, investigate the pipeline for problems. You could create a trigger to notify you when error.ratio
exceeds, for example, 5%.
bspump.pipeline.dutycycle
¶
The duty cycle (also called power cycle) describes if a pipeline is waiting for messages (ready, value 1) or unable to process new messages (busy, value 0).
In general:
- A value of 1 is acceptable because the pipeline can process new messages
- A value 0 indicates a problem, because the pipeline cannot process new messages.
Understanding the idea of duty cycle
We can use human productivity to explain the concept of the duty cycle. If a person is not busy at all and has nothing to do, they are just waiting for a task. Their duty cycle reading is at 100% - they are spending all of their time waiting and can take on more work. If a person is busy doing something and cannot take on any more tasks, their duty cycle is at 0%.
The above example (not taken from InfluxDB) shows what a change in duty cycle looks like on a very short time scale. In this example, the pipeline had two instances of being at 0, meaning not ready and unable to process new incoming events. Keep in mind that your system's duty cycle can fluctuate between 1 or 0 thousands of times per second; the duty cycle ready
graphs you'll see in Grafana or InfluxDB will already be aggregated (more below).
ready
¶
Description: ready
aggregates (averages) the duty cycle values once per minute. While duty cycle is expressed as 0 (false, busy) or 1 (true, waiting), the ready
metric represents the percentage of time the duty cycle is at 0 or 1. Therefore, the value of ready
is a percentage anywhere between 0 and 1, so the graph does not look like a typical duty cycle graph.
Unit: Percentage expressed as a number, 0 to 1
Interpretation: Monitoring the duty cycle is critical to understanding your system's capacity. While every system is different, in general, ready
should stay above 70%. If ready
goes below 70%, that means the duty cycle has dropped to 0 (busy) more than 30% of the time in that interval, indicating that the system is quite busy and requires some attention or adjustment.
The above graph shows that the majority of the time, the duty cycle was ready more than 90% of the time over the course of these two days. However, there are two points at which it dropped near and below 70%.
timedrift
¶
The timedrift
metric serves as a way to understand how much the timing of events' origins (usually @timestamp
) varies from what the system considers to be the "current" time. This can be helpful for identifying issues like delays or inaccuracies in a microservice.
Each value is calculated once per minute by default:
avg
¶
Average. This calculates the average time difference between when an event actually happened and when your system recorded it. If this number is high, it may indicate a consistent delay.
median
¶
Median. This tells you the middle value of all timedrifts for a set interval, offering a more "typical" view of your system's timing accuracy. The median
is less sensitive to outliers than average
, since it is a value and not a calculation.
stddev
¶
Standard deviation. This gives you an idea of how much the timedrift varies. A high standard deviation might mean that your timing is inconsistent, which could be problematic.
min
¶
Minimum. This shows the smallest timedrift in your set of data. It's useful for understanding the best-case scenario in your system's timing accuracy.
max
¶
Maximum. This indicates the largest time difference. This helps you understand the worst-case scenario, which is crucial for identifying the upper bounds of potential issues.
In this graph of time drift, you can see a spike in lag before the pipeline returns to normal.
commlink
¶
The commlink is the communication link between LogMan.io Collector and LogMan.io Receiver. These metrics are specific to data sent from the Collector microservice to the Receiver microservice.
Tags: ActivityState
, appclass
(LogMan.io Receiver only), host
, identity
, tenant
- bytes.in: bytes that enter LogMan.io Receiver
- event.in: events that enter LogMan.io Receiver
logs
¶
Count of logs that pass through microservices.
Tags: appclass
, host
, identity
, instance_id
, node_id
, service_id
, tenant
- critical: Count of critical logs
- errors: Count of error logs
- warnings: Count of warning logs
Disk usage metrics¶
Monitor your disk usage carefully to avoid a common cause of system failure.
disk
¶
Metrics to monitor disk usage. See the InfluxDB Telegraf plugin documentation for more.
Tags: device
, fstype
(file system type), mode
, node_id
, path
- free: Total amount of free disk space available on the storage device, measured in bytes
- inodes_free: The number of free inodes, which corresponds to the number of free file descriptors available on the file system.
- inodes_total: The total number of inodes or file descriptors that the file system supports.
- inodes_used: The number of inodes or file descriptors currently being used on the file system.
- total: Total capacity of the disk or storage device, measured in bytes.
- used: The amount of disk space currently in use, calculated in bytes.
- used_percent: The percentage of the disk space that is currently being used in relation to the total capacity.
diskio
¶
Metrics to monitor disk traffic and timing. Consult the InfluxDB Telegraf plugin documentation for the definition of each metric.
Tags: name
, node_id
, wwid
- io_time
- iops_in_progress
- merged_reads
- merged_writes
- read_bytes
- read_time
- reads
- weighted_io_time
- write_bytes
- write_time
- writes
System performance metrics¶
cpu
¶
Metrics to monitor system CPUs. See the InfluxDB Telegraf plugin documentation for more.
Tags: ActivityState
, cpu
, node_id
- time_active: Total time the CPU has been active, performing tasks excluding idle time.
- time_guest: Time spent running a virtual CPU for guest operating systems.
- time_guest_nice: Time the CPU spent running a niced guest (a guest with a positive niceness value).
- time_idle: Total time the CPU was not in use (idle).
- time_iowait: Time the CPU was idle while waiting for I/O operations to complete.
- time_irq: Time spent handling hardware interrupts.
- time_nice: Time the CPU spent processing user processes with a positive niceness value.
- time_softirq: Time spent handling software interrupts.
- time_steal: Time that a virtual CPU waited for a real CPU while the hypervisor was servicing another virtual processor.
- time_system: Time the CPU spent running system (kernel) processes.
- time_user: Time spent on executing user processes.
- usage_active: Percentage of time the CPU was active, performing tasks.
- usage_guest: Percentage of CPU time spent running virtual CPUs for guest OSes.
- usage_guest_nice: Percentage of CPU time spent running niced guests.
- usage_idle: Percentage of time the CPU was idle.
- usage_iowait: Percentage of time the CPU was idle due to waiting for I/O operations.
- usage_irq: Percentage of time spent handling hardware interrupts.
- usage_nice: Percentage of CPU time spent on processes with a positive niceness.
- usage_softirq: Percentage of time spent handling software interrupts.
- usage_steal: Percentage of time a virtual CPU waited for a real CPU while the hypervisor serviced another processor.
- usage_system: Percentage of CPU time spent on system (kernel) processes.
- usage_user: Percentage of CPU time spent executing user processes.
mdstat
¶
Statistics about Linux MD RAID arrays configured on the host. RAID (redundant array of inexpensive or independent disks) combines multiple physical disks into one unit for the purpose of data redundancy (and therefore safety or protection against loss in the case of disk failure) as well as system performance (faster data access). Visit the InfluxDB Telegraf plugin documentation for more.
Tags: ActivityState
(active or inactive), Devices
, Name
, _field
, node_id
- BlocksSynced: The count of blocks that have been scanned if the array is rebuilding/checking
- BlocksSyncedFinishTime: Minutes remaining in the expected finish time of the rebuild scan
- BlocksSyncedPct: Percentage remaining of the rebuild scan
- BlocksSyncedSpeed: The current speed the rebuild is running at, listed in K/sec
- BlocksTotal: The count of total blocks in the array
- DisksActive: Number of disks in the array that are currently considered healthy
- DisksDown: Number of disks in the array that are currently down, or non-operational
- DisksFailed: Count of currently failed disks in the array
- DisksSpare: Count of spare disks in the array
- DisksTotal: Count of total disks in the array
processes
¶
All processes, grouped by status. Find the InfluxDB Telegraf plugin documentation here.
Tags: node_id
- blocked: Number of processes in a blocked state, waiting for resource or event to become available.
- dead: Number of processes that have finished execution but still have an entry in the process table.
- idle: Number of processes in an idle state, typically indicating they are not actively doing any work.
- paging: Number of processes that are waiting for paging, either swapping into our out from disk.
- running: Number of processes that are currently executing or ready to execute.
- sleeping: Number of processes that are in a sleep state, inactive until certain conditions are met or events occur.
- stopped: Number of processes that are stopped, typically due to receiving a signal or being in debug.
- total: Total number of processes currently existing in the system.
- total_threads: The total number of threads across all processes, as processes can have multiple threads.
- unknown: Number of processes in an unknown state, where their state can't be determined.
- zombies: Number of zombie processes, which have completed execution but still have an entry in the process table due to the parent process not reading its exit status.
system
¶
These metrics provide general information about the system load, uptime, and number of users logged in. Visit the InfluxDB Telegraf plugin for details.
Tags: node_id
- load1: The average system load over the last one minute, indicating the number of processes in the system's run queue.
- load15: The average system load over the last 15 minutes, providing a longer-term view of the recent system load.
- load5: The average system load over the last 5 minutes, offering a shorter-term perspective of the recent system load.
- n_cpus: The number of CPU cores available in the system.
- uptime: The total time in seconds that the system has been running since its last startup or reboot.
temp
¶
Temperature readings as collected by system sensors. Visit the InfluxDB Telegraf plugin documentation for details.
Tags: node_id
, sensor
- temp: Temperature
Network-specific metrics¶
net
¶
Metrics for network interface and protocol usage for Linux systems. Monitoring the volume of data transfer and potential errors is important to understanding the network health and performance. Visit the InfluxDB Telegraf plugin documentation for details.
Tags: interface
, node_id
bytes fields: Monitoring the volume of data transfer, which is important to bandwidth management and network capacity planning.
- bytes_recv: The total number of bytes received by the interface
- bytes_sent: The total number of bytes sent by the interface
drop fields: Dropped packets are often a sign of network congestion, hardware issues, or incorrect configurations. Dropped packets can lead to performance degradation.
- drop_in: The total number of received packets dropped by the interface
- drop_out: The total number of transmitted packets dropped by the interface
error fields: High error rates can signal issues with the network hardware, interference, or configuration problems.
- err_in: The total number of receive errors detected by the interface
- err_out: The total number of transmit errors detected by the interface
packet fields: The number of packets sent and received gives an indication of network traffic and can help identify if the network is under heavy load or if there are issues with packet transmission.
- packets_recv: The total number of packets sent by the interface
- packets_sent: The total number of packets received by the interface
nstat
¶
Network metrics. Visit the InfluxDB Telegraf plugin documentation for more.
Tags: name
, node_id
ICMP fields¶
ICMP (internet control message protocol) metrics are used for network diagnostics and control messages, like error reporting and operational queries. Visit this page for additional field definitions.
Key terms:
- Echo requests/replies (ping): Used to test reachability and round-trip time.
- Destination unreachable: Indicates that a destination is unreachable.
- Parameter problems: Signals issues with IP header parameters.
- Redirect messages: Instructs to use a different route.
- Time exceeded messages: Indicates that the time to live (TTL) for a packet has expired.
IP fields¶
IP (internet protocol) metrics monitor the core protocol for routing packets across the internet and local networks.
Visit this page for additional field definitions.
Key terms:
- Address errors: Errors related to incorrect or unreachable IP addresses.
- Header errors: Problems in the IP header, such as incorrect checksums or formatting issues.
- Delivered packets: Packets successfully delivered to their destination.
- Discarded packets: Packets discarded due to errors or lack of buffer space.
- Forwarded datagrams: Packets routed to their next hop towards the destination.
- Reassembly failures: Failure in reassembling fragmented IP packets.
- IPv6 multicast/broadcast packets: Packets sent to multiple destinations or all nodes in a network segment in IPv6.
TCP fields¶
These metrics monitor the TCP, or transmission control protocol, which provides reliable, ordered, and error-checked delivery of data between applications. Visit this page for additional field definitions.
Key terms:
- Connection opens: Initiating a new TCP connection.
- Segments: Units of data transmission in TCP.
- Reset segments (RST): Used to abruptly close a connection.
- Retransmissions: Resending data that was not successfully received.
- Active/passive connection openings: Connections initiated actively (outgoing) or passively (incoming).
- Checksum errors: Errors detected in the TCP segment checksum.
- Timeout retransmissions: Resending data after a timeout, indicating potential packet loss.
UDP fields¶
These metrics monitor the UDP, or user datagram protocol, which facilitates low-latency (low-delay) but less reliable data transmission compared to TCP. Visit this page for additional field definitions.
- Datagrams: Basic transfer units in UDP.
- Receive/send buffer errors: Errors due to insufficient buffer space for incoming/outgoing data.
- No ports: Datagrams sent to a port with no listener.
- Checksum errors: Errors in the checksum field of UDP datagrams.
All nstat
fields
- Icmp6InCsumErrors
- Icmp6InDestUnreachs
- Icmp6InEchoReplies
- Icmp6InEchos
- Icmp6InErrors
- Icmp6InGroupMembQueries
- Icmp6InGroupMembReductions
- Icmp6InGroupMembResponses
- Icmp6InMLDv2Reports
- Icmp6InMsgs
- Icmp6InNeighborAdvertisements
- Icmp6InNeighborSolicits
- Icmp6InParmProblems
- Icmp6InPktTooBigs
- Icmp6InRedirects
- Icmp6InRouterAdvertisements
- Icmp6InRouterSolicits
- Icmp6InTimeExcds
- Icmp6OutDestUnreachs
- Icmp6OutEchoReplies
- Icmp6OutEchos
- Icmp6OutErrors
- Icmp6OutGroupMembQueries
- Icmp6OutGroupMembReductions
- Icmp6OutGroupMembResponses
- Icmp6OutMLDv2Reports
- Icmp6OutMsgs
- Icmp6OutNeighborAdvertisements
- Icmp6OutNeighborSolicits
- Icmp6OutParmProblems
- Icmp6OutPktTooBigs
- Icmp6OutRedirects
- Icmp6OutRouterAdvertisements
- Icmp6OutRouterSolicits
- Icmp6OutTimeExcds
- Icmp6OutType133
- Icmp6OutType135
- Icmp6OutType143
- IcmpInAddrMaskReps
- IcmpInAddrMasks
- IcmpInCsumErrors
- IcmpInDestUnreachs
- IcmpInEchoReps
- IcmpInEchos
- IcmpInErrors
- IcmpInMsgs
- IcmpInParmProbs
- IcmpInRedirects
- IcmpInSrcQuenchs
- IcmpInTimeExcds
- IcmpInTimestampReps
- IcmpInTimestamps
- IcmpMsgInType3
- IcmpMsgOutType3
- IcmpOutAddrMaskReps
- IcmpOutAddrMasks
- IcmpOutDestUnreachs
- IcmpOutEchoReps
- IcmpOutEchos
- IcmpOutErrors
- IcmpOutMsgs
- IcmpOutParmProbs
- IcmpOutRedirects
- IcmpOutSrcQuenchs
- IcmpOutTimeExcds
- IcmpOutTimestampReps
- IcmpOutTimestamps
- Ip6FragCreates
- Ip6FragFails
- Ip6FragOKs
- Ip6InAddrErrors
- Ip6InBcastOctets
- Ip6InCEPkts
- Ip6InDelivers
- Ip6InDiscards
- Ip6InECT0Pkts
- Ip6InECT1Pkts
- Ip6InHdrErrors
- Ip6InMcastOctets
- Ip6InMcastPkts
- Ip6InNoECTPkts
- Ip6InNoRoutes
- Ip6InOctets
- Ip6InReceives
- Ip6InTooBigErrors
- Ip6InTruncatedPkts
- Ip6InUnknownProtos
- Ip6OutBcastOctets
- Ip6OutDiscards
- Ip6OutForwDatagrams
- Ip6OutMcastOctets
- Ip6OutMcastPkts
- Ip6OutNoRoutes
- Ip6OutOctets
- Ip6OutRequests
- Ip6ReasmFails
- Ip6ReasmOKs
- Ip6ReasmReqds
- Ip6ReasmTimeout
- IpDefaultTTL
- IpExtInBcastOctets
- IpExtInBcastPkts
- IpExtInCEPkts
- IpExtInCsumErrors
- IpExtInECT0Pkts
- IpExtInECT1Pkts
- IpExtInMcastOctets
- IpExtInMcastPkts
- IpExtInNoECTPkts
- IpExtInNoRoutes
- IpExtInOctets
- IpExtInTruncatedPkts
- IpExtOutBcastOctets
- IpExtOutBcastPkts
- IpExtOutMcastOctets
- IpExtOutMcastPkts
- IpExtOutOctets
- IpForwDatagrams
- IpForwarding
- IpFragCreates
- IpFragFails
- IpFragOKs
- IpInAddrErrors
- IpInDelivers
- IpInDiscards
- IpInHdrErrors
- IpInReceives
- IpInUnknownProtos
- IpOutDiscards
- IpOutNoRoutes
- IpOutRequests
- IpReasmFails
- IpReasmOKs
- IpReasmReqds
- IpReasmTimeout
- TcpActiveOpens
- TcpAttemptFails
- TcpCurrEstab
- TcpEstabResets
- TcpExtArpFilter
- TcpExtBusyPollRxPackets
- TcpExtDelayedACKLocked
- TcpExtDelayedACKLost
- TcpExtDelayedACKs
- TcpExtEmbryonicRsts
- TcpExtIPReversePathFilter
- TcpExtListenDrops
- TcpExtListenOverflows
- TcpExtLockDroppedIcmps
- TcpExtOfoPruned
- TcpExtOutOfWindowIcmps
- TcpExtPAWSActive
- TcpExtPAWSEstab
- TcpExtPAWSPassive
- TcpExtPruneCalled
- TcpExtRcvPruned
- TcpExtSyncookiesFailed
- TcpExtSyncookiesRecv
- TcpExtSyncookiesSent
- TcpExtTCPACKSkippedChallenge
- TcpExtTCPACKSkippedFinWait2
- TcpExtTCPACKSkippedPAWS
- TcpExtTCPACKSkippedSeq
- TcpExtTCPACKSkippedSynRecv
- TcpExtTCPACKSkippedTimeWait
- TcpExtTCPAbortFailed
- TcpExtTCPAbortOnClose
- TcpExtTCPAbortOnData
- TcpExtTCPAbortOnLinger
- TcpExtTCPAbortOnMemory
- TcpExtTCPAbortOnTimeout
- TcpExtTCPAutoCorking
- TcpExtTCPBacklogDrop
- TcpExtTCPChallengeACK
- TcpExtTCPDSACKIgnoredNoUndo
- TcpExtTCPDSACKIgnoredOld
- TcpExtTCPDSACKOfoRecv
- TcpExtTCPDSACKOfoSent
- TcpExtTCPDSACKOldSent
- TcpExtTCPDSACKRecv
- TcpExtTCPDSACKUndo
- TcpExtTCPDeferAcceptDrop
- TcpExtTCPDirectCopyFromBacklog
- TcpExtTCPDirectCopyFromPrequeue
- TcpExtTCPFACKReorder
- TcpExtTCPFastOpenActive
- TcpExtTCPFastOpenActiveFail
- TcpExtTCPFastOpenCookieReqd
- TcpExtTCPFastOpenListenOverflow
- TcpExtTCPFastOpenPassive
- TcpExtTCPFastOpenPassiveFail
- TcpExtTCPFastRetrans
- TcpExtTCPForwardRetrans
- TcpExtTCPFromZeroWindowAdv
- TcpExtTCPFullUndo
- TcpExtTCPHPAcks
- TcpExtTCPHPHits
- TcpExtTCPHPHitsToUser
- TcpExtTCPHystartDelayCwnd
- TcpExtTCPHystartDelayDetect
- TcpExtTCPHystartTrainCwnd
- TcpExtTCPHystartTrainDetect
- TcpExtTCPKeepAlive
- TcpExtTCPLossFailures
- TcpExtTCPLossProbeRecovery
- TcpExtTCPLossProbes
- TcpExtTCPLossUndo
- TcpExtTCPLostRetransmit
- TcpExtTCPMD5NotFound
- TcpExtTCPMD5Unexpected
- TcpExtTCPMTUPFail
- TcpExtTCPMTUPSuccess
- TcpExtTCPMemoryPressures
- TcpExtTCPMinTTLDrop
- TcpExtTCPOFODrop
- TcpExtTCPOFOMerge
- TcpExtTCPOFOQueue
- TcpExtTCPOrigDataSent
- TcpExtTCPPartialUndo
- TcpExtTCPPrequeueDropped
- TcpExtTCPPrequeued
- TcpExtTCPPureAcks
- TcpExtTCPRcvCoalesce
- TcpExtTCPRcvCollapsed
- TcpExtTCPRenoFailures
- TcpExtTCPRenoRecovery
- TcpExtTCPRenoRecoveryFail
- TcpExtTCPRenoReorder
- TcpExtTCPReqQFullDoCookies
- TcpExtTCPReqQFullDrop
- TcpExtTCPRetransFail
- TcpExtTCPSACKDiscard
- TcpExtTCPSACKReneging
- TcpExtTCPSACKReorder
- TcpExtTCPSYNChallenge
- TcpExtTCPSackFailures
- TcpExtTCPSackMerged
- TcpExtTCPSackRecovery
- TcpExtTCPSackRecoveryFail
- TcpExtTCPSackShiftFallback
- TcpExtTCPSackShifted
- TcpExtTCPSchedulerFailed
- TcpExtTCPSlowStartRetrans
- TcpExtTCPSpuriousRTOs
- TcpExtTCPSpuriousRtxHostQueues
- TcpExtTCPSynRetrans
- TcpExtTCPTSReorder
- TcpExtTCPTimeWaitOverflow
- TcpExtTCPTimeouts
- TcpExtTCPToZeroWindowAdv
- TcpExtTCPWantZeroWindowAdv
- TcpExtTCPWinProbe
- TcpExtTW
- TcpExtTWKilled
- TcpExtTWRecycled
- TcpInCsumErrors
- TcpInErrs
- TcpInSegs
- TcpMaxConn
- TcpOutRsts
- TcpOutSegs
- TcpPassiveOpens
- TcpRetransSegs
- TcpRtoAlgorithm
- TcpRtoMax
- TcpRtoMin
- Udp6IgnoredMulti
- Udp6InCsumErrors
- Udp6InDatagrams
- Udp6InErrors
- Udp6NoPorts
- Udp6OutDatagrams
- Udp6RcvbufErrors
- Udp6SndbufErrors
- UdpIgnoredMulti
- UdpInCsumErrors
- UdpInDatagrams
- UdpInErrors
- UdpLite6InCsumErrors
- UdpLite6InDatagrams
- UdpLite6InErrors
- UdpLite6NoPorts
- UdpLite6OutDatagrams
- UdpLite6RcvbufErrors
- UdpLite6SndbufErrors
- UdpLiteIgnoredMulti
- UdpLiteInCsumErrors
- UdpLiteInDatagrams
- UdpLiteInErrors
- UdpLiteNoPorts
- UdpLiteOutDatagrams
- UdpLiteRcvbufErrors
- UdpLiteSndbufErrors
- UdpNoPorts
- UdpOutDatagrams
- UdpRcvbufErrors
- UdpSndbufErrors
Authorization-specific metrics¶
TeskaLabs SeaCat Auth (as seen in tag appclass
) handles all LogMan.io authorization, including credentials, logins, and sessions.
credentials
¶
Tags: appclass
(SeaCat Auth only), host
, instance_id
, node_id
, service_id
- default: The number of credentials (user accounts) existing in your deployment of TeskaLabs LogMan.io.
logins
¶
Count of failed and successful logins via TeskaLabs SeaCat Auth.
Tags: appclass
(SeaCat Auth only), host
, insance_id
, node_id
, service_id
- failed: Counts failed login attempts. Reports at the time of the login.
- successful: Counts successful logins. Reports at the time of the login.
sessions
¶
A session begins any time a user logs in to LogMan.io, so the sessions
metric counts open sessions.
Tags: appclass
(SeaCat Auth only), host
, instance_id
, node_id
, service_id
- sessions: Number of sessions open at the time
Memory metrics¶
By monitoring memory usage metrics, you can understand how memory resources are being used. This, in turn, can provide insights into areas that may need optimization or adjustment.
memory
and os.stat
¶
Tags: appclass
, host
, identity
, instance_id
, node_id
, service_id
, tenant
VmPeak
¶
Meaning: Peak virtual memory size. This is the peak or current total of virtual memory being used by the microservice. Virtual memory includes both physical RAM and disk swap space (the sum of all virtual memory areas involved in the process).
Interpretation: Monitoring the peak can help you identify if a service is using more memory than expected, potentially indicating a memory leak or a requirement for optimization.
VmLck
¶
Meaning: Locked memory size. This indicates the portion of memory that is locked in RAM and can't be swapped out to disk.
Interpretation: A high amount of locked memory could potentially reduce the system's flexibility in managing memory, which might lead to performance issues.
VmPin
¶
Meaning: Pinned memory size. This is the portion of memory that is "pinned" in place; a memory page's physical location can't be changed within RAM automatically or swapped out to disk.
Interpretation: Like locked memory, pinned memory can't be moved, so a high value could also limit system flexibility.
VmHWM
¶
Meaning: Peak resident set size ("high water mark"). This is the maximum amount of physical RAM that the microservice has used.
Interpretation: If this value is consistently high, it might indicate that the service needs optimization or that you need to allocate more physical RAM.
VmRSS
¶
Meaning: Resident set size. This shows the portion of the microservice's memory that is held in RAM.
Interpretation: A high RSS value could mean your service is using a lot of RAM, potentially leading to performance issues if it starts to swap.
VmData
, VmStk
, VmExe
¶
Meaning: Size of data, stack, and text segments. These values represent the sizes of different memory segments: data, stack, and executable code.
Interpretation: Monitoring these can help you understand the memory footprint of your service and can be useful for debugging or optimizing your code.
VmLib
¶
Meaning: Shared library code size. This counts executable pages with a VmExe subtracted, and shows the amount of memory used by shared libraries in the process.
Interpretation: If this is high, you may want to check whether all the libraries are necessary, as they add to the memory footprint.
VmPTE
¶
Meaning: Page table entries size. This indicates the size of the page table, which maps virtual memory to physical memory.
Interpretation: A large size might signify that a lot of memory is being used, which could be an issue if it grows too much.
VmSize
¶
Meaning: Size of second-level page tables. This is an extension of VmPTE, indicating the size of the second-level page tables.
Interpretation: Like VmPTE, monitoring this size helps in identifying potential memory issues.
VmSwap
¶
Meaning: Swapped-out virtual memory size. This indicates the amount of virtual memory that has been swapped out to disk. shmem
swap is not included.
Interpretation: Frequent swapping is generally bad for performance; thus, if this metric is high, you may need to allocate more RAM or optimize your services.
mem
¶
Additional masurements regarding memory. Visit the InfluxDB Telegraf plugin documentation for details.
-
Tags:
node_id
-
active: Memory currently in use or very recently used, and thus not immediately available for eviction.
- available: The amount of memory that is readily available for new processes without swapping.
- available_percent: The percentage of total memory that is readily available for new processes.
- buffered: Memory used by the kernel for things like file system metadata, distinct from caching.
- cached: Memory used to store recently used data for quick access, not immediately freed when processes no longer require it.
- commit_limit: The total amount of memory that can be allocated to processes, including both RAM and swap space.
- committed_as: The total amount of memory currently allocated by processes, even if not used.
- dirty: Memory pages that have been modified but not yet written to their respective data location in storage.
- free: The amount of memory that is currently unoccupied and available for use.
- high_free: The amount of free memory in the system's high memory area (memory beyond direct kernel access).
- high_total: The total amount of system memory in the high memory area.
- huge_page_size: The size of each huge page (larger-than-standard memory pages used by the system).
- huge_pages_free: The number of huge pages that are not currently being used.
- huge_pages_total: The total number of huge pages available in the system.
- inactive: Memory that has not been used recently and can be made available for other processes or disk caching.
- low_free: The amount of free memory in the system's low memory area (memory directly accessible by the kernel).
- low_total: The total amount of system memory in the low memory area.
- mapped: Memory used for mapped files, such as libraries and executable files in memory.
- page_tables: Memory used by the kernel to keep track of virtual memory to physical memory mappings.
- shared: Memory used by multiple processes, or shared between processes and the kernel.
- slab: Memory used by the kernel for caching data structures.
- sreclaimable: Part of the slab memory that can be reclaimed, such as caches that can be freed if necessary.
- sunreclaim: Part of the slab memory that cannot be reclaimed under memory pressure.
- swap_cached: Memory that has been swapped out to disk but is still in RAM.
- swap_free: The amount of swap space currently not being used.
- swap_total: The total amount of swap space available.
- total: The total amount of physical RAM available in the system.
- used: The amount of memory that is currently being used by processes.
- used_percent: The percentage of total memory that is currently being used.
- vmalloc_chunk: The largest contiguous block of memory available in the kernel's vmalloc space.
- vmalloc_total: The total amount of memory available in the kernel's vmalloc space.
- vmalloc_used: The amount of memory currently used in the kernel's vmalloc space.
- write_back: Memory which is currently being written back to the disk.
- write_back_tmp: Temporary memory used during write-back operations.
Kernel-specific metrics¶
kernel
¶
Metrics to monitor the Linux kernel. Visit the InfluxDB Telegraf plugin documentation for more details.
Tags: node_id
- boot_time: The time when the system was last booted, measured in seconds since the Unix epoch (January 1, 1970). This tells you the system uptime and time of last restart. You can convert this number to a date using a (Unix epoch time converter).
- context_switches: The number (count, integer) of context switches the kernel has performed. A context switch occurs when the CPU switches from one process or thread to another. A high number of context switches can indicate that many processes are competing for CPU time, which can be a sign of high system load.
- entropy_avail: The amount (integer) of available entropy (randomness that can be generated) in the system, which is essential for secure random number generation. Low entropy can affect cryptographic functions and secure communications. Entropy is consumed by various operations and replenished over time, so monitoring this metric is important for maintaining security.
- interrupts: The total number (count, integer) of interrupts processed since boot. An interrupt is a signal to the processor emitted by hardware or software indicating an event that needs immediate attention. High numbers of interrupts can indicate a busy or possibly overloaded system.
- processes_forked: The total number (count, integer) of processes that have been forked (created) since the system was booted. Tracking the rate of process creation can help in diagnosing system performance issues, especially in environments where processes are frequently started and stopped.
kernel_vmstat
¶
Kernel virtual memory statistics gathered via proc/vmstat
. Visit the InfluxDB Telegraf plugin documentation for more details.
Relevant terms
- Active pages: Pages currently in use or recently used.
- Inactive pages: Pages not recently used, and therefore more likely to be moved to swap space or reclaimed.
- Anonymous pages: Memory pages not backed by a file on disk; typically used for data that does not need to be persisted, such as program stacks.
- Bounce buffer: Temporary memory used to facilitate data transfers between devices that cannot directly address each other’s memory.
- Compaction: The process of rearranging pages in memory to create larger contiguous free spaces, often useful for allocating huge pages.
- Dirty pages: Pages that have been modified in memory but have not yet been written back to disk.
- Evict: The process of removing pages from physical memory, either by moving them to disk (swapping out) or discarding them if they are no longer needed.
- File-backed pages: Memory pages that are associated with files on the disk, such as executable files or data files.
- Free pages: Memory pages that are available for use and not currently allocated to any process or data.
- Huge pages: Large memory pages that can be used by processes, reducing the overhead of page tables.
- Interleave: The process of distributing memory pages across different memory nodes or zones, typically to optimize performance in systems with non-uniform memory access (NUMA).
- NUMA (non-uniform memory access): A memory design where a processor accesses its own local memory faster than non-local memory.
- Page allocation: The process of assigning free memory pages to fulfill a request by a process or the kernel.
- Page fault: An event that occurs when a program tries to access a page that is not in physical memory, requiring the OS to handle this by allocating a page or retrieving it from disk.
- Page table: Data structure used by the operating system to store the mapping between virtual addresses and physical memory addresses.
- Shared memory (shmem): Memory that can be accessed by multiple processes.
- Slab pages: Memory pages used by the kernel to store objects of fixed sizes, such as file structures or inode caches.
- Swap space: A space on the disk used to store memory pages that have been evicted from physical memory.
- THP (transparent huge pages): A feature that automatically manages the allocation of huge pages to improve performance without requiring changes to applications.
- Vmscan: A kernel process that scans memory pages and decides which pages to evict or swap out based on their usage.
- Writeback: The process of writing dirty pages back to disk.
Tags: node_id
- nr_free_pages: Number of free pages in the system.
- nr_inactive_anon: Number of inactive anonymous pages.
- nr_active_anon: Number of active anonymous pages.
- nr_inactive_file: Number of inactive file-backed pages.
- nr_active_file: Number of active file-backed pages.
- nr_unevictable: Number of pages that cannot be evicted from memory.
- nr_mlock: Number of pages locked into memory (mlock).
- nr_anon_pages: Number of anonymous pages.
- nr_mapped: Number of pages mapped into userspace.
- nr_file_pages: Number of file-backed pages.
- nr_dirty: Number of pages currently dirty.
- nr_writeback: Number of pages under writeback.
- nr_slab_reclaimable: Number of reclaimable slab pages.
- nr_slab_unreclaimable: Number of unreclaimable slab pages.
- nr_page_table_pages: Number of pages used for page tables.
- nr_kernel_stack: Amount of kernel stack pages.
- nr_unstable: Number of unstable pages.
- nr_bounce: Number of bounce buffer pages.
- nr_vmscan_write: Number of pages written by vmscan.
- nr_writeback_temp: Number of temporary writeback pages.
- nr_isolated_anon: Number of isolated anonymous pages.
- nr_isolated_file: Number of isolated file pages.
- nr_shmem: Number of shared memory pages.
- numa_hit: Number of pages allocated in the preferred node.
- numa_miss: Number of pages allocated in a non-preferred node.
- numa_foreign: Number of pages intended for another node.
- numa_interleave: Number of interleaved hit pages.
- numa_local: Number of pages allocated on the local node.
- numa_other: Number of pages allocated on other nodes.
- nr_anon_transparent_hugepages: Number of anonymous transparent huge pages.
- pgpgin: Number of kilobytes read from disk.
- pgpgout: Number of kilobytes written to disk.
- pswpin: Number of pages swapped in.
- pswpout: Number of pages swapped out.
- pgalloc_dma: Number of DMA zone pages allocated.
- pgalloc_dma32: Number of DMA32 zone pages allocated.
- pgalloc_normal: Number of normal zone pages allocated.
- pgalloc_movable: Number of movable zone pages allocated.
- pgfree: Number of pages freed.
- pgactivate: Number of inactive pages activated.
- pgdeactivate: Number of active pages deactivated.
- pgfault: Number of page faults.
- pgmajfault: Number of major page faults.
- pgrefill_dma: Number of DMA zone pages refilled.
- pgrefill_dma32: Number of DMA32 zone pages refilled.
- pgrefill_normal: Number of normal zone pages refilled.
- pgrefill_movable: Number of movable zone pages refilled.
- pgsteal_dma: Number of DMA zone pages reclaimed.
- pgsteal_dma32: Number of DMA32 zone pages reclaimed.
- pgsteal_normal: Number of normal zone pages reclaimed.
- pgsteal_movable: Number of movable zone pages reclaimed.
- pgscan_kswapd_dma: Number of DMA zone pages scanned by kswapd.
- pgscan_kswapd_dma32: Number of DMA32 zone pages scanned by kswapd.
- pgscan_kswapd_normal: Number of normal zone pages scanned by kswapd.
- pgscan_kswapd_movable: Number of movable zone pages scanned by kswapd.
- pgscan_direct_dma: Number of DMA zone pages directly scanned.
- pgscan_direct_dma32: Number of DMA32 zone pages directly scanned.
- pgscan_direct_normal: Number of normal zone pages directly scanned.
- pgscan_direct_movable: Number of movable zone pages directly scanned.
- zone_reclaim_failed: Number of failed zone reclaim attempts.
- pginodesteal: Number of inodes pages reclaimed.
- slabs_scanned: Number of slab pages scanned.
- kswapd_steal: Number of pages reclaimed by kswapd.
- kswapd_inodesteal: Number of inode pages reclaimed by kswapd.
- kswapd_low_wmark_hit_quickly: Frequency of kswapd hitting low watermark quickly.
- kswapd_high_wmark_hit_quickly: Frequency of kswapd hitting high watermark quickly.
- kswapd_skip_congestion_wait: Number of times kswapd skipped wait due to congestion.
- pageoutrun: Number of pageout pages processed.
- allocstall: Number of times page allocation stalls.
- pgrotated: Number of pages rotated.
- compact_blocks_moved: Number of blocks moved during compaction.
- compact_pages_moved: Number of pages moved during compaction.
- compact_pagemigrate_failed: Number of page migrations failed during compaction.
- compact_stall: Number of stalls during compaction.
- compact_fail: Number of compaction failures.
- compact_success: Number of successful compactions.
- htlb_buddy_alloc_success: Number of successful HTLB buddy allocations.
- htlb_buddy_alloc_fail: Number of failed HTLB buddy allocations.
- unevictable_pgs_culled: Number of unevictable pages culled.
- unevictable_pgs_scanned: Number of unevictable pages scanned.
- unevictable_pgs_rescued: Number of unevictable pages rescued.
- unevictable_pgs_mlocked: Number of unevictable pages mlocked.
- unevictable_pgs_munlocked: Number of unevictable pages munlocked.
- unevictable_pgs_cleared: Number of unevictable pages cleared.
- unevictable_pgs_stranded: Number of unevictable pages stranded.
- unevictable_pgs_mlockfreed: Number of mlock-freed unevictable pages.
- thp_fault_alloc: Number of times a fault caused THP allocation.
- thp_fault_fallback: Number of times a fault fell back from THP.
- thp_collapse_alloc: Number of THP collapses allocated.
- thp_collapse_alloc_failed: Number of failed THP collapse allocations.
- thp_split: Number of THP splits.
Tenant metrics¶
You can investigate the health and status of microservices on a tenant-specific basis if you have multiple LogMan.io tenants in your system. Tenant metrics are specific to LogMan.io Parser, Dispatcher, Correlator, and Watcher microservices.
Naming and tags in Grafana and InfluxDB
- Tenant metrics groups are under the
measurement
tag. - Tenant metrics are produced for select microservices (tag
appclass
) and can be further filtered with the additional tagshost
andpipeline
. - Each individual metric (for example,
eps.in
) is a value in thefield
tag.
The tags are pipeline
(ID of the pipeline), host
(hostname of the microservice) and tenant
(the lowercase name of the tenant). Visit the Pipeline metrics page for more in-depth explanations and guides for interpreting each metric.
bspump.pipeline.tenant.eps
¶
A counter metric with following values, updated once per minute:
eps.in
: The tenant's events per second entering the pipeline.eps.aggr
: The tenant's aggregated events (number is multiplied bycnt
attribute in events) per second entering the pipeline.eps.drop
: The tenant's events per second dropped in the pipeline.eps.out
: The tenant's events per second successfully leaving the pipeline.warning
: The tenant's number of warnings produced in the pipeline in the specified time interval.error
: the tenant's number of errors produced in the pipeline in the specified time interval.
In LogMan.io Parser, the most relevant metrics come from ParsersPipeline
(when the data first enters the Parser and gets parsed via preprocessors and parsers) and EnrichersPipeline
. In LogMan.io Dispatcher, the most relevant metrics come from EventsPipeline
and OthersPipeline
.
bspump.pipeline.tenant.load
¶
A counter metric with following values, updated once per minute:
load.in
: The tenant's byte size of all events entering the pipeline in the specified time interval.load.out
: the tenant's byte size of all events leaving the pipeline in the specified time interval.
Correlator metrics¶
The following metrics are specific for LogMan.io Correlator. Detections (also known as correlation rules) are based on the Correlator microservice.
Naming and tags in Grafana and InfluxDB
- Correlator metrics groups are under the
measurement
tag. - Correlator metrics are only produced for the Correlator microservice (tag
appclass
) and can be further filtered with the additional tagscorrelator
to isolate a single correlator, andhost
. - Each individual metric (for example,
in
) is a value in thefield
tag.
correlator.predicate
¶
A counter metric that counts how many events went through the predicate
section, or filter, of a detection. Each metric updates once per minute, so time interval refers to the period of about one minute.
in
: Number events entering the predicate in the time interval.hit
: Number events successfully matching the predicate (fulfilling the conditions of the filter) in the time interval.miss
: Number events missing the predicate in the time interval (not fulfilling the conditions of the filter) and thus leaving the Correlator.error
: Number of errors in the predicate in the time interval.
correlator.trigger
¶
A counter metric that counts how many events went through the trigger
section of the correlator. The trigger defines and carries out an action. Each metric updates once per minute, so time interval refers to the period of about one minute.
in
: Number events entering the trigger in the time interval.out
: Number events leaving the trigger in the time interval.error
: Number of errors in the trigger in the time interval, should be equal toin
minusout
.
Ended: Metrics
Ended: System monitoring
Ended: Administration Manual
Reference ↵
TeskaLabs LogMan.io Reference¶
Welcome to the Reference Guide. You can find definitions and details of every LogMan.io component here.
Collector ↵
LogMan.io Collector¶
TeskaLabs LogMan.io Collector is a microservice responsible for collecting logs and other events from various inputs and sending them to LogMan.io Receiver.
- Before you proceed, see Configuration for setup instructions.
- For the setup of event collection from various log sources, see the Log sources subtopic.
- For the detailed configuration options, see Inputs, Transformations and Outputs.
- To mock logs, see Mirage.
- For the communication details between Collector and Receiver, see LogMan.io Receiver documentation.
LogMan.io Collector configuration¶
LogMan.io Collector configuration typically consists of two files.
- Collector configuration (
/conf/lmio-collector.conf
, INI format) specifies the path for pipeline configuration(s) and possibly other application-level configuration options. - Pipeline configuration (
/conf/lmio-collector.yaml
, YAML format) specifies from which inputs the data is collected (inputs), how the data is transformed (transforms) and how the data is sent further (outputs).
Collector configuration¶
[config]
path=/conf/lmio-collector.yaml
Pipeline configuration¶
Pipeline configuration is in a YAML format. Multiple pipelines can be configured in the same pipeline configuration file.
Every section represents one component of the pipeline. It always starts with either input:
, transform:
, output:
or connection:
and has the form:
input|transform|output:<TYPE>:<ID>
where <TYPE>
determines the component type. <ID>
is used for reference and can be chosen in any way.
- Input specifies a source/input of logs.
- Output specifies output where to ship logs.
- Connection specifies the connection that can be used by
output
. - Transform specifies a transformation action to be applied on logs (optional).
Typical pipeline configuration for LogMan.io Receiver:
# Connection to LogMan.io (central part)
connection:CommLink:commlink:
url: https://recv.logman.example.com/
# Input
input:Datagram:udp-10002-src:
address: 0.0.0.0 10002
output: udp-10002
# Output
output:CommLink:udp-10002: {}
For the detailed configuration options of each component, see Inputs, Transformations and Outputs chapters. See LogMan.io Receiver documentation for the CommLink connection details.
Docker Compose¶
version: '3'
services:
lmio-collector:
image: docker.teskalabs.com/lmio/lmio-collector
container_name: lmio-collector
volumes:
- ./lmio-collector/conf:/conf
- ./lmio-collector/var:/app/lmio-collector/var
network_mode: host
restart: always
LogMan.io Collector Inputs¶
Note
This chapter concerns setup for log sources collected over network, syslog, files, databases, etc. For the setup of event collection from various log sources, see the Log sources subtopic.
Network¶
Sections: input:TCP
, input:Stream
, input:UDP
, input:Datagram
These inputs listen on a given address using TCP, UDP or Unix Socket.
Tip
Logs should be collected through TCP protocol. Only if it is not possible, use UDP protocol.
The configuration options for listening:
address: # Specify IPv4, IPv6 or UNIX file path to listen from
output: # Which output to send the incoming events to
Here are possible form of address
:
8080
or*:8080
: Listen on a port 8080 all available network interfaces on IPv4 and IPv60.0.0.0:8080
: Listen on a port 8080 all available network interfaces on IPv4:::8080
: Listen on a port 8080 all available network interfaces on IPv61.2.3.4:8080
: Listen on a port 8080 and specific network interface (1.2.3.4
) on IPv4::1:8080
: Listen on a port 8080 and specific network interface (::1
) on IPv6/tmp/unix.sock
: Listen on a UNIX socket/tmp/unix.sock
The following configuration options are available only for input:Datagram
:
max_packet_size: # (optional) Specify the maximum size of packets in bytes (default: 65536)
receiver_buffer_size: # (optional) Limit the receiver size of the buffer in bytes (default: 0)
Warning
LogMan.io Collector runs inside Docker container. Propagation of network ports must be enabled like this:
services:
lmio-collector-tenant:
network_mode: host
Note
TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are both protocols used for sending data over the network.
TCP is a Stream, as it provides a reliable, ordered, and error-checked delivery of a stream of data.
In contrast, UDP is a datagram that sends packets independently, allowing faster transmission but with less reliability and no guarantee of order, much like individual, unrelated messages.
Tip
For troubleshooting, use tcpdump
to capture raw network traffic and then use Wireshark for deeper analysis.
The example of capturing the traffic at TCP/10008 port:
$ sudo tcpdump -i any tcp port 10008 -s 0 -w /tmp/capute.pcap -v
When enough traffic is captured, press Ctrl-C and collect the file /tmp/capture.pcap
that contains the traffic capture.
This file can be opened in Wireshark.
Syslog¶
Sections: input:TCPBSDSyslogRFC6587
, input:TCPBSDSyslogNoFraming
Special cases of TCP input for parsing SysLog via TCP. For more information, see RFC 6587 and RFC 3164, section 4.1.1
The configuration options for listening on a given path:
address: # Specify IPv4, IPv6 or UNIX file path to listen from (f. e. 127.0.0.1:8888 or /data/mysocket)
output: # Which output to send the incoming events to
The following configuration options are available only for input:TCPBSDSyslogRFC6587
:
max_sane_msg_len: # (optional) Maximum size in bytes of SysLog message to be received (default: 10000)
The following configuration options are available only for input:TCPBSDSyslogNoFraming
:
buffer_size: # (optional) Maximum size in bytes of SysLog message to be received (default: 64 * 1024)
variant: # (optional) The variant of SysLog format of the incoming message, can be `auto`, `nopri` with no PRI number in the beginning and `standard` with PRI (default: auto)
Subprocess¶
Section: input:SubProcess
The SubProcess input runs a command as a subprocess of the LogMan.io collector, while
periodically checking for its output at stdout
(lines) and stderr
.
The configuration options include:
command: # Specify the command to be run as subprocess (f. e. tail -f /data/tail.log)
output: # Which output to send the incoming events to
line_len_limit: # (optional) The length limit of one read line (default: 1048576)
ok_return_codes: # (optional) Which return codes signify the running status of the command (default: 0)
File tailing¶
Section: input:SmartFile
Smart File Input is used for collecting events from multiple files whose content may be dynamically modified,
or the files may be deleted altogether by another process, similarly to the tail -f
shell command.
Smart File Input creates a monitored file object for every file path, that is specified in the configuration in the path
options.
The monitored file periodically checks for new lines in the file, and if one occurs, the line is read in bytes and passed further to the pipeline, including meta information such as file name and extracted parts of file path, see extract parameters section.
Various protocols are used for reading from different log file formats:
- Line Protocol for line-oriented log files
- XML Protocol for XML-oriented log files
- W3C Extended Log File Protocol for log files in W3C Extended Log File Format
- W3C DHCP Server Protocol for DHCP Server log files
Required configuration options:
input:SmartFile:MyFile:
path: | # File paths separated by newlines
/first/path/to/log/files/*.log
/second/path/to/log/files/*.log
/another/path/*s
protocol: # Protocol to be used for reading
Optional configuration options:
recursive: # Recursive scanning of specified paths (default: True)
scan_period: # File scan period in seconds (default: 3 seconds)
preserve_newline: # Preserve new line character in the output (default: False)
last_position_storage: # Persistent storage for the current positions in read files (default: ./var/last_position_storage)
Tip
In more complex setup, such as an extraction of logs from the Windows shared folder, you can utilize rsync
to synchronize logs from the shared folder to a local folder at the collector machine. Then Smart File Input reads logs from the local folder.
Warning
Internally, the current position in the file is stored in last position storage in position variable. If the last position storage file is deleted or not specified, all files are read all over again after the LogMan.io Collector restarts, i.e. no persistence means reset of the reading when restarting.
You can configure path for last position storage:
last_position_storage: "./var/last_position_storage"
Warning
If the file size is lower than the previous remembered file size, the file is read as a whole over again and sent to the pipeline split to lines.
File paths¶
File path globs are separated by newlines. They can contain wildcards (such as *, **
, etc.).
path: |
/first/path/*.log
/second/path/*.log
/another/path/*
By default, files are read recursively. You can disable recursive reading with:
recursive: False
Line Protocol¶
protocol: line
line/C_separator: # (optional) Character used for line separator. Default: '\n'.
Line Protocol is used for reading messages from line-oriented log files.
XML Protocol¶
protocol: xml
tag_separator: '</msg>' # (required) Tag for separator.
XML Protocol is used for reading messages from XML-oriented log files.
Parameter tag_separator
must be included in configuration.
Example
Example of XML log file:
...
<msg time='2024-04-16T05:47:39.814+02:00' org_id='orgid'>
<txt>Log message 1</txt>
</msg>
<msg time='2024-04-16T05:47:42.814+02:00' org_id='orgid'>
<txt>Log message 2</txt>
</msg>
<msg time='2024-04-16T05:47:43.018+02:00' org_id='orgid'>
<txt>Log message 3</txt>
</msg>
...
Example configuration:
input:SmartFile:Alert:
path: /xml-logs/*.xml
protocol: xml
tag_separator: "</msg>"
W3C Extended Log File Protocol¶
protocol: w3c_extended
W3C Extended Log File Protocol is used for collecting events from files in W3C Extended Log File Format and serializing them into JSON format.
Example of event collection from Microsoft Exchange Server
LogMan.io Collector Configuration example:
input:SmartFile:MSExchange:
path: /MicrosoftExchangeServer/*.log
protocol: w3c_extended
extract_source: file_path
extract_regex: ^(?P<file_path>.*)$
Example of log file content:
#Software: Microsoft Exchange Server
#Version: 15.02.1544.004
#Log-type: DNS log
#Date: 2024-04-14T00:02:48.540Z
#Fields: Timestamp,EventId,RequestId,Data
2024-04-14T00:02:38.254Z,,9666704,"SendToServer 122.120.99.11(1), AAAA exchange.bradavice.cz, (query id:46955)"
2024-04-14T00:02:38.254Z,,7204389,"SendToServer 122.120.99.11(1), AAAA exchange.bradavice.cz, (query id:11737)"
2024-04-14T00:02:38.254Z,,43150675,"Send completed. Error=Success; Details=id=46955; query=AAAA exchange.bradavice.cz; retryCount=0"
...
W3C DHCP Server Format¶
protocol: w3c_dhcp
W3C DHCP Protocol is used for collecting events from DHCP Server log files. It is very similar to W3C Extended Log File Format with the difference in log file header.
Table of W3C DHCP events identification
Event ID | Meaning |
---|---|
00 | The log was started. |
01 | The log was stopped. |
02 | The log was temporarily paused due to low disk space. |
10 | A new IP address was leased to a client. |
11 | A lease was renewed by a client. |
12 | A lease was released by a client. |
13 | An IP address was found to be in use on the network. |
14 | A lease request could not be satisfied because the scope's address pool was exhausted. |
15 | A lease was denied. |
16 | A lease was deleted. |
17 | A lease was expired and DNS records for an expired leases have not been deleted. |
18 | A lease was expired and DNS records were deleted. |
20 | A BOOTP address was leased to a client. |
21 | A dynamic BOOTP address was leased to a client. |
22 | A BOOTP request could not be satisfied because the scope's address pool for BOOTP was exhausted. |
23 | A BOOTP IP address was deleted after checking to see it was not in use. |
24 | IP address cleanup operation has began. |
25 | IP address cleanup statistics. |
30 | DNS update request to the named DNS server. |
31 | DNS update failed. |
32 | DNS update successful. |
33 | Packet dropped due to NAP policy. |
34 | DNS update request failed as the DNS update request queue limit exceeded. |
35 | DNS update request failed. |
36 | Packet dropped because the server is in failover standby role or the hash of the client ID does not match. |
50+ | Codes above 50 are used for Rogue Server Detection information. |
Example of event collection from DHCP Server
LogMan.io Collector Configuration example:
input:SmartFile:DHCP-Server-Input:
path: /DHCPServer/*.log
protocol: w3c_dhcp
extract_source: file_path
extract_regex: ^(?P<file_path>.*)$
Example of DHCP Server log file content:
DHCP Service Activity Log
Event ID Meaning
00 The log was started.
01 The log was stopped.
...
50+ Codes above 50 are used for Rogue Server Detection information.
ID,Date,Time,Description,IP Address,Host Name,MAC Address,User Name, TransactionID, ...
24,04/16/24,00:00:21,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
24,04/16/24,00:00:22,Database Cleanup Begin,,,,,0,6,,,,,,,,,0
...
For instance, ignore_older_than
limit for files being red can be set to ignore_older_than: 20d
or ignore_older_than: 100s
.
Extract parameters¶
There are also options for the extraction of information from the file name or file path using a regular expression.
The extracted parts are then stored as meta data (which implicitly include unique meta ID and file name).
The configuration options start with extract_
prefix and include the following:
extract_source: # (optional) file_name or file_path (default: file_path)
extract_regex: # (optional) regex to extract field names from the extract source (disabled by default)
The extract_regex
must contain named groups. The group names are used as field keys for the extracted information.
Unnamed groups produce no data.
Example of extracting metadata from regex
Collecting from a file /data/myserver.xyz/tenant-1.log
The following configuration:
extract_regex: ^/data/(?P<dvchost>\w+)/(?P<tenant>\w+)\.log$
will produce metadata:
{
"meta": {
"dvchost": "myserver.xyz",
"tenant": "tenant-1"
}
}
The following in a working example of configuration of SmartFile
input with extraction
of attributes from file name using regex, and associated File
output:
input:SmartFile:SmartFileInput:
path: ./etc/tail.log
extract_source: file_name
extract_regex: ^(?P<dvchost>\w+).log$
output: FileOutput
output:File:FileOutput:
path: /data/my_path.txt
prepend_meta: true
debug: true
Prepending information¶
prepend_meta: true
Prepends the meta information such as the extracted field names to the log line/event as key-values pairs separated by spaces.
Ignore old changes¶
The following configuration options enable to check that modification time of files being read is not older than the specified limit.
ignore_older_than: # (optional) Limit in days, hours, minutes or seconds to read only files modified after the limit (default: "", f. e. "1d", "1h", "1m", "1s")
File¶
Section: input:File
, input:FileBlock
, input:XML
These inputs read specified files by lines (input:File
) or as a whole block (input:FileBlock
, input:XML
)
and pass their content further to the pipeline.
Depending on the mode, the files may be then renamed to <FILE_NAME>-processed
and if more of them are specified using a wildcard, another file will be open,
read and processed in the same way.
The available configuration options for opening, reading and processing the files include:
path: # Specify the file path(s), wildcards can be used as well (f. e. /data/lines/*)
chilldown_period: # If more files or wildcard is used in the path, specify how often in seconds to check for new files (default: 5)
output: # Which output to send the incoming events to
mode: # (optional) The mode by which the file is going to be read (default: 'rb')
newline: # (optional) File line separator (default is value of os.linesep)
post: # (optional) Specifies what should happen with the file after reading - delete (delete the file), noop (no renaming), move (rename to `<FILE_NAME>-processed`, default)
exclude: # (optional) Path of filenames that should be excluded (has precedence over 'include')
include: # (optional) Path of filenames that should be included
encoding: # (optional) Charset encoding of the file's content
move_destination: # (optional) Destination folder for post 'move', make sure it is outside of the path specified above
lines_per_event: # (optional) The number of lines after which the read method enters the idle state to allow other operations to perform their tasks (default: 10000)
event_idle_time: # (optional) The time in seconds for which the read method enters the idle state, see above (default: 0.01)
ODBC¶
Section: input:ODBC
Provides input via ODBC driver connection to collect logs form various databases.
Configuration options related to the connection establishment:
host: # Hostname of the database server
port: # Port where the database server is running
user: # Username to loging to the databse server (usually a technical/access account)
password: # Password for the user specified above
driver: # Pre-installed ODBC driver (see list below)
db: # Name of the databse to access
connect_timeout: # (optional) Connection timeout in seconds for the ODBC pool (default: 1)
reconnect_delay: # (optional) Reconnection delay in seconds after timeout for the ODBC pool (default: 5.0)
output_queue_max_size: # (optional) Maximum size of the output queue, i. e. in-memory storage (default: 10)
max_bulk_size: # (optional) Maximum size of one bulk composed of the incoming records (default 2)
output: # Which output to send the incoming events to
Configuration options related to querying the database:
query: # Query to periodically call the database
chilldown_period: # Specify in seconds how often the query above will be called (default: 5)
last_value_enabled: # Enable last value duplicity check (true/false)
last_value_table: # Specify table for SELECT max({}) from {};
last_value_column: # The column in query to be used for obtainment of last value
last_value_storage: # Persistent storage for the current last value (default: ./var/last_value_storage)
last_value_query: # (optional) To specify the last value query entirely (in case this option is set, last_value_table will not be considered)
last_value_start: # (optional) The first value to start from (default: 0)
Apache Kafka¶
Section: input:Kafka
This option is available from version v22.32
Creates a Kafka consumer for the specific .topic(s).
Configuration options related to the connection establishment:
bootstrap_servers: # Kafka nodes to read messages from (such as `kafka1:9092,kafka2:9092,kafka3:9092`)
Configuration options related to the Kafka Consumer setting:
topic: # Name of the topics to read messages from (such as `lmio-events` or `^lmio.*`)
group_id: # Name of the consumer group (such as: `collector_kafka_consumer`)
refresh_topics: # (optional) If more topics matching the topic name are expected to be created during consumption, this options specifies in seconds how often to refresh the topics' subscriptions (such as: `300`)
bootstrap_servers
, topic
and group_id
options are always required
topic
can be a name, a list of names separated by spaces or a simple regex (to match all available topics, use ^.*
)
For more configuration options, please refer to https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
LogMan.io Collector Outputs¶
The collector output is specified as follows:
output:<output-type>:<output-name>:
debug: false
...
Common output options¶
In every output, meta information can be specified as dictionary in meta
attribute.
meta:
my_meta_tag: my_meta_tag_value # (optional) Custom meta information, that will be later available in LogMan.io Parser in event's context
tenant
meta information can be specified in the output's config directly.
Debugging¶
debug
(optional)
Specify if to write output also to the log for debugging.
Default: false
Prepend the meta information¶
prepend_meta
(optional)
Prepend the meta information to the the incoming event as key-values pairs separated by spaces.
Default: false
Note
Meta information include file name or extracted information from it (in case of Smart File input), custom defined fields (see below) etc.
TCP Output¶
Outputs events over TCP to a server specified by the IP address and the Port.
output:TCP:<output-name>:
address: <IP address>:<Port>
...
Address¶
address
The server address consists of the IP address and the port.
Hint
IPv4 and IPv6 addresses are supported.
Maximum size of packets¶
max_packet_size
(optional)
Specify the maximum size of packets in bytes.
Default: 65536
Receiver size of the buffer¶
receiver_buffer_size
(optional)
Limit the receiver size of the buffer in bytes.
Default: 0
UDP Output¶
Outputs events over a UDP to the specified IP address and the Port.
output:UDP:<output-name>:
address: <IP address>:<Port>
...
Address¶
address
The server address consists of the IP address and the port.
Hint
IPv4 and IPv6 addresses are supported.
Maximum size of packets¶
max_packet_size
(optional)
Specify the maximum size of packets in bytes.
Default: 65536
Receiver size of the buffer¶
receiver_buffer_size
(optional)
Limit the receiver size of the buffer in bytes.
Default: 0
WebSocket Output¶
Outputs events over WebSocket to a specified URL.
output:WebSocket:<output-name>:
url: <Server URL>
...
URL¶
url
Specify WebSocket destination URL. For example http://example.com/ws
Tenant¶
tenant
Name of the tenant the LogMan.io Collector, the tenant name is forwarded to LogMan.io parser and put to the event.
Inactive time¶
inactive_time
(optional)
Specify inactive time in seconds, after which idle Web Sockets will be closed.
Default: 60
Output queue size¶
output_queue_max_size
(optional)
Specify in-memory outcoming queue size for every Web Socket
Path to store persistent files¶
buffer
(optional)
Path to store persistent files in, when the Web Socket connection is offline.
SSL configuration options¶
The following configuration options specify the SSL (HTTPS) connection:
cert
: Path to the client SSL certificatekey
: Path to the private key of the client SSL certificatepassword
: Private key file password (optional, default: none)cafile
: Path to a PEM file with CA certificate(s) to verify the SSL server (optional, default: none)capath
: Path to a directory with CA certificate(s) to verify the SSL server (optional, default: none)ciphers
: SSL ciphers (optional, default: none)dh_params
: Diffie–Hellman (D-H) key exchange (TLS) parameters (optional, default: none)verify_mode
: One of CERT_NONE, CERT_OPTIONAL or CERT_REQUIRED (optional); for more information, see: github.com/TeskaLabs/asab
File Output¶
Outputs events into a specified file.
output:File:<output-name>:
path: /data/output.log
...
Path¶
path
Path of the output file.
Hint
Make sure the location of the output file is accessible within the Docker container when using Docker.
Flags¶
flags
(optional)
One of O_CREAT
and O_EXCL
, where the first one tell the output to create the file if it does not exist.
Default: O_CREAT
Mode¶
mode
(optional)
The mode by which the file is going to be written to.
Default: ab
(append bytes).
Unix Socket (datagram)¶
Outputs events into a datagram-oriented Unix Domain Socket.
output:UnixSocket:<output-name>:
address: <path>
...
Address¶
address
The Unix socket file path, e.g. /data/myunix.socket
.
Maximum size of packets¶
max_packet_size
(optional)
Specify the maximum size of packets in bytes.
Default: 65536
Unix Socket (stream)¶
Outputs events into a stream-oriented Unix Domain Socket.
output:UnixStreamSocket:<output-name>:
address: <path>
...
Address¶
address
The Unix socket file path, e.g. /data/myunix.socket
.
Maximum size of packets¶
max_packet_size
(optional)
Specify the maximum size of packets in bytes.
Default: 65536
Print Output¶
Helper output that print events to the terminal.
output:Print:<output-name>:
...
Null Output¶
Helper outputs that discard events.
output:Null:<output-name>:
...
Log sources ↵
Collecting events from Apache Kafka¶
TeskaLabs LogMan.io Collector is able to collect events from Apache Kafka, namely its topics. The events stored in Kafka may contain any data encoded in bytes, such as logs about various user, admin, system, device and policy actions.
Prerequisites¶
In order to create a Kafka consumer, the boostrap_servers
, that is the location of the Kafka nodes, need to be known as well as the topic
where to read the data from.
LogMan.io Collector Configuration¶
The LogMan.io Collector provides input:Kafka:
input section, that needs to be specified in the YAML configuration. The configuration looks as follows:
input:Kafka:KafkaInput:
bootstrap_servers: <BOOTSTRAP_SERVERS>
topic: <TOPIC>
group_id: <GROUP_ID>
...
The input creates a Kafka consumer for the specific topic(s).
Configuration options related to the connection establishment:
bootstrap_servers: # Kafka nodes to read messages from (such as `kafka1:9092,kafka2:9092,kafka3:9092`)
Configuration options related to the Kafka Consumer setting:
topic: # Name of the topics to read messages from (such as `lmio-events` or `^lmio.*`)
group_id: # Name of the consumer group (such as: `collector_kafka_consumer`)
refresh_topics: # (optional) If more topics matching the topic name are expected to be created during consumption, this options specifies in seconds how often to refresh the topics' subscriptions (such as: `300`)
Options bootstrap_servers
, topic
and group_id
are always required!
topic
can be a name, a list of names separated by spaces or a simple regex (to match all available topics, use ^.*
)
For more configuration options, please refer to librdkafka configuration guide.
Collecting events from Google Cloud PubSub¶
Info
This option is available from version v23.27
onwards.
TeskaLabs LogMan.io Collector can collect events from Google Cloud PubSub using a native asynchronous consumer.
Google Cloud PubSub Documentation
Google Cloud Pull Subscription Explanation
Prerequisites¶
In Pub Sub, the following information need to be gathered:
1.) The name of the project the messages are to be consumed from
How to create a topic in a project
2.) the subscription name created in the topic the messages are to be consumed from
How to create a PubSub subscription
3.) Service account file with a private key to authorize to the given topic and subscription
How to create a service account
LogMan.io Collector Input setup¶
Google Cloud PubSub Input¶
The input named input:GoogleCloudPubSub:
needs to be provided in the LogMan.io Collector YAML configuration:
input:GoogleCloudPubSub:GoogleCloudPubSub:
subscription_name: <NAME_OF_THE_SUBSCRIPTION_IN_THE_GIVEN_TOPIC>
project_name: <NAME_OF_THE_PROJECT_TO_CONSUME_FROM>
service_account_file: <PATH_TO_THE_SERVICE_ACCOUNT_FILE>
output: <OUTPUT>
<NAME_OF_THE_SUBSCRIPTION_IN_THE_GIVEN_TOPIC>
, <NAME_OF_THE_PROJECT_TO_CONSUME_FROM>
and <PATH_TO_THE_SERVICE_ACCOUNT_FILE>
must be provided from the Google Clould Pub Sub
The output is events as a byte stream with the following meta information: publish_time
, message_id
, project_name
and subscription_name
.
Commit¶
The commit/acknowledgement is done automatically after each individual bulk of messages is processed,
so the same messages are not set by PubSub repeatedly.
The default bulk is 5,000 messages and can be changed in the input configuration via max_messages
option:
max_messages: 10000
Collecting from Bitdefender¶
TeskaLabs LogMan.io can collect Bitdefender logs from requests made by Bitdefender as specified by the server API documentation.
LogMan.io Collector Configuration¶
On the LogMan.io server, where the logs are being forwarded to, run a LogMan.io Collector instance with the following configuration.
In the listen
section, set the appropriate port configured in the Log Forwarding in Bitdefender.
Bitdefender Server Configuration¶
input:Bitdefender:BitdefenderAPI:
listen: 0.0.0.0 <PORT_SET_IN_FORWARDING> ssl
cert: <PATH_TO_PEM_CERT>
key: <PATH_TO_PEM_KEY_CERT>
cafile: <PATH_TO_PEM_CA_CERT>
encoding: utf-8
output: <OUTPUT_ID>
output:xxxxxx:<OUTPUT_ID>:
...
Collecting from Cisco IOS based devices¶
This collecting method is designed to collect logs from Cisco products that operates IOS, such as Cisco Catalyst 2960 switch or Cisco ASR 9200 router.
Log configuration¶
Configure the remote address of a collector and the logging level:
CATALYST(config)# logging host <hostname or IP of the LogMan.io collector> transport tcp port <port-number>
CATALYST(config)# logging trap informational
CATALYST(config)# service timestamps log datetime year msec show-timezone
CATALYST(config)# logging origin-id <hostname>
Log format contains the following fields:
-
timestamp in the UTC format with:
- year month, day
- hour, minute, and second
- millisecond
-
hostname of the device
-
log level is set to informational
Example of the output
<189>36: CATALYST: Aug 22 2022 10:11:25.873 UTC: %SYS-5-CONFIG_I: Configured from console by admin on vty0 (10.0.0.44)
Time synchronization¶
It is important that Cisco device time is synchronized using NTP.
Prerequisites are: * Internet connection (if you are using a public NTP server) * Configured name-server option (for a DNS query resolution)
LAB-CATALYST(config)# no clock timezone
LAB-CATALYST(config)# no ntp
LAB-CATALYST(config)# ntp server <hostname or IP of NTP server>
Example of the configuration with Google NTP server:
CATALYST(config)# no clock timezone
CATALYST(config)# no ntp
CATALYST(config)# do show ntp associations
%NTP is not enabled.
CATALYST(config)# ntp server time.google.com
CATALYST(config)# do show ntp associations
address ref clock st when poll reach delay offset disp
*~216.239.35.4 .GOOG. 1 58 64 377 15.2 0.58 0.4
* master (synced), # master (unsynced), + selected, - candidate, ~ configured
CATALYST(config)# do show clock
10:57:39.110 UTC Mon Aug 22 2022
Collecting from Citrix¶
TeskaLabs LogMan.io can collect Citrix logs using Syslog via log forwarding over TCP (recommended) or UDP communication.
Citrix ADC¶
If Citrix devices are being connected through ADC, there is the following guide on how to enable Syslog over TCP. Make sure you select the proper LogMan.io server and port to forward logs to.
F5 BIG-IP¶
If Citrix devices are connected to F5 BIG-IP, use the following guide. Make sure you select the proper LogMan.io server and port to forward logs to.
Configuring LogMan.io Collector¶
On the LogMan.io server, where the logs are being forwarded to, run a LogMan.io Collector instance with the following configuration.
Log Forwarding Via TCP¶
input:TCPBSDSyslogRFC6587:Citrix:
address: 0.0.0.0:<PORT_SET_IN_FORWARDING>
output: WebSocketOutput
output:WebSocket:WebSocketOutput:
url: http://<LMIO_SERVER>:<YOUR_PORT>/ws
tenant: <YOUR_TENANT>
debug: false
prepend_meta: false
Log Forwarding Via UDP¶
input:Datagram:Citrix:
address: 0.0.0.0:<PORT_SET_IN_FORWARDING>
output: WebSocketOutput
output:WebSocket:WebSocketOutput:
url: http://<LMIO_SERVER>:<YOUR_PORT>/ws
tenant: <YOUR_TENANT>
debug: false
prepend_meta: false
Collecting from Fortinet FortiGate¶
TeskaLabs LogMan.io can collect Fortinet FortiGate logs directly or through FortiAnalyzer via log forwarding over TCP (recommended) or UDP communication.
Forwards logs to LogMan.io¶
Both in FortiGate and FortiAnalyzer, the Syslog
type must be selected along with the appropriate port.
For precise guides, see the following link:
LogMan.io Collector Configuration¶
On the LogMan.io server, where the logs are being forwarded to, run a LogMan.io Collector instance with the following configuration.
In the address
section, set the appropriate port configured in the Log Forwarding in FortiAnalyzer.
Log Forwarding Via TCP¶
input:TCPBSDSyslogRFC6587:Fortigate:
address: 0.0.0.0:<PORT_SET_IN_FORWARDING>
output: <OUTPUT_ID>
output:xxxxxxx:<OUTPUT_ID>:
...
Log Forwarding Via UDP¶
input:Datagram:Fortigate:
address: 0.0.0.0:<PORT_SET_IN_FORWARDING>
output: <OUTPUT_ID>
output:xxxxxxx:<OUTPUT_ID>:
...
Collecting events from Microsoft Azure Event Hub¶
This option is available from version v22.45
onwards
TeskaLabs LogMan.io Collector can collect events from Microsoft Azure Event Hub through a native client or Kafka. The events stored in Azure Event Hub may contain any data encoded in bytes, such as logs about various user, admin, system, device, and policy actions.
Microsoft Azure Event Hub Setting¶
The following credentials need to be obtained for LogMan.io Collector to read the events: connection string
, event hub name
and consumer group
.
Obtain connection string from Microsoft Azure Event Hub¶
1) Sign in to the Azure portal with admin privileges to the respective Azure Event Hubs Namespace.
The Azure Event Hubs Namespace is available in the Resources
section.
2) In the selected Azure Event Hubs Namespace, click on Shared access policies
in the Settings
section in the left menu.
Click on Add
button, enter the name of the policy (the recommended name is: LogMan.io Collector), and a right popup window about the policy details should appear.
3) In the popup window, select the Listen
option to allow the policy to read from event hubs associated with the given namespace.
See the following picture.
4) Copy the Connection string-primary key
and click on Save
.
The policy should be visible in the table in the middle of the screen.
The connection string starts with Endpoint=sb://
prefix.
Obtain consumer group¶
5) In the Azure Event Hubs Namespace, select Event Hubs
option from the left menu.
6) Click on the event hub that contains events to be collected.
7) When in the event hub, click on the + Consumer group
button in the middle of the screen.
8) In the right popup window, enter the name of the consumer group (the recommended value is lmio_collector
) and click on Create
button.
9) Repeat this procedure for all event hubs meant to be consumed.
10) Write down the consumer group's name and all event hubs for the eventual LogMan.io Collector configuration.
LogMan.io Collector Input setup¶
Azure Event Hub Input¶
The input named input:AzureEventHub:
needs to be provided in the LogMan.io Collector YAML configuration:
input:AzureEventHub:AzureEventHub:
connection_string: <CONNECTION_STRING>
eventhub_name: <EVENT_HUB_NAME>
consumer_group: <CONSUMER_GROUP>
output: <OUTPUT>
<CONNECTION_STRING>
, <EVENT_HUB_NAME>
and <CONSUMER_GROUP>
are provided through the guide above
The following meta options are available for the parser: azure_event_hub_offset
, azure_event_hub_sequence_number
, azure_event_hub_enqueued_time
, azure_event_hub_partition_id
, azure_event_hub_consumer_group
and azure_event_hub_eventhub_name
.
The output is events as a byte stream, similar to Kafka input.
Azure Monitor Through Event Hub Input¶
The Azure Monitor Through Event Hub Input loads events from Azure Event Hub, loads the Azure Monitor JSON Log and breaks individual records to log lines, that are then sent to the defined output.
The input named input:AzureMonitorEventHub:
needs to be provided in the LogMan.io Collector YAML configuration:
input:AzureMonitorEventHub:AzureMonitorEventHub:
connection_string: <CONNECTION_STRING>
eventhub_name: <EVENT_HUB_NAME>
consumer_group: <CONSUMER_GROUP>
encoding: # default: utf-8
output: <OUTPUT>
<CONNECTION_STRING>
, <EVENT_HUB_NAME>
and <CONSUMER_GROUP>
are provided through the guide above
The following meta options are available for the parser: azure_event_hub_offset
, azure_event_hub_sequence_number
, azure_event_hub_enqueued_time
, azure_event_hub_partition_id
, azure_event_hub_consumer_group
and azure_event_hub_eventhub_name
.
The output is events as a byte stream, similar to Kafka input.
Alternative: Kafka Input¶
Azure Event Hub also provides (excluding basic tier users) a Kafka interface, so standard LogMan.io Collector Kafka input can be used.
There are multiple authentication options in Kafka, including oauth etc.
However, for the purposes of the documentation and reuse of the connection string
, the plain SASL authentication using the connection string
from the guide above is preferred.
input:Kafka:KafkaInput:
bootstrap_servers: <NAMESPACE>.servicebus.windows.net:9093
topic: <EVENT_HUB_NAME>
group_id: <CONSUMER_GROUP>
security.protocol: SASL_SSL
sasl.mechanisms: PLAIN
sasl.username: "$ConnectionString"
sasl.password: <CONNECTION_STRING>
output: <OUTPUT>
<CONNECTION_STRING>
, <EVENT_HUB_NAME>
and <CONSUMER_GROUP>
are provided through the guide above, <NAMESPACE>
in the name of the Azure Event Hub resource (also mentioned in the guide above).
The following meta options are available for the parser: kafka_key
, kafka_headers
, _kafka_topic
, _kafka_partition
and _kafka_offset
.
The output is events as a byte stream.
Collecting logs from Microsoft 365¶
TeskaLabs LogMan.io can collect logs from Microsoft 365, formerly Microsoft Office 365.
There are following classes of Microsoft 365 logs:
-
Audit logs: They contain information about various user, admin, system, and policy actions and events from Azure Active Directory, Exchange and SharePoint.
-
Message Trace: It provides an ability to gain an insight into the e-mail traffic passing thru Microsoft Office 365 Exchange mail server.
Enable auditing of Microsoft 365¶
By default, audit logging is enabled for Microsoft 365 and Office 365 enterprise organizations. However, when setting up logging of a Microsoft 365 or Office 365 organization, you should verify the auditing status of Microsoft Office 365.
1) Go to https://compliance.microsoft.com/ and sign in
2) In the left navigation pane of the Microsoft 365 compliance center, click Audit
3) Click the Start recording user and admin activity banner
It may take up to 60 minutes for the change to take effect.
For more details, see Turn auditing on or off.
Configuration of Microsoft 365¶
Before you can collect logs from Microsoft 365, you must configure Microsoft 365. Be aware that configuration takes a significant amount of time.
1) Setup a subscription to Microsoft 365 and a subscription to Azure
You need a subscription to Microsoft 365 and a subscription to Azure that has been associated with your Microsoft 365 subscription.
You can use trial subscriptions to both Microsoft 365 and Azure to get started.
For more details, see Welcome to the Office 365 Developer Program.
2) Register your TeskaLabs LogMan.io collector in Azure AD
It allows you to establish an identity for TeskaLabs LogMan.io and assign specific permissions it needs to collect logs from Microsoft 365 API.
Sign in to the Azure portal, using the credential from your subscription to Microsoft 365 you wish to use.
3) Navigate to Azure Active Directory
4) On the Azure Active Directory page, select "App registrations" (1), and then select "New registration" (2)
5) Fill the registration form for TeskaLabs LogMan.io application
- Name: "TeskaLabs LogMan.io"
- Supported account types: "Account in this organizational directory only"
- Redirect URL: None
Press "Register" to complete the process.
6) Collect essential informations
Store following informations from the registered application page at Azure Portal:
- Application (client) ID aka
client_id
- Directory (tenant) ID aka
tenant_id
7) Create a client secret
The client secret is used for the safe authorization and access of TeskaLabs LogMan.io.
After the page for your app is displayed, select Certificates & secrets (1) in the left pane. Then select "Client secrets" tab (2). On this tab, create new client secrets (3).
8) Fill in the information about a new client secret
- Description: "TeskaLabs LogMan.io Client Secret"
- Expires: 24 months
Press "Add" to continue.
9) Click the clipboard icon to copy the client secret value to the clipboard
Store the Value (not the Secret ID) for a configuration of TeskaLabs LogMan.io, it will be used as client_secret
.
10) Specify the permissions for TeskaLabs LogMan.io to access the Microsoft 365 Management APIs
Go to App registrations > All applications in the Azure Portal and select "TeskaLabs LogMan.io".
11) Select API Permissions (1) in the left pane and then click Add a permission (2)
12) On the Microsoft APIs tab, select Microsoft 365 Management APIs
13) On the flyout page, select the all types of permissions
- Delegated permissions
ActivityFeed.Read
ActivityFeed.ReadDlp
ServiceHealth.Read
- Application permissions
ActivityFeed.Read
ActivityFeed.ReadDlp
ServiceHealth.Read
Click "Add permissions" to finish.
14) Add "Microsoft Graph" permissions
- Delegated permissions
AuditLog.Read.All
- Application permissions
AuditLog.Read.All
Select "Microsoft Graph", "Delegated permissions", then seek and select "AuditLog.Read.All" in "Audit Log".
Then select again "Microsoft Graph", "Application permissions" then seek and select "AuditLog.Read.All" in "Audit Log".
15) Add "Office 365 Exchange online" permissions for collecting Message Trace reports
Click on "Add a permission" again.
Then go to "APIs my organization uses".
Type "Office 365 Exchange Online" to search bar.
Finally select "Office 365 Exchange Online" entry.
Select "Application permissions".
Type "ReportingWebService" into a search bar.
Check the "ReportingWebService.Read.All" select box.
Finally click on "Add permissions" button.
16) Grant admin consent
17) Navigate to Azure Active Directory
18) Navigate to Roles and administrators
19) Assign TeskaLabs LogMan.io to Global Reader role
Type "Global Reader" into a search bar.
Then click on "Global Reader" entry.
Select "Add assignments".
Type "TeskaLabs LogMan.io" into a search bar. Alternatively use "Application (client) ID" from previous steps.
Select "TeskaLabs LogMan.io" entry, the entry will appear in "Selected items".
Hit "Add" button.
Congratulations! Your Microsoft 365 is now ready for an log collection.
Configuration of TeskaLabs LogMan.io¶
Example¶
connection:MSOffice365:MSOffice365Connection:
client_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
tenant_id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
client_secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Collect Microsoft 365 Audit.General
input:MSOffice365:MSOffice365Source1:
connection: MSOffice365Connection
content_type: Audit.General
output: ms-office365-01
# Collect Microsoft 365 Audit.SharePoint
input:MSOffice365:MSOffice365Source2:
connection: MSOffice365Connection
content_type: Audit.SharePoint
output: ms-office365-01
# Collect Microsoft 365 Audit.Exchange
input:MSOffice365:MSOffice365Source3:
connection: MSOffice365Connection
content_type: Audit.Exchange
output: ms-office365-01
# Collect Microsoft 365 Audit.AzureActiveDirectory
input:MSOffice365:MSOffice365Source4:
connection: MSOffice365Connection
content_type: Audit.AzureActiveDirectory
output: ms-office365-01
# Collect Microsoft 365 DLP.All
input:MSOffice365:MSOffice365Source5:
connection: MSOffice365Connection
content_type: DLP.All
output: ms-office365-01
output:XXXXXX:ms-office365-01: {}
# Collect Microsoft 365 Message Trace logs
input:MSOffice365MessageTraceSource:MSOffice365MTSource1:
connection: MSOffice365Connection
output: ms-office365-message-trace-01
output:XXXXXX:ms-office365-message-trace-01: {}
Connection¶
The connection to Microsoft 365 must be configured first in the connection:MSOffice365:...
section.
connection:MSOffice365:MSOffice365Connection:
client_id: # Application (client) ID from Azure Portal
tenant_id: # Directory (tenant) ID from Azure Portal
client_secret: # Client secret value from Azure Portal
resources: # (optional) resource to get data from separated by comma (,) (default: https://manage.office.com,https://outlook.office365.com)
Danger
Fields client id
, tenant_id
and client secret
MUST be specified for a successful connection to Microsoft 365.
Collecting from Microsoft 365 activity logs¶
Configuration options to set up the collection fot the Auditing logs (Audit.AzureActiveDirectory
, Audit.SharePoint
, Audit.Exchange
, Audit.General
and DLP.All
):
input:MSOffice365:MSOffice365Source1:
connection: # ID of the MSOffice365 connection