Continuity Plan¶

Risk matrix¶

The risk matrix defines the level of risk by considering the category of "Likelihood" of an incident occurring against the category of "Impact". Both categories are given a score between 1 and 5. By multiplying the scores for "Likelihood" and "Impact" together, a total risk score is be produced.

Likelihood¶

Likelihood	Score
Rare	1
Unlikely	2
Possible	3
Likely	4
Almost certain	5

Impact¶

Impact	Score	Description
Insignificant	1	The functionality is not impacted, performance is not reduced, downtime is not needed.
Minor	2	The functionality is not impacted, the performance is not reduced, downtime of the impacted cluster node is needed.
Moderate	3	The functionality is not impacted, the performance is reduced, downtime of the impacted cluster node is needed.
Major	4	The functionality is impacted, the performance is significantly reduced, downtime of the cluster is needed.
Catastrophic	5	Total loss of functionality.

Incident scenarios¶

Complete system failure¶

Impact: Catastrophic (5)
Likelihood: Rare (1)
Risk level: medium-high

Risk mitigation:

Geographically distributed cluster
Active use of monitoring and alerting
Prophylactic maintenance
Strong cyber-security posture

Recovery:

Contact the support and/or vendor and consult the strategy.
Restore the hardware functionality.
Restore the system from the backup of the site configuration.
Restore the data from the offline backup (start with the most fresh data and continue to the history).

Loss of the node in the cluster¶

Impact: Moderate (4)
Likelihood: Unlikely (2)
Risk level: medium-low

Risk mitigation:

Geographically distributed cluster
Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Contact the support and/or vendor and consult the strategy.
Restore the hardware functionality.
Restore the system from the backup of the site configuration.
Restore the data from the offline backup (start with the most fresh data and continue to the history).

Loss of the fast storage drive in one node of the cluster¶

Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low

Fast drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second fast drive failure will escalate to a "Loss of the node in the cluster".

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely replacement of the failed drive

Recovery:

Turn off the impacted cluster node
Replace failed fast storage drive ASAP
Turn on the impacted cluster node
Verify correct RAID1 array reconstruction

Note

Hot swap of the fast storage drive is supported on a specific customer request.

Fast storage space shortage¶

Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high

This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Remove unnecessary data from the fast storage space.
Adjust the life cycle configuration so that the data are moved to slow storage space sooner.

Loss of the slow storage drive in one node of the cluster¶

Impact: Insignificant (1)
Likelihood: Likely (4)
Risk level: medium-low

Slow drives are in RAID 5 or RAID 6 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent another drive failure. A second drive failure in RAID 5 or third drive failure in RAID 6 will escalate to a "Loss of the node in the cluster".

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely replacement of the failed drive

Recovery:

Replace failed slow storage drive ASAP (hot swap)
Verify a correct slow storage RAID reconstruction

Slow storage space shortage¶

Impact: Moderate (3)
Likelihood: Likely (4)
Risk level: medium-high

This situation is problematic if it happens on multiple nodes of the cluster simultaneously. Use monitoring tools to identify this situation ahead of escalation.

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely extension of the slow data storage size

Recovery:

Remove unnecessary data from the slow storage space.
Adjust the life cycle configuration so that the data are removed from slow storage space sooner.

Loss of the system drive in one node of the cluster¶

Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low

System drives are in RAID 1 array so the loss of one drive is non-critical. Ensure quick replacement of the failed drive to prevent a second fast drive failure. A second system drive failure will escalate to a "Loss of the node in the cluster".

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely replacement of the failed drive

Recovery:

Replace failed fast storage drive ASAP (how swap)
Verify correct RAID1 array reconstruction

System storage space shortage¶

Impact: Moderate (3)
Likelihood: Rare (1)
Risk level: low

Use monitoring tools to identify this situation ahead of escalation.

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Remove unnecessary data from the system storage space.
Contact the support or the vendor.

Loss of the network connectivity in one node of the cluster¶

Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Redundant network connectivity

Recovery:

Restore the network connectivity
Verify the proper cluster operational condition

Failure of the Elasticsearch cluster¶

Impact: Major (4)
Likelihood: Possible (3)
Risk level: medium-high

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Elasticsearch cluster health

Recovery:

Contact the support and/or vendor and consult the strategy.

Failure of the Elasticsearch node¶

Impact: Minor (2)
Likelihood: Likely (4)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Elasticsearch cluster health

Recovery:

Monitor an automatic Elasticsearch node rejoining to the cluster
Contact the support / the vendor if the failure persists over several hours.

Failure of the Apache Kafka cluster¶

Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Apache Kafka cluster health

Recovery:

Contact the support and/or vendor and consult the strategy.

Failure of the Apache Kafka node¶

Impact: Minor (2)
Likelihood: Rare (1)
Risk level: low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Apache Kafka cluster

Recovery:

Monitor an automatic Apache Kafka node rejoining to the cluster
Contact the support / the vendor if the failure persists over several hours.

Failure of the Apache ZooKeeper cluster¶

Impact: Major (4)
Likelihood: Rare (1)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Apache ZooKeeper cluster

Recovery:

Contact the support and/or vendor and consult the strategy.

Failure of the Apache ZooKeeper node¶

Impact: Insignificant (1)
Likelihood: Rare (1)
Risk level: low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance
Timely reaction to the deteriorating Apache ZooKeeper cluster

Recovery:

Monitor an automatic Apache ZooKeeper node rejoining to the cluster
Contact the support / the vendor if the failure persists over several hours.

Failure of the stateless data path microservice (collector, parser, dispatcher, correlator, watcher)¶

Impact: Minor (2)
Likelihood: Possible (3)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Restart the failed microservice.

Failure of the stateless support microservice (all others)¶

Impact: Insignificant (1)
Likelihood: Possible (3)
Risk level: medium-low

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Restart the failed microservice.

Significant reduction of the system performance¶

Impact: Moderate (3)
Likelihood: Possible (3)
Risk level: medium-high

Risk mitigation:

Active use of monitoring and alerting
Prophylactic maintenance

Recovery:

Identify and remove the root cause of the reduction of the performance
Contact the vendor or the support if help is needed

Backup and recovery strategy¶

Offline backup for the incoming logs¶

Incoming logs are duplicated to the offline backup storage that is not part of the active cluster of LogMan.io (hence is "offline"). Offline backup provides an option to restore logs to the LogMan.io after critical failure etc.

Backup strategy for the fast data storage¶

Incoming events (logs) are copied into the archive storage once they enter the LogMan.io. It means that there is always the way how to “replay” events into the TeskaLabs LogMan.in in case of need. Also, data are replicated to other nodes of the cluster immediately after arrival to the cluster. For this reason, traditional backup is not recommended but possible.

The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.

Backup strategy for the slow data storage¶

The data stored on the slow data storage are ALWAYS replicated to other nodes of the cluster and also stored in the archive. For this reason, traditional backup is not recommended but possible (consider the huge size of the slow storage).

The restoration is handled by the cluster components by replicating the data from other nodes of the cluster.

Backup strategy for the system storage¶

It is recommended to periodically backup all file systems on the system storage so that they could be used for restoring the installation when needed. The backup strategy is compatible with most common backup technologies in the market.

Recovery Point Objective (RPO): full backup once per week or after major maintenance work, incremental backup one per day.
Recovery Time Objective (RTO): 12 hours.

Note

RPO and RTO are recommended, assuming highly available setup of the LogMan.io cluster. It means three and more nodes so that the complete downtime of the single node don’t impact service availability.

Generic backup and recovery rules¶

Data Backup: Regularly backup to a secure location, such as a cloud-based storage service, backup tapes, to minimize data loss in case of failures.
Backup Scheduling: Establish a backup schedule that meets the needs of the organization, such as daily, weekly, or monthly backups.
Backup Verification: Verify the integrity of backup data regularly to ensure that it can be used for disaster recovery.
Restoration Testing: Test the restoration of backup data regularly to ensure that the backup and recovery process is working correctly and to identify and resolve any issues before they become critical.
Backup Retention: Establish a backup retention policy that balances the need for long-term data preservation with the cost of storing backup data.

Monitoring and alerting¶

Monitoring is an important component of a Continuity Plan as it helps to detect potential failures early, identify the cause of failures, and support decision-making during the recovery process.

LogMan.io microservices provides OpenMetrics API and/or ship their telemetry into InfluxDB and uses Grafana as a monitoring tool.

Monitoring Strategy: OpenMetrics API is used to collect telemetry from all microservices in the cluster, Operating system and hardware. Telemetry is collected once per minute. InfluxDB is used to store the telemetry data. Grafana is used as the Web-based User interface for telemetry inspection.
Alerting and Notification: The monitoring system is configured to generate alerts and notifications in case of potential failures, such as low disk space, high resource utilization, or increased error rates.
Monitoring Dashboards: Monitoring dashboards are provided in Grafana that display the most important metrics for the system, such as resource utilization, error rates, and response times.
Monitoring Configuration: Regularly reviews and updates are provided for the monitoring configuration to ensure that it is effective and that it reflects changes in the system.
Monitoring Training: Trainings are provided for the monitoring team and other relevant parties on the monitoring system and the monitoring dashboards in Grafana.

High availability architecture¶

TeskaLabs LogMan.io is deployed in a highly available architecture (HA) with multiple nodes to reduce the risk of single points of failure.

High availability architecture is a design pattern that aims to ensure that a system remains operational and available, even in the event of failures or disruptions.

In a LogMan.io cluster, a high availability architecture includes the following components:

Load Balancing: Distribution of incoming traffic among multiple instances of microservices, thereby improving the resilience of the system and reducing the impact of failures.
Redundant Storage: Storing of data redundantly across multiple storage nodes to prevent data loss in the event of a storage failure.
Multiple Brokers: Use multiple brokers in Apache Kafka to improve the resilience of the messaging system and reduce the impact of broker failures.
Automatic Failover: Automatic failover mechanisms, such as leader election in Apache Kafka, to ensure that the system continues to function in the event of a cluster node failure.
Monitoring and Alerting: Usage of monitoring and alerting components to detect potential failures and trigger automatic failover mechanisms when necessary.
Rolling Upgrades: Upgrades to the system without disrupting its normal operation, by upgrading nodes one at a time, without downtime.
Data Replication: Replication of log across multiple cluster nodes to ensure that the system continues to function even if one or more nodes fail.

Communication plan¶

A clear and well-communicated plan for responding to failures and communicating with stakeholders helps to minimize the impact of failures and ensure that everyone is on the same page.

Stakeholder Identification: Identify all stakeholders who may need to be informed during and after a disaster, such as employees, customers, vendors, and partners.
Participating organizations: The LogMan.io operator, the integrating party and the vendor (TeskaLabs).
Communication Channels: Communication channels that will be used during and after a disaster are Slack, email, phone and SMS.
Escalation Plan: Specify an escalation plan to ensure that the right people are informed at the right time during a disaster, and that communication is coordinated and effective.
Update and Maintenance: Regularly update and maintain the communication plan to ensure that it reflects changes in the organization, such as new stakeholders or communication channels.