Parser preprocessor

The parser preprocessor allows to preprocess the input event by a imperative code, e.g. Python, Cython, C etc.

Example

---
define:
  name: Demo of the build-in Syslog preprocessor
  type: parser/preprocessor
  tenant: Syslog_RFC5424.STRUCTURED_DATA.soc@0.tenant  # (optional)

function: lmiopar.preprocessor.Syslog_RFC5424

tenant specifies the tenant attribute to be read and passed to context['tenant'] for further distribution of parsed and unparsed events to tenant specific indices/storages in LogMan.io Dispatcher

Built-in preprocessors

lmiopar.preprocessor module contains following commonly used preprocessors. There preprocessors are optimized for high performace deployments.

Syslog RFC5425 built-in preprocessor

function: lmiopar.preprocessor.Syslog_RFC5424

This is a preprocessor for the Syslog protocol (new) according to RFC5425.

The input for this preprocessor is a valid Syslog entry, e.g.:

<165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog 10 ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] An application event log entry.

The output is, a message part of the log in the event and parsed elements in the context.syslog_rfc5424.

event: An application event log entry.

context:
  Syslog_RFC5424:
    PRI: 165
    FACILITY: 20
    PRIORITY: 5
    VERSION: 1
    TIMESTAMP: 2003-10-11T22:14:15.003Z
    HOSTNAME: mymachine.example.com
    APP_NAME: evntslog
    PROCID: 10
    MSGID: ID47
    STRUCTURED_DATA:
      exampleSDID@32473:
        iut: 3
        eventSource: Application
        eventID: 1011
      ...

Syslog RFC3164 built-in preprocessor

function: lmiopar.preprocessor.Syslog_RFC3164

This is a preprocessor for the BSD syslog Protocol (old) according to RFC3164.

The Syslog RFC3164 preprocessor can be configured in the define section:

define:
  type: parser/preprocessor
  year: 1999
  timezone: Europe/Prague

year specifies the numeric representation of the year that will be applied to the timestamp of the logs. Also, you may specify smart (default) for the advanced selection of the year based on the month.

timezone specifies the timezone of the logs, the default is UTC.

The input for this preprocessor is a valid Syslog entry, e.g.:

<34>Oct 11 22:14:15 mymachine su[10]: 'su root' failed for lonvick on /dev/pts/8

The output is, a message part of the log in the event and parsed elements in the context.syslog_rfc3164.

event: "'su root' failed for lonvick on /dev/pts/8"

context:
  Syslog_RFC3164:
    PRI: 34
    PRIORITY: 2
    FACILITY: 4
    TIMESTAMP: '2003-10-11T22:14:15.003Z'
    HOSTNAME: mymachine
    TAG: su
    PID: 10

TAG and PID are optional parameters.

CEF built-in preprocessor

function: lmiopar.preprocessor.CEF

This is a preprocessor for the CEF or Common Event Format.

define:
  type: parser/preprocessor
  year: 1999
  timezone: Europe/Prague

year specifies the numeric representation of the year that will be applied to the timestamp of the logs. Also, you may specify smart (default) for the advanced selection of the year based on the month.

timezone specifies the timezone of the logs, the default is UTC.

The input for this preprocessor is a valid CEF entry, e.g.:

CEF:0|Vendor|Product|Version|foobar:1:2|Failed password|Medium| eventId=1234 app=ssh categorySignificance=/Informational/Warning categoryBehavior=/Authentication/Verify

The output is, a message part of the log in the event and parsed elements in the context.CEF:

context:
  CEF:
    Version: 0
    DeviceVendor: Vendor
    DeviceProduct: Product
    DeviceVersion: Version
    DeviceEventClassID: 'foobar:1:2'
    Name: Failed password
    Severity: Medium

    eventId: '1234'
    app: ssh
    categorySignificance: /Informational/Warning
    categoryBehavior: /Authentication/Verify

CEF can contain also a Syslog header. This is supported by chaining relevant Syslog preprocessor with a CEF preprocessor. Please refer to a preprocessor chaining chapter for details.

Apache HTTP Server log formats built-in preprocessor

There are high performance preprocessors for common Apache HTTP server access logs.

function: lmiopar.preprocessor.Apache_Common_Log_Format

This is a preprocessor for the Apache Common Log Format.

function: lmiopar.preprocessor.Apache_Combined_Log_Format

This is a preprocessor for the Apache Combined Log Format.

Apache Common Log example

Input:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Output:

context:
  Apache_Access_Log:
    HOST: '127.0.0.1'
    IDENT: '-'
    USERID: 'frank'
    TIMESTAMP: '2000-10-10T20:55:36.000Z'
    METHOD: 'GET'
    RESOURCE: '/apache_pb.gif'
    PROTOCOL: 'HTTP/1.0'
    STATUS_CODE: 200
    DOWNLOAD_SIZE: 2326

Apache Combined Log example

Input:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

Output:

context:
  Apache_Access_Log:
    HOST: '127.0.0.1'
    IDENT: '-'
    USERID: 'frank'
    TIMESTAMP: '2000-10-10T20:55:36.000Z'
    METHOD: 'GET'
    RESOURCE: '/apache_pb.gif'
    PROTOCOL: 'HTTP/1.0'
    STATUS_CODE: 200
    DOWNLOAD_SIZE: 2326
    REFERE': http://www.example.com/start.html
    USER_AGENT: Mozilla/4.08 [en] (Win98; I ;Nav)

JSON built-in preprocessor

function: lmiopar.preprocessor.JSON

This is a preprocessor for the JSON format. It expects the input in a binary or textual format, the output dictionary is placed in the event.

Hence, the input for this preprocessor is a valid JSON entry.

XML built-in preprocessor

function: lmiopar.preprocessor.XML

This is a preprocessor for the XML format. It expects the input in a binary or textual format, the output dictionary is placed in the event.

Hence, the input for this preprocessor is a valid XML entry, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
   <System>
      <Provider Name="Schannel" Guid="{1f678132-5938-4686-9fdc-c8ff68f15c85}" />
      <EventID>36884</EventID>
      <Version>0</Version>
      <Level>2</Level>
      <Task>0</Task>
      <Opcode>0</Opcode>
      <Keywords>0x8000000000000000</Keywords>
      <TimeCreated SystemTime="2020-06-26T07:12:01.331577900Z" />
      <EventRecordID>30286</EventRecordID>
      <Correlation ActivityID="{8e20742a-4b06-0002-c274-208e064bd601}" />
      <Execution ProcessID="788" ThreadID="948" />
      <Channel>System</Channel>
      <Computer>XX</Computer>
      <Security UserID="S-1-5-21-1627182167-2524376360-74743131-1001" />
   </System>
   <UserData>
      <EventXML xmlns="LSA_NS">
         <Name>localhost</Name>
      </EventXML>
   </UserData>
   <RenderingInfo Culture="en-US">
      <Message>The certificate received from the remote server does not contain the expected name. It is therefore not possible to determine whether we are connecting to the correct server. The server name we were expecting is localhost. The TLS connection request has failed. The attached data contains the server certificate.</Message>
      <Level>Error</Level>
      <Task />
      <Opcode>Info</Opcode>
      <Channel>System</Channel>
      <Provider />
      <Keywords />
   </RenderingInfo>
</Event>

The output of the preprocessor in the event:

{
  "System.EventID": "36884",
  "System.Version": "0",
  "System.Level": "2",
  "System.Task": "0",
  "System.Opcode": "0",
  "System.Keywords": "0x8000000000000000",
  "System.EventRecordID": "30286",
  "System.Channel": "System",
  "System.Computer": "XX",
  "UserData.EventXML.Name": "localhost",
  "RenderingInfo.Message": "The certificate received from the remote server does not contain the expected name. It is therefore not possible to determine whether we are connecting to the correct server. The server name we were expecting is localhost. The TLS connection request has failed. The attached data contains the server certificate.",
  "RenderingInfo.Level": "Error",
  "RenderingInfo.Opcode": "Info",
  "RenderingInfo.Channel": "System"
}

CSV built-in preprocessor

function: lmiopar.preprocessor.CSV

This is a preprocessor for the CSV format. It expects the input in a binary or textual format, the output dictionary is placed in the event.

Hence, the input for this preprocessor is a valid CSV entry, e.g.:

user,last_name\njack,black\njohn,doe

The output of the preprocessor in the context["CSV"]:

{
  "lines": [
    {"user": "jack", "last_name": "black"},
    {"user": "john", "last_name": "doe"}
  ]
}

Parameters

In define section of the CSV preprocessor, the following parameters may be set for CSV reading:

delimiter: (default: ",")
escapechar: escape character
doublequote: allow doublequote (default: true)
lineterminator: line terminator character, either \n or \r (default is the operation system line separator)
quotechar: default quote character (default: "\"")
quoting: type of quoting
skipinitialspace: skip initial space (default: false)
strict: strict mode (default: false)

Custom preprocessors

A custom preprocessors can be called from the parser, the respective code has to be accessible by a parser microservice thru a common Python import way.

---
define:
  name: Demo of the custom Python preprocessor
  type: parser/preprocessor

function: mypreprocessors.preprocessor

mypreprocessors is a module respective a folder with __init__.py that contains a function preprocessor().

The parser specifies a function to call. It uses Python notation and it will automatically import the module.

The signature of the function:

def preprocessor(context, event):
  ...
  return event

Preprocessor may (1) modify the event (!EVENT) and/or (2) modify the context (!CONTEXT).

The output of the preprocessor function will be passed to a subsequent parsers. Preprocessor parser doesn’t produce parsed events directly. If the function returns None, the parsing of the eveny is silently terminated. If the funtion raises the exception, the exception will be logged and the event will be forwarded into unparsed output.

Chaining of preprocessors

Preprocessors can be chained in order to parse more complex input formats. The output (aka event) of the first preprocessor is fed as an input of the second preprocessor (and so on).

For example, the input is a CEF format with Syslog RFC3164 header:

<14>Jan 28 05:51:33 connector-test CEF_PARSED_LOG: CEF:0|Vendor|Product|Version|foobar:1:2|Failed password|Medium| eventId=1234 app=ssh categorySignificance=/Informational/Warning categoryBehavior=/Authentication/Verify

The pipeline contains two preprocessors:

p01_parser.yaml:

---
define:
  name: Preprocessor for Syslog RFC5424 part of the message
  type: parser/preprocessor
  tenant: Syslog_RFC5424.STRUCTURED_DATA.soc@0.tenant

function: lmiopar.preprocessor.Syslog_RFC5424

p02_parser.yaml:

---
define:
  name: Preprocessor for CEF part of the message
  type: parser/preprocessor

function: lmiopar.preprocessor.CEF

and final parser p03_parser.yaml:

---
define:
  name: Finalize by parsing the event into a dictionary
  type: parser/cascade

parse:
  !DICT
  set:
    Syslog_RFC5424: !ITEM CONTEXT Syslog_RFC5424
    CEF: !ITEM CONTEXT CEF
    Message: !EVENT

Output example:

context:
  CEF:
    Version: 0
    DeviceVendor: Vendor
    DeviceProduct: Product
    DeviceVersion: Version
    DeviceEventClassID: 'foobar:1:2'
    Name: Failed password
    Severity: Medium

    eventId: '1234'
    app: ssh
    categorySignificance: /Informational/Warning
    categoryBehavior: /Authentication/Verify

  Syslog_RFC3164:
    PRI: 14
    FACILITY: 1
    PRIORITY: 6
    HOSTNAME: connector-test'
    TAG: CEF_PARSED_LOG
    TIMESTAMP': '2020-01-28T05:51:33.000Z'

  Message: ''