Applying GROK Patterns in ELK
Recently, I worked on filters in Logstash using different tools like grok, date etc. Many of us find grok patterns complex to write, so here I am writing this blog to make writing grok patterns easier.
Use Case
I had a use-case in which I had to filter logs from catalina.out, but it was difficult to filter the logs as there is not fixed pattern of logs in catalina.out. So I have created my own multiple filters to filter out the logs.
Scope
How to write filters in Logstash config file using different tools like grok and date?
What is Logstash?
An open source tool, written in JRuby, that requires Java on your machine, used for collecting and parsing logs. Installation steps can be referred here.
What is Filter?
It is an in-line processing mechanism that parses the unstructured data and gives structured data as output. Currently, grok is the best way to structure crappy data in logs.
What is GROK?
It is used to match ‘n’ number of complex patterns on ‘n’ number of inputs and have customizable outputs thereby helps us to focus on ideas over syntax.
It ships about 120 patterns with itself by default, hence eliminating repetitiveness and brings the idea of REUSABILITY
Syntax of GROK is %{SYNTAX:SEMANTIC} where ,
SYNTAX:- Is the Pattern
SEMANTIC:- Is the name which we give to the matched pattern.
For Example:
We have Nginx-access logs like:
10.0.2.119 – – [09/Sep/2015:11:50:55 +0000] “GET /versionInfo.txt HTTP/1.1” 200 219 “-” “ELB-HealthChecker/1.0” 0.003 0.003 .
10.0.2.119 – – [09/Sep/2015:11:50:55 +0000] “GET /versionInfo.txt HTTP/1.1” 200 219 “-” “ELB-HealthChecker/1.0” 0.003 0.003
Now, GROK Pattern can be written as shown below:
[js]filter {
if [type] == "nginx-access" {
grok {
match => [ "message" => "%{IP:client} %{USERNAME} %{USERNAME} \[%{HTTPDATE:timestamp}\] (?:"%{WORD:request} %{URIPATHPARAM:path} HTTP/%{NUMBER:version}" %{NUMBER:response} %{NUMBER:bytes} "%{USERNAME}" %{GREEDYDATA:responseMessage})" ]
}
}
}
[/js]
where,
%{IP}: It is used to filter the IP whether it is IPv4 or IPv6
%{USERNAME}: It is used to filter text, numbers, some special characters.
%{HTTPDATE}: It is used the timestamp according to the format in logs
%{WORD}: It is used to filter text upto one word
%{URIPATHPARAM}: It is used filter path from the logs.
%{NUMBER}: It is used to filter the number
%{GREEDYDATA}: It is used to filter the rest of the message.
There are predefined formats which can be referred here, using which we can make our own filters to see logs in the format we want in Logstash.
GROK Pattern for Tomcat Logs
The filter given below is according to the use case defined above:
[js]
filter {
if [type] == "catalina" {
multiline {
pattern => "(^\s*|%{MONTH} %{MONTHDAY}, 20%{YEAR} %{HOUR}:?%{MINUTE}(?::?%{SECOND}) (?:AM|PM))|(^20%{YEAR}-%{MONTHNUM}-%{MONTHDAY} %{HOUR}:?%{MINUTE}(?::?%{SECOND}) %{ISO8601_TIMEZONE})"
negate => true
what => "previous"
}
if "_grokparsefailure" in [tags] {
drop { }
}
grok {
match => { "message" => "(?:%{DATESTAMP:timestamp})(?:\s*)(?:%{SYSLOG5424SD})(?:\s*)(?:%{LOGLEVEL:log_type})(?:\s*)%{GREEDYDATA:Message}" }
match => { "message" => "(?:\s*)%{LOGLEVEL:log_type}(?:\s*)%{GREEDYDATA:Message}" }
match => { "message" => "(?:\s*)(?:%{GREEDYDATA:cache})" }
}
date {
match => [ "timestamp", "yyyy-MM-dd HH:mm:ss,SSS Z", "MMM dd, yyyy HH:mm:ss a", "yy-MM-dd HH:mm:ss,SSS" ]
}
}
}[/js]
Since Tomcat logs are multiline, we have defined date patterns in multiline. It is not full filter, it is used just to catch multiline logs in Tomcat or any application that generates multiline logs. At the first time, it will throw _grokparsefailure for sure because till now we have only matched the date. After successful date matching, it will surely match the pattern defined under grok function and henceforth will not throw any such error. There are different date patterns in Tomcat so multiple date patterns have been defined under date function, to match any of them.
Nice!