Monitoring sendgrid with Elasticsearch

If you are using sendgrid as a service for your outbound email, you would want to monitor and be able to answer questions such as:

  • how much email are you sending
  • status of sent email – success, bounced, delayed, etc.
  • trends
  • etc.

We get questions all the time from $WORK customer support folks on whether an email sent to a customer got there (customer claimed they never got it).   There could be any number of reasons why customer do not see email sent from us.

  • our email is filtered into customer spam folder
  • email is reject/bounced by customer mail service
  • any number of network/server/services related errors between us and customer mail service
  • the email address customer provided is invalid (and email bounced)

If we have access to event logs from sendgrid, we would be able to quickly answer these types of questions.

Luckily sendgrid offers Event Webhook.

Verbatim quote from above link.

SendGrid’s Event Webhook will notify a URL of your choice via HTTP POST with information about events that occur as SendGrid processes your email. Common uses of this data are to remove unsubscribes, react to spam reports, determine unengaged recipients, identify bounced email addresses, or create advanced analytics of your email program. With Unique Arguments and Category parameters, you can insert dynamic data that will help build a sharp, clear image of your mailings.

Login to your sendgrid account and click on Mail Settings.

Then click on Event Notification

 

In HTTP Post URL, enter the DNS name of the service endpoint you are going to setup next.

For example, mine is (not a valid endpoint, but close enough): https://sendlog.mydomain.com/logger

Since I do not believe in re-inventing the wheel, Adly Abdullah has already written a simple sendgrid event listener (Note: this is my forked version, which works with ES 6.x).   This is a nodejs service.  You can install it via npm.

$ sudo npm install -g sendgrid-event-logger pm2

You want to install pm2 (nodejs Process Manager version 2).  Very nice nodejs process manager.

Next is to edit and configure sendgrid-event-logger (SEL for short).   If the default config works for you, then no need to do anything.  Check and make sure it is pointing to where your ES host is located (mine is running on the same instance, hence localhost).   I also left SEL listening on port 8080 as that is available on this instance.

$ cat /etc/sendgrid-event-logger.json
{
    "elasticsearch_host": "localhost:9200",
    "port": 8080,
    "use_basicauth": true,
    "basicauth": {
    "user": "sendgridlogger",
    "password": "KLJSDG(#@%@!gBigSecret"
},
"use_https": false,
    "https": {
        "key_file": "",
        "cert_file": ""
    },
    "days_to_retain_log": 365
}

NOTE: I have use_https set to false because my nginx front-end is already using https.

Since SEL is listening on port 8080, you can run it as yourself.

$ pm2 start sendgrid-event-logger -i 0 --name "sendgrid-event-logger"

Verify that SEL is running.

$ pm2 show 0

Describing process with id 0 - name sendgrid-event-logger
┌───────────────────┬──────────────────────────────────────────────────────┐
│ status            │ online                                               │
│ name              │ sendgrid-event-logger                                │
│ restarts          │ 0                                                    │
│ uptime            │ 11m                                                  │
│ script path       │ /usr/bin/sendgrid-event-logger                       │
│ script args       │ N/A                                                  │
│ error log path    │ $HOME/.pm2/logs/sendgrid-event-logger-error-0.log    │
│ out log path      │ $HOME/.pm2/logs/sendgrid-event-logger-out-0.log      │
│ pid path          │ $HOME/.pm2/pids/sendgrid-event-logger-0.pid          │
│ interpreter       │ node                                                 │
│ interpreter args  │ N/A                                                  │
│ script id         │ 0                                                    │
│ exec cwd          │ $HOME                                                │
│ exec mode         │ fork_mode                                            │
│ node.js version   │ 8.11.1                                               │
│ watch & reload .  │ ✘                                                    │
│ unstable restarts │ 0                                                    │
│ created at        │ 2018-02-14T23:36:06.705Z                             │
└───────────────────┴──────────────────────────────────────────────────────┘
Code metrics value
┌─────────────────┬────────┐
│ Loop delay .    │ 0.68ms │
│ Active requests │ 0      │
│ Active handles  │ 4      │
└─────────────────┴────────┘

I use nginx and here is my nginx config for SEL.

/etc/nginx/sites-available $ cat sendgrid-logger
upstream sendgrid_logger {
  server 127.0.0.1:8080;
}

server {
  server_name slog.mysite.org slog;
  listen 443 ssl ;

  include snippets/ssl.conf;
  access_log /var/log/nginx/slog/access.log;
  error_log /var/log/nginx/slog/error.log;
  proxy_connect_timeout 5m;
  proxy_send_timeout 5m;
  proxy_read_timeout 5m;

  location / {
    proxy_pass http://sendgrid_logger;
  }
}
$ sudo ln -s /etc/nginx/sites-available/sendgrid-logger /etc/nginx/sites-enabled/
$ sudo systemctl reload nginx

Make sure Sendgrid Event webhook is turned on and you should be seeing events coming in.   Check your Elasticsearch cluster for new indices.

$ curl -s localhost:9200/_cat/indices|grep mail
green open mail-2018.03.31 -g6Tw9b9RfqZnBVYLdrF-g 1 0 2967 0 1.4mb 1.4mb
green open mail-2018.03.28 GxTRx2PgR4yT5kiH0RKXrg 1 0 8673 0 4.2mb 4.2mb
green open mail-2018.04.06 2PO9YV1eS7eevZ1dfFrMGw 1 0 10216 0 4.9mb 4.9mb
green open mail-2018.04.11 _ZINqVPTSwW7b8wSgkTtTA 1 0 8774 0 4.3mb 4.3mb

etc.

Go to Kibana, setup index pattern.  In my case, it’s mail-*.  Go to Discover, select mail-* index pattern and play around.

Here is my simple report.  I see around 9am, something happened to cause a huge spike in mail events.

 

Next step is for you to create dashboards to fit your needs.

Enjoy!

 

Elasticsearch util to copy/reindex index(es)

Elasticsearch (and the entire ELK stack) is pretty useful open source piece of software for analyzing large datasets.   I manage a fairly large ELK infrastructure at work — around 90+ ES clusters, 300+ TB of data.   One of things I’ve found myself having to do is copying and/or reindexing one or more index(es).   Sometime to the same ES cluster, sometime moving index(es) to another cluster.

Regardless, it is just something that is done often enough, but yet in an ad-hoc manner.   It’s not worth setting up logstash config to do this and then tearing them down.

Here is an example logstash config to do something like this.

logstash config:

input {
 elasticsearch {
   hosts => [ "host1", "host2", ..., "hostN" ]
   index => "index"
 }
}
filter {
 ......
}
output {
 elasticsearch {
 .....
 }
}

This gets old fast when there are many indices. So I wrote a tool to do this in Go. I used the elastic go library from Olivere (https://github.com/olivere/elastic).

I call it espipe and put it on my Github repo — https://github.com/TinLe/tools.

You will need to download it, and make sure you have a golang build environment setup. Then change into the source where espipe.go is located and type go build.

If you don’t have golang build environment setup and just want the binary to use, you can d/l  espipe (this is built for linux x86_64).

 

Simple usage:

$ ./espipe -h
Usage of ./espipe:
  -bulksize int
    	Number of docs to send to ES per chunk (default to 500) (default 500)
  -dst string
    	Destination ES cluster (default to http://localhost:9200) (default "http://localhost:9200")
  -sidx string
    	Source index(es) to copy (default to all '*') (default "logstash*")
  -src string
    	Source ES cluster (default to http://localhost:9200) (default "http://localhost:9200")
  -tidx string
    	Target index to copy (default to 'copyidx') (default "copyidx")

# the following copy all nginx-access-YYYY.MM.DD indices from local cluster to
# anothercluster and consolidated all into one index
$ ./espipe -dst http://localhost:9200 -src http://anothercluster:9200 -sidx 'nginx-access*' -tidx 'nginx-consolidated' -bulksize 1000

Monitoring Postfix and Dovecot logs in ELK

postfix-kibana4I’ve been using pflogsumm for the longest time to monitor my postfix logs.   When I used to manage hundreds of domains and many more mailing lists, it was important to keep an eye on my mail servers.

These days, it is just my own personal mail server for my dozens of domains.   I don’t even need to, what with Google and other low cost email services.    It’s for fun and to keep my skills sharp.

Since I have been working with ELK stack a lot lately, I have been wanting to send all my logs — nginx, syslog and postfix maillog — into ELK.  There is already existing grok patterns in logstash for nginx, apache and syslog, but none for postfix.   So I do what I always do, sit down and dived in.

To be clear, I don’t believe in re-inventing the wheel, so I did due diligence and searched for what others have done first.   There were several places that posted their grok recipes for postfix.  But none were exactly plug-n-play for me.   I’ll list them here.

whyscream postfix grok pattern on github

antispin logstash postfix grok patterns

I ended up using a modified version of antispin’s patterns.   I don’t use Amavisd, but I do use Dovecot.   So I added new patterns and modified what was there for my particular installation.

My installation is

  • Fedora 21 (now 23) x86_64
  • Postfix 2.xx
  • Dovecot 2.xx
  • Elasticsearch v1.7.3
  • logstash v1.5.5
  • Kibana 4.1.3.
  • Hardware is:
    • Dell XPS1210 laptop (3.5GB RAM and 250GB HD)
    • ASUS Eee PC 900A (Atom N270, 2GB RAM and 4GB SSD, with 80GB external USB2 drive) – this one run Fedora 21 X86 (32 bit).  Note that I have not seen any problems with mixing 32, 64 bit systems wrt ELK data.

On Fedora, postfix and dovecot logs go to syslogs and end up in /var/log/maillog.

I have logstash installed in /home/logstash. So I added in postfix pattern file in /home/logstash/patterns and called it (what else) postfix.

Also want to say that the site grokdebug really saved me a lot of time and headache.  Use it if you ever have to create new grok patterns!

Here is the content of that file.

# Syslog stuff
COMPONENT ([\w._\/%-]+)
COMPID postfix\/%{COMPONENT:component}(?:\[%{NUMBER:pid}\])?
POSTFIX (?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{COMPID}:
# POSTFIX %{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{COMPID}: %{QUEUEID:queueid}
# POSTFIX_MESSAGE %{SYSLOGTIMESTAMP:timestamp} %{IPORHOST:host} %{DATA:program}/%{DATA:subprog}\[%{NUMBER:pid}\]: %{POSTFIX_QUEUEID:queueid}:

# Milter
HELO (?:\[%{IP:helo}\]|%{HOST:helo}|%{DATA:helo})

MILTERCONNECT %{QUEUEID:qid}: milter-reject: CONNECT from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; proto=%{WORD:proto}
MILTERUNKNOWN %{QUEUEID:qid}: milter-reject: UNKNOWN from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; proto=%{WORD:proto}
MILTEREHLO %{QUEUEID:qid}: milter-reject: EHLO from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; proto=%{WORD:proto} helo=<%{HELO}>
MILTERMAIL %{QUEUEID:qid}: milter-reject: MAIL from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; from=<%{EMAILADDRESS:from}> proto=%{WORD:proto} helo=<%{HELO}>
MILTERHELO %{QUEUEID:qid}: milter-reject: HELO from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; proto=%{WORD:proto} helo=<%{HELO}>
MILTERRCPT %{QUEUEID:qid}: milter-reject: RCPT from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; from=<%{EMAILADDRESS:from}> to=<%{EMAILADDRESS:to}> proto=%{WORD:proto} helo=<%{HELO}>
MILTERENDOFMESSAGE %{QUEUEID:qid}: milter-reject: END-OF-MESSAGE from %{RELAY:relay}: %{GREEDYDATA:milter_reason}; from=<%{EMAILADDRESS:from}> to=<%{EMAILADDRESS:to}> proto=%{WORD:proto} helo=<%{HELO}>

# Postfix stuff
QUEUEID (?:[A-F0-9]+|NOQUEUE)
EMAILADDRESSPART [a-zA-Z0-9_.+-=:~]+
EMAILADDRESS %{EMAILADDRESSPART:local}@%{EMAILADDRESSPART:remote}
RELAY (?:%{HOSTNAME:relayhost}(?:\[%{IP:relayip}\](?::[0-9]+(.[0-9]+)?)?)?)
#RELAY (?:%{HOSTNAME:relayhost}(?:\[%{IP:relayip}\](?:%{POSREAL:relayport})))
POSREAL [0-9]+(.[0-9]+)?
#DELAYS %{POSREAL:a}/%{POSREAL:b}/%{POSREAL:c}/%{POSREAL:d}
#DELAYS (%{POSREAL}[/]*)+
DSN %{NONNEGINT}.%{NONNEGINT}.%{NONNEGINT}
STATUS sent|deferred|bounced|expired
PERMERROR 5[0-9]{2}
MESSAGELEVEL reject|warning|error|fatal|panic

POSTFIXSMTPMESSAGE %{MESSAGELEVEL}: %{GREEDYDATA:reason}
POSTFIXACTION discard|dunno|filter|hold|ignore|info|prepend|redirect|replace|reject|warn

# postfix/smtp and postfix/lmtp, postfix/local and postfix/error
POSTFIXSMTP %{POSTFIXSMTPRELAY}|%{POSTFIXSMTPCONNECT}|%{POSTFIXSMTP5XX}|%{POSTFIXSMTPREFUSAL}|%{POSTFIXSMTPLOSTCONNECTION}|%{POSTFIXSMTPTIMEOUT}
# Jun 17 04:41:52 dir postfix/smtp[14434]: CE4FC560C0D: to=, relay=localhost[127.0.0.1]:2525, delay=0.32, delays=0.05/0.01/0.19/0.07, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 1B6864661B2F)
POSTFIXSMTPRELAY %{QUEUEID:qid}: to=<%{DATA:to}>,(?:\sorig_to=<%{DATA:orig_to}>,)? relay=%{RELAY},(?: delay=%{POSREAL:delay},)?(?: delays=%{DATA:delays}?,)?(?: conn_use=%{POSREAL:conn_use},)?( %{WORD}=%{DATA},)+? dsn=%{DSN:dsn}, status=%{STATUS:result} %{GREEDYDATA:reason}
POSTFIXSMTPCONNECT connect to %{RELAY}: %{GREEDYDATA:reason}
POSTFIXSMTP5XX %{QUEUEID:qid}: to=<%{EMAILADDRESS:to}>,(?:\sorig_to=<%{EMAILADDRESS:orig_to}>,)? relay=%{RELAY}, (%{WORD}=%{DATA},)+ dsn=%{DSN:dsn}, status=%{STATUS:result} \(host %{HOSTNAME}\[%{IP}\] said: %{PERMERROR:responsecode} %{DATA:smtp_response} \(in reply to %{DATA:command} command\)\)
POSTFIXSMTPREFUSAL %{QUEUEID:qid}: host %{RELAY} refused to talk to me: %{GREEDYDATA:reason}
POSTFIXSMTPLOSTCONNECTION %{QUEUEID:qid}: lost connection with %{RELAY} while %{GREEDYDATA:reason}
POSTFIXSMTPTIMEOUT %{QUEUEID:qid}: conversation with %{RELAY} timed out while %{GREEDYDATA:reason}


# postfix/smtpd
POSTFIXSMTPD %{POSTFIXSMTPDCONNECTS}|%{POSTFIXSMTPDMILTER}|%{POSTFIXSMTPDACTIONS}|%{POSTFIXSMTPDTIMEOUTS}|%{POSTFIXSMTPDLOGIN}|%{POSTFIXSMTPDCLIENT}|%{POSTFIXSMTPDNOQUEUE}|%{POSTFIXSMTPDWARNING}|%{POSTFIXSMTPDLOSTCONNECTION}
POSTFIXSMTPDCONNECTS (?:dis)?connect from %{RELAY}
POSTFIXSMTPDMILTER %{MILTERCONNECT}|%{MILTERUNKNOWN}|%{MILTEREHLO}|%{MILTERMAIL}|%{MILTERHELO}|%{MILTERRCPT}
POSTFIXSMTPDACTIONS %{QUEUEID:qid}: %{POSTFIXACTION:postfix_action}: %{DATA:command} from %{RELAY}: %{PERMERROR:responsecode} %{DSN:dsn} %{DATA}: %{DATA:reason}; from=<%{EMAILADDRESS:from}> to=<%{EMAILADDRESS:to}> proto=%{DATA:proto} helo=<%{HELO}>
#POSTFIXSMTPDACTIONS %{QUEUEID:qid}: %{POSTFIXACTION:postfix_action}: %{DATA:command} from %{RELAY}: %{DATA:smtp_response}: %{DATA:reason}; from=<%{EMAILADDRESS:from}> to=<%{EMAILADDRESS:to}> proto=%{DATA:proto} helo=<%{HELO}>
POSTFIXSMTPDTIMEOUTS timeout after %{DATA:command} from %{RELAY}
POSTFIXSMTPDLOGIN %{QUEUEID:qid}: client=%{DATA:client}, sasl_method=%{DATA:saslmethod}, sasl_username=%{GREEDYDATA:saslusername}
POSTFIXSMTPDCLIENT %{QUEUEID:qid}: client=%{GREEDYDATA:client}
POSTFIXSMTPDNOQUEUE NOQUEUE: %{POSTFIXACTION:postfix_action}: %{DATA:command} from %{RELAY}: %{GREEDYDATA:reason}
POSTFIXSMTPDWARNING warning:( %{IP}: | hostname %{HOSTNAME} )?%{GREEDYDATA:reason}
# Jun  3 16:40:28 dir postfix/smtpd[16526]: improper command pipelining after HELO from 41.254.8.1.ZTE.WiMAX.dynamic.ltt.ly[41.254.8.1]: QUIT\r\n
POSTFIXSMTPDLOSTCONNECTION (?:lost connection after %{DATA:smtp_response} from %{RELAY}|improper command pipelining after HELO from %{GREEDYDATA:reason})

# postfix/cleanup
POSTFIXCLEANUP %{POSTFIXCLEANUPMESSAGE}|%{POSTFIXCLEANUPMILTER}
POSTFIXCLEANUPMESSAGE %{QUEUEID:qid}: (resent-)?message-id=(<)?%{GREEDYDATA:messageid}(>)?
POSTFIXCLEANUPMILTER %{MILTERENDOFMESSAGE}

# postfix/bounce
POSTFIXBOUNCE %{QUEUEID:qid}: sender (non-)?delivery( status)? notification: %{QUEUEID:bouncequeueid}

# postfix/qmgr and postfix/pickup
# Jun 15 14:33:26 dir postfix/qmgr[1282]: 76A5C560C09: from=<[email protected]>, size=21928, nrcpt=1 (queue active)
POSTFIXQMGR %{QUEUEID:qid}: (?:removed|from=<(?:%{DATA:from})?>(?:, size=%{NUMBER:size}, nrcpt=%{NUMBER:nrcpt} \(%{GREEDYDATA:queuestatus}\))?)

# postfix/anvil
# May 19 19:33:17 dir postfix/scache[8102]: statistics: domain lookup hits=0 miss=1 success=0%
#POSTFIXANVIL statistics:( %{DATA:anvilstatistic})?( for %{DATA:remotehost})?( at )?%{SYSLOGTIMESTAMP:timestamp}
POSTFIXANVIL statistics: %{GREEDYDATA:reason}

# postfix/trivial-rewrite
POSTFIXREWRITE warning: do not list domain %{DATA:domain} in BOTH mydestination and virtual_alias_domains

# AMAVISD
USER_AGENT User-Agent|X-Mailer
RECIPIENTS <%{EMAILADDRESS:recipient}>(,<%{GREEDYDATA:recipientlist}>)?
ORIGIN (%{DATA:originating_net} )\[%{IP:relay}\](:%{NUMBER}) \[%{IP:originip}\]
AMAVIS %{SYSLOGBASE} \(%{DATA}\) %{WORD:action} %{WORD:ccat} \{%{GREEDYDATA:policybank}\}, %{ORIGIN} <(%{EMAILADDRESS:from})> -> %{GREEDYDATA}, Queue-ID: %{QUEUEID}, Message-ID: <%{DATA:messageid}>%{GREEDYDATA:rest_of_message}

#AMAVISDNEW %{SYSLOGBASE} \(%{DATA:amavisdid}\) %{WORD:action} %{WORD:ccat} %{GREEDYDATA:policybank}, (%{GREEDYDATA:origin_net}) \[%{IP:relayip}\](:%{POSINT}) \[%{IP:originip}\] <(%{EMAILADDRESS:from})?> -> %{RECIPIENTS:recipients}, Queue-ID:%{QUEUEID}, Message-ID: <%{DATA:messageid}>,( mail_id: %{DATA:mail_id},)? Hits: %{NUMBER:hits:float}, size: %{NUMBER:size:int},( queued_as: %{QUEUEID:qid},)? Subject: "%{DATA:subject}", From: %{DATA:from},( %{USER_AGENT}: %{DATA:user_agent},)? Tests: \[%{DATA:TESTS}\],( shortcircuit=%{WORD:shortcircuit},)?( autolearn=%{WORD:autolearn},)? %{POSINT:elapsedtime} ms

#AMAVISDNEW %{SYSLOGBASE} \(%{DATA:amavisdid}\) %{WORD:action} %{WORD:ccat} %{GREEDYDATA:policybank}, \[%{RELAY:relayip}\] \[%{IP:originip}\] <(%{EMAILADDRESS:from})?> -> %{RECIPIENTS:recipients}, Message-ID: <%{DATA:messageid}>,( mail_id: %{DATA:mail_id},)? Hits: %{NUMBER:hits:float}, size: %{NUMBER:size:int},( queued_as: %{QUEUEID:qid},)? Subject: "%{DATA:subject}", From: %{DATA:from},( %{USER_AGENT}: %{DATA:user_agent},)? Tests: \[%{DATA:TESTS}\],( shortcircuit=%{WORD:shortcircuit},)?( autolearn=%{WORD:autolearn},)? %{POSINT:elapsedtime} ms

# Dovecot
# Jun 17 21:30:16 dir dovecot: imap(tin): Disconnected: Logged out in=397 out=45702
# Jun 15 09:26:18 dir dovecot: imap(tin): Connection closed in=352 out=1726
# Jun 19 01:19:29 dir dovecot: imap(pnguyen): Connection closed in=0 out=362
#DOVEID dovecot: %{DATA:component}(?:\(%{DATA:user}\))?(:)?
DOVEIMAP imap\(%{DATA:user}\): %{DATA:reason} in=%{NUMBER:inbytes} out=%{NUMBER:outbytes}

# May 21 21:58:12 dir dovecot: master: Warning: /home/alex is no longer mounted. See http://wiki2.dovecot.org/Mountpoints
# Jun  5 16:13:31 dir dovecot: anvil: Warning: Killed with signal 15 (by pid=1 uid=0 code=kill)
DOVECMD anvil|auth|config|log|master
DOVEMISC %{DOVECMD:command}: %{GREEDYDATA:reason}
# DOVEMISC %{(anvil|auth|config|log|master):command}: %{GREEDYDATA:reason}

DOVELOGIN imap-login: %{DATA:action}:(?: user=<(%{DATA:user})?>, (method=%{DATA:loginmethod}, )?rip=%{IP:rip}, lip=%{IP:lip},( mpid=%{NUMBER:mpid},( %{DATA:sectype},)?| %{DATA:securesession},)? session=<%{DATA:session}>| %{GREEDYDATA:reason})

DOVELDA lda\((%{DATA:user})?\):( %{DATA:action}:)? msgid=(?:<%{DATA:mesgid}@%{DATA:domain}>|%{DATA:mesgid}):( saved mail to| stored mail into mailbox) .*?%{DATA:folder}.*?

DOVEAUTH auth-worker\(%{NUMBER:pid}\): pam\((?:%{USERNAME:user}|%{EMAILADDRESS:user}),%{IP:ip}\): %{GREEDYDATA:reason}

DOVECOT (?:%{SYSLOGTIMESTAMP:timestamp}|%{TIMESTAMP_ISO8601:timestamp8601}) (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} dovecot: (%{DOVEIMAP}|%{DOVELOGIN}|%{DOVELDA}|%{DOVEAUTH}|%{DOVEMISC})

#PF %{SYSLOGBASE} (%{POSTFIXSMTP}|%{POSTFIXANVIL}|%{POSTFIXQMGR}|%{POSTFIXBOUNCE}|%{POSTFIXCLEANUP}|%{POSTFIXSMTPD}|%{AMAVIS})
PF %{POSTFIX} (?:%{POSTFIXSMTP}|%{POSTFIXANVIL}|%{POSTFIXQMGR}|%{POSTFIXBOUNCE}|%{POSTFIXCLEANUP}|%{POSTFIXSMTPD}|%{POSTFIXREWRITE})

MAILLOG (%{PF}|%{DOVECOT})

Here is the logstash.conf file, which uses the file input plugin and elasticsearch output plugin, along with the grok filter to make use of our patterns. Note that after analyzing the default mapping of incoming data, I decided to create my own customized template and override the default logstash mapping. You can leave as is, I just happen to want more control over my data mappings. The custom mapping is included below.

input {
  file {
    path => "/var/log/maillog*"
    exclude => "*.gz"
    start_position => "beginning"
    type => "maillog"
  }
}
filter {
  if [type] == "maillog" {
    grok {
      patterns_dir => ["/home/logstash/config/patterns"]
      match => { "message" => ["%{PF}", "%{DOVECOT}" ] }
    }
    date {
      match => [ "timestamp", "MMM dd HH:mm:ss" ]
    }
  }
  # I wanted to monitor metrics and health of logstash
  metrics {
    meter => "events"
    add_tag => "metric"
  }
}
output {
  if [type] == "maillog" {
    elasticsearch {
      index => "maillog-%{+YYYY.MM.dd}"
      host => "localhost"
      port => "9200"
      protocol => "http"
      flush_size => 1000
      ########################################################
      # the next 4 lines are for explicit index mapping
      manage_template => true
      template_overwrite => true
      template => "/home/logstash/config/templates/maillog.json"
      template_name => "maillog"
    }
  }
  if "metric" in [tags] {
    stdout {
      codec => line {
        format => "rate: %{events.rate_1m}"
      }
    }
  }
}

My customized mapping.

{
    "template" : "maillog-*",
    "order" : 1,
    "settings" : {
        "number_of_shards" : 2,
        "index.refresh_interval" : "90s"
    },
    "mappings" : {
        "maillog" : {
            "properties" : {
                "reason" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "saslusername" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "postfix_action" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "relayip" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "messageid" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "pid" : { "index": "not_analyzed", "doc_values": true, "type" : "long" },
                "remote" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "type" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "qid" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "local" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "result" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "path" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "file" : { "index": "not_analyzed", "type" : "string" },
                "queuestatus" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "smtp_response" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "@version" : { "type" : "string" },
                "host" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "client" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "from" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "timestamp" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "nrcpt" : { "index": "not_analyzed", "doc_values": true, "type" : "long" },
                "responsecode" : { "index": "not_analyzed", "doc_values": true, "type" : "long" },
                "offset" : { "index": "not_analyzed", "doc_values": true, "type" : "long" },
                "relayhost" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "logsource" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "message" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "orig_to" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "command" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "tags" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "helo" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "saslmethod" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "component" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "@timestamp" : { "format" : "dateOptionalTime", "type" : "date" },
                "remotehost" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "size" : { "index": "not_analyzed", "doc_values": true, "type" : "long" },
                "anvilstatistic" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "proto" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "bouncequeueid" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "to" : { "index": "not_analyzed", "doc_values": true, "type" : "string" },
                "dsn" : { "index": "not_analyzed", "doc_values": true, "type" : "string" }
            }
        }
    }
}

ELK Operational Tips

I’ve been running ELK clusters for over a year now, and want to share tips and tricks that I’ve found to be useful.

Feel free to post questions and corrections. I’ll try to answer and update when possible.

Elasticsearch

  • Split brained – this is when you have more than one node in your cluster becoming master.
    • It is best to avoid ever having this happen.   Use the rule of thumb, e.g. if you have N nodes, the number of nodes that can be master is N/2 + 1.   Even better, set aside a dedicated pool of master nodes (I recommend minimum of 3 master capable nodes).
    • If split brained does happen, you want to stop one of the master node ASAP.   Depending on whether you have replicas or not, it could be easy fix, or you might end up having to re-index if your indices has gotten out of sync by having the replica promoted to primary and new index data sent to it.
  • Failed node(s) – one or more failed nodes.  There are many scenarios, from failing hardware to outages causing data corruption, etc.
  • Planned maintenance – several scenarios.
  • Indexing take too long.
  • Recovery take too long.
  • Search/query take too long.

Logstash

Kibana

 

adding CORS support to elasticsearch-head plugin

There are two vulnerabilities in Elasticsearch that I recently patched in my installations.

One is the ‘script’ vuln, mentioned here.

Fix by adding

script.disable_dynamic: true

to your Elasticsearch.yml config file.

The other one has to do with CORS, which exposes data via REST endpoints.

Fix by adding

http.cors.allow-origin: "http://your.FQDN.domain.name"

to your Elasticsearch.yml config file.

In fixing the second one (CORS), I run into a problem where that broke my usage of elasticsearch-head plugin.  I use the plugin as a checked out git repo on my laptop and port forward to the actual ES server.   E.g. the URL I use is something like this

file:///Users/tinle/src/opensource/elasticsearch-head/index.html?base_uri=http://127.0.0.1:9200/

So I ended up having to patch elasticsearch-head to make it work with CORS.

diff --git a/dist/app.js b/dist/app.js
index 5bce2a3..7e58acb 100644
--- a/dist/app.js
+++ b/dist/app.js
@@ -1188,6 +1188,9 @@
                request: function( params ) {
                        return $.ajax( $.extend({
                                url: this.base_uri + params.path,
+      /**
+       * 2014/06/01 tinle
+       **/
                                dataType: "jsonp",
         crossDomain: true,
                                error: function(xhr, type, message) {
diff --git a/dist/vendor.js b/dist/vendor.js
index fb1a448..2b74180 100644
--- a/dist/vendor.js
+++ b/dist/vendor.js
@@ -6838,6 +6838,10 @@ jQuery.each( [ "get", "post" ], function( i, method ) {
                return jQuery.ajax({
                        type: method,
                        url: url,
+      /**
+       * HACK 2014/06/03 tinle
+       */
+      crossDomain: true,
                        data: data,
                        success: callback,
                        dataType: type
@@ -14439,4 +14443,4 @@ under the License.
                }
                throw "could not process value " + v;
        };
-})();
\ No newline at end of file
+})();

 

Updated: 6/4/2014 – I think the above patch should work.  I’ve been using it last few days and I am able to GET/PUT/POST, e.g. make changes to ES via elasticsearch-head.