Monitoring sendgrid with Elasticsearch

If you are using sendgrid as a service for your outbound email, you would want to monitor and be able to answer questions such as:

  • how much email are you sending
  • status of sent email – success, bounced, delayed, etc.
  • trends
  • etc.

We get questions all the time from $WORK customer support folks on whether an email sent to a customer got there (customer claimed they never got it).   There could be any number of reasons why customer do not see email sent from us.

  • our email is filtered into customer spam folder
  • email is reject/bounced by customer mail service
  • any number of network/server/services related errors between us and customer mail service
  • the email address customer provided is invalid (and email bounced)

If we have access to event logs from sendgrid, we would be able to quickly answer these types of questions.

Luckily sendgrid offers Event Webhook.

Verbatim quote from above link.

SendGrid’s Event Webhook will notify a URL of your choice via HTTP POST with information about events that occur as SendGrid processes your email. Common uses of this data are to remove unsubscribes, react to spam reports, determine unengaged recipients, identify bounced email addresses, or create advanced analytics of your email program. With Unique Arguments and Category parameters, you can insert dynamic data that will help build a sharp, clear image of your mailings.

Login to your sendgrid account and click on Mail Settings.

Then click on Event Notification

 

In HTTP Post URL, enter the DNS name of the service endpoint you are going to setup next.

For example, mine is (not a valid endpoint, but close enough): https://sendlog.mydomain.com/logger

Since I do not believe in re-inventing the wheel, Adly Abdullah has already written a simple sendgrid event listener (Note: this is my forked version, which works with ES 6.x).   This is a nodejs service.  You can install it via npm.

$ sudo npm install -g sendgrid-event-logger pm2

You want to install pm2 (nodejs Process Manager version 2).  Very nice nodejs process manager.

Next is to edit and configure sendgrid-event-logger (SEL for short).   If the default config works for you, then no need to do anything.  Check and make sure it is pointing to where your ES host is located (mine is running on the same instance, hence localhost).   I also left SEL listening on port 8080 as that is available on this instance.

$ cat /etc/sendgrid-event-logger.json
{
    "elasticsearch_host": "localhost:9200",
    "port": 8080,
    "use_basicauth": true,
    "basicauth": {
    "user": "sendgridlogger",
    "password": "KLJSDG(#@%@!gBigSecret"
},
"use_https": false,
    "https": {
        "key_file": "",
        "cert_file": ""
    },
    "days_to_retain_log": 365
}

NOTE: I have use_https set to false because my nginx front-end is already using https.

Since SEL is listening on port 8080, you can run it as yourself.

$ pm2 start sendgrid-event-logger -i 0 --name "sendgrid-event-logger"

Verify that SEL is running.

$ pm2 show 0

Describing process with id 0 - name sendgrid-event-logger
┌───────────────────┬──────────────────────────────────────────────────────┐
│ status            │ online                                               │
│ name              │ sendgrid-event-logger                                │
│ restarts          │ 0                                                    │
│ uptime            │ 11m                                                  │
│ script path       │ /usr/bin/sendgrid-event-logger                       │
│ script args       │ N/A                                                  │
│ error log path    │ $HOME/.pm2/logs/sendgrid-event-logger-error-0.log    │
│ out log path      │ $HOME/.pm2/logs/sendgrid-event-logger-out-0.log      │
│ pid path          │ $HOME/.pm2/pids/sendgrid-event-logger-0.pid          │
│ interpreter       │ node                                                 │
│ interpreter args  │ N/A                                                  │
│ script id         │ 0                                                    │
│ exec cwd          │ $HOME                                                │
│ exec mode         │ fork_mode                                            │
│ node.js version   │ 8.11.1                                               │
│ watch & reload .  │ ✘                                                    │
│ unstable restarts │ 0                                                    │
│ created at        │ 2018-02-14T23:36:06.705Z                             │
└───────────────────┴──────────────────────────────────────────────────────┘
Code metrics value
┌─────────────────┬────────┐
│ Loop delay .    │ 0.68ms │
│ Active requests │ 0      │
│ Active handles  │ 4      │
└─────────────────┴────────┘

I use nginx and here is my nginx config for SEL.

/etc/nginx/sites-available $ cat sendgrid-logger
upstream sendgrid_logger {
  server 127.0.0.1:8080;
}

server {
  server_name slog.mysite.org slog;
  listen 443 ssl ;

  include snippets/ssl.conf;
  access_log /var/log/nginx/slog/access.log;
  error_log /var/log/nginx/slog/error.log;
  proxy_connect_timeout 5m;
  proxy_send_timeout 5m;
  proxy_read_timeout 5m;

  location / {
    proxy_pass http://sendgrid_logger;
  }
}
$ sudo ln -s /etc/nginx/sites-available/sendgrid-logger /etc/nginx/sites-enabled/
$ sudo systemctl reload nginx

Make sure Sendgrid Event webhook is turned on and you should be seeing events coming in.   Check your Elasticsearch cluster for new indices.

$ curl -s localhost:9200/_cat/indices|grep mail
green open mail-2018.03.31 -g6Tw9b9RfqZnBVYLdrF-g 1 0 2967 0 1.4mb 1.4mb
green open mail-2018.03.28 GxTRx2PgR4yT5kiH0RKXrg 1 0 8673 0 4.2mb 4.2mb
green open mail-2018.04.06 2PO9YV1eS7eevZ1dfFrMGw 1 0 10216 0 4.9mb 4.9mb
green open mail-2018.04.11 _ZINqVPTSwW7b8wSgkTtTA 1 0 8774 0 4.3mb 4.3mb

etc.

Go to Kibana, setup index pattern.  In my case, it’s mail-*.  Go to Discover, select mail-* index pattern and play around.

Here is my simple report.  I see around 9am, something happened to cause a huge spike in mail events.

 

Next step is for you to create dashboards to fit your needs.

Enjoy!

 

Courier Fetch Error: unhandled courier request error: Authorization Exception in Chrome/Safari on Kibana 4.5.0

Getting this error in your Kibana?

You need to increase your max header size as default netty is only 8KB.   You can change the value in your elasticsearch.yml file.

Add the following line (or uncomment it if it is already there).

http.max_header_size: 32kb

 

Fixing ‘plugin:elasticsearch [document_already_exists_exception] [config][4.5.1]: document already exists’

Substitute in the version ‘4.5.1’ with the version you are upgrading to. So far I’ve seen it since Kibana 4.1.x to 4.5.1.

It seem that if you upgrade Kibana, there is a timing bug in how Kibana note its current version. You will get lots of these errors in Kibana logs:

log [08:08:30.649] [error][status][plugin:elasticsearch] Status changed from green to red - [document_already_exists_exception] [config][4.5.1]: document already exists, with: {"shard":"0","index":".kibana"}

These came from me upgrading version 4.5.0 to 4.5.1. I’ve seen same thing when I went from 4.1.4 to 4.5.0.

The fix is to delete the config record in your .kibana index. Don’t worry, it gets recreated again. No loss as far as I know.

curl -XDELETE elasticsearchserver:9200/.kibana/config/4.5.1

The Kibana bug is documented here: kibana issues #5519.

If deleting record does not work, you will also need to refresh your kibana index, e.g. this will flush the data!!!!

curl -XPOST elasticsearchserver:9200/.kibana/_refresh

Kibana 4 with tribe node MasterNotDiscoveredException

I use tribe nodes quite a lot at $work. It’s how we federate disparate ELK clusters and able to search across them. There are many reasons to have distinct ELK clusters in each data center and/or region.

Some of these are:

1. Elasticsearch does not work well when there is network latencies, which is guaranteed when your nodes are located geographically distant places. You could spend a lot of money to get fast network connection, or you can just have only local clusters. (Me? I pick saving money and avoiding head aches :-)).

2. It can get insanely expensive to create an ES cluster that span data centers/regions. The network bandwidth requirement, the data charges, the care and feeding of such a latency sensitive cluster…. OMG!

3. I don’t really think a 3rd reason is needed.

Although tribe nodes are great for federating ES clusters, there are some quirks in setting them up and caring for them (not as bad as ES clusters that span datacenter though).

One big gotcha for many people who are setting up tribe nodes for the first time is that tribe node can not create index. Tribe can only update, modify an existing index. What this mean is that if you point Kibana at a tribe node, you must first make sure you Kibana index is already created in one of the downstream ES cluster. Otherwise, you will have to create it yourself.

Otherwise, the first time you create an index pattern and tried to save it, you will get an error similar to the subject of this post.

MasterNotDiscoveredException

The error message is wrong and misleading. It has nothing to do with Master node. It has everything to do with tribe node not able to create (PUT) a Kibana index.

Personally, I prefer to make the Kibana index that I use with tribe to have its own unique name. So I run a dedicated Kibana instance pointing to the dedicated tribe (client) node.

Here are the steps I do to get a tribe node and its associated Kibana ready for use.

1. Configure the tribe node to know all the ES clusters I want to federate data from.

tribe.elasticsearch.yml:

cluster.name: toplevel_tribe
node.name: ${HOSTNAME}
node.master: false
node.data: false
tribe:
  DC1-appservice:
     cluster.name: logging-DC1
     network.host: 0.0.0.0
     network.publish_host: ${HOSTNAME}
     discovery.zen.ping.unicast.hosts:
      - dc1-app13225.prod.example.com
      - dc1-app13226.prod.example.com
      - dc1-app13227.prod.example.com
  DC2-appservice:
     cluster.name: logging-DC2
     network.host: 0.0.0.0
     network.publish_host: ${HOSTNAME}
     discovery.zen.ping.unicast.hosts:
      - dc2-app12281.prod.example.com
      - dc2-app12282.prod.example.com
      - dc2-app12283.prod.example.com
   DC3.....etc to DCNN

  my-es-dedicated-config-cluster:
     cluster.name: es-config-CORP
     network.host: 0.0.0.0
     network.publish_host: ${HOSTNAME}
     discovery.zen.ping.unicast.hosts:
      - corp-app1234.example.com

 on_conflict: prefer_my-es-dedicated-config-cluster

2. Now pre-create the Kibana index in my-ES-dedicated-config-cluster. This is a small cluster in my admin/corp data center that is only for housing configurations, Kibana dashboards, etc.

3. A simpler and more correct way is to temporary point Kibana to the dedicated ES cluster (instead of the tribe).

Do this via this setting in your kibana.yml file:

# The Elasticsearch instance to use for all your queries.
elasticsearch.url: “http://ES-node:9200”

Start Kibana, let it create the index.  Then stop it, change the setting back to point to your tribe node.

Doing it this way ensure that your kibana is correct.

curl command for pre-creating kibana (3 and 4) index:


curl -s -XPUT "http://localhost:9200/kibana3-int/" -d '{ "settings" : { "number_of_shards" : 3, "number_of_replicas" : 2 },
"mappings" : { "temp" : { "properties" : { "dashboard" : { "type" : "string" }, "group" : { "type" : "string" }, "title" : { "type" : "string" }, "user" : { "type" : "string" } } }, "dashboard" : { "properties" : { "dashboard" : { "type" : "string" }, "group" : { "type" : "string" }, "title" : { "type" : "string" }, "user" : { "type" : "string" } } } }'


# Kibana4
curl -s -XPUT "http://localhost:9200/TRIBENAME-kibana4" -d '{ "index.mapper.dynamic" : true, "settings" : { "number_of_shards" : 1, "number_of_replicas" : 0 },"mappings" : {"search" : {"_timestamp" : { },"properties" : {"columns" : {"type" : "string"},"description" : {"type" : "string"},"hits" : {"type" : "long"},"kibanaSavedObjectMeta" : {"properties" : {"searchSourceJSON" : {"type" : "string"}}},"sort" : {"type" : "string"},"title" : {"type" : "string"},"version" : {"type" : "long"}}},"dashboard" : {"_timestamp" : { },"properties" : {"description" : {"type" : "string"},"hits" : {"type" : "long"},"kibanaSavedObjectMeta" : {"properties" : {"searchSourceJSON" : {"type" : "string"}}},"optionsJSON" : {"type" : "string"},"panelsJSON" : {"type" : "string"},"timeRestore" : {"type" : "boolean"},"title" : {"type" : "string"},"uiStateJSON" : {"type" : "string"},"version" : {"type" : "long"}}},"visualization" : {"_timestamp" : { },"properties" : {"description" : {"type" : "string"},"kibanaSavedObjectMeta" : {"properties" : {"searchSourceJSON" : {"type" : "string"}}},"savedSearchId" : {"type" : "string"},"title" : {"type" : "string"},"uiStateJSON" : {"type" : "string"},"version" : {"type" : "long"},"visState" : {"type" : "string"}}},"config" : {"_timestamp" : { },"properties" : {"buildNum" : {"type" : "long"},"defaultIndex" : {"type" : "string"}}},"index-pattern" : {"_timestamp" : { },"properties" : {"customFormats" : {"type" : "string"},"fieldFormatMap" : {"type" : "string"},"fields" : {"type" : "string"},"intervalName" : {"type" : "string"},"timeFieldName" : {"type" : "string"},"title" : {"type" : "string"}}}}}'

ELK Operational Tips

I’ve been running ELK clusters for over a year now, and want to share tips and tricks that I’ve found to be useful.

Feel free to post questions and corrections. I’ll try to answer and update when possible.

Elasticsearch

  • Split brained – this is when you have more than one node in your cluster becoming master.
    • It is best to avoid ever having this happen.   Use the rule of thumb, e.g. if you have N nodes, the number of nodes that can be master is N/2 + 1.   Even better, set aside a dedicated pool of master nodes (I recommend minimum of 3 master capable nodes).
    • If split brained does happen, you want to stop one of the master node ASAP.   Depending on whether you have replicas or not, it could be easy fix, or you might end up having to re-index if your indices has gotten out of sync by having the replica promoted to primary and new index data sent to it.
  • Failed node(s) – one or more failed nodes.  There are many scenarios, from failing hardware to outages causing data corruption, etc.
  • Planned maintenance – several scenarios.
  • Indexing take too long.
  • Recovery take too long.
  • Search/query take too long.

Logstash

Kibana