In the world of Ops, it’s always good to learn from mistakes. It’s not good enough that we solved a problem (*fix*), but we must also do a post-mortem to understand what went wrong (*root cause*), what can we do to prevent it in the future (*long term solution*).
I am of the opinion that long term solutions are preferred to short term fixes (hacks!). But long term solutions are not easy, they almost always require understanding the root cause, and that is not always obvious.
After any incidents, crisis, problems, whatever you want to call it, make sure you have a *blame-free* post-mortem. This is very important. We are not looking to blame anyone, we should be focusing on the root cause and how can it be prevented from happening again. Going into a post-mortem with the right mind-set also help make the process go much smoother. You will get better cooperation from involved parties. It’s a team effort, to improve everyone’s job.
The process should be something like this.
Assign an owner of the post-mortem process. Usually the lead engineer involved in the incident. That person is empowered to call for help from anyone needed.
Assign a specific time-frame for when the post-mortem must conclude by. You do not want to let it drag on. Let’s get it done and move on. Recommend no more than two weeks from date of incident.
Communicated what is expected of the post-mortem output.
When — Timeline of incidents
What — specific details of alerts, failures, etc.
Communications during the incidents — within team, with other teams, internal and external (customers, press).
If you are using sendgrid as a service for your outbound email, you would want to monitor and be able to answer questions such as:
how much email are you sending
status of sent email – success, bounced, delayed, etc.
We get questions all the time from $WORK customer support folks on whether an email sent to a customer got there (customer claimed they never got it). There could be any number of reasons why customer do not see email sent from us.
our email is filtered into customer spam folder
email is reject/bounced by customer mail service
any number of network/server/services related errors between us and customer mail service
the email address customer provided is invalid (and email bounced)
If we have access to event logs from sendgrid, we would be able to quickly answer these types of questions.
SendGrid’s Event Webhook will notify a URL of your choice via HTTP POST with information about events that occur as SendGrid processes your email. Common uses of this data are to remove unsubscribes, react to spam reports, determine unengaged recipients, identify bounced email addresses, or create advanced analytics of your email program. With Unique Arguments and Category parameters, you can insert dynamic data that will help build a sharp, clear image of your mailings.
Login to your sendgrid account and click on Mail Settings.
Then click on Event Notification
In HTTP Post URL, enter the DNS name of the service endpoint you are going to setup next.
For example, mine is (not a valid endpoint, but close enough): https://sendlog.mydomain.com/logger
You want to install pm2 (nodejs Process Manager version 2). Very nice nodejs process manager.
Next is to edit and configure sendgrid-event-logger (SEL for short). If the default config works for you, then no need to do anything. Check and make sure it is pointing to where your ES host is located (mine is running on the same instance, hence localhost). I also left SEL listening on port 8080 as that is available on this instance.
Make sure Sendgrid Event webhook is turned on and you should be seeing events coming in. Check your Elasticsearch cluster for new indices.
$ curl -s localhost:9200/_cat/indices|grep mail
green open mail-2018.03.31 -g6Tw9b9RfqZnBVYLdrF-g 1 0 2967 0 1.4mb 1.4mb
green open mail-2018.03.28 GxTRx2PgR4yT5kiH0RKXrg 1 0 8673 0 4.2mb 4.2mb
green open mail-2018.04.06 2PO9YV1eS7eevZ1dfFrMGw 1 0 10216 0 4.9mb 4.9mb
green open mail-2018.04.11 _ZINqVPTSwW7b8wSgkTtTA 1 0 8774 0 4.3mb 4.3mb
Go to Kibana, setup index pattern. In my case, it’s mail-*. Go to Discover, select mail-* index pattern and play around.
Here is my simple report. I see around 9am, something happened to cause a huge spike in mail events.
Next step is for you to create dashboards to fit your needs.
You may have notice that this blog is mostly unavailable or showing 5xx lately. It’s because I am on AWS and the recent Intel vulns has all the cloud vendors patching and rebooting their hypervisors. It’s causing various issues with my infrastructure.
I don’t blame the vendors, they are doing what they are supposed to be doing :-). I am waiting for my turn…. when the clouds are done with their patching, then I have to patch my instances and reboot them too. Ugh, joy….