ELK Operational Tips

I’ve been running ELK clusters for over a year now, and want to share tips and tricks that I’ve found to be useful.

Feel free to post questions and corrections. I’ll try to answer and update when possible.

Elasticsearch

  • Split brained – this is when you have more than one node in your cluster becoming master.
    • It is best to avoid ever having this happen.   Use the rule of thumb, e.g. if you have N nodes, the number of nodes that can be master is N/2 + 1.   Even better, set aside a dedicated pool of master nodes (I recommend minimum of 3 master capable nodes).
    • If split brained does happen, you want to stop one of the master node ASAP.   Depending on whether you have replicas or not, it could be easy fix, or you might end up having to re-index if your indices has gotten out of sync by having the replica promoted to primary and new index data sent to it.
  • Failed node(s) – one or more failed nodes.  There are many scenarios, from failing hardware to outages causing data corruption, etc.
  • Planned maintenance – several scenarios.
  • Indexing take too long.
  • Recovery take too long.
  • Search/query take too long.

Logstash

Kibana

 

Online debugging/tutorial tools

Online Tools

In the course of my career, I’ve jumped from one platform, OS, programming, scripting languages, etc. to another.   I’ve found that what make it easier to transition into a new “whatever” is the quality of tutorial and debugging tools available to me.

Besides local tools, there are some awesome web sites that are setup to help with debugging various problems.   I am going to try to compile them here.

Please feel free to let me know of others that I’ve missed.

Go

Javascript

Python

 

Regular Expressions

logstash-forwarder TLS handshake errors

I started using logstash-forwarder to send logs from my cloud hosted servers to my ELK server for analysis.   Since it’s just a simple setup, I used the self-gen cert as described on logstash-forwarder’s github page.

Unfortunately, using the example generated a cert that is only good for 30 days.   So suddenly my kibana graph show no data for my cloud servers…. ???  After some digging, I found errors like this in the log.

 logstash-forwarder[4367]: 2014/07/01 23:24:08.559691 Failed to tls handshake with 172.25.28.52 x509: certificate has expired or is not yet valid

openssl x509 -in logstash-forwarder.crt -noout -text  show that the Validity period was only 30 days.  D’oh! 🙂

So I generated a new set, this time for 10 years.  Why not, it’s for my use and if I am still using it 10 years from now…

openssl req -x509 -batch -nodes -newkey rsa:2048 -days 3560 -keyout logstash-forwarder.key -out logstash-forwarder.crt

 

Update 2014-07-28

Tried to bring up another server with logstash-forwarder.  Except I used latest logstash-forwarder (git pull today 2014/07/25) and started getting this error when starting up LS.

Failed to tls handshake with 172.25.28.52 x509: certificate is valid for , not foo.bar.le.org

After a bit of debugging, comparing certs (exact same MD5 as the ones on working servers), I went googling and bingo!

https://github.com/elasticsearch/logstash-forwarder/issues/221

I see people blaming Go v1.3 TLS changes, but I am still using the same Go v1.2.1 that I built the currently working logstash-forwarder.   And as a matter of fact, copying logstash-forwarder from existing working servers over to the new one and it works just fine!   So I do not think that it’s Go, but something in the latest commits to logstash-forwarder that broke TLS.

 Update 2014-08-17

Turned out to be my self-gen cert ;-P   I created a new one, using properly filled out openssl.cnf and a wildcard domain.  That works fine with latest trunk and built using go v1.2.1.   I’ll update to go v1.3 soon.

 

adding CORS support to elasticsearch-head plugin

There are two vulnerabilities in Elasticsearch that I recently patched in my installations.

One is the ‘script’ vuln, mentioned here.

Fix by adding

script.disable_dynamic: true

to your Elasticsearch.yml config file.

The other one has to do with CORS, which exposes data via REST endpoints.

Fix by adding

http.cors.allow-origin: "http://your.FQDN.domain.name"

to your Elasticsearch.yml config file.

In fixing the second one (CORS), I run into a problem where that broke my usage of elasticsearch-head plugin.  I use the plugin as a checked out git repo on my laptop and port forward to the actual ES server.   E.g. the URL I use is something like this

file:///Users/tinle/src/opensource/elasticsearch-head/index.html?base_uri=http://127.0.0.1:9200/

So I ended up having to patch elasticsearch-head to make it work with CORS.

diff --git a/dist/app.js b/dist/app.js
index 5bce2a3..7e58acb 100644
--- a/dist/app.js
+++ b/dist/app.js
@@ -1188,6 +1188,9 @@
                request: function( params ) {
                        return $.ajax( $.extend({
                                url: this.base_uri + params.path,
+      /**
+       * 2014/06/01 tinle
+       **/
                                dataType: "jsonp",
         crossDomain: true,
                                error: function(xhr, type, message) {
diff --git a/dist/vendor.js b/dist/vendor.js
index fb1a448..2b74180 100644
--- a/dist/vendor.js
+++ b/dist/vendor.js
@@ -6838,6 +6838,10 @@ jQuery.each( [ "get", "post" ], function( i, method ) {
                return jQuery.ajax({
                        type: method,
                        url: url,
+      /**
+       * HACK 2014/06/03 tinle
+       */
+      crossDomain: true,
                        data: data,
                        success: callback,
                        dataType: type
@@ -14439,4 +14443,4 @@ under the License.
                }
                throw "could not process value " + v;
        };
-})();
\ No newline at end of file
+})();

 

Updated: 6/4/2014 – I think the above patch should work.  I’ve been using it last few days and I am able to GET/PUT/POST, e.g. make changes to ES via elasticsearch-head.