Interesting tech news today

The Wayback Machine (aka Internet Archive)

Watch a video on the history of the Internet Archive.

[archiveorg wayback-machine-1996 width=640 height=480 frameborder=0 webkitallowfullscreen=true mozallowfullscreen=true]

Massive Twitch hack

Source code and creator payouts part of massive hack.

https://www.theverge.com/2021/10/6/22712250/twitch-hack-leak-data-streamer-revenue-steam-competitor

Site Reliability Engineer Training

LinkedIn has open sourced their SRE online training materials. This is a wonderful gesture from LinkedIn.

It will be useful to both those wanting to enter the SRE field or learn more it. I think it is also a nice way to round out the fuzzy areas for practicing SREs. What I mean is that we all have areas we are domain experts in, and areas we can get by, but not comfortable with.

The following section is lifted verbatim from LinkedIn.

There is a vast amount of resources scattered throughout the web on what the roles and responsibilities of SREs are, how to monitor site health, production incidents, define SLO/SLI etc. But there are very few resources out there guiding someone on the basic skill sets one has to acquire as a beginner. Because of the lack of these resources, we felt that individuals have a tough time getting into open positions in the industry. We created the School Of SRE as a starting point for anyone wanting to build their career as an SRE.

In this course, we are focusing on building strong foundational skills. The course is structured in a way to provide more real life examples and how learning each of these topics can play an important role in day to day SRE life. Currently we are covering the following topics under the School Of SRE:

HOW-TO customize Grafana legend/label

A question that I’ve seen asked many time on the web is people asking how to shorten their Grafana legend/label.

E.g. using hostname as a legend will usually return the full FQDN, which can be too long if you have many hosts and make a mess of your panel.

standard legend using FQDN

Lots of searching show people having similar questions and a number of request for enhancements. In my case, there is a simple solution that work. Here is how.

Use the function label_replace(). So

rate(nginx_http_requests_total{instance=~"$instance", host="$host"}[5m])

turn into

label_replace(rate(nginx_http_requests_total{instance=~"$instance", host="$host"}[5m]), "hname", "$1", "instance", "(.*).foo.bar.local")

 

And the legend format changed from

{{instance}}-{{status}}  to {{hname}}-{{status}}
Shorter legend

Learning from Operational mistakes

In the world of Ops, it’s always good to learn from mistakes. It’s not good enough that we solved a problem (*fix*), but we must also do a post-mortem to understand what went wrong (*root cause*), what can we do to prevent it in the future (*long term solution*).

I am of the opinion that long term solutions are preferred to short term fixes (hacks!). But long term solutions are not easy, they almost always require understanding the root cause, and that is not always obvious.

After any incidents, crisis, problems, whatever you want to call it, make sure you have a *blame-free* post-mortem. This is very important. We are not looking to blame anyone, we should be focusing on the root cause and how can it be prevented from happening again. Going into a post-mortem with the right mind-set also help make the process go much smoother. You will get better cooperation from involved parties. It’s a team effort, to improve everyone’s job.

The process should be something like this.

  • Assign an owner of the post-mortem process. Usually the lead engineer involved in the incident. That person is empowered to call for help from anyone needed.
  • Assign a specific time-frame for when the post-mortem must conclude by. You do not want to let it drag on. Let’s get it done and move on. Recommend no more than two weeks from date of incident.
  • Communicated what is expected of the post-mortem output.
    • When — Timeline of incidents
    • What — specific details of alerts, failures, etc.
    • Communications during the incidents — within team, with other teams, internal and external (customers, press).
    • Root Cause Analysis
    • Prevention
      • Training – better training
      • Monitoring – better monitoring (add monitor, alerts)
      • Failure detection – missed failures
      • SPOF – Single Point of Failure. Add redundancies. Re-architecture.
      • etc.

It’s good if we can learn from past mistakes. It is even better if we can learn from others’ mistakes!

Here is the start of a list of Operational mistakes published on the web. I will be adding more as I find them. Feel free to submit any that I missed. Thanks!

Kubernetes

 

Very nice post from David Henke: