How To set limit on systemd services

This is a cookbook style on how to set a limit (ulimit style) on your custom services that is managed by systemd.

Usecase

Why would you want to do something like this?

You might be running on a small server (or instance if you are using cloud services) and want to prevent your application from affecting other services sharing that server (think of noisy neighbor problem).

Generally, Linux kernel scheduler does a good job of fairly sharing system resources, but that is assuming you have a well behaved application.

Sometime you want to pack applications tightly and don’t mind less performant applications.

In summary, there are lots of reasons why you might want to tune the resources allocated to your applications.

Luckily, if you are using systemd as the controller, you can take advantage of its capabilities.

Note:

There are some caveats. You need to be using a fairly recent kernel and Linux distrob, either Ubuntu/Debian or recent CentOS/RedHat/Fedora.

What

I am going to show you how to get cloudquery run under systemd on an Ubuntu 20.04 LTS. The reason that I want to do this is because cloudquery will use as much memory as it can and trigger Linux OOM killer.

How

There are 3 files needed:

  • /etc/default/cloudquery
    • This file contains definition for CQ_SERVICE_ACCOUNT_KEY_JSON, the value of which is the json content of your service account key file.
    • Example:
      • CQ_SERVICE_ACCOUNT_KEY_JSON='{ "type": "service_account", "project_id": "foobar", "private_key_id": "1a23b456cd134", "private_key": "-----BEGIN PRIVATE KEY-----\n.....vA8r\n-----END PRIVATE KEY-----\n", "client_email": "[email protected]", "client_id": "1234567890", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/foobar-sa%40foobar.iam.gserviceaccount.com" }'

  • /lib/systemd/system/cloudquery_limit.slice
    • [Unit]
      Description=Slice that limits memory for all my services

      [Slice]
      # MemoryHigh works only in "unified" cgroups mode, NOT in "hybrid" mode
      # Must add 'systemd.unified_cgroup_hierarchy=1' to GRUB_CMDLINE_LINUX_DEFAULT
      # in /etc/default/grub
      MemoryHigh=10240M
      # MemoryMax works in "hybrid" cgroups mode, too
      MemoryMax=10240M

  • /etc/systemd/system/cloudquery.service
    • [Unit]
      Description=Cloud Query
      Documentation=cloudquery.README.md
      After=network.target

      [Service]
      Slice=cloudquery_limit.slice
      EnvironmentFile=-/etc/default/cloudquery
      ExecStart=/usr/local/bin/cloudquery --config /data/cq/config.hcl fetch
      ExecReload=/bin/kill -HUP $MAINPID
      KillMode=process
      Restart=on-failure
      RestartPreventExitStatus=255
      Type=simple
      WorkingDirectory=/data/cq
      RuntimeDirectory=cq
      RuntimeDirectoryMode=0755
      LimitNOFILE=64000
      user=cloudquery
      group=cloudquery

      [Install]
      WantedBy=multi-user.target
      Alias=cloudquery.service

Once you have all 3 files in place and edited the values to match your particular system, you need to tell systemd to check its directory for the new service, by running

systemctl daemon-reload

Once you have done that, you can check to see if systemd see your new service, by running

systemctl list-unit-files|grep query

Smoke Test

Test to see if everything works by starting your service.

systemctl start cloudquery

Check (and debug) the status of your new service via

systemctl status cloudquery

and journalctl -xe

Thanks to the posts from https://unix.stackexchange.com/questions/436791/limit-total-memory-usage-for-multiple-instances-of-systemd-service for pointing me in the right direction.

Site instabilities due to Meltdown and Spectre (indirectly)

You may have notice that this blog is mostly unavailable or showing 5xx lately. It’s because I am on AWS and the recent Intel vulns has all the cloud vendors patching and rebooting their hypervisors. It’s causing various issues with my infrastructure.

I don’t blame the vendors, they are doing what they are supposed to be doing :-). I am waiting for my turn…. when the clouds are done with their patching, then I have to patch my instances and reboot them too. Ugh, joy….

Moving or copying files from one Google drive account to another

I have seen questions on the web about how to migrate (copy/move) files from one GDrive account to another. There are many reasons, such as migrating from one Google account (such as company) to your personal account, etc.

WARNING: you may be violating your company policy by moving/copying files from your company Google account to a personal. I advise you to consult your company security officer or equivalent before doing this.

There are other reasons for wanting to copy or moving large number of files from one GDrive to another. Such as for me. I shared a folder in my GDrive with my family for putting our family photos in a central location. My family have G account, and there own GDrive. It seem that Google make it painful to copy files from one GDrive to another. Their suggestions is some form of downloading the files to your local drive first, and then uploading it to the other GDrive that you want.

This is painful!!! There are so many reasons why it’s painful…. 😉

The solution I’ve used is to install Google Drive app (supports OSX, Windows, Linux, Android and IOS).

Link Google Drive app to one Google account, and now you can treat the files in it as on your local drive and drag from there to the GDrive account you want to copy to.

Summary of Overcoming RoR Performance Challenges Meetup on Wed 2/29/12

Overcoming RoR Performance Challenges Meetup

The talk was on best practices and some tips on looking for problems and how the panelists worked around them. There are no “magic” bullet like Ruby or RoR has 🙂

Essentials:

  1. Watch out when using ActiveRecord. It make it too easy to use DB. It make it too easy to use DB. One more time, it make it too easy to use DB.

    Essentially, ActiveRecord and DB is not always the right tool. Sometime using other tool could work better for a particular problem.

    Things mentioned:

    • Using Redis as a queueing system, to buffer writes, which later go to DB. (this is what Blitz, Bleacher Report use to increase their performance).
    • Use NoSQL (CouchBase, Mongo and Cassandara were mentioned as being used by panelists).
    • Cache results as much as possible. Don’t hit DB all the time.
    • Hand optimize queries might be needed. ActiveRecord is not the best at generating optimized DB calls.
  2. Cache as much as possible. Bleacher Reports put in caching layer everywhere, memcache, front end web cache, etc. They also have scripts that pre-warmed their cache (“goal is to never have users be the one who triggered a cache request”).

    Use the cache in newer RoR (3.2).

  3. Write code in ways that make it easy to update to latest Ruby and RoR.

    Ruby EE has flags to allow you to use more memory for internal cache. Sometime it make sense to test for and try different memory configuration there (based on 2 panelists’ experiences).RoR 3.2 has good Rack/Rails cache. Read the doc and use them.

  4. Background processes.
    • Use bg proc whenever possible.
    • Anytime you need to make calls to external website (external API), use a bg process, to not tie up your RoR web process.
    • Blitz put jobs into Redis queue, then bg server check Q for job, run it and put partial results back into Redis, Ajax call then check and format/display result to web client.
    • Bleacher Reports and Mixbooks also do similar things. They use Redis as a job queueing system, among other things (see 1 above).
  5. They all mention using other web server for production (not using webrick). The following were mentioned as being used by panelists.
    • Passenger
    • Thin
    • Unicorn
  6. Related to (ActiveRecord) above is the N+1 problem. Where you add 1 line of code and the DB calls increased manifold.
    • Advice essentially say to develop and use coding best practices and train developers to look out for them.
    • There is a possible test that can be use to automated looking out for N+1 issue.
    • Solving n+1 problem with special tests: http://en.oreilly.com/rails2009/public/schedule/detail/8615 Query testing – see PDF of slides page 75
    • Panelists all recommended RSpec for automated testing.
  7. Monitoring for issues and performance.
    • All panelists point to NewRelic as the tool they use all the time.
    • The host of the meeting Blitz also did a marketing spiel on their tool to use for performance testing (it look really good, and available as a plugin on Heroku). I am going to test it and see about using it for performance/load testing our site.
  8. For ease of scaling infrastructure, leverage AWS EC2, Heroku, Engine Yard and other cloud providers.

Migrating from Bamboo to Cedar (calendon stack) on Heroku

We just migrated from Bamboo (bamboo-ree-1.8.7) to Cedar (calendon rails 3.1 stack) on Heroku. Here are the steps I took to make it work. Note that I didn’t do the code migration, my dev handle that, I do the server and infrastructure part.

Basically this mean moving from Rails 2.x (and ruby 1.8.7) to Rails 3.1 (ruby 1.9.2). Some of the changes require upgrading gem packages. Things such as images, css and js has to be put into an asset bundle.

On the site itself, I left the old site running, e.g. oldsite.heroku.com, and created a new site at newsite.herokuapp.com.

We created a new git dev branch and pushed to newsite, e.g.

git push heroku dev:master

Since we use SSL, I have to make sure custom_domains addon is there. But…

heroku addons:add custom_domains:basic custom_domains:wildcard

I can’t add the ssl addon until I am ready, because Heroku requires that I defined the domains for the app first! Chicken and egg, as it mean I have to take down the currently running production site.

So, make sure everything is running on new site first. Because the next steps mean production site will be down during the changes.

0. Make sure your SSL cert is up-to-date and you have both part, domain.cert and private.key. And most importantly, your key is passphrase-less!

1. Make sure your DNS records are updated and have shortest possible TTL. You are changing them.

2. Wait till DNS changes (TTL) have settle, could be a few hours. Then update your production site CNAME from *.yourdomain.com to newsite.herokuapp.com.

3. Clear out the domains in your old Heroku site:

heroku domains:clear --app oldsite

4. Add domains to new site:

heroku domains:add '*.yourdomain.com' --app newsite
heroku domains:add 'www.yourdomain.com' --app newsite

5. Upload SSL cert to your new app:

heroku ssl:add yourdomain.crt private.key

6. Now you can add the ssl addon.

heroku addons:add ssl:hostname

7. You should get an email from Heroku that has the hostname for your SSL DNS record to point your DNS to. The name should be something like this:

appid1234567herokucom-1234567890.us-east-1.elb.amazonaws.com

Update your DNS CNAME for www.yourdomain.com to point to this name.

8. Test, test, and test again your spanking new site. Add the New Relic addon if you haven’t (life saver!) and monitor traffic to your new site.

9. Monitor the log of your new site:

heroku logs -t --app newsite

10. Check the processes:

heroku ps --app newsite

11. Wait a week, if everything look good, you can take down the oldsite.