How to perform an Elasticsearch Rolling Restart

Database-driven search engines often take a long time to respond to queries. This is where search engines like Elasticsearch come into play by storing, retrieving, and managing data using a NoSQL database. Even though the primary purpose of Elasticsearch is to make the data available at all times, there may come a time when one needs to perform a restart of the cluster. And doing this task the right way is very important to ensure that the cluster is still operational during its maintenance even for a heavily active production cluster.

To perform this process, Elasticsearch offers a Rolling Restart option where each node can be incrementally stopped and started in a cluster. This allows a node undergoing maintenance to pick up configuration changes that require a restart while ensuring high availability of the other nodes.

Reasons to restart a cluster

The most common reasons to restart a cluster are-

  1. Elasticsearch version upgrades to get in all the newly rolled-out features operational on stack
  2. Maintenance of the server itself (such as an OS update, or hardware).
  3. To perform changes in the configuration file (elasticsearch.yml).
  4. Duplicate directory errors

There can be a number of reasons why an Elasticsearch cluster needs to be restarted and it is generally not a plausible solution to shut down the entire cluster but to rather perform a rolling restart where each node is made to restart one by one.

Steps to perform Elasticsearch Rolling Restart

By default, Elasticsearch ensures that your data is fully replicated and evenly balanced. So, if a single node is shut down for maintenance, the cluster will immediately recognize the loss of the node and begin rebalancing. Very large shards can take a longer time to rebalance, so if the node’s maintenance will be short-lived, it can get frustrating where the cluster is rebalancing on its own — think of replicating 1 TB of data

Hence, a planned restart of the nodes should include stopping the routing of traffic to avoid any unnecessary rebalancing.

What needs to be done is that Elasticsearch should be made to hold off on rebalancing on its own, since we have more knowledge about the state of the cluster due to external factors. Here are the steps to perform a rolling restart-

Step 1- Stop indexing new data- This is not always possible (e.g. heavily active production cluster) but might help to speed up recovery time. To stop indexing run the following-

curl -X POST "localhost:9200/<target>/_flush?pretty"

Step 2- Disable shard allocation- This prevents Elasticsearch from rebalancing missing shards until you tell it otherwise. As stated earlier, if the maintenance window is short, this is a good idea. You can disable allocation as follows-

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{ "transient" :{"cluster.routing.allocation.enable" : "none"}}'

Step 3- Shut down a single node using these commands-

  • With systemd:
sudo systemctl stop elasticsearch.service
  • With SysV init:
sudo -i service elasticsearch stop
  • For daemon- Manually kill the process running Elasticsearch

Step 4- Perform maintenance/upgrade on that node.

Step 5- Start the node-

  • If you are running Elasticsearch with systemd:
sudo systemctl start elasticsearch.service
  • If you are running Elasticsearch with SysV init:
sudo -i start elasticsearch stop

Step 6- Check logs- to confirm if the cluster has rejoined in /var/log/elasticsearch

Step 7- Reenable shard allocation-

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'{ "transient" :{"cluster.routing.allocation.enable" : "all"}}'

Step 8- Check cluster health- The status should become green and can be checked using-

curl -XGET 'localhost:9200/_cluster/health?pretty'

Step 9- Repeat steps 2 through 7 for other nodes in the cluster if required

Step 10- Indexing can be safely resumed (if it was previously stopped), but waiting until the cluster is fully balanced before resuming indexing will help to speed up the process.

Perform Rolling Restart using unSkript

As we know that upgrading and maintenance of a cluster should be a regular process in an organization, but it does become tricky at times to do it the right way and ensure availability of the nodes to avoid interruptions in indexing. unSkript provides an open-source runbook to perform Elasticsearch Rolling Restart and standardize this automation as a part of your organization’s DevOps practice.

You can try it out on our open-source Awesome-CloudOps-Automation repository on Github.

Share your thoughts