Automating Reindexing in Elasticsearch

Bashing out the reindexing process

Featured image

In today’s data-driven landscape, Elasticsearch has emerged as a powerhouse for fast and efficient data storage and retrieval. However, as datasets grow and evolve, the need to optimize Elasticsearch indices becomes paramount to maintaining top-notch performance. In the quest for efficiency, automation often takes the spotlight, simplifying complex tasks and freeing up valuable time for more critical tasks, like enjoying a drink at the beach on a Tuesday.

In this post, we will embark on a journey into the world of automating Elasticsearch index reindexing. From the initial hurdles that motivated the search for an automated solution to the strategies, tools, and best practices that emerged, I’ll share my experiences and lessons learned in implementing a streamlined approach to index reindexing in a totally automated way.

So, whether you’re a seasoned Elasticsearch administrator, a developer trying to get into data management, or simply curious about the power of bash scripting and the logic that can be applied to enhance database performance, this blog post is for you. By the end, you’ll not only understand the importance of automation in the world of Elasticsearch but also be equipped with actionable insights to embark on your own path towards a more efficient and robust Elasticsearch indexing strategy management.

What is Elasticsearch?

Elasticsearch, at its core, is more than just a database; it’s an open-source, distributed search and analytics engine designed to handle vast amounts of data while providing lightning-fast search capabilities. At its essence, Elasticsearch is built on top of Apache Lucene, a powerful and robust full-text search library. However, Elasticsearch extends Lucene’s capabilities by adding features like schema-free JSON documents, real-time indexing, and distributed search capabilities.

Imagine a virtual treasure trove where data—structured or unstructured—can be stored, indexed, and searched effortlessly. Elasticsearch transforms raw data into actionable insights, making it a staple in applications ranging from e-commerce platforms to log analysis, monitoring systems, and beyond.

By its own definition, Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.

As we venture further into our exploration of automated index reindexing, it’s important to recognize how Elasticsearch’s architecture plays a pivotal role. Its distributed nature allows data to be spread across nodes, enhancing fault tolerance and scalability. Furthermore, its RESTful API makes it accessible for various programming languages, enabling seamless integration with a multitude of applications and systems.

Reindexing Indexes

The story begins in a project where Elasticsearch was one of the main components of the entire application’s backend, as it served as a service to manage data and consume data from other sources, so at the end, we were talking about Terabytes of data being managed and stored in Elasticsearch indexes. Data in indexes was managed in a monthly basis, so for each of the year’s months a new index was created and then all the ETL processes, applications, business flow, and pretty much the entire Twitter was being dumped in the previously mentioned indexes.

Time went by and then some of the indexes were not fully active anymore, and because of that we needed to run manually reindexes on the old indexes, as well as for doing optimizations on old ones, such as the racks on where they were assigned or the number of shards that these had.

Elasticsearch reindexing is the practice of creating a new index and populating it with data from an existing index. This seemingly straightforward process holds immense power when it comes to data management, enabling organizations to restructure, cleanse, and enhance their data without disrupting ongoing operations.

The need for reindexing often arises from various scenarios. Imagine a situation where your mapping structure needs to evolve to accommodate new fields or modified data types. Or, perhaps, you’re consolidating multiple indices into a single, unified dataset. Reindexing can also address performance concerns by redistributing data across shards or updating outdated indexing techniques.

In this process, Elasticsearch reindexing allows you to take a proactive approach to data optimization. It ensures that your data remains aligned with your evolving business requirements, keeps your search performance at its peak, and allows for smoother data migrations during system upgrades or shifts.

The Script

Will all these in mind, we were able to create the following script, which we executed overnight and just triggered the reindex process by providing some parameters, such as old and new index, with the given dates and aliases. One can simply abstract that logic and implement this in order to automatically trigger the execution of theses reindexes.

#!/bin/bash
set -x
set -e

declare -A PREFIX
PREFIX["complete-mayor"]="complete"


scriptname=$0
function usage {
    echo ""
    echo "Runs ReIndexing"
    echo ""
    echo "usage: $scriptname --environ staging --prefix complete-mayor --year 2023 --month 03 --new-suffix v4 --old-suffix v3 --cluster-name ibiza  --batch-size 5000 --ip-address 10.34.4.44 "
    echo ""
    echo "  --environ  string       elasticsearch cluster environment"
    echo "                          (example: staging/prod)"
    echo "  --prefix string         prefix of the index"
    echo "                          (example: complete-mayor)"
    echo "  --year string           4 digit year of the index"
    echo "                          (example: 1969, 2023)"
    echo "  --month string          2 digit month of the index"
    echo "                          (example: 03)"
    echo "  --new-suffix  string     new suffix for the index"
    echo "                          (example: v4)"
    echo "  --old-suffix  string    old suffix for the index"
    echo "                          (example: v3)"
    echo "  --shards  string        number of shards for new index"
    echo "                          (example: 3)"
    echo "  --cluster-name string    dns prefix name of the es cluster"
    echo "                          (example: ibiza)"
    echo "  --ip-address string     ip address of the query node"
    echo "                          (example: 172.29.203.222)"
    echo "  --batch-size string      size of the indexing batch, defaults to 5000"
    echo "                          (example: 5000)"
    echo "  --index-slices string   size of the slices, defaults to 10"
    echo "                          (example: 5000)"
    echo "  --delete-index string   defaults to false"
    echo "                          (example: true/false)"
    echo ""
}

function die {
    printf "Script failed: %s\n\n" "$1"
    exit 1
}

while [ $# -gt 0 ]; do
    if [[ $1 == "--help" ]]; then
        usage
        exit 0
    elif [[ $1 == "--"* ]]; then
        v=$(echo "${1/--/}" | tr '-' '_')
        declare "$v"="$2"
        shift
    fi
    shift
done

if [[ -z $environ ]]; then
    usage
    die "Missing parameter --environ"
elif [[ -z $prefix ]]; then
    usage
    die "Missing parameter --prefix"
elif [[ -z $year ]]; then
    usage
    die "Missing parameter --year"
elif [[ -z $month ]]; then
    usage
    die "Missing parameter --month"
elif [[ -z $new_suffix ]]; then
    usage
    die "Missing parameter --new-suffix"
elif [[ -z $old_suffix ]]; then
    usage
    die "Missing parameter --old-suffix"
elif [[ -z $cluster_name ]]; then
    usage
    die "Missing parameter --cluster-name"
elif [[ -z $ip_address ]]; then
    usage
    die "Missing parameter --ip-address"
fi

batch_size="${batch_size:-5000}"
index_slices="${index_slices:-10}"
shards="${shards:-3}"
delete_index="${delete_index:-false}"

NEW_INDEX="${prefix}-${year}-${month}-${new_suffix}"
OLD_INDEX="${prefix}-${year}-${month}-${old_suffix}"

# PreIndex for creating the index
python3 -c "import json; data=json.load(open('${environ}-${PREFIX[$prefix]}.json')); data['settings']['number_of_shards'] = $shards; json.dump(data, open('${environ}-${PREFIX[$prefix]}.json', 'w'))"

curl --location --request PUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${NEW_INDEX}" \
-H 'Content-Type: application/json' \
--data-binary "@${environ}-${PREFIX[$prefix]}.json"

curl --location --request PUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${NEW_INDEX}/_settings" \
--header 'Content-Type: application/json' \
--data '{
    "index.routing.allocation.include._tier_preference": null
}'

curl --location --request PUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${NEW_INDEX}/_settings" \
--header 'Content-Type: application/json' \
--data '{
    "index.routing.allocation.include.rack_id": "us-east-1a"
}'

curl --location --request PUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${NEW_INDEX}/_settings" \
-H 'Content-Type: application/json' \
-d '{
    "index.routing.allocation.include._tier_preference": null,
    "index.refresh_interval": "-1"
}'
echo "Updated Settings on New Index ${NEW_INDEX}"

curl -H 'Content-Type: application/json' -XPUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${OLD_INDEX}/_settings" \
-d '{"index.blocks.read_only": "true"}'
echo "Updated Settings on New Index ${OLD_INDEX}"

curl  -H 'Content-Type: application/json' -XPUT "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/${NEW_INDEX}/_settings" \
-d '{ "index": { "number_of_replicas": 0 }}'

# Reindex Command
curl -H 'Content-Type: application/json' -XPOST "http://${ip_address}:9200/_reindex?wait_for_completion=false&slices=${index_slices}" \
-d '{"source": {"index": "'"${OLD_INDEX}"'","size": '"${batch_size}"'},"dest": {"index": "'"${NEW_INDEX}"'"}}'
echo "Triggered ReIndexing  ${OLD_INDEX} ---> ${NEW_INDEX}"

while true; do 
  old_count=$(curl "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/_cat/indices?format=json&index=${OLD_INDEX}"  | jq '.[] | ."docs.count"')
  new_count=$(curl "http://${cluster_name}.elasticsearch.int.${environ}.notadevopsengineer.com/_cat/indices?format=json&index=${NEW_INDEX}"  | jq '.[] | ."docs.count"')
  if [ "${old_count}" = "${new_count}" ]; then
    echo "Old Index Count ${old_count} and New Index ${new_count} are the same";
    break;
  else
    echo "Old Index Count ${old_count} and New Index ${new_count} are different";
    sleep 240;
  fi
done

At its core, this script goes from a given old and new index and then starts executing all the commands that are needed to do the reindexing, from creating the new index, updating the settings of both the new and old index, and finally, executing the reindex operation, while waiting on completion on the last part given the total number of docs in both indexes.

Moving to containers

Leveraging Docker containers to execute the multiple-reindexer.sh and have the container run until it ends, at the same time, it gives the ability to check for logs and store them as per the container ID.

#multiple-reindexer.sh
#!/bin/bash
commands=(
  "bash reindexer.sh --environ prod --prefix complete-mayor --year 2022 --month 03 --new-suffix v2 --old-suffix v1 --cluster-name ibiza  --index-slices 20 --batch-size 5000 --shards 5 --ip-address 172.30.201.216"
  "bash reindexer.sh --environ prod --prefix complete-mayor --year 2022 --month 04 --new-suffix v2 --old-suffix v1 --cluster-name ibiza  --index-slices 20 --batch-size 5000 --shards 5 --ip-address 172.30.201.216"
  "bash reindexer.sh --environ prod --prefix complete-mayor --year 2022 --month 05 --new-suffix v2 --old-suffix v1 --cluster-name ibiza  --index-slices 20 --batch-size 5000 --shards 5 --ip-address 172.30.201.216"
  "bash reindexer.sh --environ prod --prefix complete-mayor --year 2022 --month 06 --new-suffix v2 --old-suffix v1 --cluster-name ibiza  --index-slices 20 --batch-size 5000 --shards 5 --ip-address 172.30.201.216"
  "bash reindexer.sh --environ prod --prefix complete-mayor --year 2022 --month 07 --new-suffix v2 --old-suffix v1 --cluster-name ibiza  --index-slices 20 --batch-size 5000 --shards 5 --ip-address 172.30.201.216"
)

# Execute the commands one by one
for cmd in "${commands[@]}"; do
  echo "Executing command: $cmd"
  eval "$cmd"
done

Now, in order to execute this in a Docker container:

1- Update the multiple-reindexes.sh with all the Indexes that needs to be reindexed 2- Build the Docker image 3- Run the Docker image

docker build -t reindex-script .
docker run -d reindex-script
docker logs CONTAINER_ID --tail=100

And of course, the Dockerfile:

FROM ubuntu:latest

RUN apt-get update && apt-get install -y bash curl jq python3 python3-pip ncal
RUN pip3 install requests
RUN pip3 install awscli

WORKDIR /usr/src/app
COPY multiple-reindexes.sh .
COPY reindexer.sh .

RUN ls -al

RUN chmod +x multiple-reindexes.sh
RUN chmod +x reindexer.sh

CMD ["/bin/bash", "multiple-reindexes.sh"]

Conclusion

After this trial and error, we were able to fully automate the process, and just fire the script whenever we needed with the indexes we needed to, and this allowed me to go to the beach on a Tuesday evening to watch the sunset while the Elasticsearch indexes were being optimized.

We moved from understanding Elasticsearch’s core to unraveling the mechanics of reindexing, and now, we stand ready to wield the automation script as a tool of transformation. The script isn’t just code; it’s a roadmap to streamline complexities. It’s a way to orchestrate reindexing without disrupting operations.

Build On!