How to Fix AWS API WRITE Block During EBS Resize in Blue/Green Deployments

Select Language:

If you’re managing an OpenSearch cluster and encounter a FORBIDDEN/8 error during shard migration, it can seem confusing at first. Here’s a simple guide to help you understand the cause and how to address it.

This issue often happens during a stage where you’re migrating data, especially when using a blue/green deployment method. During this process, the old and new nodes run together. While data is being copied from the old nodes to the new ones, disk usage on the nodes spikes temporarily. If any node’s disk usage hits the set limit—commonly 95%—OpenSearch will block writing to prevent data corruption or loss. As a result, all indices with primary shards on that full node will have write operations blocked, causing the FORBIDDEN/8 error.

In a typical setup with two data nodes and around 40 shards, even a brief spike can push a node over the edge. Usually, this happens during the final shard transfer. The source node for that last shard is likely experiencing high disk usage at that moment, which leads to the disk watermark being triggered and the write block being put in place.

When your migration stalls at a specific point—say, 39 out of 40 shards—the first step is to gather some diagnostic information. Running these commands in order can help determine the cause:

Check why the cluster isn’t moving shards or allocating resources:
GET _cluster/allocation/explain
See the status of data recovery, focusing on active shards:
GET _cat/recovery?active_only=true&v&h=index,shard,stage,bytes_percent,source_node,target_node
Review disk space usage on each node:
GET _nodes/stats/fs?pretty
Check JVM heap and memory stats:
GET _nodes/stats/jvm?pretty
See if there are any pending tasks that could be delaying the process:
GET _cluster/pending_tasks

The information from these commands will tell you if the stall relates to disk space, JVM issues, or shard copying delays. This helps in identifying the root cause and planning your next steps.

Another common question is why the DryRun option might report only a “DynamicUpdate,” while in reality, your cluster is operating in a blue/green setup. This discrepancy is a known limitation. DryRun assesses the cluster’s state based on the time of the request, but actual deployment decisions are made during execution. Sometimes, cloud providers like AWS override these predicted sizes during operations like EBS volume resizing, especially with gp3 volumes.

To better monitor and manage your cluster’s health during migrations, you should instead observe the progress through the DescribeDomainChangeProgress API immediately after initiating a scale change. If the total number of stages exceeds two, it indicates you’re in a blue/green migration, and you should activate additional monitoring right away to prevent unexpected issues.

To prevent similar problems in the future, set your autoscaling rules to trigger only when all nodes are below 50% disk usage. During blue/green migrations, disk usage can temporarily go beyond normal, especially on a two-node cluster, due to additional data copying. Therefore, allowing more headroom—such as 80% free space—can help prevent hitting the disk watermark during these times.

For more detailed best practices and guidance on managing blue/green deployments, consult the official docs from AWS:

This approach will help ensure smoother migrations and reduce the chances of hitting disk space limits that block your data operations.