AWS Aurora MySQL to S3: Troubleshooting DMS Load Failures

Select Language:

If you’re running a test where you load 4 million rows into Aurora MySQL and then try to replicate the changes to S3 using DMS, you might notice that the ongoing change data capture (CDC) takes several hours to catch up. This delay is common when dealing with large volumes of data and high transaction rates, especially with peak workloads like over 20,000 concurrent transactions.

Here’s a straightforward approach to improve your setup and speed up the replication process:

Start by ensuring your DMS instance is appropriately sized. You mentioned using a r7i.2xlarge instance for testing, which generally is sufficient for small-scale testing. However, for CDC workloads with high transaction volumes, consider scaling up temporarily. A larger instance with more CPU and memory can process changes more efficiently, reducing lag.

Optimize your DMS task settings:

– Increase batch size and memory limits for applying changes. For example, adjusting the “BatchApplyTimeoutMax” and “BatchSplitSize” can help DMS process larger chunks of data at once.
– Enable parallel apply threads if your setup supports it. This allows multiple threads to process data concurrently, speeding up the catch-up process.
– Adjust CDC batch interval and minimum file sizes to find the best balance between performance and resource consumption.

Review your target endpoint configuration:

– Since you’re writing directly to S3 in Parquet format, using GZIP compression is a good choice. Make sure your “MaxFileSize” and “RowGroupLength” are optimized for your workload.
– Consider whether writing directly from DMS to S3 is the most efficient approach for near real-time replication. Sometimes, batching and buffering data within DMS can introduce delays. Using an intermediate staging area or a different data pipeline might help if minimal lag is critical.

Monitor resource utilization:

– Despite only about 2% CPU usage and 20GB RAM reported during your test, actual performance can be affected by disk I/O, network throughput, and internal database processes.
– Enable detailed logging temporarily to identify bottlenecks or delays in data capture and apply phases.

Tune your source and target endpoints:

– Make sure connection parameters like “EventsPollInterval” and other timeout settings are appropriate. Sometimes, reducing poll intervals can help capture changes more quickly.
– Ensure that your Aurora MySQL instance’s binary log settings are optimized for CDC. The binary log should be enabled and configured for minimal latency.

Lastly, consider the volume and concurrency limits:

– With peak transaction loads over 20k transactions per second, your replication setup might need additional tuning or higher-capacity resources.
– Limiting the number of concurrent change streams or batching changes can help manage load and reduce lag.

By adjusting your DMS instance size, optimizing task settings, and ensuring your environment is tuned for high-volume CDC, you can significantly reduce the lag and improve near real-time data replication from Aurora to S3. If delays persist, exploring alternative data pipeline solutions or incremental tuning based on specific bottlenecks would be the next step.