I just signed up for a new course from Astronomer in a few days, and honestly, I think I’ll be trying to convince my company to migrate over to them.
We are currently relying on AWS MWAA (Managed Workflows for Apache Airflow), but boy, this has been quite bad.
Lately, I’ve been putting a lot of effort into optimizing our setup. I adapted and improved all our existing Airflow code, made a heavy bet on deferred operators, moved our workloads—specifically the non-intense but long-running processes like dbt calls—into ECS, and meticulously prepared our migration to version 3.0.6.
Development was ready. Everything seemed good to go.
But then the cloud reality check hit. We found out that MWAA’s 3.0.6 image had stability issues due to a mismatch between Python 3.12 and Celery. That issue dragged on, open for more than six months.
When they finally released Airflow 3.2.1 on May 19th, I immediately updated it, hoping for a fix. Instead, we just traded one headache for another: now we are facing workers randomly rebooting every 65 minutes.
I’m currently tracking an AWS Re:Post thread (https://repost.aws/questions/QUB7LC2vV2Rem-fGa7zJDbTw/mwaa-3-2-1-workers-being-recycled-approximately-every-65-minutes) where others are hitting the exact same 65-minute worker recycling loop, and I’ve opened a new support ticket to push for answers.
If there is one thing I am absolutely sure of right now, it’s that no cloud provider is perfect. Managed services are supposed to take away the operational pain, but sometimes they just block you from fixing the actual problem.
I’ll keep you posted on how the Astronomer course goes.