From OpenFAAS to Fly.io: My Migrating Misadventures and How Gunicorn Saved the Day!

As a junior dev, sometimes the simplest problems are the hardest to solve. This was one of those times.

We were in the process of migrating our OpenFAAS function to another platform. Our initial choice was Google’s Cloud Functions because of:

  1. its feature-set overlap with OpenFAAS; and
  2. we had some services already running on the Google platform

It seemed like a straightforward migration.

Unfortunately we later on discovered that because the database our cloud function queries is hosted on Amazon RDS, we needed to generate a static IP address for the cloud function to connect to the database over VPC. This made it needlessly complex and adds a fixed monthly cost to something that should’ve been pay-as-you-go. Lest we forget that one of the primary benefits of using FAAS is the reduced complexity and costs—especially for infrequently run functions.

This led us in deciding to instead migrate using Fly.io instead of Google Cloud Function. Aside from cost, we were also in the process of migrating some of our other services there so this made sense. This did mean that it no longer was strictly a serverless function since we’d be deploying a full Python app. However, we can take advantage of Fly.io‘s automatic start and stop of machines so the pay-as-you-go benefits remains.

The actual changes needed to migrate from OpenFAAS to Fly.io was pretty straightforward as Fly.io‘s official docs was quite helpful here. When deploying a Python app on Fly.io, make sure to follow the steps as instructed. Especially the “Before Deployment” section which mentions installing gunicorn as a dependency. Your app will most likely use some form of environment variables, which Fly.io has a mechanism for as well. Overall, refactoring the code and deployment went without a hitch.

It was when we were testing the endpoint that things got hairy. The endpoint just wouldn’t successfully return and kept timing out. You will know you may be running into this issue if the application logs will look something like this:


2023-12-28T11:00:15.778 app[xyz] cdg [info] [2023-12-28 11:00:15 +0000] [306] [INFO] Starting gunicorn 21.2.0

2023-12-28T11:00:15.779 app[xyz] cdg [info] [2023-12-28 11:00:15 +0000] [306] [INFO] Listening at: <http://0.0.0.0:8080> (306)

2023-12-28T11:00:15.779 app[xyz] cdg [info] [2023-12-28 11:00:15 +0000] [306] [INFO] Using worker: sync

2023-12-28T11:00:15.782 app[xyz] cdg [info] [2023-12-28 11:00:15 +0000] [322] [INFO] Booting worker with pid: 322

2023-12-28T11:04:25.070 app[xyz] cdg [info] [2023-12-28 11:04:25 +0000] [306] [CRITICAL] WORKER TIMEOUT (pid:322)

2023-12-28T11:04:25.071 app[xyz] cdg [info] [2023-12-28 11:04:25 +0000] [322] [INFO] Worker exiting (pid: 322)

2023-12-28T11:04:25.210 app[xyz] cdg [info] [2023-12-28 11:04:25 +0000] [306] [ERROR] Worker (pid:322) exited with code 1

2023-12-28T11:04:25.210 app[xyz] cdg [info] [2023-12-28 11:04:25 +0000] [306] [ERROR] Worker (pid:322) exited with code 1.

2023-12-28T11:04:25.211 app[xyz] cdg [info] [2023-12-28 11:04:25 +0000] [323] [INFO] Booting worker with pid: 323

Fortunately, we came across these community posts early on regarding Fly.io‘s timeout issues:

Quoting the reply pertinent to our issue at hand:

There is a 60 idle timeout. If no data is received or sent within 60 seconds, the connection will be closed.

So we tried out one of the suggestions mentioned. We refactored the code to take advantage of Flask’s streaming, making sure it was sending something within 60 seconds. Nada, still kept timing out.

So we thought maybe it’s a CPU or memory bottleneck. Maybe our OpenFAAS server was just better configured which is why it wasn’t time out when it was hosted there. So we bumped up our Fly.io machine to test this hypothesis. Nada. Still kept timing out.

We tried refactoring the code, doing all sorts of performance optimizations. We even tested offloading the data processing to our PostgreSQL database— the Fly.io app was now relegated to a thin client. Still nada. We were so confused.

It was while I was updating my seniors on the issue that the eureka moment happened:

Wil’s (Sakay’s wunderboy CTO for the uninitiated) question re which part of the code was generating the WORKER TIMEOUT error made frustrated and headbanging-at-wall Me slow down and think things through. It was then that I realized that this error wasn’t actually coming from Fly.io‘s end…but Gunicorn’s. The solution was simply to update the timeout in the Procfile from the default of 30 seconds to 60 seconds…. (web: gunicorn app:app --timeout 60 is the command). Bye bye timeout’s.

Conclusions and lessons learned

Some lessons learned from this experience:

  • Have you ever heard of or read the book “Thinking, Fast and Slow” by Daniel Kahneman? This experience is probably a good application of that book’s main thesis. When frustration is mounting, pause for a minute. Talk to your colleagues/seniors. Breath. When you slow down, things may start moving faster.
  • When analyzing logs, make sure to take into account each part of the stack and slowly think through the source of the error logs. As the Filipino saying goes, “Maraming namamatay sa maling akala (many die because of false assumptions)“.
  • Sometimes your seniors don’t even need to give you the answer. Just by asking certain/the right questions, it can help you arrive at the necessary solution.

Finally, if this post makes it easy for even at least one individual, it will have made my day. As the comic says, “Everything is hard until someone makes it easy”.