Why we moved away from Azure serverless

Sjoerd Smink
6 min readMay 16, 2020

I’m personally a big fan of serverless. No need to install updates anymore, it just works, while keeping (some) control over your security configuration. Who wouldn’t like serverless? On AWS my experience was that the cold start could be an issue, but on Azure I didn’t notice this. Except for everything else, because making it secure, performant and reliable proved to be impossible.

On Azure serverless is called Function Apps (or simply Functions, they’re not really consistent on naming their products). I’ve only worked with Node.js, so I can’t say anything about other languages. But in the past half year, these were my struggles.

  1. It’s not really serverless

Although advertised as such, only the Consumption plan is really serverless. For some people this could work well, although the cold starts could be a pain (because I’m talking about seconds here, not milliseconds). Only downside is that it doesn’t support a Virtual Network (VNET), which limits the security enormously, and is not really usable for a professional infrastructure.

The Premium Plan and App Service Plan have VNET support, but require you to set the minimum amount of instances (Premium) or existing App Service machines. The Premium plan can scale and doesn’t have a cold start, but you need to decide the minimum number of instances. There isn’t much visual feedback on the load of your instances, no configuration on when to scale up, or any configuration whatsoever. So you basically end up defining the number of instances you want to use.

Another limitation is that it doesn’t seem to be available multi-region. So keep in mind that if you require a high availability platform, you need to do quite some manual labour.

2. VNET support is limited

Using a Virtual Network (VNET) is a necessity if you don’t want some of your backend systems accessible to the internet, and you really don’t want that for your database. But the documented support for VNETs is not 100% true. Functions rely on a Storage Account, but that Storage Account can’t be in a VNET (known issue). Also after contacting support that the Storage Account contains sensitive tokens which are internet accessible, support confirmed this can’t be changed.

We also noticed a bug that the Function sometimes suddenly has a non-VNET (public) IP address. Non-VNET IPs are blocked by our database and an alert is triggered, which is how I found out the Function was using a public IP. This happens then for a couple of minutes and then the IP is back to the private VNET IP. Contacting support about this didn’t bring any solace.

3. Functions are by default internet accessible

I personally don’t think it’s a good idea to directly call your backend systems, but instead put something in front of it like API Management. That’s very well possible in Azure, but keep in mind that the Functions remain open to public. This can be limited to the VNET with Access Restrictions, but then you won’t have access to some parts of the Azure portal interface. Our solution was to use a VPN with a static IP, and whitelist that IP for the Access Restrictions. The portal is then fully functional as long as you’re using the VPN.

4. It’s difficult to do deployments without downtime

Yes, there’s something called Deployment Slots. And then you have to decide on which Slot you want to deploy, and swap that after the deployment. But come on, shouldn’t deployments without downtime be a part of the serverless proposition?

5. Making it secure was difficult

It’s possible to let every Function have its own Storage Account. But as the hard limit for the number of Storage Accounts in 250, when you have many Functions on different environments (and also a need for other Storage Accounts) we reached that limit. Luckily it’s possible to use the same Storage Account for multiple Functions. We enabled Advanced Threat Protection for the Storage Accounts, including the Storage Account used by the Function. However, when deploying many Functions at the same time, the Threat Protection kicked in and blocked the deployment. The only solution was to turn it off. Something you need to do before every big deployment starts, because Advanced Threat Protection seems to be turned on again automatically after a while.

We used Key Vault to store and retrieve secrets. The Function can access it via environment variables. It’s possible to limit access to Key Vault from the Function by assigning the Identity of the Function to the Key Vault. But then the Key Vault can’t be in a VNET (fixing this limitation seems to be planned for a while now).

When creating a Function, you also need to pay attention to some defaults. For example the (unsecured) FTP and (unsecured) HTTP configuration.

6. Deployments sometimes fail (although the response code is successful)

We used Bitbucket Pipelines with an official Docker image to deploy a zip. It happened that this failed with the weirdest errors (some JSON not able to parse), especially when deploying many Functions at the same time. But also sometimes it randomly failed, and a retry succeeded.

But more annoying was that sometimes it showed a successful response, but in reality it was still running an old version. The only solution was to restart the Function. And sometimes it needed another deployment. But always doubting if the deployment succeeded makes it very unreliable.

7. There seem to be caching issues

Besides the apparently incorrect deployments, there are more indications that a corrupt cache could be the culprit of some Function issues. We had once an API that had a response time on 6 seconds, which happened already a couple of days. Locally everything was fine. After a restart of the Function, it was back to under 100ms.

Another time a Function returned a 403 status with the message that the web app has stopped. Executing the Function from the Portal returned a normal response however. A restart solved this issue as well.

And then the time that I deleted a Function, and recreated it with the same name. Every time the deployment failed. I couldn’t fix it. After that I deleted the Function, waited for a day, recreated it, and everything worked fine.

8. Other issues we had

First some background info. Because I don’t believe in that you can click together a professional infrastructure, we wrote some (Node.js) scripts to set up a new environment. The CLI and REST API were used to automate this, and to check all the (security) settings.

We use API management in front of Functions. To script that, you need the host key. Fetching that host key after the Function has been created returns an error. The solution is to deploy some initial code; after that initial deployment the host keys is retrievable.

Our monitoring is triggered on average once a week that some Function returns a 503 status code. After one or two retries, it returns a success code again. In the past I tried to contact Azure support, which promised rolling out a fix soon, but that didn’t resolve the entire issue. There are weeks that the Function works without problems, and there are weeks that the monitoring shows alerts multiple times a day.

During the script development, it also became clear that the order of things is important. And that some commands can’t be run in parallel. And that Azure sometimes needs 10 seconds to process it all, so it’s better to wait before continuing with the next step. And a successful response not always really means that it is executed successfully (i.e. Access Restrictions not really applied). Scripting it all requires many workarounds.

All in all there are quite some problems in actually using Azure Functions. Hopefully this article helps someone in fixing issues, or making key decisions on their infrastructure. We’re moving the Kubernetes now. Let’s see if I’ll be writing a similar article about AKS in 6 months…

Above statements reflect my personal opinion and are do not express the views or opinions of the company I performed this work for.

--

--