akuszyk.com

A blog about software engineering, collaboration, and--occasionally--Emacs.

View on GitHub

Debugging phantom HTTP 400 responses

I recently spent an entire week debugging an unusual situation involving unexplained, intermittent 400 responses to seemingly well-formed HTTP requests. Throughout the week, I tried a variety of things to debug the situation, pulled out more than a little of my hair, and finally stumbled across the solution.

Although–at times–frustrating, it was a great opportunity to explore some new infrastructure, and practice some network debugging skills. So much so, that I thought I would summarise the problem, how I debugged it, and the solution in a blog post. If you ever have the same problem in the future, I hope your Google searches bring you straight to this page to save you all the trouble!

The problem

In essence, the problem was that I had a NodeJS Lambda sending HTTP POST requests to a Python server:

img

About 60% of the time these requests would succeed, the API would receive the body of the request, and return a 200 response. However, the rest of the time, the API was receiving an empty body, and returning a 400 response.

Why was the API receiving an empty body? Was the Lambda not sending a payload? Was the API failing to decode it? Was some intervening network infrastructure truncating the body?

I had no idea! 🙃

What I tried 🔎

Alright, so here’s a list of things I tried in roughly timeline order:

Adding logging to the API

I logged the length of the body, and whether or not any fields were present in it. After several attempts at this, I concluded that when the API returned a 400, it was because the body really was empty.

Why was the body empty? 🤔

Adding logging to the Lambda

I added logging progressively closer to the point where the HTTP request was being sent. As far as I could tell from the logs, the Lambda was sending a request. The Lambda was using the NodeJS https library to send requests, and was doing something like this:

const https = require('https');

const uri = new URL(apiAddress);

// The actual request body here.
const payload = {
  'spam': 'eggs',
}

const options = {
  hostname: uri.hostname,
  method: 'POST',
  path: uri.pathname,
  headers: {
    'Content-Type': 'application/json',
  },
};

const req = https.request(options, (res) => {
  // Code to handle the response here
  });
});

req.write(JSON.stringify(payload));
req.end();

If the Lambda was definitely sending a body, and the API definitely wasn’t receiving it, what was going wrong?

Identify a failing payload

If the Lambda was sending a body, and the API wasn’t receiving it, maybe there was something strange about particular payloads which caused the problem. I managed to identify a problematic payload, and recreate the issue by triggering the Lambda to process the same payload. Now I was getting somewhere! I could manually reconstruct the rest of the HTTP request, and try sending it myself.

Sending HTTP requests from within the API’s pod

My first attempt was to try running a shell inside the API’s server container so that I could send the request directly to the server process without anything getting in between. I ran a shell inside the server’s container like this:

kubectl exec -it pod-name -c container-name -- bash

So far so good, except there were no utilities (like curl) installed in the container image for me to send HTTP requests with. That’s a good thing really, since ideally all server images like this one should even be distroless, and not even contain a shell.

Sending HTTP requests from an adjacent pod in the same namespace

Rather than sending the HTTP request from the same container, I tried sending it from an adjacent pod in the same namespace:

kubectl run -it andy-debug --image nicolaka/netshoot

The nicolaka/netshoot is an excellent container image containing an array of useful network debugging utilities. In this case, I just needed curl. I sent the HTTP request that causes a 400 when sent from the Lambda, and…received a 200 response 🤔

Sending HTTP requests from my machine with curl

OK, so I got a 200 response when sending the request over the Kubernetes overlay network, but what about across the internet like the Lambda does?

In fact, at this stage, I took a quick diversion to find out how exactly the Lambda was sending requests to the API. The Lambda was using a public DNS address for the API; something like https://api.company.com. I took a look at the DNS record for this name using dig:

dig api.company.name

The result showed me that the DNS address was CNAMED directly to an AWS load balancer, and the load balancer was essentially just routing requests directly to the Kubernetes cluster hosting the API. This meant that whether or not the request originated from the Lambda, or my machine, it would follow the same network route:

img

💡 I checked the AWS configuration, and noticed that the ALB was configured to use a WAF as well.

Anyway, sending the request from my local machine to the API was the same as sending it from my netshoot pod:

curl https://api.company.com/route \
     -v \
     -H 'Content-Type: application/json' \
     -d '
{
  "spam": "eggs"
}'

This resulted in a 200 response as well. So I get a 200 response when:

But, I get a 400 response when:

Sending HTTP requests from my machine with NodeJS

Next, I put together a small NodeJS program locally which sent the HTTP request in the same way as the Lambda. It works with curl locally, so it will work with NodeJS, right?

Wrong! ⛔

When I sent the request from NodeJS, I got a 400 response; exactly as the Lambda did. But, how can that be? I’d put in lots of logging; the Lambda definitely sends the request body, doesn’t it?! 😅

Intercepting requests with netcat

Now I wanted to verify exactly what the NodeJS program was sending across the network, so I used netcat to intercept the TCP traffic locally. I ran netcat in listen mode as follows:

netcat -l -p 8000

This will listen for TCP connections on port 8000, and write whatever it receives to the stdout.

Then, I configured the NodeJS program to send the request to http://localhost:8000. Sure enough, the request had a body as expected.

I tried the same with curl and netcat; the bodies where the same! What could the difference possibly be?

What worked 🎊

After much ruminating, complaining to colleagues, and long walks to clear my head, I finally spotted an important difference between the requests sent by curl and NodeJS. curl was setting the Content-Length header, whereas NodeJS wasn’t; instead, it was setting Transfer-Encoding: chunked. It turns out this is the default behaviour of the Node HTTP library. Manually setting the Content-Length header as follows seemed to fix the problem:

const https = require('https');

const uri = new URL(apiAddress);

// The actual request body here.
const payload = {
  'spam': 'eggs',
}
const json_payload = JSON.stringify(payload)

const options = {
  hostname: uri.hostname,
  method: 'POST',
  path: uri.pathname,
  headers: {
    'Content-Type': 'application/json',
    'Content-Length': Buffer.byteLength(json_payload),
  },
};

const req = https.request(options, (res) => {
  // Code to handle the response here
  });
});

req.write(json_payload);
req.end();

With this change deployed to the Lambda, payloads which would have resulted in a 400 response now return a 200. Success! 🎉

Why it worked 💡

But, the question still remained: why did using the Transfer-Encoding: chunked header cause intermittent 400s. When I investigated the access logs of the ALB, it showed small request sizes when the payload was sent from Javascript, and large sizes when sent from curl.

It turns out that Django (which was used for the HTTP server implementation in the API), and the WSGI implementation in Python general, doesn’t support chunked transfer encoding out-of-the-box. My guess is that smaller payload were being transmitted in a single chunk, and were handled gracefully by the API. Larger bodies were transmitted in smaller chunks, and the API wasn’t configured to receive the stream of them; which resulted in a 400 being returned for an empty body.

The main learning for me, was that sending HTTP requests from NodeJS isn’t as straightforward as it seems, and that the default behaviour of using chunked encoding is something to watch out for!