HTTP Content Length

A few weeks ago, I had a problem where JSON that I sent from one of my services to another could not be parsed properly as JSON. It was quite an interesting bug in the code, so I decided to write something about it.

Running the examples

I created a repo to demonstrate the bug, and some examples to explain it. Run the following steps if you would like to follow along. The example project uses "node": ">=16.13.0" and "npm": ">=9.1.2".

git clone https://github.com/stonefruit/example-content-length
cd example-content-length
npm ci
node index.js

The bug

Run this curl

curl localhost:3000/example-main-broken

and this error will appear

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>SyntaxError: Unexpected end of JSON input<br> &nbsp; &nbsp;at JSON.parse
...
...
...

Tracing through the code, we can see that some data is being stringified and sent by a POST request to /endpoint.

app.get('/example-main-broken', async (req, res) => {
  const data = JSON.stringify({ data: `stonefruit’s data` })
  const result = await request.post(data, 'application/json', data.length)
  res.send(result)
})

Function signature of request.post:

/**
 * @param {string} data
 * @param {'text/html' | 'application/json'} contentType
 * @param {number} contentLength
 * @returns {Promise<string>}
 */
const post = async (data, contentType, contentLength) => {...}

The data sent seems simple, but the receiving /endpoint does not seem to know how to parse this data. Why is this so?

Example 1: a simple text request

Run this curl

curl localhost:3000/example1

and this is the response

Parsed body: stonefruit's data
Received content-length header: 17

The code for example 1 is:

app.get('/example1', async (req, res) => {
  const data = `stonefruit's data`
  const result = await request.post(data, 'text/html', data.length)
  res.send(result)
})

It seems this request works! Although you might have noticed that I have used Content-Type: text/html for the examples leading to fix for ease of explanation. Note that the content-length is 17, which is the same number of characters in "stonefruit's data".

Example 2: modifying the content-length

Run this curl

curl localhost:3000/example2

and this is the response

Parsed body: s
Received content-length header: 1

In this example, we have used 1 as the last parameter, which sets the content-length for the POST request.

const result = await request.post(data, 'text/html', 1)

Hence, only 1 byte of data is transmitted, and this ends up being the letter s.

Example 3: revealing the culprit

Run this curl

curl localhost:3000/example3

and this is the response

Parsed body: stonefruit’s da
Received content-length header: 17

The result of this seems to be missing some letters at the back, but the content-length seems to have the correct number of character count.

This is the code used for example 3:

app.get('/example3', async (req, res) => {
  const data = `stonefruit’s data`
  const result = await request.post(data, 'text/html', data.length)
  res.send(result)
})

Looks the sample as in example 1 doesn't it? Can you spot the difference between the two?

app.get('/example1', async (req, res) => {
  const data = `stonefruit's data`
  const result = await request.post(data, 'text/html', data.length)
  res.send(result)
})

The difference is in stonefruit’s data vs stonefruit's data, specifically, in example 3 and ' in example 1. These apostrophes are actually different characters! The one used in example 1 is the typical single quote you find on your keyboard, and is included as part of the ASCII characters. Example 3 uses another lookalike, which is not included as an ASCII character, but a UTF-8 character instead.

ASCII has a small set of characters, and each character uses 1 byte. UTF-8 has a larger set of characters, of which ASCII is a subset, and can use 1 to 4 bytes for encoding a character. in example 3 actually uses 3 bytes to represent the character, so the correct content-length should be 19 instead of 17, causing the string to be missing the last 2 characters ta as they are ASCII characters and use 1 byte each.

Example 4: Using the correct content-length

Run this curl

curl localhost:3000/example4

and this is the response

Parsed body: stonefruit’s data
Received content-length header: 19

The data sent is the same as in example 3, but now we can see the full string and correct content-length.

What was the difference between the code that enabled this?

// Example 3
data.length

// Example 4
Buffer.byteLength(data)

In example 3, a naive count using data.length is used. This is only accurate if all the characters used at ASCII since they would use 1 byte each. However, since we have the sneaky 3 byte apostrophe in the string, it is still considered as 1 character, so the data.length would be less than the actual bytes. Using Buffer.byteLength(data) counts the actual number of bytes used by the string.

The fix

Run this curl

curl localhost:3000/example-main-fixed

and this is the response

Parsed body: [object Object]
Received content-length header: 30

In the broken version, the request body goes through body-parser, where JSON.parse() is used to make it into a javascript object. However, since we were originally using data.length, some bytes were missing and the stringified JSON was missing the closing }, which made it invalid to parse, hence throwing the error.

After changing this to Buffer.byteLength(data), body-parser is able to do its job properly, and we do not get an error anymore.

This is how the code looks like for this

app.get('/example-main-fixed', async (req, res) => {
  const data = JSON.stringify({ data: `stonefruit’s data` })
  const result = await request.post(
    data,
    'application/json',
    Buffer.byteLength(data)
  )
  res.send(result)
})

Takeaway

Some legacy code may not use convenient libraries like axios, which help to handle things like content-length. If you have to implement the code, take care to note that content could be in different encodings. It is possible that you only use ASCII characters in your code, but when forwarding some user input or data stored in the database, they may not be so, and will cause the issue as shown here.

The difficulty I had debugging this was not so much technical, but that it happens that ASCII single quotes are not always used as apostrophes, and it turns out that when people copy and paste text from the internet, it is probably a common occurrence.