HTTP Content Length
A few weeks ago, I had a problem where JSON that I sent from one of my services to another could not be parsed properly as JSON. It was quite an interesting bug in the code, so I decided to write something about it.
Running the examples
I created a repo to demonstrate the bug, and some examples to explain it. Run
the following steps if you would like to follow along. The example project uses
"node": ">=16.13.0"
and "npm": ">=9.1.2"
.
git clone https://github.com/stonefruit/example-content-length
cd example-content-length
npm ci
node index.js
The bug
Run this curl
curl localhost:3000/example-main-broken
and this error will appear
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>SyntaxError: Unexpected end of JSON input<br> at JSON.parse
...
...
...
Tracing through the code, we can see that some data is being stringified and
sent by a POST request to /endpoint
.
app.get('/example-main-broken', async (req, res) => {
const data = JSON.stringify({ data: `stonefruit’s data` })
const result = await request.post(data, 'application/json', data.length)
res.send(result)
})
Function signature of request.post
:
/**
* @param {string} data
* @param {'text/html' | 'application/json'} contentType
* @param {number} contentLength
* @returns {Promise<string>}
*/
const post = async (data, contentType, contentLength) => {...}
The data sent seems simple, but the receiving /endpoint
does not seem to know
how to parse this data. Why is this so?
Example 1: a simple text request
Run this curl
curl localhost:3000/example1
and this is the response
Parsed body: stonefruit's data
Received content-length header: 17
The code for example 1 is:
app.get('/example1', async (req, res) => {
const data = `stonefruit's data`
const result = await request.post(data, 'text/html', data.length)
res.send(result)
})
It seems this request works! Although you might have noticed that I have used
Content-Type: text/html
for the examples leading to fix for ease of
explanation. Note that the content-length
is 17, which is the same number of
characters in "stonefruit's data".
Example 2: modifying the content-length
Run this curl
curl localhost:3000/example2
and this is the response
Parsed body: s
Received content-length header: 1
In this example, we have used 1
as the last parameter, which sets the
content-length
for the POST request.
const result = await request.post(data, 'text/html', 1)
Hence, only 1 byte of data is transmitted, and this ends up being the letter
s
.
Example 3: revealing the culprit
Run this curl
curl localhost:3000/example3
and this is the response
Parsed body: stonefruit’s da
Received content-length header: 17
The result of this seems to be missing some letters at the back, but the
content-length
seems to have the correct number of character count.
This is the code used for example 3:
app.get('/example3', async (req, res) => {
const data = `stonefruit’s data`
const result = await request.post(data, 'text/html', data.length)
res.send(result)
})
Looks the sample as in example 1 doesn't it? Can you spot the difference between the two?
app.get('/example1', async (req, res) => {
const data = `stonefruit's data`
const result = await request.post(data, 'text/html', data.length)
res.send(result)
})
The difference is in stonefruit’s data
vs stonefruit's data
, specifically,
’
in example 3 and '
in example 1. These apostrophes are actually different
characters! The one used in example 1 is the typical single quote you find on
your keyboard, and is included as part of the ASCII characters. Example 3 uses
another lookalike, which is not included as an ASCII character, but a UTF-8
character instead.
ASCII has a small set of characters, and each character uses 1 byte. UTF-8 has a
larger set of characters, of which ASCII is a subset, and can use 1 to 4 bytes
for encoding a character.’
in example 3 actually uses 3 bytes to represent the
character, so the correct content-length
should be 19 instead of 17, causing
the string to be missing the last 2 characters ta
as they are ASCII characters
and use 1 byte each.
Example 4: Using the correct content-length
Run this curl
curl localhost:3000/example4
and this is the response
Parsed body: stonefruit’s data
Received content-length header: 19
The data sent is the same as in example 3, but now we can see the full string
and correct content-length
.
What was the difference between the code that enabled this?
// Example 3
data.length
// Example 4
Buffer.byteLength(data)
In example 3, a naive count using data.length
is used. This is only accurate
if all the characters used at ASCII since they would use 1 byte each. However,
since we have the sneaky 3 byte apostrophe in the string, it is still considered
as 1 character, so the data.length
would be less than the actual bytes. Using
Buffer.byteLength(data)
counts the actual number of bytes used by the string.
The fix
Run this curl
curl localhost:3000/example-main-fixed
and this is the response
Parsed body: [object Object]
Received content-length header: 30
In the broken version, the request body goes through body-parser
, where
JSON.parse()
is used to make it into a javascript object. However, since we
were originally using data.length
, some bytes were missing and the stringified
JSON was missing the closing }
, which made it invalid to parse, hence throwing
the error.
After changing this to Buffer.byteLength(data)
, body-parser
is able to do
its job properly, and we do not get an error anymore.
This is how the code looks like for this
app.get('/example-main-fixed', async (req, res) => {
const data = JSON.stringify({ data: `stonefruit’s data` })
const result = await request.post(
data,
'application/json',
Buffer.byteLength(data)
)
res.send(result)
})
Takeaway
Some legacy code may not use convenient libraries like axios
, which help to
handle things like content-length
. If you have to implement the code, take
care to note that content could be in different encodings. It is possible that
you only use ASCII characters in your code, but when forwarding some user input
or data stored in the database, they may not be so, and will cause the issue as
shown here.
The difficulty I had debugging this was not so much technical, but that it happens that ASCII single quotes are not always used as apostrophes, and it turns out that when people copy and paste text from the internet, it is probably a common occurrence.