Syncing Mobile Bug

The debugging process was quite difficult as the issue could not be replicated consistently. However, we finally found the issue and a simple fix resolved the bug. Below are the tech involved in causing this bug:

How the sync works

  1. Android device stores data in SQLite during offline mode
  2. SQLite files are synced to server when device is online
  3. Server processes SQLite files (VACUUM)
  4. Server sends processed file (sqlite.db) to another folder for processed data
  5. Server updates main database
  6. Files used in (3) are removed

To persist data, AWS S3 was used. s3fs-fuse was used to link folders in the server to buckets in S3, effectively making the folder work like a network file system.

Debugging

I once worked on a project where asynchronous syncing was required between mobile devices and server. The synced data would then be made available on a web portal. One day, a client reported that they were unable to see some data on the web portal they had included in the most recent sync. Strangely, some data from that sync session was available. We were able to reproduce the issue, but it did not happen consistently. Thus began the search for the bug...

The involved tech

The debugging process was quite difficult as the issue could not be replicated consistently. However, we finally found the issue and a simple fix resolved the bug. Below are the tech involved in causing this bug:

How the sync works

  1. Android device stores data in SQLite during offline mode
  2. SQLite files are synced to server when device is online
  3. Server processes SQLite files (VACUUM)
  4. Server sends processed file (sqlite.db) to another folder for processed data
  5. Server updates main database
  6. Files used in (3) are removed

To persist data, AWS S3 was used. s3fs-fuse was used to link folders in the server to buckets in S3, effectively making the folder work like a network file system.

Debugging

First we had to make sure the issue was real. We checked the processed SQLite files we had and the records in our main database, but they seemed to match. So far, looks like it might be a user issue, maybe they imagined entering and syncing the data?

But we decided it was a good idea to test out the full flow of syncing, just to be sure.

Try 1... no issues

Try 2... no issues

Try 3... MISSING DATA!

Ok, so it looks like there might really be an issue with our syncing procedure.

We inspected the Android app code and server code, and it looked fine.

Next, we intercepted the files received by the server and ran the processing commands on them manually, and compared them to what the server was doing automatically. And... we found that in the instances where data got missing, it was indeed being sent from the Android app, so something in the server was not working properly!

But if the manual command was effectively doing the same thing as the server, why would there be an issue? Hmm...

The issue

It turns out that there was no real issue with the server. There was a mix of things that caused the issue, starting with Android 9. It turns out that when the Android app was developed, it was running on Android 8, and recently, we had updated all the devices to use Android 9. The change that affected us was that Android 9 made SQLite use WAL mode instead of a rollback journal by default.

The default for Android 8 would use a rollback journal, which means there would be sqlite.db and sqlite-journal.db files, where sqlite.db contains the latest full data, and sqlite-journal.db contains a previous version of data that would be used in case of rollback, should there were problems in the sqlite.db file.

The default for Android 9 would be to use a Write Ahead Log (WAL), which includes sqlite.db and sqlite-wal.db files. The difference between this and the rollback journal is that sqlite.db contains older data, while changes are written to sqlite-wal.db and combined to sqlite.db periodically.

But this change alone should not cause issues in the server, since a VACUUM command is executed, which combines the data, hence the sqlite.db would contain the full data. However, this was not the case in our earlier investigation, as some data was actually missing from the sqlite.db files, compared to what was coming from the Android app.

It turns out the main culprit was S3 and s3fs-fuse.

Since s3fs-fuse maps folders to buckets in S3, behind the scenes, it needs to query and write data with S3. Whenever data from these folders are accessed, e.g. ls, s3fs-fuse query S3 for the list of files to show, and similarly if a specific file is accessed.

Whenever sqlite.db, sqlite-journal.db or sqlite-wal.db was in the folder, S3 would have a get a copy. When VACUUM is done, sqlite.db is updated, and is written in S3. When sqlite.db is accessed again to update the main database, S3 is queried for sqlite.db. This seems fine, except for the fact that at the time, S3 had Eventual Consistency, which at that time meant that when data is updated, 2 out of 3 availability zones would be updated for it to be considered a success. When sqlite.db is requested the second time to update the main database, it queries S3 for the file. Due to eventual consistency, occasionally s3fs-fuse gets the data from the non-consistent S3. In this case, the sqlite.db which had not been merged with the write ahead log.

This was how the data went missing.

The fix

After figuring out the cause, the fix was simple. Instead of doing the VACUUM in the S3 linked folder, we first moved the files to a non S3-linked folder, VACUUM-ed and used sqlite.db from there.

Conclusion

Fortunately we were able to find and fix the issue. Data loss is definitely something that would make an application unusable. We must be very aware of timing issues such as race conditions and in this case, data consistency behaviors, which can cause bugs that are very difficult to mitigate. As a note, this issue would probably not occur at present since AWS announced automatic strong consistency for S3, which would have mitigated this issue.