[SOLVED] 2019-09-19 Schema and Content is Gone: Post Mortem

Sebastian · September 19, 2019, 8:44am

This morning multiple users reported that their schema and content were gone.

Usually and in theory this cannot really happen because of the following reasons:

All changes to the system are stored as events and the state can always be derived from the events. Events cannot be deleted. Deletions are just other events, e.g. AppArchivedEvent. This concept is called event sourcing.
The MongoDB databases are hosted in a replica set with 3 members for higher availability.
We have daily backups of the Mongo database in case something goes wrong.

So what happened was that even though the schemas and content were still there, they were not shown in the UI and API, because some indices were corrupt after an improvement to the indexing system.

About event sourcing

Because it is not very efficient to query all events for a particular entity like a schema or app, Squidex also stores snapshots of the current state in the database. These snapshots are mandatory for contents and assets, because they are used for queries but optional for schemas, apps and rules. If there is no snapshot for these entities we can still derive the state from the events on the fly.

What happened?

1. Populating the indexes, first attempt

To populate the indexes after the improvement a migration script was used that was working like that:

foreach (var app in appsFromDatabase)
{
	Index(app);
}

But this did not work properly. The reasons was that some schemas and apps did not have a snapshot in the database. Very likely because of an old migration were the snapshots have been deleted after a major change.

So the index was not consistent after the migration and some schemas were not found.

2. The bug in the storage system

For these kind of errors I have a tool to recreate all snapshots but it was also not working properly, because of the following reason:

Custom roles are stored in MongoDB as document:

"roles": {
	"roleName": { ... }
}

But one user has a role with characters that are not allowed for as property names for MongoDB and therefore the tool has stopped when this app has thrown an error.

3. Populating the indexes, second attempt

I have then improved the script to populate the indexes. Instead of querying the snapshots, I go over all events now. This solves the problem that some snapshots might not be available.

What will be changed?

Improve the tooling: The migrations should not stop when one app has a problem.
Improve the CI pipeline: The builds are very slow at the moment, which increased the time to fix the problem.
Automatically populate the snapshot. When a snapshot does not exist but an app, schema or rule is loaded successfully from memory it should try to recreate the snapshot when possible.
More testing: The migration has been tested several times with test data. Migrations must always be tested with the production database.