Replace MongoDB with time-travel DB

malefijt · February 22, 2020, 9:41am

Would it be advantageous to replace MongoDB with a database that has native time-traveling git-like features, such as Crux or Irmin? Or would a lot of the CQRS code be unchanged since a lot still revolve around events?

Sebastian · February 22, 2020, 10:19am

Perhaps? Never heard about them. The biggest question is usually around query cababiltities.

But the CQRS code is also about handling events. It is like an oplog in a database that is used to fill secondary indices like full text lucene indices.

malefijt · February 22, 2020, 9:49pm

I just saw this thread you started about versions: Content Revisions / Versions when I was looking for a way to query specific historic content versions.

This is very relevant.

So in that thread you’re getting into the data structure that would be needed to support versions. From the proposed description, it looks quite complex. I think this is because a multidimensional structure (one piece of content with different versions) is being modeled on a flat structure (plain Mongo document).

This can be solved more elegantly by using Irmin (i’m not affiliated, but I ran into this issue myself trying to allow version control for sheet music). Using Irmin you would write all versions of the piece of content to the same document path. Every time you do that, you get back a commit hash (like in Git). Squidex would keep track of the commit hashes of the piece of content versions. A published “current version” of a piece of versioned content is then just a hash reference to a specific commit of the Irmin document, that you can then (batch) retrieve using GraphQL.

That is also the response to your last statement about query capabilities. The Irmin data can be queried using GraphQL where you can specify the version (branch) it should use. See: https://irmin.org/tutorial/graphql.

On the technical side we will get two collections in MongoDB for content items. One collection will contain the data object of the last version and will be used for the Management API or if you want to get all content items, even if they are not published. The other collection will only contain published items.

Using a branchable time-travel DB like Irmin here would prevent this situation of the collection duplication, since you only need a single document path for the content item to write to. You can happily overwrite the content with new versions and track the commit hashes to be displayed in the list from your mockup.

Another plus is that the fields of the content are deduplicated across versions/commits because the values are atomically stored as a plain key:value store, like in Datomic. This means the file size will not bloat linearly when making versions of the same content.

This means that it will not be possible to get both, the last published data object and the last data object, for a content item.

When using Irmin, it would be possible to get both if you want. You would just query the data path using all commit hashes you track.

What are the requirements for the query capability of a DB so it could be used in this context?

P.S awesome work what you managed with Squidex. I am evaluating all the (OSS & non-OSS) headless GraphQL CMS’es and Squidex is very impressive and underrated.

Sebastian · February 23, 2020, 8:47am

Hi,

in Squidex every item has versioning anyway, because of CQRS and you can very easily restore a specific state by applying all events until a given time. This is also done when you want to retrieve an older version. The described feature would add another level of versioning. As a user you create new versions/revisions/branches explicitly.

The challenges are the following and also the reason why I have not implemented it in this way:

At the moment the content objects also live in memory because of the actor model. But the data of a content item can become really big, I have seen examples with 5 MB. Therefore an unlimited number of branches per content item would be a problem. This can be solved be “outsourcing” the data to another structure.
The goal of the events is to derive as much information as possible from the events itself. When you want to know which is the published branch you have a problem. Lets assume you have multiple published branches. The newest of this branches is exposed as the default in the API, so when this branch gets deleted you have to find another published branch and you have to store all other published branches in an event consumer to know that. This is a performance issue for event consumers like secondary indexes.

The second reason is why I also store the full data object in the event. It would be more efficient to have only the diff, which can be implemented very easily, but then you need the previous events. This is a problem for secondary indexes, because it would require at least one additional call to the DB for each incoming event which really hurts performance.

This means that it will not be possible to get both, the last published data object and the last data object, for a content item.

Not with an extra call to the DB per item, which is also a performance issue.

What are the requirements for the query capability of a DB so it could be used in this context?

Everything you can do with filters. Range queries (>=, >, <, <=) for numbers, dates, array and so on. I don’t see that with Irmin.

Because all the different hosting scenarios I am also a little bit more conservative about third party dependencies. Also a reason why I use Lucene and not Elastic. In a pure SaaS solution I would just try it out, but some companies I know have approval processes. Before using Irmin it needs to be approved first and makes the deployment much harder for teams there.

malefijt · February 23, 2020, 10:14am

Thanks for the comprehensive reply! Very insightful.