Feedback Needed: The future or Orleans?

Sebastian · June 30, 2022, 12:56pm

The future of Orleans?

At the moment Squidex uses Microsoft Orleans for many parts of the architecture, but it makes deployment and operations more complicated. Therefore I think about removing this dependency.

What is Orleans?

Orleans is an actor framework. An actor is more or less a class with methods where everything is handled single threaded. So when a method is called and another method is in progress it is queued and handled afterwards. Furthermore these actors are distributed in the cluster automatically and when the actor that you want to talk with lives on another node the communication is automatically established.

What are the benefits?

As mentioned before, method called are queued. Usually when you update an object in the database you use a mechanism called optimistic concurrency. The idea is that you also store an version number in the database and when you read a value to be updated you also get the version. Then you only make the update when the version that you currently have in memory has not been changed in the meantime. The problem is that you are not detecting updates before you make the write to the database. So a lot of expensive operations might have happened already, like reading the value from the database.
With actors you have one instance of every domain object (content, assets and so on) and therefore you do not have this performance problem.
Because actors just live in the cluster you do not need special nodes for background operations. Squidex has a lot of background jobs that cannot run in parallel. You just tell the system that you only want one actor of each type to be exist in the cluster.

What are the downsides?

Cluster management is complicated. You have to ensure that the nodes can talk to each other and you need a stable cluster. This makes a lot of scenarios quiete complex, for example auto scaling is not that easy.

Without Orleans

When we talk about an architecture without Orleans we would have to talk about how Orleans is used today.

Background Operations

These operations are only allowed to run once for the deployment. Typical tasks are:

Rule execution.
Event processing.
Content scheduling
Backup restore
Backup operations

Without Orleans you need typical kind of Squidex nodes (just a configuration) and in some cases a way to talk to each other. For example with PubSub mechanisms like Redis or Google PubSub.

=> Without Orleans these deployments become more complex.

Caching

Because the actor lives in memory it is very cheap to access some data that is used very often. Typical example are apps and schemas. The app instance is needed for almost every database call and often you want to have the newest version and not a cached instance.

=> Without Orleans you have way more calls to the database or more caching for some endpoints.

Domain Objects

As mentioned above you need to solve parallel updates on domain objects. Without Orleans you would solve it on the database leve.

=> Without Orleans you have more database calls and also more exceptions for parallel updates.

Rules

Rules are a special case. They work on events. Lets talk about a concrete example: When you make an update to a content item, you create a new ContentUpdated evvent. Based on this event an enriched event is created in a background process that contains the new data and the old data of the content item. The new data is already part of the event that has been queried from the database. But the question is, where we can get the old data from. Therefore the content actors keeps the previous data in memory for a little while. Usually the event is handled by the rule system more or less immediately, so it is very likely that the content actor is still alive and has the old data available. So this operation is basically free in the most cases. If the old data is not available we have to query it. We could use a distributed cache for that if configured, but it is more expensive.

Summary

With Orleans

PRO: A lot of implementations are easy if you understand Orleans.
PRO: Less database calls, because actors work like a cache, but are always consistent.
PRO: Easier to deploy because you have less requirements (like redis).
PRO: Some operations are very fast, because they are in-memory.

CON: The architecture is very special, making it harder for new developers.
CON: Harder to deploy in some cases and harder in operations.

Without Orleans

PRO: Easier architecture
PRO: Easier to deploy to things like managed containers or even serverless deployments.

CON: More database calls, perhaps you shift some of the operational challenges to the database.
CON: Some operations are slower, because you have more database calls.

NeoMoritsiqu · July 1, 2022, 8:30am

Hi Sebastian,

Performance is the most important thing in squidex. If this change will not adversely affect performance, there is no problem. even when developing code, it’s great to reduce complexity. And the other thing I’m wondering is how do you plan to develop the advantages of orleans without orleans?

Cluster
Schedule Services
etc.

Thank you

Sebastian · July 1, 2022, 12:09pm

I would say it is the second important thing. Number one would be stability for me. Handling Orleans is not so simple in this regard, I just had a 5 min offline period yesterday.

It would be a more traditional architecture with a worker node and API nodes and a queue between them.

About Performance

Some things will be definitely slower. Before we talk about concrete examples just another introduction into objects (apps, schemas, contents, assets, asset folders, rules …)

To load an object you need 2 information, which means to database calls.

The snapshot
The events

The objects live in memory as long as they are used. If they do not live already they get activated and then the two database calls are made. If the object lives on the same server as the current request, then it does not cost anything to get the app. If the object lives on another server, you have the network costs and the serialization costs, which is similar to a database but without the overhead and the I/O to load the object from the disk.

1. Updating a content item

1.1 With Orleans

To update a content you need the following information:
1. The app
2. The schema.
3. The content itself.
To make the update you need 2 calls to write the snapshot and events.

Worst Case: 8 database calls
Best Case: 2 database calls
Realistic: 4 database calls (because app and schema are already loaded).

1.2. Without Orleans.

In the update process you do not want to work with caching that much. Therefore you always have the 8 database calls.

2. Querying content items

1.1 With Orleans

To query content items you need at least the following information:

The app
The schemas (Could be multiple).
The contents, but they are always loaded from the database.

This means the following database calls.

Worst Case: 2 (app) + N * 2 (schemas) + 1
Best Case: 1
Realistic: 1 (because app and schema are already loaded).

1.2. Without Orleans.

When you query content items you can live with caching for a short period of time (5 min or so).

Therefore it means

Worst Case: 2 (app) + N * 2 (schemas) + 1
Best Case: 1
Realistic: 1 + ( 2 (app) + N * 2 (schemas) ) * CACHE_RATE

Summary

Performance will be slower on updates
Performance on reads is not a big difference, but a little bit slower.
It will mean more database calls (but these calls are on small collections with indexes).

NeoMoritsiqu · July 1, 2022, 2:36pm

While this development is being done, wouldn’t it be good if a structure that can cache the desired schemas on redis and read them over redis is configured?

also schemas containing big data (500k 1m etc) slow down in queries, what can be done for this?

Thank you

Sebastian · July 1, 2022, 3:34pm

I am not sure if it will provide that much performance in general. Because the schema is a relatively small document and can be queried from MongoDB by the primary ID, therefore it is pretty fast (< 2ms). If you use redis you still have the deserialization problem, so you do not gain that much. if you really want to get as much performance as possible you need a copy of the schema on every server, then you can get it in a few nanoseconds.

This is independent from the Orleans decision. There are 2 problems:

At the moment all documents are part of a single collection. The reason is the cloud with thousands of schemas and some queries that are across schemas (e.g. references, scheduling and so on). Therefore to get the number of total documents for pagination you have to go over the index which is not so fast for a lot of documents.
Because all documents are part of a single collection you cannot create custom indexes.

There are solutions:

Squidex 6.9.0 has new cache for the total amount of documents, that is uses when there are more than 10.000 items per schema. This makes normal queries faster. Furthermore the frontend UI does not query the total items by default anymore.
There is an extension to the storage system where a separate collection per schema is created. To satisfy all queries 4 collections per schema are needed now, which increases the storage and inserts costs to have better performance for queries. This is WIP at the moment: https://github.com/Squidex/squidex/tree/storage

Jiri · July 1, 2022, 8:05pm

Isn’t one option to try provide feedback to Orleans team? Shifting away from Orleans sounds like a lot of work (i may be wrong).

Sebastian · July 2, 2022, 8:29am

Yes, I do this all the time and also contribute to projects like the MongoDB provider and the Dashboard. But you have to configure Orleans properly and it is a totally new principle to operations and developers and therefore not that easy. I mean every serious backend developer understands the pattern with worker nodes and API nodes and hosting solutions work better with independent node members. Sometimes you cannot even establish a communication between your nodes.

I started with a POC and have something that compiles after a day. So it is okay I guess. I am evaluating different libraries for messaging right now.

Sebastian · July 11, 2022, 1:58pm

A new version: 7.0.0-RC1 is available that works without Orleans.

maxisam · August 4, 2022, 10:01pm

Wonder how you think about this https://bartwullems.blogspot.com/2022/02/net-7microsoft-orleans.html
Wonder how much resource should we have for mongodb after this change?
Is there a way to provide an optional cache layer to use Redis as distribution cache?

Thanks!

Sebastian · August 5, 2022, 6:10am

Hi maxisma,

If you look at the contributors, you will see that the Orleans teams is very small: https://github.com/dotnet/orleans/graphs/contributors. Appearantly it has no priority at Microsoft.
I have no idea, but the majority of the costly queries are always content queries and the impact from Orleans on this part is relatively low.
Of course, if this is needed, but caching has other issues of course.