[IMPLENTED] Permanent deletion

Hi Sebastian,

I want to deleting all the contents on a schema and bulk insert from the csv file.

When I delete all the contents, I want all the crumbs of these contents to be deleted from the database. Because I’ll be doing this every day and I don’t want the database side to be bloated.

How should I handle this issue?

Thank you.

Right now, it is not possible, because Squidex does not delete data. Why are you doing this every day?

I import data from an external environment and do this via a daily scheduled task. and I insert about 60,000 data every day. That’s why I need to clean old data in the system and add new data. I can’t update existing data because they don’t have a unique id to identify the data. Therefore, I have to delete existing data and re-create it with updated data.

I need two things here, one is bulk insert and the other is to completely delete the old data in the system. If necessary, I can do this by deleting the schema completely and creating it from scratch in the background, but when I delete the schema or data, I need to make sure that it is completely deleted from the database.

Can you not use a compound id?

Where I export the data I cannot provide this. I need to provide this when importing. I also do not have a unique field to detect changing data. so I have to do this on the squidex side.

I developed a special plugin that imports data. I need to be able to do bulk data entry and clean the contents completely before doing that.

When you use the bulk endpoint you can assign custom IDs. I think the 5.5 version has already this feature. So you can use a custom ID like user123-2021-02-25 or whatever for your data and just make an upsert.

I cannot produced a data exporter on the other system to support the special IDs it sets on squidex.

Let’s assume that the sample data I exported from the other system is like this.

{
“Name”: “Sebastian”,
“Surname”: “Stehle”
},
{
“Name”: “Neo”,
“Surname”: “Moritsiqu”
}

And the data is constantly changing and there is no sequence.

I can give an id to these when adding it to squidex. but in the other system I cannot produce an output that feeds this id. Actually, this is why I thought of deleting old data and adding a new one.

If you do not have a unique and you cannot create a unique id (like surname + lastname) than you have to delete the contents, that is true.

But right now Squidex does not provide good support for this.

Well, if I made an additional improvement on the product, where should I improve for batch recording and to completely delete all data. If you can refer, maybe we can make it generic and add it to the product as a feature.

The bulk endpoint already exists, but not the deletion. Perhaps you can provide a PR that makes a hard delete as well?

1 Like

If you direct me a little for this, maybe we can add a small option on the schema side and activate the permanent deletion when this option is approved.

You have to add an abstract method to the base domain object:

And then implementations in the derived classes DomainObject and LogSnapshotDomainObject class. Both have access to persistence.DeleteAsync() or so. Then you call this method when the command has a certain flag. Then we need to populate this flag through all layers.

Hi Sebastian,

There is no proper use case for bulk import and I do not understand exactly how to use it.

PostContents (string app, string name, [FromBody] ImportContentsDto request) method is available on ContentsController.cs.

I think we should use this. However, a usage example of this method is not available on squidex-samples. My thinking is; Bool HardDeleteContents {get; set;} property adding on ImportContentsDto.

CommandMiddleware, which processes the command when this is true, also execute the command to delete all data on the corresponding schema. In fact, I would be very happy if you encoded this flow with a sample because you have a much more control and info over the system. But if you show me I will try to do it anyway.

Are you using the C# client SDK?

If yes, it provides some endpoints, if not, please do not use this endpoint. The correct one is this:

How does it work?

You have to create a post like this

POST `/api/content/{app}/{name}/bulk`
{
	"jobs": [{
		"type": "Upsert",
		"data": {
			"number": {
				"iv": 1
			}
		}
	}, {
		"type": "Delete",
		"id": "123..."
	]
}

What id does not provide, is an endpoint to delete all contents for a schema and there are several reasons for that. One of them is that it is harder to do with event sourcing, because you not only need to delete the data in the content collections, which would in fact be a single DELETE command like DELETE CONTENTS WHERE schemaId = '...'.

You also have to create the deletion events.

This the root of the problem. To work reliable you have to keep the deletion events. Squidex works like a database in some ways. A database maintains a list of sorted operations (e.g. Oplog in MongoDB), which is used for synchronization. If you just delete everything, all other systems like indices, usage counters and external systems where the data might be synced as well, does not get the information that the content has been deleted.

Therefore the delete command to be updated to allow a permanent flag. When the flag is true, the data has to be delete (there is a persistence.DeleteAsync()) method for that and then the delete event needs to be published. It is not possible otherwise. We could make an optimization to delete these events after a months or so, but this is another story.

If this does not work for you, Squidex is probably not the best tool for you, at least not for this use case.

When a schema is deleted, are events deleted? My only fear is that the database is garbage after while doing too much delete or add…

No. Right now, nothing gets deleted. It is by design because I think that for the kind of content that Squidex is built for (manually built content in the range of < 1 Mio Records) the content grows slower than the disk sizes.

Deleting 60.000 content items over the bulk update should be acceptable fast like 1000 items / sec or so. So I think this is not a big issue. I am open to allow permanent deletions as it also has advantages like GRPC compliance and so on, but at least the deleted-event must stay in the system.

I am working on this at the moment.

Hi Sebastian,

I did the BulkInsert job and tried it with 20,000 demo data. It took about 1 min and it is working properly. Before doing this, I was thinking about deleting all existing data. I think we can solve this when you create a method such as clear all contents of this scheme on ContentRepository and create an action / command named “truncate” on the contents side.

Expected work;
When the truncate command is run or by specifying a feature before bulk insert in the model;

1- Deletion of all data belonging to the scheme from the ContentsDatabase (Published, All).
2- Deletion of all events belonging to this scheme from the Event2 collection.

Thus, when 60.000 data is created, garbage data will not be accumulated in the system every time.

It would not help that much on the content repository because you also have to delete the events. If it takes only one minute I think it is fast enough for now.