Content Fulltext Search Improvement

Hi Sebastian,

I have idea about development that can search fulltext on all content. Such an improvement for CMS and Web sites would be very nice. This will not be an internal search, it will be an improvement that can give out search results.

Currently it is very difficult to develop on squidex for CMS and Web site searches.

Some settings can be adjusted in the properties section of the schemes.

This service must be located on the Content API.

The output of the service should look like this:

Search keyword: “sample”

[
  {
    "schemaId": "c64f56c3-458f-482b-bc5c-a292caa6cc94",
    "schema": "asd",
    "contentId": "ae5d0bce-a66d-4d87-b3e6-e83db66f72cb",
    "title": null,
    "url": null,
    "found": [
      {
        "field": "examplefield",
        "content": "this is a sample content"
      }
    ]
  },
  {
    "schemaId": "b34c0929-c24f-49b0-82e7-0b8b48eeac3d",
    "schema": "test",
    "contentId": "9076bb7b-9889-4ecb-8ee3-b2247ff43753",
    "title": "Test content title",
    "url": "test-content-title",
    "found": [
      {
        "field": "samplefield",
        "content": "hello world this is sample text"
      },
      {
        "field": "samplefield2",
        "content": "this is a sample text"
      }
    ]
  }
]

Of course, this idea can be improved. I’m curious about your comments.

Thank you.

Hi,
thanks for your mock-ups. Well done.

Nevertheless I think it is not the best idea.

Current State

Let my explain the current state in Squidex first. At the moment Squidex has a full text implementation. The implementation is very basic and has been migrated from a Lucene backed implementation to a implementation using MongoDb or ElasticSearch. At the moment all text fields are added to an index and the ElasticSearch implementation also considers the different languages for localized fields.

The Lucene implementation produced better results than the MongoDB implementation but it was buggy and hard to maintain. Therefore the implementation has been shifted.

What is the problem?

I am not sure how familiar you are with full text indexes, but the implementation works most of the time with a so called reverse index. Lets consider you have two documents.

| Id | Texts                   |
| 1  | Hello User              | 
| 2  | Hello, how is it going? |

The full text index splits the text into its words (or word stems). Then it associated the ids to the words. Because some words like and and used so often these words are added to a blacklist, so called stopwords and no added to the index.

So after this process is done you get the following results

| Word   | Ids  |
| Hello  | 1, 2 |
| User   | 1    |
| Go     | 3    |

In this case how, is and it are treated as stopwords and going is reduced to its stem go.

Problem 1: A lot to configure

To make the full text work the best way a lot of settings have to be made:

  1. Stemming and word splitting (tokenization) differes between the languages, so per field you have to decide which language should be used.
  2. Stop words also differ per language and you have to define per field which stopwords should be used. But sometimes you also want to have custom stopwords. For example if you have a travel port you wanna exclude words like hotel and flight from your results, because they are basically part of every content item and it would produce the wrong results.
  3. Some fields are more important than others, for example the title is more important than then the text, so you wanna add a weight to some fields.
  4. Some fields contain words that are hard to understand for the search engine. For example the IATA code for airports. You wanna treat these fields differently.
  5. You also wanna configure synonyms, so that the word does not have to match and approximity search so that typos are acceptable.

There are more settings and I am not an expert for full text search, but you can have a look at the documentation of Algolia or Elastic Search to understand all the different options.

Without deep understanding of these parameters you will not get good results. There is no automatic solution that does everything for you.

So when we really wanna integrate full text search into Squidex we have to make these parameters and configuration options available to the developer.

But we have to consider the different implementations and have to study what the different solutions offer. Just a list of servers and technologies:

  • Lucene based (a Java full text search engine)
    • Lucene itself
    • Elastic Search
    • Solr
  • Algolia (SaaS)
  • Databases (most databases have basic full text support today)
  • MeiliSearch

Then we have 2 options:

  1. Support all parameters that are available by at least one of these engines and make clear why this parameter cannot be used in the current installation.
  2. Support only common parameters.

We also have to implement processes like reindexing because it might be necessary to index all your content items again after you made a change to the stopwords.

Problem 2: Document transformation

The content items itself are not the best representation for a full text index. Lets come back to our original sample with the travel portal. A travel offer has a reference to the airport and the destination and the hotel. In the content these references are only represented as IDs and provide no value for the search engine. So we must also configure which fields of referenced items should be added to the document for the full text search engine.

After all it is a lot of effort to bring a very good fulltext solution and I am not sure if it would provide enough value.

1 Like

Thank you for the in-depth explanation. I got a much better idea with the explanations. So how can we produce a similar solution? How can we produce a good search solution for websites. How can we improve this solution? I think it’s a topic worth thinking about. There is no good solution for this at the moment.

Thank you for all your support.

You can use rules to publish your changes to Algolia or Elastic and then query your content from there.

OR

We could make a simple integration in addition to that so that Algolia gets integrated into Squidex. So basically you would use Rules to publish to Algolia and the search engine integration to trigger the search only.

So what if we develop this kind of development in a way that we can search on the contentsRepo, not as a fulltext search?

You mean a “contains” search? You can do that with filter, but it is very slow, because it is always a table scan.