Skip to content

Conversation

sekiyama58
Copy link
Contributor

  • add multi_byte_support_enabled option to organization settings
  • add ilike %...% to query search conditions when the option is enabled

What type of PR is this? (check all applicable)

  • Bug Fix

Description

Since postgres tsvector does not work well with multi-byte characters in CJK- (and some others) launguages , currently queries named in multi-byte characters are often fails to be searched as described #2622.

This fixes the issue by adding an option to enable multi-byte compliant search into organization settings, and if it is enabled, search for queries using <name or description> ilike %<search_term>% pattern in addition to tsvector.
This search method is slower than tsvector search, so it is recommended only when multi-byte search is more important than speed.

Related Tickets & Documents

This closes #2622.

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

  • New organization settings for multi-byte search

スクリーンショット 2019-06-18 16 16 13

スクリーンショット 2019-06-18 16 17 03

  • The result of search result when multi-byte search is enabled

スクリーンショット 2019-06-18 16 18 06

(If the option is disabled, nothing is hit in ユーザー search results.)

* add multi_byte_support_enabled option to organization settings

* add `ilike %...%` to query search conditions when the option is enabled
@sekiyama58 sekiyama58 force-pushed the multi-byte-query-search branch from 503780a to 6179e43 Compare June 18, 2019 07:25
Copy link
Member

@arikfr arikfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking care of this. Please see comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough to match only name and description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is popular, but search_vector contains the query's SQL too, so we can find by some terms in SQL with this (only with small overhead). How do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, you can add the query_text column to the search instead. But I think it's better to leave it out, because for some search terms it can yield really bad results (like when you search for something that is also a table name). When using the regular search it has weights in place, so those are ranked lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your advise. I have removed search_vector for multi-byte search. It still provides reasonable search results for me.

@cypress
Copy link

cypress bot commented Jun 18, 2019



Test summary

29 0 0 0


Run details

Commit 6179e43
Started Jun 18, 2019 7:35 AM UTC
Ended Jun 18, 2019 9:44 AM UTC
Duration 09:07 💡
OS linux Debian - 8.10
Browser Electron 61.0.3163.100

View run in Cypress Dashboard ➡️


This comment has been generated by cypress-bot as a result of your project's GitHub integration settings. You can manage this integration in your project's settings in the Cypress Dashboard

@jezdez
Copy link
Contributor

jezdez commented Jun 24, 2019

@arikfr So I'm not sure if this should block this PR, I just wondered whether the "search backends" idea would still useful for this?

Looking back, I'm not super convinced the tsvector based search delivered what we hoped to get -- smooth deployment and improved results. Especially given the fact that tsvector is limited in many ways and excludes a huge part of the market for non-latin languages.

My suggestion would be to:

  1. adding a backend system (like the query runner registry) that can be extended with additional Python packages. It would need to have a CRUD interface, so could be pretty simplistic.
  2. ship the old ILIKE based search as the default backend
  3. implement more complex (tsvector or Elasticsearch) based backends as extensions (either by the Redash team/contributors or 3rd parties?)

@jezdez
Copy link
Contributor

jezdez commented Jun 24, 2019

Of course the additional search backends (e.g. tsvector or Elasticsearch) could even be shipped and just made optional (e.g. when a ES cluster is available).

@arikfr
Copy link
Member

arikfr commented Jul 1, 2019

@jezdez this is good feedback, but will be lost once this is merged. Let's move it to https://discuss.redash.io/c/development?

@jezdez
Copy link
Contributor

jezdez commented Jul 1, 2019

@jezdez this is good feedback, but will be lost once this is merged. Let's move it to https://discuss.redash.io/c/development?

Good idea, expanded on it in https://discuss.redash.io/t/search-backend-proposal/4080

@arikfr arikfr merged commit 261062d into getredash:master Jul 8, 2019
@arikfr
Copy link
Member

arikfr commented Jul 8, 2019

Merged 👍

cls.name.ilike(pattern),
cls.description.ilike(pattern)
)
).order_by(Query.id).limit(limit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed -- maybe we should order by created_at or updated_at timestamps?

harveyrendell pushed a commit to pushpay/redash that referenced this pull request Nov 14, 2019
…#3908)

* Support multi-byte search for query names and descriptions

* add multi_byte_support_enabled option to organization settings

* add `ilike %...%` to query search conditions when the option is enabled

* Improve description for multi_byte_search_enabled option

Co-Authored-By: Arik Fraimovich <[email protected]>

* Remove tsvector from search when multi_byte_search_enabled

* Add a multi-byte search test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't search query correctly with non-ASCII chars

3 participants