Text Analytics (Lucene Index)

Varada uses Apache Lucene indexing to accelerate text analytics queries and provide fast text filters, particularly with LIKE predicates. We do this using the KeywordAnalyzer, which provides full LIKE semantics support, to search for the exact appearance of a value in a filtered column.

For example, consider a table with a column city and a record in which city="New York, United States". When Lucene indexing is applied to the column, the following query will return this record:

select * from tbl where city LIKE '%New York%'

The following queries will not return the record:

select * from tbl where city LIKE '%NewYork%'
select * from tbl where city LIKE '%new york%'

👍

By default the LIKE filter is case-sensitive. From version 351.10, you can make the filter case-insensitive by changing the value of the regexp_options session property to 1 by executing set session regexp_options=1

Usage

Lucene indexing is recommended for:

  • Queries with LIKE predicates, prefix or suffix queries, or queries that use the START_WITH functions. Note that Lucene indexing will not accelerate queries with nested expressions like:
    select from table where starts_with(some_nested_method(col1), 'aaa')
  • Queries with Regexp_like predicates.
  • Range queries on string columns. A common use is dates that are stored as strings and have range predicates. For example, date_string>='yyyy-mm-dd'. In this case, setting the prefix index as date_string with a sufficient length will boost these queries.

Limitations

  • Lucene indexing is not supported with Varada as a connector.
  • The maximum supported string length is 50k characters.