Query Acceleration Engine

The Query Acceleration Engine uses a machine learning-based feedback loop that monitors which datasets and columns are frequently used, as well as which part of the data lake needs to be accelerated in order to meet the performance requirements of high-priority workloads.

Based on this information, the Query Acceleration Engine implements acceleration instructions that dynamically operationalize datasets within your data lake. The Query Acceleration Engine automatically caches and indexes a relevant subset of table columns to enable efficient reads from the data lake given the cluster's available storage capacity.

Selecting Elements for Acceleration

The best candidates for acceleration are those that minimize the data that needs to be read from the data lake.

For example, consider the following two queries on table t:

SELECT b FROM t WHERE b < 1
SELECT c FROM t WHERE c < 10

Assuming that:

  • Columns b and c are the same size.
  • Columns b and c are uniformly distributed between 0 and 100.
  • Both queries run the same number of times.

Indexing column b will reduce the amount of data that needs to be read by 99%, while indexing column c will reduce the amount of data that needs to be read by 90%. Therefore, the Query Acceleration Engine will give column b higher priority.

The role of the Query Acceleration Engine is to apply acceleration instructions to the columns that will have the most impact on the amount of data that needs to be read. It does this in two ways.

  • By performing default acceleration when a query hits a column.
  • By using a machine learning feedback loop to identify popular columns and tables. Based on this information, as well as the workload priorities you defined, the Query Acceleration Engine configures acceleration instructions that deliver optimal performance and price balance.

Default Acceleration

When a query hits a column that is not being accelerated, the Query Acceleration Engine performs data and index materialization to accelerate the data in the column. This default acceleration, which is performed as long as space is available in your cluster storage, is designed to immediately improve performance for recently used datasets that were not analyzed by the Query Acceleration Engine.

📘

Default acceleration is not performed for SELECT * FROM <table_name> queries that are commonly used to explore a table, rather than to retrieve specific data.

Default acceleration has a lower priority than a configured acceleration strategy. As a result, the default acceleration applied to a table can be removed and replaced when an acceleration strategy is implemented by the Query Acceleration Engine.

Acceleration Strategies

The Query Acceleration Engine includes two main components: the Collector and the Accelerator.

  • The Collector continuously collects and summarizes query execution metadata. The summarized model is optimized for insight extraction, and stored in columnar ORC format in an admin-defined S3 bucket. It includes, for example, query statistics, operators used during query execution, column usage statistics and selectivity levels.

  • The Accelerator creates actionable insights based on historical query and data usage patterns from the Collector output. These insights are continuously revised based on real-time usage and performance, and translated into cache acceleration instructions and indexing acceleration instructions.

Acceleration Instructions

An acceleration instruction defines the materialization type Varada will perform on a column in a table in the Varada Catalog in order to warm up the data in the column. The actual warmup takes place when a query hits the column.

  • Cache Instructions: Based on the frequency of data usage and its business priority, Varada uses SSD columnar nanoblock caching to speed up data access and improve performance.

  • Indexing Instructions: Based on the data type and the level of selectivity, Varada uses different indexing technologies to speed up data searches, filters, and joins.

Instructions are created automatically by the Query Acceleration Engine. You can also define acceleration instructions from the Varada Control Center, or using REST API commands.

For details, see Understanding Acceleration Instructions.

Cluster Storage Management

When the cluster available storage hits its maximum storage capacity threshold (warmup-demoter.max-usage-threshold-percentage) the cluster automatically deletes index and cache from the cluster's SSDs down to the cleanup threshold (warmup-demoter.max-cleanup-threshold-percentage). The order in which index and cache items are being deleted from the SSDs until the warmup-demoter.max-cleanup-threshold-percentage threshold is reached is as follows:

  1. All items with expired ttl property.
  2. Items with the lowest values on the priority property.