[KYLIN-5564] Introduce Bloom Filter to optimize data scanning based on Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0-alpha
Fix Version/s: 5.0-beta
Component/s: Query Engine
Labels:
None

Description

Currently, all the data generated by Kylin are saved as Parquet files through Spark, but Kylin has not make full use of the features of Parquet when scanning data. Among them, BloomFilter must be stressed, because it's the most common tool to help READERs to skip useless data.

Therefore, we introduced an approach to build BloomFilter automatically, conditionally and smartly when constructing segments, on the desired columns especially according to the query histories.

After brought in BloomFilter, Spark will have a good performance improvement in the most cases.

About the benchmarks or performance tests, please read the attached PDF is the report testing on SSB.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

RowGroup BloomFilter 场景介绍和性能测试.pdf
07/Jun/23 09:33
856 kB
Guangyuan Feng

Activity

People

Assignee:: Guangyuan Feng

Reporter:: Guangyuan Feng

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Jun/23 09:28

Updated:: 14/Jun/23 02:26

Resolved:: 14/Jun/23 02:26