WISK: A Workload-aware Learned Index for Spatial Keyword Queries

Slide Note

WISK, a workload-aware learned index that combines spatial and keyword queries to efficiently retrieve objects. It integrates spatial and textual indexes and considers query workload information.

Uploaded on Dec 21, 2023 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.


Presentation Transcript

  1. WISK: A Workload-aware Learned Index for Spatial Keyword Queries Yufan Sheng1, Xin Cao1*, Yixiang Fang2, Kaiqi Zhao3, Jianzhong Qi4, Gao Cong5, Wenjie Zhang1

  2. Introduction Geo-textual objects: stores, tourist attractions, business, etc. Geographical location (e.g. Latitude 33 51'S, Longitude 151 12'E) Text description (e.g. opera, live, car park) Spatial keyword query: retrieve the resultant object combining spatial query and keyword search. Spatial query: range query, k-NN query. Keyword search: Boolean search, ranking-based search. Spatial keyword index: integrate spatial index into textual index[1] Combination schema: spatial-first/textual-first integration, tight integration 1.Chen, Z., Chen, L., Cong, G., Jensen, C. S.: Location-and keyword-based querying of geo-textual data: a survey. In: The VLDB Journal, pp 603-640 (2021). 2

  3. Motivation However, traditional indexes only consider the distribution of the underlying data. Challenge 1: how to utilize the information from the query workload? Assume two queries (red) have different keywords (e.g. q1.kws = {Park, BBQ}, q2.kws = {Hotel, Pizza} (b) Query-aware (a) Data-driven Solution: query-driven partition + RL packing 3

  4. Motivation Multi-dimensional learned index cannot adapt to the high dimensionality scenario. Challenge 2: how to consider the spatial and textual information simultaneously? (b) Tight combination (a) Loose combination Solution: keyword-based CDF models 4

  5. Problem Definition SKR Query: Given a query q with a query region (i.e. q.area) and a set of keywords (i.e. q.kws), the result of q is a set of objects, which are within the region and at least contain one keyword. {Italian, Restaurant} O 4 {Chinese, Asian, Restaurant} O7 {Chinese, Restaurant} {India, Market} {Thailand, Restaurant} O5 Problem statement: Given a geo-textual dataset D and a set of historical SKR queries W, we aim to learn a spatial keyword index that can efficiently process the incoming SKR queries Q. O3 {Market} O2 {Italian, Restaurant} O6 {Chinese, Restaurant} O1 Note: We assume the distributions of W and Q are the same. 5

  6. Overview Two separate steps ML to solve the optimization problem RL to construct the hierarchical index 6

  7. Cost Formulation 7

  8. Bottom Partition Each bottom partition can be considered as a set of objects Problem: Optimal Partitioning Given a dataset D = {?1, ?2, . . . , ??} and a query workload W = {?1, ?2, . . . , ??}, we aim to find an optimal partition, i.e., a set of k clusters ? = {?1, ?2, . . . , ??}, where I. Each object belongs to exactly one cluster. II. The total cost is minimized. NP-hard problem: can be reduced from the MaxSkip partitioning problem [1, 2]. 8 1.Sun, L., Franklin, M. J., Krishnan, S., Xin, R. S.: Fine-grained partitioning for aggressive data skipping. In: SIGMOD, pp. 1115-1126 (2014). 2.Yang, Z., Chandramouli, B., Wang, C., Gehrke, J., Li, Y., Minhas, U. F., Larson, P. A., Kossmann, D., Acharya, R.: Qd-tree: Learning data layouts for big data analytics. In: SIGMOD, pp. 193-208 (2020).

  9. Greedy Partition Motivated by K-D-Tree, we propose a greedy bi-partitioning algorithm Update Lower cost Split Y X Ignore Higher cost 9

  10. Greedy SGD Partition Key problems in splitting Which dimension? Where? X x xb xu 10

  11. Bottom-up Packing The generated bottom clusters can be regarded as a flat index The pruning power can be further improved Definition: Optimization Criterion Query time directly reflect the pruning capability but impossible to use The number of accessed nodes proportional to the time cost and easy to calculate with query labels Problem: Bottom-up Packing Given a set of bottom clusters G = {c1, c2, . . . , ck} and a query workload W = {?1, ?2, . . . , ??}, we aim to construct a hierarchical index level by level, which can minimize the number of accessed nodes 11

  12. RL Packing We formulate the bottom-up packing as an MDP, which can be solved by RL State: upper node query labels + # of bottom nodes bottom node query labels Bottom nodes X Y Z X + 1 3 2 Action: add the bottom node into one upper node Transition: update the query label of the upper node Z + Z + 3 1 2 2 Y Y Y + + + 3 3 3 1 1 1 X Y Z + Z 2 2 2 Z Z Z Z Z Z + + + + + + 3 3 3 3 3 1 3 1 3 3 1 1 1 1 1 1 2 2 2 2 2 2 2 Reward: the difference of the average number of accessed nodes 12

  13. Experiments Datasets Baseline Traditional index: CDIR-Tree, SFC-Quad, ST2I, ST2D Learned index: TFI, LSTI, Flood-T Settings 13

  14. Experiments Varying the query distribution WISK can work well across different settings Varying the query region size Varying the number of keywords 14

  15. Experiments Scalability Robustness ST2I Flood-T WISK ST2I LSTI Flood-T WISK 250 3600 Query time (ms) Query time (ms) 200 2700 150 1800 100 900 0 50 1M 5M 10M 50M 100M 0 0.2 0.4 LAP ratio 0.6 0.8 1 No. of objects WISK can work well on large-scale dataset WISK can work well when query distribution changes slightly 15

  16. Conclusion Conclusion Data-driven index Query-aware index Spatial index R-Tree, RSMI, LISA Flood, Tsunami Geo-textual index WISK CDIR-Tree, SFC-Quad, LSTI Use ML algorithm to solve the optimal partitioning problem, which is shown to be NP-hard. Pack the generated clusters by using RL framework, which further improve the pruning capability. Report on extensive experiments on real-world datasets and synthetic workloads. 16

  17. Thank you! WISK: A Workload-aware Learned Index for Spatial Keyword Queries Yufan Sheng, Xin Cao*, Yixiang Fang, Kaiqi Zhao, Jianzhong Qi, Gao Cong, Wenjie Zhang yufan.sheng@unsw.edu.au 17