Efficient processing of top-k frequent spatial keyword queries

# Efficient processing of top-k frequent spatial keyword queries

To evaluate the performance of the proposed RCL-tree and tfSKQ algorithm, we conduct a series of comparative experiments with some existing methods using the actual STBD set. Later, after processing, we evaluated their effectiveness and efficiency, accordingly using tables and figures as presented below.

### Data preparation and preprocessing

This paper employs two spatial textual datasets to evaluate the proposed RCL-tree and tfSKQ algorithm. One is a real business dataset from “Yelp Open Dataset” (yelp.com/dataset), named “Yelp”, which contains about 192,609 businesses, including 8 fields such as “business_id”, “latitude”, “longitude”, “starts”, “review_count”, “is_open”, “attributes”, “categories”, etc. The other is a POI dataset from AutoNavi (www.amap.com), named “Amap”, which contains 483,991 business POIs in Shanghai, China. Because the concept lattice structure in the RCL-tree accepts the binary fields only, the above two raw dataset need to be preprocessed as the binary formal context with multi textual keywords.

For Yelp, we select some important fields from the business dataset and design a binary formal context with 41 columns divided into five categories, as shown in Table 1. The first 26 columns are from the “categories” field and cover the business dataset completely. In other words, every record of the business dataset satisfies one or more of them. Columns 27–29 from “review_count” discretize the number of reviews into three grades: Rc_low, Rc_middle, Rc_high based on the tri-sectional quantiles of “review_count”. Columns 30–32 discretize the “stars” into three grades: S_low, S_middle, S_high in [0,2], [2.5,3.5] [4, 5]. Columns 33 is from “is_open” and represents the operation status of object. Columns 34–41 selected from “attributes” include 8 common features of business that covered about 85% of total data records with one or more than 1 value, while other 15% records are all of 0 value in these 8 columns.

Then, $${\mathbb{D}}_{Yelp} = \{ d_{i} |1 \le i \le 192,609\}$$, the textual keywords set $$K = \{ \left\langle {k_{1} ,k_{2} , \ldots ,k_{j} , \ldots ,k_{41} } \right\rangle {|} k_{j} \in \left\{ {0,1} \right\},1 \le j \le 41\}$$, and the average keywords coverage is about 11% that means each spatial object has about 4.5 keywords on average.

For Amap, except for location and category information, it has no keywords suitable for the binary formal context. To ensure the comparability of experimental results, we also want to design 41 simulation textual keywords similar to Yelp to modify Amap. In addition, to present the effect of data complexity on retrieval performance, we set the average keywords coverage to 17%, about 7 keywords of per spatial object, to construct the Amap dataset. Then, $${\mathbb{D}}_{Amap} = \{ d_{i} |1 \le i \le 483,991\}$$, the textual keywords set $$K = \{ \left\langle {k_{1} ,k_{2} , \ldots ,k_{j} , \ldots ,k_{41} } \right\rangle {|} k_{j} \in \left\{ {0,1} \right\},1 \le j \le 41\}$$, and the average keywords coverage of Amap is 17%, about 7 keywords per spatial object.

All of experiments are performed on Python 3.7 with a computer equipped with Intel i5, 3.0 GHz CPU, 24 GB RAM, and 64bit Windows 10 operation system.

### RCL-tree evaluation

To initialize the RCL-tree index structure, Algorithm 1 (see in “Initialization algorithm” section) need to be conducted, and two thresholds, $$\theta$$ and $$\delta$$, need to be determined in advance. $$\theta$$ is the range of R-tree node entries, and $$\delta$$ is the range of data volume of R-tree node linked to concept lattice. In general, $$\theta$$ is designed to have a similar number of entries for nodes to balance the retrieval time. In addition, for RCL-tree, few node entries make simple node structure and is helpful to link to concept lattice efficiently. Therefore, let $$\theta_{Yelp} = \left[ {2,4} \right]$$ be the range of R-tree node entries in Yelp. The R-tree structure of RCL-tree in Yelp can be built, and 291,678 tree nodes are generated, including $$192,609$$ leaf nodes, $${\mathbb{R}}_{{{\varvec{Yelp}}}} = \{ n_{1} ,\left[ {2,4} \right],{ }\left\langle {n_{1} , n_{2} , \ldots ,n_{i} , \ldots ,n_{291,678} } \right\rangle |1 \le i \le 291,678\}$$, $${\mathbb{R}}_{{{\varvec{Yelp}}}} .root = n_{1}$$.

$$\delta$$ is an important factor to determine how many concept lattices should be built. Since tfSKQ is to retrieve the k objects by traversing concept lattices, we expect that the k query results can be obtained by traversing as few concept lattice structures as possible, in other word, we expect the k and the data volume of concept lattice have a similar value range. To achieve it, we explore the detailed statistical features of R-tree nodes in $${\mathbb{R}}_{{{\varvec{Yelp}}}}$$ in Yelp, and the results are shown in Fig. 4 and Table 2. In Fig. 4, the box diagrams of data volume of R-tree nodes in level 1–8 (the maximum level $${\mathbb{R}}$$ is 11) of are drawn based on the level of R-tree nodes. And the nodes of level 2–5 are in the range of [5, 500] of k, which is a widely recognized query range and often used in a variety of related literatures. We can create concept lattice structures linked with these R-tree nodes in level 2–5 one by one to meet the efficient tfSKQ. However, as you can see from Table 2, the number of nodes in level 2, 22,149, is too large to the initialization of RCL-tree, and the minimum value of nodes in level 2 is 4, which means that a considerable number of nodes in level 2 do not meet the query number k. Therefore, for the yelp business datasets, we employ these level 3–5 R-tree nodes to build concept lattices one by one and set $$\delta_{Yelp} = \left[ {9,413} \right]$$, covering all 11,142 tree nodes in levels 3–5. Then, 11,142 concept lattices are built, and $${\mathbb{L}}_{Yelp} = \{ L_{1} ,L_{2} , \ldots ,L_{i} , \ldots ,L_{11142} {|}1 \le i \le 11,142, L_{i} .F.size \in \left[ {9,413} \right]$$}, the RCL-tree of Yelp is initialized, $${\mathbb{I}}_{Yelp} = \left\langle {{\mathbb{R}}_{Yelp} ,\user2{ }{\mathbb{L}}_{Yelp} } \right\rangle$$.

We also conduct similar experiments on Amap, based on these same principles, the RCL-tree of Amap is created. Specifically, $$\theta_{Amap} = \left[ {2,4} \right]$$, 732,342 R-tree nodes are generated, and $${\mathbb{R}}_{Amap} = \{ n_{1} ,\left[ {2,4} \right],{ }\left\langle {n_{1} , n_{2} , \ldots ,n_{i} , \ldots ,n_{732,342} } \right\rangle |1 \le i \le 732,342\}$$. Concept lattices are created on 27,930 R-tree nodes in level 3–5, $$\delta_{Amap} = \left[ {8,458} \right]$$, and $${\mathbb{L}}_{Amap} = \{ L_{1} ,L_{2} , \ldots ,L_{i} , \ldots ,L_{27930} {|}1 \le i \le 27930, L_{i} .F.size \in \left[ {8,458} \right]$$, $${\mathbb{I}}_{Amap} = \left\langle {{\mathbb{R}}_{Amap} ,\user2{ }{\mathbb{L}}_{Amap} } \right\rangle$$.

Table 3 shows the details of the initialized RCL-tree on $${\mathbb{D}}_{Yelp}$$ and $${\mathbb{D}}_{Amap}$$. Only 3.8% R-tree nodes need to link to concept lattice, thus saving storage space and improving initiation efficiency. In addition, the number of concepts in concept lattice is greater than the number of objects, which represents the complexity of textual keywords. The more the complexity in the textual keywords of objects, the more the concepts in concept lattices.

To evaluate the efficiency of RCL-tree initialization process (Algorithm 1), the influences of data volume on $${\mathbb{D}}_{Yelp}$$ are demonstrated by Fig. 5. As shown in Fig. 5a, dark colour rectangles represent the initialization time of $${\mathbb{R}}_{Yelp}$$ in $${\mathbb{I}}_{Yelp}$$, and light colour rectangles represent the initialization time of $${\mathbb{L}}_{Yelp}$$, and the initialization time of $${\mathbb{I}}_{Yelp}$$ is the sum of them. Obviously, $${\mathbb{R}}_{Yelp}$$ time is always less than $${\mathbb{L}}_{Yelp}$$ time. And with the increase of data volume, the initialization time of $${\mathbb{I}}_{Yelp}$$ increases linearly. For $${\mathbb{D}}_{Yelp}$$, included 192,609 spatial textual objects, the time of $${\mathbb{I}}_{Yelp}$$, $${\mathbb{R}}_{Yelp}$$, and $${\mathbb{L}}_{Yelp}$$ is about 175 s, 69 s, and 106 s.

In addition, we analyse the quantitative relationship between $${\mathbb{L}}_{Yelp}$$ and $${\mathbb{R}}_{Yelp}$$. Let $$\rho_{Yelp} = 100 \times {\mathbb{L}}_{Yelp} .size/{\mathbb{R}}_{Yelp} .{\text{size}}$$ be the ratio of the number of concept lattices in $${\mathbb{L}}_{Yelp}$$ to the number of nodes in $${\mathbb{R}}_{Yelp}$$. Figure 5b shows the trends of $$\rho_{Yelp}$$ with different data volumes. As you can see, $$\rho_{Yelp}$$ always fluctuates around 3.8. Therefore, we can think that the setting of $$\delta_{Yelp}$$ is reasonable and adequate. Similar conclusions, $$\rho_{Amap} \approx 3.8$$, can also be obtained on Amap and will not be repeated here, since the same setting of $$\delta_{Yelp}$$ and $$\delta_{Amap}$$ that they all build concept lattices at level 3 to 5.

### The evaluation and comparison of tfSKQ

Based on the RCL-tree, the proposed tfSKQ algorithm takes spatial point $${\mathcalligra{p}}$$ and textual keywords $${\mathcal{K}}$$ as the query conditions to retrieve the $$k$$ most frequent and nearest items on $${\mathbb{D}}_{Yelp}$$ and $${\mathbb{D}}_{Amap}$$. Different with the common top-k spatial keyword query (TkSKQ), tfSKQ can not only express spatial proximity but also reveal the textual keyword aggregation features of spatial objects to present the frequent items and its frequency.

To evaluate the performance of the proposed tfSKQ algorithm shown in Algorithm 2–5, a similar algorithm proposed by Ahmed et al.10 is employed. Ahmed proposes a hybrid index structure with a R-tree and some top-k sorted term lists (STLs), and develops algorithms to efficiently answer the top-k frequent spatiotemporal terms (kFST) query. Similar with IR-tree1, STLs index structure employ inverted structure to store sorted keyword lists in tree nodes and leaf nodes of the R-tree structure, but the difference is that STLs maintain the frequency of each keyword in nodes. To make the STLs index and RCL-tree comparable, we use the parameter $$\delta$$ of RCL-tree to limit tree nodes linked to sorted term lists in STLs index, that is to say, in STLs index, only the level 3 to 5 R-tree nodes connect with sorted term lists. We call this variant of the SLTs index as δSTLs. Note that, since δSTLs only stores single keyword’s frequency in STLs, it can only answer the frequency with the 0 textual keyword, i.e. $${\mathcal{K}} = \{ \}$$, and cannot analyse the frequency of complex multiple keywords combinations.

We also compare tfSKQ with two classical frequent items algorithms Apriori17 and FP-Growth18. Apriori algorithm employs the support degree as the criterion of judging frequent items to find the largest multiple frequent items. FP-Growth algorithm constructs a frequent pattern tree (FP-tree), maps data to the tree, and finds all frequent FP-tree items. Based on them, we develop two baseline index schemas to compare with RCL-tree and tfSKQ algorithm.

One is the combination of a R-tree structure and some frequent item tables generated by Apriori algorithm, named A-frequent. It employs a R-tree structure to index the spatial information and employs some frequent item tables generated by Apriori algorithm to store the frequent items of the textual keyword information of each R-tree node. Each record in the frequent item table includes two columns $$\left\langle {frequent item, frequency} \right\rangle$$, i.e. the frequent item and its frequency. A-frequent method can retrieve the k most frequent items to answer tfSKQ by the query conditions and the minimum support degree parameter. The second is the hybrid of R-tree and FP-tree, named F-frequent. It employs a R-tree structure and some FP-tree structures to index spatial information and textual keywords of each R-tree node respectively. The tfSKQ can be solved by the given query conditions and the minimum support parameter.

Like RCL-tree, A-frequent and F-frequent are both limited by $$\delta$$, i.e. frequent item tables in A-frequent and FP-tree structures in F-frequent are both built in level 3 to5 R-tree nodes. In addition, in A-frequent and F-frequent methods, the minimum support degree for querying frequent items is set to 0.1%.

Then, the RCL-tree is compared with the above three methods, δSTLs, A-frequent, and F-frequent, in Yelp and Amap dataset, and the results are as follows.

Figure 6 shows the comparisons of initialization time. In Fig. 6a,b, since δSTLs only stores single keyword’s frequency, it has the shortest total initialization time 129 s in Yelp and 395 s in Amap, while the other three methods need longer time to maintain all frequent information including the frequency information of single keyword and multiple keywords. Except for δSTLs, others increase dramatically from Yelp to Amap, A-frequent increases by about 76 times, F-frequent 19 times, and RCL-tree 25 times, while the increase of data volume is about 2,5 times. It means that the maintenance cost of frequent items is affected by data volume, and is more related to the complexity of data itself. These differences are also shown in Fig. 6c,d, with the increase of data volume, the initialization time gaps between them remain unchanged. In addition, since A-frequent employ table structure to maintain frequent information, there are many table-based traversal operations and a large number of data insertions and update in the initialization of A-frequent method, A-frequent always has the much longer initialization time in $${\mathbb{I}}_{Yelp}$$ and $${\mathbb{I}}_{Amap}$$ than others. Compared with A-frequent, F-frequent uses tree structure to do it and RCL-tree uses lattice structure. Among the three methods that store multiple keywords frequent information, as shown in Fig. 6a–d, RCL-tree always has the shortest initialization time.

Comparative results of storage space are given in Fig. 6e,f. In Yelp, with a R-tree structure 41 MB and some frequent item tables 1177 MB, A-frequent has the maximum storage space, 1218 MB. δSTLs has the minimum storage space of 72 MB with a R-tree structure 41 MB and some STLs 31 MB, because only the frequent information about single keyword is stored in it. And F-frequent, RCL-tree are 797 MB, 345 MB with FP-tree set 756 MB and concept lattices 304 MB respectively. Similar differences of them are also show in Fig. 6f, with the same R-tree 130 MB, the other component of A-frequent has the maximum 26237 MB, followed by F-frequent 18081 MB and RCL-tree 3596 MB, and δSTLs 97 MB at least. It indicates that these four index structures have the same R-tree component, and when multi keyword frequent information is stored, the concept lattices component in RCL-tree is the most compact and efficient storage structure than FP-tree of F-frequent, and the frequent item tables of A-frequent.

Next, the comparison of retrieval time of tfSKQ are conducted by three aspects: data volume, the number of query results, and the number of query keywords, are as below. Note that, because of the uneven distribution of spatial objects, random query points of tfSKQ often bring different query results, which gives difficult to objectively present the algorithm performance. To avoid it, the results of each query experiments are the average of 100 experiments under the same query conditions.

Firstly, the effects of data volume on retrieval time are given in Fig. 7. Under the different number of query keywords and k = 10, the tfSKQ results of these four methods are significantly different. Because δSTLs can only be applied to tfSKQ with empty keyword query condition, i.e. $${\mathcal{K}}.size = 0$$ or $${\mathcal{K}} = \{ \}$$, δSTLs only participates in the comparative experiments of $${\mathcal{K}}.size = 0$$. Shown in Fig. 7a,b, STLs has the best performance than others, RCL-tree has the worse retrieval time in some cases, and the retrieval time of A-frequent and F-frequent dose not grow steadily with the increase of data volume. In Fig. 7c–f, the query keyword set $${\mathcal{K}}$$ is not an empty set, the results are reversed, the retrieval time of RCL-tree is significantly better than that of A-frequent and F-frequent in both of $${\mathbb{I}}_{Yelp}$$ and $${\mathbb{I}}_{Amap}$$. And these gaps are more pronounced in $${\mathbb{I}}_{Amap}$$. That is because the frequent items stored by δSTLs, A-frequent, and F-frequent are ordered and the frequency of single keyword is easier to retrieve, while the frequent items stored by RCL-tree are generalized as concepts, and the frequency of keyword need to be deduced from concept lattice. In addition, it can be seen that the retrieval time of A-frequent and F-frequent are unstable in all three cases, and they grow leaps and bounds with the increase of data volume, while the retrieval time of RCL-tree always increases linearly with the increase of data volume. It indicates that RCL-tree has better robustness and adaptability than other methods in complex tfSKQ.

Figure 8 shows the effect of k and the number of query keywords on retrieval time with the full data set. In Fig. 8a–d, we still employ the number of query keywords as a factor to observe the performance of these four methods. Figure 8a,b show the effect of k with $${\mathcal{K}}.size = 0$$ in $${\mathbb{I}}_{Yelp}$$ and $${\mathbb{I}}_{Amap}$$. We can see that δSTLs is still the best method, and RCL-tree is the worst one in most cases. This situation is changed when $${\mathcal{K}}.size = 1$$. As shown in Fig. 8c,d, A-frequent and F-frequent have the same trends with the increase of k, the performance of RCL-tree is great better than that of A-frequent and F-frequent, and the gap between them grows with the increase of k. When k = 500, the retrieval time of $${\mathbb{I}}_{Yelp}$$ is 35.6 ms, which is about 1/5 of A-frequent 173.7 ms and F-frequent 181.1 ms, and $${\mathbb{I}}_{Amap}$$ is 35.0 ms, which is about 1/15 of A-frequent 517.0 ms and 1/12 of F-frequent 400.0 ms.

Obviously, RCL-tree has more advantages than other methods when $${\mathcal{K}}$$ is not an empty set. The detailed analysis about the effect of $${\mathcal{K}}$$ on retrieval time with k = 10 and the full data set are shown in Fig. 8e,f. We can see that as the number of query keywords increases, the process of tfSKQ becomes more complex, and the advantages of RCL-tree is more obvious. When the number of query keywords is 5, the retrieval time of $${\mathbb{I}}_{Yelp}$$ is 98.8 ms, which is about 1/5 of F-frequent 452.0 ms and A-frequent 466.9 ms. Similarly, $${\mathbb{I}}_{Amap}$$ is 148 ms, and F-frequent 550.0 ms, A-frequent 524.0 ms.

In this section, we employ two datasets, Yelp and Amap, to compare the performance of RCL-tree with other three methods, δSTLs, A-frequent, and F-frequent, in initialization and tfSKQ. Although δSTLs performs well in keyword free query, it cannot directly achieve tfSKQ of multi keyword query due to its own structure. There is no doubt that in the case of multi keyword query, RCL-tree has the best efficiency in initialization and tfSKQ, its retrieval performance is at least 5 times of A-frequent and F-frequent, and its storage occupy is at least 2/5 of F-frequent and 1/4 of A-frequent. It is worth mentioning that, on two dataset Yelp and Amap, the RCL-tree has the stable performance and its retrieval efficiency is always better than other baseline methods.

Banking

Banking

Banking

Banking

Banking