CN110362652B - Space keyword Top-K query method based on space-semantic-numerical correlation - Google Patents
Space keyword Top-K query method based on space-semantic-numerical correlation Download PDFInfo
- Publication number
- CN110362652B CN110362652B CN201910657221.9A CN201910657221A CN110362652B CN 110362652 B CN110362652 B CN 110362652B CN 201910657221 A CN201910657221 A CN 201910657221A CN 110362652 B CN110362652 B CN 110362652B
- Authority
- CN
- China
- Prior art keywords
- query
- semantic
- space
- numerical
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000005516 engineering process Methods 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 10
- 239000002131 composite material Substances 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 claims 1
- 230000004044 response Effects 0.000 abstract description 21
- 238000012163 sequencing technique Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 241001122767 Theaceae Species 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/387—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a space keyword Top-K query method based on space-semantic-numerical relevance, which comprises the following steps: carrying out semantic expansion on the initial query of the user by using Word Embedding technology; constructing a spatial-semantic-numerical value mixed index structure AKR-tree; and (4) calculating the correlation degree of space-semantic-numerical value. The invention utilizes Word Embedding technology to carry out semantic expansion on the initial query of a user to generate a series of query keywords which are semantically related to the initial query keywords; then constructing a spatial-semantic-numerical value mixed index structure AKR-tree, wherein the index structure can simultaneously support text and semantic matching of query keywords, and processing numerical attributes by using a Skyline method; and finally, quickly matching objects related to the space keyword query condition semantics by using the provided index structure, and sequencing according to the comprehensive relevance of space-semantics-numerical values. Compared with the existing similar method, the method has better user satisfaction degree of the query result, and the index structure has quicker query response time.
Description
Technical Field
The invention belongs to the technical field of space keyword query and word embedding, and particularly relates to a space keyword Top-K query method based on space-semantic-numerical relevance.
Background
The existing spatial keyword query processing modes mainly include: the top-k range query and the top-k neighbor query are mainly characterized in that a result scoring function is constructed according to text similarity and position similarity between a space object and space keyword query, and then the query efficiency is improved by utilizing a text and space mixed index technology. The existing indexing technology for mixing spatial data and text information mainly comprises IR-tree and IR 2 The indexing technology of the text search mainly comprises Inverted files (Inverted files), signature files (Signature files), bitmap indexes (bitmaps) and the like. However, the above-mentioned space-text index structure mainly focuses on the position proximity and text similarity of the space object and the space keyword query, and rarely considers the semantic relevance of the query result. Although few recent works research semantic matching of spatial keyword query, spatial objects include numerical attributes such as price and user score besides position information and text information, the existing method needs to discretize the numerical attributes and then process the numerical attributes as text attributes, but the processing method cannot effectively compare the numerical size and the inclusion relation of numerical intervals, and actually processes numerical informationThe method is still very different from the text matching processing method. As far as we know, no correlation work exists at present, and the comprehensive correlation degree of the spatial object and the spatial keyword query on the position, the semantics and the value is considered, so that a hybrid index structure for simultaneously supporting the comprehensive query is not provided.
In addition, with the widespread use of GPS and the rapid increase of spatial Web objects, spatial keyword query is widely used in Location-based services (LBS). Most of the existing spatial keyword query processing modes only support position proximity and text similarity matching, but cannot provide semantically related objects for users. Furthermore, current spatial-text hybrid index structures (e.g., IR-tree, IRs-tree, bR-tree, quad-tree, etc.) are not yet able to handle the numeric attributes of spatial objects.
Disclosure of Invention
Based on the defects of the prior art, the technical problem to be solved by the invention is to provide a space keyword Top-K query method based on space-semantic-numerical relevance, establish a space keyword query processing mode simultaneously fusing position information, semantic information and numerical information, improve query efficiency through an effective mixed index structure, and improve the user satisfaction degree and query response time of query results to a great extent.
In order to solve the technical problem, the invention is realized by the following technical scheme:
the invention provides a space keyword Top-K query method based on space-semantic-numerical relevance, which comprises the following steps of:
step 1: for each word in the space object text information, generating a corresponding word embedding vector representation, then calculating the semantic similarity between each pair of words, storing the semantic similarity in a word semantic similarity table, and expanding the semantics of the query keyword;
step 2: constructing an AKR-tree mixed index structure;
and step 3: for the space keyword query condition given by the user, firstly, finding out semantic related words from the semantic similarity table in the step 1, and expanding the query keyword range; then, the constructed AKR-tree mixed index is used for quickly matching the query result;
and 4, step 4: in the matching process, whether each branch node meets the space constraint of the query condition is checked, and on the premise of meeting the space constraint, whether the Textfile of the node contains the query keyword is checked;
and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score.
Preferably, the specific steps of step 1 are as follows:
step 1.1: extracting words in all the space object text information, performing Word-stop processing, forming a dictionary by all different words, and generating an embedded vector representation of each Word by using a Word embedding technology for each different Word in the dictionary;
step 1.2: based on the embedded vector representation of the words, the semantic similarity between each pair of words is calculated by utilizing a Cosine similarity method and is used for the semantic expansion of the query keywords in the online query stage.
The specific steps of step 2 are as follows:
step 2.1: and (2) generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers which respectively point to files containing all keywords of the node and a numerical value attribute file, and the third part is an entry set in the node;
step 2.2: and generating a Skyline set of the value attribute tuples of the space objects under each intermediate node of the AKR-tree.
Further, the specific steps of step 3 are as follows:
step 3.1: expanding a space keyword query condition, obtaining a node matched with the query condition by using an AKR-tree, and obtaining a space object in a Skyline set in the matched node as a candidate result set;
step 3.2: for each space object in the candidate result set, respectively calculating the position proximity, semantic relevance and numerical proximity of the space object to the query;
step 3.3: and calculating the comprehensive relevance score of the space object and the query, and selecting top-k final results according to the score.
Optionally, the rule for generating the Skyline set of the value attribute tuples in step 2.2 is as follows:
suppose that the value attribute tuple set D corresponding to all space objects under a certain node in the AKR-tree has n tuples and m +1 values, let A = { A = 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i A value of (d);
assuming that for each attribute, the values in the dominating relation dominate have a partial ordering relation, one tuple T ∈ D dominating another tuple T' ∈ D, governed byMeans for producing a product having a structure represented byt[A i ]≥t’[A i ]Andt[A i ]>t’[A i ];
if one tuple T E D is not comparable to another tuple T '. E D, then we denote t-t', if and only ifAnd is provided with
Optionally, the method for calculating the position closeness, semantic relatedness and numerical closeness between the query and the spatial object in step 3.2 is as follows:
(1) Location proximity between query and spatial object:
d (q, lambda, o, lambda) is the Euclidean distance between the space object and the query in the space position, and MaxD is the maximum Euclidean distance among all the space objects;
(2) Semantic relatedness between query and spatial object:
firstly, vectorizing text information in a space object and keywords in query, wherein the dimensionality of a vector is that the text information of the space object and the number of all different words contained in the query are expressed by m, the words are arranged in a certain sequence to form a sequence, if the words contained in the space object appear in the sequence, 1 is arranged at the corresponding positions of the vectors Vo and Vq, and otherwise, 0 is arranged; then, the similarity between the two vectors is calculated by using a Cosine similarity method, and the calculation formula is as follows:
wherein, V o [i]And V q [i]Respectively represent vectors V o And V q The ith element in (1);
(3) The method for calculating the correlation degree of the numerical attribute between the space object and the query comprises the following steps:
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,q.w i is equal to or more than 0 (i = 1., | q.w |) ando.a i numerical attribute A representing object o i Attribute values of which the values are normalized to [0,1 ]]In the above-mentioned manner,it is assumed that the smaller the value of these numerical attributes, the better, e.g., the lower the noise, the lower the price, etc.; if the numerical attribute value is higher as better, such as environmental atmosphere, score, etc., information, it can be determined by a i =1-a i Converting it;
(4) The position-semantic similarity calculation formula of the space object and the query is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q)
wherein alpha is an adjusting parameter.
Optionally, in step 3.3, the final composite score of the spatial object and the query is calculated according to the following formula:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q)
wherein beta is an adjusting parameter.
Therefore, the spatial keyword Top-K query method based on the spatial-semantic-numerical relevance realizes semantic expansion of spatial keyword query by utilizing Word embedding (Word embedding) technology, and improves query efficiency and support for text and numerical query by constructing AKR-tree mixed indexes and Skyline sets of numerical attribute tuples. Experimental results show that the algorithm provided by the invention can support semantic approximate query of space keywords, can process numerical attributes, has higher query efficiency, and improves the user satisfaction degree and the query efficiency of query results to a great extent.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a block diagram of the spatial keyword Top-K query method based on spatial-semantic-numerical relevance according to the present invention;
FIG. 2 is a diagram of an AKR-tree index structure in an embodiment of the present invention;
FIG. 3 is a structural diagram of an AKR-tree index constructed by using the data in Table 1 in the embodiment of the present invention;
FIG. 4 is a comparison graph of query response times of the IR-tree, IRS-tree and AKR-tree used when the number k of query results is different in the Yelp data set according to the embodiment of the present invention;
FIG. 5 is a comparison graph of query response times used by an IR-tree, an IRS-tree, and an AKR-tree, when the number of numerical attributes is different in the Yelp data set in the embodiment of the present invention;
FIG. 6 is a comparison chart of query response times used by IR-tree, IRS-tree and AKR-tree when the number of query keywords is different in the Yelp data set in the embodiment of the present invention;
FIG. 7 is a time comparison chart of index structures of IR-tree, IRS-tree and AKR-tree constructed on Yelp data sets with different data sizes according to the embodiment of the present invention;
FIG. 8 is a comparison chart of query accuracy rates obtained by using an IR-tree, an IRS-tree, and an AKR-tree when the number k of query results is different in the Yelp data set in the embodiment of the present invention.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
The invention relates to a space keyword Top-K query method based on space-semantic-numerical correlation, which is mainly applied to the fields of current popular Location Based Service (LBS) systems and space interest point recommendation, and the overall processing flow is as follows:
(1) Semantic expansion is carried out on the initial query of the user by using Word Embedding technology, and a series of query keywords related to the initial query keyword semantics are generated.
(2) A space-semantic-numerical value mixed index structure AKR-tree is constructed, the index structure can simultaneously support text and semantic matching of query keywords, and numerical attributes are processed by using a Skyline method.
(3) And quickly matching objects related to the space keyword query condition semanteme by using the provided index structure, and sequencing according to the comprehensive relevance of space-semanteme-numerical value.
The block diagram of the solution of the semantic approximate query method for the space keywords top-k is shown in FIG. 1. The specific implementation of the present invention and the results of each significant phase are described below in conjunction with the data and queries of Table 1.
TABLE 1 location, semantic, and numerical information of spatial objects and examples of spatial keyword queries
Some relevant definitions to which the method of the invention relates are as follows:
given a spatial data set O = { O = { O 1 ,o 2 ,…,o n H, each spatial object o i Is composed of a triplet (λ, K, A) in which o i λ denotes o i Position information of (two-dimensional space object is usually represented by latitude and longitude), o i K is o i Set of text keywords in (1), o i A is o i Set of numerical attributes. It is to be noted that i The value in A is o.a i Normalized to [0,1 ]]In between, the smaller the value of these numerical attributes is, the better, e.g., the noise is low, the price is low, etc.; if the value of the numerical attribute is higher as better, such as information of environmental atmosphere, score, etc., it is possibleBy a i =1-a i It is converted. The spatial key query q is represented by a triplet (λ, K, W), where q. λ is the query location, q.k is the query key set, q.w is the user's set of preference weights for different numerical attributes: (k.w)q.w i Is equal to or more than 0 (i = 1., | q.w |) and)。
step 1: and generating a corresponding word embedding vector representation for each word in the space object text information, then calculating the semantic similarity between each pair of words, and storing the semantic similarity in a word semantic similarity table for semantic expansion of the query keyword.
Step 1.1: an embedded vector representation (Embedding vector) for each Word is generated using the Word Embedding technique. For example, table 1 shows a partial data example in the Yelp dataset, and on this example, by using the method of the present invention, the embedded vector representations corresponding to the words restaurants and tea are calculated as follows:
Restaurants:
Tea:
in the word embedding vector generating method in step 1.1, all words are read from the text description information of all space objects, a dictionary (Vocabulary) is created by removing stop words and selecting words with higher word frequency, and a < UNK > word is added in the dictionary to represent the words which do not appear in the dictionary. The invention realizes Word Embedding through Pythrch, thereby generating the embedded vector representation (Word Embedding) of each Word in a dictionary, wherein the hyper-parameter setting method comprises the following steps: the number of negative sampling random samples K is 100, the number of peripheral words C is designated as 3, the number of iteration rounds NUM _ EPOCH is 2, the number of 1 BATCH per iteration round BATCH batt _ SIZE is designated as 128, the SIZE of the vocabulary table VOCAB _ SIZE is set to 30000, the LEARNING RATE LEARNING _ RATE is 1e-4, and the word vector dimension EMBEDDING _ SIZE is designated as 100. The present invention optimizes the model using Adam as an optimizer.
Step 1.2: and calculating the semantic similarity between each pair of words by using a Cosine similarity method. For example, for restaurant and tea, their similarity is: 1.1146.
step 2: and constructing an AKR-tree mixed index structure as shown in figure 2.
Step 2.1: and generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers, which respectively point to the file (TextFile) and the value attribute file (AttrFile) containing all the keys of the node, and the third part is the set of Entries (Entries) in the node. The generation process of the AKR-tree is a bottom-up tree building process, and for a leaf node, each entry therein is composed of a quadruple in the form of < o, rect, o.tid, o.aid >, where o represents a spatial object, rect represents the Minimum Bounding Rectangle (MBR) of the object, o.tid is the text information identifier of the object, and o.aid is the numeric attribute tuple information identifier of the object. For a non-leaf node, each item in the non-leaf node is also composed of a four-tuple, and the form is < pN, rect, n.pid, n.aid >, wherein pN is the address of a child node N in the node, rect refers to a Minimum Bounding Rectangle (MBR) that can contain all child nodes under the node, n.pid is the document identifier of the node, the document contains the summary of text information (i.e., the extracted text keyword set) of all child nodes under the node, n.aid is the numerical attribute information identifier of the node, and the numerical attribute information contains the Skyline set of the numerical attribute tuples of all child nodes under the node.
Based onThe AKR-tree formed is shown in FIG. 3, data of Table 1. Wherein N is 1 The node in (A) has 18 ,o 5 ,o 11 ,o 2 ,o 9 ,o 14 ,o 24 ,o 21 ,o 20 ,o 16 ,o 26 ,o 25 ,o 13 ,o 6 。N 2 The node in (A) has 7 ,o 8 ,o 27 ,o 38 ,o 10 ,o 19 ,o 17 ,o 37 ,o 3 ,o 32 ,o 23 ,o 1 ,o 4 ,o 40 ,o 12 ,o 22 ,o 42 ,o 44 ,o 46 ,o 47 ,o 49 ,o 50 。N 3 The node in (A) has 29 ,o 41 ,o 36 ,o 30 ,o 39 ,o 35 ,o 31 ,o 34 ,o 28 ,o 33 ,o 43 ,o 45 ,o 48 . Wherein N is 1 AttrFile of (a) is { [0.06,0.58],[0.17,0.0]},N 2 AttrFile of (a) is { [0.07,0.92],[0.13,0.21],[0.15,0.18]},N 3 The AttrFile of (a) is { [0.03,0.01]},N 4 The AttrFile of (a) is { [0.03,0.01],[0.17,0.01]}。
Step 2.2: the generation process of the Skyline set is to make A = { A ] on the assumption that the value attribute tuple set D corresponding to all space objects under a certain node in the AKR-tree has n tuples and m +1 values 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i The value of (c). It is assumed that for each attribute, the values in the dominating relation dominate have a partial ordering relation (e.g.,indicating that the value a is better than b). One tuple te D dominates another tuple t' teD, resulting fromMeans if and only ift[A i ]≥t’[A i ]Andt[A i ]>t’[A i ]. In addition, if one tuple T ∈ D is not comparable to another tuple T '∈ D, then it is denoted t-t', and if and only ifAnd is
Based on the data in table 1, skyline sets of numeric attribute tuples generated by each intermediate node of the AKR-tree are respectively (as shown in table 2):
table 2. Skyline set corresponding to each node of AKR-tree generated based on data in Table 1
And step 3: for the space keyword query condition given by the user, firstly finding out semantic related words from the semantic similarity table in the step 1, and expanding the range of the query keyword; then, the constructed AKR-tree mixed index is used for quickly matching the query result; in the matching process, whether each branch node meets the space constraint of the query condition is checked, and whether the Textfile of the node contains the query keyword is checked on the premise that the space constraint is met; and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score. The method comprises the following specific steps:
step 3.1: expanding the query condition of the spatial keywords, obtaining nodes matched with the query condition by using the AKR-tree, and obtaining spatial objects in the Skyline set in the matched nodes as a candidate result set. For tablesQuery q in 1, the top 10 match results in table 1 are: o 11 ,o 13 ,o 16 ,o 5 ,o 26 ,o 15 ,o 21 ,o 9 ,o 20 ,o 18 。
Step 3.2: for each spatial object in the candidate result set, its position proximity, semantic relevance, and numerical proximity to the query q are calculated separately. The methods of calculating the positional, semantic, and numerical closeness between the query q and the spatial object o are as follows, respectively:
(1) Position proximity between query q and spatial object o:
wherein D (q, λ, o, λ) is the euclidean distance between the spatial object o and the query q at the spatial position, and MaxD is the maximum euclidean distance among all spatial objects.
(2) Semantic relatedness between query q and spatial object o:
the basic idea is that the text information in the space object o and the keywords (including expanded words) in the query q are firstly subjected to vectorization processing, the dimension of the vector is the number (expressed by m) of all different words included in the text information of the space object o and the query q, the words are arranged in a certain sequence to form a sequence, if the words included in o (and q) appear in the sequence, 1 is arranged at the corresponding position of the vector Vo (and Vq), and otherwise, 0 is arranged; then, the similarity between the two vectors is calculated by using a Cosine similarity method, and the calculation formula is as follows:
wherein, V o [i]And V q [i]Respectively represent vectors V o And V q The ith element in (1);
(3) The method for calculating the relevance of the numerical attribute between the space object o and the query q comprises the following steps:
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,q.w i is equal to or more than 0 (i = 1., | q.w |) ando.a i numerical attribute A representing object o i Attribute values of normalized to [0,1 ]]In the meantime.
(4) The position-semantic similarity calculation formula of the spatial object o and the query q is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q) (4)
where α is an adjustment parameter, and α is set to 0.5.
For example, the matching result o for query q in Table 1 11 With the calculation methods of the present invention (equations (1) to (3)), the position closeness, semantic relatedness, and numerical closeness to the query q are respectively: 0.9374,0.1889,0.5454.
Step 3.3: calculating the comprehensive relevance score of the space object o and the query q, and selecting top-k final results according to the score, wherein the final comprehensive score calculation formula of the space object o and the query q is as follows:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q) (5)
wherein beta is an adjusting parameter.
For example, the spatial object o obtained by equation (5) based on the data in Table 1 11 The composite relevance score to query q is: 0.5607. wherein β in the formula (5) is set to 0.86.
In order to further test the effect and performance of the method, a Foursquare data set and a Yelp data set are selected in the example, which shows the query accuracy and the query efficiency (namely the query response time) of the method. The Yelp is a famous merchant commenting website in the United states, and the website comprises merchant information in various fields such as restaurants, shopping centers and hotels in various regions and information such as user evaluation and check-in time; the real POI data are processed into 174567 points of interest, each point of interest has an ID, position information (represented in the form of latitude and longitude), text information and numerical attribute information, the position information is used as spatial information, user comment information, name, city and category are used as text information, and randomly generated 5 random numbers between 0 and 1 are used as numerical attribute information.
After data cleaning, the data set comprises 215614 spatial objects related to geographic positions, and a keyword list and standardized values of numerical attribute information for describing the spatial objects, namely each spatial object comprises longitude and latitude information, keyword information and four numerical attributes (comprising price, environment, service and rating).
The following are the test results on the Yelp and Foursquare datasets for query efficiency and query accuracy using the method of the present invention. The default values for the various parameters of the process of the present invention are given in table 3. In the experimental process, the influence of a certain parameter on the experimental result is researched by changing the value of the parameter and fixing the values of other parameters. All experiments are realized by adopting Python, and a computer is configured to be a CPU 2.5GHz i7-4710HQ, a RAM 8GB and a Windows10 operating system. The method (AKR-tree) is compared with the existing IR-tree and IRS-tree in the prior art in the aspects of query efficiency and query effect.
IR-tree index Structure: is a combination of the spatial index R-tree and the inverted index InvertedFile for indexing spatial and textual keywords. The method of the invention is different from the method of the IR-tree in that a keyword file TextFile and a value attribute file Attrife are added, and a skyline set of value attribute tuples under each intermediate node is calculated, thereby effectively processing the query on the value attributes. TextFile differs from InvertedFile in that: the textFile of the non-leaf node comprises all text keyword information sets of child nodes of the textFile, and the textFile of the leaf node comprises text keyword information of space objects of the textFile; invoked file needs to index each keyword information and mark all its occurrences in the document, thereby achieving the goal of speeding up the query. The TextFile does not need to construct inverted files for all nodes, and certain time can be saved in the index construction stage.
IRS-tree index structure: is an Invertedfile mixed index structure with a Sybopse tree, and can search a plurality of different numerical attributes simultaneously. However, IRS-based search algorithms need to provide an accurate range of values, and an exact match in numerical attributes may result in little to no query results being returned.
TABLE 3 Default values for the parameters of the invention
To illustrate, since the experimental results on Foursquare and Yelp are very different, the present example only shows the experimental results on the Yelp data set. The experimental performance test is mainly carried out from the following aspects:
(1) Influence of parameter k on query efficiency: the experiment sets k values to 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, respectively, to observe the influence of the number of query results on the query response time. FIG. 4 is a comparison chart of query response time for using IR-tree, IRS-tree, AKR-tree when the number k of query results is different on the Yelp data set. As can be seen from fig. 4, the three algorithms have longer query response times with larger k values. This is because as the value of k increases, more candidates will be indexed. The query time response of the IR-tree is minimized because it does not take into account numeric attributes, nor does it take into account keyword queries related to query keyword senses, and thus the query time is minimized. The AKR-tree has a slightly longer query response time because the initial query keyword is semantically expanded and the value attribute is processed using a Skyline method. The longest query response time is the IRS-tree index structure, which increases the query cost because it needs to be combined with other indexes to complete the query.
(2) Effect of | o.a | on query time: the experiment aims to verify the influence of the numerical attributes of the space objects on the response time of the query by changing the number of the numerical attributes of the space objects. FIG. 5 is a comparison of query response times for the IR-tree, IRS-tree, AKR-tree used when the number of numerical attributes increases from 1 to 5 on the Yelp data set in accordance with the present invention. As can be seen from fig. 5, as the number of numerical attributes increases, the query response time also gradually increases. This is because the AKR-tree structure needs to perform Skyline calculation on the numeric attribute tuples in the query result, and in the worst case, the Skyline method almost compares each element in each tuple, so that the larger the number of numeric attributes is, the more time is consumed. An IRS-tree is more time consuming than an AKR-tree because it needs to consider the exact range of a numerical attribute when handling it, which can be computationally expensive if the value range of the attribute is large. Since the IR-tree does not have the processing function of numerical attributes, no comparison is made here.
(3) Influence of | q.k | on query time: the present invention observes its effect on query response time by setting the number of query keywords to grow from 1 to 8. FIG. 6 is a comparison of query response times of the present invention in the Yelp data set when the number of query keywords is different, and the IR-tree, IRS-tree, and AKR-tree are used. As can be seen from fig. 6, the query response time increases in proportion to the number of query keywords. The reason is that, in any index structure, when the number of query keywords increases, the more objects containing the query keywords need to be indexed, and thus the query time increases. However, the query response time of the AKR-tree and the IR-tree is far shorter than that of the IRS-tree, and the query response time of the method is not much different from that of the IR-tree under the condition that the method can process text and numerical attributes at the same time.
(4) Relationship of | D | to build index time: the purpose of this experiment was to compare the comparison of the above three algorithms over the time it took to construct the index. FIG. 7 shows the time comparison for constructing the indexes of IR-tree, IRS-tree and AKR-tree on the Yelp data sets with different data sizes. As can be seen from FIG. 7, the index building time is proportional to the size of the data set, wherein the time for building the IR-Tree index structure is the least, because it is the shortest index building time because it does not need to build AttrFile files and synopses compared with AKR-Tree, IRS-Tree index structures; but the index construction time of the AKR-tree is not different from that of the IR-tree; the IRS-tree needs to combine the synopses tree with other indexes to complete index construction, so that the index construction time is longest.
(5) And (3) evaluating the query effect:
FIG. 8 is a comparison of user satisfaction of query results obtained by using IR-tree, IRS-tree, and AKR-tree when the number k of query results on a Yelp data set according to the present invention takes different values, and the evaluation formula is as follows:
wherein I (q) is the top-k ideal object sets which are labeled by the user and most relevant to the query q, and R (q) is the result set returned by each algorithm. As can be seen from FIG. 8, the AKR-tree of the present invention has an accuracy improved by 9.05% and 20.19% respectively compared with the IR-tree and the IRS-tree.
The invention utilizes Word Embedding technology to carry out semantic expansion on the initial query of a user to generate a series of query keywords which are semantically related to the initial query keywords. Then, constructing a spatial-semantic-numerical value mixed index structure AKR-tree, wherein the index structure can simultaneously support text and semantic matching of query keywords, and processing numerical value attributes by using a Skyline method; and finally, quickly matching objects related to the space keyword query condition semanteme by using the provided index structure, and sequencing according to the comprehensive relevance of space-semanteme-numerical value. Experimental analysis and results show that compared with the existing similar method, the method has better user satisfaction degree of query results, and the index structure has quicker query response time.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (5)
1. The method for querying the Top-K of the spatial key words based on the spatial-semantic-numerical relevance is characterized by comprising the following steps of:
step 1: for each word in the space object text information, generating a corresponding word embedding vector representation, then calculating the semantic similarity between each pair of words, storing the semantic similarity in a word semantic similarity table, and expanding the semantics of the query keyword;
step 2: an AKR-tree mixed index structure is constructed, and the method comprises the following specific steps: step 2.1: and (2) generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers which respectively point to files containing all keywords of the node and a numerical value attribute file, and the third part is an entry set in the node;
step 2.2: generating a Skyline set of the value attribute tuples of the space objects under each intermediate node of the AKR-tree;
the rules for generating the Skyline set of value-attribute tuples in step 2.2 are as follows:
suppose that the value attribute tuple set D corresponding to all space objects under a certain node in AKR-tree has n tuples and m +1 values, let A = { A = 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i A value of (d);
assuming that for each attribute, the values in the dominating relation dominate have a partial ordering relation, one tuple T ∈ D dominating another tuple T' ∈ D, governed byMeans if and only ift[A i ]≥t’[A i ]Andt[A i ]>t’[A i ];
if one tuple te D is not comparable to another tuple t 'teD, it is denoted t-t', if and only ifAnd is
And step 3: for the space keyword query condition given by the user, firstly, finding out semantic related words from the semantic similarity table in the step 1, and expanding the query keyword range; then, performing quick matching on the query result by using the constructed AKR-tree mixed index;
and 4, step 4: in the matching process, whether each branch node meets the space constraint of the query condition is checked, and on the premise of meeting the space constraint, whether the Textfile of the node contains the query keyword is checked;
and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score.
2. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 1, wherein the specific steps of step 1 are as follows:
step 1.1: extracting words in text information of all space objects, performing Word-stop-removing processing, forming a dictionary by all different words, and generating embedded vector representation of each Word by using Word embedding technology for each different Word in the dictionary;
step 1.2: based on the embedded vector representation of the words, the semantic similarity between each pair of words is calculated by utilizing a Cosine similarity method and is used for the semantic expansion of the query keywords in the online query stage.
3. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 1, wherein the specific steps of step 3 are as follows:
step 3.1: expanding a space keyword query condition, obtaining a node matched with the query condition by using an AKR-tree, and obtaining a space object in a Skyline set in the matched node as a candidate result set;
step 3.2: for each space object in the candidate result set, respectively calculating the position proximity, semantic relevance and numerical proximity of the space object to the query;
step 3.3: and calculating the comprehensive relevance score of the space object and the query, and selecting top-k final results according to the score.
4. The spatial key word Top-K query method based on spatio-semantic-numerical relevance according to claim 3, characterized in that the methods of calculating the positional proximity, semantic relevance and numerical proximity between query and spatial object in step 3.2 are respectively as follows:
(1) Location proximity between query and spatial object:
d (q, lambda, o, lambda) is the Euclidean distance between the space object and the query in the space position, and MaxD is the maximum Euclidean distance among all the space objects;
(2) Semantic relatedness between query and spatial object:
firstly, vectorizing text information in a space object and keywords in query, wherein the dimensionality of a vector is that the text information of the space object and the number of all different words contained in the query are expressed by m, the words are arranged in a certain sequence to form a sequence, and if the words contained in the space object appear in the sequence, a vector V is formed in a vector V o And V q 1 is set at the corresponding position, and 0 is set otherwise; then, utilizeThe Cosine similarity method calculates the similarity between two vectors, and the calculation formula is as follows:
wherein, V o [i]And V q [i]Respectively represent vector V o And V q The ith element in (1);
(3) The method for calculating the correlation degree of the numerical attribute between the space object and the query comprises the following steps:
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,q.w i is equal to or more than 0 (i = 1., | q.w |) ando.a i numerical attribute A representing object o i Attribute values of which the values are normalized to [0,1 ]]To (c) to (d);
(4) The position-semantic similarity calculation formula of the space object and the query is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q)
wherein alpha is an adjusting parameter.
5. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 4, wherein the final composite score calculation formula of the spatial object and the query in the step 3.3 is as follows:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q)
wherein beta is a regulating parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910657221.9A CN110362652B (en) | 2019-07-19 | 2019-07-19 | Space keyword Top-K query method based on space-semantic-numerical correlation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910657221.9A CN110362652B (en) | 2019-07-19 | 2019-07-19 | Space keyword Top-K query method based on space-semantic-numerical correlation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110362652A CN110362652A (en) | 2019-10-22 |
CN110362652B true CN110362652B (en) | 2022-11-22 |
Family
ID=68221090
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910657221.9A Expired - Fee Related CN110362652B (en) | 2019-07-19 | 2019-07-19 | Space keyword Top-K query method based on space-semantic-numerical correlation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110362652B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112270199A (en) * | 2020-11-03 | 2021-01-26 | 辽宁工程技术大学 | CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method |
CN113158087B (en) * | 2021-04-09 | 2024-07-09 | 深圳前海微众银行股份有限公司 | Space text query method and device |
CN116341567B (en) * | 2023-05-29 | 2023-08-29 | 山东省工业技术研究院 | Interest point semantic labeling method and system based on space and semantic neighbor information |
CN117171802B (en) * | 2023-11-03 | 2024-01-12 | 中国科学技术信息研究所 | Strong privacy protection method and system for space keyword query |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156548A (en) * | 2014-09-03 | 2014-11-19 | 哈尔滨华夏矿安科技有限公司 | Circuit switching method with CAD graph of downhole ventilation, fire avoiding and water conveying line as data source |
CN104731860A (en) * | 2015-02-04 | 2015-06-24 | 北京邮电大学 | Space keyword query method protecting privacy |
CN106610972A (en) * | 2015-10-21 | 2017-05-03 | 阿里巴巴集团控股有限公司 | Query rewriting method and apparatus |
CN108647213A (en) * | 2018-05-21 | 2018-10-12 | 辽宁工程技术大学 | A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis |
CN108804551A (en) * | 2018-05-21 | 2018-11-13 | 辽宁工程技术大学 | It is a kind of to take into account diversity and personalized space point of interest recommendation method |
CN109947904A (en) * | 2019-03-22 | 2019-06-28 | 东北大学 | A kind of preference space S kyline inquiry processing method based on Spark environment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7842926B2 (en) * | 2003-11-12 | 2010-11-30 | Micronic Laser Systems Ab | Method and device for correcting SLM stamp image imperfections |
-
2019
- 2019-07-19 CN CN201910657221.9A patent/CN110362652B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104156548A (en) * | 2014-09-03 | 2014-11-19 | 哈尔滨华夏矿安科技有限公司 | Circuit switching method with CAD graph of downhole ventilation, fire avoiding and water conveying line as data source |
CN104731860A (en) * | 2015-02-04 | 2015-06-24 | 北京邮电大学 | Space keyword query method protecting privacy |
CN106610972A (en) * | 2015-10-21 | 2017-05-03 | 阿里巴巴集团控股有限公司 | Query rewriting method and apparatus |
CN108647213A (en) * | 2018-05-21 | 2018-10-12 | 辽宁工程技术大学 | A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis |
CN108804551A (en) * | 2018-05-21 | 2018-11-13 | 辽宁工程技术大学 | It is a kind of to take into account diversity and personalized space point of interest recommendation method |
CN109947904A (en) * | 2019-03-22 | 2019-06-28 | 东北大学 | A kind of preference space S kyline inquiry processing method based on Spark environment |
Non-Patent Citations (3)
Title |
---|
Towards Why-Not Spatial Keyword Top-k Queries: A Direction-Aware Approach;Lei Chen et al.;《IEEE Transactions on Knowledge and Data Engineering》;20180430;第30卷(第4期);796-809 * |
基于位置-文本关系的空间对象top-k查询与排序方法;孟祥福 等;《智能系统学报》;20200331;第15卷(第2期);235-242 * |
空间数据库中的Skyline查询方法研究;李爽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115;I138-1803 * |
Also Published As
Publication number | Publication date |
---|---|
CN110362652A (en) | 2019-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110362652B (en) | Space keyword Top-K query method based on space-semantic-numerical correlation | |
CN111274811B (en) | Address text similarity determining method and address searching method | |
JP5316158B2 (en) | Information processing apparatus, full-text search method, full-text search program, and recording medium | |
CN103593425B (en) | Intelligent retrieval method and system based on preference | |
CN106980648B (en) | Personalized recommendation method based on probability matrix decomposition and combined with similarity | |
CN110795527B (en) | Candidate entity ordering method, training method and related device | |
CN110704743A (en) | Semantic search method and device based on knowledge graph | |
JP6722615B2 (en) | Query clustering device, method, and program | |
CN103440314A (en) | Semantic retrieval method based on Ontology | |
CN112100396B (en) | Data processing method and device | |
CN107145545A (en) | Top k zone users text data recommends method in a kind of location-based social networks | |
CN112000776B (en) | Topic matching method, device, equipment and storage medium based on voice semantics | |
CN110377684A (en) | A kind of spatial key personalization semantic query method based on user feedback | |
US12067061B2 (en) | Systems and methods for automated information retrieval | |
CN102915381A (en) | Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method | |
CN102567421A (en) | Document retrieval method and device | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN117708270A (en) | Enterprise data query method, device, equipment and storage medium | |
Kannadasan et al. | Personalized query auto-completion through a lightweight representation of the user context | |
JP6495206B2 (en) | Document concept base generation device, document concept search device, method, and program | |
CN112270199A (en) | CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method | |
CN107368525B (en) | Method and device for searching related words, storage medium and terminal equipment | |
CN114491056A (en) | Method and system for improving POI (Point of interest) search in digital police scene | |
CN116401356B (en) | Knowledge graph multi-round question-answering method and system based on historical information tracking | |
CN116881437B (en) | Data processing system for acquiring text set |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20221122 |
|
CF01 | Termination of patent right due to non-payment of annual fee |