CN110362652B - Space keyword Top-K query method based on space-semantic-numerical correlation - Google Patents

Space keyword Top-K query method based on space-semantic-numerical correlation Download PDF

Info

Publication number
CN110362652B
CN110362652B CN201910657221.9A CN201910657221A CN110362652B CN 110362652 B CN110362652 B CN 110362652B CN 201910657221 A CN201910657221 A CN 201910657221A CN 110362652 B CN110362652 B CN 110362652B
Authority
CN
China
Prior art keywords
query
semantic
space
numerical
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910657221.9A
Other languages
Chinese (zh)
Other versions
CN110362652A (en
Inventor
张霄雁
李盼
孙劲光
孟祥福
殷臣
杨昕悦
齐雪月
王丹丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN201910657221.9A priority Critical patent/CN110362652B/en
Publication of CN110362652A publication Critical patent/CN110362652A/en
Application granted granted Critical
Publication of CN110362652B publication Critical patent/CN110362652B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a space keyword Top-K query method based on space-semantic-numerical relevance, which comprises the following steps: carrying out semantic expansion on the initial query of the user by using Word Embedding technology; constructing a spatial-semantic-numerical value mixed index structure AKR-tree; and (4) calculating the correlation degree of space-semantic-numerical value. The invention utilizes Word Embedding technology to carry out semantic expansion on the initial query of a user to generate a series of query keywords which are semantically related to the initial query keywords; then constructing a spatial-semantic-numerical value mixed index structure AKR-tree, wherein the index structure can simultaneously support text and semantic matching of query keywords, and processing numerical attributes by using a Skyline method; and finally, quickly matching objects related to the space keyword query condition semantics by using the provided index structure, and sequencing according to the comprehensive relevance of space-semantics-numerical values. Compared with the existing similar method, the method has better user satisfaction degree of the query result, and the index structure has quicker query response time.

Description

Space keyword Top-K query method based on space-semantic-numerical correlation
Technical Field
The invention belongs to the technical field of space keyword query and word embedding, and particularly relates to a space keyword Top-K query method based on space-semantic-numerical relevance.
Background
The existing spatial keyword query processing modes mainly include: the top-k range query and the top-k neighbor query are mainly characterized in that a result scoring function is constructed according to text similarity and position similarity between a space object and space keyword query, and then the query efficiency is improved by utilizing a text and space mixed index technology. The existing indexing technology for mixing spatial data and text information mainly comprises IR-tree and IR 2 The indexing technology of the text search mainly comprises Inverted files (Inverted files), signature files (Signature files), bitmap indexes (bitmaps) and the like. However, the above-mentioned space-text index structure mainly focuses on the position proximity and text similarity of the space object and the space keyword query, and rarely considers the semantic relevance of the query result. Although few recent works research semantic matching of spatial keyword query, spatial objects include numerical attributes such as price and user score besides position information and text information, the existing method needs to discretize the numerical attributes and then process the numerical attributes as text attributes, but the processing method cannot effectively compare the numerical size and the inclusion relation of numerical intervals, and actually processes numerical informationThe method is still very different from the text matching processing method. As far as we know, no correlation work exists at present, and the comprehensive correlation degree of the spatial object and the spatial keyword query on the position, the semantics and the value is considered, so that a hybrid index structure for simultaneously supporting the comprehensive query is not provided.
In addition, with the widespread use of GPS and the rapid increase of spatial Web objects, spatial keyword query is widely used in Location-based services (LBS). Most of the existing spatial keyword query processing modes only support position proximity and text similarity matching, but cannot provide semantically related objects for users. Furthermore, current spatial-text hybrid index structures (e.g., IR-tree, IRs-tree, bR-tree, quad-tree, etc.) are not yet able to handle the numeric attributes of spatial objects.
Disclosure of Invention
Based on the defects of the prior art, the technical problem to be solved by the invention is to provide a space keyword Top-K query method based on space-semantic-numerical relevance, establish a space keyword query processing mode simultaneously fusing position information, semantic information and numerical information, improve query efficiency through an effective mixed index structure, and improve the user satisfaction degree and query response time of query results to a great extent.
In order to solve the technical problem, the invention is realized by the following technical scheme:
the invention provides a space keyword Top-K query method based on space-semantic-numerical relevance, which comprises the following steps of:
step 1: for each word in the space object text information, generating a corresponding word embedding vector representation, then calculating the semantic similarity between each pair of words, storing the semantic similarity in a word semantic similarity table, and expanding the semantics of the query keyword;
step 2: constructing an AKR-tree mixed index structure;
and step 3: for the space keyword query condition given by the user, firstly, finding out semantic related words from the semantic similarity table in the step 1, and expanding the query keyword range; then, the constructed AKR-tree mixed index is used for quickly matching the query result;
and 4, step 4: in the matching process, whether each branch node meets the space constraint of the query condition is checked, and on the premise of meeting the space constraint, whether the Textfile of the node contains the query keyword is checked;
and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score.
Preferably, the specific steps of step 1 are as follows:
step 1.1: extracting words in all the space object text information, performing Word-stop processing, forming a dictionary by all different words, and generating an embedded vector representation of each Word by using a Word embedding technology for each different Word in the dictionary;
step 1.2: based on the embedded vector representation of the words, the semantic similarity between each pair of words is calculated by utilizing a Cosine similarity method and is used for the semantic expansion of the query keywords in the online query stage.
The specific steps of step 2 are as follows:
step 2.1: and (2) generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers which respectively point to files containing all keywords of the node and a numerical value attribute file, and the third part is an entry set in the node;
step 2.2: and generating a Skyline set of the value attribute tuples of the space objects under each intermediate node of the AKR-tree.
Further, the specific steps of step 3 are as follows:
step 3.1: expanding a space keyword query condition, obtaining a node matched with the query condition by using an AKR-tree, and obtaining a space object in a Skyline set in the matched node as a candidate result set;
step 3.2: for each space object in the candidate result set, respectively calculating the position proximity, semantic relevance and numerical proximity of the space object to the query;
step 3.3: and calculating the comprehensive relevance score of the space object and the query, and selecting top-k final results according to the score.
Optionally, the rule for generating the Skyline set of the value attribute tuples in step 2.2 is as follows:
suppose that the value attribute tuple set D corresponding to all space objects under a certain node in the AKR-tree has n tuples and m +1 values, let A = { A = 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i A value of (d);
assuming that for each attribute, the values in the dominating relation dominate have a partial ordering relation, one tuple T ∈ D dominating another tuple T' ∈ D, governed by
Figure BDA0002137218780000041
Means for producing a product having a structure represented by
Figure BDA0002137218780000042
t[A i ]≥t’[A i ]And
Figure BDA0002137218780000043
t[A i ]>t’[A i ];
if one tuple T E D is not comparable to another tuple T '. E D, then we denote t-t', if and only if
Figure BDA0002137218780000044
And is provided with
Figure BDA0002137218780000045
Optionally, the method for calculating the position closeness, semantic relatedness and numerical closeness between the query and the spatial object in step 3.2 is as follows:
(1) Location proximity between query and spatial object:
Figure BDA0002137218780000046
d (q, lambda, o, lambda) is the Euclidean distance between the space object and the query in the space position, and MaxD is the maximum Euclidean distance among all the space objects;
(2) Semantic relatedness between query and spatial object:
firstly, vectorizing text information in a space object and keywords in query, wherein the dimensionality of a vector is that the text information of the space object and the number of all different words contained in the query are expressed by m, the words are arranged in a certain sequence to form a sequence, if the words contained in the space object appear in the sequence, 1 is arranged at the corresponding positions of the vectors Vo and Vq, and otherwise, 0 is arranged; then, the similarity between the two vectors is calculated by using a Cosine similarity method, and the calculation formula is as follows:
Figure BDA0002137218780000047
wherein, V o [i]And V q [i]Respectively represent vectors V o And V q The ith element in (1);
(3) The method for calculating the correlation degree of the numerical attribute between the space object and the query comprises the following steps:
Figure BDA0002137218780000051
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,
Figure BDA0002137218780000052
q.w i is equal to or more than 0 (i = 1., | q.w |) and
Figure BDA0002137218780000053
o.a i numerical attribute A representing object o i Attribute values of which the values are normalized to [0,1 ]]In the above-mentioned manner,it is assumed that the smaller the value of these numerical attributes, the better, e.g., the lower the noise, the lower the price, etc.; if the numerical attribute value is higher as better, such as environmental atmosphere, score, etc., information, it can be determined by a i =1-a i Converting it;
(4) The position-semantic similarity calculation formula of the space object and the query is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q)
wherein alpha is an adjusting parameter.
Optionally, in step 3.3, the final composite score of the spatial object and the query is calculated according to the following formula:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q)
wherein beta is an adjusting parameter.
Therefore, the spatial keyword Top-K query method based on the spatial-semantic-numerical relevance realizes semantic expansion of spatial keyword query by utilizing Word embedding (Word embedding) technology, and improves query efficiency and support for text and numerical query by constructing AKR-tree mixed indexes and Skyline sets of numerical attribute tuples. Experimental results show that the algorithm provided by the invention can support semantic approximate query of space keywords, can process numerical attributes, has higher query efficiency, and improves the user satisfaction degree and the query efficiency of query results to a great extent.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a block diagram of the spatial keyword Top-K query method based on spatial-semantic-numerical relevance according to the present invention;
FIG. 2 is a diagram of an AKR-tree index structure in an embodiment of the present invention;
FIG. 3 is a structural diagram of an AKR-tree index constructed by using the data in Table 1 in the embodiment of the present invention;
FIG. 4 is a comparison graph of query response times of the IR-tree, IRS-tree and AKR-tree used when the number k of query results is different in the Yelp data set according to the embodiment of the present invention;
FIG. 5 is a comparison graph of query response times used by an IR-tree, an IRS-tree, and an AKR-tree, when the number of numerical attributes is different in the Yelp data set in the embodiment of the present invention;
FIG. 6 is a comparison chart of query response times used by IR-tree, IRS-tree and AKR-tree when the number of query keywords is different in the Yelp data set in the embodiment of the present invention;
FIG. 7 is a time comparison chart of index structures of IR-tree, IRS-tree and AKR-tree constructed on Yelp data sets with different data sizes according to the embodiment of the present invention;
FIG. 8 is a comparison chart of query accuracy rates obtained by using an IR-tree, an IRS-tree, and an AKR-tree when the number k of query results is different in the Yelp data set in the embodiment of the present invention.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
The invention relates to a space keyword Top-K query method based on space-semantic-numerical correlation, which is mainly applied to the fields of current popular Location Based Service (LBS) systems and space interest point recommendation, and the overall processing flow is as follows:
(1) Semantic expansion is carried out on the initial query of the user by using Word Embedding technology, and a series of query keywords related to the initial query keyword semantics are generated.
(2) A space-semantic-numerical value mixed index structure AKR-tree is constructed, the index structure can simultaneously support text and semantic matching of query keywords, and numerical attributes are processed by using a Skyline method.
(3) And quickly matching objects related to the space keyword query condition semanteme by using the provided index structure, and sequencing according to the comprehensive relevance of space-semanteme-numerical value.
The block diagram of the solution of the semantic approximate query method for the space keywords top-k is shown in FIG. 1. The specific implementation of the present invention and the results of each significant phase are described below in conjunction with the data and queries of Table 1.
TABLE 1 location, semantic, and numerical information of spatial objects and examples of spatial keyword queries
Figure BDA0002137218780000071
Figure BDA0002137218780000081
Figure BDA0002137218780000091
Some relevant definitions to which the method of the invention relates are as follows:
given a spatial data set O = { O = { O 1 ,o 2 ,…,o n H, each spatial object o i Is composed of a triplet (λ, K, A) in which o i λ denotes o i Position information of (two-dimensional space object is usually represented by latitude and longitude), o i K is o i Set of text keywords in (1), o i A is o i Set of numerical attributes. It is to be noted that i The value in A is o.a i Normalized to [0,1 ]]In between, the smaller the value of these numerical attributes is, the better, e.g., the noise is low, the price is low, etc.; if the value of the numerical attribute is higher as better, such as information of environmental atmosphere, score, etc., it is possibleBy a i =1-a i It is converted. The spatial key query q is represented by a triplet (λ, K, W), where q. λ is the query location, q.k is the query key set, q.w is the user's set of preference weights for different numerical attributes: (k.w)
Figure BDA0002137218780000092
q.w i Is equal to or more than 0 (i = 1., | q.w |) and
Figure BDA0002137218780000093
)。
step 1: and generating a corresponding word embedding vector representation for each word in the space object text information, then calculating the semantic similarity between each pair of words, and storing the semantic similarity in a word semantic similarity table for semantic expansion of the query keyword.
Step 1.1: an embedded vector representation (Embedding vector) for each Word is generated using the Word Embedding technique. For example, table 1 shows a partial data example in the Yelp dataset, and on this example, by using the method of the present invention, the embedded vector representations corresponding to the words restaurants and tea are calculated as follows:
Restaurants:
Figure BDA0002137218780000101
Tea:
Figure BDA0002137218780000102
Figure BDA0002137218780000111
in the word embedding vector generating method in step 1.1, all words are read from the text description information of all space objects, a dictionary (Vocabulary) is created by removing stop words and selecting words with higher word frequency, and a < UNK > word is added in the dictionary to represent the words which do not appear in the dictionary. The invention realizes Word Embedding through Pythrch, thereby generating the embedded vector representation (Word Embedding) of each Word in a dictionary, wherein the hyper-parameter setting method comprises the following steps: the number of negative sampling random samples K is 100, the number of peripheral words C is designated as 3, the number of iteration rounds NUM _ EPOCH is 2, the number of 1 BATCH per iteration round BATCH batt _ SIZE is designated as 128, the SIZE of the vocabulary table VOCAB _ SIZE is set to 30000, the LEARNING RATE LEARNING _ RATE is 1e-4, and the word vector dimension EMBEDDING _ SIZE is designated as 100. The present invention optimizes the model using Adam as an optimizer.
Step 1.2: and calculating the semantic similarity between each pair of words by using a Cosine similarity method. For example, for restaurant and tea, their similarity is: 1.1146.
step 2: and constructing an AKR-tree mixed index structure as shown in figure 2.
Step 2.1: and generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers, which respectively point to the file (TextFile) and the value attribute file (AttrFile) containing all the keys of the node, and the third part is the set of Entries (Entries) in the node. The generation process of the AKR-tree is a bottom-up tree building process, and for a leaf node, each entry therein is composed of a quadruple in the form of < o, rect, o.tid, o.aid >, where o represents a spatial object, rect represents the Minimum Bounding Rectangle (MBR) of the object, o.tid is the text information identifier of the object, and o.aid is the numeric attribute tuple information identifier of the object. For a non-leaf node, each item in the non-leaf node is also composed of a four-tuple, and the form is < pN, rect, n.pid, n.aid >, wherein pN is the address of a child node N in the node, rect refers to a Minimum Bounding Rectangle (MBR) that can contain all child nodes under the node, n.pid is the document identifier of the node, the document contains the summary of text information (i.e., the extracted text keyword set) of all child nodes under the node, n.aid is the numerical attribute information identifier of the node, and the numerical attribute information contains the Skyline set of the numerical attribute tuples of all child nodes under the node.
Based onThe AKR-tree formed is shown in FIG. 3, data of Table 1. Wherein N is 1 The node in (A) has 18 ,o 5 ,o 11 ,o 2 ,o 9 ,o 14 ,o 24 ,o 21 ,o 20 ,o 16 ,o 26 ,o 25 ,o 13 ,o 6 。N 2 The node in (A) has 7 ,o 8 ,o 27 ,o 38 ,o 10 ,o 19 ,o 17 ,o 37 ,o 3 ,o 32 ,o 23 ,o 1 ,o 4 ,o 40 ,o 12 ,o 22 ,o 42 ,o 44 ,o 46 ,o 47 ,o 49 ,o 50 。N 3 The node in (A) has 29 ,o 41 ,o 36 ,o 30 ,o 39 ,o 35 ,o 31 ,o 34 ,o 28 ,o 33 ,o 43 ,o 45 ,o 48 . Wherein N is 1 AttrFile of (a) is { [0.06,0.58],[0.17,0.0]},N 2 AttrFile of (a) is { [0.07,0.92],[0.13,0.21],[0.15,0.18]},N 3 The AttrFile of (a) is { [0.03,0.01]},N 4 The AttrFile of (a) is { [0.03,0.01],[0.17,0.01]}。
Step 2.2: the generation process of the Skyline set is to make A = { A ] on the assumption that the value attribute tuple set D corresponding to all space objects under a certain node in the AKR-tree has n tuples and m +1 values 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i The value of (c). It is assumed that for each attribute, the values in the dominating relation dominate have a partial ordering relation (e.g.,
Figure BDA0002137218780000131
indicating that the value a is better than b). One tuple te D dominates another tuple t' teD, resulting from
Figure BDA0002137218780000132
Means if and only if
Figure BDA0002137218780000133
t[A i ]≥t’[A i ]And
Figure BDA0002137218780000134
t[A i ]>t’[A i ]. In addition, if one tuple T ∈ D is not comparable to another tuple T '∈ D, then it is denoted t-t', and if and only if
Figure BDA0002137218780000135
And is
Figure BDA0002137218780000136
Based on the data in table 1, skyline sets of numeric attribute tuples generated by each intermediate node of the AKR-tree are respectively (as shown in table 2):
table 2. Skyline set corresponding to each node of AKR-tree generated based on data in Table 1
Figure BDA0002137218780000137
And step 3: for the space keyword query condition given by the user, firstly finding out semantic related words from the semantic similarity table in the step 1, and expanding the range of the query keyword; then, the constructed AKR-tree mixed index is used for quickly matching the query result; in the matching process, whether each branch node meets the space constraint of the query condition is checked, and whether the Textfile of the node contains the query keyword is checked on the premise that the space constraint is met; and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score. The method comprises the following specific steps:
step 3.1: expanding the query condition of the spatial keywords, obtaining nodes matched with the query condition by using the AKR-tree, and obtaining spatial objects in the Skyline set in the matched nodes as a candidate result set. For tablesQuery q in 1, the top 10 match results in table 1 are: o 11 ,o 13 ,o 16 ,o 5 ,o 26 ,o 15 ,o 21 ,o 9 ,o 20 ,o 18
Step 3.2: for each spatial object in the candidate result set, its position proximity, semantic relevance, and numerical proximity to the query q are calculated separately. The methods of calculating the positional, semantic, and numerical closeness between the query q and the spatial object o are as follows, respectively:
(1) Position proximity between query q and spatial object o:
Figure BDA0002137218780000141
wherein D (q, λ, o, λ) is the euclidean distance between the spatial object o and the query q at the spatial position, and MaxD is the maximum euclidean distance among all spatial objects.
(2) Semantic relatedness between query q and spatial object o:
the basic idea is that the text information in the space object o and the keywords (including expanded words) in the query q are firstly subjected to vectorization processing, the dimension of the vector is the number (expressed by m) of all different words included in the text information of the space object o and the query q, the words are arranged in a certain sequence to form a sequence, if the words included in o (and q) appear in the sequence, 1 is arranged at the corresponding position of the vector Vo (and Vq), and otherwise, 0 is arranged; then, the similarity between the two vectors is calculated by using a Cosine similarity method, and the calculation formula is as follows:
Figure BDA0002137218780000142
wherein, V o [i]And V q [i]Respectively represent vectors V o And V q The ith element in (1);
(3) The method for calculating the relevance of the numerical attribute between the space object o and the query q comprises the following steps:
Figure BDA0002137218780000143
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,
Figure BDA0002137218780000144
q.w i is equal to or more than 0 (i = 1., | q.w |) and
Figure BDA0002137218780000145
o.a i numerical attribute A representing object o i Attribute values of normalized to [0,1 ]]In the meantime.
(4) The position-semantic similarity calculation formula of the spatial object o and the query q is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q) (4)
where α is an adjustment parameter, and α is set to 0.5.
For example, the matching result o for query q in Table 1 11 With the calculation methods of the present invention (equations (1) to (3)), the position closeness, semantic relatedness, and numerical closeness to the query q are respectively: 0.9374,0.1889,0.5454.
Step 3.3: calculating the comprehensive relevance score of the space object o and the query q, and selecting top-k final results according to the score, wherein the final comprehensive score calculation formula of the space object o and the query q is as follows:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q) (5)
wherein beta is an adjusting parameter.
For example, the spatial object o obtained by equation (5) based on the data in Table 1 11 The composite relevance score to query q is: 0.5607. wherein β in the formula (5) is set to 0.86.
In order to further test the effect and performance of the method, a Foursquare data set and a Yelp data set are selected in the example, which shows the query accuracy and the query efficiency (namely the query response time) of the method. The Yelp is a famous merchant commenting website in the United states, and the website comprises merchant information in various fields such as restaurants, shopping centers and hotels in various regions and information such as user evaluation and check-in time; the real POI data are processed into 174567 points of interest, each point of interest has an ID, position information (represented in the form of latitude and longitude), text information and numerical attribute information, the position information is used as spatial information, user comment information, name, city and category are used as text information, and randomly generated 5 random numbers between 0 and 1 are used as numerical attribute information.
After data cleaning, the data set comprises 215614 spatial objects related to geographic positions, and a keyword list and standardized values of numerical attribute information for describing the spatial objects, namely each spatial object comprises longitude and latitude information, keyword information and four numerical attributes (comprising price, environment, service and rating).
The following are the test results on the Yelp and Foursquare datasets for query efficiency and query accuracy using the method of the present invention. The default values for the various parameters of the process of the present invention are given in table 3. In the experimental process, the influence of a certain parameter on the experimental result is researched by changing the value of the parameter and fixing the values of other parameters. All experiments are realized by adopting Python, and a computer is configured to be a CPU 2.5GHz i7-4710HQ, a RAM 8GB and a Windows10 operating system. The method (AKR-tree) is compared with the existing IR-tree and IRS-tree in the prior art in the aspects of query efficiency and query effect.
IR-tree index Structure: is a combination of the spatial index R-tree and the inverted index InvertedFile for indexing spatial and textual keywords. The method of the invention is different from the method of the IR-tree in that a keyword file TextFile and a value attribute file Attrife are added, and a skyline set of value attribute tuples under each intermediate node is calculated, thereby effectively processing the query on the value attributes. TextFile differs from InvertedFile in that: the textFile of the non-leaf node comprises all text keyword information sets of child nodes of the textFile, and the textFile of the leaf node comprises text keyword information of space objects of the textFile; invoked file needs to index each keyword information and mark all its occurrences in the document, thereby achieving the goal of speeding up the query. The TextFile does not need to construct inverted files for all nodes, and certain time can be saved in the index construction stage.
IRS-tree index structure: is an Invertedfile mixed index structure with a Sybopse tree, and can search a plurality of different numerical attributes simultaneously. However, IRS-based search algorithms need to provide an accurate range of values, and an exact match in numerical attributes may result in little to no query results being returned.
TABLE 3 Default values for the parameters of the invention
Figure BDA0002137218780000161
To illustrate, since the experimental results on Foursquare and Yelp are very different, the present example only shows the experimental results on the Yelp data set. The experimental performance test is mainly carried out from the following aspects:
(1) Influence of parameter k on query efficiency: the experiment sets k values to 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, respectively, to observe the influence of the number of query results on the query response time. FIG. 4 is a comparison chart of query response time for using IR-tree, IRS-tree, AKR-tree when the number k of query results is different on the Yelp data set. As can be seen from fig. 4, the three algorithms have longer query response times with larger k values. This is because as the value of k increases, more candidates will be indexed. The query time response of the IR-tree is minimized because it does not take into account numeric attributes, nor does it take into account keyword queries related to query keyword senses, and thus the query time is minimized. The AKR-tree has a slightly longer query response time because the initial query keyword is semantically expanded and the value attribute is processed using a Skyline method. The longest query response time is the IRS-tree index structure, which increases the query cost because it needs to be combined with other indexes to complete the query.
(2) Effect of | o.a | on query time: the experiment aims to verify the influence of the numerical attributes of the space objects on the response time of the query by changing the number of the numerical attributes of the space objects. FIG. 5 is a comparison of query response times for the IR-tree, IRS-tree, AKR-tree used when the number of numerical attributes increases from 1 to 5 on the Yelp data set in accordance with the present invention. As can be seen from fig. 5, as the number of numerical attributes increases, the query response time also gradually increases. This is because the AKR-tree structure needs to perform Skyline calculation on the numeric attribute tuples in the query result, and in the worst case, the Skyline method almost compares each element in each tuple, so that the larger the number of numeric attributes is, the more time is consumed. An IRS-tree is more time consuming than an AKR-tree because it needs to consider the exact range of a numerical attribute when handling it, which can be computationally expensive if the value range of the attribute is large. Since the IR-tree does not have the processing function of numerical attributes, no comparison is made here.
(3) Influence of | q.k | on query time: the present invention observes its effect on query response time by setting the number of query keywords to grow from 1 to 8. FIG. 6 is a comparison of query response times of the present invention in the Yelp data set when the number of query keywords is different, and the IR-tree, IRS-tree, and AKR-tree are used. As can be seen from fig. 6, the query response time increases in proportion to the number of query keywords. The reason is that, in any index structure, when the number of query keywords increases, the more objects containing the query keywords need to be indexed, and thus the query time increases. However, the query response time of the AKR-tree and the IR-tree is far shorter than that of the IRS-tree, and the query response time of the method is not much different from that of the IR-tree under the condition that the method can process text and numerical attributes at the same time.
(4) Relationship of | D | to build index time: the purpose of this experiment was to compare the comparison of the above three algorithms over the time it took to construct the index. FIG. 7 shows the time comparison for constructing the indexes of IR-tree, IRS-tree and AKR-tree on the Yelp data sets with different data sizes. As can be seen from FIG. 7, the index building time is proportional to the size of the data set, wherein the time for building the IR-Tree index structure is the least, because it is the shortest index building time because it does not need to build AttrFile files and synopses compared with AKR-Tree, IRS-Tree index structures; but the index construction time of the AKR-tree is not different from that of the IR-tree; the IRS-tree needs to combine the synopses tree with other indexes to complete index construction, so that the index construction time is longest.
(5) And (3) evaluating the query effect:
FIG. 8 is a comparison of user satisfaction of query results obtained by using IR-tree, IRS-tree, and AKR-tree when the number k of query results on a Yelp data set according to the present invention takes different values, and the evaluation formula is as follows:
Figure BDA0002137218780000181
wherein I (q) is the top-k ideal object sets which are labeled by the user and most relevant to the query q, and R (q) is the result set returned by each algorithm. As can be seen from FIG. 8, the AKR-tree of the present invention has an accuracy improved by 9.05% and 20.19% respectively compared with the IR-tree and the IRS-tree.
The invention utilizes Word Embedding technology to carry out semantic expansion on the initial query of a user to generate a series of query keywords which are semantically related to the initial query keywords. Then, constructing a spatial-semantic-numerical value mixed index structure AKR-tree, wherein the index structure can simultaneously support text and semantic matching of query keywords, and processing numerical value attributes by using a Skyline method; and finally, quickly matching objects related to the space keyword query condition semanteme by using the provided index structure, and sequencing according to the comprehensive relevance of space-semanteme-numerical value. Experimental analysis and results show that compared with the existing similar method, the method has better user satisfaction degree of query results, and the index structure has quicker query response time.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (5)

1. The method for querying the Top-K of the spatial key words based on the spatial-semantic-numerical relevance is characterized by comprising the following steps of:
step 1: for each word in the space object text information, generating a corresponding word embedding vector representation, then calculating the semantic similarity between each pair of words, storing the semantic similarity in a word semantic similarity table, and expanding the semantics of the query keyword;
step 2: an AKR-tree mixed index structure is constructed, and the method comprises the following specific steps: step 2.1: and (2) generating an AKR-tree by using the R-tree, wherein the information of each node of the AKR-tree is divided into three parts: the first two parts are two pointers which respectively point to files containing all keywords of the node and a numerical value attribute file, and the third part is an entry set in the node;
step 2.2: generating a Skyline set of the value attribute tuples of the space objects under each intermediate node of the AKR-tree;
the rules for generating the Skyline set of value-attribute tuples in step 2.2 are as follows:
suppose that the value attribute tuple set D corresponding to all space objects under a certain node in AKR-tree has n tuples and m +1 values, let A = { A = 1 ,A 2 ,...,A m },t[A i ]For attribute A on tuple t i A value of (d);
assuming that for each attribute, the values in the dominating relation dominate have a partial ordering relation, one tuple T ∈ D dominating another tuple T' ∈ D, governed by
Figure FDA0003849101570000011
Means if and only if
Figure FDA0003849101570000012
t[A i ]≥t’[A i ]And
Figure FDA0003849101570000013
t[A i ]>t’[A i ];
if one tuple te D is not comparable to another tuple t 'teD, it is denoted t-t', if and only if
Figure FDA0003849101570000014
And is
Figure FDA0003849101570000015
And step 3: for the space keyword query condition given by the user, firstly, finding out semantic related words from the semantic similarity table in the step 1, and expanding the query keyword range; then, performing quick matching on the query result by using the constructed AKR-tree mixed index;
and 4, step 4: in the matching process, whether each branch node meets the space constraint of the query condition is checked, and on the premise of meeting the space constraint, whether the Textfile of the node contains the query keyword is checked;
and for the matched nodes, respectively calculating the position similarity, semantic relevance and numerical proximity of the space object in the Skyline set and the query condition, finally obtaining the comprehensive score of the matched result, and selecting top-k final results according to the comprehensive score.
2. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 1, wherein the specific steps of step 1 are as follows:
step 1.1: extracting words in text information of all space objects, performing Word-stop-removing processing, forming a dictionary by all different words, and generating embedded vector representation of each Word by using Word embedding technology for each different Word in the dictionary;
step 1.2: based on the embedded vector representation of the words, the semantic similarity between each pair of words is calculated by utilizing a Cosine similarity method and is used for the semantic expansion of the query keywords in the online query stage.
3. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 1, wherein the specific steps of step 3 are as follows:
step 3.1: expanding a space keyword query condition, obtaining a node matched with the query condition by using an AKR-tree, and obtaining a space object in a Skyline set in the matched node as a candidate result set;
step 3.2: for each space object in the candidate result set, respectively calculating the position proximity, semantic relevance and numerical proximity of the space object to the query;
step 3.3: and calculating the comprehensive relevance score of the space object and the query, and selecting top-k final results according to the score.
4. The spatial key word Top-K query method based on spatio-semantic-numerical relevance according to claim 3, characterized in that the methods of calculating the positional proximity, semantic relevance and numerical proximity between query and spatial object in step 3.2 are respectively as follows:
(1) Location proximity between query and spatial object:
Figure FDA0003849101570000031
d (q, lambda, o, lambda) is the Euclidean distance between the space object and the query in the space position, and MaxD is the maximum Euclidean distance among all the space objects;
(2) Semantic relatedness between query and spatial object:
firstly, vectorizing text information in a space object and keywords in query, wherein the dimensionality of a vector is that the text information of the space object and the number of all different words contained in the query are expressed by m, the words are arranged in a certain sequence to form a sequence, and if the words contained in the space object appear in the sequence, a vector V is formed in a vector V o And V q 1 is set at the corresponding position, and 0 is set otherwise; then, utilizeThe Cosine similarity method calculates the similarity between two vectors, and the calculation formula is as follows:
Figure FDA0003849101570000032
wherein, V o [i]And V q [i]Respectively represent vector V o And V q The ith element in (1);
(3) The method for calculating the correlation degree of the numerical attribute between the space object and the query comprises the following steps:
Figure FDA0003849101570000041
wherein q.W is a set of weights for different numerical attributes, representing the user's preference for those numerical attributes,
Figure FDA0003849101570000042
q.w i is equal to or more than 0 (i = 1., | q.w |) and
Figure FDA0003849101570000043
o.a i numerical attribute A representing object o i Attribute values of which the values are normalized to [0,1 ]]To (c) to (d);
(4) The position-semantic similarity calculation formula of the space object and the query is as follows:
S LT (o,q)=α*S Loc (o,q)+(1-α)*S Text (o,q)
wherein alpha is an adjusting parameter.
5. The spatial keyword Top-K query method based on spatio-semantic-numerical relevance as claimed in claim 4, wherein the final composite score calculation formula of the spatial object and the query in the step 3.3 is as follows:
Score(o,q)=β*S LT (o,q)+(1-β)*S A (o,q)
wherein beta is a regulating parameter.
CN201910657221.9A 2019-07-19 2019-07-19 Space keyword Top-K query method based on space-semantic-numerical correlation Expired - Fee Related CN110362652B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910657221.9A CN110362652B (en) 2019-07-19 2019-07-19 Space keyword Top-K query method based on space-semantic-numerical correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910657221.9A CN110362652B (en) 2019-07-19 2019-07-19 Space keyword Top-K query method based on space-semantic-numerical correlation

Publications (2)

Publication Number Publication Date
CN110362652A CN110362652A (en) 2019-10-22
CN110362652B true CN110362652B (en) 2022-11-22

Family

ID=68221090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910657221.9A Expired - Fee Related CN110362652B (en) 2019-07-19 2019-07-19 Space keyword Top-K query method based on space-semantic-numerical correlation

Country Status (1)

Country Link
CN (1) CN110362652B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270199A (en) * 2020-11-03 2021-01-26 辽宁工程技术大学 CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method
CN113158087B (en) * 2021-04-09 2024-07-09 深圳前海微众银行股份有限公司 Space text query method and device
CN116341567B (en) * 2023-05-29 2023-08-29 山东省工业技术研究院 Interest point semantic labeling method and system based on space and semantic neighbor information
CN117171802B (en) * 2023-11-03 2024-01-12 中国科学技术信息研究所 Strong privacy protection method and system for space keyword query

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156548A (en) * 2014-09-03 2014-11-19 哈尔滨华夏矿安科技有限公司 Circuit switching method with CAD graph of downhole ventilation, fire avoiding and water conveying line as data source
CN104731860A (en) * 2015-02-04 2015-06-24 北京邮电大学 Space keyword query method protecting privacy
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN108647213A (en) * 2018-05-21 2018-10-12 辽宁工程技术大学 A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis
CN108804551A (en) * 2018-05-21 2018-11-13 辽宁工程技术大学 It is a kind of to take into account diversity and personalized space point of interest recommendation method
CN109947904A (en) * 2019-03-22 2019-06-28 东北大学 A kind of preference space S kyline inquiry processing method based on Spark environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7842926B2 (en) * 2003-11-12 2010-11-30 Micronic Laser Systems Ab Method and device for correcting SLM stamp image imperfections

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156548A (en) * 2014-09-03 2014-11-19 哈尔滨华夏矿安科技有限公司 Circuit switching method with CAD graph of downhole ventilation, fire avoiding and water conveying line as data source
CN104731860A (en) * 2015-02-04 2015-06-24 北京邮电大学 Space keyword query method protecting privacy
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN108647213A (en) * 2018-05-21 2018-10-12 辽宁工程技术大学 A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis
CN108804551A (en) * 2018-05-21 2018-11-13 辽宁工程技术大学 It is a kind of to take into account diversity and personalized space point of interest recommendation method
CN109947904A (en) * 2019-03-22 2019-06-28 东北大学 A kind of preference space S kyline inquiry processing method based on Spark environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Towards Why-Not Spatial Keyword Top-k Queries: A Direction-Aware Approach;Lei Chen et al.;《IEEE Transactions on Knowledge and Data Engineering》;20180430;第30卷(第4期);796-809 *
基于位置-文本关系的空间对象top-k查询与排序方法;孟祥福 等;《智能系统学报》;20200331;第15卷(第2期);235-242 *
空间数据库中的Skyline查询方法研究;李爽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115;I138-1803 *

Also Published As

Publication number Publication date
CN110362652A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110362652B (en) Space keyword Top-K query method based on space-semantic-numerical correlation
CN111274811B (en) Address text similarity determining method and address searching method
JP5316158B2 (en) Information processing apparatus, full-text search method, full-text search program, and recording medium
CN103593425B (en) Intelligent retrieval method and system based on preference
CN106980648B (en) Personalized recommendation method based on probability matrix decomposition and combined with similarity
CN110795527B (en) Candidate entity ordering method, training method and related device
CN110704743A (en) Semantic search method and device based on knowledge graph
JP6722615B2 (en) Query clustering device, method, and program
CN103440314A (en) Semantic retrieval method based on Ontology
CN112100396B (en) Data processing method and device
CN107145545A (en) Top k zone users text data recommends method in a kind of location-based social networks
CN112000776B (en) Topic matching method, device, equipment and storage medium based on voice semantics
CN110377684A (en) A kind of spatial key personalization semantic query method based on user feedback
US12067061B2 (en) Systems and methods for automated information retrieval
CN102915381A (en) Multi-dimensional semantic based visualized network retrieval rendering system and rendering control method
CN102567421A (en) Document retrieval method and device
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN117708270A (en) Enterprise data query method, device, equipment and storage medium
Kannadasan et al. Personalized query auto-completion through a lightweight representation of the user context
JP6495206B2 (en) Document concept base generation device, document concept search device, method, and program
CN112270199A (en) CGAN (Carrier-grade network Access network) method based personalized semantic space keyword Top-K query method
CN107368525B (en) Method and device for searching related words, storage medium and terminal equipment
CN114491056A (en) Method and system for improving POI (Point of interest) search in digital police scene
CN116401356B (en) Knowledge graph multi-round question-answering method and system based on historical information tracking
CN116881437B (en) Data processing system for acquiring text set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20221122

CF01 Termination of patent right due to non-payment of annual fee