CN110955827B

CN110955827B - By using AI 3 Method and system for solving SKQwyy-not problem

Info

Publication number: CN110955827B
Application number: CN201911128644.8A
Authority: CN
Inventors: 李艳红; 冯禹鹤; 张望
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2022-09-30
Anticipated expiration: 2039-11-18
Also published as: CN110955827A

Abstract

The invention discloses a method for using AI ³ The invention discloses a method and a system for solving SKQwyh-not problem, relating to the technical field of space keyword query, wherein the digital attribute of an object is expressed in a Boolean expression form, so that the method is closer to a practical application scene; and design AI ³ Object information is skillfully organized by indexing, and meanwhile, a corresponding query strategy is designed, so that the condition that all missing objects appear in a query result is met by modifying the query q' with the minimum modification cost, and the why-not problem in the space keyword query is solved.

Description

By using AI 3 Method and system for solving SKQwyy-not problem

Technical Field

The invention relates to the technical field of space keyword query, in particular to a method for querying a space keyword by adopting AI (artificial intelligence) ³ A method and system for solving SKQwyy-not problem.

Background

Spatial Key Queries (SKQ) have been proposed and extensively studied as more and more objects are associated with geographic locations and textual descriptions. In real life, objects typically have other digital attributes, such as average price, rate, popularity, etc. It is often impossible or difficult to obtain the results desired by the user if these limiting conditions are not taken into account in the query. Therefore, in order to satisfy the constraints of the querying user on these attributes and the refined query process, the spatial keyword query needs to take the numerical attributes into account.

The present document is primarily directed to top-k enhanced spatial keyword queries. When searching top-k objects, the query firstly searches objects meeting the digital attribute requirement in q query, and then ranks according to the space distance between the query point and the objects and the comprehensive score of text similarity. Fig. 1 shows an example of an enhanced spatial keyword query, and table 1 shows text information and related attribute information of an object.

Table 1: information about objects in FIG. 1

As shown in FIG. 1, a user initiates a query on the keyword cafe, where the average price is no more than $ 42, the score is higher than 4.3 points, and the popularity is greater than 700. These enhanced requirements can then be expressed by a boolean expression: (avg-price < 42 ^ Rating > 4.3 ^ Popularity > 700). First, object o ₃ 、o ₅ 、o ₈ Satisfy the above enhanced query requirement, and then according to the object o ₃ 、o ₅ 、o ₈ The degree of textual and spatial matching with query q, the top three objects may be returned using the selected ranking function. In addition to this, due to o ₁ Does not have the same key as q, so o ₁ Neglected; o ₂ 、o ₄ 、o ₆ 、o ₇ And are also ignored because none of them meet the query attribute requirements.

However, in some cases, when a user's desired objects do not appear in the query result set, the user may think why these desired objects do not appear in the query result set, how to add their desired objects to the query result set. For example, a query is initiated at the user and a containment o is obtained ₃ 、o ₅ 、o ₈ After querying the results, he may want to know why they are familiar with object o ₁ 、o ₆ Not present in the query result set, o ₃ 、o ₅ 、o ₈ Ratio o of difficult to track ₁ 、o ₆ Is good? Object o how they can get them familiar with ₁ 、o ₆ Is it present in the query result set?

After obtaining the query results, the user may find that they want some objects not in the query result set, so that they may question the entire query result. The problem of why these desired objects are missing and how to efficiently retrieve the query object that the user intended is addressed is known as the why-not problem. However, no relevant technology exists to solve the why-not problem in the enhanced spatial keyword top-k query. Therefore, a technical scheme capable of solving the why-not problem in the enhanced spatial keyword top-k query is needed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for adopting AI ³ The method and the system for solving the SKQwyy-not problem effectively solve the why-not problem in the spatial keyword query.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows: adopt AI ³ The method for solving the SKQwyh-not problem comprises the following steps:

obtaining all objects o and constructing AI ³ Indexing;

obtaining an initial query q ═ (q.loc, q.doc) ₀ q.B, k, α) and a missing object set M; constructing a candidate keyword list CKS according to the descending order of the frequency of the keywords of the missing objects, and constructing a candidate attribute value pair list CAS according to the descending order of the similarity scores of the missing objects; respectively setting a keyword set q '. doc and an attribute value pair q'. B 'of the refined query q' as q.doc ₀ And q.B;

orderly extracting keywords in the CKS and attribute value pairs in the CAS, and respectively adding the keywords to a keyword set q '. doc of the query q' and the attribute value pairs q '. B' of the query q 'to form a new refined query q'; processing each refining query q' separately to find the best refining query until both CKS and CAS are empty;

processing each refined query q' respectively, specifically including:

calculating the modification cost p ' of q ', and filtering p ' to be more than or equal to p _c Query q', p _c Query q for preserving initial query key and attribute and all missing objects appearing in query results _b The modification cost of (2);

to p'<p _c According to the frequency of each keyword, to determine the query qWhether each key of query q' is a frequent key:

if the keywords are frequent keywords, adding the root nodes of the quadtree in the header file into a non-leaf node queue to be processed, selecting leaf nodes meeting the conditions according to a preset screening rule, and adding the leaf nodes meeting the conditions into the leaf node queue; for each object in the disk page pointed by the leaf node meeting the condition, sequentially judging whether the attribute value pair q '. B ' of the query q ' meets the attribute matching with the attribute value pair of the object, adding the matched object to an object set meeting the requirement of the query q ', and calculating the similarity score between the query q ' and the object;

if the keywords are the infrequent keywords, analyzing each object in the corresponding disk page, if the attribute value pair q '. B ' of the query q ' meets the attribute matching with the attribute value pair of the object, adding the object in the corresponding disk page into an object set meeting the requirement of the query q ', and calculating the similarity score between the query q ' and the object;

all the objects in the object set meeting the requirement of the query q 'are ranked from high to low according to the similarity scores of the objects until all original result objects and all missing objects appear, and k' objects are obtained;

if k' is ≦ k _m ，k _m To preserve the size of the result set when the initial query key and attributes are preserved and all missing objects appear in the query results, a modification cost p ' of q ' is computed, if p '<p _c The query q' is taken as the current best refined query.

On the basis of the scheme, all the objects o are obtained, and AI is constructed ³ The indexing specifically comprises the following steps:

hierarchically dividing a data space into cells using a quadtree structure; taking the cell as a basic storage unit, and storing the spatial position and the attribute information of an object containing the keyword;

three components are created: a lookup table used as a portal, a header file containing summary information of dense key units, and a data file storing key unit tuples in all the posting tables;

storing the attribute information of the basic key word unit of the frequent key word in the leaf nodes of the quadtree;

each non-leaf node R of the quadtree _i All contain three attributes: r _i .id，R _i .S， R _i Address, wherein R _i Id is node id, R _i Address is R _i Address list of all sub-nodes of (1) and R _i S is R _i The union of the attribute value pairs of all the sub-nodes;

each leaf node R of the quadtree _i All contain three attributes: r _i .id，R _i .S， R _i Address, wherein R _i Id is node id, R _i Address is the Address of the disk page to which it is linked, R _i S is the union of the attribute-value pairs of all objects in the disk page to which it is linked.

On the basis of the scheme, B is a Boolean expression:

is a predicate set where i ∈ [1, n ]]，i∈N ^* 。

On the basis of the scheme, the modification cost p 'of q' is calculated, and the calculation formula is as follows:

wherein, beta ₁ ，β ₂ ，β ₃ ，β ₄ Respectively representing the weights of a k value, a keyword, an attribute type and an attribute value in a cost function; beta is a beta _i Is not less than 0 and

k 'is the size of the query result set that refines query q', k ₀ Is the initial query qSize of result set, k _m Is the size of the result set, k, when the initial query key and attributes are preserved and all missing objects appear in the query results _m -k ₀ Normalized k' -k ₀ (ii) a Δ doc is from q.doc ₀ The number of keys that need to be changed to q'. doc,

wherein the missing object set M ═ M ₁ ,m ₂ ,...,m _j H, by | q.doc } ₀ U.doc | to normalize Δ doc; delta A _n Is the number of attribute types that need to be changed to adjust from the initial query to the refined query, Δ A is normalized by | q.B ≦ M.B | _n ；

n is the sum of the attributes contained in q.B and M.B; Δ v _i Is to contain an attribute A _i The maximum difference value of the attribute values of all the objects with respect to the attribute; | v _i '-v _i I is attribute A _i Current query attribute value v _i ' with initial query attribute value v _i Absolute value of the difference between, and | v _i '-v _i |≤Δv _i By Δ v _i To normalize | v _i '-v _i |。

On the basis of the scheme, the similarity score between the query q and the object o is calculated by the following formula:

where α is a variable between 0 and 1 defining the relative importance between the proximity and the text relevance, d (q.loc, o.loc) denotes the Euclidean distance between query q and object o, d _max (q.loc, o.loc) represents the maximum distance from the query point q to all objects in the object set O, expressed as the maximum distance between all objects in the object set O.

On the basis of the scheme, if the keyword is a frequent keyword, adding the root node of the quadtree in the header file into a to-be-processed non-leaf node queue, selecting a leaf node meeting the condition according to a preset screening rule, and adding the leaf node queue meeting the condition, wherein the method specifically comprises the following steps:

if the keywords are frequent keywords, adding the root nodes of the quadtree in the header file into a non-leaf node queue to be processed;

judging whether the sub-node of the current node in the non-leaf node queue to be processed is a qualified node or not;

if not, filtering out the sub-node; if yes, judging whether the sub-node is a non-leaf node or a leaf node;

if the node is a non-leaf node, adding the non-leaf node into a to-be-processed non-leaf node queue to wait for processing; if yes, adding the leaf node into the leaf node queue meeting the conditions.

On the basis of the scheme, whether the sub-node of the current node in the to-be-processed non-leaf node queue is a qualified node is judged, and the judgment standard is as follows:

a) all attribute classes of query q' are on this child node;

b) each attribute value range of query q' intersects the corresponding attribute value range of the child node.

The invention also provides an AI ³ The system for solving the SKQwyh-not problem comprises:

AI ³ an index building module to: obtaining all objects o and constructing AI ³ Indexing;

a candidate list construction module to: obtaining an initial query q ═ (q.loc, q.doc) ₀ q.B, k, α) and a missing object set M; constructing a candidate keyword list CKS according to the descending order of the frequency of the keywords of the missing objects, and constructing a candidate attribute value pair list CAS according to the descending order of the similarity scores of the missing objects; respectively setting a keyword set q '. doc and an attribute value pair q'. B 'of the refined query q' as q.doc ₀ And q.B;

a refined query module to: orderly extracting keywords in CKS and attribute value pairs in CAS, and respectively adding the keywords in CKS and the attribute value pairs in CAS to a keyword set q '. doc ' of a query q ' and an attribute value pair q '. B ' of the query q ' to form a new refined query q '; processing each refining query q' to find the best refining query until both CKS and CAS are empty; processing each refined query q' respectively, specifically including:

to p'<p _c According to the frequency of each keyword, determining whether each keyword of the query q' is a frequent keyword:

if the keywords are frequent keywords, adding the root nodes of the quadtree in the header file into a non-leaf node queue to be processed, selecting leaf nodes meeting the conditions according to a preset screening rule, and adding the leaf nodes meeting the conditions into a leaf node queue; for each object in the disk page pointed by the leaf node meeting the condition, sequentially judging whether the attribute value pair q '. B ' of the query q ' meets the attribute matching with the attribute value pair of the object, adding the matched object to an object set meeting the requirement of the query q ', and calculating the similarity score between the query q ' and the object;

if the key words are the infrequent key words, analyzing each object in the corresponding disk page, if the attribute value pair q '. B ' of the query q ' meets the attribute matching with the attribute value pair of the object, adding the object in the corresponding disk page into an object set meeting the requirement of the query q ', and calculating the similarity score between the query q ' and the object;

if k' is ≦ k _m ，k _m To preserve the size of the result set when the initial query key and attributes are preserved and all missing objects appear in the query results, a modification cost p ' of q ' is computed, if p '<p _c Then the query q' is taken asTop best refinement queries.

Based on the scheme, AI ³ The index building module is specifically configured to:

three components are created: a lookup table used as a portal, a header file containing summary information of dense key units, and a data file storing key unit tuples in all the inverted lists;

On the basis of the scheme, B is a Boolean expression:

is a predicate set where i ∈ [1, n ]]，i∈N ^* 。

On the basis of the scheme, if the keywords are frequent keywords, the refining query module adds the root nodes of the quadtree in the header file into a non-leaf node queue to be processed, selects leaf nodes meeting the conditions according to a preset screening rule, and adds the leaf nodes into the leaf node queue meeting the conditions, and the method specifically comprises the following steps:

On the basis of the scheme, the refining query module judges whether the sub-node of the current node in the non-leaf node queue to be processed is a qualified node, and the judgment standard is as follows:

a) all attribute classes of query q' are on this child node;

Compared with the prior art, the invention has the advantages that:

the digital attribute of the object is expressed in the form of the Boolean expression, so that the method is closer to a real application scene; and design AI ³ Object information is skillfully organized by indexing, and meanwhile, a corresponding query strategy is designed, so that the condition that all missing objects appear in a query result is met by modifying the query q' with the minimum modification cost, and the why-not problem in the space keyword query is solved.

Drawings

FIG. 1 is a diagram of an example set of objects of the background art;

FIG. 2 is AI of an embodiment of the invention ³ Indexing a schematic diagram for dividing an object;

FIG. 3 is a drawing showingAI of an embodiment of the invention ³ A schematic of the structure of an instance of the index;

FIG. 4 is an AI-based embodiment of the invention ³ And (4) an algorithm schematic diagram of the index.

Detailed Description

The embodiment of the invention provides a method for adopting AI ³ The method for solving the SKQwyh-not problem comprises the following steps:

obtaining all objects o and constructing AI ³ Indexing;

obtaining an initial query q ═ (q.loc, q.doc) ₀ q.B, k, α) and the missing object set M, q.loc represents the location of the query q, q.doc ₀ Representing a query q keyword set, q.B is a Boolean expression used for representing attribute value pairs, k represents the top k bits of the ranking of the query result, and a is a variable between 0 and 1 and used for defining the relative importance between the distance proximity and the text relevance; constructing a candidate keyword list CKS according to the descending order of the frequency of the keywords of the missing objects, and constructing a candidate attribute value pair list CAS according to the descending order of the similarity scores of the missing objects; respectively setting a keyword set q '. doc and an attribute value pair q'. B 'of the refined query q' as q.doc ₀ And q.B;

processing each refined query q' respectively, specifically including:

if the keywords are frequent keywords, adding the root nodes of the quadtree in the header file into a non-leaf node queue to be processed, selecting leaf nodes meeting the conditions according to a preset screening rule, and adding the leaf nodes meeting the conditions into a leaf node queue; for each object in the disk page pointed by the leaf node meeting the conditions, sequentially judging whether the attribute value pair q 'of the query q' meets the attribute matching with the attribute value pair of the object, adding the matched object into an object set meeting the requirement of the query q ', and calculating the similarity score between the query q' and the object;

if k' is less than or equal to k _m ，k _m To preserve the initial query key and attributes, and the size of the result set when all missing objects appear in the query result, a modification cost p ' of q ' is computed, if p '<p _c The query q' is taken as the current best refined query.

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

Definition of one, enhanced space keyword top-k query

Predicates are the basic components that make up a Boolean expression. Given a quadruple (A, f) _opt ，f _opd X) where A is an attribute, f _opt Is an operand, f _opd Is an operator and x is the value of the input, it is more convenient to define the predicate.

Definition 1: and (4) predicating.

If a mapping function p satisfies

Then

Is a predicate. Wherein if the input value x is within the predicate specification range, the mapping function will return 1, otherwise, it will return 0.

Definition 2: a boolean expression.

Knowing a predicate set

Wherein i ∈ [1, n ]]，i∈N ^* Then boolean expression B may be defined as follows:

definition 3: text-space objects.

Knowing a spatial point o.loc, a set of keyword sets o.doc and a set of attribute-value pairs-<A ₁ ，v ₁ >，...，<A _i ，v _j >，...，<A _n ，v _n >Text space object o can be represented as follows:

o ═ o.loc, o.doc o.S >, where o.S { (A) ₁ ＝v ₁ )∩(A ₁ ＝v ₁ )∩…∩ (A _n ＝v _n )}

Definition 4: enhanced spatial keyword query.

Knowing a spatial point q.loc, a set of keywords q.doc ₀ And a boolean expression q.B, an enhanced spatial key query q may be expressed as:

q＝＜q.loc,q.doc ₀ ,q.B＞

definition 5: and matching the keywords.

For query q and object o, if and only if q.doc and o.doc contain the same keywords, query q and object o are said to be keyword matched, i.e.: q.doc ≠ φ ≠ o.doc ≠ φ

As used herein

Representing keyword matches

Definition 6: and (6) matching the attributes.

For query q and object o, if and only if the following two conditions are satisfied: a) q.B are all contained in o.S; b)

(assume attribute A of attributes q.B _i Attribute a in and o.S _i' Equal),

wherein:

(A _i' ＝v _i' ) E o.S, then the query q and object o are attribute matches.

Use of

Representing attribute matching

Definition 7: and (5) comprehensive matching.

If and only if the enhanced spatial keyword query q and the text spatial object o satisfy both keyword matching and attribute matching, q and o are a composite match, that is:

as used herein

Representation synthesis matching

Now a Rank function is defined to measure the similarity score between query q and object o:

wherein α is a variable between 0 and 1Defining the relative importance between distance proximity and text relevance, d (q.loc, o.loc) denotes the Euclidean distance between query q and object o, d _max (q.loc, o.loc) represents the maximum distance from the query point q to all objects in the object set O, specifically represented by the maximum distance between all objects in the object set O.

Definition 8: enhanced spatial key top-k query.

Knowing a set of objects O, the enhanced spatial key top-k query q ═ (loc, doc) ₀ B, k, α) retrieves a set of objects O',

it satisfies: i O' | ═ k, and

o’∈O-O’，Rank(q，o)>Rank(q，o’).

two, why-not problem in enhanced spatial keyword top-k query

When a user initiates an enhanced Top-k space keyword query q ═ loc, doc ₀ B, k, α), if the query parameters, such as text description, query attribute, k value and α, are set unreasonably, this may result in one or more objects desired by the user being accidentally missing, such objects being referred to as missing objects, M ═ M ₁ ，m ₂ ，...，m _j Denotes. So that the user will propose a set of missing objects M ═ M ₁ ，m ₂ ，...，m _j Why-not question why these desired objects would be missing and seek a refined query q ' ═ loc, doc, B ', k ', α, complete, set of results that can contain all the missing objects. Since the location of the query is usually deterministic, the initial query can be refined by changing the query keyword set, the Boolean expression, the k value, and the alpha value.

Considering that the result set of the refined query q 'contains all missing objects, let q' doc contain some or all of the keywords of the missing objects in addition to the original keyword set, i.e. CKS is oneThe ordered list of keys for missing objects ordered according to key frequency, the function Out list (CKS) indicates that the first key is taken from the CKS and returned. For example, in example 1, query q filters out o1, o2, o4, o6, o7, provided that o ₄ And o ₆ Is a missing object, the keyword "center" has a higher frequency than the keyword "Cosmic", and "center" is arranged before "Cosmic" in the CKS, when the CKS is { "center", "ic" }. Similarly, let q '. B' satisfy the requirement of each attribute-value pair of all the missing objects, except the original set of attribute-value pairs, i.e., the CAS represents an ordered list of attribute-value pairs of the missing objects, ordered by object similarity score. The function Out _ List (CAS) indicates that the first attribute-value pair is taken from the CAS and returned. Combining the above examples, assume o ₄ Is ranked according to the similarity score of (a) ₆ Is high, therefore o ₄ The attribute value pair of (2) is ranked at o ₆ The attribute value of (2) is ahead. This is because high-scoring objects are generally more desirable to users, so their attribute values are more in line with the needs of users. Therefore, priority is given to ₄ The attribute value pair of (2) can obtain:

q′.B′＝q.B∪Out_List(CAS)＝q.B∪o ₄ .B

＝(avg-price≤42)∧(Rating>4.3)∧Popularity>700)

wherein q.B ═ avg-price < 42 ^ Rating > 4.3 ^ powdery > 700), o ₄ .S＝(avg-price＝42∧Rating＝4.4∧Popularity＝900)。

Due to o ₆ This refined query is still not satisfied, so its attribute-value pair, i.e., o, is considered ₆ S ═ 35 Λ ratio ^ 4.6 Λ ratio ^ NULL), so that q '. B' ═ 42 ^ avg-price ≦ 42 ^ (Rating)>4.3)。

Considering that changing the values of different query parameters will have different effects on the optimization of the query, the modification cost between the refined query q' and the initial query q can be defined as follows:

wherein, beta ₁ ，β ₂ ，β ₃ ，β ₄ The weights of the k value, the keyword, the attribute type, and the attribute value in the cost function are respectively expressed. Beta is a _i Not less than 0 and

k 'is the size of the query result set that refines query q', k ₀ Is the size of the result set of the initial query q, in k _m -k ₀ Normalized k' -k ₀ . This is because in many predecessors' studies, k was increased by preserving the initial query key and attributes ₀ To k _m Obtaining a basic refined query q by a method until all missing objects appear in a query result set _b . In contrast, a better refined query may have a lower query modification cost by modifying the k value, the key, the attribute type, and the attribute value. Wherein k' -k ₀ Is less than or equal to k _m -k ₀ . Δ doc is from q.doc ₀ Doc is adjusted to q'. doc the number of keys needs to be changed,

wherein the missing object set M ═ M ₁ ,m ₂ ,...,m _j }. Here by | q.doc ₀ U.doc | to normalize Δ doc; delta A _n Is the number of attribute categories that need to be changed from the initial query to the refined query, where Δ A is normalized by | q.B ≦ M.B | _n (ii) a Then the

n is the sum of the attributes contained in q.B and M.B. Δ v _i Is to contain an attribute A _i With respect to the attribute value of the attribute. | v _i '-v _i Is attribute A _i The value v of the current query attribute _i ' with initial query attribute value v _i The absolute value of the difference between, and | v _i '-v _i |≤Δv _i . Here by Δ v _i To normalize | v _i '-v _i |。

ΔA _n And Δ doc can be calculated by compiling the distance. In the example of FIG. 1, the initial query q is modified to a refined query q ', where q '. doc ═ { cat }, { cafe } ", q '. A ═ avg-price < 42 ═ U (Rating > 4.5). U (Popularity > 700), and then Δ A _n ＝1,Δdoc＝1。

III, adopting AI ³ Method for solving why-not problem in spatial keyword query through index

Based on whether the query keyword is a frequent keyword or an infrequent keyword, AI is designed ³ Indexes are used for improving query efficiency and solving why-not problem of the enhanced space keyword top-k query. AI ³ The indexing is based on I ³ Indexing, using a quadtree structure to hierarchically divide a data space into cells, processes spatiotemporal textual information. The index takes a keyword cell as a basic storage unit, and the cell captures spatial position and attribute information of an object containing the keyword.

FIG. 2 shows the key units of the two keys "cat" and "cafe" in FIG. 1. A unit containing the number of objects not exceeding a given threshold is called a basic keyword unit; and vice versa as dense key units. Assume that a cell contains a threshold of 2 objects, and therefore each basic key cell of the key contains at most two objects with this key. In the unit for the key "cat" in FIG. 2, C ₁ 、C ₂ 、C4 ₂ 、C ₄₃ And C ₄₄ Is a basic unit, and C ₄ Are dense cells.

And I ³ Similarly, AI of an embodiment of the invention ³ Three main components are also included: a lookup table that serves as a portal, a header file that contains summary information for dense key cells, and a data file that stores key cell tuples in all posting tables. And I ³ Different is that AI ³ Not only text information and spatial information are used to retrieve an object desired by a user, but also attribute information is used in a header file to improve efficiency of pruning. Specifically, AI ³ And introducing the attribute information into the node abstract of the quadtree. If not leafIf the attribute information of the node and the query attribute are not 'attribute matching', the node and all sub-nodes thereof are pruned.

Each non-leaf node R of the quadtree _i All contain a triplet (R) _i .id，R _i .S， R _i Address), wherein R is _i Id is node id, R _i Address is R _i Address list of all sub-nodes of (1) and R _i S is R _i Is generated by the attribute value pairs of all the sub-nodes. Since the header file is stored in memory, in order to save query time, especially time to access disk pages, the attribute information of the basic key unit of the frequent key is stored in the leaf nodes of the quadtree, not in the disk pages. And when the attribute information of the leaf node is not matched with the query attribute, ignoring the corresponding disk page. Each leaf node R of the quadtree _i Comprising a triplet (R) _i .id，R _i .S，R _i Address). Wherein R is _i Id is node id, R _i Address is the Address of the disk page to which it is linked, R _i S is the union of the attribute-value pairs of all objects in the disk page to which it is linked.

Continuing with the example in FIG. 1, wherein

o ₅ .S＝(avg-price＝37∧Rating＝4.5∧Popularity＝1400),

o ₆ .S＝(avg-price＝35∧Rating＝4.6),o ₇ .S＝(Rating＝4.3∧Popularity＝700),

o ₈ S ═ 38 ^ Rating ^ 4.6 ^ poularity ^ 1600, as shown in fig. 3, R ^ R ═ g-price ^ 4.6 ^ poularity ^ 1600, as shown in fig. 3 ₇ Containing an object o ₅ And o ₆ Thus, therefore, it is

R ₇ .S＝Cover(o ₅ .S,o ₆ .S)＝(avg-price∈[35,37])∧(Rating∈[4.5,4.6] )∧(Popularity＝1400)。

Here the function Cover (o) _i .S,o _j S) returned is a list of value ranges, each value range covering O _i .S.A _k To O _j .S.A _k Each attribute A therebetween _k ∈o _i .S∪o _j Attribute value of SAnd (3) a range. Note that the function Cover () also applies to two non-leaf nodes R _i And R _j And cases with more parameters. Then:

R ₅ .S＝Cover(R ₇ .S,R ₈ .S,R ₉ .S)＝(avg-price∈[35,38])∧(Rating∈[4.3,4.6]) ∧(Popularity∈[700,1600])

in I ³ In the above embodiments, the objects in different basic key units of the data file may be stored in the disk page to improve the storage utilization, but this means that if some objects in the basic key units of the disk page are loaded into the memory for processing, other basic key units from the disk page are also loaded into the memory for processing, which consumes time. In contrast, for AI ³ To improve the efficiency of the query, disk pages store only AI ³ Such that no other extraneous objects appear when the disk page is loaded into memory.

FIG. 3 illustrates an AI constructed for the object of FIG. 1 ³ And (4) indexing. In FIG. 3, the keys "cat" and "cafe" are both stored in a lookup table, and both frequent keys are each linked to a quadtree in the header file. The attribute value pairs for each quadtree node are used to prune the ineligible tree branches. Each leaf node of the quad-tree in the header file is linked to the related disk page of the data file, and which disk pages need to be accessed can be determined according to the attribute value pairs of the leaf nodes.

Referring to FIG. 4, the algorithm illustrates the use of AI ³ The detailed implementation of the enhanced why-not space keyword top-k query processing method. The method comprises the step of subjecting AI ³ Index, initial query q, missing object set M, candidate keyword list CKS, candidate attribute value pair list CAS, basic optimization query q _b Penalty of (1) _c ，q _b Query result object k in _m As an input. The output is the best refined query q'.

Specifically, CKS is an ordered list of keywords of missing objects arranged in order of decreasing frequency of keywords, while CAS is based on missing objectsAn ordered list of attribute-value pairs of the missing objects in descending order of similarity score. The two lists are pre-constructed, and the processing order of the candidate keyword and candidate attribute value pairs plays an important role in obtaining the refined query. For P _c A value equal to cost (q, q) calculated using equation (2) _b )。q _b Is the basic refined query discussed previously.

The queue D, queue D', queue W, pointer TWord, pointer TNode, set RRS are initialized to empty and are used to store the non-leaf nodes of the quadtree in the eligible header file, the keywords of the refined query being processed, the quadtree nodes of the header file being accessed, and the set of objects satisfying the optimized refinement requirements, respectively (line 4). Next, let q '. doc and q '. B ' equal q.doc, respectively ₀ And q.B (line 5). Next, the key value pairs in the CKS and the attribute value pairs in the CAS are sequentially fetched and added to q '. doc and q '. B ', respectively, to form a new refined query, which is then processed to find the best refined query until both the CKS and CAS are empty.

Lines 7-38 show the processing steps for each refined query q'. First, a refined query q' is obtained by parameter modification. Specifically, the first key in CKS and the first attribute value pair in CAS are taken out and added to q '. doc and q '. B ' respectively (lines 7-8); here, the function Out _ list (cks) takes Out the first key and returns the key, and the function Out _ list (cas) functions similarly to Out _ list (cks); let k' be k ₀ . The cost q 'of p' is then calculated according to equation (2) to filter the ratio q as early as possible _b A costly refinement query. If P' ≧ P _c The loop is terminated (lines 10-11). Otherwise, the key of the refined query is enqueued to queue W (line 12) to continue the processing of q' (lines 13-29): and determining whether the keywords of the queue W are frequent keywords according to the frequency of each keyword of the queue W, and respectively processing.

Lines 15-29 show the processing steps for frequent keywords. For frequent keys pointed to by TWord, its root node of the quadtree in the header file is pushed into queue D (line 16) and its eligible non-leaf nodes are then pushed into queue D for processing, thereby obtaining queue D', which retains the eligible leaf nodes for further processing.

When queue D is not empty, the elements in D are processed in the following order: 1) pop the head element in D and point to it by TNode (line 18); 2) for each sub-node n of the TNode _s If the node meets the following requirements: a) can be at n _s Finding all attribute categories of the refined query q' on S; b) each attribute value range of q' with n _s The corresponding attribute value ranges of (2) have intersection; this node may contain a result object and is "eligible" requiring further processing (line 20). Then, if n is _s Is a non-leaf node, n is _s Enqueue to queue D. If n is _s Is a leaf node, n is _s Enqueue to queue D' (lines 21-24).

Next, queue D 'is processed to obtain query result q'. When queue D 'is not empty, the elements in D' are processed in the following order: 1) pop the head element (node) of D' and point to it by TNode (line 26); 2) for each object o of TNode _i If the query attributes q '. B' and o are refined _i S attribute match, object o can then be calculated according to equation (1) _i And adds this object to the RRS (lines 27-29).

Lines 31-33 show the processing steps for infrequent keys. For each object o in a disk page linked by TWord _i If the query attributes q '. B' and o are refined _i S satisfies attribute matching, object o can be calculated according to equation (1) _i And adds the subject to the RRS (lines 31-33).

Next, all the objects in the RRS are ranked according to their similarity scores. The top k' objects with the highest score can be obtained until all original result objects and all missing objects appear (line 34). If k' is ≦ k _m Then calculate the cost of q' (line 36); if p'<P _c (lines 37-38), P is modified with P _c . After all of these refinements have been processedAfter the query, the best refined query can be obtained.

The embodiment of the invention also provides an AI adopted ³ The system for solving the SKQwyy-not problem comprises:

a refinement query module to: orderly extracting keywords in the CKS and attribute value pairs in the CAS, and respectively adding the keywords to a keyword set q '. doc of the query q' and the attribute value pairs q '. B' of the query q 'to form a new refined query q'; processing each refining query q' to find the best refining query until both CKS and CAS are empty; processing each refined query q' respectively, specifically including:

As a preferred embodiment, AI ³ The index building module is specifically configured to:

each non-leaf node R of the quadtree _i All contain three attributes: r is _i .id，R _i .S， R _i Address, wherein R _i Id is node id, R _i Address is R _i Address list and R of all sub-nodes _i S is R _i The union of the attribute value pairs of all the sub-nodes;

As a preferred embodiment, B is a boolean expression:

is a set of predicates where i ∈ [1, n ]]，i∈N ^* 。

As a preferred embodiment, if the keyword is a frequent keyword, the refined query module adds the root node of the quadtree in the header file to a to-be-processed non-leaf node queue, selects a leaf node meeting the condition according to a preset screening rule, and adds the leaf node to the leaf node queue meeting the condition, specifically including the following steps:

judging whether a sub node of a current node in a non-leaf node queue to be processed is a qualified node or not;

As a preferred embodiment, the refined query module determines whether a child node of a current node in the to-be-processed non-leaf node queue is a qualified node, where the determination criterion is:

a) all attribute classes of query q' are on this child node;

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. By using AI ³ The method for solving the SKQwyh-not problem is characterized by comprising the following steps of:

obtaining all objects o and constructing AI ³ Indexing;

obtaining an initial query q ═ q.loc, q.doc ₀ q.B, k, α) and a missing object set M; constructing a candidate keyword list CKS according to the descending order of the frequency of the keywords of the missing objects, and constructing a candidate attribute value pair list CAS according to the descending order of the similarity scores of the missing objects; respectively setting a keyword set q '. doc and an attribute value pair q'. B 'of the refined query q' as q.doc ₀ And q.B;

orderly extracting keywords in the CKS and attribute value pairs in the CAS, and respectively adding the keywords to a keyword set q '. doc of the query q' and the attribute value pairs q '. B' of the query q 'to form a new refined query q'; processing each refining query q' to find the best refining query until both CKS and CAS are empty;

processing each refined query q' respectively, specifically including:

if k' is ≦ k _m ，k _m To preserve the initial query key and attributes, and the size of the result set when all missing objects appear in the query result, a modification cost p ' of q ' is computed, if p '<p _c The query q' is taken as the current best refined query.

2. The method of claim 1, wherein: obtaining all objects o and constructing AI ³ The indexing specifically comprises the following steps:

storing the attribute information of the basic key word unit of the frequent key word in the leaf node of the quadtree;

each non-leaf node R of the quadtree _i All contain three attributes: r _i .id，R _i .S，R _i Address, wherein R _i Id is node id, R _i Address is R _i Address list of all sub-nodes of (1) and R _i S is R _i The union of the attribute value pairs of all the sub-nodes;

each leaf node R of the quadtree _i All contain three attributes: r _i .id，R _i .S，R _i Address, wherein R _i Id is node id, R _i Address is the Address of the disk page to which it is linked, R _i S is the union of the attribute-value pairs of all objects in the disk page to which it is linked.

3. The method of claim 1, wherein: b is a Boolean expression:

is a predicate set where i ∈ [1, n ]]，i∈N ^* 。

4. The method of claim 1, wherein: and calculating the modification cost p 'of q', wherein the calculation formula is as follows:

wherein, beta ₁ ，β ₂ ，β ₃ ，β ₄ Respectively representing the weight of a k value, a keyword, an attribute type and an attribute value in a cost function; beta is a _i Is not less than 0 and

k' is the size of the query result set of the refined query qSmall, k ₀ Is the size of the result set of the initial query q, k _m Is the size of the result set, k, when the initial query key and attributes are preserved and all missing objects appear in the query results _m -k ₀ Normalizing k' -k ₀ (ii) a Δ doc is from q.doc ₀ Adjust to the number of keys that need to be changed to q'. doc,

wherein the missing object set M ═ M ₁ ,m ₂ ,...,m _j }, by | q.doc ₀ U.doc | to normalize Δ doc; delta A _n Is the number of attribute types that need to be changed to adjust from the initial query to the refined query, and is normalized by | q.B ≦ M.B |, Δ A _n ；

n is the sum of the attributes contained in q.B and M.B; Δ v _i Is to contain an attribute A _i The maximum difference value of the attribute values of all the objects with respect to the attribute; | v _i '-v _i Is attribute A _i Current query attribute value v _i ' with initial query attribute value v _i The absolute value of the difference between, and | v _i '-v _i |≤Δv _i By Δ v _i To normalize | v _i '-v _i |。

5. The method of claim 1, wherein: calculating a similarity score between the query q and the object o, wherein the calculation formula is as follows:

where α is a variable between 0 and 1 defining the relative importance between distance proximity and text relevance, d (q.loc, o.loc) denotes the Euclidean distance between query q and object o, d _max (q.loc, O.loc) represents the maximum distance of the query point q to all objects in the object set O, in terms of the object setThe maximum value of the distance between all objects in O.

6. The method of claim 2, wherein: if the keyword is a frequent keyword, adding a root node of the quadtree in the header file into a to-be-processed non-leaf node queue, selecting a leaf node meeting the conditions according to a preset screening rule, and adding the leaf node meeting the conditions into the leaf node queue, wherein the method specifically comprises the following steps of:

if the node is a non-leaf node, adding the non-leaf node into a to-be-processed non-leaf node queue to wait for processing; if the leaf node is the leaf node, adding the leaf node into the leaf node queue meeting the conditions.

7. The method of claim 6, wherein: judging whether the sub-node of the current node in the non-leaf node queue to be processed is a qualified node or not, wherein the judgment standard is as follows:

a) all attribute classes of query q' are on this child node;

8. By using AI ³ The system for solving the SKQwyh-not problem is characterized by comprising the following steps:

a candidate list construction module to: obtaining an initial query q ═ (q.loc, q.doc) ₀ q.B, k, α) and a missing object set M; constructing candidate keyword columns according to descending order of frequency of keywords of missing objectsThe table CKS is used for constructing a candidate attribute value pair list CAS according to the descending order of the similarity scores of the missing objects; respectively setting a keyword set q '. doc and an attribute value pair q'. B 'of the refined query q' as q.doc ₀ And q.B;

a refined query module to: orderly extracting keywords in the CKS and attribute value pairs in the CAS, and respectively adding the keywords to a keyword set q '. doc of the query q' and the attribute value pairs q '. B' of the query q 'to form a new refined query q'; processing each refining query q' to find the best refining query until both CKS and CAS are empty; processing each refined query q' respectively, specifically including:

if k' is ≦ k _m ，k _m To preserve the size of the result set when the initial query key and attributes are preserved and all missing objects appear in the query results, a modification cost p ' of q ' is computed, if p '<p _c Then query q' is taken as the current best refined query.

9. The system of claim 8, wherein: AI ³ The index building module is specifically configured to:

each leaf node R of the quadtree _i All contain three attributes: r is _i .id，R _i .S，R _i Address, wherein R _i Id is node id, R _i Address is the Address of the disk page to which it is linked, R _i S is the union of the attribute-value pairs of all objects in the disk page to which it is linked.

10. The system of claim 9, wherein: b is a Boolean expression:

is a predicate set where i ∈ [1, n ]]，i∈N ^* 。

11. The system of claim 9, wherein: if the keywords are frequent keywords, the refining query module adds the root nodes of the quadtree in the header file into a non-leaf node queue to be processed, selects leaf nodes meeting the conditions according to a preset screening rule, and adds the leaf nodes into the leaf node queue meeting the conditions, and the refining query module specifically comprises the following steps:

if the keywords are frequent keywords, adding the root nodes of the quadtree in the header file into a to-be-processed non-leaf node queue;

12. The system of claim 11, wherein: the refining query module judges whether the sub-node of the current node in the non-leaf node queue to be processed is a qualified node, and the judgment standard is as follows:

a) all attribute classes of query q' are on this child node;