CN110134758A

CN110134758A - A kind of indexing means inquired towards continuous space-fuzzy keyword

Info

Publication number: CN110134758A
Application number: CN201910346372.2A
Authority: CN
Inventors: 邓泽; 王力哲; 褚军德; 陈云亮; 陈小岛
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-08-16

Abstract

Index structure is created according to query statement, space text flow data is filtered according to index structure in search window input inquiry sentence towards continuous space-fuzzy keyword inquiry indexing means the present invention provides a kind of；During being filtered to space text flow data, one-permutation hash signature processing is carried out to the query statement parallel multithread, similar tags corresponding with query statement are obtained from the text flow data of space；Similar tags are embedded in adaptive index structure AP-tree, is calculated according to one-permutation hash signature, is carried out the differentiation of similar tags, obtain optimal label, exported using data corresponding with optimal label as final search result.In data communication between GPU-CPU, big-kernel communication strategy is added, so as to acceleration search；Data communication is divided into four-stage: prefetching address generation, data filling, data transmission and assesses calculation.The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.

Description

A kind of indexing means inquired towards continuous space-fuzzy keyword

Technical field

The present invention relates to space text flow data directory inquiring technology fields, more particularly to one kind is towards continuous space-mould Paste the indexing means of keyword query.

Background technique

Arrival and high speed development with mobile internet era, many are based on LBS (Location-Based Services application software) increases significantly, and the application of these softwares produces the space text flow data of magnanimity, using efficient Analytical technology processing space text flow data, can bring great convenience to people's lives.But the space of magnanimity is literary This flow data also brings many challenges: data volume is huge, inquiry time delay growth, data redundancy.

Traditional Space text flow data directory querying method can substantially be divided into three classes: the preferential indexing means (RQ- of text Tree etc.), preferential indexing means in space (IQ-tree and Rt-tree etc.) and adaptive based on location information and text information Indexing means (AP-tree).But existing indexing means face two problems in current application: first, these index knots Lack in structure and support text approximate query, in search space when text object, user may due to input error or other reasons Meeting is so that text information input inaccuracy, and at this moment keyword fuzzy query just seems particularly significant；Second, above-mentioned Traditional Space text This flow data search algorithm is realized based on CPU, with the increase of data scale, Traditional Space text flow data query method Software be can no longer meet to the real-time of processing data information and the demand of high efficiency.

Therefore, it needs to study a kind of new indexing means, it is made to reach efficient index while meeting query performance Purpose, and query time and space expense are reduced as far as possible.

Summary of the invention

It is too long in order to solve prior art query time after the increase of space text flow data scale, it is unable to satisfy software pair The problem of data processing real-time high-efficiency demand and support text fuzzy query, the present invention provides one kind towards continuous space-mould The indexing means for pasting keyword query, mainly comprise the steps that

S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to sky Between text flow data be filtered；

S102: during being filtered to space text flow data, the query statement parallel multithread is carried out One-permutation hash signature processing, obtains corresponding with query statement similar from the text flow data of space Label；

S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark Label, then export using data corresponding with optimal label in the text flow data of space as final search result.

Further, data communication is carried out during being filtered to space text flow data, between GPU-CPU in number Information relevant to query statement is obtained according to big-kernel communication strategy is added in communication so as to accelerated filtration；

Data communication is divided into four-stage:

(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer；The ground The address that GPU thread needs to handle the query statement of one-permutation hash signature is stored in the buffer of location；

(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be looked into It askes sentence and is assembled to the prefetching in buffer of the end CPU；

(3) data are transmitted: the query statement prefetched in buffer at the end CPU is transmitted to the number at the end GPU by DMA mechanism According to buffer；

(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement It calculates.

Further, per thread handles the one-permutation of space-text object text information The calculating of hash signature.

Further, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is certainly Adapt to index structure AP-tree；Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost Small partition method distinguishes.

Further, the search strategy of depth-first is used to carry out the area of similar tags in a manner of recursive call access Point, to obtain final search result.

Technical solution provided by the invention has the benefit that the space expense for reducing index structure and time are opened Pin, reduces search time.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is a kind of flow chart towards continuous space-fuzzy keyword inquiry indexing means in the embodiment of the present invention；

Fig. 2 is the schematic illustration of big-kernel strategy in the embodiment of the present invention；

Fig. 3 is a kind of schematic diagram for replacing hash method in the embodiment of the present invention；

Fig. 4 is the hollow m- fuzzy keyword search index structure chart of the embodiment of the present invention.

Specific embodiment

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.

The embodiment provides a kind of towards continuous space-fuzzy keyword inquiry indexing means.

Referring to FIG. 1, Fig. 1 is a kind of towards continuous space-fuzzy keyword inquiry index side in the embodiment of the present invention The flow chart of method, specifically comprises the following steps:

S102: during being filtered to space text flow data, the query statement parallel multithread is carried out One-permutation hash signature (a kind of displacement Hash label) processing, obtained from the text flow data of space with The corresponding similar tags of query statement；

Exist as shown in Fig. 2, carrying out data communication during being filtered to space text flow data, between GPU-CPU Big-kernel communication strategy is added in data communication and obtains information relevant to query statement so as to accelerated filtration；Data are logical Letter is divided into four-stage:

(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement It calculates；

The text information in each space-text object is extracted for the first time, then per thread carries out processing one The calculating of the one-permutation hash signature of space-text object text information；As shown in Figure 3, it is assumed that There are two set of keywords V₁And V₂, V₁Primary index be π (V₁)={ 0,5,8 }, the primary index π (V of V2₂)={ 1,6,8 }； The binary D of setting one ties up matrix, and the first row that D ties up matrix indicates feature, and 1 in secondary series and third column indicates original rope Containing the feature corresponding to first row in drawing, 0 is indicated in primary index without containing the feature corresponding to first row；D is tieed up into square The column of battle array are uniformly divided into k partially (bins), and in the present embodiment, k takes 3, i.e. D dimension matrix column is uniformly divided into 3 parts: Bin1, bin2 and bin3；Secondary series and tertial first nonzero term in bin1, bin2 and bin3 are marked, new index π is formed (V₁)={ 0,2,2 }, π (V₂)={ 1,3,2 }, the calculation method of new index is as follows: it is each that new index is equal to primary index The index of first nonzero term subtracts the volume of bin belonging to index of the sum of bins multiplied by first nonzero term in bin Number difference, i.e. π (V₁)=[0-3 × 0,5-3 × 1,8-3 × 2]=[0,2,2], π (V₂)=[1-3 × 0,6-3 × 1,8-3 × 2] =[1,3,2]；

π(V₁) and π (V₂) similarity be that identical item number is indexed in corresponding bin divided by total bin number k, i.e., this π (V in embodiment₁) and π (V₂) similarity be 1/3.

S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark Label, then export using data corresponding with optimal label in the text flow data of space as final search result；Pass through Efficient heuristic algorithm keyword subregion and space partition zone algorithm create adaptive index structure AP-tree in conjunction with cost model； Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then the partition method for selecting cost small distinguishes； Wherein, the urtext information of keyword node, query node and space nodes is all by one-permutation hash Signature is replaced；

As shown in figure 4, if the quantity for the key word of the inquiry Q that query statement is split is no more than preset threshold or inquiry is closed Key word Q can according to keywords or space partition zone further division just remain on all key word of the inquiry Q in q node；

If a key word of the inquiry Q can be transmitted through by keyword or space partition zone further division from father node The key word of the inquiry Q come can be set to query node q, keyword node k or space nodes s；It is created according to query statement The index structure built is tree structure, which includes query node q, text node k and space nodes s, query node q Including q₁、q₂、......、q₁₀, text node k includes k₁- node and k₂- node, space nodes s include s₁- node and s₂- node。S₁、S₂、......、S₈Respectively indicate the Hash label for carrying out one-permutation hash signature processing. C₁、C₂、C₃And C₄Indicate the 4 block space regions of each space nodes s in index structure.If being divided with keyword subregion, The offset l of keyword subregion is then assigned to keyword subregion cost Ck in q node, is inquired in offset l expression q node First of keyword is used for keyword subregion, and records the cost using keyword partition method；Similarly, if with space partition zone It is divided, then the offset m of space partition zone is assigned to space partition zone cost Cs in q node, offset m is indicated in q node M-th of keyword of inquiry is used for space partition zone, and records the cost using spatial zonal approach；Then by present node N It is built into s node or k node, and the inquiry in q node is moved to relevant child node and is done further according to the above method Processing uses the search strategy of depth-first to carry out the differentiation of similar tags in a manner of recursive call access, final to obtain Search result.

The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: the following steps are included:

S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to space text This flow data is filtered；

S102: during being filtered to space text flow data, one- is carried out to the query statement parallel multithread Permutation hash signature processing, obtains similar mark corresponding with query statement from the text flow data of space Label；

2. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 102, it is logical in data that data communication is carried out during being filtered to space text flow data, between GPU-CPU Big-kernel communication strategy is added in letter and obtains information relevant to query statement so as to accelerated filtration；

Data communication is divided into four-stage:

(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer；The address is slow It rushes in device and stores the address that GPU thread needs to handle the query statement of one-permutation hash signature；

(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be inquired language Sentence is assembled to the prefetching in buffer of the end CPU；

(3) data are transmitted: being delayed the data that the query statement prefetched in buffer at the end CPU is transmitted to the end GPU by DMA mechanism Storage；

(4) assess calculation: GPU thread carries out the calculating of one-permutation hash signature by query statement.

3. as claimed in claim 2 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: Per thread handles the meter of the one-permutation hash signature of space-text object text information It calculates.

4. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 103, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is adaptive Index structure AP-tree；Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost small Partition method distinguishes.

5. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 103, the search strategy of depth-first is used to carry out the differentiation of similar tags in a manner of recursive call access, with Obtain final search result.