CN110134758A - A kind of indexing means inquired towards continuous space-fuzzy keyword - Google Patents

A kind of indexing means inquired towards continuous space-fuzzy keyword Download PDF

Info

Publication number
CN110134758A
CN110134758A CN201910346372.2A CN201910346372A CN110134758A CN 110134758 A CN110134758 A CN 110134758A CN 201910346372 A CN201910346372 A CN 201910346372A CN 110134758 A CN110134758 A CN 110134758A
Authority
CN
China
Prior art keywords
space
data
query statement
index structure
indexing means
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910346372.2A
Other languages
Chinese (zh)
Inventor
邓泽
王力哲
褚军德
陈云亮
陈小岛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201910346372.2A priority Critical patent/CN110134758A/en
Publication of CN110134758A publication Critical patent/CN110134758A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation

Abstract

Index structure is created according to query statement, space text flow data is filtered according to index structure in search window input inquiry sentence towards continuous space-fuzzy keyword inquiry indexing means the present invention provides a kind of;During being filtered to space text flow data, one-permutation hash signature processing is carried out to the query statement parallel multithread, similar tags corresponding with query statement are obtained from the text flow data of space;Similar tags are embedded in adaptive index structure AP-tree, is calculated according to one-permutation hash signature, is carried out the differentiation of similar tags, obtain optimal label, exported using data corresponding with optimal label as final search result.In data communication between GPU-CPU, big-kernel communication strategy is added, so as to acceleration search;Data communication is divided into four-stage: prefetching address generation, data filling, data transmission and assesses calculation.The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.

Description

A kind of indexing means inquired towards continuous space-fuzzy keyword
Technical field
The present invention relates to space text flow data directory inquiring technology fields, more particularly to one kind is towards continuous space-mould Paste the indexing means of keyword query.
Background technique
Arrival and high speed development with mobile internet era, many are based on LBS (Location-Based Services application software) increases significantly, and the application of these softwares produces the space text flow data of magnanimity, using efficient Analytical technology processing space text flow data, can bring great convenience to people's lives.But the space of magnanimity is literary This flow data also brings many challenges: data volume is huge, inquiry time delay growth, data redundancy.
Traditional Space text flow data directory querying method can substantially be divided into three classes: the preferential indexing means (RQ- of text Tree etc.), preferential indexing means in space (IQ-tree and Rt-tree etc.) and adaptive based on location information and text information Indexing means (AP-tree).But existing indexing means face two problems in current application: first, these index knots Lack in structure and support text approximate query, in search space when text object, user may due to input error or other reasons Meeting is so that text information input inaccuracy, and at this moment keyword fuzzy query just seems particularly significant;Second, above-mentioned Traditional Space text This flow data search algorithm is realized based on CPU, with the increase of data scale, Traditional Space text flow data query method Software be can no longer meet to the real-time of processing data information and the demand of high efficiency.
Therefore, it needs to study a kind of new indexing means, it is made to reach efficient index while meeting query performance Purpose, and query time and space expense are reduced as far as possible.
Summary of the invention
It is too long in order to solve prior art query time after the increase of space text flow data scale, it is unable to satisfy software pair The problem of data processing real-time high-efficiency demand and support text fuzzy query, the present invention provides one kind towards continuous space-mould The indexing means for pasting keyword query, mainly comprise the steps that
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to sky Between text flow data be filtered;
S102: during being filtered to space text flow data, the query statement parallel multithread is carried out One-permutation hash signature processing, obtains corresponding with query statement similar from the text flow data of space Label;
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark Label, then export using data corresponding with optimal label in the text flow data of space as final search result.
Further, data communication is carried out during being filtered to space text flow data, between GPU-CPU in number Information relevant to query statement is obtained according to big-kernel communication strategy is added in communication so as to accelerated filtration;
Data communication is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The ground The address that GPU thread needs to handle the query statement of one-permutation hash signature is stored in the buffer of location;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be looked into It askes sentence and is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: the query statement prefetched in buffer at the end CPU is transmitted to the number at the end GPU by DMA mechanism According to buffer;
(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement It calculates.
Further, per thread handles the one-permutation of space-text object text information The calculating of hash signature.
Further, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is certainly Adapt to index structure AP-tree;Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost Small partition method distinguishes.
Further, the search strategy of depth-first is used to carry out the area of similar tags in a manner of recursive call access Point, to obtain final search result.
Technical solution provided by the invention has the benefit that the space expense for reducing index structure and time are opened Pin, reduces search time.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is a kind of flow chart towards continuous space-fuzzy keyword inquiry indexing means in the embodiment of the present invention;
Fig. 2 is the schematic illustration of big-kernel strategy in the embodiment of the present invention;
Fig. 3 is a kind of schematic diagram for replacing hash method in the embodiment of the present invention;
Fig. 4 is the hollow m- fuzzy keyword search index structure chart of the embodiment of the present invention.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.
The embodiment provides a kind of towards continuous space-fuzzy keyword inquiry indexing means.
Referring to FIG. 1, Fig. 1 is a kind of towards continuous space-fuzzy keyword inquiry index side in the embodiment of the present invention The flow chart of method, specifically comprises the following steps:
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to sky Between text flow data be filtered;
S102: during being filtered to space text flow data, the query statement parallel multithread is carried out One-permutation hash signature (a kind of displacement Hash label) processing, obtained from the text flow data of space with The corresponding similar tags of query statement;
Exist as shown in Fig. 2, carrying out data communication during being filtered to space text flow data, between GPU-CPU Big-kernel communication strategy is added in data communication and obtains information relevant to query statement so as to accelerated filtration;Data are logical Letter is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The ground The address that GPU thread needs to handle the query statement of one-permutation hash signature is stored in the buffer of location;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be looked into It askes sentence and is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: the query statement prefetched in buffer at the end CPU is transmitted to the number at the end GPU by DMA mechanism According to buffer;
(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement It calculates;
The text information in each space-text object is extracted for the first time, then per thread carries out processing one The calculating of the one-permutation hash signature of space-text object text information;As shown in Figure 3, it is assumed that There are two set of keywords V1And V2, V1Primary index be π (V1)={ 0,5,8 }, the primary index π (V of V22)={ 1,6,8 }; The binary D of setting one ties up matrix, and the first row that D ties up matrix indicates feature, and 1 in secondary series and third column indicates original rope Containing the feature corresponding to first row in drawing, 0 is indicated in primary index without containing the feature corresponding to first row;D is tieed up into square The column of battle array are uniformly divided into k partially (bins), and in the present embodiment, k takes 3, i.e. D dimension matrix column is uniformly divided into 3 parts: Bin1, bin2 and bin3;Secondary series and tertial first nonzero term in bin1, bin2 and bin3 are marked, new index π is formed (V1)={ 0,2,2 }, π (V2)={ 1,3,2 }, the calculation method of new index is as follows: it is each that new index is equal to primary index The index of first nonzero term subtracts the volume of bin belonging to index of the sum of bins multiplied by first nonzero term in bin Number difference, i.e. π (V1)=[0-3 × 0,5-3 × 1,8-3 × 2]=[0,2,2], π (V2)=[1-3 × 0,6-3 × 1,8-3 × 2] =[1,3,2];
π(V1) and π (V2) similarity be that identical item number is indexed in corresponding bin divided by total bin number k, i.e., this π (V in embodiment1) and π (V2) similarity be 1/3.
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark Label, then export using data corresponding with optimal label in the text flow data of space as final search result;Pass through Efficient heuristic algorithm keyword subregion and space partition zone algorithm create adaptive index structure AP-tree in conjunction with cost model; Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then the partition method for selecting cost small distinguishes; Wherein, the urtext information of keyword node, query node and space nodes is all by one-permutation hash Signature is replaced;
As shown in figure 4, if the quantity for the key word of the inquiry Q that query statement is split is no more than preset threshold or inquiry is closed Key word Q can according to keywords or space partition zone further division just remain on all key word of the inquiry Q in q node;
If a key word of the inquiry Q can be transmitted through by keyword or space partition zone further division from father node The key word of the inquiry Q come can be set to query node q, keyword node k or space nodes s;It is created according to query statement The index structure built is tree structure, which includes query node q, text node k and space nodes s, query node q Including q1、q2、......、q10, text node k includes k1- node and k2- node, space nodes s include s1- node and s2- node。S1、S2、......、S8Respectively indicate the Hash label for carrying out one-permutation hash signature processing. C1、C2、C3And C4Indicate the 4 block space regions of each space nodes s in index structure.If being divided with keyword subregion, The offset l of keyword subregion is then assigned to keyword subregion cost Ck in q node, is inquired in offset l expression q node First of keyword is used for keyword subregion, and records the cost using keyword partition method;Similarly, if with space partition zone It is divided, then the offset m of space partition zone is assigned to space partition zone cost Cs in q node, offset m is indicated in q node M-th of keyword of inquiry is used for space partition zone, and records the cost using spatial zonal approach;Then by present node N It is built into s node or k node, and the inquiry in q node is moved to relevant child node and is done further according to the above method Processing uses the search strategy of depth-first to carry out the differentiation of similar tags in a manner of recursive call access, final to obtain Search result.
The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (5)

1. a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: the following steps are included:
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to space text This flow data is filtered;
S102: during being filtered to space text flow data, one- is carried out to the query statement parallel multithread Permutation hash signature processing, obtains similar mark corresponding with query statement from the text flow data of space Label;
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark Label, then export using data corresponding with optimal label in the text flow data of space as final search result.
2. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 102, it is logical in data that data communication is carried out during being filtered to space text flow data, between GPU-CPU Big-kernel communication strategy is added in letter and obtains information relevant to query statement so as to accelerated filtration;
Data communication is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The address is slow It rushes in device and stores the address that GPU thread needs to handle the query statement of one-permutation hash signature;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be inquired language Sentence is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: being delayed the data that the query statement prefetched in buffer at the end CPU is transmitted to the end GPU by DMA mechanism Storage;
(4) assess calculation: GPU thread carries out the calculating of one-permutation hash signature by query statement.
3. as claimed in claim 2 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: Per thread handles the meter of the one-permutation hash signature of space-text object text information It calculates.
4. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 103, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is adaptive Index structure AP-tree;Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost small Partition method distinguishes.
5. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: In step s 103, the search strategy of depth-first is used to carry out the differentiation of similar tags in a manner of recursive call access, with Obtain final search result.
CN201910346372.2A 2019-04-26 2019-04-26 A kind of indexing means inquired towards continuous space-fuzzy keyword Pending CN110134758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910346372.2A CN110134758A (en) 2019-04-26 2019-04-26 A kind of indexing means inquired towards continuous space-fuzzy keyword

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910346372.2A CN110134758A (en) 2019-04-26 2019-04-26 A kind of indexing means inquired towards continuous space-fuzzy keyword

Publications (1)

Publication Number Publication Date
CN110134758A true CN110134758A (en) 2019-08-16

Family

ID=67575209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910346372.2A Pending CN110134758A (en) 2019-04-26 2019-04-26 A kind of indexing means inquired towards continuous space-fuzzy keyword

Country Status (1)

Country Link
CN (1) CN110134758A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848404A (en) * 1997-03-24 1998-12-08 International Business Machines Corporation Fast query search in large dimension database
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN109271560A (en) * 2018-09-05 2019-01-25 东南大学 A kind of link data critical word querying method based on tree template

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5848404A (en) * 1997-03-24 1998-12-08 International Business Machines Corporation Fast query search in large dimension database
CN102084363A (en) * 2008-07-03 2011-06-01 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN109271560A (en) * 2018-09-05 2019-01-25 东南大学 A kind of link data critical word querying method based on tree template

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
PING LI等: "One Permutation Hashing for Efficient Search and Learning", 《MATHEMATICS》 *
REZA MOKHTARI等: "BigKernel—High Performance CPU-GPU Communication Pipelining for Big Data-style Applications", 《2014 IEEE 28TH INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM》 *
ZE DENG等: "An Efficient Indexing Approach for Continuous Spatial Approximate Keyword Queries over Geo-Textual Streaming Data", 《INTERNATIONAL JOURNAL OF GEO-INFORMATION》 *
ZE DENG等: "An Indexing Approach for Efficient Supporting of Continuous Spatial Approximate Keyword Queries", 《2018 IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS》 *

Similar Documents

Publication Publication Date Title
Kraska et al. The case for learned index structures
CN100501746C (en) Web page collecting method and web page collecting server
JP5407043B2 (en) Efficient piecewise update of binary encoded XML data
CN101561814B (en) Topic crawler system based on social labels
US7433886B2 (en) SQL language extensions for modifying collection-valued and scalar valued columns in a single statement
CN102521334B (en) Data storage and query method based on classification characteristics and balanced binary tree
CN102436513A (en) Distributed search method and system
CN105956183A (en) Method and system for multi-stage optimization storage of a lot of small files in distributed database
US20070016605A1 (en) Mechanism for computing structural summaries of XML document collections in a database system
CN104391908B (en) Multiple key indexing means based on local sensitivity Hash on a kind of figure
US8015195B2 (en) Modifying entry names in directory server
US20130159278A1 (en) Techniques for efficiently supporting xquery update facility in sql/xml
CN109033314A (en) The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
Kucukyilmaz et al. A machine learning approach for result caching in web search engines
CN106033428B (en) The selection method of uniform resource locator and the selection device of uniform resource locator
CN100397397C (en) XML data storage and access method based on relational database
US20230315727A1 (en) Cost-based query optimization for untyped fields in database systems
US7454436B2 (en) Generational global name table
CN104915388B (en) It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology
JP2009512950A (en) Architecture and method for efficiently bulk loading Patricia Tri
US8756246B2 (en) Method and system for caching lexical mappings for RDF data
CN107704585A (en) One kind inquiry HDFS data methods and system
Barkhordari et al. Atrak: a MapReduce-based data warehouse for big data
CN110134758A (en) A kind of indexing means inquired towards continuous space-fuzzy keyword
Sharma et al. Federated learning based caching in fog computing for future smart cities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190816

RJ01 Rejection of invention patent application after publication