CN110134758A - A kind of indexing means inquired towards continuous space-fuzzy keyword - Google Patents
A kind of indexing means inquired towards continuous space-fuzzy keyword Download PDFInfo
- Publication number
- CN110134758A CN110134758A CN201910346372.2A CN201910346372A CN110134758A CN 110134758 A CN110134758 A CN 110134758A CN 201910346372 A CN201910346372 A CN 201910346372A CN 110134758 A CN110134758 A CN 110134758A
- Authority
- CN
- China
- Prior art keywords
- space
- data
- query statement
- index structure
- indexing means
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
Abstract
Index structure is created according to query statement, space text flow data is filtered according to index structure in search window input inquiry sentence towards continuous space-fuzzy keyword inquiry indexing means the present invention provides a kind of;During being filtered to space text flow data, one-permutation hash signature processing is carried out to the query statement parallel multithread, similar tags corresponding with query statement are obtained from the text flow data of space;Similar tags are embedded in adaptive index structure AP-tree, is calculated according to one-permutation hash signature, is carried out the differentiation of similar tags, obtain optimal label, exported using data corresponding with optimal label as final search result.In data communication between GPU-CPU, big-kernel communication strategy is added, so as to acceleration search;Data communication is divided into four-stage: prefetching address generation, data filling, data transmission and assesses calculation.The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.
Description
Technical field
The present invention relates to space text flow data directory inquiring technology fields, more particularly to one kind is towards continuous space-mould
Paste the indexing means of keyword query.
Background technique
Arrival and high speed development with mobile internet era, many are based on LBS (Location-Based
Services application software) increases significantly, and the application of these softwares produces the space text flow data of magnanimity, using efficient
Analytical technology processing space text flow data, can bring great convenience to people's lives.But the space of magnanimity is literary
This flow data also brings many challenges: data volume is huge, inquiry time delay growth, data redundancy.
Traditional Space text flow data directory querying method can substantially be divided into three classes: the preferential indexing means (RQ- of text
Tree etc.), preferential indexing means in space (IQ-tree and Rt-tree etc.) and adaptive based on location information and text information
Indexing means (AP-tree).But existing indexing means face two problems in current application: first, these index knots
Lack in structure and support text approximate query, in search space when text object, user may due to input error or other reasons
Meeting is so that text information input inaccuracy, and at this moment keyword fuzzy query just seems particularly significant;Second, above-mentioned Traditional Space text
This flow data search algorithm is realized based on CPU, with the increase of data scale, Traditional Space text flow data query method
Software be can no longer meet to the real-time of processing data information and the demand of high efficiency.
Therefore, it needs to study a kind of new indexing means, it is made to reach efficient index while meeting query performance
Purpose, and query time and space expense are reduced as far as possible.
Summary of the invention
It is too long in order to solve prior art query time after the increase of space text flow data scale, it is unable to satisfy software pair
The problem of data processing real-time high-efficiency demand and support text fuzzy query, the present invention provides one kind towards continuous space-mould
The indexing means for pasting keyword query, mainly comprise the steps that
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to sky
Between text flow data be filtered;
S102: during being filtered to space text flow data, the query statement parallel multithread is carried out
One-permutation hash signature processing, obtains corresponding with query statement similar from the text flow data of space
Label;
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash
Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark
Label, then export using data corresponding with optimal label in the text flow data of space as final search result.
Further, data communication is carried out during being filtered to space text flow data, between GPU-CPU in number
Information relevant to query statement is obtained according to big-kernel communication strategy is added in communication so as to accelerated filtration;
Data communication is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The ground
The address that GPU thread needs to handle the query statement of one-permutation hash signature is stored in the buffer of location;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be looked into
It askes sentence and is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: the query statement prefetched in buffer at the end CPU is transmitted to the number at the end GPU by DMA mechanism
According to buffer;
(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement
It calculates.
Further, per thread handles the one-permutation of space-text object text information
The calculating of hash signature.
Further, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is certainly
Adapt to index structure AP-tree;Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost
Small partition method distinguishes.
Further, the search strategy of depth-first is used to carry out the area of similar tags in a manner of recursive call access
Point, to obtain final search result.
Technical solution provided by the invention has the benefit that the space expense for reducing index structure and time are opened
Pin, reduces search time.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is a kind of flow chart towards continuous space-fuzzy keyword inquiry indexing means in the embodiment of the present invention;
Fig. 2 is the schematic illustration of big-kernel strategy in the embodiment of the present invention;
Fig. 3 is a kind of schematic diagram for replacing hash method in the embodiment of the present invention;
Fig. 4 is the hollow m- fuzzy keyword search index structure chart of the embodiment of the present invention.
Specific embodiment
For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail
A specific embodiment of the invention.
The embodiment provides a kind of towards continuous space-fuzzy keyword inquiry indexing means.
Referring to FIG. 1, Fig. 1 is a kind of towards continuous space-fuzzy keyword inquiry index side in the embodiment of the present invention
The flow chart of method, specifically comprises the following steps:
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to sky
Between text flow data be filtered;
S102: during being filtered to space text flow data, the query statement parallel multithread is carried out
One-permutation hash signature (a kind of displacement Hash label) processing, obtained from the text flow data of space with
The corresponding similar tags of query statement;
Exist as shown in Fig. 2, carrying out data communication during being filtered to space text flow data, between GPU-CPU
Big-kernel communication strategy is added in data communication and obtains information relevant to query statement so as to accelerated filtration;Data are logical
Letter is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The ground
The address that GPU thread needs to handle the query statement of one-permutation hash signature is stored in the buffer of location;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be looked into
It askes sentence and is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: the query statement prefetched in buffer at the end CPU is transmitted to the number at the end GPU by DMA mechanism
According to buffer;
(4) assess calculation: GPU thread carries out the meter of one-permutation hash signature by query statement
It calculates;
The text information in each space-text object is extracted for the first time, then per thread carries out processing one
The calculating of the one-permutation hash signature of space-text object text information;As shown in Figure 3, it is assumed that
There are two set of keywords V1And V2, V1Primary index be π (V1)={ 0,5,8 }, the primary index π (V of V22)={ 1,6,8 };
The binary D of setting one ties up matrix, and the first row that D ties up matrix indicates feature, and 1 in secondary series and third column indicates original rope
Containing the feature corresponding to first row in drawing, 0 is indicated in primary index without containing the feature corresponding to first row;D is tieed up into square
The column of battle array are uniformly divided into k partially (bins), and in the present embodiment, k takes 3, i.e. D dimension matrix column is uniformly divided into 3 parts:
Bin1, bin2 and bin3;Secondary series and tertial first nonzero term in bin1, bin2 and bin3 are marked, new index π is formed
(V1)={ 0,2,2 }, π (V2)={ 1,3,2 }, the calculation method of new index is as follows: it is each that new index is equal to primary index
The index of first nonzero term subtracts the volume of bin belonging to index of the sum of bins multiplied by first nonzero term in bin
Number difference, i.e. π (V1)=[0-3 × 0,5-3 × 1,8-3 × 2]=[0,2,2], π (V2)=[1-3 × 0,6-3 × 1,8-3 × 2]
=[1,3,2];
π(V1) and π (V2) similarity be that identical item number is indexed in corresponding bin divided by total bin number k, i.e., this
π (V in embodiment1) and π (V2) similarity be 1/3.
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash
Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark
Label, then export using data corresponding with optimal label in the text flow data of space as final search result;Pass through
Efficient heuristic algorithm keyword subregion and space partition zone algorithm create adaptive index structure AP-tree in conjunction with cost model;
Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then the partition method for selecting cost small distinguishes;
Wherein, the urtext information of keyword node, query node and space nodes is all by one-permutation hash
Signature is replaced;
As shown in figure 4, if the quantity for the key word of the inquiry Q that query statement is split is no more than preset threshold or inquiry is closed
Key word Q can according to keywords or space partition zone further division just remain on all key word of the inquiry Q in q node;
If a key word of the inquiry Q can be transmitted through by keyword or space partition zone further division from father node
The key word of the inquiry Q come can be set to query node q, keyword node k or space nodes s;It is created according to query statement
The index structure built is tree structure, which includes query node q, text node k and space nodes s, query node q
Including q1、q2、......、q10, text node k includes k1- node and k2- node, space nodes s include s1- node and s2-
node。S1、S2、......、S8Respectively indicate the Hash label for carrying out one-permutation hash signature processing.
C1、C2、C3And C4Indicate the 4 block space regions of each space nodes s in index structure.If being divided with keyword subregion,
The offset l of keyword subregion is then assigned to keyword subregion cost Ck in q node, is inquired in offset l expression q node
First of keyword is used for keyword subregion, and records the cost using keyword partition method;Similarly, if with space partition zone
It is divided, then the offset m of space partition zone is assigned to space partition zone cost Cs in q node, offset m is indicated in q node
M-th of keyword of inquiry is used for space partition zone, and records the cost using spatial zonal approach;Then by present node N
It is built into s node or k node, and the inquiry in q node is moved to relevant child node and is done further according to the above method
Processing uses the search strategy of depth-first to carry out the differentiation of similar tags in a manner of recursive call access, final to obtain
Search result.
The beneficial effects of the present invention are: reducing the space expense and time overhead of index structure, reduce search time.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (5)
1. a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that: the following steps are included:
S101: in search window input inquiry sentence, index structure is created according to query statement, according to index structure to space text
This flow data is filtered;
S102: during being filtered to space text flow data, one- is carried out to the query statement parallel multithread
Permutation hash signature processing, obtains similar mark corresponding with query statement from the text flow data of space
Label;
S103: similar tags are embedded in adaptive index structure AP-tree, according to one-permutation hash
Signature is calculated, and is carried out the differentiation of similar tags, is obtained one with the most like similar tags of query statement as optimal mark
Label, then export using data corresponding with optimal label in the text flow data of space as final search result.
2. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that:
In step s 102, it is logical in data that data communication is carried out during being filtered to space text flow data, between GPU-CPU
Big-kernel communication strategy is added in letter and obtains information relevant to query statement so as to accelerated filtration;
Data communication is divided into four-stage:
(1) prefetch address generation: the end GPU distributes one piece of memory, and generation prefetches address, as address buffer;The address is slow
It rushes in device and stores the address that GPU thread needs to handle the query statement of one-permutation hash signature;
(2) data are loaded: by the address for prefetching data stored in address buffer, being found query statement, and will be inquired language
Sentence is assembled to the prefetching in buffer of the end CPU;
(3) data are transmitted: being delayed the data that the query statement prefetched in buffer at the end CPU is transmitted to the end GPU by DMA mechanism
Storage;
(4) assess calculation: GPU thread carries out the calculating of one-permutation hash signature by query statement.
3. as claimed in claim 2 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that:
Per thread handles the meter of the one-permutation hash signature of space-text object text information
It calculates.
4. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that:
In step s 103, by efficient heuristic algorithm keyword subregion and space partition zone algorithm, in conjunction with cost model, creation is adaptive
Index structure AP-tree;Wherein, the matching cost of two partition methods of cost model quantitative measurment, and then select cost small
Partition method distinguishes.
5. as described in claim 1 a kind of towards continuous space-fuzzy keyword inquiry indexing means, it is characterised in that:
In step s 103, the search strategy of depth-first is used to carry out the differentiation of similar tags in a manner of recursive call access, with
Obtain final search result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910346372.2A CN110134758A (en) | 2019-04-26 | 2019-04-26 | A kind of indexing means inquired towards continuous space-fuzzy keyword |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910346372.2A CN110134758A (en) | 2019-04-26 | 2019-04-26 | A kind of indexing means inquired towards continuous space-fuzzy keyword |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110134758A true CN110134758A (en) | 2019-08-16 |
Family
ID=67575209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910346372.2A Pending CN110134758A (en) | 2019-04-26 | 2019-04-26 | A kind of indexing means inquired towards continuous space-fuzzy keyword |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134758A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5848404A (en) * | 1997-03-24 | 1998-12-08 | International Business Machines Corporation | Fast query search in large dimension database |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
CN109271560A (en) * | 2018-09-05 | 2019-01-25 | 东南大学 | A kind of link data critical word querying method based on tree template |
-
2019
- 2019-04-26 CN CN201910346372.2A patent/CN110134758A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5848404A (en) * | 1997-03-24 | 1998-12-08 | International Business Machines Corporation | Fast query search in large dimension database |
CN102084363A (en) * | 2008-07-03 | 2011-06-01 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
CN109271560A (en) * | 2018-09-05 | 2019-01-25 | 东南大学 | A kind of link data critical word querying method based on tree template |
Non-Patent Citations (4)
Title |
---|
PING LI等: "One Permutation Hashing for Efficient Search and Learning", 《MATHEMATICS》 * |
REZA MOKHTARI等: "BigKernel—High Performance CPU-GPU Communication Pipelining for Big Data-style Applications", 《2014 IEEE 28TH INTERNATIONAL PARALLEL & DISTRIBUTED PROCESSING SYMPOSIUM》 * |
ZE DENG等: "An Efficient Indexing Approach for Continuous Spatial Approximate Keyword Queries over Geo-Textual Streaming Data", 《INTERNATIONAL JOURNAL OF GEO-INFORMATION》 * |
ZE DENG等: "An Indexing Approach for Efficient Supporting of Continuous Spatial Approximate Keyword Queries", 《2018 IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kraska et al. | The case for learned index structures | |
CN100501746C (en) | Web page collecting method and web page collecting server | |
JP5407043B2 (en) | Efficient piecewise update of binary encoded XML data | |
CN101561814B (en) | Topic crawler system based on social labels | |
US7433886B2 (en) | SQL language extensions for modifying collection-valued and scalar valued columns in a single statement | |
CN102521334B (en) | Data storage and query method based on classification characteristics and balanced binary tree | |
CN102436513A (en) | Distributed search method and system | |
CN105956183A (en) | Method and system for multi-stage optimization storage of a lot of small files in distributed database | |
US20070016605A1 (en) | Mechanism for computing structural summaries of XML document collections in a database system | |
CN104391908B (en) | Multiple key indexing means based on local sensitivity Hash on a kind of figure | |
US8015195B2 (en) | Modifying entry names in directory server | |
US20130159278A1 (en) | Techniques for efficiently supporting xquery update facility in sql/xml | |
CN109033314A (en) | The Query method in real time and system of extensive knowledge mapping in the case of memory-limited | |
Kucukyilmaz et al. | A machine learning approach for result caching in web search engines | |
CN106033428B (en) | The selection method of uniform resource locator and the selection device of uniform resource locator | |
CN100397397C (en) | XML data storage and access method based on relational database | |
US20230315727A1 (en) | Cost-based query optimization for untyped fields in database systems | |
US7454436B2 (en) | Generational global name table | |
CN104915388B (en) | It is a kind of that method is recommended based on spectral clustering and the book labels of mass-rent technology | |
JP2009512950A (en) | Architecture and method for efficiently bulk loading Patricia Tri | |
US8756246B2 (en) | Method and system for caching lexical mappings for RDF data | |
CN107704585A (en) | One kind inquiry HDFS data methods and system | |
Barkhordari et al. | Atrak: a MapReduce-based data warehouse for big data | |
CN110134758A (en) | A kind of indexing means inquired towards continuous space-fuzzy keyword | |
Sharma et al. | Federated learning based caching in fog computing for future smart cities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190816 |
|
RJ01 | Rejection of invention patent application after publication |