CN103902699B

CN103902699B - Data space retrieval method applied to big data environments and supporting multi-format feature

Info

Publication number: CN103902699B
Application number: CN201410125840.0A
Authority: CN
Inventors: 周连科; 王洪滨; 王念滨; 祝官文
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2017-04-12
Anticipated expiration: 2034-03-31
Also published as: CN103902699A

Abstract

The invention relates to a data space retrieval method applied to big data environments and supporting multi-format feature. The method includes: inputting query content by a user; judging the type of query from the user; reading a built multi-stage index by means of prefix scanning; merging linked lists; rewriting query; traversing the multi-stage index; pressing an inverted sequence into a stack; popping two elements from the stack top first; reading the built multi-stage index; popping out stack top elements according to an index list table right-join scheme; outputting all elements which are satisfactory. The method has the advantages that the multi-stage index is built with a B-tree index and a secondary index and the problem that the cost of a main index on path query index joins in the big data environments can be solved.

Description

The data space that multi-format characteristic can be supported under a kind of environment for big data is retrieved Method

Technical field

The present invention relates to the data space search method of multi-format characteristic can be supported under a kind of environment for big data.

Background technology

Data space data have diversified feature, and it includes structuring, such as semi-structured and unstructured data, relation The data such as first ancestral, XML, Word document, Email, video, audio frequency, this feature causes to support polytype in the urgent need to a kind of Inquiry mode, therefore its index technology just seems particularly critical.On the one hand, with search engine and traditional data integrated technology not With data space index technology demand draws multiple types of data, rather than sets up a kind of index for each type；The opposing party Face, is no longer the inquiry for laying particular emphasis on certain categorical data and unlike traditional search engines, XML engines, data base querying, But the inquiry of various different degrees of structures is neatly supported, such as keyword query, predicate inquiry, path query.

As developing rapidly for the Internet, data message are presented explosive growth, every year the new data of at least more than one hundred million TB is produced Raw, in the face of this big data environment, the height of its index efficiency, the performance of direct determination data Spatial data query is fine or not, because This, the index efficiency of data space data is very crucial.At present data space index technology mainly have Hybrid-ATIL indexes, Index of the picture, full-text index+copy, although these index technologies have indexed well multiple types of data, they are difficult to solve Certainly the data space under big data environment indexes the low problem of joint efficiency.For the problem, the present invention utilizes multiple index Thought, have devised the data space efficient index technology under a kind of environment for big data, so as to improve query performance.

The content of the invention

It is an object of the invention to provide a kind of support various inquiry modes and index multi-format data, can reduce a large amount of Attended operation, the efficient data space search method for being used under big data environment that multi-format characteristic can be supported.

The object of the present invention is achieved like this：

The data space search method of multi-format characteristic can be supported under a kind of environment for big data, including：

1) user input query content；

2) user's query type is judged, if keyword query Q={ k_i, k_iFor key word, then execution step 3)；Such as Fruit inquires about Q=(v, { k for predicate_i), v is attribute, k_iFor key word, then step 5 is gone to)；If path query Q= k₁/..../k_i/ ... '/' representational level path, then go to step 7)；

3) set up multiple index is read using prefix scan mode, obtains k_i*, k_i* represent with key word k_iStart rope Draw item, start corresponding chained list result and be designated as respectivelyRepresent index in j-th include k_iText corresponding to index entry Shelves list, i.e. posting；If query type is keyword query, step 4 is gone to)；If query type is looked into for path Ask, then go to step 7)；

4) chained list union operation is carried out, i.e.,First to all k_iThe corresponding posting of index entry of beginning Carry out and operate, to all key word k_iAnd after result carry out friendship operation, while there is the lists of documents of all key words；

5) query rewrite is { k_i//v}；

6) traversal step 1) multiple index set up, obtain k_iThe corresponding items of //v, are designated asL_ki//vTable Show k in index entry_i//v correspondence posting, L are represented in attribute while there are multiple key word k_iAll lists of documents；

7) by k₁To k_nIn backward press-in stack；

8) two elements in stack top are ejected first, be designated as k₁And k₂；

9) read step 1) multiple index set up, obtain k₁B- trees index and k₂H indexes, be designated as respectively Key word k₁Correspondence resource view numbering be element constitute b-tree indexed andKey word k₂Corresponding H indexes；

10) according to the right connection scheme of index chained list, connectionWithAs a result it is designated asThe interim B for generating Tree, is initially empty, i.e., rightIn each major keyFor, ifIn it can be found that, thenRespective items C={ c_iIn All elements are inserted into B_tempIn；

11) if stack is not sky, step 12 is gone to)；Otherwise, step 14 is gone to) in；

12) stack top element k is ejected_i, read step 1) and the multiple index set up, obtain k_iH indexes, be designated asPress According to step 10) method connectionAnd B_temp, as a result it is designated as

13) step 11 is gone to)；

14) B is traveled through_tempOr L, export all elements for meeting condition.

Step 3) building process of multiple index set up comprises the steps：

A, under big data environment to data space build multiple index；The content construction includes arranging rope using extension Structure of the structure of the master index for drawing with the secondary index using B- trees in combination with secondary index；Extension inverted index is responsible for propping up Hold keyword query, the predicate inquiry of big data；B- trees index and secondary index is then responsible for supporting the path query of big data；

The structure of master index is for different components in resource view, using the inverted index of extension structure to be indexed Build：

(1) loading data spatial data；

(2) to each resource view V_iEntitled keyword title component, key word row in add keyword, And item (V is added in corresponding chained list_i,{P_i ^k), wherein V_iRepresent resource view V_iUnique mark, { P_i ^kRepresent V_k→V_i All V_kMark constitute set；That is V_iAll father nodes mark；

(3) if the tuple component of resource view is not sky, step 4 is gone to), if content components are not sky, turn To step (5)；

(4) to tuple component τ=(w, t) of resource view, wherein, w intermediate schemes, t is a unit for meeting pattern w Group；W=a_j, j=1,2 ..k is a sequence of attributes, wherein a_jFor attribute-name；T=v_j, j=1,2 ... k are value sequences, wherein v_jTo be worth, this step includes two sub-steps (4-1) and (4-2)；

(4-1) a is added in key word row_jAttribute-name, and item (V is added in corresponding chained list_i,{V_i), wherein V_iTable Show resource view V_iUnique mark；

(4-2) ＜ a, k ＞ is a corresponding attribute-value pair of (w, t), then to each ＜ a, k ＞, in key word K//a is added in row, and adds an item (V in corresponding chained list_i,{V_i), wherein V_iRepresent resource view V_iUnique mark Know；

(5) for each key word keyword in content components, keyword is added in key word row, and in phase Answer one item (V of addition in chained list_i,{V_i), wherein V_iRepresent resource view V_iUnique mark；

B, secondary index are mainly to solve master index under big data environment, and path query index connection is of a high price Problem；Secondary index is made up of B- trees index and secondary index, and it is comprised the following steps that：

(1) master index is read；

(2) to each key word keyword1, its corresponding item term is obtained_i=..., ＜ V_i,{P_i ^k＞ ..., P_i ^kIt is v_iParent resource view；

(3) if keyword1 is not expanded keyword, i.e. a//k forms then carry out following two steps：

(3-1) S={ V are assumed_i, wherein V_iIt is comprising key word for the left-half of all elements of item i, i.e. S All resource views of keyword1, then carry out B- trees index to S；

(3-2) to each element ＜ V in item i_i,{P_i ^kEach P in ＞_i ^kIf not including P in father view vector_i ^k, In being then added to father view vector, and V_iIn being added to its corresponding chained list, H indexes are formed.

Multilevel index technology, using the different types of inquiry of right linking rule process.

The beneficial effects of the present invention is：

The method of the present invention, by B- trees index and secondary index multiple index is collectively formed, and can solve the problem that master index big Under data environment, the excessive problem of path query index connection cost.

Description of the drawings

Fig. 1 is the querying method based on the right linking rule of major-minor index；

Fig. 2 is the right connection scheme of index chained list；

Fig. 3 is the secondary index method for combining B- trees and secondary index.

Fig. 4 is the B-tree and H tree examples that key word keyword1 and keyword2 are applied

Specific embodiment

The present invention is described further below in conjunction with the accompanying drawings.

1) user input query content；

2) user's query type is judged, if keyword query Q={ k_i, then go to step 3)；Look into if predicate Ask Q=(v, { k_i), then go to step 5)；If path query Q=k₁/..../k_i/ ..., then go to step 7).

3) set up multiple index is read using prefix scan mode, obtains k_i* corresponding chained list result difference is started It is designated asIf query type is keyword query, step 4 is gone to)；If query type is path query, turn To step 7)；

4) chained list union operation is carried out, i.e.,

5) query rewrite is { k_i//v}；

6) traversal step 1) multiple index set up, obtain k_iThe corresponding items of //v, are designated as

7) by k₁To k_nIn backward press-in stack；

9) read step 1) multiple index set up, obtain k₁B- trees index and k₂H indexes, be designated as respectivelyWith

10) according to the right connection scheme of index chained list, connectionWithAs a result it is designated asIt is initially empty, It is i.e. rightIn each major keyFor, ifIn it can be found that, thenRespective items C={ c_iIn all elements insertion To B_tempIn；

13) step 11 is gone to)；

14) B is traveled through_tempOr L, export all elements for meeting condition.

Present embodiment is illustrated with reference to Fig. 1 to Fig. 3, for multi-format characteristic can be supported under big data environment Data space search method, methods described comprises the steps (as shown in Figure 1)：

1) user input query content；

3) set up multiple index is read using prefix scan mode, obtains k_i* corresponding chained list result difference is started It is designated asIf query type is keyword query, step 4 is gone to)；If query type is path query, Go to step 7)；

4) chained list union operation is carried out, i.e.,

5) query rewrite is { k_i//v}；

7) by k₁To k_nIn backward press-in stack；

10) according to the right connection scheme (as shown in Figure 2) of index chained list, connectionWithAs a result it is designated asBe initially empty, i.e., it is rightIn each major keyFor, ifIn it can be found that, thenRespective items C={ c_iIn all elements be inserted into B_tempIn；

13) step 11 is gone to)；

14) B is traveled through_tempOr L, export all elements for meeting condition.

Step 3) building process of multiple index set up includes following A step and step B：

A, under big data environment to data space build multiple index；The content construction includes arranging rope using extension Structure of the structure of the master index for drawing with the secondary index using B- trees in combination with secondary index；Wherein, inverted index is extended It is responsible for supporting keyword query, the predicate inquiry of big data；B- trees index and secondary index is then responsible for supporting the path of big data Inquiry；

The structure of master index is mainly, for different components in resource view, to be indexed using the inverted index of extension Build, its process is as follows：

(1) loading data spatial data；

(2) to each resource view V_iEntitled keyword title component for, key word row in add Keyword, and item (V is added in corresponding chained list_i,{P_i ^k), wherein V_iRepresent resource view V_iUnique mark, { P_i ^kTable Show V_k→V_i(P_i ^kIt is v_iParent resource view) all V_kMark constitute set；That is V_iAll father nodes mark；

(4) for tuple component τ=(w, t) of resource view, wherein, w intermediate schemes, t is meet pattern w one Tuple；W=a_j, j=1,2 ..k is a sequence of attributes, wherein a_jFor attribute-name；T=v_j, j=1,2 ... k are value sequences, its Middle v_jTo be worth, this step includes two sub-steps (4-1) and (4-2)；

(4-1) a is added in key word row_j, and item (V is added in corresponding chained list_i,{V_i), wherein V_iRepresent resource View V_iUnique mark；

(4-2) assume that ＜ a, k ＞ are a corresponding attribute-value pair of (w, t), then to each ＜ a, k ＞, are closing K//a is added in keyword row, and adds an item (V in corresponding chained list_i,{V_i), wherein V_iRepresent resource view V_iOnly One mark；

B, secondary index are mainly to solve master index under big data environment, and path query index connection is of a high price Problem；Secondary index is made up of B- trees index and secondary index, with reference to Fig. 3, illustrates that it is comprised the following steps that：

(1) master index is read；

(2) to each key word keyword1, its corresponding item term is obtained_i=..., ＜ V_i,{P_i ^k＞ ...；

Claims

1. the data space search method of multi-format characteristic can be supported under a kind of environment for big data, it is characterised in that：

1) user input query content；

2) user's query type is judged, if keyword query Q={ k_i, k_iFor key word, then execution step 3)；If Predicate inquires about Q=(v, { k_i), v is attribute, k_iFor key word, then step 5 is gone to)；If path query Q=k₁/..../ k_i/ ... '/' representational level path, then go to step 7)；

3) set up multiple index is read using prefix scan mode, obtains k_i*, k_i* represent with key word k_iStart index , start corresponding chained list result and be designated as respectivelyRepresent index in j-th include k_iDocument corresponding to index entry List, i.e. posting；If query type is keyword query, step 4 is gone to)；If query type is path query, Then go to step 7)；

4) chained list union operation is carried out, i.e.,First to all k_iThe corresponding posting of index entry of beginning is carried out And operate, to all key word k_iAnd after result carry out friendship operation, while there is the lists of documents of all key words；

5) query rewrite is { k_i//v}；

6) traversal step 3) multiple index set up, obtain k_iThe corresponding items of //v, are designated asL_ki//vRepresent rope Draw k in item_i//v correspondence posting, L are represented in attribute while there are multiple key word k_iAll lists of documents；

7) by k₁To k_nIn backward press-in stack；

9) read step 3) multiple index set up, obtain k₁B- trees index and k₂H indexes, be designated as respectivelyKey word k₁Correspondence resource view numbering be element constitute b-tree indexed andKey word k₂Corresponding H indexes；

10) according to the right connection scheme of index chained list, connectionWithAs a result it is designated asThe interim B-tree for generating, just Begin as sky, i.e., it is rightIn each major keyFor, ifIn it can be found that, thenRespective items C={ c_iIn own Element is inserted into B_tempIn；

12) stack top element k is ejected_i, read step 3) and the multiple index set up, obtain k_iH indexes, be designated asAccording to step Rapid 10) method connectionAnd B_temp, as a result it is designated as

13) step 11 is gone to)；

14) B is traveled through_tempOr step 6) all lists of documents L, output meets all elements of condition.

2. the data space retrieval side of multi-format characteristic can be supported under a kind of environment for big data according to claim 1 Method, it is characterised in that：The step 3) building process of multiple index set up comprises the steps：

A, under big data environment to data space build multiple index；The content construction is included using extension inverted index Structure of the structure of master index with the secondary index using B- trees in combination with secondary index；Extension inverted index is responsible for supporting big The keyword query of data, predicate inquiry；B- trees index and secondary index is then responsible for supporting the path query of big data；

The structure of master index is for different components in resource view, using the inverted index of extension structure to be indexed：

(1) loading data spatial data；

(2) to each resource view V_iEntitled keyword title component, key word row in add keyword, and Item (V is added in corresponding chained list_i,{P_i ^k), wherein V_iRepresent resource view V_iUnique mark, { P_i ^kRepresent V_k→V_iInstitute There is V_kMark constitute set；That is V_iAll father nodes mark；

(3) if the tuple component of resource view is not sky, step 4 is gone to), if content components are not sky, go to step Suddenly (5)；

(4) to tuple component τ=(w, t) of resource view, wherein, w intermediate schemes, t is a tuple for meeting pattern w；W= a_j, j=1,2 ..k is a sequence of attributes, wherein a_jFor attribute-name；T=v_j, j=1,2 ... k are value sequences, wherein v_jFor Value, this step includes two sub-steps (4-1) and (4-2)；

(4-1) a is added in key word row_jAttribute-name, and item (V is added in corresponding chained list_i,{V_i), wherein V_iRepresent money Source view V_iUnique mark；

(4-2) ＜ a, k ＞ is a corresponding attribute-value pair of (w, t), then to each ＜ a, k ＞, in key word row K//a is added, and adds an item (V in corresponding chained list_i,{V_i), wherein V_iRepresent resource view V_iUnique mark；

(5) for each key word keyword in content components, keyword is added in key word row, and in corresponding chain Add an item (V in table_i,{V_i), wherein V_iRepresent resource view V_iUnique mark；

, mainly to solve master index under big data environment, path query index connection is of a high price to ask for B, secondary index Topic；Secondary index is made up of B- trees index and secondary index, and it is comprised the following steps that：

(1) master index is read；

(3-1) S={ V are assumed_i, wherein V_iIt is comprising key word keyword1 for the left-half of all elements of item i, i.e. S All resource views, then carry out B- trees index to S；

(3-2) to each element ＜ V in item i_i,{P_i ^kEach P in ＞_i ^kIf not including P in father view vector_i ^k, then add To in father view vector, and V_iIn being added to its corresponding chained list, H indexes are formed.

3. the data space retrieval side of multi-format characteristic can be supported under a kind of environment for big data according to claim 1 Method, it is characterised in that described multilevel index technology, using the different types of inquiry of right linking rule process.