CN110019204A

CN110019204A - Method and apparatus are indexed inside split towards HDFS

Info

Publication number: CN110019204A
Application number: CN201711023820.2A
Authority: CN
Inventors: 唐凌; 林文辉
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2017-10-27
Filing date: 2017-10-27
Publication date: 2019-07-16

Abstract

The embodiment of the present invention provides a kind of inside split index method and apparatus towards HDFS, belongs to field of data retrieval.This method comprises: receiving inquiry request；Index attributes are determined according to the inquiry request；Pass through the aggregat ion pheromones pre-established according to the index attributes value of the index attributes or nonclustered index determines piecemeal split；And split determined by loading is to obtain data corresponding with the inquiry request.Through the above technical solutions, aggregat ion pheromones or nonclustered index of the present invention by pre-establishing reduce the magnetic disc i/o that unnecessary data scanning generates, improve the inquiry velocity of HDFS come the split loaded according to the determining expectation of inquiry request.

Description

Method and apparatus are indexed inside split towards HDFS

Technical field

The present invention relates to field of data retrieval, more particularly to index method and apparatus inside the split towards HDFS.

Background technique

Underlying basis of the HDFS (Hadoop distributed file system) as the Hadoop ecosphere, be usually used to storage from Line number evidence, and analytical inquiry is handled in conjunction with Map/Reduce, but for having the selectivity being relatively strict with to the response time And interactive inquiry, then have the defects that in performance.

In traditional database management technology, improving the most common method of query processing speed is index.Pass through index The data for not meeting query requirement can be quickly filtered, I/O can be greatly reduced, reduce search range, reduce the response time. However, traditional index technology can not be applied directly in the inquiry of HDFS.

In HDFS, list file can be divided into multiple split to be handled, and each split contains a large amount of note Record, when inquiry, if be all scanned to every record, it will generate a large amount of magnetic disc i/o, reduction search efficiency.

Summary of the invention

The purpose of the embodiment of the present invention is that a kind of inside split index method and apparatus towards HDFS are provided, for solving The certainly big problem of I/O expense.

To achieve the goals above, indexing means inside the split that the embodiment of the invention provides a kind of towards HDFS, should Method includes: reception inquiry request；Index attributes are determined according to the inquiry request；According to the index attributes of the index attributes Value passes through the aggregat ion pheromones pre-established or nonclustered index determines piecemeal split；And split determined by loading is to obtain Data corresponding with the inquiry request.

Preferably, pass through the aggregat ion pheromones or nonclustered index pre-established according to the index attributes value of the index attributes Determine that split includes: to determine split by the aggregat ion pheromones in the case where the index attributes only include an attribute； And in the case where the index attributes include multiple attributes, split is determined by the nonclustered index.

Preferably, the establishment process of the aggregat ion pheromones is as follows: being arranged for the index attributes value of an index attributes Sequence, and aggregat ion pheromones are established based on the index attributes value after sequence.

Preferably, the establishment process of the nonclustered index is as follows: for the rope of the first attribute in multiple index attributes Draw attribute value to be ranked up, and aggregat ion pheromones are established based on the index attributes value after sequence；And belong to for the multiple index Other attributes other than first attribute in property establish nonclustered index.

Preferably, this method further include: by the index attributes value of the index attributes according to determined by the inquiry request Range is compared with the range of the index attributes value of corresponding index attributes in the nonclustered index, judges whether there is friendship Collection；There are intersection, the data of the corresponding split in intersection part of index attributes value are loaded；And There is no in the case where intersection, the corresponding split of index attributes value of index attributes corresponding in the nonclustered index is lost It abandons.

Correspondingly, indexing unit inside the split that the embodiment of the invention provides a kind of towards HDFS, the device include: Receiving module, for receiving inquiry request；Processing module for determining index attributes according to the inquiry request, and is used for root Pass through the aggregat ion pheromones pre-established according to the index attributes or nonclustered index determines piecemeal split；And loading module, it uses The split determined by loading is to obtain data corresponding with the inquiry request.

Preferably, the processing module is also used to: in the case where the index attributes only include an attribute, passing through institute It states aggregat ion pheromones and determines split；And in the case where the index attributes include multiple attributes, pass through the nonclustered index Determine split.

Preferably, the processing module is also used to: by the index category of the index attributes according to determined by the inquiry request The range of property value is compared with the range of the index attributes value of corresponding index attributes in the nonclustered index, is judged whether There is intersection；And the loading module is also used to: there are intersection, the intersection part of index attributes value is corresponding The data of split are loaded；And in the case where intersection is not present, by index attributes corresponding in the nonclustered index Index attributes be worth corresponding split and abandon.

Through the above technical solutions, the present invention is asked by the aggregat ion pheromones that pre-establish or nonclustered index according to inquiry The split for determining expectation load is sought, the magnetic disc i/o that unnecessary data scanning generates is reduced, improves the inquiry speed of HDFS Degree.

The other feature and advantage of the embodiment of the present invention will the following detailed description will be given in the detailed implementation section.

Detailed description of the invention

Attached drawing is to further understand for providing to the embodiment of the present invention, and constitute part of specification, under The specific embodiment in face is used to explain the present invention embodiment together, but does not constitute the limitation to the embodiment of the present invention.Attached In figure:

Fig. 1 is the flow chart of indexing means inside the split provided by the invention towards HDFS；

Fig. 2 is the diagram of aggregat ion pheromones structure provided by the invention；

Fig. 3 is the flow chart provided by the invention for establishing aggregat ion pheromones；

Fig. 4 is the flow chart of the query processing process of aggregat ion pheromones provided by the invention；

Fig. 5 is the diagram of nonclustered index structure provided by the invention；

Fig. 6 is the flow chart of the query processing process of nonclustered index provided by the invention；And

Fig. 7 is the block diagram of indexing unit inside the split provided by the invention towards HDFS.

Specific embodiment

It is described in detail below in conjunction with specific embodiment of the attached drawing to the embodiment of the present invention.It should be understood that this Locate described specific embodiment and be merely to illustrate and explain the present invention embodiment, is not intended to restrict the invention embodiment.

Fig. 1 is the flow chart of indexing means inside the split provided by the invention towards HDFS, as shown in Figure 1, this method Include:

Step 101, inquiry request is received.

Step 102, index attributes are determined according to inquiry request.

Step 103, the aggregat ion pheromones pre-established are passed through according to the index attributes value of index attributes or nonclustered index is true Determine piecemeal split.

Step 104, split determined by loading is to obtain data corresponding with the inquiry request.

Wherein aggregat ion pheromones and nonclustered index pre-establish, and aggregat ion pheromones are for true according to inquiry request institute Fixed index attributes only include the case where that an attribute, nonclustered index are for the index attributes according to determined by inquiry request Include the case where multiple attributes.Thus, in above step 103 according to index attributes pass through the aggregat ion pheromones that pre-establish or Nonclustered index determines that split includes: to determine in the case where index attributes only include an attribute by aggregat ion pheromones split；In the case where index attributes include multiple attributes, split is determined by nonclustered index.

The establishment process of aggregat ion pheromones is as follows: being ranked up for the index attributes value of an index attributes, and based on row Index attributes value after sequence establishes aggregat ion pheromones.When handling the list file being stored on HDFS, list file can be drawn It is divided into split one by one, when establishing aggregat ion pheromones, by the data in each split according to the index attributes of index attributes The sequence of value is ranked up, and is also ranked up according to identical ordering rule to the index attributes value in aggregat ion pheromones, that is, It says, the data and index attributes value in split for aggregat ion pheromones, are namely based on according to identical rule compositor Index attributes value after sequence establishes aggregat ion pheromones, and then the aggregat ion pheromones established are stored in after split data and are protected There are in HDFS.

Fig. 2 is the diagram of aggregat ion pheromones structure provided by the invention, in Fig. 2:

Split data is the data of the split.

Trojan index is the aggregat ion pheromones established for the index attributes of the split, and it includes the split data Index attributes value and offset.

Header be aggregat ion pheromones metamessage, it includes 5 fields: DataSize, IndexSize, Max, Min and RecordNum.Wherein field DataSize is the size of the split data, and field IndexSize is that aggregat ion pheromones itself are big Small, field Max is maximum value of the index attributes in the split, and field Min is minimum value of the index attributes in the split, field RecordNum is the data bulk of the split.

Footer is the information of the new split repartitioned, and the aggregat ion pheromones of original data and generation are divided into Field SplitSize and field FooterSize in one new split, Footer respectively indicate split size and The size of Footer.

Fig. 3 is the flow chart provided by the invention for establishing aggregat ion pheromones, as shown in figure 3, the process includes:

Step 301, Selecting Index attribute, that is, a certain field is chosen as foundation aggregation from the list file on HDFS The index attributes of index.

Step 302, such as ascending sort is carried out according to the index attributes value of index attributes to the data in each split, Be under normal circumstances according to ascending sort, certain those skilled in the art can also according to the actual situation descending sort or press other Rule compositor, present embodiment are only to provide a kind of example.

Step 303, using the result after ascending sort as the value of Split data.

Step 304, using the index attributes value of Split data and offset as the value of Trojan index.

Step 305, the value of Header and Footer is calculated according to Split data and Trojan index.

Step 306, aggregat ion pheromones are generated.

Fig. 4 is the flow chart of the query processing process of aggregat ion pheromones provided by the invention, as shown in figure 4, the process includes:

Step 401, inquiry request is received.

Step 402, index attributes and corresponding index attributes value are determined according to inquiry request, index attributes value here is just It is condition described in step 405, present embodiment, which to be accomplished that, retrieves the index attributes according to determined by inquiry request It is worth corresponding data.Index attributes value can be a range, for aggregat ion pheromones, the index of the split in list file The range of attribute value is included within the scope of the index attributes value to be inquired.

Step 403, since the last one Footer field at list file end, each Footer word is successively read forward SplitSize field in section marks off each split to come.It will be appreciated by those skilled in the art that in list file, respectively The storage of a split is mutually continuous, thus need according to the size of each split come by each split mark off come, Here the size of split is stored in SplitSize field, so being drawn each split according to SplitSize field It branches away.

Step 404, the Header field for reading each split, obtains the metamessage of index, such as rope of metamessage here Draw size etc..

Step 405, scanning index can refer to step with the offset of the determining data for meeting condition, condition here Explanation in 402.

Step 406, the data for reading the condition that meets, i.e., read corresponding number according to the offset that scanning index obtains According to, that is, load corresponding split.

Aggregat ion pheromones described above are suitable for the case where querying condition only relates to an attribute, if querying condition is related to Multiple attributes, in order to improve search efficiency, it is necessary to carry out nonclustered index, that is, need to pre-establish nonclustered index. Nonclustered index is generally built upon on the basis of aggregat ion pheromones, a list file can possess simultaneously an aggregat ion pheromones and One or more nonclustered indexes, to support different query demands.

The establishment process of nonclustered index is as follows: carrying out for the index attributes value of the first attribute in multiple index attributes Sequence, and aggregat ion pheromones are established based on the index attributes value after sequence；And in multiple index attributes in addition to first belongs to Other attributes except property establish nonclustered index.That is assemble rope firstly the need of foundation when establishing nonclustered index Draw, then establishing nonclustered index, it should be noted that aggregat ion pheromones are established based on the index attributes value after sequence , and nonclustered index is not based on any rule, that is to say, that aggregat ion pheromones were ordered into, nonclustered index is unordered.

For the establishment process of nonclustered index, it is noted that wherein described multiple index attributes and basis Index attributes determined by inquiry request will be distinguished, and the establishment process of nonclustered index is before receiving inquiry request It carries out, is not aware that in targeted multiple index attributes which or multiple index attributes are when establishing nonclustered index With the index attributes according to determined by inquiry request it is consistent or whether there is the index attributes according to determined by inquiry request.

When establishing nonclustered index, targeted multiple index attributes are for example the first category respectively in the presence of three attributes Property, the second attribute, third attribute, then aggregat ion pheromones can be established for the index attributes value of the first attribute, then for the Two attributes and third attribute establish nonclustered index.Here volume first, second, third is used for the purpose of three in description Attribute differentiates, and is not for purposes of limitation.

Fig. 5 is the diagram of nonclustered index structure provided by the invention, in Fig. 5:

Split data is the data of the split.

Header is the metamessage of aggregat ion pheromones, the description referring specifically to combination Fig. 2 to aggregat ion pheromones structure.

Non-Clustered Index is nonclustered index, mainly saves the offset of unsorted nonclustered index attribute Amount.

Non-Clustered Header is the metamessage of nonclustered index.

The File Header essential record split establishes index on which attribute.

Footer is the information of the new split repartitioned, is retouched referring specifically in conjunction with Fig. 2 to aggregat ion pheromones structure It states.

Wherein, Non-Clustered Index and Non-Clustered Header can have multiple, that is to say, that can be with Multiple nonclustered indexes are established simultaneously.

Due to being arranged by the targeted index attributes of aggregat ion pheromones the data in split when establishing aggregat ion pheromones Sequence, Split data are the data after sequence, so when establishing nonclustered index, it only need to be by the rope of selected nonclustered index Draw attribute, the index attributes value and offset that the nonclustered index of each data is chosen from Split data are as Non- Clustered Index.

Since data are sorted by aggregat ion pheromones, do not sort by nonclustered index, so for nonclustered index, Data are unordered.The query process of nonclustered index is different with aggregat ion pheromones.

So query process includes: by the index attributes according to determined by inquiry request for nonclustered index The range of index attributes value is compared with the range of the index attributes value of index attributes corresponding in nonclustered index, and judgement is It is no to have intersection；There are intersection, the data of the corresponding split in intersection part of index attributes value are loaded； And in the case where intersection is not present, the index attributes of index attributes corresponding in nonclustered index are worth corresponding split It abandons.

Wherein, there are intersection, for the index attributes value of index attributes corresponding in nonclustered index The data of the corresponding split in non-intersection part, can load or be not loaded with, certainly, in order to accelerate inquiry velocity, general feelings It is not loaded under condition.

The query processing process of nonclustered index is illustrated presently in connection with the structure of nonclustered index, Fig. 6 is that the present invention mentions The flow chart of the query processing process of the nonclustered index of confession, as shown in fig. 6, the process includes:

Step 601, inquiry request is received.

Step 602, index attributes and corresponding index attributes value are determined according to inquiry request.

Step 603, since the last one Footer field at list file end, each Footer word is successively read forward SplitSize field in section marks off each split to come.

Step 604, the Header field for reading each split, determines the offset of the index attributes of nonclustered index.

Step 605, NonClustered Header is read, the metamessage of nonclustered index is obtained, passes through what is wherein recorded The index attributes value and the index attributes value according to determined by inquiry request of split determines scanning strategy.Wherein, specific scanning Strategy will be explained below.

Step 606, split data are scanned according to identified scanning strategy, returned the result.

It is specifically described scanning strategy below, it is assumed that the range of the index attributes value of split is [c, d] in list file, according to The range of the index attributes value of index attributes determined by query messages is [a, b], is come according to range [c, d] and range [a, b] The scanning strategy for loading split is as follows:

(1) as c≤a≤d and b >=d, scanning starting position is the offset minimum value of all values in the section [a, d], eventually Stop bit is set to the offset maximum value of all values in the section [a, d], the i.e. data of split corresponding to load section [a, d], i.e., Load the data of the corresponding split in intersection part of index attributes value.

(2) as c≤a≤d and c≤b≤d, scanning starting position is that the offset of all values in the section [a, b] is minimum Value, final position are the offset maximum value of all values in the section [a, b], i.e. the number of split corresponding to load section [a, b] According to, i.e., load index attributes value the corresponding split in intersection part data.

(3) as a≤c and c≤b≤d, scanning starting position is the offset minimum value of all values in the section [c, b], eventually Stop bit is set to the offset maximum value of all values in the section [c, b]., that is, the data of split corresponding to the section [c, b] are loaded, Load the data of the corresponding split in intersection part of index attributes value.

(4) as a≤c and b >=d, entire split is scanned, that is, loads entire split, i.e. all data of split.

(5) when a, b are unsatisfactory for above 4 kinds of situations, the split is abandoned, that is, abandons entire split.

For aggregat ion pheromones, there is only situations in above (4).

In addition, it will be appreciated by those skilled in the art that the side of traversal can be used when inquiry or scan data Formula.

Fig. 7 is the block diagram of indexing unit inside the split provided by the invention towards HDFS, as shown in fig. 7, the device packet Include receiving module 701, processing module 702 and loading module 703.Receiving module 701 is for receiving inquiry request.Processing module 702 for determining index attributes according to inquiry request, and is used to pass through the aggregat ion pheromones pre-established or non-according to index attributes Aggregat ion pheromones determine piecemeal split.Loading module 703 is corresponding with inquiry request to obtain for loading identified split Data.

It should be noted that the detail and benefit of the inside the split indexing unit provided by the invention towards HDFS Similar with indexing means inside the split provided by the invention towards HDFS, in this, it will not go into details.

The optional embodiment of the embodiment of the present invention is described in detail in conjunction with attached drawing above, still, the embodiment of the present invention is simultaneously The detail being not limited in above embodiment can be to of the invention real in the range of the technology design of the embodiment of the present invention The technical solution for applying example carries out a variety of simple variants, these simple variants belong to the protection scope of the embodiment of the present invention.

The technical solution provided through the invention is optimized the inside split indexing means, in query execution rank Section meets the data of querying condition by read-only take of the internal index of inquiry, greatly reduces what unnecessary data scanning generated Magnetic disc i/o.The present invention can also be combined with the optimization method of other levels further to promote HDFS inquiry velocity.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the embodiment of the present invention pair No further explanation will be given for various combinations of possible ways.

In addition, any combination can also be carried out between a variety of different embodiments of the embodiment of the present invention, as long as it is not The thought of the embodiment of the present invention is violated, equally should be considered as disclosure of that of the embodiment of the present invention.

Claims

1. indexing means inside a kind of split towards HDFS, which is characterized in that this method comprises:

Receive inquiry request；

Index attributes are determined according to the inquiry request；

Pass through the aggregat ion pheromones pre-established according to the index attributes value of the index attributes or nonclustered index determines piecemeal split；And

Split determined by loading is to obtain data corresponding with the inquiry request.

2. the method according to claim 1, wherein being passed through according to the index attributes value of the index attributes preparatory The aggregat ion pheromones or nonclustered index of foundation determine that split includes:

In the case where the index attributes only include an attribute, split is determined by the aggregat ion pheromones；And

In the case where the index attributes include multiple attributes, split is determined by the nonclustered index.

3. the method according to claim 1, wherein the establishment process of the aggregat ion pheromones is as follows:

It is ranked up for the index attributes value of an index attributes, and aggregation rope is established based on the index attributes value after sequence Draw.

4. the method according to claim 1, wherein the establishment process of the nonclustered index is as follows:

It is ranked up for the index attributes value of the first attribute in multiple index attributes, and based on the index attributes value after sequence Establish aggregat ion pheromones；And

Nonclustered index is established for other attributes other than first attribute in the multiple index attributes.

5. according to the method described in claim 4, it is characterized in that, this method further include:

By the range of the index attributes value of the index attributes according to determined by the inquiry request with it is right in the nonclustered index The range of the index attributes value for the index attributes answered is compared, and judges whether there is intersection；

There are intersection, the data of the corresponding split in intersection part of index attributes value are loaded；And

It is in the case where intersection is not present, the index attributes value of index attributes corresponding in the nonclustered index is corresponding Split is abandoned.

6. indexing unit inside a kind of split towards HDFS, which is characterized in that the device includes:

Receiving module, for receiving inquiry request；

Processing module, for determining index attributes according to the inquiry request, and it is preparatory for being passed through according to the index attributes The aggregat ion pheromones or nonclustered index of foundation determine piecemeal split；And

Loading module, for loading identified split to obtain data corresponding with the inquiry request.

7. device according to claim 6, which is characterized in that the processing module is also used to:

8. device according to claim 6, which is characterized in that the establishment process of the aggregat ion pheromones is as follows:

9. device according to claim 6, which is characterized in that the establishment process of the nonclustered index is as follows:

10. device according to claim 9, which is characterized in that the processing module is also used to: will be asked according to the inquiry Ask the range of the index attributes value of identified index attributes and the index category of corresponding index attributes in the nonclustered index The range of property value is compared, and judges whether there is intersection；And

The loading module is also used to: there are intersection, by the corresponding split's in intersection part of index attributes value Data are loaded；And in the case where intersection is not present, by the index of index attributes corresponding in the nonclustered index The corresponding split of attribute value is abandoned.