CN105574093B

CN105574093B - A method of index is established in the spark-sql big data processing system based on HDFS

Info

Publication number: CN105574093B
Application number: CN201510918956.4A
Authority: CN
Inventors: 张鋆; 冯骏
Original assignee: Shenzhen Huaxun Fangzhou Software Technology Co Ltd; Shenzhen Huaxun Ark Technology Co Ltd
Current assignee: Shenzhen Huaxun Ark Photoelectric Technology Co ltd; Shenzhen Huaxun Fangzhou Software Technology Co ltd
Priority date: 2015-12-10
Filing date: 2015-12-10
Publication date: 2019-09-10
Anticipated expiration: 2035-12-10
Also published as: CN105574093A; WO2017096939A1

Abstract

The invention discloses a kind of methods that index is established in the spark-sql big data processing system based on HDFS, increase index in the spark-sql big data processing system based on HDFS by SQL statement, delete index, data are inserted into, data are deleted, when data query, inquiry column are judged automatically with the presence or absence of index, if it is present searching the blocks of files that index includes, filtering does not need the blocks of files of inquiry.The present invention is to after increasing index function to spark-sql, inquiry velocity can be effectively increased, such as a typical spark-sql tables of data, mono- file storage of size 1000GB, 1GB, it is divided into 1000 files, if inquiring single record, original way needs to scan 1000 files, after establishing index, only need to scan 1 file, efficiency improves 1000 times.It is estimated according to ordinary circumstance, in conjunction with traditional relevant database experience, the spark-sql database for establishing index executes fast 100-10000 times or more than the sql sentence inquiry velocity not indexed.

Description

It is a kind of to establish index in the spark-sql big data processing system based on HDFS Method

Technical field

The present invention relates to one kind to establish the method for index more particularly to a kind of based on HDFS's on a data processing system The method of index is established in spark-sql big data processing system.

Background technique

Spark-sql is to increase support standard sql query statement on the basis of big data processing platform spark Function.

Spark big data processing platform is the universal parallel frame for the class Hadoop MapReduce that Berkeley is increased income Frame, it possesses advantage possessed by Hadoop MapReduce；But writing for Spark big data processing routine needs to be grasped Scala language, and written in code is carried out based on open api function interface, it is cumbersome and complicated, and a large amount of traditional number The sql like language grasped according to library developer can not use on spark.The birth of spark-sql solves the above problem, it Traditional database table concept is applied to spark processing frame, user can use sql as operation with traditional database table Sentence builds table and inquiry, and spark-sql converts operation inside spark for corresponding operating automatically, it is thin to shield complicated processing Section.

But due to the particularity of spark big data processing platform, spark-sql does not support to establish rope in tables of data Draw, i.e., does not support the foundation index sentence similar to traditional database, such as:

create index myindex on table t(b)；

It means: establishing the general index of entitled myindex on the b column of table t.

Traditional Relational DataBase starts to establish index for the c column of a table after receiving mentioned order.

There are many kinds of the types of index, such as B- tree index, Hash index, GiST index, GIN index etc..With B- tree For index, it is as follows that relevant database establishes index principle:

Database opens up one piece of individual storage region, for storing index tree.

Field in the column (being the column of entitled b in example) indexed as needed generates B- tree.And this tree is saved in finger Determine storage region.Wherein each node of B- tree corresponds to each element in b column, in addition also includes a pointer in each node, This node corresponding element of the pointer record is stored in the corresponding position in database file.

When b arranges insertion new element, new element is also inserted into B- tree (B- tree meeting adjust automatically), while the tree node is remembered Record position of the element in database file.

When b, which is arranged, deletes element, element is also deleted to (B- tree meeting adjust automatically) from B- tree.

Entity file of the database index based on database purchase, the i.e., " corresponding positions in database file described above Set ", database file can according to need user-defined format, so the corresponding position of data hereof can have different expressions Method, but general thought is all to have recorded the accurate location of an element hereof when subsequent lookup element, does not need time File is gone through, and record can quickly be positioned by some way by this recorded position, is searched to reach quickening Purpose.

As shown in Figure 1, data file corresponds to table t, there are a, b two to arrange, wherein b column establish an index.That is the right tree knot The index of structure, the member in index in each node are known as the pointer for being directed toward data file respective element position, wherein index itself Also it is stored as a file.

When inquiring data, such as query statement

Select*from t where b=22；

Indicate all rows of the b column equal to 22 in inquiry table t.Database first parses sql sentence first, then finds that b column are deposited It is indexing.

Then, 22 elements are directly quickly found out from index tree, according to the pointer of 22 elements, navigating to element value is 22 Row where b column, address 0x90 take out this line then directly according to address, return the result as " 5 22 ".

When table t is inserted into element, according to the value that b is arranged, corresponding index tree can be dynamically modified；Correspondingly, when table t deletes member When plain, the content in index tree also can be dynamically deleted.

The index technology of traditional relational comparative maturity, what is stated here is its General Principle, realization side Formula is varied, and the form of expression is usually not fixed, but basic principle all communicates, the method for building up and realization step of other indexes Suddenly it no longer repeats one by one here.

For spark-sql, (generallyd use since the document storage mode of its bottom is different from traditional Relational DataBase HDFS, rather than general Linux or Windows file system), and a usual table capacity is very big, a table even meeting Thousands of a physical files are associated with, so there is no the functions of creation index when spark-sql is designed.Its design focal point The efficiency for being to emphasize the concurrent capability of data processing and having ignored processing.

When usual spark-sql carries out data query, entire tables of data, the data of usual spark-sql processing can be searched for Measure very big, a database likely corresponds to multiple physical files, spark-sql can by concurrent technology, to All Files into Row search.

As shown in Fig. 2, by taking same sql query statement as an example:

Select*from t where b=22；

Spark-sql first parses sql sentence, then navigates to the data file (having 4 files here) of table t.Then These files are split into multiple pieces according to particular size, distribute to different progress of work processing, progress of work sequence is swept The All Files piecemeal that whole table includes is retouched, the row that b train value is 22 is found, after finding, returns the result.

As can be seen that spark-sql, in the case where no index, the mode for carrying out table search is fairly simple, there is effect Rate is low, needs to scan all rows of whole data files.

In addition to simple select sentence, traditional relational is looked into all places for being related to inquiry including complexity Ask, subquery, nested query etc. all can application references technology to reduce queries accelerate inquiry velocity, spark-sql does not have There is such mechanism.

In conclusion current spark-sql due to not having data directory mechanism, can not make inquiry velocity reach most It is excellent, compared to traditional Relational DataBase, there are problems that inefficiency.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of in the spark-sql big data processing system based on HDFS The method for establishing index, this method can make spark-sql be adapted to more, more flexible application scenarios, accelerate spark-sql The speed that sql sentence is inquired is executed, the execution efficiency of spark-sql is improved, more fully performance spark-sql processing is big The advantage of data capability.

In order to solve the above-mentioned technical problems, the present invention provides a kind of in the spark-sql big data processing based on HDFS The method that index is established in system increases rope in the spark-sql big data processing system based on HDFS by SQL statement Draw, delete index, be inserted into data, delete data, when data query, judges automatically inquiry column with the presence or absence of index, such as Fruit exists, then searches the blocks of files that index includes, and filtering does not need the blocks of files of inquiry.

When increasing index, it is necessary first to increase an index file newly, the format of index file can be according to configuration and other Instruction setting, there are commonly formats such as B- tree, Hash indexes, then traverses all records in original table, determines every record institute The value for the column for needing to index is located at position in HDFS or alternative document system, re-records the train value and corresponding file letter of the record Breath, write-in index tree construction.All records are looped through, index structure, final updating table metadata letter are saved with document form Breath is used in the metadata of new index information write in table in case of subsequent query.

When deleting the index of certain table column, it is only necessary to navigate to corresponding index file and be deleted, and update list cell number It is believed that breath, while deleting the index information in metadata.

After being inserted into a data, judge whether the data of insertion are related to indexing, if being related to indexing, needs to adjust phase This data and its associated the file information are also added in index structure by the index structure answered.

In whole flow process, tables of data increases data flow, and to continue to use former process constant, only after the completion of data increase, record Filename where data constructs index node according to the filename that this is returned.

After deleting a data, judge whether the data deleted are related to indexing, if being related to indexing, needs to adjust phase The index structure answered deletes the associated index node of this data.

In whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only after the completion of data are deleted, Increase and deletes the corresponding index information of data

When inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file, then executes SQL operation according to the data of inquiry, finally return to inquiry knot Fruit.

The index of certain field in one record is navigated into some file, that is, has recorded which text this record is included in It is subsequent when searching this record in part, it is only necessary to some file is directly targeted to according to index, and do not have to scan this table and be wrapped The All Files contained.

The present invention establishes the method and prior art phase of index in the spark-sql big data processing system based on HDFS Than having the advantages that.

After increasing index function to spark-sql, inquiry velocity, such as a typical spark-sql can be effectively increased Tables of data, mono- file storage of size 1000GB, 1GB, is divided into 1000 files, if inquiry single record, original way Need to scan 1000 files, after establishing index, it is only necessary to scan 1 file, efficiency improves 1000 times.According to general feelings Condition estimation establishes the spark-sql database of index than the sql language that does not index in conjunction with traditional relevant database experience Sentence inquiry velocity execution wants fast 100-10000 times or more.

Detailed description of the invention

Below in conjunction with the drawings and specific embodiments to the present invention in the spark-sql big data processing system based on HDFS The upper method for establishing index is described in further detail.

Fig. 1 is general data table and index tree structural schematic diagram in the prior art.

Fig. 2 is the inquiry principle when not indexing in the spark-sql big data processing system of HDFS in the prior art Figure.

Fig. 3 is increase index flow chart of the invention.

Fig. 4 is deletion index flow chart of the invention.

Fig. 5 is increase data flowchart of the invention.

Fig. 6 is deletion data flowchart of the invention.

Fig. 7 is inquiry data flowchart of the invention.

Fig. 8 is HDF distributed memory system structural schematic diagram.

Fig. 9 is tables of data and index tree structural schematic diagram in distributed memory system of the present invention.

Specific embodiment

As shown in figs. 3 to 9, present embodiment establishes rope in the spark-sql big data processing system based on HDFS The method drawn realizes spark-sql and increases support index function, can pass through SQL language as traditional Relational DataBase Sentence increases index, deletes index, is inserted into data, deletes data, and when data query, judging automatically inquiry column whether there is Index, if it is present searching the blocks of files that index includes, filtering does not need the blocks of files of inquiry, reaches quickening inquiry velocity Purpose.

1) as shown in figure 3, increasing index process.

Increase index to refer on the basis of legacy data table, increases index for a certain column, the subsequent inquiry for this column can To be accelerated by index.

2) as shown in figure 4, deleting index process.

The index process for deleting certain table column is relatively simple, it is only necessary to and it navigates to corresponding index file and is deleted, and Table metadata information is updated, while deleting the index information in metadata.

3) as shown in figure 5, insertion data flow (table has index).

After being inserted into a data (including batch is inserted into, and practical is that continuous single data are inserted into), the data of insertion are judged Whether it is related to indexing, if being related to indexing, needs to adjust corresponding index structure, by this data and its associated text Part information is also added in index structure.

In whole flow process, wherein tables of data increases data flow to continue to use former process constant, only after the completion of data increase, The filename where data is recorded, index node is constructed according to the filename that this is returned.

4) as shown in fig. 6, deleting data flow (table has index).

After deleting a data (including batch is deleted, and practical is that continuous single data are deleted), the data of deletion are judged Whether it is related to indexing, if being related to indexing, needs to adjust corresponding index structure, by the associated index node of this data It deletes.

In whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only after the completion of data are deleted, Increase and deletes the corresponding index information of data.

5) as shown in fig. 7, inquiry data flow (table has index).

When inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file.Quantity of documents by filtering can greatly reduce, and reduce inquiry burden, so SQL operation is executed according to the data of inquiry afterwards, finally returns to query result.

It is in this special emphasis, index of the present embodiment based on spark-sql, different from the rope of traditional database Draw, purpose of design is to handle big data quantity.By taking 10GB as an example, spark-sql can accomplish traditional data storage capacity 1PB, i.e., 100,000 times common traditional data storage capacities.

General data library one opens tables of data and generally corresponds to physical file in several file system, and spark-sql allusion quotation Type deployment way is combined with HDFS, is to carry out storage file in a manner of a kind of distributed storage, and one tables of data can be right The file that should be stored on HDFS in thousands of or even up to ten thousand, as shown in Figure 8.

A usual spark-sql node is made of several spark nodes, and bottom storage is deposited using HDFS distribution Storage system.I.e. data file is present in HDFS.In figure, t1-p1 indicates the part part1 of table t1, it is a physical file, together Managing t1-p2 indicates the part2 file of table t1, and by p1-p7, totally 7 files form whole table t1；Similarly, table t2 is by 3 files Composition.

Former querying flow can scan all list files when carrying out the inquiry of sql sentence.

Such as Select*from t where b=22

Spark-sql parses above-mentioned sql sentence, then the corresponding database file of look-up table t, result t1-p1, t1- P2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 files altogether, in the case where not considering the excessive cutting of file, Spark-sql can establish 7 query tasks, respectively correspond this 7 files and start scan for inquiries, All Files are scanned, until looking for It goes to qualified record.

The present invention refers to the principle generally indexed and is improved on this basis for spark-sql storage characteristics.

Index granularity of the invention is indexed different from traditional database, and traditional database index is generally directed to certain and is recorded in Address in file, since mono- database list file of spark-sql is usually made of many files, so the think of that this law is taken The index of certain field in one record is navigated to some file, that is, has recorded which file this record is included in by Lu Wei In, it is subsequent when searching this record, it is only necessary to some file is directly targeted to according to index, and do not have to scan this table included All Files.

With above-mentioned example, table t has t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 to have 7 altogether for reference File composition, table t have 2 fields a and b, and wherein b field establishes index, it is assumed that wherein have several records (here without aobvious Show whole records), establish index as shown in Figure 9.

Wherein table is recorded as being inserted into the original record in table, while insertion, establishes B- tree on b column and indexes.Index It is right in HDFS file system where value and the record in tree in each nodes records node corresponding data-base recording The physical file answered.

When inquiring data, such as query statement.

Select*from t where b=22；

Then it is directly quickly found out 22 elements from index tree, according to the pointer of 22 elements, navigates to the b that element value is 22 Physical file where column is t1-p7, then only reads this file content and is searched, and is returned after finding record.

When table t is inserted into element, according to the value that b is arranged, corresponding index tree can be dynamically modified；It is corresponding, when table t deletes member When plain, the content in index tree also can be dynamically deleted.

As can be seen that the present invention in spark-sql index concept it is although similar with traditional database, but have it is basic Difference, the present invention are to handle big data feature according to spark-sql, will index granularity certain position from the file of traditional database It is changed to some file in spark database, to avoid scanning a large amount of inactive files, avoids waste system resource.

Index in the present invention is suitable for all sql sentences, i.e., in no matter simple or complicated sql inquiry, as long as relating to And the inquiry operation to index column, all file first can be positioned according to index, sql inquiry behaviour is then carried out in the file of positioning Make, this has fundamental difference with traditional relational way.

Key point of the invention

1, increase the mechanism for supporting index on spark-sql, such as support following sql sentence:

Establish index: create index myindex on t (b)；Wherein keyword is create index on

Check index: show index from t；Wherein keyword is show index from

Delete index: drop index myindex on t；Wherein keyword is drop index on

2, Indexing Mechanism file-based

Spark-sql is different from traditional Relational DataBase, and one of key point of the invention is, index is established in file On the basis of, i.e. index is directed toward a specific file in a HDFS or other file system, rather than the content in file, Granularity is bigger than traditional database.Under the premise of database table establishes index according to the present invention, it can effectively filter and look into vain File is ask, inquired file extent can be reduced, to improve search efficiency.

3, the index established is including but not limited to unique index, major key index, more property indexes, partial index, expression formula Index.These index types are consistent with the concept in traditional database；Establishing data structure used in indexing includes but unlimited In B- tree, Hash, GiST, GIN etc., these data structures are consistent with the concept in traditional database.

Advantages of the present invention is as follows.

Currently without the disclosed real-time proposals and method for supporting spark-sql index technology.

So the database table established in spark-sql does not all index, inquiry velocity at present in public technology It is limited with search efficiency, by establishing Indexing Mechanism to spark-sql, inquiry velocity a number of orders of magnitude can be improved.It can be with In the case of accomplishing mass data, search efficiency and inquiry velocity are equally matched with traditional Relational DataBase.

It should be noted that each embodiment above by reference to described in attached drawing is only to illustrate the present invention rather than limits this The range of invention, those skilled in the art should understand that, it is right under the premise without departing from the spirit and scope of the present invention The modification or equivalent replacement that the present invention carries out, should all cover within the scope of the present invention.In addition, signified unless the context Outside, the word occurred in the singular includes plural form, and vice versa.In addition, unless stated otherwise, then any embodiment All or part of in combination with any other embodiment all or part of come using.

Claims

1. a kind of method for establishing index in the spark-sql big data processing system based on HDFS, it is characterised in that: Spark-sql is to increase the function of support standard sql query statement on the basis of big data processing platform spark, pass through SQL statement increases index in the spark-sql big data processing system based on HDFS, deletes index, is inserted into data, deletes number According to when data query, judging automatically inquiry column whether there is index, if it is present searching the file that index includes Block, filtering do not need the blocks of files of inquiry；

Wherein, the index that increases has recorded this for the index of certain field in a record is navigated to some file Record be included in which file in, it is subsequent search this record when, it is only necessary to some file is directly targeted to according to index, without The All Files for being included with this table is scanned；

Wherein, when inquiring data, according to data train value, corresponding node elements, then read element value in search index file, To obtain the filename where file corresponding to the data, former querying flow is then proceeded to, former querying flow will finally be read All data files are inquired in this table, and the filename obtained before this according to previous step filters out inactive file, Then querying flow is continued to execute to remaining file, then executes SQL operation according to the data of inquiry, finally return to inquiry knot Fruit.

2. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: when increasing index, it is necessary first to increase an index file newly, the format of index file can be arranged according to configuration, Then all records in original table are traversed, determine that the value of the column indexed required for every record is located at position in HDFS file system It sets, the train value and corresponding the file information, write-in index tree for re-recording the record loop through all records, with document form Save index structure, final updating table metadata information, by the metadata of new index information write in table, in case subsequent query It uses.

3. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: when deleting the index of certain table column, it is only necessary to navigate to corresponding index file and be deleted, and update list cell number It is believed that breath, while deleting the index information in metadata.

4. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: after one data of insertion, judging whether the data of insertion are related to indexing, if being related to indexing, need to adjust This data and its associated the file information are also added in index structure by corresponding index structure.

5. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 4, Be characterized in that: in whole flow process, it is constant that tables of data insertion data flow continues to use former process, only after the completion of data insertion, note The filename where data is recorded, index node is constructed according to the filename that this is returned.

6. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 1, It is characterized in that: after deleting a data, judging whether the data deleted are related to indexing, if being related to indexing, need to adjust Corresponding index structure deletes the associated index node of this data.

7. the method for establishing index in the spark-sql big data processing system based on HDFS according to claim 6, Be characterized in that: in whole flow process, wherein tables of data deletes data flow to continue to use former process constant, only deletes and completes in data Afterwards, the corresponding index information of data is deleted.