CN105574093A

CN105574093A - Method for establishing index in HDFS based spark-sql big data processing system

Info

Publication number: CN105574093A
Application number: CN201510918956.4A
Authority: CN
Inventors: 张鋆; 冯骏
Original assignee: Shenzhen Huaxun Fangzhou Software Technology Co Ltd; Shenzhen Huaxun Ark Technology Co Ltd
Current assignee: Shenzhen Huaxun Ark Photoelectric Technology Co ltd; Shenzhen Huaxun Fangzhou Software Technology Co ltd
Priority date: 2015-12-10
Filing date: 2015-12-10
Publication date: 2016-05-11
Anticipated expiration: 2035-12-10
Also published as: CN105574093B; WO2017096939A1

Abstract

The invention discloses a method for establishing an index in an HDFS based spark-sql big data processing system. The method comprises the steps of adding the index in the HDFS based spark-sql big data processing system through an SQL statement; deleting the index; inserting data; deleting the data; during data query, automatically judging whether the index exists in a query column or not; and if the index exists, then searching for file blocks contained in the index and filtering file blocks that do not need to be queried. According to the method, the query speed can be effectively increased after spark-sql is endowed with an index function; and for example, a typical spark-sql data table has the capacity of 1,000GB, one file is stored by the capacity of 1GB, 1,000 files are stored by the capacity of 1,000GB, if a single record is queried, the 1,000 files need to be scanned in a conventional method, and after the index is established, only one file needs to be scanned, so that the efficiency is improved by 1,000 times. In combination with experience with a conventional relational database, it is estimated according to a general condition that the sql statement query speed of a spark-sql database with the index is higher than that of a database without the index by 100-10,000 times or more.

Description

A kind of based on the method large data handling system of spark-sql of HDFS being set up index

Technical field

The present invention relates to a kind of method setting up index on a data processing system, particularly relate to a kind of based on the method large data handling system of spark-sql of HDFS being set up index.

Background technology

Spark-sql is on the basis of large data processing platform (DPP) spark, adds the function of support standard sql query statement.

The large data processing platform (DPP) of Spark is the universal parallel framework of the class HadoopMapReduce that Berkeley increases income, and it has the advantage that HadoopMapReduce has; But writing of the large data processor of Spark needs to be grasped scala language, and carry out code based on open api function interface to write, loaded down with trivial details and complicated, and the sql like language that a large amount of traditional database developers grasps cannot use on spark.The birth of spark-sql solves the problems referred to above, traditional database table concept is applied to spark process framework by it, user can as operation with traditional database table, table and inquiry is built with sql statement, corresponding operating is converted into spark built-in function by spark-sql automatically, shields complicated process details.

But due to the singularity of the large data processing platform (DPP) of spark, spark-sql is not supported in tables of data and sets up index, that does not namely support to be similar to traditional database sets up index statement, such as:

createindexmyindexontablet(b)；

Mean: the general index setting up myindex by name on the b row of table t.

Traditional Relational DataBase is after receiving mentioned order, and the c row namely started as a table set up index.

The type of index has a variety of, and such as B-sets index, Hash index, GiST index, GIN index etc.Set index for B-, it is as follows that relevant database sets up index principle:

Database opens up one piece of independent storage area, is used for storing index tree.

Field in the row (being the row of b by name in example) of index as required generates B-tree.And this tree is saved in designated storage area.Wherein each element in the corresponding b row of each node of B-tree, also comprises a pointer in addition in each node, and this node corresponding element of this pointer record is kept at the relevant position in database file.

When b row insert new element, also new element to be inserted B-tree (B-tree meeting adjustment automatically), this this element of tree node record position in database file simultaneously.

When b row delete element, also element to be deleted from B-sets (B-tree can adjust automatically).

Database index is based on the entity file of database purchase, namely above said " relevant position in database file ", database file can user-defined format as required, so can there be different method for expressing data relevant position hereof, but general thought is all have recorded element accurate location hereof, follow-up when searching this element, do not need to travel through file, and quick position record is someway passed through in the position can recorded by this, thus reach the object of accelerating to search.

As shown in Figure 1, data file correspondence table t, have a, b two arranges, and wherein b row establish an index.The i.e. index of the right tree construction, the unit in index in each node have the pointer pointing to data file respective element position, and wherein index itself also stores as a file.

When data query, such as query statement

Select*fromtwhereb＝22；

Represent all row that in question blank t, b row equal 22.First database first resolves sql statement, then finds that b row exist index.

Then, directly from index tree, find 22 elements fast, according to the pointer of 22 elements, navigate to the row that element value is the b row place of 22, its address is 0x90, then direct according to address, takes out this line, returns results as " 522 ".

When showing t and inserting element, according to the value of b row, corresponding index tree can be on-the-fly modified; Accordingly, when showing t and deleting element, the content in index tree can also dynamically be deleted.

The index technology comparative maturity of traditional relational, what state here is its General Principle, and its implementation is varied, and the form of expression is not fixed usually, but ultimate principle all communicates, method for building up and the performing step of other index repeat here no longer one by one.

For spark-sql, document storage mode due to its bottom is different from traditional Relational DataBase and (usually adopts HDFS, instead of general Linux or Windows file system), and a usual table capacity is very large, a table even can be associated with thousands of physical files, so do not create the function of index when spark-sql design.Its design focal point is to emphasize the concurrent capability of data processing and have ignored the efficiency of process.

When usual spark-sql carries out data query, can search for whole tables of data, the data volume of usual spark-sql process is very large, and a database may correspond to multiple physical file, and spark-sql by concurrent technology, can search for All Files.

As shown in Figure 2, for same sql query statement:

Select*fromtwhereb＝22；

First Spark-sql resolves sql statement, then navigates to the data file (having 4 files here) of table t.Then these files are split into multiple pieces according to specific size, distribute to different progress of work process, progress of work order scans the All Files piecemeal that whole table comprises, find b train value be 22 row, after finding, return results.

Can find out, spark-sql is not when having index, and the Method compare carrying out showing search is simple, there is inefficiency, needs all row scanning whole data files.

Except simple select statement, traditional relational comprises complex query, subquery in all places relating to inquiry, and nested query etc. all can application references technology, reduces queries, and accelerate inquiry velocity, spark-sql does not have such mechanism.

In sum, current spark-sql, owing to not having data directory mechanism, cannot make inquiry velocity reach optimum, compared to traditional Relational DataBase, there is the problem of inefficiency.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of based on the method large data handling system of spark-sql of HDFS being set up index, the method can make spark-sql be adapted to more, application scenarios more flexibly, accelerate the speed that spark-sql execution sql statement carries out inquiring about, improve the execution efficiency of spark-sql, play the advantage of the large data capability of spark-sql process more fully.

In order to solve the problems of the technologies described above, the invention provides a kind of based on the method large data handling system of spark-sql of HDFS being set up index, increase index by SQL statement based in the large data handling system of spark-sql of HDFS, delete index, data inserting, delete data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about.

When increasing index, first a newly-increased index file is needed, the form of index file can be arranged according to configuration and other instruction, conventional has the forms such as B-tree, Hash index, then all records in original table are traveled through, determine that the value of the row of index required for every bar record is arranged in HDFS or alternative document system position, then record the train value of this record and corresponding fileinfo, write index tree structure.The all records of searching loop, preserve index structure with document form, final updating table metadata information, by the metadata of new index information write table, use in order to subsequent query.

Delete certain when showing the index of certain row, only need to navigate to corresponding index file and deleted, and updating form metadata information, delete the index information in metadata simultaneously.

After inserting data, judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.

In whole flow process, it is constant that tables of data increase data flow continues to use former flow process, and only after data increase completes, the filename at record data place, according to filename structure index node that this returns.

After deleting data, judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, are deleted by the index node of this data correlation.

In whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increases the index information that deletion data are corresponding

During data query, according to data train value, corresponding node elements in search index file, then element value is read, thus obtain the filename at file place corresponding to these data, then former querying flow is continued, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file, then perform SQL operation according to the data of inquiry, finally return Query Result.

In recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up when searching this record, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.

The present invention compared with prior art has following beneficial effect based on the method large data handling system of spark-sql of HDFS being set up index.

After index function is increased to spark-sql, effectively can increase inquiry velocity, such as one typical spark-sql tables of data, size is that 1000GB, 1GB file is deposited, be divided into 1000 files, if inquiry wall scroll record, original way needs scanning 1000 files, after setting up index, only need scanning 1 file, efficiency improves 1000 times.According to generalized case estimation, in conjunction with traditional relevant database experience, the spark-sql database setting up index performs fast 100-10000 doubly or more than not having the sql statement inquiry velocity of index.

Accompanying drawing explanation

Below in conjunction with the drawings and specific embodiments, the present invention is being described in further detail based on the method large data handling system of spark-sql of HDFS being set up index.

Fig. 1 is general data table and index tree structural representation in prior art.

Fig. 2 is inquiry schematic diagram when not having index in prior art in the large data handling system of the spark-sql of HDFS.

Fig. 3 is increase index process flow diagram of the present invention.

Fig. 4 is deletion index process flow diagram of the present invention.

Fig. 5 is increase data flowchart of the present invention.

Fig. 6 is deletion data flowchart of the present invention.

Fig. 7 is data query process flow diagram of the present invention.

Fig. 8 is HDF distributed memory system structural representation.

Fig. 9 is tables of data and index tree structural representation in distributed memory system of the present invention.

Embodiment

As shown in Fig. 3 to Fig. 9, present embodiment is achieving spark-sql increase support index function based on the method large data handling system of spark-sql of HDFS being set up index, can as traditional Relational DataBase, index is increased by SQL statement, delete index, data inserting, delete data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about, reach the object accelerating inquiry velocity.

1) as shown in Figure 3, index flow process is increased.

Increasing index refers on the basis of legacy data table, and for a certain row increase index, the follow-up inquiry for these row can be accelerated by index.

2) as shown in Figure 4, index flow process is deleted.

Delete certain index flow process showing certain row comparatively simple, only need to navigate to corresponding index file and deleted, and updating form metadata information, delete the index information in metadata simultaneously.

3) as shown in Figure 5, data inserting flow process (the existing index of table).

(comprise batch to insert after inserting data, reality is that continuous wall scroll data are inserted), can judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.

In whole flow process, wherein tables of data increases data flow to continue to use former flow process constant, and only after data increase completes, the filename at record data place, according to filename structure index node that this returns.

4) data flow (the existing index of table) as shown in Figure 6, is deleted.

(comprise batch to delete after deleting data, reality is that continuous wall scroll data are deleted), can judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, the index node of this data correlation is deleted.

In whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increases the index information that deletion data are corresponding.

5) as shown in Figure 7, data query flow process (table has index).

During data query, according to data train value, corresponding node elements in search index file, then read element value, thus obtain the filename at file place corresponding to these data, then continue former querying flow, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file.Quantity of documents through filtering can greatly reduce, and reduces inquiry burden, then performs SQL operation according to the data of inquiry, finally returns Query Result.

Ben at this, present embodiment, based on the index of spark-sql, is different from the index of traditional database, and its purpose of design is in order to process big data quantity.Traditional data storage capacity can accomplish 1PB for 10GB, spark-sql, i.e. 100,000 times of common traditional data storage capacities.

Physical file in general several file system corresponding of one, general data storehouse tables of data, and spark-sql typical deployed mode combines with HDFS, carry out storage file in a kind of mode of distributed storage, one opens tables of data can correspond to thousands of and even up to ten thousand the files be stored on HDFS, as shown in Figure 8.

A usual spark-sql node is made up of several spark nodes, and its bottom stores and adopts HDFS distributed memory system.Namely data file is present in HDFS.In figure, t1-p1 represents the part1 part of t1, and it is a physical file, and in like manner t1-p2 represents the part2 file of t1, whole table t1 by p1-p7 totally 7 files form; Similarly, show t2 to be made up of 3 files.

Former querying flow, when carrying out the inquiry of sql statement, can scan all list files.

Such as Select*fromtwhereb=22

Spark-sql resolves above-mentioned sql statement, the database file that then look-up table t is corresponding, and result is t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 files altogether, when not considering the excessive cutting of file, spark-sql can set up 7 query tasks, these 7 files corresponding start scan for inquiries respectively, scanning All Files, until find qualified record row.

The present invention is also improved for spark-sql storage characteristics on this basis with reference to the principle of general index.

Index granularity of the present invention is different from traditional database index, traditional database index refers generally to the address be recorded in certain in file, because spark-sql database list file is made up of a lot of file usually, so the thinking that this law is taked is, in recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up search this record time, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.

Quote with above-mentioned example, table t has t1-p1, t1-p2, t1-p3, t1-p4, t1-p5, t1-p6, t1-p7 have 7 file compositions altogether, table t has 2 field a and b, wherein b field establishes index, supposes wherein there are some records (not showing whole record here), sets up index as shown in Figure 9.

Wherein table record is the raw readings be inserted in table, while insertion, b row sets up B-and sets index.Physical file corresponding in value in the data-base recording that in index tree, this node of each nodes records is corresponding and this record place HDFS file system.

When data query, such as query statement.

Select*fromtwhereb＝22；

Then directly from index tree, find 22 elements fast, according to the pointer of 22 elements, to navigate to element value be the physical file at the b row place of 22 is t1-p7, then only reads this file content and searches, return after finding record.

When showing t and inserting element, according to the value of b row, corresponding index tree can be on-the-fly modified; Corresponding, when showing t and deleting element, the content in index tree also dynamically can be deleted.

Can find out, although in the present invention the index concept of spark-sql and traditional database similar, but there is basic difference, the present invention is according to the large data characteristics of spark-sql process, change index granularity certain position from the file of traditional database in spark database certain file, thus avoid scanning a large amount of inactive file, avoid wasting system resource.

Index in the present invention is applicable to all sql statements, namely in or complicated sql inquiry simple no matter, in every case the query manipulation of index column is related to, capital is first according to index locating file, then in the file of location, carry out sql query manipulation, this and traditional relational way have fundamental difference.

Key point of the present invention

1, on spark-sql, increase the mechanism supporting index, such as, support following sql statement:

Set up index: createindexmyindexont (b); Wherein key word is createindexon

Check index: showindexfromt; Wherein key word is showindexfrom

Delete index: dropindexmyindexont; Wherein key word is dropindexon

2, based on the Indexing Mechanism of file

Spark-sql is different from traditional Relational DataBase, one of key point of the present invention is, index is based upon in file basis, a concrete file namely in index point HDFS or other file system, instead of the content in file, granularity is larger than traditional database.Under the prerequisite that database table sets up index according to the present invention, can effectively filter invalid inquiry file, inquired about file extent can be reduced, thus improve search efficiency.

3, the index of foundation is including but not limited to unique index, major key index, many property indexs, partial index, expression formula index.These index types are consistent with the concept in traditional database; Set up the data structure that index uses to set including but not limited to B-, Hash, GiST, GIN etc., these data structures are consistent with the concept in traditional database.

Advantage of the present invention is as follows.

There is no disclosed real-time proposals and the method for spark-sql being supported to index technology at present.

So in current public technology, the database table set up in spark-sql does not all have index, and its inquiry velocity and search efficiency are limited, and by setting up Indexing Mechanism to spark-sql, can improve the some orders of magnitude of inquiry velocity.Under can accomplishing mass data situation, search efficiency is equally matched with traditional Relational DataBase with inquiry velocity.

It should be noted that, reference each embodiment described by accompanying drawing is only in order to illustrate the present invention but not to limit the scope of the invention above, those of ordinary skill in the art is to be understood that, the amendment carried out the present invention under the premise without departing from the spirit and scope of the present invention or equivalently to replace, all should contain within the scope of the present invention.In addition, unless the context outside indication, the word occurred in the singular comprises plural form, and vice versa.In addition, unless stated otherwise, all or part of of so any embodiment can use in conjunction with all or part of of other embodiment any.

Claims

1. one kind based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: increasing index by SQL statement based in the large data handling system of spark-sql of HDFS, delete index, data inserting, deletes data, when data query, whether automatic decision inquiry row exist index, if existed, then search the blocks of files that index comprises, filter the blocks of files not needing to inquire about.

2. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: when increasing index, first a newly-increased index file is needed, the form of index file can be arranged according to configuration and other instruction, the conventional B-that has sets, the forms such as Hash index, then all records in original table are traveled through, determine that the value of the row of index required for every bar record is arranged in HDFS or alternative document system position, record the train value of this record and corresponding fileinfo again, write index tree structure, the all records of searching loop, index structure is preserved with document form, final updating table metadata information, by in the metadata of new index information write table, use in order to subsequent query.

3. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: delete certain show certain row index time, only need to navigate to corresponding index file to be deleted, and updating form metadata information, delete the index information in metadata simultaneously.

4. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: after inserting data, judge whether the data inserted relate to index, if relate to index, then need to adjust corresponding index structure, the fileinfo that these data and it are associated also is joined in index structure and goes.

5. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in whole flow process, it is constant that tables of data increase data flow continues to use former flow process, only after data increase completes, the filename at record data place, according to the filename structure index node that this returns.

6. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: after deleting data, judge whether the data of deleting relate to index, if relate to index, then need to adjust corresponding index structure, the index node of this data correlation is deleted.

7. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in whole flow process, wherein tables of data is deleted data flow to continue to use former flow process constant, only after data have been deleted, increase and delete index information corresponding to data.

8. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: during data query, according to data train value, corresponding node elements in search index file, then element value is read, thus obtain the filename at file place corresponding to these data, then former querying flow is continued, all data files in this table of reading are finally inquired about by former querying flow, before this according to the filename that previous step obtains, filter out inactive file, then continue to perform querying flow to remaining file, then SQL operation is performed according to the data of inquiry, finally return Query Result.

9. according to claim 1 based on the method large data handling system of spark-sql of HDFS being set up index, it is characterized in that: in recording one, the index of certain field navigates to certain file, namely which file have recorded this record is included in, follow-up search this record time, only need to be directly targeted to certain file according to index, and this All Files showing to comprise need not be scanned.