CN110489403A

CN110489403A - A kind of method of the pretreatment and storage of high-volume data

Info

Publication number: CN110489403A
Application number: CN201910794841.7A
Authority: CN
Inventors: 赵伟; 康磊
Original assignee: Jiangsu Huaku Data Technology Co Ltd
Current assignee: Jiangsu Huaku Data Technology Co Ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-11-22

Abstract

A kind of method of the pretreatment and storage of high-volume data, specific method is the following steps are included: Step 1: first process is the process of data prediction, the input of the process is source data file to be loaded, and output is can be directly to the cluster-based storage layer data file that second process uses；Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted in clustered database system.What the process of data preprocessing loaded in this method did not needed to hold load table writes lock, can concurrently carry out with other DML of load table；Process of data preprocessing is relatively independent, process of data preprocessing can be placed on the more idle node of cluster resource, realizes load balancing.

Description

A kind of method of the pretreatment and storage of high-volume data

Technical field

The invention belongs to database field, the method for the pretreatment and storage of especially a kind of high-volume data.

Background technique

The data load storage process of traditional MPP large-scale cluster is illustrated in fig. 1 shown below, and traditional company-data load exists Require the lock of writing for holding load table during the entire process of load, this loading method for the table other DML write operations all Load, which can be blocked to, to be terminated.If load data volume it is big in the case where, what load monopolized the table for a long time writes lock, causes to load Other DML business of table are blocked for a long time.Preceding several phase datas of actually loading procedure read, distribution, parsing, store What layer format conversion did not all need to hold load table writes lock, because these stages have no effect on read-write version, is only submitting Stage just will affect the read-write version of data.

Summary of the invention

The problem to be solved in the present invention is to provide a kind of pretreatment of high-volume data and the methods of storage.

To achieve the above object, the invention provides the following technical scheme: the pretreatment and storage of a kind of high-volume data For method the following steps are included: Step 1: first process is the process of data prediction, the input of the process is source to be loaded Data file, output is can be directly to the cluster-based storage layer data file that second process uses；

Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted to Cluster Database In system.

Preferably, the process of data prediction is to be taken by an independent thread from load is machine-readable in the step one Source data simultaneously splits data into several subsets according to corresponding distribution rule, and each subset, which is equivalent in traditional approach, to be distributed to Then the data subset of clustered node carries out the conversion of verification with storage layer format by the way of multithreading to each data subset Processing, then by an independent thread according to different data subsets by the storage layer format file distributing handled well to accordingly Back end submits process to use for second loading procedure data, and the above-mentioned stage is that the mode of assembly line carries out, and side is read Source data, the conversion of side verification and storage layer format, while by the storage layer format file distributing handled well to corresponding data section Point is initiated to submit second process for switching to load by host process after the completion of entire data are all processed and distribution.

Further preferably, do not need to hold load table in the step one in process of data preprocessing writes lock, therefore The preprocessing process of data is progress that can be concurrent with other DML of load table, can also be carried out simultaneously to the same load table Process of data preprocessing, in addition can be placed on cluster on arbitrary node by hair load according to cluster loading condition.

Preferably, it is to need writing for load table that data, which are submitted in the process, during data are submitted in the step two Lock.

Preferably, the accumulation layer data file of two load tables can change, and loading procedure is by multiple collection Group node converts data to storage layer format file, generates so having unsaturated accumulation layer data file.

Preferably, the loading method is when generating accumulation layer data file with the document form of special name It is distinguished with the accumulation layer data file of load table, only when submission process obtains cluster, and this load is allowed to submit It waits, is to take when writing lock of load table, accumulation layer is responsible for merging unsaturated data file, then further according to accumulation layer Naming rule renaming after the additional accumulation layer data file for writing load table write in version, and be written and read version and switched Process is submitted at data.

Preferably, data handled in the present invention are text that is formatted, being organized as unit of data line or two Binary file.

Compared with prior art, the beneficial effects of the present invention are: the process of data preprocessing of 1, load does not need to hold and add Load table writes lock, can concurrently carry out with other DML of load table；

2, process of data preprocessing is relatively independent, and it is more idle process of data preprocessing can be placed into cluster resource On node, load balancing is realized.

Detailed description of the invention

Fig. 1 is the data load storage procedure chart of MPP large-scale cluster traditional in background technique；

Fig. 2 is the pretreatment of high-volume data and the method procedure chart of storage in the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Embodiment

Referring to shown in Fig. 2, the method for the pretreatment and storage of a kind of high-volume data of the present embodiment, the method for the present embodiment Entire loading procedure is divided into two relatively independent processes.

First process is process of data preprocessing, the process of data prediction be do not need to hold load table write lock 's.The process is by the reading stage of source data file of traditional MPP data load process, the distribution phase of data, data Parsing is merged into a relatively independent process with the storage layer format conversion stage.The process of data prediction is by an independence Thread from load it is machine-readable take source data and several subsets splitted data into according to corresponding distribution rule, each subset is suitable It is distributed to the data subset of clustered node in traditional approach, then each data subset is verified by the way of multithreading With the conversion process of storage layer format.The accumulation layer lattice that will be handled well by an independent thread according to different data subsets again Formula file distributing submits process to use to corresponding back end, for second loading procedure data.Above-mentioned three phases are The mode of assembly line carries out.Read source data, the conversion of side verification and storage layer format, while the storage layer format that will be handled well in side File distributing is to corresponding back end.After the completion of entire data are all processed and distribution, submission is initiated by host process and is turned For second process of load.Due to the lock of writing for not needing to hold load table in preprocessing process, data it is pretreated Journey is progress that can be concurrent with other DML of load table, can also concurrently be loaded to the same load table, in addition can be with Process of data preprocessing is placed on cluster on arbitrary node according to cluster loading condition.The loading method of this patent is will to count After being put into parsing and storage layer format conversion according to distribution, the distribution that the reading of source data is final with the data is not related, Therefore can with source data, balancedly several data prediction nodes are given in cutting, process of data preprocessing can be made to realize that load is equal Weighing apparatus.It avoids in the load of traditional MPP database, because source data causes the case where there is data distribution inclination according to distribution rule The excessively high situation of some node load of cluster occurs, and load machine no longer needs compared with the existing MPP cluster load of background technique Data are divided according to distribution rule, and are only to provide a basic file reading service, due to reducing distribution rule Calculating process then, therefore load machine and certain handling capacity can be improved.In addition entire process of data preprocessing is parallel again , the hardware resource of clustered node can be made full use of.With the clustered node phase of the existing MPP company-data load of background technique Needs processing only than, the existing MPP clustered node of background technique is distributed to the source data of this node, and the loading method of the present embodiment Middle clustered node needs to handle all data read, and is then distributed on different nodes further according to distribution rule.

Second process data submits process, data submit process be need to hold load table write lock.The present embodiment Submission process and the submission process of the existing MPP data load of background technique are not quite alike.The existing MPP data of background technique load The bottom document handled well is written to the storage layer data text of load table in the case where holding and writing lock in a manner of additional write Part is write in version.The data load method of the present embodiment, which does not need to hold load table before submission, writes lock, therefore is adding The accumulation layer data file of load table might have variation during load.And the loading method loading procedure that embodiment proposes is Storage layer format file is converted data to by multiple clustered nodes, is generated so having unsaturated accumulation layer data file. The loading method that the present embodiment proposes is when generating accumulation layer data file with the document form and load table of special name Accumulation layer data file distinguish, only when submission process obtain cluster allow this load submit when, be to take To load table when writing lock, accumulation layer is responsible for merging unsaturated data file, then advises further according to the name of accumulation layer The additional accumulation layer data file for writing load table is write in version after then renaming, and is written and read version switching completion data and mentions Friendship process.

It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims

1. a kind of method of the pretreatment and storage of high-volume data, it is characterised in that: specific method is the following steps are included: step One, first process is the process of data prediction, and the input of the process is source data file to be loaded, and output is can be straight Connect the cluster-based storage layer data file used to second process；

Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted to clustered database system In.

2. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The data handled in step 1 are text that is formatted, being organized as unit of data line or binary file.

3. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The process of data prediction is to take source data and according to corresponding distribution from load is machine-readable by an independent thread in step 1 Rule splits data into several subsets, and each subset is equivalent to the data subset that clustered node is distributed in traditional approach, so Carry out the conversion process of verification with storage layer format to each data subset by the way of multithreading afterwards, then by an independent line Journey, by the storage layer format file distributing handled well to corresponding back end, is loaded according to different data subsets for second Process data submits process to use, and the above-mentioned stage is that the mode of assembly line carries out, and source data, side verification and accumulation layer are read in side The conversion of format, while by the storage layer format file distributing handled well to corresponding back end, until entire data are all located Reason initiates second process that submission switchs to load by host process with after the completion of distribution.

4. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described Do not need the lock of writing for holding load table in step 1 in process of data preprocessing, thus the preprocessing process of data be can with add The concurrent progress of other DML of table is carried, the same load table can also concurrently be loaded, in addition can be loaded according to cluster Process of data preprocessing is placed on cluster on arbitrary node by situation.

5. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described During data are submitted in step 2, data submit during be need load table write lock.

6. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The accumulation layer data file of two load tables can change, and loading procedure is to convert data to accumulation layer by multiple clustered nodes Formatted file generates so having unsaturated accumulation layer data file.

7. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described Loading method is when generating accumulation layer data file with the storage layer data text of the document form of special name and load table Part distinguishes, only when submission process obtains cluster, and this load is allowed to submit, be take load table write lock When, accumulation layer is responsible for merging unsaturated data file, then further according to additional after the naming rule renaming of accumulation layer The accumulation layer data file for writing load table is write in version, and is written and read version switching and completes data submission process.