CN110489403A - A kind of method of the pretreatment and storage of high-volume data - Google Patents

A kind of method of the pretreatment and storage of high-volume data Download PDF

Info

Publication number
CN110489403A
CN110489403A CN201910794841.7A CN201910794841A CN110489403A CN 110489403 A CN110489403 A CN 110489403A CN 201910794841 A CN201910794841 A CN 201910794841A CN 110489403 A CN110489403 A CN 110489403A
Authority
CN
China
Prior art keywords
data
storage
load
file
pretreatment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910794841.7A
Other languages
Chinese (zh)
Inventor
赵伟
康磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Huaku Data Technology Co Ltd
Original Assignee
Jiangsu Huaku Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Huaku Data Technology Co Ltd filed Critical Jiangsu Huaku Data Technology Co Ltd
Priority to CN201910794841.7A priority Critical patent/CN110489403A/en
Publication of CN110489403A publication Critical patent/CN110489403A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/219Managing data history or versioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

A kind of method of the pretreatment and storage of high-volume data, specific method is the following steps are included: Step 1: first process is the process of data prediction, the input of the process is source data file to be loaded, and output is can be directly to the cluster-based storage layer data file that second process uses;Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted in clustered database system.What the process of data preprocessing loaded in this method did not needed to hold load table writes lock, can concurrently carry out with other DML of load table;Process of data preprocessing is relatively independent, process of data preprocessing can be placed on the more idle node of cluster resource, realizes load balancing.

Description

A kind of method of the pretreatment and storage of high-volume data
Technical field
The invention belongs to database field, the method for the pretreatment and storage of especially a kind of high-volume data.
Background technique
The data load storage process of traditional MPP large-scale cluster is illustrated in fig. 1 shown below, and traditional company-data load exists Require the lock of writing for holding load table during the entire process of load, this loading method for the table other DML write operations all Load, which can be blocked to, to be terminated.If load data volume it is big in the case where, what load monopolized the table for a long time writes lock, causes to load Other DML business of table are blocked for a long time.Preceding several phase datas of actually loading procedure read, distribution, parsing, store What layer format conversion did not all need to hold load table writes lock, because these stages have no effect on read-write version, is only submitting Stage just will affect the read-write version of data.
Summary of the invention
The problem to be solved in the present invention is to provide a kind of pretreatment of high-volume data and the methods of storage.
To achieve the above object, the invention provides the following technical scheme: the pretreatment and storage of a kind of high-volume data For method the following steps are included: Step 1: first process is the process of data prediction, the input of the process is source to be loaded Data file, output is can be directly to the cluster-based storage layer data file that second process uses;
Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted to Cluster Database In system.
Preferably, the process of data prediction is to be taken by an independent thread from load is machine-readable in the step one Source data simultaneously splits data into several subsets according to corresponding distribution rule, and each subset, which is equivalent in traditional approach, to be distributed to Then the data subset of clustered node carries out the conversion of verification with storage layer format by the way of multithreading to each data subset Processing, then by an independent thread according to different data subsets by the storage layer format file distributing handled well to accordingly Back end submits process to use for second loading procedure data, and the above-mentioned stage is that the mode of assembly line carries out, and side is read Source data, the conversion of side verification and storage layer format, while by the storage layer format file distributing handled well to corresponding data section Point is initiated to submit second process for switching to load by host process after the completion of entire data are all processed and distribution.
Further preferably, do not need to hold load table in the step one in process of data preprocessing writes lock, therefore The preprocessing process of data is progress that can be concurrent with other DML of load table, can also be carried out simultaneously to the same load table Process of data preprocessing, in addition can be placed on cluster on arbitrary node by hair load according to cluster loading condition.
Preferably, it is to need writing for load table that data, which are submitted in the process, during data are submitted in the step two Lock.
Preferably, the accumulation layer data file of two load tables can change, and loading procedure is by multiple collection Group node converts data to storage layer format file, generates so having unsaturated accumulation layer data file.
Preferably, the loading method is when generating accumulation layer data file with the document form of special name It is distinguished with the accumulation layer data file of load table, only when submission process obtains cluster, and this load is allowed to submit It waits, is to take when writing lock of load table, accumulation layer is responsible for merging unsaturated data file, then further according to accumulation layer Naming rule renaming after the additional accumulation layer data file for writing load table write in version, and be written and read version and switched Process is submitted at data.
Preferably, data handled in the present invention are text that is formatted, being organized as unit of data line or two Binary file.
Compared with prior art, the beneficial effects of the present invention are: the process of data preprocessing of 1, load does not need to hold and add Load table writes lock, can concurrently carry out with other DML of load table;
2, process of data preprocessing is relatively independent, and it is more idle process of data preprocessing can be placed into cluster resource On node, load balancing is realized.
Detailed description of the invention
Fig. 1 is the data load storage procedure chart of MPP large-scale cluster traditional in background technique;
Fig. 2 is the pretreatment of high-volume data and the method procedure chart of storage in the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Embodiment
Referring to shown in Fig. 2, the method for the pretreatment and storage of a kind of high-volume data of the present embodiment, the method for the present embodiment Entire loading procedure is divided into two relatively independent processes.
First process is process of data preprocessing, the process of data prediction be do not need to hold load table write lock 's.The process is by the reading stage of source data file of traditional MPP data load process, the distribution phase of data, data Parsing is merged into a relatively independent process with the storage layer format conversion stage.The process of data prediction is by an independence Thread from load it is machine-readable take source data and several subsets splitted data into according to corresponding distribution rule, each subset is suitable It is distributed to the data subset of clustered node in traditional approach, then each data subset is verified by the way of multithreading With the conversion process of storage layer format.The accumulation layer lattice that will be handled well by an independent thread according to different data subsets again Formula file distributing submits process to use to corresponding back end, for second loading procedure data.Above-mentioned three phases are The mode of assembly line carries out.Read source data, the conversion of side verification and storage layer format, while the storage layer format that will be handled well in side File distributing is to corresponding back end.After the completion of entire data are all processed and distribution, submission is initiated by host process and is turned For second process of load.Due to the lock of writing for not needing to hold load table in preprocessing process, data it is pretreated Journey is progress that can be concurrent with other DML of load table, can also concurrently be loaded to the same load table, in addition can be with Process of data preprocessing is placed on cluster on arbitrary node according to cluster loading condition.The loading method of this patent is will to count After being put into parsing and storage layer format conversion according to distribution, the distribution that the reading of source data is final with the data is not related, Therefore can with source data, balancedly several data prediction nodes are given in cutting, process of data preprocessing can be made to realize that load is equal Weighing apparatus.It avoids in the load of traditional MPP database, because source data causes the case where there is data distribution inclination according to distribution rule The excessively high situation of some node load of cluster occurs, and load machine no longer needs compared with the existing MPP cluster load of background technique Data are divided according to distribution rule, and are only to provide a basic file reading service, due to reducing distribution rule Calculating process then, therefore load machine and certain handling capacity can be improved.In addition entire process of data preprocessing is parallel again , the hardware resource of clustered node can be made full use of.With the clustered node phase of the existing MPP company-data load of background technique Needs processing only than, the existing MPP clustered node of background technique is distributed to the source data of this node, and the loading method of the present embodiment Middle clustered node needs to handle all data read, and is then distributed on different nodes further according to distribution rule.
Second process data submits process, data submit process be need to hold load table write lock.The present embodiment Submission process and the submission process of the existing MPP data load of background technique are not quite alike.The existing MPP data of background technique load The bottom document handled well is written to the storage layer data text of load table in the case where holding and writing lock in a manner of additional write Part is write in version.The data load method of the present embodiment, which does not need to hold load table before submission, writes lock, therefore is adding The accumulation layer data file of load table might have variation during load.And the loading method loading procedure that embodiment proposes is Storage layer format file is converted data to by multiple clustered nodes, is generated so having unsaturated accumulation layer data file. The loading method that the present embodiment proposes is when generating accumulation layer data file with the document form and load table of special name Accumulation layer data file distinguish, only when submission process obtain cluster allow this load submit when, be to take To load table when writing lock, accumulation layer is responsible for merging unsaturated data file, then advises further according to the name of accumulation layer The additional accumulation layer data file for writing load table is write in version after then renaming, and is written and read version switching completion data and mentions Friendship process.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims (7)

1. a kind of method of the pretreatment and storage of high-volume data, it is characterised in that: specific method is the following steps are included: step One, first process is the process of data prediction, and the input of the process is source data file to be loaded, and output is can be straight Connect the cluster-based storage layer data file used to second process;
Step 2: second process is that data submit process, i.e., the output of preprocessing process is submitted to clustered database system In.
2. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The data handled in step 1 are text that is formatted, being organized as unit of data line or binary file.
3. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The process of data prediction is to take source data and according to corresponding distribution from load is machine-readable by an independent thread in step 1 Rule splits data into several subsets, and each subset is equivalent to the data subset that clustered node is distributed in traditional approach, so Carry out the conversion process of verification with storage layer format to each data subset by the way of multithreading afterwards, then by an independent line Journey, by the storage layer format file distributing handled well to corresponding back end, is loaded according to different data subsets for second Process data submits process to use, and the above-mentioned stage is that the mode of assembly line carries out, and source data, side verification and accumulation layer are read in side The conversion of format, while by the storage layer format file distributing handled well to corresponding back end, until entire data are all located Reason initiates second process that submission switchs to load by host process with after the completion of distribution.
4. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described Do not need the lock of writing for holding load table in step 1 in process of data preprocessing, thus the preprocessing process of data be can with add The concurrent progress of other DML of table is carried, the same load table can also concurrently be loaded, in addition can be loaded according to cluster Process of data preprocessing is placed on cluster on arbitrary node by situation.
5. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described During data are submitted in step 2, data submit during be need load table write lock.
6. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described The accumulation layer data file of two load tables can change, and loading procedure is to convert data to accumulation layer by multiple clustered nodes Formatted file generates so having unsaturated accumulation layer data file.
7. the method for the pretreatment and storage of a kind of high-volume data according to claim 1, it is characterised in that: described Loading method is when generating accumulation layer data file with the storage layer data text of the document form of special name and load table Part distinguishes, only when submission process obtains cluster, and this load is allowed to submit, be take load table write lock When, accumulation layer is responsible for merging unsaturated data file, then further according to additional after the naming rule renaming of accumulation layer The accumulation layer data file for writing load table is write in version, and is written and read version switching and completes data submission process.
CN201910794841.7A 2019-08-27 2019-08-27 A kind of method of the pretreatment and storage of high-volume data Pending CN110489403A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910794841.7A CN110489403A (en) 2019-08-27 2019-08-27 A kind of method of the pretreatment and storage of high-volume data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910794841.7A CN110489403A (en) 2019-08-27 2019-08-27 A kind of method of the pretreatment and storage of high-volume data

Publications (1)

Publication Number Publication Date
CN110489403A true CN110489403A (en) 2019-11-22

Family

ID=68554383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910794841.7A Pending CN110489403A (en) 2019-08-27 2019-08-27 A kind of method of the pretreatment and storage of high-volume data

Country Status (1)

Country Link
CN (1) CN110489403A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573068A (en) * 2015-01-23 2015-04-29 四川中科腾信科技有限公司 Information processing method based on megadata
CN106126601A (en) * 2016-06-20 2016-11-16 华南理工大学 A kind of social security distributed preprocess method of big data and system
WO2018028797A1 (en) * 2016-08-12 2018-02-15 Telefonaktiebolaget Lm Ericsson (Publ) Methods and systems for bulk loading of data into a distributed database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573068A (en) * 2015-01-23 2015-04-29 四川中科腾信科技有限公司 Information processing method based on megadata
CN106126601A (en) * 2016-06-20 2016-11-16 华南理工大学 A kind of social security distributed preprocess method of big data and system
WO2018028797A1 (en) * 2016-08-12 2018-02-15 Telefonaktiebolaget Lm Ericsson (Publ) Methods and systems for bulk loading of data into a distributed database

Similar Documents

Publication Publication Date Title
US9720992B2 (en) DML replication with logical log shipping
EP2191370B1 (en) Transaction aggregation to increase transaction processing throughput
US11169978B2 (en) Distributed pipeline optimization for data preparation
Wang et al. Multi-query optimization in mapreduce framework
CA2997061C (en) Method and system for parallelization of ingestion of large data sets
EP1544753A1 (en) Partitioned database system
US11461304B2 (en) Signature-based cache optimization for data preparation
CN102640151A (en) High throughput, reliable replication of transformed data in information systems
CN107180113B (en) Big data retrieval platform
Liu et al. ETLMR: a highly scalable dimensional ETL framework based on MapReduce
CN103678556A (en) Method for processing column-oriented database and processing equipment
US20120254173A1 (en) Grouping data
US11321228B1 (en) Method and system for optimal allocation, processing and caching of big data sets in a computer hardware accelerator
US10664460B2 (en) Index B-tree maintenance for linear sequential insertion
WO2008068114A1 (en) Workflow processing system and method with federated database system support
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
Liu et al. ETLMR: a highly scalable dimensional ETL framework based on mapreduce
US10740316B2 (en) Cache optimization for data preparation
CN103995827A (en) High-performance ordering method for MapReduce calculation frame
CN105677915A (en) Distributed service data access method based on engine
US10650021B2 (en) Managing data operations in an integrated database system
CN110489403A (en) A kind of method of the pretreatment and storage of high-volume data
US8229946B1 (en) Business rules application parallel processing system
Perwej et al. An extensive investigate the mapreduce technology
CN111090638B (en) Comparison method and device for transaction functions in database migration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191122