CN104376082B - A method of the data in data source file are imported into database - Google Patents

A method of the data in data source file are imported into database Download PDF

Info

Publication number
CN104376082B
CN104376082B CN201410658208.2A CN201410658208A CN104376082B CN 104376082 B CN104376082 B CN 104376082B CN 201410658208 A CN201410658208 A CN 201410658208A CN 104376082 B CN104376082 B CN 104376082B
Authority
CN
China
Prior art keywords
data
queue
database
pretreatment
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410658208.2A
Other languages
Chinese (zh)
Other versions
CN104376082A (en
Inventor
王懋成
刘迪
吴文青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN201410658208.2A priority Critical patent/CN104376082B/en
Publication of CN104376082A publication Critical patent/CN104376082A/en
Application granted granted Critical
Publication of CN104376082B publication Critical patent/CN104376082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data in data source file to imported into the method in database, comprising the following steps: A. reads the data in the data source file;B. the data are stored in line by line into initial data queue;C. multiple data processing inbound threads are set;D. the data in the initial data queue are pre-processed by the multiple data processing inbound thread, and the pretreated data parallel is written in the database by the multiple data processing inbound thread.The present invention realizes that data parallel imports by multithreading, takes full advantage of hardware resource, so that storage efficiency is improved, so that the speed that data import greatly improves.

Description

A method of the data in data source file are imported into database
Technical field
The present invention relates to field of data storage, in particular it relates to which a kind of data in data source file import Method into database.
Background technique
With the continuous growth of application system quantity, a large amount of data import demand and give birth to therewith, and existing database (example Such as, INFORMIX, MYSQL etc.) simple regular derivative tool in order is only provided, not to the utmost such as people in efficiency and scalability Meaning.
Database used in current each bank is mostly INFORMIX database, and the frequently-used data that official provides is led Entering method has load and two kinds of dbload, but all there is significant limitations in terms of data processing for both methods:
Firstly, the boot speed of both methods is slower.By taking the data file that the every row 27 of 1,000,000 rows arranges as an example: passing through When load mode imported into the data file in database, reporting an error in any importing process can all be caused because data volume is excessive Lead to Database lock resource exhaustion and rollback.When the data file being imported into database by dbload mode, need It is 184440 milliseconds time-consuming.
It is imported secondly, both methods can only implement linear read to text file.Load and dbload is due to being official The tool of offer, therefore not expansible, can only by every row order linear carry out data importing.
The wrong output format of third, both methods not can customize.The error message of load method is output at screen It on curtain, and once malfunctions, data are accordingly to be regarded as malfunctioning by the gross, and program does not continue to processing data and imports.Although dbload Can with customized error log name, but format be it is fixed, cannot achieve customized.
Based on the above reasons, it is badly in need of a kind of with higher storage efficiency, and drawbacks described above can be overcome, data are led Enter method, to meet the needs of big data quantity importing.
Summary of the invention
With the enterprise application system of big data era, Data Migration or processing become more frequently, traditional data Lead-in mode is no longer satisfied existing demand.The importing of millions data volume needs to take several hours easily, therefore number According to importing efficiency it is in urgent need to be improved.In consideration of it, the present invention provides a kind of data in data source file to imported into database In method.
The invention discloses a kind of data in data source file to imported into the method in database, including following step Rapid: A. reads the data in the data source file;B. the data are stored in line by line into initial data queue;C. it sets more A data processing inbound thread;D. by the multiple data processing inbound thread to the data in the initial data queue into Row pretreatment, and the pretreated data parallel is written to by the multiple data processing inbound thread by the number According in library.
Particularly, the step D is specifically included: each data processing inbound thread of a. is respectively from the initial data The data of predetermined quantity are read in queue, and are pre-processed one by one;B. each data processing inbound thread is respectively institute Pretreated data are stated to be stored in into respective pretreatment queue;C. each data processing inbound thread is the pre- place It is written in the database with managing the data parallel in queue;D. each data processing inbound thread empties the pre- place Manage the data in queue;E. each data processing inbound thread confirms whether there are also not in the initial data queue respectively Processed data: if there is untreated data, a is thened follow the steps.
Particularly, specifically include in the step c: I. successively extracts the data from the pretreatment queue;II. really The data being extracted are recognized with the presence or absence of mistake;When there is an error, implementation steps:
Vi. position of the wrong data in the pretreatment queue is determined;
Vii. the data before the position are written in the database;
Viii. the object information of the wrong data on the position is written in error queue;
Ix. the data on the position with before are deleted in the pretreatment queue;
X. step I is executed;
When there is no mistake: the data of extraction are written in the database.
Particularly, the object information includes: location information, content, error reason.
Particularly, it further comprises the steps of: through error handling processing thread and handles the error queue.
Particularly, the information in the error queue is exported according to user-defined format.
Particularly, the depth of the initial data queue and the depth of the pretreatment queue are configurableization.
Particularly, in step a, the predetermined quantity is determined based on the depth of the pretreatment queue.
Particularly, the depth of the pretreatment queue is 50.
Particularly, the database is the database for meeting JDBC specification.
Particularly, the database is INFORMIX database or MYSQL database.
In conclusion disclosed method has the advantages that according to the present invention
1) present invention realizes that data parallel imports by multithreading, takes full advantage of hardware resource, to improve storage effect Rate, so that the speed that data import greatly improves.By taking the data file that the every row 27 of 1,000,000 rows arranges as an example, institute is public through the invention The method opened, which completes data importing, only needs 59473 milliseconds;
2) all databases for meeting JDBC specification be present invention can be suitably applied to, to realize that the unified data of integration across database are led Enter processing;
3) present invention personalized on demand can be handled every data line, get rid of the reading by column that database carries tool Take non-expandable drawback;And the present invention can customize output as a result, for example direct output screen, preservation file, preservation number According to library etc..
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other Feature, objects and advantages will become more apparent:
Fig. 1 shows the method stream data in data source file being imported into database disclosed according to the present invention Cheng Tu;
Fig. 2 shows the processes that the data in data source file are imported into database disclosed according to the present invention to show It is intended to;
Fig. 3 shows data processing threads disclosed according to the present invention and pretreated data is written in database Schematic diagram;And
Fig. 4 shows the method flow diagram that processing data disclosed according to the present invention import error.
Specific embodiment
In the following detailed description of the preferred embodiment, reference is constituted to the appended attached drawing of present invention a part.Institute Attached attached drawing, which has been illustrated by way of example, can be realized specific embodiment.Exemplary embodiment is not intended to Exhaustive all embodiments according to the present invention.It is appreciated that without departing from the scope of the present invention, can use other Embodiment can also carry out the modification of structure or logic.Therefore, it is below specific descriptions and it is unrestricted, and this The range of invention is defined by the claims appended hereto.
Below with reference to attached drawing 1-4, illustrate that a kind of data in data source file imported into the method in database.It needs It is noted that although the step of describing method in attached drawing with particular order, this does not require that or implies must be by These operations are executed according to the particular order, or have to carry out shown in whole operation be just able to achieve it is desired as a result, on the contrary, The step of describing in flow chart, which can change, executes sequence.Additionally or alternatively, it is convenient to omit certain steps, by multiple steps It merges into a step to execute, and/or a step is decomposed into execution of multiple steps.
As shown in Figure 1, 2, in step 110, file in file read module reads thread 210 and passes through reads line by line Mode reads data from data source file 200.Wherein, since file reading speed in this step is far longer than subsequent The speed of data processing and storage in step, therefore be configured to only a file in this programme and read thread to read data source Data in file 200.
In the step 120, file reads thread 210 and the data read from data source file 200 is written to simultaneously In the initial data queue 220 of starting.The depth of the queue 220 can be customized by the user.In a preferred embodiment, The depth of the initial data queue 220, which can be according to the hardware configuration of data source file and gatherer, to be determined.Work as number According to the comparison that when source file is larger and/or the memory of gatherer is larger, the depth of the initial data queue 220 can be set It is deep.In a specific embodiment, the depth of the initial data queue 220 is 5000, it may be assumed that the initial data queue can be protected Deposit 5000 row data.
It, should after the data in initial data queue 220 are read by subsequent thread (data processing inbound thread 230) Data will be deleted, and file reading thread 210 will obtain new data from data source file 200 and be put into initial data In queue 220, which will be implemented repeatedly until the data of data source file 200 are finished by all readings.
In step 130, entering to set multiple data processing inbound threads 230 in library module in batches, at multiple data It is all identical for managing the function of inbound thread 230.The data processing inbound thread 230 from initial data queue 220 for reading Access evidence, and data are written to after the pre-treatment in database 240, which will be implemented repeatedly until initial data queue Data are all imported into and finish in 220.In the present invention, which is the database for meeting JDBC specification, such as: INFORMIX database or MYSQL database.
Referring to Fig. 3, a pretreatment queue 330 is respectively equipped in each data processing inbound thread 230.At data The processing speed for managing inbound thread 230 is slower, therefore the depth for pre-processing queue 330 is much smaller than the depth of initial data queue 220 Degree.
In a preferred embodiment, the depth of queue 330 and/or the number of data processing inbound thread 230 are pre-processed Amount can be configured according to system performance.For example, the depth and/or data processing inbound thread 230 of pretreatment queue 330 Quantity can be set according to the data processing speed of data processing inbound thread 230.When the number of data processing inbound thread 230 When very fast according to processing speed, the depth for pre-processing queue 330 is deeper, the negligible amounts of data processing inbound thread 230;Work as data When the data processing speed of processing inbound thread 230 is slower, the depth for pre-processing queue 330 is shallower, data processing inbound thread 230 quantity is more.
It is to be noted that there is no specific sequencing between step 110 and step 130, the two can be real simultaneously It applies, can also successively implement.
In step 140, each data processing inbound thread 230 reads predetermined number from initial data queue 220 respectively The data of amount, and pre-processed one by one.Wherein, the predetermined quantity can be determined according to the depth of pretreatment queue 330, The predetermined quantity is equal to the depth of pretreatment queue 330 in the present embodiment.The pretreatment includes according to pretreatment interface The data read from initial data queue 220 are converted to required format.In the present embodiment, the pretreatment can be with Data are assembled into JDBC parameter list required for PreparedStatement.Pretreated data will be more readily It is directed into JDBC database.
In step 150, each data processing inbound thread 230 is respectively stored in pretreated data to respective pre- It handles in queue 330.And after the pretreatment queue 330 has been expired, in a step 160,230 meeting of data processing inbound thread Data in the pretreatment queue are written in the database 240.Wherein, multiple data processing inbound threads 230 it Between be that concurrently pretreated data are written in database 240, in such a way that this is written in parallel to, data processing and The time of write-in is compressed, to improve the efficiency of data importing.
In step 170, after the data in respective pretreatment queue 330 are imported into database 240, data processing Inbound thread 230 empties the data in the pretreatment queue 330 respectively.And in step 180, data processing inbound thread Whether there are also untreated data in 230 confirmation initial data queues 220: if there is untreated data, then repeating Step 140 is executed to 180, until the data in initial data queue 220 are all directed in database 240.
In a step 160, when each data processing inbound thread 230 writes data into database 240, sometimes There is a situation where write-ins to malfunction.For the situation, Fig. 4 shows the side that processing data disclosed according to the present invention import error Method flow chart, the step 160 specifically includes the following steps:
In the step 161, each data processing inbound thread 230 can be extracted from respective pretreatment queue 330 It pre-processes and needs to be written into the data in database 240.
Confirm the data with the presence or absence of mistake in step 162.
It is implemented the steps of when finding that the data being extracted have mistake: in step 163, confirming the mistake pre- Handle the position in queue 330;In step 164, the data not malfunctioned before the position are written to database 240 In;In step 165, the object information of the error data is written in error queue 340, wherein the object information includes: Location information (such as: the data in data source file 200 line number or the wrong data pretreatment queue 330 in Location number), content, any information relevant to the wrong data of error reason etc.;In step 166, recorded mistake Data dump, it may be assumed that the data in pretreatment queue 330 on deletion error position with before.It finally returns into step 161, To continue to extract remaining data.
In a specific embodiment, which is handled by subsequent error handling processing thread, the processing Mode includes: that the object information is written in specified file, database or JMS, or according to user-defined format Export the information in the error queue.
When finding implementation steps 167: the data of extraction there is no when mistake, are then written to data by the data being extracted In library 240.
Finally, the result of the data write-in and the result of error message can be exported by customized mode, such as Direct output screen, preservation file, preservation database etc..
Illustrate processing of the data processing inbound thread to wrong data below by way of a specific example: for example, one In the pretreatment queue that a depth is 50, there are mistakes for the data on the 20th position in the queue, then data processing is put in storage Thread can successively extract data from pretreatment queue, and when extracting the 20th data, the discovery of data processing inbound thread should There are mistakes for data, then the thread can stop continuing to extract data, but the data on the position 1-19 before are written to In database, and the object information of the data on the 20th position is written to error queue, then on the 1-20 position Data delete.Data processing inbound thread continues to extract data since on the 21st position later, until all data mention It takes and finishes and be written in database.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter How from the point of view of, the present embodiments are to be considered as illustrative and not restrictive.In addition, it will be evident that one word of " comprising " not Exclude other elements and steps, and wording "one" be not excluded for plural number.The multiple element stated in device claim can also To be implemented by one element.The first, the second equal words are used to indicate names, and are not indicated any particular order.

Claims (10)

1. a kind of data in data source file imported into the method in database, comprising the following steps:
A. thread is read via a file, reads the data in the data source file;
B. the data are stored in line by line into initial data queue;
C. the identical data processing inbound thread of multiple functions is set;
D. each data processing inbound thread respectively from the initial data queue read predetermined quantity data, and by Item is pre-processed, and then the pretreated data is stored in into respective pretreatment queue, then respectively described pre- It is written in the database with handling the data parallel in queue, wherein the depth of the pretreatment queue is much smaller than the original The depth of beginning data queue, and the pretreatment team is set according to the data processing speed of the data processing inbound thread The depth of column;And
It is read, deletes described original by the multiple data processing inbound thread in response to the data in the initial data queue The data in data queue;Thread, which is read, via file reads the new data in the data source file;
Wherein the data parallel in the pretreatment queue is written in the database described in the step D and is specifically wrapped It includes:
I. data successively are extracted from the pretreatment queue, wherein the data being extracted are by the data processing inbound thread Deposit is into the pretreatment queue;
II. the data being extracted described in confirmation are with the presence or absence of mistake;
When there is an error, implementation steps:
I. position of the wrong data in the pretreatment queue is determined;
Ii. the data before the position are written in the database;
Iii. the object information of the wrong data on the position is written in error queue;
Iv. the data on the position with before are deleted in the pretreatment queue;
V. step I is executed;
When there is no mistake: the data of extraction are written in the database.
2. according to the method described in claim 1, wherein, the step D is also specifically included:
Each data processing inbound thread empties the data in the pretreatment queue;And each data processing enters Library thread confirms whether there are also untreated data in the initial data queue respectively: if there is untreated number According to, then each data processing inbound thread respectively from the initial data queue read predetermined quantity data, and by Item is pre-processed.
3. according to the method described in claim 1, wherein, the object information includes: location information, content, error reason.
4. according to the method described in claim 1, wherein, further comprising the steps of: through error handling processing thread and handling the error Queue.
5. according to the method described in claim 1, wherein, exporting the letter in the error queue according to user-defined format Breath.
6. according to the method described in claim 1, wherein, the depth of the depth of the initial data queue and the pretreatment queue Degree is configurableization.
7. according to the method described in claim 6, wherein reading predetermined number from the initial data queue in the step D The data of amount, and carry out pretreatment one by one and include:
The predetermined quantity is determined based on the depth of the pretreatment queue.
8. according to the method described in claim 6, wherein, the depth of the pretreatment queue is 50.
9. the method according to claim 1, wherein the database is the database for meeting JDBC specification.
10. according to the method described in claim 9, it is characterized in that, the database is INFORMIX database or MYSQL Database.
CN201410658208.2A 2014-11-18 2014-11-18 A method of the data in data source file are imported into database Active CN104376082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410658208.2A CN104376082B (en) 2014-11-18 2014-11-18 A method of the data in data source file are imported into database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410658208.2A CN104376082B (en) 2014-11-18 2014-11-18 A method of the data in data source file are imported into database

Publications (2)

Publication Number Publication Date
CN104376082A CN104376082A (en) 2015-02-25
CN104376082B true CN104376082B (en) 2019-06-18

Family

ID=52554989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410658208.2A Active CN104376082B (en) 2014-11-18 2014-11-18 A method of the data in data source file are imported into database

Country Status (1)

Country Link
CN (1) CN104376082B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469152A (en) * 2015-08-14 2017-03-01 阿里巴巴集团控股有限公司 A kind of document handling method based on ETL and system
CN105045917B (en) 2015-08-20 2019-06-18 北京百度网讯科技有限公司 A kind of the distributed data restoration methods and device of Case-based Reasoning
CN105279261B (en) * 2015-10-23 2019-07-02 北京京东尚科信息技术有限公司 Dynamic scalable database filing method and system
CN105468705A (en) * 2015-11-18 2016-04-06 广东南方通信建设有限公司 Mobile communication background data file importing method
CN106909554B (en) * 2015-12-22 2020-08-04 亿阳信通股份有限公司 Method and device for loading database text table data
CN107665195A (en) * 2016-07-27 2018-02-06 北京京东尚科信息技术有限公司 Import the method, apparatus and system of mass data
CN106354788A (en) * 2016-08-23 2017-01-25 无锡天脉聚源传媒科技有限公司 File reading method and device
CN106599222B (en) * 2016-12-19 2020-09-04 广州四三九九信息科技有限公司 Method and equipment for processing logs in streaming parallel
CN108959292B (en) * 2017-05-19 2021-03-30 北京京东尚科信息技术有限公司 Data uploading method, system and computer readable storage medium
CN107357868A (en) * 2017-07-03 2017-11-17 华通信安(北京)科技发展有限公司 A kind of fast conversion method and device of matlab data formats
CN107862095B (en) * 2017-12-07 2021-06-01 中国银行股份有限公司 Data processing method and device
CN108228730A (en) * 2017-12-11 2018-06-29 深圳市买买提信息科技有限公司 Data lead-in method, device, computer equipment and readable storage medium storing program for executing
CN109033184B (en) * 2018-06-27 2021-08-17 中国建设银行股份有限公司 Data processing method and device
CN109857832A (en) * 2019-01-03 2019-06-07 中国银行股份有限公司 A kind of preprocess method and device of payment data
CN110362617B (en) * 2019-06-24 2023-06-13 北京人大金仓信息技术股份有限公司 Method and system for rapidly exporting batch data from database based on multiple concurrency technologies
CN112328542A (en) * 2020-11-25 2021-02-05 天津凯发电气股份有限公司 Method for importing data in heterogeneous data file into database
CN115510020A (en) * 2021-06-23 2022-12-23 比亚迪股份有限公司 Data archiving method, electronic equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6477535B1 (en) * 1998-11-25 2002-11-05 Computer Associates Think Inc. Method and apparatus for concurrent DBMS table operations
CN101515291A (en) * 2009-03-26 2009-08-26 北京泰合佳通信息技术有限公司 Method for leading data into database in a batch way and system thereof
CN103049533A (en) * 2012-12-23 2013-04-17 北京人大金仓信息技术股份有限公司 Method for quickly loading data into database
CN103092840A (en) * 2011-10-28 2013-05-08 上海邮电设计咨询研究院有限公司 Method for acquiring self-increment mass data files from multiple sources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6477535B1 (en) * 1998-11-25 2002-11-05 Computer Associates Think Inc. Method and apparatus for concurrent DBMS table operations
CN101515291A (en) * 2009-03-26 2009-08-26 北京泰合佳通信息技术有限公司 Method for leading data into database in a batch way and system thereof
CN103092840A (en) * 2011-10-28 2013-05-08 上海邮电设计咨询研究院有限公司 Method for acquiring self-increment mass data files from multiple sources
CN103049533A (en) * 2012-12-23 2013-04-17 北京人大金仓信息技术股份有限公司 Method for quickly loading data into database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Oracle中如何定位报错的行";lianjie1229;《Linux公社》;20120805;第1-2页 *

Also Published As

Publication number Publication date
CN104376082A (en) 2015-02-25

Similar Documents

Publication Publication Date Title
CN104376082B (en) A method of the data in data source file are imported into database
US11328003B2 (en) Data relationships storage platform
Zhao et al. Schema conversion model of SQL database to NoSQL
CN106844682B (en) Method for interchanging data, apparatus and system
CN103886011B (en) Social-relation network creation and retrieval system and method based on index files
CN111712809A (en) Learning ETL rules by example
US9037525B2 (en) Correlating data from multiple business processes to a business process scenario
EP3182304A1 (en) Computer-implemented method for storing unlimited amount of data as a mind map in relational database systems
CN106164865A (en) Affairs batch processing for the dependency perception that data replicate
CN105260464B (en) The conversion method and device of data store organisation
CN106547918B (en) Statistical data integration method and system
CN106354817B (en) Log processing method and device
CN103778239B (en) Multi-database data management method and system
CN109635024A (en) A kind of data migration method and system
CN107784026A (en) A kind of ETL data processing methods and device
CN103092997B (en) For interlock inquiry system and the interlock querying method of statement analysis
CN103440265A (en) MapReduce-based CDC (Change Data Capture) method of MYSQL database
CN104391891B (en) A kind of database isomery clone method
US20140280218A1 (en) Techniques for data integration
JP2016530646A (en) Metadata automation system
CN108073705B (en) Distributed mass data aggregation acquisition method
US20150178367A1 (en) System and method for implementing online analytical processing (olap) solution using mapreduce
CN109376154B (en) Data reading and writing method and data reading and writing system
US20180196858A1 (en) Api driven etl for complex data lakes
US10289711B2 (en) Integrated data analysis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant