CN103092840A - Method for acquiring self-increment mass data files from multiple sources - Google Patents

Method for acquiring self-increment mass data files from multiple sources Download PDF

Info

Publication number
CN103092840A
CN103092840A CN2011103348516A CN201110334851A CN103092840A CN 103092840 A CN103092840 A CN 103092840A CN 2011103348516 A CN2011103348516 A CN 2011103348516A CN 201110334851 A CN201110334851 A CN 201110334851A CN 103092840 A CN103092840 A CN 103092840A
Authority
CN
China
Prior art keywords
data
file
time
source
data file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103348516A
Other languages
Chinese (zh)
Other versions
CN103092840B (en
Inventor
王志海
麦菁
辛炜博
徐卸土
王智博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Original Assignee
Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd filed Critical Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority to CN201110334851.6A priority Critical patent/CN103092840B/en
Publication of CN103092840A publication Critical patent/CN103092840A/en
Application granted granted Critical
Publication of CN103092840B publication Critical patent/CN103092840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method for acquiring self-increment mass data files from multiple sources. In the method, the multithreading technology is utilized for acquiring self-increment data files in parallel from data sources, and the file slice technology and file continuous transferring technology are utilized for slicing the self-increment data files according to time, the increment of the self-increment data files is acquired each time. The acquiring time intervals are set according to a data generating circle, data file size estimation and business requirements, and current cyclic data files from server data source are checked by polling at regular intervals according to the acquiring time intervals, increment data are collected by aid of the file slice technology and the file continuous transferring technology to store locally in a form of small data files and to record current-time file byte size as an initial position of next polling acquirement. By acquiring the increment each time, the method realizes self-increment mass data files acquirement, so that the technical problems of long delay time, poor real-time performance, server load and stability influencing elements in collecting teledata of the current technology are solved.

Description

Multi-source is from increasing the massive data files real-time collecting method
Technical field:
The present invention relates to physical field, relate in particular to mass data collection technology in computer application system, particularly a kind of multi-source from increasing the massive data files real-time collecting method.
Background technology:
The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally there are a plurality of data sources to provide simultaneously magnanimity in real time from increasing data file, application system need to gather tens to GB up to a hundred mass data every day, as PCMD and ROP data.These class data are stored on a plurality of server datas sources with document form, each data source is generally to generate a file cycle regular time, generated a data file or generated a data file in one day as one hour, file can be in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that the very first time accurately intactly collects massive data files and offers application system and becomes a technical barrier.
Prior art is write fully and no longer gather warehouse-in after increasing at a data file, and this will bring two drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather, may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.
Summary of the invention:
The object of the present invention is to provide a kind of multi-source from increasing the massive data files real-time collecting method, described this multi-source will solve prior art and gathers that teledata time delay is long, real-time is poor, affect server load and stable technical matters from increasing the massive data files real-time collecting method.
This multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
Further, described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file,
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5 in the generation moment of the cycle data file described in step 4, is carried out last poll collection,
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, the data source in described step 1 comprises:
The corresponding N station server of N data source is arranged,
Data are stored in respectively on N server with document form,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
The data file name comprises the unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is the time cycle feature, XXXX data category feature.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: the data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, and _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
The present invention and prior art are compared, and its effect is actively with obvious.The present invention is by adopting multithreading to gather simultaneously data file on a plurality of data servers, adopting file microtomy and file to resume technology cuts into slices by the time to a data file, each incremental portion that gathers, realize multi-source from increasing the massive data files Real-time Collection, solved and gathered in the prior art that teledata time delay is long, real-time is poor, affect server load and stable technical matters.
Description of drawings:
Fig. 1 is that multi-source of the present invention is from the schematic diagram that increases the massive data files real-time collecting method.
Embodiment:
Embodiment 1:
As shown in Figure 1, multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
Further, described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file.The data source feature mainly comprises:
A, the corresponding N station server of N data source is arranged;
B, data are stored in respectively on N server with document form;
C, data generate a data file (as 1 hour/1 day) at one-period (T);
D, data file write growth in real time within the cycle, until next cycle data document creation;
E, data file name comprise the unique identification rule, name according to the YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, the 10040711st, the time cycle feature; PCMD is the data category feature.
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5, in the generation moment of the cycle data file described in step 4, carry out last poll collection, last poll acquisition time will guarantee after next cycle data file generated and after being harsh one-tenth, to guarantee integrality that a upper cycle data file gathers and the real-time of next data acquisition.
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: the data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, and _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
In one embodiment of the invention, the PCMD data are stored on 7 OMP server data sources with document form, each data source per hour generates a data file, file write in current hour in real time, until the data file of next hour creates and writes in real time, per hour each file record number reaches DBMS amounts up to a million.
Classic method is to gather after a PCMD data file writes fully, and this will bring two serious drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.
Adopt the present invention to gather the PCMD data, every 1 minutes interval is set, per hour each File cutting becomes 60 collection, gathers increment at every turn, and 7 OMP parallel acquisitions have realized that the in time complete collection of magnanimity PCMD data puts in storage.The contrast classic method, advantage is as follows: the one, overcome long drawback of classic method data delay time, data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.The 2nd, be conducive to server load balancing, avoided concentrating on the hidden danger on the stability that in a period of time, large data files of acquisition process brings, improve the stability of PCMD data acquisition and processing (DAP).

Claims (6)

1. a multi-source is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, it is characterized in that: described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
2. multi-source as claimed in claim 1 from increasing the massive data files real-time collecting method, is characterized in that: described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file,
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5 in the generation moment of the cycle data file described in step 4, is carried out last poll collection,
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
3. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:
Data source in described step 1 comprises:
The corresponding N station server of N data source is arranged,
Data are stored in respectively on N server with document form,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
The data file name comprises the unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is the time cycle feature, XXXX data category feature.
4. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.
5. multi-source as claimed in claim 2 is from increasing the massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
6. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:
Naming rule in described step 6 comprises:
The data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
CN201110334851.6A 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method Active CN103092840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110334851.6A CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110334851.6A CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Publications (2)

Publication Number Publication Date
CN103092840A true CN103092840A (en) 2013-05-08
CN103092840B CN103092840B (en) 2015-09-16

Family

ID=48205423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110334851.6A Active CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Country Status (1)

Country Link
CN (1) CN103092840B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103685559A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for processing data in server
CN103678699A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for merging data in server
CN103701907A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Processing method and system for continuing to transmit data in server
CN103699666A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Transmission method and transmission device for splitting data
CN104111983A (en) * 2014-06-30 2014-10-22 中国科学院信息工程研究所 Open-type multi-source data collection system and method
CN104376082A (en) * 2014-11-18 2015-02-25 中国建设银行股份有限公司 Method for importing data in data source file to database
CN105183585A (en) * 2015-08-27 2015-12-23 北京金山安全软件有限公司 Data backup method and device
CN105843935A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Data acquisition method and ETL (Extraction-Transformation-Loading) assembly
CN105893529A (en) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 Data collecting method and ETL assembly
CN106164867A (en) * 2014-04-01 2016-11-23 谷歌公司 The increment parallel processing of data
CN107993696A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of collecting method, device, client and system
CN110347661A (en) * 2019-07-05 2019-10-18 北京红山信息科技研究院有限公司 Method, apparatus, server and the storage medium that data source is quasi real time put in storage
CN111159118A (en) * 2019-12-20 2020-05-15 东软集团股份有限公司 Polling monitoring method and device, storage medium and electronic equipment
CN112669148A (en) * 2020-12-22 2021-04-16 深圳市富途网络科技有限公司 Order processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756108A (en) * 2004-09-29 2006-04-05 华为技术有限公司 Master/backup system data synchronizing method
US20080010322A1 (en) * 2006-07-06 2008-01-10 Data Domain, Inc. File system replication
CN101719143A (en) * 2009-12-01 2010-06-02 北京中科创元科技有限公司 Method for parallel processing compare increment data extraction
CN102110121A (en) * 2009-12-24 2011-06-29 阿里巴巴集团控股有限公司 Method and system for processing data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756108A (en) * 2004-09-29 2006-04-05 华为技术有限公司 Master/backup system data synchronizing method
US20080010322A1 (en) * 2006-07-06 2008-01-10 Data Domain, Inc. File system replication
CN101719143A (en) * 2009-12-01 2010-06-02 北京中科创元科技有限公司 Method for parallel processing compare increment data extraction
CN102110121A (en) * 2009-12-24 2011-06-29 阿里巴巴集团控股有限公司 Method and system for processing data

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678699A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for merging data in server
CN103701907A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Processing method and system for continuing to transmit data in server
CN103699666A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Transmission method and transmission device for splitting data
CN103685559A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for processing data in server
CN106164867A (en) * 2014-04-01 2016-11-23 谷歌公司 The increment parallel processing of data
US10628212B2 (en) 2014-04-01 2020-04-21 Google Llc Incremental parallel processing of data
CN106164867B (en) * 2014-04-01 2020-01-14 谷歌有限责任公司 Incremental parallel processing of data
CN104111983B (en) * 2014-06-30 2017-12-19 中国科学院信息工程研究所 A kind of open multi-source data acquiring system and method
CN104111983A (en) * 2014-06-30 2014-10-22 中国科学院信息工程研究所 Open-type multi-source data collection system and method
CN104376082B (en) * 2014-11-18 2019-06-18 中国建设银行股份有限公司 A method of the data in data source file are imported into database
CN104376082A (en) * 2014-11-18 2015-02-25 中国建设银行股份有限公司 Method for importing data in data source file to database
CN105183585A (en) * 2015-08-27 2015-12-23 北京金山安全软件有限公司 Data backup method and device
CN105183585B (en) * 2015-08-27 2019-03-26 北京金山安全软件有限公司 Data backup method and device
CN105893529A (en) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 Data collecting method and ETL assembly
CN105843935A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Data acquisition method and ETL (Extraction-Transformation-Loading) assembly
CN107993696A (en) * 2017-12-25 2018-05-04 东软集团股份有限公司 A kind of collecting method, device, client and system
CN110347661A (en) * 2019-07-05 2019-10-18 北京红山信息科技研究院有限公司 Method, apparatus, server and the storage medium that data source is quasi real time put in storage
CN111159118A (en) * 2019-12-20 2020-05-15 东软集团股份有限公司 Polling monitoring method and device, storage medium and electronic equipment
CN111159118B (en) * 2019-12-20 2024-01-26 东软集团股份有限公司 Polling monitoring method and device, storage medium and electronic equipment
CN112669148A (en) * 2020-12-22 2021-04-16 深圳市富途网络科技有限公司 Order processing method and device

Also Published As

Publication number Publication date
CN103092840B (en) 2015-09-16

Similar Documents

Publication Publication Date Title
CN103092840B (en) Multi-source is from increasing massive data files real-time collecting method
CN102982085B (en) Data mover system and method
CA2871313C (en) Method and system for managing power grid data
CN103678042B (en) A kind of backup policy information generating method based on data analysis
CN101436207A (en) Data restoring and synchronizing method based on log snapshot
CN106502868B (en) Dynamic monitoring frequency adjusting method suitable for cloud computing
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN104182506A (en) Log management method
CN105117402B (en) Daily record data sharding method and device
CN108268565B (en) Method and system for processing user browsing behavior data based on data warehouse
CN104090889A (en) Method and system for data processing
CN102426609A (en) Index generation method and index generation device based on MapReduce programming architecture
CN103914485A (en) System and method for remotely collecting, retrieving and displaying application system logs
CN104036029A (en) Big data consistency comparison method and system
CN105490854A (en) Real-time log collection method and system, and application server cluster
CN102722584B (en) Data storage system and method
CN104572856A (en) Converged storage method of service source data
CN105808653A (en) User label system-based data processing method and device
CN113946575A (en) Space-time trajectory data processing method and device, electronic equipment and storage medium
CN102231673A (en) System and method for monitoring business server
CN105787058A (en) User label system and data pushing system based on same
CN105787090A (en) Index building method and system of OLAP system of electric data
CN105205189A (en) BIM based on container and integrated method of high-speed data collecting system
CN103020169A (en) Effectiveness and uniqueness processing method for electric data
Murugesan et al. Audit log management in MongoDB

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant