CN103092840A - Method for acquiring self-increment mass data files from multiple sources - Google Patents
Method for acquiring self-increment mass data files from multiple sources Download PDFInfo
- Publication number
- CN103092840A CN103092840A CN2011103348516A CN201110334851A CN103092840A CN 103092840 A CN103092840 A CN 103092840A CN 2011103348516 A CN2011103348516 A CN 2011103348516A CN 201110334851 A CN201110334851 A CN 201110334851A CN 103092840 A CN103092840 A CN 103092840A
- Authority
- CN
- China
- Prior art keywords
- data
- file
- time
- source
- data file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention relates to a method for acquiring self-increment mass data files from multiple sources. In the method, the multithreading technology is utilized for acquiring self-increment data files in parallel from data sources, and the file slice technology and file continuous transferring technology are utilized for slicing the self-increment data files according to time, the increment of the self-increment data files is acquired each time. The acquiring time intervals are set according to a data generating circle, data file size estimation and business requirements, and current cyclic data files from server data source are checked by polling at regular intervals according to the acquiring time intervals, increment data are collected by aid of the file slice technology and the file continuous transferring technology to store locally in a form of small data files and to record current-time file byte size as an initial position of next polling acquirement. By acquiring the increment each time, the method realizes self-increment mass data files acquirement, so that the technical problems of long delay time, poor real-time performance, server load and stability influencing elements in collecting teledata of the current technology are solved.
Description
Technical field:
The present invention relates to physical field, relate in particular to mass data collection technology in computer application system, particularly a kind of multi-source from increasing the massive data files real-time collecting method.
Background technology:
The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally there are a plurality of data sources to provide simultaneously magnanimity in real time from increasing data file, application system need to gather tens to GB up to a hundred mass data every day, as PCMD and ROP data.These class data are stored on a plurality of server datas sources with document form, each data source is generally to generate a file cycle regular time, generated a data file or generated a data file in one day as one hour, file can be in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that the very first time accurately intactly collects massive data files and offers application system and becomes a technical barrier.
Prior art is write fully and no longer gather warehouse-in after increasing at a data file, and this will bring two drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather, may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.
Summary of the invention:
The object of the present invention is to provide a kind of multi-source from increasing the massive data files real-time collecting method, described this multi-source will solve prior art and gathers that teledata time delay is long, real-time is poor, affect server load and stable technical matters from increasing the massive data files real-time collecting method.
This multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
Further, described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file,
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5 in the generation moment of the cycle data file described in step 4, is carried out last poll collection,
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, the data source in described step 1 comprises:
The corresponding N station server of N data source is arranged,
Data are stored in respectively on N server with document form,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
The data file name comprises the unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is the time cycle feature, XXXX data category feature.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: the data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, and _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
The present invention and prior art are compared, and its effect is actively with obvious.The present invention is by adopting multithreading to gather simultaneously data file on a plurality of data servers, adopting file microtomy and file to resume technology cuts into slices by the time to a data file, each incremental portion that gathers, realize multi-source from increasing the massive data files Real-time Collection, solved and gathered in the prior art that teledata time delay is long, real-time is poor, affect server load and stable technical matters.
Description of drawings:
Fig. 1 is that multi-source of the present invention is from the schematic diagram that increases the massive data files real-time collecting method.
Embodiment:
Embodiment 1:
As shown in Figure 1, multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
Further, described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file.The data source feature mainly comprises:
A, the corresponding N station server of N data source is arranged;
B, data are stored in respectively on N server with document form;
C, data generate a data file (as 1 hour/1 day) at one-period (T);
D, data file write growth in real time within the cycle, until next cycle data document creation;
E, data file name comprise the unique identification rule, name according to the YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, the 10040711st, the time cycle feature; PCMD is the data category feature.
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5, in the generation moment of the cycle data file described in step 4, carry out last poll collection, last poll acquisition time will guarantee after next cycle data file generated and after being harsh one-tenth, to guarantee integrality that a upper cycle data file gathers and the real-time of next data acquisition.
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: the data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, and _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
In one embodiment of the invention, the PCMD data are stored on 7 OMP server data sources with document form, each data source per hour generates a data file, file write in current hour in real time, until the data file of next hour creates and writes in real time, per hour each file record number reaches DBMS amounts up to a million.
Classic method is to gather after a PCMD data file writes fully, and this will bring two serious drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.
Adopt the present invention to gather the PCMD data, every 1 minutes interval is set, per hour each File cutting becomes 60 collection, gathers increment at every turn, and 7 OMP parallel acquisitions have realized that the in time complete collection of magnanimity PCMD data puts in storage.The contrast classic method, advantage is as follows: the one, overcome long drawback of classic method data delay time, data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.The 2nd, be conducive to server load balancing, avoided concentrating on the hidden danger on the stability that in a period of time, large data files of acquisition process brings, improve the stability of PCMD data acquisition and processing (DAP).
Claims (6)
1. a multi-source is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, it is characterized in that: described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.
2. multi-source as claimed in claim 1 from increasing the massive data files real-time collecting method, is characterized in that: described server data source collection from an above number comprises the following steps from the process that increases data file:
Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file,
Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,
Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time
Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,
Step 5 in the generation moment of the cycle data file described in step 4, is carried out last poll collection,
Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,
Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
3. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:
Data source in described step 1 comprises:
The corresponding N station server of N data source is arranged,
Data are stored in respectively on N server with document form,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
The data file name comprises the unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is the time cycle feature, XXXX data category feature.
4. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.
5. multi-source as claimed in claim 2 is from increasing the massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
6. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:
Naming rule in described step 6 comprises:
The data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110334851.6A CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110334851.6A CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103092840A true CN103092840A (en) | 2013-05-08 |
CN103092840B CN103092840B (en) | 2015-09-16 |
Family
ID=48205423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110334851.6A Active CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103092840B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103685559A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for processing data in server |
CN103678699A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for merging data in server |
CN103701907A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Processing method and system for continuing to transmit data in server |
CN103699666A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Transmission method and transmission device for splitting data |
CN104111983A (en) * | 2014-06-30 | 2014-10-22 | 中国科学院信息工程研究所 | Open-type multi-source data collection system and method |
CN104376082A (en) * | 2014-11-18 | 2015-02-25 | 中国建设银行股份有限公司 | Method for importing data in data source file to database |
CN105183585A (en) * | 2015-08-27 | 2015-12-23 | 北京金山安全软件有限公司 | Data backup method and device |
CN105843935A (en) * | 2016-03-30 | 2016-08-10 | 乐视控股(北京)有限公司 | Data acquisition method and ETL (Extraction-Transformation-Loading) assembly |
CN105893529A (en) * | 2016-03-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Data collecting method and ETL assembly |
CN106164867A (en) * | 2014-04-01 | 2016-11-23 | 谷歌公司 | The increment parallel processing of data |
CN107993696A (en) * | 2017-12-25 | 2018-05-04 | 东软集团股份有限公司 | A kind of collecting method, device, client and system |
CN110347661A (en) * | 2019-07-05 | 2019-10-18 | 北京红山信息科技研究院有限公司 | Method, apparatus, server and the storage medium that data source is quasi real time put in storage |
CN111159118A (en) * | 2019-12-20 | 2020-05-15 | 东软集团股份有限公司 | Polling monitoring method and device, storage medium and electronic equipment |
CN112669148A (en) * | 2020-12-22 | 2021-04-16 | 深圳市富途网络科技有限公司 | Order processing method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1756108A (en) * | 2004-09-29 | 2006-04-05 | 华为技术有限公司 | Master/backup system data synchronizing method |
US20080010322A1 (en) * | 2006-07-06 | 2008-01-10 | Data Domain, Inc. | File system replication |
CN101719143A (en) * | 2009-12-01 | 2010-06-02 | 北京中科创元科技有限公司 | Method for parallel processing compare increment data extraction |
CN102110121A (en) * | 2009-12-24 | 2011-06-29 | 阿里巴巴集团控股有限公司 | Method and system for processing data |
-
2011
- 2011-10-28 CN CN201110334851.6A patent/CN103092840B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1756108A (en) * | 2004-09-29 | 2006-04-05 | 华为技术有限公司 | Master/backup system data synchronizing method |
US20080010322A1 (en) * | 2006-07-06 | 2008-01-10 | Data Domain, Inc. | File system replication |
CN101719143A (en) * | 2009-12-01 | 2010-06-02 | 北京中科创元科技有限公司 | Method for parallel processing compare increment data extraction |
CN102110121A (en) * | 2009-12-24 | 2011-06-29 | 阿里巴巴集团控股有限公司 | Method and system for processing data |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103678699A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for merging data in server |
CN103701907A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Processing method and system for continuing to transmit data in server |
CN103699666A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Transmission method and transmission device for splitting data |
CN103685559A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for processing data in server |
CN106164867A (en) * | 2014-04-01 | 2016-11-23 | 谷歌公司 | The increment parallel processing of data |
US10628212B2 (en) | 2014-04-01 | 2020-04-21 | Google Llc | Incremental parallel processing of data |
CN106164867B (en) * | 2014-04-01 | 2020-01-14 | 谷歌有限责任公司 | Incremental parallel processing of data |
CN104111983B (en) * | 2014-06-30 | 2017-12-19 | 中国科学院信息工程研究所 | A kind of open multi-source data acquiring system and method |
CN104111983A (en) * | 2014-06-30 | 2014-10-22 | 中国科学院信息工程研究所 | Open-type multi-source data collection system and method |
CN104376082B (en) * | 2014-11-18 | 2019-06-18 | 中国建设银行股份有限公司 | A method of the data in data source file are imported into database |
CN104376082A (en) * | 2014-11-18 | 2015-02-25 | 中国建设银行股份有限公司 | Method for importing data in data source file to database |
CN105183585A (en) * | 2015-08-27 | 2015-12-23 | 北京金山安全软件有限公司 | Data backup method and device |
CN105183585B (en) * | 2015-08-27 | 2019-03-26 | 北京金山安全软件有限公司 | Data backup method and device |
CN105893529A (en) * | 2016-03-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Data collecting method and ETL assembly |
CN105843935A (en) * | 2016-03-30 | 2016-08-10 | 乐视控股(北京)有限公司 | Data acquisition method and ETL (Extraction-Transformation-Loading) assembly |
CN107993696A (en) * | 2017-12-25 | 2018-05-04 | 东软集团股份有限公司 | A kind of collecting method, device, client and system |
CN110347661A (en) * | 2019-07-05 | 2019-10-18 | 北京红山信息科技研究院有限公司 | Method, apparatus, server and the storage medium that data source is quasi real time put in storage |
CN111159118A (en) * | 2019-12-20 | 2020-05-15 | 东软集团股份有限公司 | Polling monitoring method and device, storage medium and electronic equipment |
CN111159118B (en) * | 2019-12-20 | 2024-01-26 | 东软集团股份有限公司 | Polling monitoring method and device, storage medium and electronic equipment |
CN112669148A (en) * | 2020-12-22 | 2021-04-16 | 深圳市富途网络科技有限公司 | Order processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN103092840B (en) | 2015-09-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103092840B (en) | Multi-source is from increasing massive data files real-time collecting method | |
CN102982085B (en) | Data mover system and method | |
CA2871313C (en) | Method and system for managing power grid data | |
CN103678042B (en) | A kind of backup policy information generating method based on data analysis | |
CN101436207A (en) | Data restoring and synchronizing method based on log snapshot | |
CN106502868B (en) | Dynamic monitoring frequency adjusting method suitable for cloud computing | |
CN104111996A (en) | Health insurance outpatient clinic big data extraction system and method based on hadoop platform | |
CN104182506A (en) | Log management method | |
CN105117402B (en) | Daily record data sharding method and device | |
CN108268565B (en) | Method and system for processing user browsing behavior data based on data warehouse | |
CN104090889A (en) | Method and system for data processing | |
CN102426609A (en) | Index generation method and index generation device based on MapReduce programming architecture | |
CN103914485A (en) | System and method for remotely collecting, retrieving and displaying application system logs | |
CN104036029A (en) | Big data consistency comparison method and system | |
CN105490854A (en) | Real-time log collection method and system, and application server cluster | |
CN102722584B (en) | Data storage system and method | |
CN104572856A (en) | Converged storage method of service source data | |
CN105808653A (en) | User label system-based data processing method and device | |
CN113946575A (en) | Space-time trajectory data processing method and device, electronic equipment and storage medium | |
CN102231673A (en) | System and method for monitoring business server | |
CN105787058A (en) | User label system and data pushing system based on same | |
CN105787090A (en) | Index building method and system of OLAP system of electric data | |
CN105205189A (en) | BIM based on container and integrated method of high-speed data collecting system | |
CN103020169A (en) | Effectiveness and uniqueness processing method for electric data | |
Murugesan et al. | Audit log management in MongoDB |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |