CN103092840B - Multi-source is from increasing massive data files real-time collecting method - Google Patents
Multi-source is from increasing massive data files real-time collecting method Download PDFInfo
- Publication number
- CN103092840B CN103092840B CN201110334851.6A CN201110334851A CN103092840B CN 103092840 B CN103092840 B CN 103092840B CN 201110334851 A CN201110334851 A CN 201110334851A CN 103092840 B CN103092840 B CN 103092840B
- Authority
- CN
- China
- Prior art keywords
- data
- file
- source
- time
- data file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of multi-source is from increasing massive data files real-time collecting method, adopt and multithreading parallel acquisition data source increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.According to data generation period, estimated data file size and business demand setting acquisition time interval, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, be stored into this locality with the form of small data file, and record the reference position that current time file byte-sized gathers as next poll.The present invention gathers incremental portion at every turn, realizes multi-source from increasing massive data files Real-time Collection, solves in prior art and gathers teledata length time delay, poor real, affects the technical matters of server load and stability.
Description
Technical field:
The present invention relates to physical field, particularly relating to mass data collection technology in computer application system, particularly a kind of multi-source from increasing massive data files real-time collecting method.
Background technology:
The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally have multiple data source to provide magnanimity in real time from increasing data file, application system needs collection tens to the mass data of GB up to a hundred, as PCMD and ROP data every day simultaneously.This kind of data are stored on multiple server data source with document form, each data source is generally generate a file cycle regular time, as one hour generates a data file or generates a data file in one day, file can in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that massive data files accurately intactly collects and is supplied to application system by the very first time and become a technical barrier.
Prior art writes completely a data file and no longer after increasing, gathers warehouse-in, and this will bring two drawbacks: one is that data delay time is long, poor real.The data file in a upper cycle needs to wait until that next cycle starts to gather, like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.
Summary of the invention:
The object of the present invention is to provide a kind of multi-source from increasing massive data files real-time collecting method, described this multi-source will solve prior art and gathers teledata length time delay, poor real from increasing massive data files real-time collecting method, affects the technical matters of server load and stability.
This multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.
Further, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file,
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, in the generation moment of cycle data file described in step 4, carries out last poll collection,
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, the data source in described step 1 comprises:
There is the corresponding N station server of N number of data source,
Data are stored on N number of server with document form respectively,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
Data file name comprises unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is time cycle feature, XXXX data category feature.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: data file name comprises uniquely identified key element, name according to YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is data cycle feature, XXXX is data category feature, ECPN is data source characteristic, and _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
The present invention compares with prior art, and its effect is actively with obvious.The present invention gathers the data file on multiple data server simultaneously by employing multithreading, adopt file microtomy and file to resume technology temporally to cut into slices to a data file, each collection incremental portion, realizing multi-source from increasing massive data files Real-time Collection, solving in prior art and gathering teledata length time delay, poor real, affecting the technical matters of server load and stability.
Accompanying drawing illustrates:
Fig. 1 is that multi-source of the present invention is from the schematic diagram increasing massive data files real-time collecting method.
Embodiment:
Embodiment 1:
As shown in Figure 1, multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.
Further, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file.Data source characteristic mainly comprises:
A, there is the corresponding N station server of N number of data source;
B, data are stored on N number of server with document form respectively;
C, data generate a data file (as 1 hour/1 day) at one-period (T);
D, data file write growth in real time within the cycle, until next cycle data document creation;
The name of E, data file comprises unique identification rule, names according to YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, 10040711 is time cycle features; PCMD is data category feature.
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, the generation moment of cycle data file described in step 4, carry out last poll collection, last poll acquisition time will to ensure after next cycle data file generated and to be after harsh one-tenth, to ensure the real-time of the integrality that a upper cycle data file gathers and next data acquisition.
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: data file name comprises uniquely identified key element, name according to YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is data cycle feature, XXXX is data category feature, ECPN is data source characteristic, and _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
In one embodiment of the invention, PCMD data are stored on 7 OMP server data sources with document form, each data source generation per hour data file, file write in real time in current hour, until the data file of next hour creates and write in real time, each file record number per hour reaches DBMS amount up to a million.
Classic method is that this will bring two serious shortcomings: one is that data delay time is long, poor real in collection after a PCMD data file writes completely.The data file in a upper cycle needs next cycle by the time to start could gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.
Adopt the present invention to gather PCMD data, arrange every 1 minutes interval, each File cutting per hour becomes 60 collections, gathers increment at every turn, 7 OMP parallel acquisitions, achieves the collection warehouse-in that magnanimity PCMD data are complete in time.Contrast classic method, advantage is as follows: one is overcome the long drawback of classic method data delay time, and data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.Two is be conducive to server load balancing, avoids the hidden danger concentrated in stability that in a period of time, acquisition process large data files brings, improves the stability of PCMD data acquisition and processing (DAP).
Claims (5)
1. a multi-source is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, it is characterized in that: gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each collection is from the incremental portion increasing data file, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file,
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, in the generation moment of cycle data file described in step 4, carries out last poll collection,
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
2. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: described data source comprises:
There is the corresponding N station server of N number of data source,
Data are stored on N number of server with document form respectively,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
Data file name comprises unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is time cycle feature, XXXX data category feature.
3. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.
4. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
5. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the naming rule in described step 6 comprises:
Data file name comprises uniquely identified key element, according to
YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form is named, YYYYMMDDHHMMSS is data cycle feature, and XXXX is data category feature, and ECPN is data source characteristic, _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110334851.6A CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110334851.6A CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103092840A CN103092840A (en) | 2013-05-08 |
CN103092840B true CN103092840B (en) | 2015-09-16 |
Family
ID=48205423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110334851.6A Active CN103092840B (en) | 2011-10-28 | 2011-10-28 | Multi-source is from increasing massive data files real-time collecting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103092840B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103699666A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Transmission method and transmission device for splitting data |
CN103685559A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for processing data in server |
CN103678699A (en) * | 2013-12-27 | 2014-03-26 | 乐视网信息技术(北京)股份有限公司 | Method and system for merging data in server |
CN103701907A (en) * | 2013-12-27 | 2014-04-02 | 乐视网信息技术(北京)股份有限公司 | Processing method and system for continuing to transmit data in server |
US9268597B2 (en) | 2014-04-01 | 2016-02-23 | Google Inc. | Incremental parallel processing of data |
CN104111983B (en) * | 2014-06-30 | 2017-12-19 | 中国科学院信息工程研究所 | A kind of open multi-source data acquiring system and method |
CN104376082B (en) * | 2014-11-18 | 2019-06-18 | 中国建设银行股份有限公司 | A method of the data in data source file are imported into database |
CN105183585B (en) * | 2015-08-27 | 2019-03-26 | 北京金山安全软件有限公司 | Data backup method and device |
CN105893529A (en) * | 2016-03-30 | 2016-08-24 | 乐视控股(北京)有限公司 | Data collecting method and ETL assembly |
CN105843935A (en) * | 2016-03-30 | 2016-08-10 | 乐视控股(北京)有限公司 | Data acquisition method and ETL (Extraction-Transformation-Loading) assembly |
CN107993696B (en) * | 2017-12-25 | 2020-11-17 | 东软集团股份有限公司 | Data acquisition method, device, client and system |
CN110347661A (en) * | 2019-07-05 | 2019-10-18 | 北京红山信息科技研究院有限公司 | Method, apparatus, server and the storage medium that data source is quasi real time put in storage |
CN111159118B (en) * | 2019-12-20 | 2024-01-26 | 东软集团股份有限公司 | Polling monitoring method and device, storage medium and electronic equipment |
CN112669148A (en) * | 2020-12-22 | 2021-04-16 | 深圳市富途网络科技有限公司 | Order processing method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1756108A (en) * | 2004-09-29 | 2006-04-05 | 华为技术有限公司 | Master/backup system data synchronizing method |
CN101719143A (en) * | 2009-12-01 | 2010-06-02 | 北京中科创元科技有限公司 | Method for parallel processing compare increment data extraction |
CN102110121A (en) * | 2009-12-24 | 2011-06-29 | 阿里巴巴集团控股有限公司 | Method and system for processing data |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7882064B2 (en) * | 2006-07-06 | 2011-02-01 | Emc Corporation | File system replication |
-
2011
- 2011-10-28 CN CN201110334851.6A patent/CN103092840B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1756108A (en) * | 2004-09-29 | 2006-04-05 | 华为技术有限公司 | Master/backup system data synchronizing method |
CN101719143A (en) * | 2009-12-01 | 2010-06-02 | 北京中科创元科技有限公司 | Method for parallel processing compare increment data extraction |
CN102110121A (en) * | 2009-12-24 | 2011-06-29 | 阿里巴巴集团控股有限公司 | Method and system for processing data |
Also Published As
Publication number | Publication date |
---|---|
CN103092840A (en) | 2013-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103092840B (en) | Multi-source is from increasing massive data files real-time collecting method | |
CN102982085B (en) | Data mover system and method | |
CN102880685B (en) | Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size | |
CN103020281B (en) | A kind of data storage and retrieval method based on spatial data numerical index | |
CN107943831B (en) | HBase-based power grid historical data centralized storage method | |
CN104317800A (en) | Hybrid storage system and method for mass intelligent power utilization data | |
CN103678042B (en) | A kind of backup policy information generating method based on data analysis | |
CN103605805A (en) | Storage method of massive time series data | |
CN104599032A (en) | Distributed memory power grid construction method and system for resource management | |
CN101436207A (en) | Data restoring and synchronizing method based on log snapshot | |
CN104111996A (en) | Health insurance outpatient clinic big data extraction system and method based on hadoop platform | |
CN104182506A (en) | Log management method | |
CN104572856A (en) | Converged storage method of service source data | |
CN102779138B (en) | The hard disk access method of real time data | |
CN108369550B (en) | Real-time alteration of data from different sources | |
CN102722584B (en) | Data storage system and method | |
CN106649722B (en) | Monitoring system high-frequency data storage and query method | |
CN102291269A (en) | Data merging processing method | |
CN105808653A (en) | User label system-based data processing method and device | |
CN103324696A (en) | Collecting and statistical analysis system and method for data logs | |
CN103593486A (en) | Method for storing and reading mass data of power quality | |
CN105787090A (en) | Index building method and system of OLAP system of electric data | |
CN104270605A (en) | Method and device for processing video monitoring data | |
JP5774513B2 (en) | File list generation method and system, program, and file list generation device | |
CN104933042B (en) | Database table optimization of collection technology based on big data quantity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |