CN103092840B

CN103092840B - Multi-source is from increasing massive data files real-time collecting method

Info

Publication number: CN103092840B
Application number: CN201110334851.6A
Authority: CN
Inventors: 王志海; 麦菁; 辛炜博; 徐卸土; 王智博
Original assignee: Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Current assignee: Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority date: 2011-10-28
Filing date: 2011-10-28
Publication date: 2015-09-16
Anticipated expiration: 2031-10-28
Also published as: CN103092840A

Abstract

A kind of multi-source is from increasing massive data files real-time collecting method, adopt and multithreading parallel acquisition data source increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.According to data generation period, estimated data file size and business demand setting acquisition time interval, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, be stored into this locality with the form of small data file, and record the reference position that current time file byte-sized gathers as next poll.The present invention gathers incremental portion at every turn, realizes multi-source from increasing massive data files Real-time Collection, solves in prior art and gathers teledata length time delay, poor real, affects the technical matters of server load and stability.

Description

Multi-source is from increasing massive data files real-time collecting method

Technical field:

The present invention relates to physical field, particularly relating to mass data collection technology in computer application system, particularly a kind of multi-source from increasing massive data files real-time collecting method.

Background technology:

The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally have multiple data source to provide magnanimity in real time from increasing data file, application system needs collection tens to the mass data of GB up to a hundred, as PCMD and ROP data every day simultaneously.This kind of data are stored on multiple server data source with document form, each data source is generally generate a file cycle regular time, as one hour generates a data file or generates a data file in one day, file can in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that massive data files accurately intactly collects and is supplied to application system by the very first time and become a technical barrier.

Prior art writes completely a data file and no longer after increasing, gathers warehouse-in, and this will bring two drawbacks: one is that data delay time is long, poor real.The data file in a upper cycle needs to wait until that next cycle starts to gather, like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.

Summary of the invention:

The object of the present invention is to provide a kind of multi-source from increasing massive data files real-time collecting method, described this multi-source will solve prior art and gathers teledata length time delay, poor real from increasing massive data files real-time collecting method, affects the technical matters of server load and stability.

This multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.

Further, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:

Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file,

Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,

Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll

Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,

Step 5, in the generation moment of cycle data file described in step 4, carries out last poll collection,

Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,

Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,

Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.

Further, the data source in described step 1 comprises:

There is the corresponding N station server of N number of data source,

Data are stored on N number of server with document form respectively,

Data generate a data file at one-period,

Data file writes growth in real time within the cycle, until next cycle data document creation,

Data file name comprises unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is time cycle feature, XXXX data category feature.

Further, according to the granularity of the acquisition time interval defined file section in described step 2.

Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.

Further, naming rule in described step 6 comprises: data file name comprises uniquely identified key element, name according to YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is data cycle feature, XXXX is data category feature, ECPN is data source characteristic, and _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.

The present invention compares with prior art, and its effect is actively with obvious.The present invention gathers the data file on multiple data server simultaneously by employing multithreading, adopt file microtomy and file to resume technology temporally to cut into slices to a data file, each collection incremental portion, realizing multi-source from increasing massive data files Real-time Collection, solving in prior art and gathering teledata length time delay, poor real, affecting the technical matters of server load and stability.

Accompanying drawing illustrates:

Fig. 1 is that multi-source of the present invention is from the schematic diagram increasing massive data files real-time collecting method.

Embodiment:

Embodiment 1:

As shown in Figure 1, multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.

Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file.Data source characteristic mainly comprises:

A, there is the corresponding N station server of N number of data source;

B, data are stored on N number of server with document form respectively;

C, data generate a data file (as 1 hour/1 day) at one-period (T);

D, data file write growth in real time within the cycle, until next cycle data document creation;

The name of E, data file comprises unique identification rule, names according to YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, 10040711 is time cycle features; PCMD is data category feature.

Step 5, the generation moment of cycle data file described in step 4, carry out last poll collection, last poll acquisition time will to ensure after next cycle data file generated and to be after harsh one-tenth, to ensure the real-time of the integrality that a upper cycle data file gathers and next data acquisition.

In one embodiment of the invention, PCMD data are stored on 7 OMP server data sources with document form, each data source generation per hour data file, file write in real time in current hour, until the data file of next hour creates and write in real time, each file record number per hour reaches DBMS amount up to a million.

Classic method is that this will bring two serious shortcomings: one is that data delay time is long, poor real in collection after a PCMD data file writes completely.The data file in a upper cycle needs next cycle by the time to start could gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.

Adopt the present invention to gather PCMD data, arrange every 1 minutes interval, each File cutting per hour becomes 60 collections, gathers increment at every turn, 7 OMP parallel acquisitions, achieves the collection warehouse-in that magnanimity PCMD data are complete in time.Contrast classic method, advantage is as follows: one is overcome the long drawback of classic method data delay time, and data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.Two is be conducive to server load balancing, avoids the hidden danger concentrated in stability that in a period of time, acquisition process large data files brings, improves the stability of PCMD data acquisition and processing (DAP).

Claims

1. a multi-source is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, it is characterized in that: gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each collection is from the incremental portion increasing data file, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:

2. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: described data source comprises:

There is the corresponding N station server of N number of data source,

Data are stored on N number of server with document form respectively,

Data generate a data file at one-period,

3. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.

4. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.

5. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the naming rule in described step 6 comprises:

Data file name comprises uniquely identified key element, according to

YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form is named, YYYYMMDDHHMMSS is data cycle feature, and XXXX is data category feature, and ECPN is data source characteristic, _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.