CN103092840A

CN103092840A - Method for acquiring self-increment mass data files from multiple sources

Info

Publication number: CN103092840A
Application number: CN2011103348516A
Authority: CN
Inventors: 王志海; 麦菁; 辛炜博; 徐卸土; 王智博
Original assignee: Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Current assignee: Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority date: 2011-10-28
Filing date: 2011-10-28
Publication date: 2013-05-08
Anticipated expiration: 2031-10-28
Also published as: CN103092840B

Abstract

The invention relates to a method for acquiring self-increment mass data files from multiple sources. In the method, the multithreading technology is utilized for acquiring self-increment data files in parallel from data sources, and the file slice technology and file continuous transferring technology are utilized for slicing the self-increment data files according to time, the increment of the self-increment data files is acquired each time. The acquiring time intervals are set according to a data generating circle, data file size estimation and business requirements, and current cyclic data files from server data source are checked by polling at regular intervals according to the acquiring time intervals, increment data are collected by aid of the file slice technology and the file continuous transferring technology to store locally in a form of small data files and to record current-time file byte size as an initial position of next polling acquirement. By acquiring the increment each time, the method realizes self-increment mass data files acquirement, so that the technical problems of long delay time, poor real-time performance, server load and stability influencing elements in collecting teledata of the current technology are solved.

Description

Multi-source is from increasing the massive data files real-time collecting method

Technical field:

The present invention relates to physical field, relate in particular to mass data collection technology in computer application system, particularly a kind of multi-source from increasing the massive data files real-time collecting method.

Background technology:

The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally there are a plurality of data sources to provide simultaneously magnanimity in real time from increasing data file, application system need to gather tens to GB up to a hundred mass data every day, as PCMD and ROP data.These class data are stored on a plurality of server datas sources with document form, each data source is generally to generate a file cycle regular time, generated a data file or generated a data file in one day as one hour, file can be in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that the very first time accurately intactly collects massive data files and offers application system and becomes a technical barrier.

Prior art is write fully and no longer gather warehouse-in after increasing at a data file, and this will bring two drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather, may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.

Summary of the invention:

The object of the present invention is to provide a kind of multi-source from increasing the massive data files real-time collecting method, described this multi-source will solve prior art and gathers that teledata time delay is long, real-time is poor, affect server load and stable technical matters from increasing the massive data files real-time collecting method.

This multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.

Further, described server data source collection from an above number comprises the following steps from the process that increases data file:

Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file,

Step 2 is set the acquisition time interval according to data generating period, estimated data file size and business demand,

Step 3, by the set acquisition time interval, in the mode of periodic polling, check the data file of server data source current period, adopt file microtomy and file to resume technology and gather incremental data, and with the form of small data file, store this locality by the naming rule of setting in step 1 into, and record the original position that in this gatherer process, current time file byte-sized gathers as next poll, in poll, gather for the first time from 0 byte location to the data of poll data file byte location constantly for the first time

Step 4, the byte location that collection is recorded from last poll is to the data of current poll data file byte location constantly, and circulation is read, until the next cycle Generating Data File,

Step 5 in the generation moment of the cycle data file described in step 4, is carried out last poll collection,

Step 6 stores with small data file the file that collects into assigned catalogue by the naming rule of setting, and direct loading of databases or back up to server,

Step 7 for N server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,

Step 8 for a plurality of data categories, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.

Further, the data source in described step 1 comprises:

The corresponding N station server of N data source is arranged,

Data are stored in respectively on N server with document form,

Data generate a data file at one-period,

Data file writes growth in real time within the cycle, until next cycle data document creation,

The data file name comprises the unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is the time cycle feature, XXXX data category feature.

Further, according to the granularity of the acquisition time interval defined file section in described step 2.

Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.

Further, naming rule in described step 6 comprises: the data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, and _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.

The present invention and prior art are compared, and its effect is actively with obvious.The present invention is by adopting multithreading to gather simultaneously data file on a plurality of data servers, adopting file microtomy and file to resume technology cuts into slices by the time to a data file, each incremental portion that gathers, realize multi-source from increasing the massive data files Real-time Collection, solved and gathered in the prior art that teledata time delay is long, real-time is poor, affect server load and stable technical matters.

Description of drawings:

Fig. 1 is that multi-source of the present invention is from the schematic diagram that increases the massive data files real-time collecting method.

Embodiment:

Embodiment 1:

As shown in Figure 1, multi-source of the present invention is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, wherein, described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.

Step 1, explicit data generating period, naming rule and acquisition mode, and estimate the size of each data file.The data source feature mainly comprises:

A, the corresponding N station server of N data source is arranged;

B, data are stored in respectively on N server with document form;

C, data generate a data file (as 1 hour/1 day) at one-period (T);

D, data file write growth in real time within the cycle, until next cycle data document creation;

E, data file name comprise the unique identification rule, name according to the YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, the 10040711st, the time cycle feature; PCMD is the data category feature.

Step 5, in the generation moment of the cycle data file described in step 4, carry out last poll collection, last poll acquisition time will guarantee after next cycle data file generated and after being harsh one-tenth, to guarantee integrality that a upper cycle data file gathers and the real-time of next data acquisition.

In one embodiment of the invention, the PCMD data are stored on 7 OMP server data sources with document form, each data source per hour generates a data file, file write in current hour in real time, until the data file of next hour creates and writes in real time, per hour each file record number reaches DBMS amounts up to a million.

Classic method is to gather after a PCMD data file writes fully, and this will bring two serious drawbacks: the one, and the data delay time is long, and real-time is poor.The data file in a upper cycle need to wait until that next cycle begins to gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), may will postpone one-period like this for the data at a upper initial stage in cycle could gather, and collection itself also can spend longer a period of time, greatly reduces the real-time of data.The 2nd, server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, in case the warehouse-in process occurs extremely, the cost of rollback is very high, also can badly influence the client to the queried access speed of server.

Adopt the present invention to gather the PCMD data, every 1 minutes interval is set, per hour each File cutting becomes 60 collection, gathers increment at every turn, and 7 OMP parallel acquisitions have realized that the in time complete collection of magnanimity PCMD data puts in storage.The contrast classic method, advantage is as follows: the one, overcome long drawback of classic method data delay time, data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.The 2nd, be conducive to server load balancing, avoided concentrating on the hidden danger on the stability that in a period of time, large data files of acquisition process brings, improve the stability of PCMD data acquisition and processing (DAP).

Claims

1. a multi-source is from increasing the massive data files real-time collecting method, comprise from the server data source collection of an above number and certainly increase the process of data file, it is characterized in that: described server data source collection from an above number increases the process of data file certainly, adopt and certainly to increase data file on the server data source of an above number of multithreading parallel acquisition, adopt file microtomy and file to resume technology and cut into slices by the time from increasing data file to described, the each collection from the incremental portion that increases data file.

2. multi-source as claimed in claim 1 from increasing the massive data files real-time collecting method, is characterized in that: described server data source collection from an above number comprises the following steps from the process that increases data file:

3. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:

Data source in described step 1 comprises:

The corresponding N station server of N data source is arranged,

Data are stored in respectively on N server with document form,

Data generate a data file at one-period,

4. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.

5. multi-source as claimed in claim 2 is from increasing the massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location of each record cutting is as the start byte position of next section collecting, it is time point in the each cutting of file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.

6. multi-source as claimed in claim 2 from increasing the massive data files real-time collecting method, is characterized in that:

Naming rule in described step 6 comprises:

The data file name comprises the uniquely identified key element, name according to the YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is the data time periodic characteristic, XXXX is the data category feature, ECPN is the data source feature, _ HHMMSS is the acquisition time feature, and data time periodic characteristic and data category feature derive from data source.