CN103092840B - Multi-source is from increasing massive data files real-time collecting method - Google Patents

Multi-source is from increasing massive data files real-time collecting method Download PDF

Info

Publication number
CN103092840B
CN103092840B CN201110334851.6A CN201110334851A CN103092840B CN 103092840 B CN103092840 B CN 103092840B CN 201110334851 A CN201110334851 A CN 201110334851A CN 103092840 B CN103092840 B CN 103092840B
Authority
CN
China
Prior art keywords
data
file
source
time
data file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110334851.6A
Other languages
Chinese (zh)
Other versions
CN103092840A (en
Inventor
王志海
麦菁
辛炜博
徐卸土
王智博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Original Assignee
Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd filed Critical Shanghai Posts & Telecommunication Designing Consulting Institute Co Ltd
Priority to CN201110334851.6A priority Critical patent/CN103092840B/en
Publication of CN103092840A publication Critical patent/CN103092840A/en
Application granted granted Critical
Publication of CN103092840B publication Critical patent/CN103092840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of multi-source is from increasing massive data files real-time collecting method, adopt and multithreading parallel acquisition data source increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.According to data generation period, estimated data file size and business demand setting acquisition time interval, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, be stored into this locality with the form of small data file, and record the reference position that current time file byte-sized gathers as next poll.The present invention gathers incremental portion at every turn, realizes multi-source from increasing massive data files Real-time Collection, solves in prior art and gathers teledata length time delay, poor real, affects the technical matters of server load and stability.

Description

Multi-source is from increasing massive data files real-time collecting method
Technical field:
The present invention relates to physical field, particularly relating to mass data collection technology in computer application system, particularly a kind of multi-source from increasing massive data files real-time collecting method.
Background technology:
The data volume that telecommunication service relates to is very huge.In the large-scale application system of telecommunications, generally have multiple data source to provide magnanimity in real time from increasing data file, application system needs collection tens to the mass data of GB up to a hundred, as PCMD and ROP data every day simultaneously.This kind of data are stored on multiple server data source with document form, each data source is generally generate a file cycle regular time, as one hour generates a data file or generates a data file in one day, file can in real time from increasing within the cycle, until next cycle, corresponding data file can create automatically also in real time from increasing.How to guarantee that massive data files accurately intactly collects and is supplied to application system by the very first time and become a technical barrier.
Prior art writes completely a data file and no longer after increasing, gathers warehouse-in, and this will bring two drawbacks: one is that data delay time is long, poor real.The data file in a upper cycle needs to wait until that next cycle starts to gather, like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.
Summary of the invention:
The object of the present invention is to provide a kind of multi-source from increasing massive data files real-time collecting method, described this multi-source will solve prior art and gathers teledata length time delay, poor real from increasing massive data files real-time collecting method, affects the technical matters of server load and stability.
This multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.
Further, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file,
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, in the generation moment of cycle data file described in step 4, carries out last poll collection,
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, the data source in described step 1 comprises:
There is the corresponding N station server of N number of data source,
Data are stored on N number of server with document form respectively,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
Data file name comprises unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is time cycle feature, XXXX data category feature.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: data file name comprises uniquely identified key element, name according to YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is data cycle feature, XXXX is data category feature, ECPN is data source characteristic, and _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
The present invention compares with prior art, and its effect is actively with obvious.The present invention gathers the data file on multiple data server simultaneously by employing multithreading, adopt file microtomy and file to resume technology temporally to cut into slices to a data file, each collection incremental portion, realizing multi-source from increasing massive data files Real-time Collection, solving in prior art and gathering teledata length time delay, poor real, affecting the technical matters of server load and stability.
Accompanying drawing illustrates:
Fig. 1 is that multi-source of the present invention is from the schematic diagram increasing massive data files real-time collecting method.
Embodiment:
Embodiment 1:
As shown in Figure 1, multi-source of the present invention is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, wherein, gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each incremental portion gathered from increasing data file.
Further, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file.Data source characteristic mainly comprises:
A, there is the corresponding N station server of N number of data source;
B, data are stored on N number of server with document form respectively;
C, data generate a data file (as 1 hour/1 day) at one-period (T);
D, data file write growth in real time within the cycle, until next cycle data document creation;
The name of E, data file comprises unique identification rule, names according to YYYYMMDDHHMMSS.XXXX form.As 10040711.PCMD: wherein, 10040711 is time cycle features; PCMD is data category feature.
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, the generation moment of cycle data file described in step 4, carry out last poll collection, last poll acquisition time will to ensure after next cycle data file generated and to be after harsh one-tenth, to ensure the real-time of the integrality that a upper cycle data file gathers and next data acquisition.
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
Further, according to the granularity of the acquisition time interval defined file section in described step 2.
Further, file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
Further, naming rule in described step 6 comprises: data file name comprises uniquely identified key element, name according to YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form, YYYYMMDDHHMMSS is data cycle feature, XXXX is data category feature, ECPN is data source characteristic, and _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
In one embodiment of the invention, PCMD data are stored on 7 OMP server data sources with document form, each data source generation per hour data file, file write in real time in current hour, until the data file of next hour creates and write in real time, each file record number per hour reaches DBMS amount up to a million.
Classic method is that this will bring two serious shortcomings: one is that data delay time is long, poor real in collection after a PCMD data file writes completely.The data file in a upper cycle needs next cycle by the time to start could gather (data file of 10 o'clock to 11 o'clock will could gather warehouse-in after 11), like this for a upper initial stage in cycle data may will postpone one-period and could gather, and collection itself also can spend longer a period of time, greatly reduce the real-time of data.Two is that server load is unbalance, poor stability.The data of disposable collection warehouse-in magnanimity, the server process time concentrated in longer a period of time, once warehouse-in process occurs abnormal, the cost of rollback is very high, also can badly influence the queried access speed of client to server.
Adopt the present invention to gather PCMD data, arrange every 1 minutes interval, each File cutting per hour becomes 60 collections, gathers increment at every turn, 7 OMP parallel acquisitions, achieves the collection warehouse-in that magnanimity PCMD data are complete in time.Contrast classic method, advantage is as follows: one is overcome the long drawback of classic method data delay time, and data delay was reduced to 1 minute from 1 hour, improved the real-time of data acquisition.Two is be conducive to server load balancing, avoids the hidden danger concentrated in stability that in a period of time, acquisition process large data files brings, improves the stability of PCMD data acquisition and processing (DAP).

Claims (5)

1. a multi-source is from increasing massive data files real-time collecting method, comprise and gather from the server data source of more than one number the process certainly increasing data file, it is characterized in that: gather in the described server data source from more than one number in the process increasing data file, adopt and the server data source of multithreading parallel acquisition more than one number increases data file certainly, adopt file microtomy and file to resume technology temporally to cut into slices from increasing data file to described, each collection is from the incremental portion increasing data file, the process that the described server data source from more than one number gathers from increasing data file comprises the following steps:
Step 1, explicit data generating period, naming rule and acquisition mode, and the size estimating each data file,
Step 2, according to data generation period, estimated data file size and business demand setting acquisition time interval,
Step 3, by set acquisition time interval, the data file of server data source current period is checked in the mode of periodic polling, adopt file microtomy and file to resume technology and gather incremental data, and be stored into this locality by the naming rule of setting in step 1 with the form of small data file, and record the reference position that in this gatherer process, current time file byte-sized gathers as next poll, the data from 0 byte location to the data file byte location in poll moment first time are gathered for the first time in poll
Step 4, gather the byte location that records from the last poll data to the data file byte location in current poll moment, circulation is read, until next cycle Generating Data File,
Step 5, in the generation moment of cycle data file described in step 4, carries out last poll collection,
Step 6, is stored into assigned catalogue by the naming rule of setting with small data file by the file collected, and direct loading of databases or back up to server,
Step 7, for N number of server data source, adopts multithreading, carries out parallel acquisition according to step 3-step 6,
Step 8, for multiple data category, according to step 1-step 7, adopts multithreading or multi-process technology to realize parallel acquisition.
2. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: described data source comprises:
There is the corresponding N station server of N number of data source,
Data are stored on N number of server with document form respectively,
Data generate a data file at one-period,
Data file writes growth in real time within the cycle, until next cycle data document creation,
Data file name comprises unique identification rule, and according to the name of YYYYMMDDHHMMSS.XXXX form, YYYYMMDDHHMMSS is time cycle feature, XXXX data category feature.
3. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: according to the granularity of the acquisition time interval defined file section in described step 2.
4. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the file microtomy in described step 3 is that file is cut into file section one by one according to constant duration, the byte location that each record cuts is as the starting byte position of next section collecting, it is the time point at every turn cut at file that file resumes technology, gather from last time acquisition and recording byte location to the data slicer of current time file maximum byte position.
5. multi-source as claimed in claim 1 is from increasing massive data files real-time collecting method, it is characterized in that: the naming rule in described step 6 comprises:
Data file name comprises uniquely identified key element, according to
YYYYMMDDHHMMSSECPN_HHMMSS.XXXX form is named, YYYYMMDDHHMMSS is data cycle feature, and XXXX is data category feature, and ECPN is data source characteristic, _ HHMMSS is acquisition time feature, and data cycle characteristic sum data category feature derives from data source.
CN201110334851.6A 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method Active CN103092840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110334851.6A CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110334851.6A CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Publications (2)

Publication Number Publication Date
CN103092840A CN103092840A (en) 2013-05-08
CN103092840B true CN103092840B (en) 2015-09-16

Family

ID=48205423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110334851.6A Active CN103092840B (en) 2011-10-28 2011-10-28 Multi-source is from increasing massive data files real-time collecting method

Country Status (1)

Country Link
CN (1) CN103092840B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699666A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Transmission method and transmission device for splitting data
CN103685559A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for processing data in server
CN103678699A (en) * 2013-12-27 2014-03-26 乐视网信息技术(北京)股份有限公司 Method and system for merging data in server
CN103701907A (en) * 2013-12-27 2014-04-02 乐视网信息技术(北京)股份有限公司 Processing method and system for continuing to transmit data in server
US9268597B2 (en) 2014-04-01 2016-02-23 Google Inc. Incremental parallel processing of data
CN104111983B (en) * 2014-06-30 2017-12-19 中国科学院信息工程研究所 A kind of open multi-source data acquiring system and method
CN104376082B (en) * 2014-11-18 2019-06-18 中国建设银行股份有限公司 A method of the data in data source file are imported into database
CN105183585B (en) * 2015-08-27 2019-03-26 北京金山安全软件有限公司 Data backup method and device
CN105893529A (en) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 Data collecting method and ETL assembly
CN105843935A (en) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 Data acquisition method and ETL (Extraction-Transformation-Loading) assembly
CN107993696B (en) * 2017-12-25 2020-11-17 东软集团股份有限公司 Data acquisition method, device, client and system
CN110347661A (en) * 2019-07-05 2019-10-18 北京红山信息科技研究院有限公司 Method, apparatus, server and the storage medium that data source is quasi real time put in storage
CN111159118B (en) * 2019-12-20 2024-01-26 东软集团股份有限公司 Polling monitoring method and device, storage medium and electronic equipment
CN112669148A (en) * 2020-12-22 2021-04-16 深圳市富途网络科技有限公司 Order processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756108A (en) * 2004-09-29 2006-04-05 华为技术有限公司 Master/backup system data synchronizing method
CN101719143A (en) * 2009-12-01 2010-06-02 北京中科创元科技有限公司 Method for parallel processing compare increment data extraction
CN102110121A (en) * 2009-12-24 2011-06-29 阿里巴巴集团控股有限公司 Method and system for processing data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7882064B2 (en) * 2006-07-06 2011-02-01 Emc Corporation File system replication

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756108A (en) * 2004-09-29 2006-04-05 华为技术有限公司 Master/backup system data synchronizing method
CN101719143A (en) * 2009-12-01 2010-06-02 北京中科创元科技有限公司 Method for parallel processing compare increment data extraction
CN102110121A (en) * 2009-12-24 2011-06-29 阿里巴巴集团控股有限公司 Method and system for processing data

Also Published As

Publication number Publication date
CN103092840A (en) 2013-05-08

Similar Documents

Publication Publication Date Title
CN103092840B (en) Multi-source is from increasing massive data files real-time collecting method
CN102982085B (en) Data mover system and method
CN102880685B (en) Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size
CN103020281B (en) A kind of data storage and retrieval method based on spatial data numerical index
CN107943831B (en) HBase-based power grid historical data centralized storage method
CN104317800A (en) Hybrid storage system and method for mass intelligent power utilization data
CN103678042B (en) A kind of backup policy information generating method based on data analysis
CN103605805A (en) Storage method of massive time series data
CN104599032A (en) Distributed memory power grid construction method and system for resource management
CN101436207A (en) Data restoring and synchronizing method based on log snapshot
CN104111996A (en) Health insurance outpatient clinic big data extraction system and method based on hadoop platform
CN104182506A (en) Log management method
CN104572856A (en) Converged storage method of service source data
CN102779138B (en) The hard disk access method of real time data
CN108369550B (en) Real-time alteration of data from different sources
CN102722584B (en) Data storage system and method
CN106649722B (en) Monitoring system high-frequency data storage and query method
CN102291269A (en) Data merging processing method
CN105808653A (en) User label system-based data processing method and device
CN103324696A (en) Collecting and statistical analysis system and method for data logs
CN103593486A (en) Method for storing and reading mass data of power quality
CN105787090A (en) Index building method and system of OLAP system of electric data
CN104270605A (en) Method and device for processing video monitoring data
JP5774513B2 (en) File list generation method and system, program, and file list generation device
CN104933042B (en) Database table optimization of collection technology based on big data quantity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant