CN100407203C - Method for processing mass data - Google Patents

Method for processing mass data Download PDF

Info

Publication number
CN100407203C
CN100407203C CN2005101214668A CN200510121466A CN100407203C CN 100407203 C CN100407203 C CN 100407203C CN 2005101214668 A CN2005101214668 A CN 2005101214668A CN 200510121466 A CN200510121466 A CN 200510121466A CN 100407203 C CN100407203 C CN 100407203C
Authority
CN
China
Prior art keywords
data
source
source data
formation
mass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2005101214668A
Other languages
Chinese (zh)
Other versions
CN1858735A (en
Inventor
陈辉
周翔
王健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zuo Qiyi
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN2005101214668A priority Critical patent/CN100407203C/en
Publication of CN1858735A publication Critical patent/CN1858735A/en
Application granted granted Critical
Publication of CN100407203C publication Critical patent/CN100407203C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention is suitable for the field of data processing. The present invention provides a method for processing mass data. The method comprises the following procedures: A, source data requesting for being processed is received, and the source data is partitioned to be stored in a source data queue; B, data blocks in the source data queue are processed, and the source data is converted into target data; C, the target data is stored in a database. The present invention enhances throughput capacity for mass data of the whole application system and realizes simplicity and low cost by optimizing an application processing logic and database storage application in the treating process of the mass data.

Description

A kind of disposal route of mass data
Technical field
The invention belongs to data processing field, relate in particular to that a kind of mass data is stored, the method for retrieval process.
Background technology
In technical fields such as communication, meteorology and finance, exist analysis and Search Requirement to mass data, because data volume itself is huge, storing, retrieving and analyze in the process of these data, carrying out efficient and space expense is the technical bottleneck that can't avoid.
In the prior art, a kind of method is by designing a parallel algorithm efficiently, use high performance parallel processing mechanism to realize the storage of final data, retrieval and analysis, such as large computer system or distributed computing system, mainly by high performance hardware and the handling property bottleneck that solves mass data in conjunction with good parallel distributed algorithm.Though it is fairly simple that this method is implemented, technical risk also compares less, because large-scale parallel processor or distributed computing system cost are very high, safeguards complexity, is unable to reach for the user who uses low side devices.In addition, unpredictable data processing amount increase causes the upgrading of hardware system, also can constantly increase the cost of system.
Second method is subjective earlier definite standard before mass data processing, according to this standard data are carried out nondimensionalization and normalized then, can be by the data pre-service with data typeization, serializing, even can raw data be transformed by specific algorithm, reached the purpose of dwindling data processing scale and complexity.This method scheme level of abstraction is higher, and better flexibility and processing power are arranged, but usually causes operation result not only to contain many subjective factors, and has just changed the primitiveness of data before computing.In addition, this class methods computing relative complex is difficult to be grasped by those skilled in the art usually.Simultaneously, most of mass data object abstract has certain degree of difficulty, has very high data modeling requirement.
The third method is to use data warehouse and data mining technology, and it comprises second kind of technical scheme to a certain extent, is a kind of subject-oriented, data set and technology integrated and stable, that constantly change in time.This method can be good at solving the problem of mass data processing, but owing to the basis for IT application equipment of implementing environment is relatively not too perfect, the holding at high price of data warehouse, administrative mechanism and the technician lacks, data accumulation is insufficient and Data Mining Tools itself is immature etc. that reason is difficult to extensively put it over.
In sum, prior art is when handling mass data, and ubiquity realizes that cost is higher, and shortcomings such as implementation procedure complexity are difficult to satisfy user's practical application needs.
Summary of the invention
The object of the present invention is to provide a kind of data processing method, be intended to solve when mass data handled, realize that cost is higher, the implementation procedure complicated problems.
Goal of the invention of the present invention is achieved in that a kind of disposal route of mass data, and described method comprises the steps:
A. receive the source data of Request Processing, deposit described source data piecemeal in the source data formation;
B. the data block in the described source data formation is handled, source data is converted to target data;
C. deposit described target data in the target data formation;
D. deposit the data in the described target data formation in database.
Described steps A further comprises:
Extract the raw information of described source data, according to described raw information described source data piecemeal is handled, described raw information comprises size, time or the frame head additional information of described source data packet.
Described steps A further comprises:
When the source data formation reaches maximum stage, the data that buffer memory overflows, and when the source data formation is idle, data in buffer is replenished into source data formation.
The step of the data that described buffer memory overflows further comprises:
The data that to overflow are placed in the impact damper that is provided with in the internal memory, behind buffer full data are write exterior storage medium and carry out buffer memory.
Described step B further comprises:
By a plurality of threads parallel polling is carried out in described source data formation and handle, each thread is unit with the source block, and the source block that poll is obtained is converted to target data.
Described step B further comprises:
Source data is being converted in the process of target data, the multiple source data compression is being encapsulated as a target data.
Described step D further comprises:
The tables of data of the described target data of storage is set, and the concordance list that described tables of data is carried out index.
Described method further comprises:
E. database is carried out timing configured and maintenance.
Described step e further comprises:
Regularly tables of data is merged and delete according to the tables of data time, and the periodic cleaning database journal.
Described step D further comprises:
Single page displayed record number is set, when the user inquiring data, according to the single page displayed record number that is provided with and the data total number of records of user inquiring, to user's paging output query note.
The present invention is optimized with processing logic and database storing operation by correspondence in the processing procedure of mass data, has promoted the handling capacity of whole application system to mass data, realizes that simply cost is low.
Description of drawings
Fig. 1 is the realization flow figure of data processing method provided by the invention;
Fig. 2 is the compression encapsulation ratio synoptic diagram of source data and target data among the present invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The present invention is by carrying out batch processes such as formation buffer memory, parallel logic processing to the source data of Request Processing, improve the efficient of mass data processing, adopt submeter storage and paging output etc. to optimize the storage organization of database simultaneously, improve the retrieval and inquisition efficient of mass data.
Fig. 1 shows the realization flow of data processing method provided by the invention, and details are as follows:
1. the source data piecemeal is put into the source data formation
In the applied environment of actual mass data, source data is always a collection of, report continuously, because the hardware resource of total system is limited, is difficult to accomplish to arrive a batch data and just starts a thread data are handled.Simultaneously, for fear of obliterated data and reduction system effectiveness, need the data that get clogged be managed.Therefore, in the present invention, after the source data of Request Processing arrives, do not carry out analyzing and processing immediately, but the raw information of extraction source data, some additional informations of the time of source data packet, size and frame head etc. for example, according to these raw informations with source data sub-clustering (piece), then source block is put in the source data formation, to promote the efficient of subsequent applications processing logic.
By source data being put into source data formation buffer memory, can be organized into piece to the source data of request on the one hand, each thread is that unit carries out logical process with the piece, by the configuration adjustment of block size and Thread Count, can reach the peak use rate of CPU and internal memory; On the other hand, by the management of source data formation to request msg, comprise that blocking control, buffer memory control etc. can effectively prevent loss of data, the good data of formation simultaneously own goes out team and joins the team and control also increase system efficiency to a great extent.Especially have the application that strict sequential processing requires for request msg, formation is the data pre-storage container that fits like a glove.
2. the buffer memory of overflow data is with additional
As a preferred embodiment of the present invention, after the source data of Request Processing arrives, if the source data formation has arrived maximum stage (formation is full), in order to prevent loss of data, need be in exterior storage mediums such as disk, tape the metadata cache that overflows, when the source data formation has the free time, again data are written into the source data queue processing from exterior storage medium.
When external agency is overflowed the read-write of buffer memory, in order to promote the service efficiency of exterior storage medium, the data of overflowing from the source data formation do not directly write to the external agency, and be first written to an impact damper that is provided with in the internal memory, impact damper big or small configurable, for example in order to reach optimum read-write efficiency, big or small consistent with source block substantially, by the time behind the buffer full, again source data is write exterior storage medium.Correspondingly, when the source data formation is idle, be the unit write memory by piece with the source data in the exterior storage medium.
3. multi-threaded parallel is carried out in the source data formation and handle, source data is converted to target data
In the present invention, each thread all is an independently data handling utility logic, and each thread is that unit carries out logical process with the source block.Each thread real time polling source data formation is obtained a source data and is handled from the source data formation, convert target data to.
Number of threads can combine flexibly with the real resource situation of the size of source data formation and total system, reasonable disposition, with the maximal efficiency of performance total system.In general, the data block size of number of threads and source data is inversely proportional to, and data block is big more, and required Thread Count is just few more, otherwise data block is more little, and required Thread Count will correspondingly increase.Simultaneously, the configuration relation of the data block size of number of threads and source data depends on hardware environment, it is the choice relation of space and efficient, if hardware is carried out efficiency ratio higher (CPU is than very fast), just can open some threads more, save the storage space of internal memory, if cpu performance is lower, just need take the space to change the measure of efficient, open some threads less.
When thread carries out poll to the source data formation, if the current thread poll less than data, does not just deal with; If discontented one of the data that are polled to, thread are waited at time-out time, if arrive time-out time, still can not gather together enough a blocks of data, just handle by the data of current acquisition.
4. target data is put into the target data formation
As one embodiment of the present of invention,, before depositing target data in database, put into the target data formation earlier and carry out buffer memory in order to reduce peripheral hardware I/O (I/O) number of times and system overhead.
Target data formation and source data formation do not have relation one to one.In the present invention, change in the process of target data in source block, according to certain ratio source data is compressed assembling, several source datas are carried out being encapsulated in the target data after the logical process, for example source data is changed and insert SQL statement etc. in batch, to reduce the expense of the follow-up storage of target data, a same disk access, can store more data, also alleviate the contention of access resources such as database connection to a certain extent.
The target data formation equally also has maximum, lowest water level, its maximum stage has determined the data carrying capacity that total system is instantaneous to a certain extent, but the water level by source data is assembled and the target data tabulation reasonably is set to the target data compression can solve the instantaneous data carrying capacity of total system.
Fig. 2 shows the signal of source data and target data compression encapsulation ratio, the neglecting actual conditions greatly of n and deciding, generally the size (the source data scale of a thread process just) with one of source data equates it is proper, and the maximum stage of target data formation needs to determine according to the logical process Thread Count of reality, general more suitable with the 3-5 multiple proportions of actual line number of passes, too high meeting causes the waste of memory headroom, the too low obstruction that causes the internal system data path easily.
5. deposit target data in database
As one embodiment of the present of invention, target data adopts the mode of submeter storage in the storage of database.At first, determine the storage cap of a database table, reaching needs newly-built table storage after the upper limit.Accordingly, owing to,, set up concordance list, all tables of data are carried out index in order to find the tables of data at corresponding data place quickly by a plurality of data table stores data.Concordance list can be set up according to multiple parameter, such as according to the time field, can set up a tables of data and the concordance list of time, this tables of data design proposal, need depositing of strict demand table data, aforesaid timetable index just requires the time field of data in every tables of data must drop between the initial termination time of concordance list corresponding record, and the deposit data that does not allow other time period is in this table.By concordance list, just scope can be narrowed down in one or several the tables of data the inquiry of database, rather than the set of whole tables of data, thereby improve the inquiry velocity of data.
Simultaneously,,, database indexes at different levels can be set up, the search efficiency of data can be improved within the specific limits significantly for the critical field of some inquiries commonly used according to the actual scene of query and search for the field in the tables of data.
In the retrieving of actual mass data, the data volume of one query is often bigger, is not suitable for the disposable user of being shown to.As a preferred embodiment of the present invention, at first configure user single page data presented writes down number, after database receives query requests, in database, calculate the scale (record sum) of this query requests earlier, then by data scale that inquires and the single page displayed record number that has been configured, this inquiry can be calculated and what page demonstrations will be divided altogether, once only give the user then from the data presentation of data base querying one page, remember the initial sum final position of this inquiry simultaneously, on follow-up/following page turning inquiry uses.Adopt this paging method, reduced the data scale of single inquiry, reduced the performance requirement to system, and the inquiry means are comparatively flexible, the data that do not need system to come out for caching query provide extra expense.
6. database is carried out timing configured and maintenance.
Because the data of database memory space is huge; in order to improve the storage efficiency of database; the present invention regularly safeguards database; for example regularly tables of data is merged and delete; it can be merged longer tables of data of some times; with the expense of calling of tables of data in the minimizing data query process, delete the tables of data that some have exceeded the keep-alive phase simultaneously.In addition, the daily record of all right periodic cleaning database discharges database space, to keep the operation vigor of database.
In sum, some Optimizing operation that the present invention utilizes program modeling, data modeling and database to use constitute the solution of mass data storage retrieval, realize that cost and complexity are lower, can be used as other expensive, high complexity solutions, a kind of replacement scheme of data warehouse for example, be particularly useful for solving the processing of mass data in the medium-sized and small enterprises level application process, and can be used as the design reference model of mass data storage retrieval.
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. the disposal route of a mass data is characterized in that, described method comprises the steps:
A. receive the source data of Request Processing, deposit described source data piecemeal in the source data formation;
B. the data block in the described source data formation is handled, source data is converted to target data;
C. deposit described target data in the target data formation;
D. deposit the data in the described target data formation in database.
2. the disposal route of mass data as claimed in claim 1 is characterized in that, described steps A further comprises:
Extract the raw information of described source data, according to described raw information described source data piecemeal is handled, described raw information comprises size, time or the frame head additional information of described source data packet.
3. the disposal route of mass data as claimed in claim 1 is characterized in that, described steps A further comprises:
When the source data formation reaches maximum stage, the data that buffer memory overflows, and when the source data formation is idle, data in buffer is replenished into source data formation.
4. the disposal route of mass data as claimed in claim 3 is characterized in that, the step of the data that described buffer memory overflows further comprises:
The data that to overflow are placed in the impact damper that is provided with in the internal memory, behind buffer full data are write exterior storage medium and carry out buffer memory.
5. the disposal route of mass data as claimed in claim 1 is characterized in that, described step B further comprises:
By a plurality of threads parallel polling is carried out in described source data formation and handle, each thread is unit with the source block, and the source block that poll is obtained is converted to target data.
6. the disposal route of mass data as claimed in claim 1 is characterized in that, described step B further comprises:
Source data is being converted in the process of target data, the multiple source data compression is being encapsulated as a target data.
7. the disposal route of mass data as claimed in claim 1 is characterized in that, described step D further comprises:
The tables of data of the described target data of storage is set, and the concordance list that described tables of data is carried out index.
8. the disposal route of mass data as claimed in claim 7 is characterized in that, described method further comprises:
E. database is carried out timing configured and maintenance.
9. the disposal route of mass data as claimed in claim 8 is characterized in that, described step e further comprises:
Regularly tables of data is merged and delete according to the tables of data time, and the periodic cleaning database journal.
10. the disposal route of mass data as claimed in claim 1 is characterized in that, described step D further comprises:
Single page displayed record number is set, when the user inquiring data, according to the single page displayed record number that is provided with and the data total number of records of user inquiring, to user's paging output query note.
CN2005101214668A 2005-12-30 2005-12-30 Method for processing mass data Expired - Fee Related CN100407203C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2005101214668A CN100407203C (en) 2005-12-30 2005-12-30 Method for processing mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2005101214668A CN100407203C (en) 2005-12-30 2005-12-30 Method for processing mass data

Publications (2)

Publication Number Publication Date
CN1858735A CN1858735A (en) 2006-11-08
CN100407203C true CN100407203C (en) 2008-07-30

Family

ID=37297644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2005101214668A Expired - Fee Related CN100407203C (en) 2005-12-30 2005-12-30 Method for processing mass data

Country Status (1)

Country Link
CN (1) CN100407203C (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251861B (en) * 2008-03-18 2010-04-21 北京锐安科技有限公司 Method for loading and inquiring magnanimity data
CN101996067B (en) * 2009-08-19 2014-07-16 阿里巴巴集团控股有限公司 Data export method and device
CN102023978B (en) * 2009-09-15 2015-04-15 腾讯科技(深圳)有限公司 Mass data processing method and system
CN102117308B (en) * 2010-01-06 2012-11-14 卓望数码技术(深圳)有限公司 Data processing method and data processing system
CN101799832A (en) * 2010-03-31 2010-08-11 山东高效能服务器和存储研究院 Method for rapidly writing mass data into database once
CN102479211B (en) * 2010-11-26 2014-06-25 中国银联股份有限公司 Mass data processing system and method on basis of database
CN103034705A (en) * 2012-12-07 2013-04-10 中联重科股份有限公司 Multi-fault-data display device and method of engineering mechanical equipment
CN103106270B (en) * 2013-02-02 2016-06-29 深圳先进技术研究院 cloud data fusion method and system
CN103970792A (en) * 2013-02-04 2014-08-06 中国银联股份有限公司 Index-based file comparison method and device
CN104252503B (en) * 2013-06-29 2017-08-04 北京新媒传信科技有限公司 A kind of method and apparatus for the index for storing dynamic message
CN103345514B (en) * 2013-07-09 2016-06-08 焦点科技股份有限公司 Streaming data processing method under big data environment
CN104572862A (en) * 2014-12-19 2015-04-29 阳珍秀 Mass data storage access method and system
CN104598563B (en) * 2015-01-08 2018-09-04 北京京东尚科信息技术有限公司 High concurrent date storage method and device
CN106293634A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 The method and system that data process
CN105302909B (en) * 2015-11-06 2019-03-26 西安交大捷普网络科技有限公司 Network security log system big data search method based on subregion calculations of offset
CN105550523A (en) * 2015-12-28 2016-05-04 上海携程商务有限公司 Data processing method
CN106201917B (en) * 2016-07-08 2019-03-15 苏州华元世纪科技发展有限公司 A kind of data processing system and method
CN106227739B (en) * 2016-07-13 2020-01-21 克拉玛依红有软件有限责任公司 Method for realizing data request based on multiple tasks
CN106257874A (en) * 2016-08-01 2016-12-28 上海海事大学 Big data quantity remote collection method and system in a kind of unstable network
CN106528306A (en) * 2016-11-08 2017-03-22 天津海量信息技术股份有限公司 Message queue storage method
CN107784085B (en) * 2017-09-30 2021-04-13 平安科技(深圳)有限公司 Data list exporting method and terminal thereof
CN110020359B (en) * 2017-11-08 2024-04-05 亿阳信通股份有限公司 Data processing method, device and storage medium applied to front end of webpage
CN109597795B (en) * 2018-12-06 2020-10-16 南京天辰礼达电子科技有限公司 High-efficiency processing system for roadbed compaction construction data
CN110990163A (en) * 2019-10-29 2020-04-10 北京左江科技股份有限公司 High-concurrency method for multi-application data processing process
CN113448933B (en) * 2021-06-29 2024-04-02 平安证券股份有限公司 Service data processing method, device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5742818A (en) * 1995-12-15 1998-04-21 Microsoft Corporation Method and system of converting data from a source file system to a target file system
US20020049776A1 (en) * 2000-02-11 2002-04-25 Aronoff Eyal M. System and method for reconciling transactions between a replication system and a recovered database
US20020194357A1 (en) * 2001-06-15 2002-12-19 International Business Machines Corporation Method for allowing simple interoperation between backend database systems
US20040213104A1 (en) * 2002-12-24 2004-10-28 Yu-Sheng Chou Duplicator for recording medium and method for duplicating recording medium
US20050021567A1 (en) * 2003-06-30 2005-01-27 Holenstein Paul J. Method for ensuring referential integrity in multi-threaded replication engines
CN1704936A (en) * 2004-05-31 2005-12-07 深圳宇龙计算机通信科技(深圳)有限公司 Data conversion or synchronization method between data bases

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5742818A (en) * 1995-12-15 1998-04-21 Microsoft Corporation Method and system of converting data from a source file system to a target file system
US20020049776A1 (en) * 2000-02-11 2002-04-25 Aronoff Eyal M. System and method for reconciling transactions between a replication system and a recovered database
US20020194357A1 (en) * 2001-06-15 2002-12-19 International Business Machines Corporation Method for allowing simple interoperation between backend database systems
US20040213104A1 (en) * 2002-12-24 2004-10-28 Yu-Sheng Chou Duplicator for recording medium and method for duplicating recording medium
US20050021567A1 (en) * 2003-06-30 2005-01-27 Holenstein Paul J. Method for ensuring referential integrity in multi-threaded replication engines
CN1704936A (en) * 2004-05-31 2005-12-07 深圳宇龙计算机通信科技(深圳)有限公司 Data conversion or synchronization method between data bases

Also Published As

Publication number Publication date
CN1858735A (en) 2006-11-08

Similar Documents

Publication Publication Date Title
CN100407203C (en) Method for processing mass data
CN202058147U (en) Distribution type real-time database management system
US11741053B2 (en) Data management system, method, terminal and medium based on hybrid storage
EP3254210B1 (en) Big data statistics at data-block level
CN110166282B (en) Resource allocation method, device, computer equipment and storage medium
EP2973018B1 (en) A method to accelerate queries using dynamically generated alternate data formats in flash cache
CN100521623C (en) High-performance Syslog processing and storage method
US8521923B2 (en) Storage-side storage request management
CN111124679A (en) Time-limited automatic processing method for multi-source heterogeneous mass data
CN107038162A (en) Real time data querying method and system based on database journal
CN102420814A (en) Data access method and device, and server
CN106502875A (en) A kind of daily record generation method and system based on cloud computing
CN101373445A (en) Method and apparatus for scheduling memory
CN106156047A (en) A kind of SNAPSHOT INFO processing method and processing device
CN102779138A (en) Hard disk access method of real time data
CN112632025A (en) Power grid enterprise management decision support application system based on PAAS platform
Otoo et al. Disk cache replacement algorithm for storage resource managers in data grids
CN106201918B (en) A kind of method and system based on big data quantity and extensive caching quick release
CN109446167A (en) A kind of storage of daily record data, extracting method and device
US8700822B2 (en) Parallel aggregation system
CN105847329B (en) Management equipment and method based on stock data server
Otoo et al. Accurate modeling of cache replacement policies in a data grid
CN101158976A (en) Method and system for data-base recording enquire preprocess
CN116089414A (en) Time sequence database writing performance optimization method and device based on mass data scene
CN109669811B (en) Data processing method and terminal capable of reliable access

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170919

Address after: The walls of the city of Zhangjiakou in 075000 in Hebei province Yangyuan County Township Village 336 walls

Patentee after: Zuo Qiyi

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: Huawei Technologies Co., Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080730

Termination date: 20171230

CF01 Termination of patent right due to non-payment of annual fee