CN100407203C

CN100407203C - Method for processing mass data

Info

Publication number: CN100407203C
Application number: CN2005101214668A
Authority: CN
Inventors: 陈辉; 周翔; 王健
Original assignee: Huawei Technologies Co Ltd
Current assignee: Zuo Qiyi
Priority date: 2005-12-30
Filing date: 2005-12-30
Publication date: 2008-07-30
Anticipated expiration: 2025-12-30
Also published as: CN1858735A

Abstract

The present invention is suitable for the field of data processing. The present invention provides a method for processing mass data. The method comprises the following procedures: A, source data requesting for being processed is received, and the source data is partitioned to be stored in a source data queue; B, data blocks in the source data queue are processed, and the source data is converted into target data; C, the target data is stored in a database. The present invention enhances throughput capacity for mass data of the whole application system and realizes simplicity and low cost by optimizing an application processing logic and database storage application in the treating process of the mass data.

Description

A kind of disposal route of mass data

Technical field

The invention belongs to data processing field, relate in particular to that a kind of mass data is stored, the method for retrieval process.

Background technology

In technical fields such as communication, meteorology and finance, exist analysis and Search Requirement to mass data, because data volume itself is huge, storing, retrieving and analyze in the process of these data, carrying out efficient and space expense is the technical bottleneck that can't avoid.

In the prior art, a kind of method is by designing a parallel algorithm efficiently, use high performance parallel processing mechanism to realize the storage of final data, retrieval and analysis, such as large computer system or distributed computing system, mainly by high performance hardware and the handling property bottleneck that solves mass data in conjunction with good parallel distributed algorithm.Though it is fairly simple that this method is implemented, technical risk also compares less, because large-scale parallel processor or distributed computing system cost are very high, safeguards complexity, is unable to reach for the user who uses low side devices.In addition, unpredictable data processing amount increase causes the upgrading of hardware system, also can constantly increase the cost of system.

Second method is subjective earlier definite standard before mass data processing, according to this standard data are carried out nondimensionalization and normalized then, can be by the data pre-service with data typeization, serializing, even can raw data be transformed by specific algorithm, reached the purpose of dwindling data processing scale and complexity.This method scheme level of abstraction is higher, and better flexibility and processing power are arranged, but usually causes operation result not only to contain many subjective factors, and has just changed the primitiveness of data before computing.In addition, this class methods computing relative complex is difficult to be grasped by those skilled in the art usually.Simultaneously, most of mass data object abstract has certain degree of difficulty, has very high data modeling requirement.

The third method is to use data warehouse and data mining technology, and it comprises second kind of technical scheme to a certain extent, is a kind of subject-oriented, data set and technology integrated and stable, that constantly change in time.This method can be good at solving the problem of mass data processing, but owing to the basis for IT application equipment of implementing environment is relatively not too perfect, the holding at high price of data warehouse, administrative mechanism and the technician lacks, data accumulation is insufficient and Data Mining Tools itself is immature etc. that reason is difficult to extensively put it over.

In sum, prior art is when handling mass data, and ubiquity realizes that cost is higher, and shortcomings such as implementation procedure complexity are difficult to satisfy user's practical application needs.

Summary of the invention

The object of the present invention is to provide a kind of data processing method, be intended to solve when mass data handled, realize that cost is higher, the implementation procedure complicated problems.

Goal of the invention of the present invention is achieved in that a kind of disposal route of mass data, and described method comprises the steps:

A. receive the source data of Request Processing, deposit described source data piecemeal in the source data formation;

B. the data block in the described source data formation is handled, source data is converted to target data;

C. deposit described target data in the target data formation;

D. deposit the data in the described target data formation in database.

Described steps A further comprises:

Extract the raw information of described source data, according to described raw information described source data piecemeal is handled, described raw information comprises size, time or the frame head additional information of described source data packet.

Described steps A further comprises:

When the source data formation reaches maximum stage, the data that buffer memory overflows, and when the source data formation is idle, data in buffer is replenished into source data formation.

The step of the data that described buffer memory overflows further comprises:

The data that to overflow are placed in the impact damper that is provided with in the internal memory, behind buffer full data are write exterior storage medium and carry out buffer memory.

Described step B further comprises:

By a plurality of threads parallel polling is carried out in described source data formation and handle, each thread is unit with the source block, and the source block that poll is obtained is converted to target data.

Described step B further comprises:

Source data is being converted in the process of target data, the multiple source data compression is being encapsulated as a target data.

Described step D further comprises:

The tables of data of the described target data of storage is set, and the concordance list that described tables of data is carried out index.

Described method further comprises:

E. database is carried out timing configured and maintenance.

Described step e further comprises:

Regularly tables of data is merged and delete according to the tables of data time, and the periodic cleaning database journal.

Described step D further comprises:

Single page displayed record number is set, when the user inquiring data, according to the single page displayed record number that is provided with and the data total number of records of user inquiring, to user's paging output query note.

The present invention is optimized with processing logic and database storing operation by correspondence in the processing procedure of mass data, has promoted the handling capacity of whole application system to mass data, realizes that simply cost is low.

Description of drawings

Fig. 1 is the realization flow figure of data processing method provided by the invention;

Fig. 2 is the compression encapsulation ratio synoptic diagram of source data and target data among the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

The present invention is by carrying out batch processes such as formation buffer memory, parallel logic processing to the source data of Request Processing, improve the efficient of mass data processing, adopt submeter storage and paging output etc. to optimize the storage organization of database simultaneously, improve the retrieval and inquisition efficient of mass data.

Fig. 1 shows the realization flow of data processing method provided by the invention, and details are as follows:

1. the source data piecemeal is put into the source data formation

In the applied environment of actual mass data, source data is always a collection of, report continuously, because the hardware resource of total system is limited, is difficult to accomplish to arrive a batch data and just starts a thread data are handled.Simultaneously, for fear of obliterated data and reduction system effectiveness, need the data that get clogged be managed.Therefore, in the present invention, after the source data of Request Processing arrives, do not carry out analyzing and processing immediately, but the raw information of extraction source data, some additional informations of the time of source data packet, size and frame head etc. for example, according to these raw informations with source data sub-clustering (piece), then source block is put in the source data formation, to promote the efficient of subsequent applications processing logic.

By source data being put into source data formation buffer memory, can be organized into piece to the source data of request on the one hand, each thread is that unit carries out logical process with the piece, by the configuration adjustment of block size and Thread Count, can reach the peak use rate of CPU and internal memory; On the other hand, by the management of source data formation to request msg, comprise that blocking control, buffer memory control etc. can effectively prevent loss of data, the good data of formation simultaneously own goes out team and joins the team and control also increase system efficiency to a great extent.Especially have the application that strict sequential processing requires for request msg, formation is the data pre-storage container that fits like a glove.

2. the buffer memory of overflow data is with additional

As a preferred embodiment of the present invention, after the source data of Request Processing arrives, if the source data formation has arrived maximum stage (formation is full), in order to prevent loss of data, need be in exterior storage mediums such as disk, tape the metadata cache that overflows, when the source data formation has the free time, again data are written into the source data queue processing from exterior storage medium.

When external agency is overflowed the read-write of buffer memory, in order to promote the service efficiency of exterior storage medium, the data of overflowing from the source data formation do not directly write to the external agency, and be first written to an impact damper that is provided with in the internal memory, impact damper big or small configurable, for example in order to reach optimum read-write efficiency, big or small consistent with source block substantially, by the time behind the buffer full, again source data is write exterior storage medium.Correspondingly, when the source data formation is idle, be the unit write memory by piece with the source data in the exterior storage medium.

3. multi-threaded parallel is carried out in the source data formation and handle, source data is converted to target data

In the present invention, each thread all is an independently data handling utility logic, and each thread is that unit carries out logical process with the source block.Each thread real time polling source data formation is obtained a source data and is handled from the source data formation, convert target data to.

Number of threads can combine flexibly with the real resource situation of the size of source data formation and total system, reasonable disposition, with the maximal efficiency of performance total system.In general, the data block size of number of threads and source data is inversely proportional to, and data block is big more, and required Thread Count is just few more, otherwise data block is more little, and required Thread Count will correspondingly increase.Simultaneously, the configuration relation of the data block size of number of threads and source data depends on hardware environment, it is the choice relation of space and efficient, if hardware is carried out efficiency ratio higher (CPU is than very fast), just can open some threads more, save the storage space of internal memory, if cpu performance is lower, just need take the space to change the measure of efficient, open some threads less.

When thread carries out poll to the source data formation, if the current thread poll less than data, does not just deal with; If discontented one of the data that are polled to, thread are waited at time-out time, if arrive time-out time, still can not gather together enough a blocks of data, just handle by the data of current acquisition.

4. target data is put into the target data formation

As one embodiment of the present of invention,, before depositing target data in database, put into the target data formation earlier and carry out buffer memory in order to reduce peripheral hardware I/O (I/O) number of times and system overhead.

Target data formation and source data formation do not have relation one to one.In the present invention, change in the process of target data in source block, according to certain ratio source data is compressed assembling, several source datas are carried out being encapsulated in the target data after the logical process, for example source data is changed and insert SQL statement etc. in batch, to reduce the expense of the follow-up storage of target data, a same disk access, can store more data, also alleviate the contention of access resources such as database connection to a certain extent.

The target data formation equally also has maximum, lowest water level, its maximum stage has determined the data carrying capacity that total system is instantaneous to a certain extent, but the water level by source data is assembled and the target data tabulation reasonably is set to the target data compression can solve the instantaneous data carrying capacity of total system.

Fig. 2 shows the signal of source data and target data compression encapsulation ratio, the neglecting actual conditions greatly of n and deciding, generally the size (the source data scale of a thread process just) with one of source data equates it is proper, and the maximum stage of target data formation needs to determine according to the logical process Thread Count of reality, general more suitable with the 3-5 multiple proportions of actual line number of passes, too high meeting causes the waste of memory headroom, the too low obstruction that causes the internal system data path easily.

5. deposit target data in database

As one embodiment of the present of invention, target data adopts the mode of submeter storage in the storage of database.At first, determine the storage cap of a database table, reaching needs newly-built table storage after the upper limit.Accordingly, owing to,, set up concordance list, all tables of data are carried out index in order to find the tables of data at corresponding data place quickly by a plurality of data table stores data.Concordance list can be set up according to multiple parameter, such as according to the time field, can set up a tables of data and the concordance list of time, this tables of data design proposal, need depositing of strict demand table data, aforesaid timetable index just requires the time field of data in every tables of data must drop between the initial termination time of concordance list corresponding record, and the deposit data that does not allow other time period is in this table.By concordance list, just scope can be narrowed down in one or several the tables of data the inquiry of database, rather than the set of whole tables of data, thereby improve the inquiry velocity of data.

Simultaneously,,, database indexes at different levels can be set up, the search efficiency of data can be improved within the specific limits significantly for the critical field of some inquiries commonly used according to the actual scene of query and search for the field in the tables of data.

In the retrieving of actual mass data, the data volume of one query is often bigger, is not suitable for the disposable user of being shown to.As a preferred embodiment of the present invention, at first configure user single page data presented writes down number, after database receives query requests, in database, calculate the scale (record sum) of this query requests earlier, then by data scale that inquires and the single page displayed record number that has been configured, this inquiry can be calculated and what page demonstrations will be divided altogether, once only give the user then from the data presentation of data base querying one page, remember the initial sum final position of this inquiry simultaneously, on follow-up/following page turning inquiry uses.Adopt this paging method, reduced the data scale of single inquiry, reduced the performance requirement to system, and the inquiry means are comparatively flexible, the data that do not need system to come out for caching query provide extra expense.

6. database is carried out timing configured and maintenance.

Because the data of database memory space is huge; in order to improve the storage efficiency of database; the present invention regularly safeguards database; for example regularly tables of data is merged and delete; it can be merged longer tables of data of some times; with the expense of calling of tables of data in the minimizing data query process, delete the tables of data that some have exceeded the keep-alive phase simultaneously.In addition, the daily record of all right periodic cleaning database discharges database space, to keep the operation vigor of database.

In sum, some Optimizing operation that the present invention utilizes program modeling, data modeling and database to use constitute the solution of mass data storage retrieval, realize that cost and complexity are lower, can be used as other expensive, high complexity solutions, a kind of replacement scheme of data warehouse for example, be particularly useful for solving the processing of mass data in the medium-sized and small enterprises level application process, and can be used as the design reference model of mass data storage retrieval.

The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. the disposal route of a mass data is characterized in that, described method comprises the steps:

C. deposit described target data in the target data formation;

D. deposit the data in the described target data formation in database.

2. the disposal route of mass data as claimed in claim 1 is characterized in that, described steps A further comprises:

3. the disposal route of mass data as claimed in claim 1 is characterized in that, described steps A further comprises:

4. the disposal route of mass data as claimed in claim 3 is characterized in that, the step of the data that described buffer memory overflows further comprises:

5. the disposal route of mass data as claimed in claim 1 is characterized in that, described step B further comprises:

6. the disposal route of mass data as claimed in claim 1 is characterized in that, described step B further comprises:

7. the disposal route of mass data as claimed in claim 1 is characterized in that, described step D further comprises:

8. the disposal route of mass data as claimed in claim 7 is characterized in that, described method further comprises:

E. database is carried out timing configured and maintenance.

9. the disposal route of mass data as claimed in claim 8 is characterized in that, described step e further comprises:

10. the disposal route of mass data as claimed in claim 1 is characterized in that, described step D further comprises: