CN102214236A

CN102214236A - Method and system for processing mass data

Info

Publication number: CN102214236A
Application number: CN201110182296XA
Authority: CN
Inventors: 祝博立
Original assignee: Beijing Feinno Communication Technology Co Ltd
Current assignee: Beijing Feinno Communication Technology Co Ltd
Priority date: 2011-06-30
Filing date: 2011-06-30
Publication date: 2011-10-12
Anticipated expiration: 2031-06-30
Also published as: CN102214236B

Abstract

The invention discloses a method for processing mass data. The method comprises the following steps that: a scheduling module judges whether to call a data warehouse operation statement (HQL) according to acquired current service information and a predetermined scheduling strategy, acquires a calling sequence according to the acquired current service information and the predetermined scheduling strategy if the HQL is called, and calls the HQL to a data warehouse platform according to the calling sequence; and the data warehouse platform reads configuration information which corresponds to a data warehouse from a relational database, triggers the HQL to perform operation on data stored in a distributed platform according to the calling sequence, generates result data and stores the result data into the distributed platform. The invention also discloses a system for processing the mass data. By the method and the system provided by the invention, the flexibility of processing of the mass data can be improved.

Description

A kind of mass data processing method and system

Technical field

The present invention relates to data processing technique, particularly relate to a kind of mass data processing method and system.

Background technology

Along with the fast development of Internet technology, Internet user's quantity sharp increase, therefore, more and more for the demand of data processing such as the collection of Internet user's data, cleaning, statistics, analysis.Simultaneously, the magnitude of Internet user's data also is being explosive growth, thereby causes the pressure of above-mentioned data processing further to increase.

At present, when Internet user's mass data is handled, the method that adopts distributed platform (Hadoop) technology to combine with data warehouse platform (Hive) technology.In distributed platform storage mass data, the calculation command by control desk command calls data warehouse action statement (HQL) to the mass data of distributed platform storage add up, processing such as analysis, the very flexible of this method when command calls.

Summary of the invention

The invention provides a kind of mass data processing method, adopt this method can strengthen the dirigibility of mass data processing.

The present invention also provides a kind of mass data processing system, adopts this system can strengthen the dirigibility of mass data processing.

For achieving the above object, technical scheme of the present invention is achieved in that

The invention discloses a kind of mass data processing method, comprising:

Scheduler module judges whether to call the data warehouse action statement according to the current business information obtained and default scheduling strategy, when being judged as when being, obtaining according to the current business information of obtaining and default scheduling strategy and to call order;

Scheduler module is called in proper order to data warehouse platform invoke data warehouse action statement according to described;

The data warehouse platform reads the configuration information of described data warehouse action statement correspondence from relational database;

The data warehouse platform triggers described data warehouse action statement the data of distributed platform storage is carried out computing according to the described order of calling, and generates result data and also stores described distributed platform into.

Described generation destination file also stores into after the described distributed platform, also comprises:

Scheduler module is controlled described distributed platform described result data is imported described relational database;

Scheduler module control cache module extracts result data commonly used according to the default strategy that represents from described relational database;

The data exhibiting platform reads from described cache module and represents described result data commonly used.

Described data exhibiting platform reads from described cache module and represents after the described destination file commonly used, also comprises:

The data exhibiting platform reads from described relational database and represents described result data.

Described scheduler module judges whether to call before the data warehouse action statement according to current business information of obtaining and default scheduling strategy, also comprises:

The data access platform is at least data of distributed platform transmission;

When each transmission was finished, the data access platform sent data transmission to the message interface module and finishes message;

Described scheduler module is obtained at least once described data transmission from described message interface module and is finished message, as described current business information.

Described data access platform is finished message to message interface module transmission data transmission and is comprised:

Described data access platform adopts the transmission of messages scheme protoBuffer of Google communication modes to send described data transmission to the message interface module and finishes message.

The invention discloses a kind of mass data processing system, comprising:

Scheduler module, be used for judging whether to call the data warehouse action statement according to current business information of obtaining and default scheduling strategy, when being judged as when being, obtain according to the current business information of obtaining and default scheduling strategy and to call order, according to the described order of calling to data warehouse platform invoke data warehouse action statement;

Described data warehouse platform, be used for reading the configuration information of described data warehouse action statement correspondence from relational database, trigger described data warehouse action statement the data of distributed platform storage are carried out computing according to the described order of calling, generate result data and also store described distributed platform into;

Described relational database is used to store the configuration information of described data warehouse action statement correspondence;

Distributed platform is used to store described data and described result data.

Described scheduler module also is used to control described distributed platform described result data is imported described relational database, and the control cache module extracts result data commonly used according to the default strategy that represents from described relational database;

Described system also comprises:

Described cache module: be used for the described result data commonly used of buffer memory;

The data exhibiting platform is used for reading and representing described result data commonly used from described cache module.

Described data exhibiting platform also is used for reading and representing described result data from described relational database.

Described system also comprises:

The data access platform is used for, sending data transmission to the message interface module and finishing message when each transmission is finished at least data of distributed platform transmission;

Described message interface module is used to receive described data transmission and finishes message;

Described scheduler module also is used for obtaining at least once described data transmission from described message interface module and finishes message, as described current business information.

Described data access platform specifically is used to adopt Google's transmission of messages scheme protoBuffer communication modes to send described data transmission to the message interface module to finish message.

By the foregoing invention content as seen, in the mass data processing system, add scheduler module, this module is determined to call the data warehouse action statement and is called order according to current business information and default scheduling strategy, under the control of scheduler module, finish data handling procedure, thereby avoided in the existing mass data processing system by control desk transmitting order to lower levels one by one, because control by scheduler module, can be according to the logic of the business of required realization, the corresponding scheduling strategy of flexible configuration and call order, thus the dirigibility of mass data processing strengthened.

Description of drawings

Fig. 1 is the process flow diagram of the mass data processing method of the embodiment of the invention one;

Fig. 2 is the process flow diagram of the mass data processing method of the embodiment of the invention two;

Fig. 3 is the structural representation of the mass data processing system of the embodiment of the invention three.

Embodiment

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

Basic thought of the present invention is, in the mass data processing system, add scheduler module, this module is determined to call the data warehouse action statement and is called order according to current business information and default scheduling strategy, finishes data handling procedure under the control of scheduler module.

Fig. 1 is the process flow diagram of the mass data processing method of the embodiment of the invention one.As shown in Figure 1, this method comprises following process at least.

Step 101: scheduler module judges whether to call the data warehouse action statement according to the current business information obtained and default scheduling strategy, when being judged as when being, obtaining according to the current business information of obtaining and default scheduling strategy and to call order.

Step 102: scheduler module is according to calling order to data warehouse platform invoke data warehouse action statement.

Step 103: the configuration information of data warehouse platform reading of data warehouse action statement correspondence from relational database (mysql).

Step 104: the data warehouse platform carries out computing according to calling order trigger data warehouse action statement to the data of distributed platform storage, generates result data and stores distributed platform into.

Fig. 2 is the process flow diagram of the mass data processing method of the embodiment of the invention two.As shown in Figure 2, this method comprises following process.

Step 201: the data access platform is at least data of distributed platform transmission.

In this step, a kind of preferred implementation is that the data transmission that the data access platform regularly will receive is to the distributed platform the inside.Distributed platform supports Data Receiving, arrangement, calculating, the distribution result of calculation of peripheral system to arrive functions such as reporting system.Particularly, distributed platform is the data storage platform under the foundation (Apache) of abroad increasing income, by member compositions such as distributed file system (HDFS), distributed document processing.Wherein, the processing of distributed file system (HDFS) and distributed document is two most important members the most basic.Distributed file system (HDFS) is the version of increasing income of distribution file system of Google (GFS), it is a highly fault-tolerant distributed file system, it can provide the data access of high-throughput, the big file that is fit to storage magnanimity, the big file that surpasses 64M of PB level for example, big file is split into N little file be distributed to above the different machines, and the quantity of backup can be set, thus still can operate as normal when some machine goes wrong.It is the sharp weapon that large-scale data calculates that distributed document is handled, and for example TB level data comprise that distributed data extracts (Map) and distributed data is handled (Reduce) module.The distributed data abstraction module is responsible for data are broken up; The distributed data processing module is responsible for data are assembled.The user only need realize that distributed data extracts and distributed data is handled two interfaces, can finish TB level data computing.Distributed document is handled and can be applied to data analyses such as log analysis and data mining, also can be applicable to science data and calculates, as the calculating of circular constant PI etc.

Step 202: when each transmission was finished, the data access platform sent data transmission to the message interface module and finishes message.

In this step, when distributed platform transmission data were finished, the data access platform sent data transmission to the message interface module and finishes message the data access platform at every turn, and the information synchronization of data transmission being finished by this message is to the application system of data platform.A kind of preferred implementation is that the data access platform adopts a kind of transmission of messages scheme (protoBuffer) communication modes of Google to send data transmission to the message interface module and finishes message.

Step 203: scheduler module is obtained at least one data transfer from the message interface module and is finished message, as current business information.

In this step, for example, the data access platform has transmitted 3 secondary data to distributed platform, correspondingly, scheduler module is obtained 3 data transfer from the message interface module and is finished message, and scheduler module is finished message as current business information with the data transmission of obtaining for 3 times.

Step 204: scheduler module judges whether to call the data warehouse action statement according to current business information of obtaining and default scheduling strategy.When being judged as when being execution in step 205; When whether being judged as, return step 201.

In this step, scheduling strategy sets in advance in scheduler module.Scheduling strategy is used to indicate the trigger condition of calling the data warehouse action statement, if current business information satisfies the scheduling strategy defined terms, then scheduler module is judged as and calls the data warehouse action statement, otherwise, if current business information does not satisfy the scheduling strategy defined terms, then scheduler module is judged as and never calls the data warehouse action statement.For example, the data that the data access platform receives comprise the data of many aspects, data import to the distributed platform the inside several times, correspondingly, scheduler module is obtained repeatedly data transmission from the message interface module and is finished message, dispatching system is finished message according to data transmission repeatedly and is judged whether to call the data warehouse action statement, according to scheduling strategy, when receiving only wherein the partial data transmission when finishing message, never call the data warehouse action statement, have only when the data of above-mentioned many aspects are all complete import to distributed platform after, receive whole data transmission and finish message, scheduler module just is judged as and begins to call the data warehouse action statement, to carry out data computation.

Step 205: scheduler module is obtained according to current business information of obtaining and default scheduling strategy and is called order.

In this step,, there is not logical communication link between the step that has mutually, and must carries out in a certain order between the step that has, therefore, carry out calculating according to certain sequence call data warehouse action statement of calling because data computation comprises a lot of steps.This calls order and sets in advance in scheduler module.Can preset a plurality of orders of calling in scheduler module, scheduler module can select to call accordingly order according to current business information of obtaining and default scheduling strategy.

Step 206: scheduler module is according to calling order to data warehouse platform invoke data warehouse action statement.

Step 207: the configuration information of data warehouse platform reading of data warehouse action statement correspondence from relational database.

In this step, the data warehouse platform is a Structured Query Language (SQL) (SQL) analytics engine, and it is used for that SQL statement is translated into distributed data extraction/distributed data handles, and carries out in distributed platform then, to reach the purpose of quick exploitation.The table of storing in the data warehouse platform is the catalogue of distributed platform, particularly, data warehouse platform default table is deposited the data warehouse catalogue that the path is positioned at the work at present catalogue, separate as file with table name, if there is partition table in work at present, then the subregion value is a sub-folder, can directly directly use this part data in other distributed data extraction/distributed data is handled.The data warehouse platform can carry out related with relational database.File or catalogue that the data warehouse action statement need be operated are mapped to the table name information stores in relational database, and the field information that the field in the file also is mapped to the table that will operate is stored in the relational database, and table name information that above-mentioned mapping obtains and field information are as the configuration information of this data warehouse action statement.When data warehouse receives when calling the order that the data warehouse action statement calculates, can resolve the order that receives, and from relational database, read the relevant configuration information of data warehouse action statement that is called, be translated into distributed data extraction/distributed data handling procedure according to this configuration information and carry out statistical computation.

Step 208: the data warehouse platform carries out computing according to calling order trigger data warehouse action statement to the data of distributed platform storage, generates result data and stores distributed platform into.

Step 209: scheduler module control distributed platform imports relational database with result data.

In this step, particularly, calling module adopts and imports the result data that algorithm generates from the reading of data warehouse calculating of distributed platform the inside, this result data can be with the storage of the form of destination file, then calling module according to business demand with in a plurality of tables of data of The above results data importing in the relational database.

Step 210: scheduler module control cache module extracts result data commonly used according to the default strategy that represents from relational database.

In this step, representing strategy sets in advance in scheduler module, this represents the frequently-used data that strategy is used to indicate exhibition platform, scheduler module represents strategy according to this, and the result data that belongs to the frequently-used data of exhibition platform in the result data of storing in the relational database is drawn in the cache module.Particularly, cache module can adopt memory cache (memcache) technology, it is a high performance distributed memory object caching system, data by huge hash (Hash) table of safeguarding a unification in internal memory is stored various forms comprise the result of image, video, file and database retrieval etc.Cache module is a kind of distributed, just can allow a plurality of users on the different main frames to visit simultaneously, thereby not only having solved shared drive can only be the drawback of unit, but also has reduced the pressure of database retrieval, and has improved the speed of obtaining data of visiting.

Step 211: the data exhibiting platform reads from cache module and represents result data commonly used.

In this step, the data exhibiting platform obtains by read result data from cache module, and represent result data commonly used after acquisition for self data commonly used.The data that are of little use for the data exhibiting platform are because can't read from cache module, so continue to carry out following step 212.

Step 212: the data exhibiting platform reads from relational database and represents result data.

In this step, the data that the data exhibiting platform is of little use for example, need the data of dynamic mapping and inquiry etc., and the data exhibiting platform obtains by read result data from relational database, and represents result data commonly used after acquisition.

Fig. 3 is the structural representation of the mass data processing system of the embodiment of the invention three.As shown in Figure 3, this mass data processing system comprises at least: scheduler module 31, data warehouse platform 32, relational database 33 and distributed platform 34.On this basis, can also comprise: data access platform 35, message interface module 36, cache module 37 and data exhibiting platform 38.Above-mentioned message interface module 36 can all be arranged in application system with scheduler module 31.Wherein the processing mode and the flow process of each ingredient execution can be referring to the records of the embodiment of the invention one and the embodiment of the invention two.

Wherein, scheduler module 31 judges whether to call the data warehouse action statement according to current business information of obtaining and default scheduling strategy, when being judged as when being, obtain according to the current business information of obtaining and default scheduling strategy and to call order, call the data warehouse action statement to data warehouse platform 32 according to calling order.

The configuration information of data warehouse platform 32 reading of data warehouse action statement correspondence from relational database 33, according to calling order trigger data warehouse action statement the data of distributed platform 34 storages are carried out computing, generate result data and store distributed platform 34 into.

The configuration information of relational database 33 storage data warehouse action statement correspondences.

Distributed platform 34 above-mentioned data of storage and The above results data.

On the basis of technique scheme, in said system, comprise under the situation of data access platform 35 and message interface module 36, data access platform 35, sends data transmission to message interface module 36 and finishes message when each transmission is finished at least data of distributed platform 34 transmission.Message interface module 36 receives data transmission and finishes message.Scheduler module 31 is obtained at least one data transfer from message interface module 36 and is finished message, as current business information.Particularly, data access platform 35 specifically can adopt a kind of transmission of messages scheme of Google, and for example the protoBuffer communication modes sends data transmission to message interface module 36 and finishes message.Wherein, the data that data access platform 35 is used for peripheral system insert, and support real-time interface to insert.The data form according to the rules that data access platform 35 receives generates text, for example file of txt form.And data access platform 35 regularly is transferred to above-mentioned text the HDFS file system the inside of distributed platform 34.

On the basis of technique scheme, in said system, comprise under the situation of cache module 37, scheduler module 31 is also controlled distributed platform 34 result data is imported relational database 33, and control cache module 37 extracts result data commonly used according to the default strategy that represents from relational database 33.The result data that cache module 37 buffer memorys are commonly used.

Data exhibiting platform 38 represents the interface with the result data of the final arrangement of notebook data disposal system.The Data Source of data exhibiting platform 38 comprises following two kinds: the first, from cache module 37, obtain; The second, from relational database, obtain.Particularly, data exhibiting platform 38 reads from cache module 37 and represents result data commonly used.And data exhibiting platform 38 also reads from relational database 33 and represents result data.

According to above embodiment as seen, in the mass data processing system, add scheduler module, this module is determined to call the data warehouse action statement and is called order according to current business information and default scheduling strategy, under the control of scheduler module, finish data handling procedure, thereby avoided in the existing mass data processing system by control desk transmitting order to lower levels one by one, because control by scheduler module, can be according to the logic of the business of required realization, the corresponding scheduling strategy of flexible configuration and call order, thus the dirigibility of mass data processing strengthened.And, by cache module storage result data commonly used, the data exhibiting module preferentially reads result data and represents from cache module, have only when not storing required result data in the cache module, the data exhibiting platform just can read from database, thereby has reduced the pressure that a large amount of visits cause for the data exhibiting platform by increasing cache module.

The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. a mass data processing method is characterized in that, comprising:

The data warehouse platform reads the configuration information of described data warehouse correspondence from relational database;

2. mass data processing method according to claim 1 is characterized in that, described generation destination file also stores into after the described distributed platform, also comprises:

3. mass data processing method according to claim 2 is characterized in that, described data exhibiting platform reads from described cache module and represents after the described destination file commonly used, also comprises:

4. according to any described mass data processing method in the claim 1 to 3, it is characterized in that described scheduler module judges whether to call before the data warehouse action statement according to current business information of obtaining and default scheduling strategy, also comprises:

The data access platform is at least data of distributed platform transmission;

5. mass data processing method according to claim 4 is characterized in that, described data access platform is finished message to message interface module transmission data transmission and comprised:

6. a mass data processing system is characterized in that, comprising:

Described data warehouse platform, be used for reading the configuration information of described data warehouse correspondence from relational database, trigger described data warehouse action statement the data of distributed platform storage are carried out computing according to the described order of calling, generate result data and also store described distributed platform into;

Distributed platform is used to store described data and described result data.

7. mass data processing according to claim 6 system is characterized in that,

Described system also comprises:

8. mass data processing according to claim 7 system is characterized in that,

9. according to any described mass data processing system in claim 6 or 8, it is characterized in that described system also comprises:

10. mass data processing according to claim 9 system is characterized in that,