CN104750749B

CN104750749B - Data processing method and device

Info

Publication number: CN104750749B
Application number: CN201310751401.6A
Authority: CN
Inventors: 刘健男
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2018-04-03
Anticipated expiration: 2033-12-31
Also published as: CN104750749A

Abstract

This application provides a kind of data processing method and device, this method includes：Flow data is performed by one or more calculate nodes to the flow data received to handle；Using the result that the flow data is handled as intermediate data storage in the master data sheet and secondary tables of data of database；And when one or more of calculate nodes are restarted, it is that the calculate node loads intermediate data corresponding with the calculate node from the secondary tables of data according to the node identification of the calculate node, to continue executing with the flow data processing to subsequently received flow data based on the intermediate data.Using the technical scheme of the application, it is possible to increase distributed stream computing system inquires about the speed of pilot process data corresponding to each calculate node when starting, so as to improve the speed of data loading, and then lift the toggle speed of distributed stream computing system.

Description

Data processing method and device

Technical field

The application be related to a kind of data processing method in data processing field, more particularly to distributed stream computing system and Device.

Background technology

Distributed stream computing device will preserve substantial amounts of pilot process and calculate data in the process of running, in usual internal memory, This partial data for calculate final result data be it is essential, therefore, typically can be during operation by pilot process meter Count according to being persisted in disk, in case causing to restart after device interrupt by a variety of causes.Calculated for distributed stream Pilot process calculate data storage, traditional Relational DataBase is a kind of selection, and still, traditional Relational DataBase is not It is adapted to storage mass data, after the data volume of storage reaches more than 100,000,000, most of traditional Relational DataBase is looked into Asking performance all can substantially be deteriorated, and can not meet the requirement of application.Big data technical field had newly risen NoSQL in recent years（Non- pass It is type database）Technology, its most important feature are just that by the quick search of mass data, therefore, work as data When amount is very huge, the result of calculation that product is calculated using NoSQL database purchases distributed stream is suitably to select very much, mesh The NoSQL databases of preceding main flow have HBase, Cassandra etc..

Generally there are three kinds of modes by NoSQL data base queryings data：（1）Key-Value mode is that is, complete by one The unique key of office inquires a record.The efficiency of this inquiry mode is highest, about a few tens of milliseconds.（2）Range scans, I.e. by key indexes, a starting position and an end position are specified, inquires about a plurality of record.This inquiry mode efficiency is same Sample is very high, in Millisecond.（3）Full table scan, it is necessary to can just obtain desired record by all records of scan table.It is this to look into Inquiry mode is less efficient, for cross hundred million data amount efficiency in hour level.

At present, distributed stream computing device is typically combined system of the composition for calculating in real time with NoSQL databases, When the system needs to stop and restart for some reason in the process of running, distributed stream computing device sometimes for Substantial amounts of pilot process is loaded from NoSQL databases and calculates data.

Fig. 1 is the structure chart of existing distributed stream computing system, as shown in figure 1, distributed stream computing system is by dividing Cloth N number of calculate node 110-1 in a network ..., 110-i ..., 110-N, and NoSQL databases 120 form, each It is all separate that the pilot process of individual calculate node 110, which calculates data, and data are not occured simultaneously between each node.When the system During restarting, each calculate node 110 needs to load the part pilot process calculating data related to oneself.

However, real time data user mainly accesses what is stored in NoSQL databases 120 by way of Key-Value Data, therefore, the data in database 120 are usually being identified with real time data user and related to business datum Data as key（key）Preserve, and the key that the None- identified of calculate node 110 is related to business datum, therefore, just can not yet By way of Key-Value or the modes of range scans loads the pilot process related to oneself and calculates data, can only pass through The mode of full table scan, that is to say, that each calculate node 110, which will scan all data, could judge which data is One's own and load, after the data volume of table crosses hundred million, full table scan will become very slowly, to influence real time computation system Toggle speed, when serious system may be caused not start.

On the other hand, existing a solution is the scheme that data are calculated using delay loading pilot process, i.e. when one After message flow enters distributed stream computing system, judge whether to find pilot process corresponding to the message flow in internal memory and calculate Data, if can find, calculate data using pilot process and subsequently calculated.If can not find, judge in NoSQL numbers Data are calculated according to pilot process corresponding to the message flow whether can be found in storehouse, if can find, the centre that this is found Process calculates data and is loaded into internal memory, and calculates data using the pilot process and subsequently calculated.If can not find, really The fixed message flow is a new stream in business, and pilot process corresponding to the message flow is added in internal memory and calculates data, and profit Data are calculated with the pilot process subsequently to be calculated.

Pilot process need not be loaded by the way of above-mentioned delay loading, during startup from NoSQL databases immediately Data are calculated, but real-time evaluation work can be carried out immediately.However, this mode is difficult to be applicable for some application scenarios, For example, in the case where majority of traffic belongs to new stream in business, when a piece of news stream can not find in internal memory it is corresponding When pilot process calculates data, it is necessary to searched into NoSQL databases once just can determine that the message flow is new stream again, when When the message flow major part of some message source is new stream, stream calculation program will continually access NoSQL databases progress data and look into Ask, produce substantial amounts of magnetic disc i/o, cause performance degradation.

Add in summary, it is necessary to propose that a kind of applicability is wider and can improve data when distributed stream computing system starts Carry the scheme of speed.

The content of the invention

The main purpose of the application is to provide a kind of data processing method and device, to solve to divide existing for prior art Start the problem of slow caused by data loading is slow when cloth stream calculation system is due to starting, wherein：

This application provides a kind of data processing method, including：By one or more calculate nodes to the stream that receives Data perform flow data processing；Using the result that the flow data is handled as intermediate data storage database master data In table and secondary tables of data, the intermediate data is stored in the main number by the key related to the Data Identification of the intermediate data According in table, the intermediate data by the intermediate data corresponding to the related key of node identification of calculate node be stored in institute State in secondary tables of data；And when one or more of calculate nodes are restarted, according to the node identification of the calculate node from It is that the calculate node loads intermediate data corresponding with the calculate node in the secondary tables of data, with based on the mediant Handled according to the flow data is continued executing with to subsequently received flow data.

Further aspect of the application provides a kind of data processing equipment, including：Processing module, for by one or Multiple calculate nodes perform flow data processing to the flow data received；Memory module, for the place for handling the flow data Manage result as intermediate data storage in the master data sheet and secondary tables of data of database, the intermediate data by with it is described in Between the related keys of Data Identification of data be stored in the master data sheet, the intermediate data by with the intermediate data pair The related key of the node identification for the calculate node answered is stored in the secondary tables of data；And load-on module, for when described one It is described to calculate section from the secondary tables of data according to the node identification of the calculate node when individual or multiple calculate nodes are restarted Point loads intermediate data corresponding with the calculate node, to be continued based on the intermediate data to subsequently received flow data Perform the flow data processing.

Compared with prior art, according to the technical scheme of the application, it is possible to increase distributed stream computing system is looked into when starting The speed of pilot process data corresponding to each calculate node is ask, so as to improve the speed of loading data, and then is lifted distributed The toggle speed of stream calculation system.

Brief description of the drawings

Accompanying drawing described herein is used for providing further understanding of the present application, forms the part of the application, this Shen Schematic description and description please is used to explain the application, does not form the improper restriction to the application.In the accompanying drawings：

Fig. 1 is the structure chart of distributed stream computing system of the prior art；

Fig. 2 is the flow chart of the data processing method of the embodiment of the present application；

Fig. 3 is mediant corresponding to the node identification according to calculate node of the embodiment of the present application loads from secondary tables of data According to flow chart；

Fig. 4 is that the embodiment of the present application obtains from master data sheet corresponding intermediate data according to inquiry request as inquiry As a result the specific flow chart for the step of returning；

Fig. 5 is the structured flowchart of the data processing equipment of the embodiment of the present application；And

Fig. 6 is the structure chart of the targeted distributed stream computing system of the technical scheme of the application.

Embodiment

The main thought of the application is, in distributed stream computing system, by intermediate data caused by each calculate node It is respectively written into different keys in the master data sheet and secondary tables of data of database, can be with phase when the distributed system is restarted The key answered is searched intermediate data corresponding to each node and loaded in secondary tables of data, so as to improve the speed of loading data Degree.Also, according to the scheme of the application, the corresponding intermediate data of each calculate node can be loaded immediately when system starts, because And the applicability of scheme is extensive, do not limited by application scenarios.

The technical scheme of the application can apply to distributed stream computing system, with reference to figure 6, the distributed stream computing system 600 can include one or more calculate node 610-1 ..., 610-i ..., 610-N, and database 620, the database 620 include master data sheet 621 and secondary tables of data 622.During data processing, by caused by each calculate node 610-i Intermediate data is respectively written into master data sheet 621 and secondary tables of data 622 with different keys.Here, for convenience, only show in figure Go out a calculate node 610-i and master data sheet 621 and the relation of secondary tables of data 622.It will be appreciated that other each calculate nodes with Master data sheet 621 is also respectively provided with similar relation with secondary tables of data 622.

To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out under the premise of creative work the every other embodiment obtained, belong to the scope of the application protection.

According to embodiments herein, there is provided a kind of data processing method.

The data processing method of the application can apply in distributed stream computing system handle data, wherein, The distributed stream computing system can include one or more calculate nodes and be saved for storing one or more of calculate The database of intermediate data corresponding to point.Wherein, the intermediate data is that the pilot process being calculated calculates data.It is distributed Stream calculation system, it can be a real-time system, i.e., run all the time.For real-time system, the data being calculated all are Pilot process calculates data.And for real time data user, at some time point, real time data user is from distribution The pilot process that stream calculation system is got calculates data（The real-time calculating data that the time point is calculated）, can be seen as It is the result data of final result data, the i.e. time point.The database can be non-relational NoSQL databases.

It is separate, each section that the pilot process of each calculate node of distributed stream computing system, which calculates data, Association of the data without certainty between point.As the system reboots, each node only needs to load the part related to oneself Pilot process calculates data.

With reference to figure 2, Fig. 2 is the flow chart of the data processing method of the embodiment of the present application.

At step S201, flow data is performed to the flow data received by one or more calculate nodes and handled.

Distributed stream computing system can be a real-time system, and data are constantly input into distributed stream computing system In, the flow data received is assigned to one or more calculate nodes by distributed system, by each calculate node to receiving The flow data arrived performs flow data processing.Wherein, it is caused in real time related to the demand according to the demand of real time data user Data, be input into distributed stream computing system, i.e. distributed stream computing system is input into the form of flow data In.In distributed stream computing system, the processing logic of flow data can be determined according to the demand, and by the flow data of the input It is assigned to corresponding calculate node and carries out corresponding data processing, wherein it is possible to Data Identification and/or number according to the flow data Determine to perform the flow data calculate node of flow data processing according to processing logic, for example, right respectively by each calculate node The flow data of different processing logics should be handled, or is handled by each calculate node corresponding to one or more Data Identifications Flow data.

Wherein, real time data user can be the applications for user's request.For example, shopping online platform is sold Family user typically can want to understand Transaction Information, the flow information in oneself shop etc. in real time.For these demands of seller user, By seller user, the caused data related to demand input distributed stream computing system in real time on the shopping online platform, That is, as long as seller user generates the new data related to demand, that is, it is input in the distributed stream computing system, by this point Cloth stream calculation system carries out the processing related to demand to the flow data.

At step S202, using the result that the flow data is handled as intermediate data storage database main number According in table and secondary tables of data.

Wherein, the intermediate data is stored in the master data by the key related to the Data Identification of the intermediate data In table, so as to which real time data user can identify the key related to the Data Identification of intermediate data, so as to the real time data User can inquire the intermediate data from master data sheet by the key.

According to one embodiment of the application, the Data Identification of the intermediate data can be the object of the intermediate data Mark, i.e. the mark of the data object of the intermediate data, for example, can be the user of user corresponding to the intermediate data Mark, for example, on shopping online platform seller user account., then can be with for example, the account of certain seller user is " abc " The key of " abc " as intermediate data corresponding to the user.When real time data user wants to obtain corresponding to the seller user's , can be at predetermined time intervals during intermediate data（For example, 5 seconds）It is in key to be inquired about in the index in master data sheet with " abc " Between data storage location, so as in real time obtain corresponding to the seller user result data, the result data can sell family expenses The real-time deal information or real-time traffic information of family demand, at the same time it can also which these information are showed into seller user.

The intermediate data by the intermediate data corresponding to the related key of node identification of calculate node be stored in ,, can be from pair so as to as the system reboots it is thus possible to identify the key related to each node identification in the secondary tables of data Intermediate data corresponding with the calculate node is found by the key in tables of data, and is middle corresponding to calculate node loading Data, so as to continue executing with flow data processing.

For example, using 1,2 ..., i ..., N identify N number of calculate node in distributed stream computing system as node identification, Key that can be using the node identification of each calculate node as the intermediate data of the calculate node.

According to one embodiment of the application, institute can also be included in key of the intermediate data in the secondary tables of data State key of the intermediate data in the master data sheet.Specifically, key of the intermediate data in secondary tables of data can include and this The node identification of calculate node corresponding to intermediate data related character and key of the intermediate data in master data sheet.

In a specific embodiment, key of the intermediate data in the secondary tables of data can be by corresponding with the intermediate data Key composition in master data sheet of the node identification of calculate node, separator and the intermediate data.

For example, the node identification of a calculate node is 18, the key of the intermediate data of the calculate node in master data sheet is " abc ", then key " 18abc " can be formed with the key " abc " of the node identification " 18 " and the intermediate data in master data sheet, will The intermediate data is that key writes secondary tables of data with " 18abc ".Wherein, the node identification of calculate node corresponding with the intermediate data Related character and the intermediate data can be separated between the key in master data sheet with any separator, for example, at this It is to be separated with " space " in example.

In step S203, when one or more of calculate nodes are restarted, according to the node mark of the calculate node It is that the calculate node loads intermediate data corresponding with the calculate node from the secondary tables of data to know, with based in described Between data flow data processing is continued executing with to subsequently received flow data.

With reference to figure 3, it according to the node identification of the calculate node is the calculate node from the secondary tables of data that Fig. 3, which is, The flow chart of loading intermediate data corresponding with the calculate node.

Step S301, searched and the node in the index of the secondary tables of data according to the node identification of the calculate node Identify related key.

Specifically, the intermediate data of each calculate node of one or more calculate nodes with the calculate node The related data of node identification are stored in the secondary tables of data as key, then can be according to the node mark of each calculate node Know, the key related to the node identification is searched in the index of the secondary tables of data.

According to one embodiment of the application, according to the node identification of the calculate node the database secondary tables of data Index in, search the key related to the node identification of the calculate node, can be by the way of range scans, i.e. pass through finger Surely the starting position searched and end position, the node identification phase with the calculate node is searched from the index of the secondary tables of data The key of pass.

The application can use NoSQL database purchase intermediate data, and the index created in NoSQL databases is ordered into , therefore it may only be necessary to specify starting position and end position can by the inquiry modes of range scans from secondary tables of data The key of the corresponding intermediate data of each calculate node is searched in index.

Range scans could be arranged to the left scan mode closed the right side and opened, i.e. scans, arrives since starting position in the index Terminate to scan at the end of end position, not the data of scan end position.

For example, key of all intermediate data of node 18 in secondary tables of data is " in 18+ separators+master data sheet Key ".For example, key of the intermediate data of node 18 in master data sheet is " abc ", then the intermediate data is in secondary tables of data Key be " 18abc ", wherein, using the node identification of calculate node corresponding to " space " as the intermediate data and the intermediate data The separator between key in master data sheet.The key of intermediate data corresponding to node 18 is searched in the index of secondary tables of data When, the starting position of lookup and end position could be arranged to：

Starting position：" 18 ", pay attention to：18 are followed by separator " space ", that is, " 18+ separators "

End position：" 19 ", pay attention to：19 are followed by separator " space ", that is, " 19+ separators "

Here the node identification of separator calculate node corresponding with the intermediate data is with the intermediate data in master data The separator between key in table is identical, i.e. with the separator in " 18abc " is all mutually " "（Space）.

Because " 19+ separators " is end position, therefore, the key for including " 18 " is scanned（That is, own corresponding to node 18 The storage location of intermediate data）When scanning key afterwards includes the key of " 19 ", it will terminate to scan, scanning will not be gone to include again The key of " 19 ".

Step S302, according to the key related to the node identification, it is determined that intermediate data corresponding with the calculate node Storage location.That is, in the index of secondary tables of data, according to finding the key related to the node identification of calculate node, It is determined that the storage location of intermediate data corresponding with the calculate node.

It should be understood that although the above-mentioned key to the intermediate data in the application in the secondary tables of data is included among with this The key of the node identification of calculate node corresponding to data related character, separator and the intermediate data in master data sheet When, one of intermediate data corresponding to one or more of calculate nodes is searched by key from the secondary tables of data of the database Kind of embodiment is described, but in fact, the difference of the structure of key according to the intermediate data in the secondary tables of data also Other arbitrarily suitable lookup modes can be used.

In addition, it is to be appreciated that searched in the application by key from the secondary tables of data of the database one or more The embodiment of intermediate data corresponding to individual calculate node is not limited to above-described embodiment, but can also use other any conjunctions Suitable mode from the secondary tables of data of the database by key search corresponding to intermediate data.

Step S303, from intermediate data corresponding to storage location loading.That is, this is found from the secondary tables of data After the storage location of intermediate data corresponding to calculating, the corresponding intermediate data that is found from the secondary tables of data is loaded to phase The calculate node answered, that is to say, that load in internal memory corresponding to intermediate data to the calculate node corresponding to the calculate node.

Loaded and corresponded to from secondary tables of data according to the node identification of the calculate node by above-mentioned step S301~S303 Intermediate data to after the calculate node, the stream can be continued executing with to subsequently received flow data based on the intermediate data Data processing.

According to one embodiment of the application, in being parsed from key of the intermediate data in the secondary tables of data Between key of the data in the master data sheet, with reflecting for key of the respective stored intermediate data in master data sheet and the intermediate data Relation is penetrated, for being used in follow-up data processing.

Specifically, in resolving, can remove the intermediate data in the key of the secondary tables of data with the mediant According to the related character of the node identification of corresponding calculate node and the node mark in calculate node corresponding with the intermediate data Separator between the key of the character of sensible pass and the intermediate data in master data sheet, the intermediate data is obtained in master data sheet In key.For example, key of the node 18 in secondary tables of data is " 18abc ", then " 18 " and separator " " can be removed（Space）, obtain To key " abc " of the intermediate data in master data sheet.

After distributed stream computing system starts, in follow-up data processing procedure, because intermediate data is in the main number Can be the Data Identification of the intermediate data according to the key in table, therefore the Data Identification for the flow data that can be come according to distribution （That is, key of the intermediate data in master data sheet）, searched in each calculate node corresponds to internal memory among corresponding to the flow data Data, and related streams data processing is continued executing with to subsequently received flow data using the intermediate data.

According to one embodiment of the application, can also include：In response to the result handled for the flow data Inquiry request, according to the Data Identification of the intermediate data of the result handled as the flow data, from the master data The step of corresponding intermediate data is obtained in table and returns to the intermediate data as Query Result.It is detailed with reference to Fig. 4 The process of the step is described.

As shown in figure 4, in step S401, according to the Data Identification included in the inquiry request, in the master data sheet Index in search the key related to the Data Identification.For example, it is " abc " for real time data user's requesting query account Seller user exchange hand inquiry request, then according to the account " abc ", search in the index of master data sheet with " abc " phase The key of pass.

Next, in step S402, according to the key related to the Data Identification, it is determined that corresponding with the Data Identification The storage location of intermediate data.That is, after the key related to the Data Identification being found in the index of master data sheet, according to rope Draw the storage location for determining intermediate data corresponding with the Data Identification.

Then, in step S403, returned from intermediate data corresponding to storage location acquisition as Query Result.That is, The intermediate data is obtained from the storage location of intermediate data corresponding with the Data Identification, and reality is returned to as the result of inquiry When data consumer.

So far the data processing method according to the embodiment of the present application is described with reference to Fig. 1 to Fig. 4.By using this The technical scheme of application, for the distributed stream computing system that one has N number of concurrent program, mediant is loaded on startup It is about to spend time taking 1/N using full table scan mode according to the time spent.Assuming that a distributed stream computing system has 400 concurrent calculate nodes, if the time of cost is 2 needed for being loaded immediately when being started in the way of full table scan Hour, then in theory, the time spent using the loading strategy of the application is about 18 seconds.Therefore, in the technical side of the application In case, it is lost when a little being run by increase, it is possible to which solution uses distributed stream of the non-relational data as storage instrument The problem of computing system spends overlong time can not even start when starting using load mode immediately.

Similarly, the embodiment of the present application additionally provides a kind of data processing equipment.

Fig. 5 schematically shows the structured flowchart of the data processing equipment 500 according to the application one embodiment.The dress Putting 500 can include：Processing module 510, memory module 520 and load-on module 530.

Wherein, processing module 510 can be used for performing stream to the flow data received by one or more calculate nodes Data processing.

The result that memory module 520 can be used for handling the flow data is as intermediate data storage in database Master data sheet and secondary tables of data in, the intermediate data is stored in by the key related to the Data Identification of the intermediate data In the master data sheet, the intermediate data by the intermediate data corresponding to calculate node the related key of node identification It is stored in the secondary tables of data.

Load-on module 530 can be used for when one or more of calculate nodes are restarted, according to the calculate node Node identification is that the calculate node loads intermediate data corresponding with the calculate node from the secondary tables of data, with based on The intermediate data continues executing with the flow data processing to subsequently received flow data.

According to one embodiment of the application, device 500 can further include enquiry module, and the module can be used for In response to the inquiry request of the result handled for the flow data, according to the result handled as the flow data Intermediate data Data Identification, corresponding intermediate data is obtained from the master data sheet and using the intermediate data as looking into Result is ask to return.

Enquiry module, which may further include, searches submodule, determination sub-module and acquisition submodule.

Wherein, search submodule to can be used for according to the Data Identification included in the inquiry request, in the master data The key related to the Data Identification is searched in the index of table.

Determination sub-module can be used for according to the key related to the Data Identification, it is determined that corresponding with the Data Identification The storage location of intermediate data.

Acquisition submodule can be used for returning as Query Result from intermediate data corresponding to storage location acquisition.

According to one embodiment of the application, load-on module 530 may further include：Search submodule, determine submodule Block and loading submodule.

Submodule is searched to can be used for being searched in the index of the secondary tables of data according to the node identification of the calculate node The key related to the node identification.

Determination sub-module can be used for according to the key related to the node identification, it is determined that in corresponding with the calculate node Between data storage location.

Loading submodule can be used for from intermediate data corresponding to storage location loading.

According to one embodiment of the application, wherein, the Data Identification of the intermediate data can include：The mediant According to object identity.

According to one embodiment of the application, wherein, the intermediate data can wrap in the key in the secondary tables of data Include key of the intermediate data in the master data sheet.

According to one embodiment of the application, wherein, key of the intermediate data in the secondary tables of data includes and this The node identification of calculate node corresponding to intermediate data related character and key of the intermediate data in master data sheet.

By the function that the device of the present embodiment is realized essentially corresponds to earlier figures 1 to the embodiment of the method shown in Fig. 4, Therefore not detailed part in the description of the present embodiment, the related description in previous embodiment is may refer to, will not be described here.

In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.

Internal memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flashRAM).Internal memory is showing for computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include non-temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the application can use the computer for wherein including computer usable program code in one or more Usable storage medium（Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.）The computer program production of upper implementation The form of product.

Embodiments herein is the foregoing is only, is not limited to the application, for those skilled in the art For member, the application can have various modifications and variations.All any modifications within spirit herein and principle, made, Equivalent substitution, improvement etc., should be included within the scope of claims hereof.

Claims

A kind of 1. data processing method, it is characterised in that including：

Flow data is performed by one or more calculate nodes to the flow data received to handle；

Using the result that the flow data is handled as intermediate data storage in the master data sheet and secondary tables of data of database, The intermediate data is stored in the master data sheet by the key related to the Data Identification of the intermediate data, the centre Data by the intermediate data corresponding to the related key of node identification of calculate node be stored in the secondary tables of data；With And

When one or more of calculate nodes are restarted, according to the node identification of the calculate node from the secondary tables of data Load corresponding with calculate node intermediate data for the calculate node, with based on the intermediate data to subsequently received Flow data continue executing with flow data processing.
2. according to the method for claim 1, it is characterised in that further comprise：

In response to the inquiry request of the result handled for the flow data, according to the processing handled as the flow data As a result the Data Identification of intermediate data, corresponding intermediate data is obtained from the master data sheet and makees the intermediate data Returned for Query Result.
3. according to the method for claim 2, it is characterised in that in response to the result that is handled for the flow data Inquiry request, according to the Data Identification of the intermediate data of the result handled as the flow data, from the master data sheet It is middle to obtain corresponding intermediate data and returned the intermediate data as Query Result, further comprise：

According to the Data Identification included in the inquiry request, searched and the Data Identification in the index of the master data sheet Related key；

According to the key related to the Data Identification, it is determined that the storage location of intermediate data corresponding with the Data Identification；

Returned from intermediate data corresponding to storage location acquisition as Query Result.
4. according to the method for claim 1, the node identification according to the calculate node is from the secondary tables of data Intermediate data corresponding with the calculate node is loaded for the calculate node, is further comprised：

The key related to the node identification is searched in the index of the secondary tables of data according to the node identification of the calculate node；

According to the key related to the node identification, it is determined that the storage location of intermediate data corresponding with the calculate node；And

From intermediate data corresponding to storage location loading.
5. according to the method for claim 1, wherein, the Data Identification of the intermediate data includes：The intermediate data Object identity.
6. according to the method any one of claim 1-5, wherein, key of the intermediate data in the secondary tables of data Include key of the intermediate data in the master data sheet.
7. according to the method any one of claim 1-5, wherein, key of the intermediate data in the secondary tables of data The key of node identification and the intermediate data including calculate node corresponding with the intermediate data in master data sheet.
A kind of 8. data processing equipment, it is characterised in that including：

Processing module, handled for performing flow data to the flow data received by one or more calculate nodes；

Memory module, for using the result that the flow data is handled as intermediate data storage database master data sheet In secondary tables of data, the intermediate data is stored in the master data by the key related to the Data Identification of the intermediate data In table, the intermediate data by the intermediate data corresponding to calculate node the related key of node identification be stored in it is described In secondary tables of data；And

Load-on module, for when one or more of calculate nodes are restarted, according to the node identification of the calculate node from It is that the calculate node loads intermediate data corresponding with the calculate node in the secondary tables of data, so that one or more Individual calculate node continues executing with the flow data processing based on the intermediate data to subsequently received flow data.
9. device according to claim 8, it is characterised in that further comprise：

Enquiry module, for the inquiry request of the result in response to being handled for the flow data, according to as the stream The Data Identification of the intermediate data of the result of data processing, corresponding intermediate data is obtained from the master data sheet and is incited somebody to action The intermediate data returns as Query Result.
10. device according to claim 9, it is characterised in that the enquiry module includes：

Submodule is searched, for according to the Data Identification included in the inquiry request, being looked into the index of the master data sheet Look for the key related to the Data Identification；

Determination sub-module, for the basis key related to the Data Identification, it is determined that mediant corresponding with the Data Identification According to storage location；

Acquisition submodule, for being returned from intermediate data corresponding to storage location acquisition as Query Result.