CN105227601A

CN105227601A - Data processing method in stream processing system, device and system

Info

Publication number: CN105227601A
Application number: CN201410270571.7A
Authority: CN
Inventors: 赫彩凤; 张晓飞; 范伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2014-06-17
Filing date: 2014-06-17
Publication date: 2016-01-06

Abstract

The embodiment of the invention discloses the data processing method in a kind of stream processing system, device and system.The embodiment of the present invention adopts the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, wherein, this traffic flow information comprises the data source identification of data flow, time marking and data stream size, obtain the quantity of computing node in this stream processing system, according to the quantity of described traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, carry out the correlation inquiry of this data flow based on this data structure; Due in this scenario, data flow can distribute to each computing node equably, so, for the skewness of prior art, acceleration large as far as possible when load balancing can be obtained, make full use of system resource, not only can reduce the overhead when correlation inquiry, and can treatment effeciency be improved.

Description

Data processing method in stream processing system, device and system

Technical field

The present invention relates to communication technical field, be specifically related to the data processing method in stream processing system, device and system.

Background technology

Along with the large-scale application of sensor DAM device, the fields such as Internet of Things, traffic, remote sensing monitoring and financial sector increase day by day for the demand of magnanimity flow data process, and how stream data carries out real-time high-efficiency analysis and inquire about also becoming a problem received much concern gradually.

Correlation inquiry operation is operation the most frequently used when carrying out data query and statistics in stream processing system.Typically refer to search in multiple data source and meet certain particular kind of relationship, such as meet and equal, the data of relation such as to be greater than or less than and to export.Existing stream processing system is when processing correlation inquiry operation, majority utilizes data warehouse technology to store the data flowed into, but for the θ correlation inquiry operation that correlation function is more complicated, this data structure utilizing data warehouse technology to set up is inapplicable, for this reason, prior art also been proposed the correlation inquiry technology based on time window, wherein, so-called time window refers to the time interval (such as 30 seconds) that stream processing system or application itself are determined, correlation inquiry operation based on this time window only needs to return with the data of current inflow as time point, in previous time window all can with all historical datas of this current inflow data correlation, because the data volume flowed in certain hour window is limited, so, prior art majority all can take the mode building internal memory index to carry out the data of summary and induction in the past in a time window, namely fashionable in data flow, to there is the deposit data of same memory index on identical computing node, to facilitate inquiry.

To in the research and practice process of prior art, the present inventor finds, because existing scheme can will have the deposit data of same memory index on identical computing node, therefore, may cause part computing node heavy load, and part computing node is in idle condition, system resource can not be fully used, and, due to some computing node heavy load, so, when carrying out correlation inquiry, the overhead of these computing nodes is also comparatively large, and treatment effeciency is not high.

Summary of the invention

The embodiment of the present invention provides data processing method, device and system in a kind of stream processing system, can make full use of system resource, not only can reduce the overhead when correlation inquiry, and can improve treatment effeciency.

First aspect, the embodiment of the present invention provides the data processing method in a kind of stream processing system, comprising:

The inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, described traffic flow information comprises the data source identification of data flow, time marking and data stream size;

Obtain the quantity of computing node in described stream processing system;

According to the quantity of described traffic flow information and computing node by described data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure;

The correlation inquiry of described data flow is carried out based on described data structure.

In the execution mode that the first is possible, in conjunction with first aspect, described data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure, comprises by the described quantity according to described traffic flow information and computing node:

According to described traffic flow information, till determining current time respectively, from the data volume of each data source, obtain the first data volume that each data source is corresponding;

Determine the group number divided needed for each first data volume according to the quantity of described computing node and described first data volume, obtain first group of number;

The data volume that each computing node can be assigned to is determined according to each first data volume and first group of corresponding number;

The data volume that can be assigned to according to described each computing node respectively by the distribution of flows that flows into described computing node, to set up data structure.

In the execution mode that the second is possible, in conjunction with the first possible execution mode of first aspect, the method can also comprise:

When the data volume that described each computing node can be assigned to is the values of powers of u, distribution again is carried out to the data flow flowed into, to set up data structure, wherein, u be 1 and first business's and, described first business is 3 times ∈ and 2 times business, described ∈ is that the load that computing node can allow exceeds the quata, and described n is the quantity of computing node in stream processing system.

In the execution mode that the third is possible, the execution mode that the second in conjunction with first aspect is possible, the described data flow to described inflow carries out distribution again, comprising:

Upgrade current time;

According to described traffic flow information, till determining the current time after upgrading respectively, from the data volume of each data source, obtain the second data volume that each data source is corresponding;

Determine the group number divided needed for each second data volume according to the quantity of described computing node and described second data volume, obtain second group of number;

The data volume that each computing node can be assigned to is determined according to each second data volume and second group of corresponding number;

The data volume that can be assigned to according to each computing node after redefining respectively by the distribution of flows that flows into described computing node, to set up data structure.

In the 4th kind of possible execution mode, in conjunction with second or three kind of possible execution mode of first aspect, when the described data flow to flowing into carries out distribution again, the method can also comprise:

Move needing the data of moving in each computing node.

In the 5th kind of possible execution mode, in conjunction with the 4th kind of possible execution mode of first aspect, described to needing the data of moving to move in each computing node, comprising:

Fast resampling notice is sent to needing the computing node carrying out Data Migration, make the described time marking needing the computing node carrying out Data Migration to upgrade the data that it stores, the time marking of the data needing the computing node carrying out Data Migration to be moved as needs by the time marking after described renewal so that described, by the described Data Migration of migration that needs to target computing nodes.

In the 6th kind of possible execution mode, in conjunction with the 5th kind of possible execution mode of first aspect, described to needing the data of moving to move in each computing node, comprising:

Fast resampling notice is sent to needing the computing node carrying out Data Migration, make the described time marking needing the computing node carrying out Data Migration to upgrade the data that it stores, need so that described the computing node carrying out Data Migration to receive other computing nodes and move the data of coming, and the time indicated by the time marking entrained by the data that described migration is come of determining equal this computing node upgrade after time marking indicated by time time, store the data that described migration is come, the time indicated by the time marking entrained by the data that described migration is come of determining be greater than this computing node upgrade after time marking indicated by time time, the data of moving described in buffer memory are to wait for new fast resampling notice, the time indicated by the time marking entrained by the data that described migration is come of determining be less than this computing node upgrade after time marking indicated by time time, time marking entrained by the data of come described migration replaces with the time marking after described computing node renewal, and the data that described migration is come is sent to other computing nodes described.

Second aspect, the embodiment of the present invention provides the data processing equipment in a kind of stream processing system, comprises monitoring means, acquiring unit, sets up unit and query unit, as follows:

Monitoring means, for the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, described traffic flow information comprises the data source identification of data flow, time marking and data stream size;

Acquiring unit, for obtaining the quantity of computing node in described stream processing system;

Set up unit, for according to the quantity of described traffic flow information and computing node by described data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure;

Query unit, for carrying out the correlation inquiry of described data flow based on described data structure.

In the execution mode that the first is possible, in conjunction with second aspect, wherein, described unit of setting up can comprise data volume determination subelement, group number determination subelement and distribute subelement;

Data volume determination subelement, for according to described traffic flow information, till determining current time respectively, from the data volume of each data source, obtains the first data volume that each data source is corresponding;

Group number determination subelement, for the group number determining to divide needed for each first data volume according to the quantity of described computing node and described first data volume, obtains first group of number;

Distribute subelement, for determining according to each first data volume and first group of corresponding number the data volume that each computing node can be assigned to, the data volume that can be assigned to according to described each computing node respectively by the distribution of flows that flows into described computing node, to set up data structure.

In the execution mode that the second is possible, in conjunction with the first possible execution mode of second aspect, wherein, described unit of setting up also comprises distribution again subelement;

Distribution again subelement, for when the data volume that described each computing node can be assigned to is the values of powers of u, carries out distribution again to the data flow flowed into, to set up data structure, wherein, u be 1 and first business's and, described first business is 3 times ∈ and 2 times business, described ∈ is that the load that computing node can allow exceeds the quata, and described n is the quantity of computing node in stream processing system.

In the execution mode that the third is possible, the execution mode that the second in conjunction with second aspect is possible, wherein:

Described distribution again subelement, specifically for when the data volume that described each computing node can be assigned to is the values of powers of u, upgrades current time; According to described traffic flow information, till determining the current time after upgrading respectively, from the data volume of each data source, obtain the second data volume that each data source is corresponding; Determine the group number divided needed for each second data volume according to the quantity of described computing node and described second data volume, obtain second group of number; The data volume that each computing node can be assigned to is redefined according to each second data volume and second group of corresponding number; The data volume that can be assigned to according to each computing node after redefining respectively by the distribution of flows that flows into described computing node, to set up data structure.

In the 4th kind of possible execution mode, in conjunction with second or three kind of possible execution mode of second aspect, described unit of setting up also comprises migration subelement;

Migration subelement, for when distribution again subelement carries out distribution again according to the data volume that described group of number and each computing node can be assigned to the data flow flowed into, moves needing the data of moving in each computing node.

In the 5th kind of possible execution mode, in conjunction with the 4th kind of possible execution mode of second aspect, wherein:

Described migration subelement, specifically for notifying to needing the computing node carrying out Data Migration to send fast resampling, make the described time marking needing the computing node carrying out Data Migration to upgrade the data that it stores, the time marking of the data needing the computing node carrying out Data Migration to be moved as needs by the time marking after described renewal so that described, by the described Data Migration of migration that needs to target computing nodes.

In the 6th kind of possible execution mode, in conjunction with the 4th kind of possible execution mode of second aspect, wherein:

Described migration subelement, specifically for notifying to needing the computing node carrying out Data Migration to send fast resampling, make the described time marking needing the computing node carrying out Data Migration to upgrade the data that it stores, need so that described the computing node carrying out Data Migration to receive other computing nodes and move the data of coming, and the time indicated by the time marking entrained by the data that described migration is come of determining equal this computing node upgrade after time marking indicated by time time, store the data that described migration is come, the time indicated by the time marking entrained by the data that described migration is come of determining be greater than this computing node upgrade after time marking indicated by time time, the data of moving described in buffer memory are to wait for new fast resampling notice, the time indicated by the time marking entrained by the data that described migration is come of determining be less than this computing node upgrade after time marking indicated by time time, time marking entrained by the data of come described migration replaces with the time marking after described computing node renewal, and the data that described migration is come is sent to other computing nodes described.

The third aspect, the embodiment of the present invention also provides a kind of stream processing system, comprises the data processing equipment in any one stream processing system that the embodiment of the present invention provides.

The embodiment of the present invention adopts the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, wherein, this traffic flow information comprises the data source identification of data flow, time marking and data stream size, obtain the quantity of computing node in this stream processing system, according to the quantity of described traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, carry out the correlation inquiry of this data flow based on this data structure; Due in this scenario, data flow can distribute to each computing node equably, so, for the skewness of prior art, acceleration large as far as possible when load balancing can be obtained, make full use of system resource, not only can reduce the overhead when correlation inquiry, and can treatment effeciency be improved.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those skilled in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the flow chart of the data processing method in the stream processing system that provides of the embodiment of the present invention;

Fig. 2 a is the network architecture schematic diagram at the stream processing system place that the embodiment of the present invention provides;

Fig. 2 b is another flow chart of the data processing method in the stream processing system that provides of the embodiment of the present invention;

Fig. 3 is the structural representation of the data processing equipment in the stream processing system that provides of the embodiment of the present invention;

Fig. 4 is the structural representation of the network equipment that the embodiment of the present invention provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those skilled in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

The embodiment of the present invention provides data processing method, device and system in a kind of stream processing system.Below be described in detail respectively.

Embodiment one,

Angle from the data processing equipment in stream processing system is described by the present embodiment, and this data processing equipment specifically can be integrated in the center coordinator node of stream processing system.

Data processing method in a kind of stream processing system, comprise: the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, obtain the quantity of computing node in this stream processing system, according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure, carry out the correlation inquiry of described data flow based on described data structure.

As shown in Figure 1, the flow process of the data processing method in this stream processing system specifically can be as follows:

101, the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information.

Wherein, this traffic flow information comprises data source identification, the information such as time marking and data stream size of data flow.Wherein, data source identification is used for the source of identification data stream, i.e. data source, and time marking is used for the time of reception of identification data stream.

102, the quantity of computing node in this stream processing system is obtained.

Wherein, stream processing system generally can comprise multiple computing node, and concrete number can be arranged according to the demand of practical application, does not repeat them here.

103, according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure.Such as, specifically can be as follows:

(A1) according to this traffic flow information, till determining current time respectively, from the data volume of each data source, obtain the data volume that each data source is corresponding, for convenience, in embodiments of the present invention, be called the first data volume.

Such as, flow into data source S and the R of data in real time to stream processing system for two, if current time t, then till moment t, the data volume from data source R can be expressed as | R| _t, the first data flow namely corresponding to data source R is | R| _t, and can be expressed as from the data volume of data source S | S| _t, the first data flow namely corresponding to data source S is | S| _t.

(A2) determine the group number divided needed for each first data volume according to the quantity of this computing node and the first data volume, obtain first group of number.

Such as, or for data source S and R, if in this stream processing system, have n computing node, then can divide x group equably by the data from data source R, the data from data source S are divided into y group equably, wherein:

x = [\sqrt{\frac{{| R |}_{t} \times n}{{| S |}_{t}}}]

y = [\sqrt{\frac{{| S |}_{t} \times n}{{| R |}_{t}}}]

That is, be the first data volume corresponding to data source R and the quantity n product of computing node from group number x of first corresponding to data source R, by this product divided by the first data volume corresponding to data source S, the evolution of the value obtained, and round.

Be the first data volume corresponding to data source S and the quantity n product of computing node from group number y of first corresponding to data source S, by this product divided by the first data volume corresponding to data source R, the evolution of the value obtained, and round.

Wherein, can each group in x group be numbered, such as x ₁, x ₂x _i, etc., can each group in y group be numbered, such as y ₁, y ₂y _jetc..

(A3) determine according to each first data volume and first group of corresponding number the data volume that each computing node can be assigned to.

(A4) data volume that can be assigned to according to this each computing node respectively by the distribution of flows that flows into each computing node, to set up data structure.

In this case, each computing node only needs to store and is allly in x _igroup from data source R data and be allly in y _jgroup from data source S data, thus achieve the load balancing of each computing node.

Can prove, under this Data distribution8, minimize the data volume in stream processing system on each computing node and computational load.Data volume now on any one computing node can meet:

T = \frac{{| R |}_{t}}{x} + \frac{{| S |}_{t}}{y} \leq \frac{3 ({| R |}_{t} + {| S |}_{t})}{2 \sqrt{n}}

It should be noted that, the best distribution of data flow can be determined when data flow flows into distributed system, but along with flowing into the change of data, Data distribution8 can not be optimum in real time, therefore, now needs to carry out distribution again to data stream.

Suppose at a time t ₁data are ideal distribution, namely meet the condition in step 1, and the data volume on any one computing node is less than the load imbalance situation that this stream processing system can be stood is that the data payload of now computing node is arbitrarily no more than wherein, ∈ is that the load that computing node can allow exceeds the quata, and the value of ∈ can be arranged according to the demand of practical application.Can make when the data total amount T that then this stream processing system injects is the values of powers of u, i.e. T=u ⁰, u ¹..., u ⁱ... time, need the distribution again carrying out data, the data processing method namely in this stream processing system can also comprise:

When the data volume that each computing node can be assigned to is the values of powers of u, distribution again is carried out, to set up data structure to the data flow flowed into.

Such as, wherein, step " to flow into data flow carry out distribution again " specifically can be as follows:

(B1) current time is upgraded.

Such as, by moment t ₁be updated to moment t ₂.

(B2) according to this traffic flow information, till determining the current time after upgrading respectively, from the data volume of each data source, obtain the data volume that each data source is corresponding, for convenience, in embodiments of the present invention, this data volume is called the second data volume.

Wherein, determination and first data volume of this second data volume are similar, such as, flow into data source S and the R of data for two in real time to stream processing system, if the current time t after upgrading ₂, then by the end of moment t ₂till, the data volume from data source R can be expressed as | R| _t2, the second data flow namely corresponding to data source R is | R| _t, and can be expressed as from the data volume of data source S | S| _t2, the second data flow namely corresponding to data source S is | S| _t.

(B3) determine the group number divided needed for each second data volume according to the quantity of this computing node and this second data volume, obtain second group of number.

Such as, or for data source S and R, if in this stream processing system, have n computing node, then can divide x ' group equably by the data from data source R, the data from data source S are divided into y ' group equably, wherein:

x^{,} = [\sqrt{\frac{{| R |}_{t 2} \times n}{{| S |}_{t 2}}}]

y^{,} = [\sqrt{\frac{{| S |}_{t 2} \times n}{{| R |}_{t 2}}}]

That is, be the second data volume corresponding to data source R and the quantity n product of computing node from group number x ' of second corresponding to data source R, by this product divided by the second data volume corresponding to data source S, the evolution of the value obtained, and round.

Be the second data volume corresponding to data source S and the quantity n product of computing node from group number y ' of second corresponding to data source S, by this product divided by the second data volume corresponding to data source R, the evolution of the value obtained, and round.

Wherein, can each group in x group be numbered, such as x ' ₁, x ' ₂x ' _i, etc., can each group in y group be numbered, such as y ' ₁, y ' ₂y ' _jetc..

(B4) redefine according to each second data volume and second group of corresponding number the data volume that each computing node can be assigned to;

(B5) data volume that can be assigned to according to each computing node after redefining respectively by the distribution of flows that flows into described computing node, to set up data structure.

In addition, except carrying out except above-mentioned distribution again to the new data flowed into, can also the data in each computing node be moved, to realize Data distribution8 equilibrium.As follows:

Tentation data source R and S is divided into x group and y group respectively when data initial distribution, changes to x ' group and y ' group, with arbitrary y after fast resampling _jdata interaction in group between all x ' group computing nodes (was namely originally belong to y _jthe data of group, after distribution again, change to and belong to x ' _igroup) strategy of fast resampling is described for example.As follows:

By y _jcomputing node in group is divided into group.Each group has the individual computing node of x ', and from 1,2 ... x ' numbering, then can do exchanges data by the computing node of identical numbering among in r group any two groups, on each like this computing node from the data volume of R namely by become for other data sources, and the fast resampling strategy of other groupings is identical therewith, does not repeat them here.

That is, when the data volume that can be assigned to according to this group number and each computing node carries out distribution again to the data flow flowed into, this data processing method can also comprise:

To needing the data of moving to move in each computing node, such as, specifically can be as follows:

Fast resampling notice is sent to needing the computing node carrying out Data Migration, this is made to need the computing node carrying out Data Migration to upgrade the time marking of the data that it stores, so that the time marking of these data needing the computing node carrying out Data Migration to be moved as needs by the time marking after described renewal, the Data Migration moved by these needs is to target computing nodes.

Certainly, computing node is except by except Data Migration to other computing nodes, can receiving other computing nodes and move the data of coming.

In addition, in order to avoid because Data Migration cause correlation inquiry to be rejected the generation of the situation of service, in above-mentioned data migration process, when there being new data to flow into, when namely needing to upgrade correlation inquiry result, can Provisioning Policy, make new data also can participate in inquiry, that is, when the data volume that can be assigned to according to this group number and each computing node carries out distribution again to the data flow flowed into, this data processing method can also comprise:

Fast resampling notice is sent to needing the computing node carrying out Data Migration, this is made to need the computing node carrying out Data Migration to upgrade the time marking of the data that it stores, so that this needs the computing node carrying out Data Migration to receive other computing nodes move the data of coming, and the time indicated by the time marking entrained by the data that this migration is come of determining equal this computing node upgrade after time marking indicated by time time, store the data that this migration is come, the time indicated by the time marking entrained by the data that described migration is come of determining be greater than this computing node upgrade after time marking indicated by time time, the data that this migration of buffer memory is come are to wait for new fast resampling notice, the time indicated by the time marking entrained by the data that this migration is come of determining be less than this computing node upgrade after time marking indicated by time time, time marking entrained by the data of come described migration replaces with the time marking after described computing node renewal, and the data that described migration is come are sent to this other computing nodes.

Visible, due in above-mentioned data migration process, flow into when there being new data, when namely needing to upgrade correlation inquiry result, new data can be identical with oneself with all time marks data cachedly to inquire about, and then return Query Result, so, can ensure that all Query Results finally can both correctly be returned, avoid the situation of the denial of service brought because of Data Migration simultaneously.

104, the correlation inquiry of this data flow is carried out based on this data structure.

As from the foregoing, the present embodiment adopts the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, wherein, this traffic flow information comprises the data source identification of data flow, time marking and data stream size, obtains the quantity of computing node in this stream processing system, according to the quantity of described traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, carry out the correlation inquiry of this data flow based on this data structure; Due in this scenario, data flow can distribute to each computing node equably, so, for the skewness of prior art, acceleration large as far as possible when load balancing can be obtained, make full use of system resource, not only can reduce the overhead when correlation inquiry, and can treatment effeciency be improved.

Embodiment two,

According to the method described by embodiment one, below citing is described in further detail.

In the present embodiment, be described in the center coordinator node that specifically can be integrated in stream processing system for this data processing equipment.

As shown in Figure 2 a, this stream processing system can comprise center coordinator node and multiple computing node, such as, n computing node, specifically can be as follows:

(1) center coordinator node;

This center coordinator node, for the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, obtain the quantity of computing node in this stream processing system, according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, carry out the correlation inquiry of this data flow based on this data structure.

In addition, this center coordinator node, can also be used for when the data volume that each computing node can be assigned to is the values of powers of u, carries out distribution again, to set up data structure to the data flow flowed into.

(2) computing node;

This computing node, for the data flow that receiving center coordinator node distributes, and the data flow received is processed, such as carry out preserving and/or calculating the data of inquiry when correlation inquiry and/or when redistributing, move etc. needing the data of migration.

In addition, as shown in Figure 2 a, in this network architecture, can also comprise multiple source node, this source node is mainly used in sending data flow to stream processing system, for convenience, in embodiments of the present invention, will be described for source node R and source node S.

Based on the above-mentioned network architecture, more than will be described in detail to its flow chart of data processing.

As shown in Figure 2 b, the data processing method in a kind of stream processing system, idiographic flow can be as follows:

201, the inflow of data flow in center coordinator node monitoring stream treatment system, to obtain traffic flow information.

Such as, center coordinator node specifically can receive the data flow of each data source, such as, receives the data flow of data source R and data source S respectively, and records the data source identification of each data flow, time of reception and data stream size respectively.

202, center coordinator node obtains the quantity of computing node in this stream processing system.

Wherein, stream processing system generally can comprise multiple computing node, and concrete number can be arranged according to the demand of practical application, for convenience, in the present embodiment, for the quantity of computing node for n is described.

203, center coordinator node is according to this traffic flow information, till determining current time respectively, from the data volume of each data source, obtains the first data volume that each data source is corresponding.

204, center coordinator node is determined according to the quantity of this computing node and the first data volume the group number that divides needed for each first data volume to obtain first group of number.

x = [\sqrt{\frac{{| R |}_{t} \times n}{{| S |}_{t}}}]

y = [\sqrt{\frac{{| S |}_{t} \times n}{{| R |}_{t}}}]

Wherein, can each group in x group be numbered, such as x ₁, x ₂... x _i, etc., can each group in y group be numbered, such as y ₁, y ₂... y _jetc..

205, center coordinator node determines according to first group of number of each first data volume and correspondence the data volume that each computing node can be assigned to.

206, the center coordinator node data volume that can be assigned to according to this each computing node respectively by the distribution of flows that flows into each computing node, to set up data structure.

Namely each computing node only needs to store and is allly in x _igroup from data source R data and be allly in y _jgroup from data source S data, thus achieve the load balancing of each computing node.

It should be noted that, the best distribution of data flow can be determined when data flow flows into distributed system, but along with flowing into the change of data, Data distribution8 can not be optimum in real time, therefore, now needs to carry out distribution again to data stream, namely can also perform step 207.

207, when the data volume that each computing node can be assigned to is the values of powers of u, i.e. T=u ⁰, u ¹..., u ⁱ... time, center coordinator node carries out distribution again to the data flow flowed into.Such as, specifically can be as follows:

(1) current time is upgraded.

Wherein, ∈ is that the load that computing node can allow exceeds the quata, and the value of ∈ can be arranged according to the demand of practical application.

(2) center coordinator node is according to this traffic flow information, till determining the current time after upgrading respectively, from the data volume of each data source, obtains the second data volume that each data source is corresponding.

(3) center coordinator node is determined according to the quantity of this computing node and this second data volume the group number that divides needed for each second data volume to obtain second group of number.

(4) center coordinator node redefines according to each second data volume and second group of corresponding number the data volume that each computing node can be assigned to.

(5) the center coordinator node data volume that can be assigned to according to each computing node after redefining respectively by the distribution of flows that flows into described computing node, to set up data structure.

In addition, except carrying out except above-mentioned distribution again to the new data flowed into, the data in each computing node can also be moved, to realize Data distribution8 equilibrium, namely can also perform step 208, as follows:

208, center coordinator node sends fast resampling notice to needing the computing node carrying out Data Migration.

209, after computing node receives this fast resampling notice, upgrade the time marking of the data that it stores, data are moved, specifically can be as follows:

Computing node is using the time marking after this renewal as the time marking needing the data of moving, and the Data Migration moved by these needs is to target computing nodes; Or, receive other computing nodes and move the data of coming, and make following process:

The time indicated by the time marking entrained by the data that this migration is come of determining equal this computing node upgrade after time marking indicated by time time, store the data that this migration is come;

The time indicated by the time marking entrained by the data that described migration is come of determining be greater than this computing node upgrade after time marking indicated by time time, this migration of buffer memory data of coming are to wait for new fast resampling notice;

The time indicated by the time marking entrained by the data that this migration is come of determining be less than this computing node upgrade after time marking indicated by time time, time marking entrained by the data of come this migration replaces with the time marking after the renewal of this computing node, and the data that this migration is come are sent to this other computing nodes.

Such as, if the current time marking of computing node A is t ₃, receiving computing node B, to move the time marking of coming be t ₄data, then work as t ₄=t ₃time, computing node A stores this data received; Work as t ₄> t ₃time, then illustrate due to system delay, computing node A not yet receives fast resampling notice, and therefore, now computing node A needs these data received of buffer memory to notify with the fast resampling of waiting system; And work as t ₄< t ₃time, then show that the data received are data that in first time fast resampling process, certain computing node sends, second time fast resampling starts now, and computing node A needs the time marking of the data received to be updated to t ₃after return sender, i.e. computing node B, then wait for a new round fast resampling notice, the operation of other computing nodes is identical therewith, does not repeat them here.

210, the correlation inquiry of this data flow is carried out based on the data structure of above-mentioned foundation.

Such as, center coordinator node is after receiving correlation inquiry request, correlation inquiry request can be sent to each computing node, the data stored according to self respectively by each computing node process, and return results to center coordinator node respectively, due to before this, data have been evenly distributed on each computing node, therefore, the now load of each computing node is balanced, data volume to be processed needed for it is also suitable, so, the resource of whole stream processing system can be utilized fully, hinge structure can only rely on certain part computing node to carry out data processing, greatly passable treatment effeciency.

Wherein, the concrete mode of this correlation inquiry see more existing querying methods, can not repeat them here.

Further, the program can also accept correlation inquiry in data migration process, and exports correct result, avoids the generation of the situation of the denial of service brought because of Data Migration.

Embodiment three,

In order to implement above method better, the embodiment of the present invention also provides the data processing equipment in a kind of stream processing system, as shown in Figure 3, the data processing equipment in this stream processing system comprises monitoring means 301, acquiring unit 302, sets up unit 303 and query unit 304, as follows:

Monitoring means 301, for the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information.

Acquiring unit 302, for obtaining the quantity of computing node in this stream processing system.

Set up unit 303, for according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure;

Query unit 304, for carrying out the correlation inquiry of this data flow based on this data structure.

Such as, wherein, set up unit 303 and can comprise data volume determination subelement, group number determination subelement and distribute subelement, as follows:

Data volume determination subelement, for according to this traffic flow information, till determining current time respectively, from the data volume of each data source, obtains the first data volume that each data source is corresponding.

Group number determination subelement, for the group number determining to divide needed for each first data volume according to the quantity of this computing node and this first data volume, obtains first group of number.

x = [\sqrt{\frac{{| R |}_{t} \times n}{{| S |}_{t}}}]

y = [\sqrt{\frac{{| S |}_{t} \times n}{{| R |}_{t}}}]

Distribute subelement, for determining according to each first data volume and first group of corresponding number the data volume that each computing node can be assigned to, the data volume that can be assigned to according to each computing node respectively by the distribution of flows that flows into described computing node, to set up data structure.That is, in this case, each computing node only needs to store and is allly in x _igroup from data source R data and be allly in y _jgroup from data source S data, thus achieve the load balancing of each computing node, refer to embodiment above, do not repeat them here.

It should be noted that, the best distribution of data flow can be determined when data flow flows into distributed system, but along with flowing into the change of data, Data distribution8 can not be optimum in real time, therefore, now need to carry out distribution again to data stream, namely this is set up unit 303 and can also comprise distribution again subelement, as follows:

Distribution again subelement, for when the data volume that this each computing node can be assigned to is the values of powers of u, carries out distribution again, to set up data structure to the data flow flowed into.

Wherein, u be 1 and first business's and, described first business is 3 times ∈ and 2 times business, be formulated and be:

u = 1 + \frac{3 &Element;}{2 \sqrt{n}}

Wherein, the load that this ∈ computing node can allow exceeds the quata, and n is the quantity of computing node in stream processing system, and the value of ∈ and n can be arranged according to the demand of practical application.

Such as, this distribution again subelement, specifically may be used for when the data volume that each computing node can be assigned to is the values of powers of u, upgrades current time; According to this traffic flow information, till determining the current time after upgrading respectively, from the data volume of each data source, obtain the second data volume that each data source is corresponding; Determine the group number divided needed for each second data volume according to the quantity of described computing node and this second data volume, obtain second group of number; The data volume that each computing node can be assigned to is redefined according to each second data volume and second group of corresponding number; The data volume that can be assigned to according to each computing node after redefining respectively by the distribution of flows that flows into described computing node, to set up data structure.

In addition, except carrying out except above-mentioned distribution again to the new data flowed into, can also move the data in each computing node, to realize Data distribution8 equilibrium, namely this is set up unit 303 and can also comprise migration subelement, as follows:

Migration subelement, may be used for, when distribution again subelement carries out distribution again according to the data volume that this group number and each computing node can be assigned to the data flow flowed into, moving needing the data of moving in each computing node.Such as, specifically can be as follows:

Migration subelement, specifically may be used for sending fast resampling notice to the computing node needing to carry out Data Migration, this is made to need the computing node carrying out Data Migration to upgrade the time marking of the data that it stores, so that the time marking of these data needing the computing node carrying out Data Migration to be moved as needs by the time marking after described renewal, the Data Migration moved by these needs is to target computing nodes.

Certainly, computing node is except by except Data Migration to other computing nodes, can receiving other computing nodes and move the data of coming, that is:

Migration subelement, specifically may be used for sending fast resampling notice to the computing node needing to carry out Data Migration, this is made to need the computing node carrying out Data Migration to upgrade the time marking of the data that it stores, so that this needs the computing node carrying out Data Migration to receive other computing nodes move the data of coming, and the time indicated by the time marking entrained by the data that this migration is come of determining equal this computing node upgrade after time marking indicated by time time, store the data that this migration is come, the time indicated by the time marking entrained by the data that described migration is come of determining be greater than this computing node upgrade after time marking indicated by time time, the data that this migration of buffer memory is come are to wait for new fast resampling notice, the time indicated by the time marking entrained by the data that this migration is come of determining be less than this computing node upgrade after time marking indicated by time time, time marking entrained by the data of come described migration replaces with the time marking after described computing node renewal, and the data that described migration is come are sent to this other computing nodes.

During concrete enforcement, above unit can realize as independently entity, and can carry out combination in any yet, realize as same or several entities, the concrete enforcement of above unit see embodiment of the method above, can not repeat them here.

Wherein, this data processing equipment specifically can be integrated in the center coordinator node of stream processing system.

As from the foregoing, the monitoring means 301 of the present embodiment can the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, wherein, this traffic flow information comprises the data source identification of data flow, time marking and data stream size, then, the quantity of computing node in this stream processing system is obtained by acquiring unit 302, and by set up unit 303 according to the quantity of described traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, the last correlation inquiry being carried out this data flow by query unit 304 based on this data structure, due in this scenario, data flow can distribute to each computing node equably, so, for the skewness of prior art, acceleration large as far as possible when load balancing can be obtained, make full use of system resource, not only can reduce the overhead when correlation inquiry, and can treatment effeciency be improved.

Embodiment four,

Accordingly, the embodiment of the present invention also provides a kind of stream processing system, comprises the data processing equipment in any one stream processing system that the embodiment of the present invention provides, specifically can see embodiment three, such as, and can be as follows:

This data processing equipment, for the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, obtain the quantity of computing node in this stream processing system, according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure, carry out the correlation inquiry of described data flow based on described data structure.

Wherein, stream processing system generally can comprise multiple computing node, and concrete number can be arranged according to the demand of practical application.

Wherein, this data flow can have multiple according to being uniformly distributed algorithm assigns to the mode of each computing node by the quantity according to this traffic flow information and computing node, such as, and can be as follows:

This data processing equipment, specifically may be used for according to this traffic flow information, till determining current time respectively, from the data volume of each data source, obtains the first data volume that each data source is corresponding; Determine the group number divided needed for each first data volume according to the quantity of this computing node and the first data volume, obtain first group of number; The data volume that each computing node can be assigned to is determined according to each first data volume and first group of corresponding number; The data volume that can be assigned to according to this each computing node respectively by the distribution of flows that flows into each computing node, to set up data structure.

It should be noted that, the best distribution of data flow can be determined when data flow flows into distributed system, but along with flowing into the change of data, Data distribution8 can not be optimum in real time, therefore, now needs to carry out distribution again to data stream, specifically see embodiment above, can not repeat them here.

In addition, this stream processing system can also comprise other equipment, and such as can comprise multiple computing node, wherein, each computing node all can perform following operation:

Computing node, for the data flow that receiving center coordinator node distributes, and the data flow received is processed, such as carry out preserving and/or when correlation inquiry, the data of inquiry to be calculated and/or when redistributing, to needing the data of migration to move, etc.

Because this stream processing system can comprise the data processing equipment in any one stream processing system that the embodiment of the present invention provides, therefore can realize the beneficial effect achieved by data processing equipment in any one stream processing system that the embodiment of the present invention provides, not repeat them here.

Embodiment five,

In addition, the embodiment of the present invention also provides a kind of network equipment, data processing equipment in any one stream processing system that can provide as the embodiment of the present invention, as shown in Figure 4, this network equipment can comprise memory 401 for storing data, for the transceiver interface 402 of transceiving data and processor 403, wherein:

Processor 403, may be used for the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, obtain the quantity of computing node in this stream processing system, according to the quantity of this traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure, carry out the correlation inquiry of described data flow based on described data structure.

Processor 403, specifically may be used for according to this traffic flow information, till determining current time respectively, from the data volume of each data source, obtains the first data volume that each data source is corresponding; Determine the group number divided needed for each first data volume according to the quantity of this computing node and the first data volume, obtain first group of number; The data volume that each computing node can be assigned to is determined according to each first data volume and first group of corresponding number; The data volume that can be assigned to according to this each computing node respectively by the distribution of flows that flows into each computing node, to set up data structure.

It should be noted that, the best distribution of data flow can be determined when data flow flows into distributed system, but along with flowing into the change of data, Data distribution8 can not be optimum in real time, therefore, now needs to carry out distribution again to data stream, that is:

Processor 403, can also be used for when the data volume that each computing node can be assigned to is the values of powers of u, carries out distribution again, to set up data structure, specifically see embodiment above, can not repeat them here the data flow flowed into.

The concrete enforcement of each operation see embodiment above, can not repeat them here above.

As from the foregoing, the network equipment of the present embodiment can the inflow of data flow in monitoring stream treatment system, to obtain traffic flow information, wherein, this traffic flow information comprises the data source identification of data flow, time marking and data stream size, obtain the quantity of computing node in this stream processing system, according to the quantity of described traffic flow information and computing node by this data flow according to being uniformly distributed algorithm assigns to each computing node, to set up data structure, carry out the correlation inquiry of this data flow based on this data structure; Due in this scenario, data flow can distribute to each computing node equably, so, for the skewness of prior art, acceleration large as far as possible when load balancing can be obtained, make full use of system resource, not only can reduce the overhead when correlation inquiry, and can treatment effeciency be improved.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in a computer-readable recording medium, storage medium can comprise: read-only memory (ROM, ReadOnlyMemory), random access memory (RAM, RandomAccessMemory), disk or CD etc.

Data processing method, device and system in a kind of stream processing system provided the embodiment of the present invention are above described in detail, apply specific case herein to set forth principle of the present invention and execution mode, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for those skilled in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the data processing method in stream processing system, is characterized in that, comprising:

Obtain the quantity of computing node in described stream processing system;

2. method according to claim 1, is characterized in that, described data flow according to being uniformly distributed algorithm assigns to described computing node, to set up data structure, comprises by the described quantity according to described traffic flow information and computing node:

3. method according to claim 2, is characterized in that, also comprises:

4. method according to claim 3, is characterized in that, the described data flow to described inflow carries out distribution again, comprising:

Upgrade current time;

5. the method according to claim 3 or 4, is characterized in that, when the described data flow to flowing into carries out distribution again, also comprises:

Move needing the data of moving in each computing node.

6. method according to claim 5, is characterized in that, described to needing the data of moving to move in each computing node, comprising:

7. method according to claim 5, is characterized in that, described to needing the data of moving to move in each computing node, comprising:

8. the data processing equipment in stream processing system, is characterized in that, comprising:

9. data processing equipment according to claim 8, is characterized in that, described unit of setting up comprises data volume determination subelement, group number determination subelement and distributes subelement;

10. data processing equipment according to claim 9, is characterized in that, described unit of setting up also comprises distribution again subelement;

11. data processing equipments according to claim 10, is characterized in that,

12. data processing equipments according to claim 10 or 11, it is characterized in that, described unit of setting up also comprises migration subelement;

13. data processing equipments according to claim 12, is characterized in that,

14. data processing equipments according to claim 12, is characterized in that,

15. 1 kinds of stream processing systems, is characterized in that, comprise the data processing equipment in the stream processing system described in any one of claim 8 to 14.