CN111367951A

CN111367951A - Method and device for processing stream data

Info

Publication number: CN111367951A
Application number: CN202010131762.0A
Authority: CN
Inventors: 康雪丹; 姜黎明; 王大飞; 江旻
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2020-07-03

Abstract

The invention discloses a method and a device for processing stream data, wherein the method comprises the following steps: the method comprises the steps of obtaining various types of service data which accord with screening rules from monitored stream data, extracting the service data according to a preset structure of the service data aiming at the various types of service data to obtain service data with set latitudes, grouping the various types of service data with the set latitudes according to a preset grouping rule, and processing the service data with the set latitudes in the groups according to a preset operator of each group. The invention extracts the service data according to the preset structure of the service data to obtain the service data with the set latitude, processes the service data with the set latitude in the packet according to the preset operator of each packet after grouping, realizes the split of the real-time calculation stages, simultaneously, the calculation logics of each stage are not excessively coupled, and the preset operator of each packet is used for multiplexing other calculation models, so that the processing of the streaming data is more efficient.

Description

Method and device for processing stream data

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing stream data.

Background

In recent years, with the rapid development of information technology, the data volume shows a trend of rapid increase, and for massive data, the processing capacity of a single computer is far from enough, thereby promoting the research and development of a distributed system. How to rapidly analyze and acquire useful information in mass data is a research hotspot in the field of distributed computing at present, and stream computing is carried forward.

For the application scenario of stream data, unlike the traditional data stored in a disk or a memory, the stream data is characterized in that: real-time performance: generating data stream in real time, and obtaining an analysis result in real time; durability: the data stream is continuously generated and streamed indefinitely.

Stream computation is widely used because of the advantages of stream computation described above. The existing typical distributed stream computing framework comprises Storm, spark timing, Flink and the like, the real-time performance and fault tolerance of the framework in a distributed environment are good, but the coupling degree is too high for a specific service scene, the development and maintenance cost is increased, the stream computing logic is opaque to service personnel, along with the rapid change of the online operation condition of a product, the change of the computing logic every time needs to be redeveloped by the development personnel, the rapid expansion of the service is not facilitated, the service requirement cannot be met, the code utilization rate of the framework is low, and certain system resource waste is caused. In a streaming computing scenario, a general streaming computing framework has the disadvantages of being relatively heavy, high in coupling degree and low in heterogeneity.

Disclosure of Invention

The application provides a method and a device for processing stream data, which are used for solving the problem of how to conveniently and efficiently process stream data.

In a first aspect, an embodiment of the present application provides a method for stream data processing, including:

acquiring various service data which accord with the screening rule from the monitored stream data;

extracting the service data according to a preset structure of the service data aiming at each type of service data to obtain service data of a set latitude; the preset structure comprises at least one set latitude;

grouping various service data with set latitude according to a preset grouping rule;

and processing the service data of the set latitude in the packet according to a preset operator of each packet.

According to the scheme, the business data are extracted according to the preset structure of the business data to obtain the business data with the set latitude, the business data with the set latitude in the groups are processed according to the preset operator of each group after grouping, the splitting of the real-time calculation stage is realized, meanwhile, the calculation logics of all stages are not excessively coupled, the preset operator of each group is used for multiplexing and flexibly combining other calculation models, and the processing of the streaming data is more efficient.

Optionally, the screening rule includes at least one of: a set data source, set category service data, and a set time window.

According to the scheme, data screening is carried out by setting the data source and the category or the time window of the service data, the data format is unified, the useless data are filtered, and the calculation is more efficient.

Optionally, the extracting the service data according to the preset structure of the service data to obtain the service data with the set latitude includes:

according to the preset structure of the service data, constructing a data matrix for the service data in the same time window; each service data corresponds to one row in the data matrix, and the same set latitude of each service data corresponds to one column in the data matrix.

According to the scheme, the screened data is constructed into the matrix, so that the data in the same column corresponds to the same set latitude, and the processing of the streaming data is more convenient and efficient.

Optionally, a latitude primary key of each group is set in the grouping rule, and the set latitude includes the latitude primary key;

grouping various service data with set latitudes according to a preset grouping rule, comprising the following steps:

and obtaining the business data of the set latitude of each group aiming at the latitude main key of each group, wherein the business data in each group conforms to the mode of the data matrix.

According to the scheme, the data are grouped through the latitude main key, and the calculation efficiency and the accuracy are improved.

Optionally, the preset operator includes a latitude index and an operator for calculating the latitude index, and the set latitude includes the latitude index;

processing the service data of the set latitude in the packet according to the preset operator of each packet, comprising:

and calling the operator to process the business data with the set latitude in the packet according to the latitude index of the packet to obtain the calculation result of the packet in the latitude index.

According to the scheme, the operators are abstracted and calculated, and are flexibly combined and configured to be reused by other calculation models, so that the flow processing capacity of mass data is realized.

Optionally, the invoking the operator to process the service data of the set latitude in the packet includes:

and calling the operator to process the column data in the grouped data matrix.

Optionally, after the processing the service data of the set latitude in the packet, the method further includes:

and outputting the processed calculation result according to a preset output template.

According to the scheme, different stream data processing results are butted with the database through the preset output template, so that the processing process is more efficient.

In a second aspect, an embodiment of the present application provides an apparatus for stream data processing, where the apparatus includes:

the acquisition module is used for acquiring various service data which accord with the screening rule from the monitored stream data;

the processing module is used for extracting the service data according to a preset structure of the service data aiming at each type of service data to obtain the service data with the set latitude; the preset structure comprises at least one set latitude;

the processing module is further used for grouping various service data with set latitudes according to a preset grouping rule;

and the processing module is further used for processing the service data of the set latitude in the packet according to the preset operator of each packet.

Optionally, the processing module is specifically configured to:

the screening rules include at least one of: a set data source, set category service data, and a set time window.

Optionally, the processing module is specifically configured to:

the group rule is set with latitude main keys of each group, and the set latitude comprises the latitude main keys;

the processing module is specifically configured to:

Optionally, the processing module is specifically configured to:

the preset operator comprises a latitude index and an operator for calculating the latitude index, and the set latitude comprises the latitude index;

the processing module is specifically configured to:

Optionally, the processing module is specifically configured to:

and calling the operator to process the column data in the grouped data matrix.

Optionally, the processing module is further configured to:

and after the service data of the set latitude in the group is processed, outputting the processed calculation result according to a preset output template.

Correspondingly, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing the streaming data processing method according to the obtained program.

Accordingly, embodiments of the present invention also provide a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer-readable instructions are read and executed by a computer, the computer is caused to execute the above-mentioned method for processing streaming data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a system framework of a method for processing streaming data according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for processing stream data according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a method for processing stream data according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for processing stream data according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problems in the prior art, an embodiment of the present invention provides a method for processing streaming data, and the method for processing streaming data provided in the embodiment of the present invention may be applied to a system architecture as shown in fig. 1, where the system architecture includes a streaming data collection device 100 and a service processing device 200.

The stream data acquisition device 100 sends the acquired stream data to the service processing device 200, and the service processing device 200 processes the stream data.

It should be noted that fig. 1 is only an example of a system architecture according to an embodiment of the present application, and the present application is not limited to this specifically.

Based on the system architecture illustrated in fig. 1, fig. 2 is a schematic flowchart corresponding to a method for processing streaming data according to an embodiment of the present invention, where the flow may be executed by a device for debugging an intelligent contract, which may be a service processing device of the above-mentioned content. As shown in fig. 2, the method includes:

step 201, acquiring various service data meeting the screening rule from the monitored stream data;

step 202, extracting the service data according to a preset structure of the service data for each type of service data to obtain the service data of the set latitude.

It should be noted that the preset structure includes at least one set latitude.

And 203, grouping various service data with set latitudes according to a preset grouping rule.

And 204, processing the service data of the set latitude in the packet according to the preset operator of each packet.

In a possible implementation, the method for stream data processing is performed based on a stream computation framework Spark Streaming.

Before specifically describing the scheme of the present application, first, a brief description is given of Spark Streaming:

spark Streaming is to divide Streaming into a series of short batch processing jobs, that is, to divide input data of Spark Streaming into a piece of data (partitioned Streaming) at a preset time interval (e.g. 1 second), that is, Spark Streaming accesses data from a real-time data stream and divides the data into small batches for processing by a subsequent Spark engine.

Based on this, in step 201, the following steps are first performed to acquire various types of service data that meet the screening rule.

S2011, a time window is set.

For example, the set time window may be every minute, every five minutes, every half hour, every day, every week, etc., which is not specifically limited in this application.

As another example, when the set time window may be every minute, every 60s of data forms a batch.

S2012, a data source is set.

It should be noted that, the scheme of the present application supports extracting data from multiple data sources, such as: rmb, Kafka, Flume, ZeroMQ, Kinesis, and the like.

And S2013, setting the service data of the category.

Such as, for example, WeChat loans, installment payments, and the like.

It should be noted that the above sequence is a general step, for example, S2013 may precede S2012, and this is not specifically limited in this application.

As can be seen from the above, the stream data is filtered, and the filtering rule includes: a set data source, set category service data, and a set time window. By screening the time range of the streaming data and the content category, the streaming data of each small batch can be uniformly and pertinently processed later. The processing flow is described in detail below.

In step 202, according to the preset structure of the service data, a data matrix is constructed from the service data in the same time window.

It should be noted that each piece of service data corresponds to one row in the data matrix, and the same set latitude of each piece of service data corresponds to one column in the data matrix.

For example, the following two structures are defined:

the account opening structure is as follows:

business scenario	Customer ID	State of opening an account	Time of opening an account	Channel for irrigation
					Opening an account	ID_NO	Success of the method	2020-01-01	Mobile phone

Borrowing structure:

business scenario	Customer ID	Time of borrowing	Amount of money to be borrowed	State of borrowing
					Borrowing money	ID_NO	2020-01-01	100.0	Success of the method

For example, the set time window is 5 seconds, and 5 seconds accumulate all service data, and form a data matrix as follows:

time1 account opening data (account opening structure)

time2 borrowing data (borrowing structure)

time3 account opening data (account opening structure)

time4 account opening data (account opening structure)

In the embodiment of the present application, before the data matrix is constructed, the filtered stream data is analyzed in a preset manner, for example, the stream data is analyzed in a "manner by using a separator".

In a possible implementation manner, after the various types of service data meeting the screening rule are obtained, preliminary data filtering may be performed.

After the extraction of the business data is completed, the input field scope and type judgment is defined by the search engine, such as by the SQL component, and if the flow data does not satisfy the SQL condition or does not satisfy the predefined type, the filtering is performed directly.

For example, "BIZ _ TYPE ═ load' and ID _ NO is not null" indicates that data whose service scene is load and whose ID is not null is selected, and the remaining data that do not meet the condition is filtered.

Based on this, the detailed procedure of grouping various types of service data with set latitudes in step 203 is described in detail below.

In the embodiment of the application, latitude main keys of all groups are set in the grouping rule, and the set latitude comprises the latitude main keys;

based on this, in the embodiment of the present application, the service data of the set latitude of the packet is obtained for the latitude main key of each packet.

Note the manner in which the traffic data within each packet conforms to the data matrix.

And combining the content of the steps, and creating the service packet after selecting the data source and the category of the service data.

Further, in step 204, the preset operator includes a latitude index and an operator for calculating the latitude index, and the set latitude includes the latitude index;

in the embodiment of the present application, the latitude index means what latitude the group of indexes are calculated according to, such as a customer latitude, a merchant latitude, and a mobile phone number latitude. And selecting a field corresponding to the statistical latitude to form a latitude index.

In the embodiment of the application, the latitude indexes are predefined and correspond to the index IDs one by one.

For example, a latitude index is defined: CUST _ PAY _ SUCCESS, corresponding to the number of successful orders placed by the client, and the latitude index is defined as 'client ID' and 'successful orders placed'; and the operator that calculates the latitude index is a sum.

Specifically, according to the grouped latitude index, an operator is called to process business data of a set latitude in the group, and a calculation result of the group in the latitude index is obtained.

And further, calling the operator to process column data in the grouped data matrix.

The above description introduces specific grouping and calculation processes, and the following description describes a specific index calculation definition method.

In the embodiment of the application, the index calculation definition comprises an index name and a calculation model.

Specifically, the index name includes an index name and an index description.

In the embodiment of the application, when the calculation model selects a single index for customization, a defined calculation mode can be selected, and the method mainly comprises the following steps: hbase store, hbase query, hbase deduplication, spark SQL query, or

A general operator: statistical calculation (20 operators such as SUM/COUNT/DIS _ COUNT/DETAIL _ LIST/late), judgment calculation (i.e., >, <, >, etc.), and logical calculation (and, or not, etc.).

In order to better explain the invention, a specific example is described below in connection with fig. 3.

As shown in fig. 3:

first, a data source RMB is monitored, which contains a plurality of events, such as input1, input2, and input n in fig. 3, to form a dynamic event stream.

In the embodiment of the application, the configuration is loaded at a cache timing, the configuration information is loaded in 5 minutes, and the configuration information is loaded according to the event ID. For example, the event ID is RMB _ WCD _ load, where RMB is a set data source, WCD is a set service data category, and load is a specific service scenario.

Specifically, all events of the product a are reported to a data source, and operations such as login, account opening, borrowing and loan placing are included.

In the embodiment of the present application, service scenarios, that is, the aforementioned LOANs, are distinguished according to the BIZ _ TYPE keyword, and a data structure of a single service scenario is defined. Specifically, the following two structures correspond to stuck 1, stuck 2,.. and stuck n in fig. 3:

the account opening structure is as follows:

Borrowing structure:

Then, the time window of the data source is set to 5 seconds, and all service data are accumulated in 5 seconds to form a data matrix.

The method comprises the following specific steps:

opening an account	ID_1	Success of the method	2020-01-01	Mobile phone
					Borrowing money	ID_2	2020-01-01	100.0	Success of the method
Borrowing money	ID_3	2020-02-01	200.0	Success of the method
					Borrowing money	ID_2	2020-01-01	200.0	Success of the method
Opening an account	ID_2	Success of the method	2020-02-01	Mobile phone

Further, a time window accumulates a batch of data, and the following process is performed.

S301, loading a corresponding configuration according to the primary latitude key (BIZ _ TYPE), grouping the batch data (5S), and obtaining different groups according to the primary latitude key, such as the data group shown in fig. 3, specifically as follows:

grouping one: the latitude primary key is a borrowing event, and the client ID is ID _ 2;

the latitude indexes are the latest borrowing time of the client, the borrowing amount of the client and the borrowing stroke number of the client.

Grouping II, wherein the latitude primary key is a borrowing event, and the client ID is ID _ 3;

As can be seen from the above, the grouping condition is that the event is a loan, the borrowing status is a success, and the borrowing success indicator is counted according to the client ID.

S302, according to the configuration and the data flow, service grouping information is obtained in real time and comprises information such as service date, latitude main keys and latitude indexes.

Based on the above, the obtained service grouping information is as follows:

group1, Group1,. and Group pn in fig. 3 are formed after the grouping is completed, resulting in a Group one as follows:

borrowing money	ID_2	2020-01-01	100.0	Success of the method
					Borrowing money	ID_2	2020-01-01	200.0	Success of the method

Meanwhile, grouping two is as follows:

borrowing money

ID_3

2020-02-01

200.0

Success of the method

It should be noted that, in the embodiment of the present application, packet data parallel computation and the same group index serial computation are performed, and one latitude index corresponds to one or more operators, such as the operator 1, the operator 2, and the operator N in fig. 3.

In the above, the latitude index includes a plurality of single index calculations, and the configuration of the single index calculation is briefly described as follows:

it should be noted that the calculation range is a cycle range, which supports minutes/hours/days/weeks/months/years, and the cycle range is a numerical range for performing range check.

The operators in the computational model may be real-time framework intermediate operators: hbase store, hbase query, hbase deduplication, spark SQL query, or

In the embodiment of the application, after the service data of the set latitude in the packet is processed, the processed calculation result is output according to the output template according to the preset output template.

It should be noted that, the present solution may associate multiple static data sources and tables, and the output mode may be the following two types:

and (3) timing output: and (5) outputting by the timer, and updating the index table, such as inquiring the output index every five minutes.

And (3) immediate output: event driving, immediate synchronization after index updating, and direct updating to an index library after single grouping is finished.

Based on the same inventive concept, fig. 4 exemplarily illustrates a stream data processing apparatus provided by an embodiment of the present invention, which may be a flow of a stream data processing method.

The apparatus for stream data processing includes:

an obtaining module 401, configured to obtain various types of service data that meet a screening rule from monitored stream data;

a processing module 402, configured to extract, for each type of service data, the service data according to a preset structure of the service data, so as to obtain service data of a set latitude; the preset structure comprises at least one set latitude;

Optionally, the processing module 402 is specifically configured to:

the processing module 402 is specifically configured to:

Optionally, the processing module 402 is specifically configured to:

the processing module 402 is specifically configured to:

Optionally, the processing module 402 is specifically configured to:

and calling the operator to process the column data in the grouped data matrix.

Optionally, the processing module 402 is further configured to:

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including:

a memory for storing program instructions;

Based on the same inventive concept, the embodiment of the present invention also provides a computer-readable non-volatile storage medium, which includes computer-readable instructions, and when the computer reads and executes the computer-readable instructions, the computer is enabled to execute the method for processing the stream data.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of stream data processing, comprising:

2. The method of claim 1, wherein the filtering rule comprises at least one of: a set data source, set category service data, and a set time window.

3. The method of claim 1, wherein the extracting the service data according to the preset structure of the service data to obtain the service data of the set latitude comprises:

4. The method of claim 1, wherein the grouping rule is set with a primary latitude key for each group, and the set latitude comprises the primary latitude key;

5. The method of any one of claims 1 to 4, wherein the preset operator comprises a latitude index and an operator that calculates the latitude index, and the set latitude comprises the latitude index;

6. The method of claim 5, wherein invoking the operator to process the set latitude of business data within the packet comprises:

and calling the operator to process the column data in the grouped data matrix.

7. The method of claim 5, wherein after processing the set latitude of traffic data within a packet, further comprising:

8. An apparatus for stream data processing, comprising:

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to perform the method of any of claims 1 to 7 in accordance with the obtained program.

10. A computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 7.