CN116049190A

CN116049190A - Kafka-based data processing method, device, computer equipment and storage medium

Info

Publication number: CN116049190A
Application number: CN202310096169.0A
Authority: CN
Inventors: 沈彬彬; 袁阳
Original assignee: Zhongdian Jinxin Software Co Ltd
Current assignee: Zhongdian Jinxin Software Co Ltd
Priority date: 2023-01-18
Filing date: 2023-01-18
Publication date: 2023-05-02
Anticipated expiration: 2043-01-18
Also published as: CN116049190B

Abstract

The application relates to a data processing method, a data processing device, computer equipment and a storage medium based on Kafka. The method comprises the following steps: determining a target input data table corresponding to the data processing rule based on the first command line, and determining a target input message theme with a binding relation with the target input data table; acquiring a target input data table from a target input message theme; reading input data from the target input data table based on the second command line; obtaining output data through a data processing rule; under the condition that an output data table and an output message theme corresponding to the output data do not exist, creating an output data table and an output message theme bound with the output data table, and storing the output data into the output message theme; or in the case that the output data table and the output message theme exist, updating the output data table in the output message theme. The method can solve the problem of data dependence in real-time data processing and improve the release efficiency of the data processing rules.

Description

Kafka-based data processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of real-time data processing, and in particular, to a data processing method, apparatus, computer device, and storage medium based on Kafka.

Background

In data processing, there is often a problem of data dependence: that is, in calculating the data a, since the data a depends on the data B, the data a must be calculated after the data B is generated.

In the prior art, the problem of data dependence is often handled by means of micro-batch scheduling, i.e. batch processing of data is performed at intervals of several hours or one day. However, in a real-time data processing environment, the problem of data dependence is caused by processing data in a micro-batch scheduling manner, so that the real-time performance of data processing is reduced, and the service requirement cannot be met.

Disclosure of Invention

Based on this, it is necessary to provide a data processing method, apparatus, computer device and storage medium based on Kafka in view of the above technical problems.

In a first aspect, the present application provides a data processing method based on Kafka. The method comprises the following steps:

determining a target input data table corresponding to a data processing rule based on a first command line in the data processing rule, and determining a target input message theme with a binding relation with the target input data table;

Acquiring the target input data table from the target input message theme of Kafka;

reading input data from the target input data table based on a second command line in the data processing rule;

processing the input data through the data processing rule to obtain output data;

under the condition that an output data table and an output message theme corresponding to the output data do not exist, creating the output data table, indicating Kafka to create the output message theme bound with the output data table, storing the output data into the output data table, and storing the output data table into the output message theme;

or when the output data table and the output message theme corresponding to the output data exist, updating the output data table in the output message theme based on the output data.

In one embodiment, the determining the target input message topic having a binding relationship with the target input data table includes:

acquiring metadata of the target input data table, and determining a target input message theme bound with the target input data table from the metadata of the target input data table;

The instructing Kafka to create the outgoing message topic bound to the outgoing data table includes:

instruct Kafka to create the outgoing message topic;

and recording the binding relation between the output data table and the output message theme in the metadata of the output data table.

In one embodiment, the data processing rule includes a first data processing rule, the target input data table includes an original data table, and the processing the input data by the data processing rule to obtain output data includes:

according to the statistical dimension field names in the first data processing rule, obtaining target statistical dimensions corresponding to the statistical dimension field names from a dimension table corresponding to target fields of the original data table, and determining each target statistical interval from the target statistical dimensions, wherein the dimension table comprises at least one statistical dimension, and the statistical dimension comprises at least one statistical interval;

and respectively counting the data corresponding to each target statistical interval in the target field through the first data processing rule to obtain output data.

In one embodiment, before determining the target input data table corresponding to the data processing rule based on the first command line in the data processing rule, the method further includes:

Acquiring a dimension table corresponding to each field of the original data table;

and carrying out association processing on the dimension table of any field of the original data table and the original data table.

acquiring the original data table from a data source;

and under the condition that the message theme corresponding to the original data table does not exist, creating an original data message theme, storing the original data table into the original data message theme, and recording that the original data table is bound with the original data message theme.

In one embodiment, after storing the output data table in the output message topic, the method further comprises:

reading each target output data from the output message subject bound with the output data table corresponding to each target output data according to each target output data corresponding to the data screening rule;

determining hit data from any one of the target output data according to a data threshold value corresponding to the target output data in the data screening rule;

Reading the original data table from the original data message subject;

target data corresponding to each hit data is determined from the original data table, and the target data is stored in a database.

In one embodiment, after the storing the output data table in the output message topic, the method further comprises:

storing the output data table into a database;

obtaining the target input data table from the target input message theme comprises the following steps:

reading the target input data table from the target input message theme when the metadata of the data table in the target input message theme is the same as the metadata of the target input data table; or, when the metadata of the data table in the subject of the target input message is different from the metadata of the target input data table, reading the target input data table from the database.

In a second aspect, the present application also provides a data processing apparatus based on Kafka. The device comprises:

the first determining module is used for determining a target input data table corresponding to the data processing rule based on a first command line in the data processing rule and determining a target input message theme with a binding relation with the target input data table;

The first acquisition module is used for acquiring the target input data table from the target input message theme of Kafka;

a first reading module, configured to read input data from the target input data table based on a second command line in the data processing rule;

the processing module is used for processing the input data through the data processing rule to obtain output data;

the first new modeling block is used for creating an output data table and an output message theme corresponding to the output data under the condition that the output data table and the output message theme corresponding to the output data do not exist, instructing Kafka to create the output message theme bound with the output data table, storing the output data into the output data table, and storing the output data table into the output message theme;

In one embodiment, the first determining module is further configured to:

The first new module is further configured to:

instruct Kafka to create the outgoing message topic;

In one embodiment, the data processing rule includes a first data processing rule, the target input data table includes an original data table, and the processing module is further configured to:

In one embodiment, the apparatus further comprises:

the second acquisition module is used for acquiring the dimension tables corresponding to the fields of the original data table;

And the association module is used for carrying out association processing on the dimension table of any field of the original data table and the original data table.

In one embodiment, the apparatus further comprises:

the third acquisition module is used for acquiring the original data table from a data source;

and the second newly-built module is used for newly building an original data message theme under the condition that the message theme corresponding to the original data table does not exist, storing the original data table into the original data message theme, and recording that the original data table is bound with the original data message theme.

In one embodiment, the apparatus further comprises:

the second reading module is used for reading each target output data according to each target output data corresponding to the data screening rule from the output message subject bound with the output data table corresponding to each target output data;

the second determining module is used for determining hit data from any target output data according to a data threshold value corresponding to the target output data in the data screening rule;

a third reading module, configured to read the original data table from the original data message topic;

And a fourth determining module, configured to determine target data corresponding to each hit data from the original data table, and store the target data in a database.

In one embodiment, the apparatus further comprises:

the storage module is used for storing the output data table into a database;

the first acquisition module is further configured to:

reading the target input data table from the target input message theme when the metadata of the data table in the target input message theme is the same as the metadata of the target input data table; or,

and reading the target input data table from the database when the metadata of the data table in the target input message theme is different from the metadata of the target input data table.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing any of the methods above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the methods above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements any of the methods above.

According to the data processing method, the data processing device, the computer equipment and the storage medium based on Kafka, through enabling each data table to have the corresponding message theme in Kafka, the data processing rule can automatically find the target input message theme corresponding to the target input data table according to the first command line, and input data under the target input data table is read from the target input message theme; after the input data is processed to form output data, an output message theme corresponding to the output data table can be created under the condition that the output data table and the message theme corresponding to the output data table do not exist, and the output data table is stored in the output message theme, so that when other data processing rules need to use the output data table, the output message theme can be automatically found, and the output data table can be read from the output message theme. The output data table obtained by processing is stored in Kafka, and data transmission is carried out based on Kafka, so that the real-time property of the data can be ensured, the input data is ensured to be calculated after the input data required by the data processing rule is generated, and the data dependence problem in real-time data stream processing can be solved; meanwhile, the target input message theme corresponding to the target input data table is automatically determined through the first command line of the data processing rule, and input data is acquired from the target input message theme, so that a user does not need to manually set a data source of the data processing rule, the issuing efficiency of the data processing rule is improved, and the real-time data stream processing efficiency is further improved.

Drawings

FIG. 1 is a flow diagram of a data processing method based on Kafka in one embodiment;

FIG. 2 is a flow diagram of a data processing method based on Kafka in one embodiment;

FIG. 3 is a flow chart of step 108 in one embodiment;

FIG. 4 is a schematic diagram of a dimension table in one embodiment;

FIG. 5 is a flow diagram of a data processing method based on Kafka in one embodiment;

FIG. 6 is a flow diagram of a data processing method based on Kafka in one embodiment;

FIG. 7 is a schematic diagram of a Kafka-based data processing method in one embodiment;

FIG. 8 is a schematic diagram of a Kafka-based data processing method in one embodiment;

FIG. 9 is a schematic diagram of a Kafka-based data processing method in one embodiment;

FIG. 10 is a block diagram of a data processing apparatus based on Kafka in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a data processing method based on Kafka is provided, where this embodiment is applied to a terminal for illustration, it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 102, determining a target input data table corresponding to the data processing rule based on the first command line in the data processing rule, and determining a target input message theme with a binding relation with the target input data table.

In the embodiment of the present application, the data processing rule is used for processing input data to obtain output data, and the data processing rule may be a program script generated based on the Flink sql. The user can set a specific rule for processing the input data, the terminal packages the rule set by the user based on the universal part of the Flink sql script to form a data processing rule, and the data processing rule is uploaded to the Flink cluster to operate. For example, the rule set by the user may be "calculate daily average value of data a", and the terminal may set data a as input data, construct a command for calculating daily average value of data a according to a function of calculating average value in Flink sql, and then combine the command with a general command of Flink sql script to form a data processing rule.

The first command line of the data processing rule refers to a command line in the program script that indicates information of a target input data table in which input data is located. The first command line has a specific indicator of "FROM", when the "FROM" is detected, the data after the "FROM" and the "FROM" together constitute the first command line, and the data after the "FROM" is used for indicating the target input data table. For example, if data a is data in a certain field in Table a and the name of Table a is Table a, the user may designate the target input data Table as Table a when the user needs to read data a. When generating the Flink sql script, the first command line is "FROM TableA".

The message Topic (Topic) is a class of divided data in the message middleware Kafka, and different data can be stored in different message topics. The original data table directly collected from the data source and each data table formed by the original data table after different data processing rules have the corresponding message subject in Kafka. That is, the correspondence of message topics to the data table is unique. When a certain field under the data table is input data of a certain data processing rule, the message theme corresponding to the data table is called a target input message theme. The terminal may read a target input data table corresponding to the data processing rule from the first command line of the data processing rule, and determine a target input message topic corresponding to the target input data table, so that input data may be obtained from the target input message topic through the data processing rule.

In one embodiment, determining a target input message topic having a binding relationship with a target input data table comprises:

metadata of a target input data table is obtained, and a target input message theme bound with the target input data table is determined from the metadata of the target input data table.

In this embodiment of the present application, the message theme corresponding to each data table may be recorded in metadata of the data table. When the target input data table needs to be acquired, the metadata of the target input data table can be searched according to the name of the target input data table, and the target input message theme corresponding to the target input data table is acquired from the metadata of the target input data table.

Step 104, obtaining a target input data table from the target input message theme of Kafka.

In the embodiment of the application, after determining the target input message theme, because the correspondence between the message theme and the data table is unique, the target input data table can be obtained from the target input message theme of Kafka, so as to read the input data from the target input data table.

Step 106, based on the second command line in the data processing rule, the input data is read from the target input data table.

In this embodiment of the present application, the second command line refers to a command line indicating input data information that needs to be read by the data processing rule in the program script. The second command line has a specific indicator "SELECT", and when "SELECT" is detected, data following "SELECT" constitutes the second command line together with "SELECT", and data following "SELECT" is used to indicate input data. For example, if data A is the data under field A in Table A, and field A is named FieldA in Table A, then the second command line may be "SELECT FieldA". The input data is the data corresponding to the field under the target input data table. For example, if there are multiple fields under the target input data table, including "user id", "timestamp", "transaction amount", "commodity id", "area id", etc., if the user needs to count the total transaction amount of each user, the fields to be used are the "user id" field and the "transaction amount" field, and the required input data is the id of the user under the "user id" field and the transaction amount of the user under the "transaction amount" field.

Since a data source table (source) needs to be created for the Flink sql script when reading data based on the Flink sql script, the data source table can be created first based on the target input data table and the target input message topic by a function of creating the data source table according to Kafka's message topic in the Flink sql. The terminal may then read the input data from the subject input message topic via the data source table.

And step 108, processing the input data through a data processing rule to obtain output data.

In the embodiment of the application, the input data can be processed through the data processing rule, and the output data is the data generated after the input data is processed. For example, if the data processing rule is a rule for counting total transaction amount of each user, the output data is a sum of transaction amounts corresponding to the user ids.

Step 110, under the condition that the output data table and the output message theme corresponding to the output data do not exist, creating an output data table, indicating Kafka to create the output message theme bound with the output data table, storing the output data into the output data table, and storing the output data table into the output message theme;

or when the output data table corresponding to the output data and the output message theme exist, updating the output data table in the output message theme based on the output data.

In this embodiment of the present application, the output data table is a data result table (sink) in the Flink sql script, and output data needs to be saved through the table. Taking the above example as an example, two fields are required for recording the total amount of transactions per user in the data table: a field for recording a user id, and a field for recording a total amount of transactions corresponding to the user id, so that a "user id" field and a "total amount of transactions" field may exist in the output data table, respectively.

The user may specify a table name of the output data table in the data processing rule. In the case that there is an output data table, the output data table has a corresponding message theme in Kafka, and the output data table in the message theme may be updated in an additional manner. Under the condition that the output data table does not exist, the output data table can be newly built according to the table name, kafka is instructed to create an output message theme corresponding to the output data table, output data is stored in the output data table, the output data table is stored in the output message theme, and therefore other data processing rules using the output data table can read data under the output data table from the output message theme and process the data under the output data table.

Illustratively, in the absence of an outgoing message topic, kafka may be instructed to create an outgoing message topic with a name specified by specifying the name of the outgoing message topic in the link. The principle is that when a user specifies that a data table needs to be stored in a message subject with a specified name in a link, but there is no message subject with the specified name in Kafka, kafka automatically creates the message subject with the specified name. After instructing Kafka to create the outgoing message topic, the outgoing data table may be stored in the outgoing message topic.

Illustratively, since each data table has its corresponding data identifier in the link, a message topic whose name is related to the data identifier of the output data table may be created according to the data identifier of the output data table, so as to manage the message topic in Kafka.

In one embodiment, instructing Kafka to create an outgoing message topic bound to an outgoing data table comprises:

instruct Kafka to create the outgoing message topic.

In the metadata of the output data table, the binding relation between the output data table and the subject of the output message is recorded.

In this embodiment of the present application, when there is no output data table and an output message theme corresponding to the output data table, after the output data table is created and Kafka is instructed to create the output message theme, the output data table and the output message theme are recorded in metadata of the output data table, so that when the output data table needs to be processed by other data processing rules in the following process, the message theme corresponding to the output data table can be read from the metadata of the output data table by the data processing rules correspondingly, and the output data can be read from the message theme.

After an output data table is generated by one data processing rule, the other data processing rules read the output data table, and a simple description is given to the process of processing the data under the output data table: if the data processing rule A and the data processing rule B exist, the input data of the data processing rule A is a field A under the data table A, the output data table is a data table B, the input data of the data processing rule B is a field B under the data table B, the output data table is a data table C, the data processing rule A continuously monitors a message theme A corresponding to the data table A, and the data processing rule B continuously monitors a message theme B corresponding to the data table B. When a data table A in a message theme A is updated, the data processing rule A consumes the updated data in a field A of the data table A to generate output data, and the output data is stored in a message theme B corresponding to the data table B; because the data processing rule A stores data into the message theme B, the data processing rule B monitoring the message theme B consumes the data updated in the field B of the data table B from the message theme B to generate the data table C, and stores the data table C into the message theme C corresponding to the data table C.

In one embodiment, as shown in fig. 2, after storing the output data table in the output message topic in step 110, the method further includes:

step 202, according to each target output data corresponding to the data filtering rule, reading each target output data from the output message subject bound with the output data table corresponding to each target output data.

Step 204, determining hit data from the target output data according to the data threshold corresponding to the target output data in the data filtering rule for any target output data.

Step 206, reading the original data table from the original data message topic.

In step 208, target data corresponding to each hit data is determined from the original data table, and the target data is stored in the database.

In this embodiment of the present application, the data filtering rule is a rule that determines target data from an original data table according to the original data table and each target output data. The target output data is the data which needs to be read by the data screening rule, and can be specifically part or all of the output data. For example, if the data filtering rule is to filter clients with a daily transaction count exceeding 3, the target output data corresponding to the data filtering rule is the daily transaction count of each client calculated according to the data processing rule, and after the clients with a daily transaction count of 3 or more (i.e. hit data) are filtered, the data record (target data) corresponding to the clients can be determined from the original data table, and the data record can be stored in a database such as Hbase, mySQL, clickHouse for subsequent analysis and the like.

The data filtering rule may also be a Flink sql script automatically generated by the terminal. The terminal can display the data which can be selected in the data screening to the user according to the existing all metadata, and after the user sets the data required in the data screening and the data threshold value corresponding to each data, the terminal can automatically generate the corresponding Flink sql script.

A data filtering rule may filter based on a plurality of target output data. For example, customers with more than 3 daily transactions and a single transaction amount greater than 5 ten thousand yuan may be screened simultaneously. According to the target output data 'client daily transaction number' and the corresponding data threshold value '3', A groups of clients with transaction number more than 3 (namely hit data aiming at 'client daily transaction number') can be screened from the 'client daily transaction number'; according to the target output data of transaction amount and the corresponding data threshold value of 5 ten thousand, B groups of clients with single transaction amount larger than 5 ten thousand yuan (namely hit data aiming at the transaction amount) can be screened from the transaction amount. And then according to the C group clients belonging to the A group client and the B group client at the same time, the data record (target data) corresponding to the C group client can be obtained from the original data table, and the data record is stored in the database.

According to the data processing method provided by the embodiment of the application, each data table is provided with the corresponding message theme in Kafka, so that the data processing rule automatically finds the target input message theme corresponding to the target input data table according to the first command line, and the input data under the target input data table is read from the target input message theme; after the input data is processed to form output data, an output message theme corresponding to the output data table can be created under the condition that the output data table and the message theme corresponding to the output data table do not exist, and the output data table is stored in the output message theme, so that when other data processing rules need to use the output data table, the output message theme can be automatically found, and the output data table can be read from the output message theme. The output data table obtained by processing is stored in Kafka, and data transmission is carried out based on Kafka, so that the real-time property of the data can be ensured, the input data is ensured to be calculated after the input data required by the data processing rule is generated, and the data dependence problem in real-time data stream processing can be solved; meanwhile, the target input message theme corresponding to the target input data table is automatically determined through the first command line of the data processing rule, and input data is acquired from the target input message theme, so that a user does not need to manually set a data source of the data processing rule, the issuing efficiency of the data processing rule is improved, and the real-time data stream processing efficiency is further improved.

In one embodiment, as shown in fig. 3, the data processing rule includes a first data processing rule, the target input data table includes an original data table, and in step 108, processing the input data by the data processing rule to obtain output data includes:

step 302, according to the statistical dimension field names in the first data processing rule, obtaining a target statistical dimension corresponding to the statistical dimension field names from a dimension table corresponding to the target field of the original data table, and determining each target statistical interval from the target statistical dimensions, wherein the dimension table comprises at least one statistical dimension, and the statistical dimension comprises at least one statistical interval.

Step 304, respectively counting the data corresponding to each target counting interval in the target field through a first data processing rule to obtain output data.

In this embodiment of the present application, the first data processing rule is a rule for processing an original data table, where the original data table is data obtained from a data source and not processed by the data processing rule. Illustratively, the raw data table may be a transaction record table or the like obtained from a business system.

Since there is generally no statistical dimension in the original data table that is required for data statistics, data statistics need to be performed through the dimension table associated with the original data table. Referring to fig. 4, if a "city identifier" field for representing a city in which a user is located exists in the original data table, and data in the field is a city identifier for representing the city in which the user is located, different statistical dimensions required for counting the "city identifier" field may be recorded in a dimension table associated with the "city identifier" field: for example, the statistical dimensions may include a "city" dimension, a "province" dimension, and a "region" dimension. The data in the dimension of statistics is a statistical interval, for example, the data in the dimension of "city" may be city names, and each city name is a statistical interval in the dimension of "city"; the data in the "province" dimension may be province names, each province name also being a statistical interval in the "province" dimension. And the dimension table records the corresponding relation between the city identification and the statistical interval under different statistical dimensions. Taking city identification as "1" for "Shijiuang", as "2" for "Zhang Jiang" and as "3" for "Tianjin" as examples, in the dimension of "city", the "1" corresponds to "Shijiuang", "2" corresponds to "Zhang Jiang" and "3" corresponds to "Tianjin", in the dimension of "province", the "1" and "2" both correspond to "Hebei" and the "3" corresponds to "Tianjin", and in the dimension of "region", the "1", "2" and "3" both correspond to "North China".

The first data processing rule may be a rule for counting data corresponding to a certain statistical interval in the original data table. For example, if the first data processing rule is to count the total consumption of the users in Hebei province, the statistical dimension field name in the first data processing rule may be "province". The method comprises the steps that a target statistical dimension corresponding to a 'province' field name can be obtained from a dimension table, a target statistical interval 'Hebei' is determined from each statistical interval under the 'province' dimension, and city marks corresponding to the 'Hebei' are determined to be '1' and '2'; further, the data records with city identifiers of 1 and 2 can be screened from the original data table, and the consumption in the data records is counted to obtain output data.

According to the data processing rule provided by the embodiment of the application, the target field of the original data table can be counted through the first data processing rule, so that output data required by further data processing can be obtained from the target field, multi-dimensional statistics can be carried out on the target field, and the available data volume during the data processing can be enriched.

In one embodiment, as shown in fig. 5, before determining the target input data table corresponding to the data processing rule based on the first command line in the data processing rule in step 102, the method further includes:

Step 502, obtaining a dimension table corresponding to each field of the original data table.

Step 504, for any field of the original data table, associating the dimension table of the field with the original data table.

In this embodiment of the present application, the dimension table may be stored in an external database, such as Hbase, mySQL, and the dimension table may be associated with a field corresponding to the dimension table in the original data table through a function of associating the dimension table with the original data table in the Flink sql, so that a statistical dimension required for performing statistics on the field may be obtained through the dimension table.

According to the data processing method based on Kafka, the dimension table corresponding to the field of the original data table can be associated with the original data table, so that when the original data table is counted through the first data processing rule, the original data table can be counted from multiple dimensions, and the available data volume during subsequent data processing is enriched.

In one embodiment, as shown in fig. 6, before determining the target input data table corresponding to the data processing rule based on the first command line in the data processing rule in step 102, the method further includes:

step 602, obtaining an original data table from a data source.

Step 604, under the condition that the message theme corresponding to the original data table does not exist, creating an original data message theme, storing the original data table into the original data message theme, and recording that the original data table is bound with the original data message theme.

In this embodiment of the present application, the data source may be an external service system, and the original data table may be obtained from the data source through a flank cdc script or other third party tools, and the original data table is stored in an original data message theme corresponding to the original data table. The specific process of creating the original data message topic may refer to the related description of creating the output message topic in the foregoing embodiment, and the embodiments of the present application are not described herein.

The flank cdc script may be automatically generated by the terminal. The user can specify the data table which needs to be acquired from the data source and the fields which need to be acquired from the data table, and the terminal can automatically generate the Flink cdc script which acquires the original data table from the data source and stores the original data table into the original data message theme according to the data table which is specified by the user, the fields which need to be acquired from the data table and the original data message theme corresponding to the original data table.

After the original data table is collected, the original data table may also be preprocessed. Because the original data table is not necessarily in the standard data table format, the situation that the data table is nested in the data table or an array exists in the data table may occur, and the collected original data table may be preprocessed through the corresponding Flink sql preprocessing script, so that the original data table is converted into the standard data table format.

According to the data processing method based on Kafka, the original data table can be obtained from the external data source and stored in the original data message theme corresponding to the original data table, so that when the original data table is processed through the first data processing rule, the original data message theme can be directly found out, and the original data table is read from the original data message theme, and therefore the release efficiency of the first data processing rule can be improved.

According to the Kafka-based data processing method, target output data can be screened according to the data screening rule, and target data corresponding to the hit data are determined from an original data table according to the hit data obtained through screening, so that the target data can be applied later. The data screening rule can automatically find the output message theme corresponding to the output data according to the output data and read the data from the output message theme, so that the issuing efficiency of the data screening rule can be improved, and the real-time data stream processing efficiency is further improved.

In one embodiment, after storing the output data table in the output message topic in step 110, the method further includes:

the output data table is stored in a database.

In step 104, obtaining a target input data table from the target input message theme includes:

In this embodiment of the present application, after the output data table is obtained, the output data table may be stored in a database such as Hbase, in addition to the output data table being stored in the message subject of Kafka corresponding to the output data, so as to store the output data table for a long period of time.

When the input data is read from the target input message theme through the data processing rule, whether the metadata of the target input data table is consistent with the metadata of the data table in the target input message theme or not can be compared, if the metadata of the target input data table is consistent with the metadata of the data table in the target input message theme, the data table in the target input message theme is indicated to be the target input data table, and at the moment, the target input message theme can be used as a data source, and the target input data table can be read from the target input message theme; if the metadata of the target input data table is inconsistent with the metadata of the data table in the target input message theme, the data table in the target input message theme may be changed, and the target input data table may be read from the database at this time, so as to ensure that the data of the input data processing rule is necessarily the target input data table.

According to the data processing method based on Kafka, the output data table can be stored in the database for backup, so that under the condition that input data cannot be obtained from a target input message theme, the input data can be obtained from the database, and the situation that the data processing rule cannot obtain the input data, so that other data processing rules of the output data depending on the data processing rule subsequently break down is avoided.

In order for those skilled in the art to better understand the embodiments of the present application, the embodiments of the present application are described below by way of specific examples.

In this embodiment, referring to fig. 7, 8 and 9, all rules of data processing may be classified into 5 types of data collection rules, data preprocessing rules, data processing rules, data screening rules and data application rules. In addition to the data preprocessing rules, the Flink sql script for each part may be automatically generated by the terminal according to the user's configuration.

In the data collection rule, the user may specify the data table to be collected from the data source, the fields to be collected from the data table, and the message subject corresponding to the collected original data table in Kafka. The terminal can automatically generate the Flink cdc script for collecting the original data table according to the configuration and the universal part in the Flink cdc script, and bind the original data table with the original data message theme corresponding to the original data table.

In the data preprocessing rule, for each type of irregular data, the irregular data can be converted into a standard data table format by manually writing a preprocessing script.

The data processing rules can be divided into three specific rules for processing data: feature attribute rules, base index rules, and derived index rules. The characteristic attribute rule refers to caching a certain field in the original data table for use in data processing of other original data tables. For example, if the original data table a includes a field for recording the clicking action of the user, and the original data table B is a transaction record of the user, if the transaction record of the user needs to be filtered according to the clicking action of the user, the field recorded with the clicking action of the user must be first cached in the original data table a, and when the original data table B is obtained, the original data table B is processed according to the cached clicking action field of the user and the original data table B.

The basic index rule refers to a data processing rule for directly processing an original data table, and the terminal can automatically generate a statement for generating a Flink source table according to a field to be counted selected by a user from the original data table, a statistical interval selected by the user from a dimension table associated with the field and a statistical rule set by the user, in a Flink sql script according to an original data message theme corresponding to the original data table, and automatically generate a statement for generating a Flink result table in the Flink sql script according to an output data table specified in the statistical rule and a field corresponding to the output data table, and further form the Flink sql script by the statement, the data processing rule and an inherent part in the Flink sql script. The data output by the basic index rule is called basic index.

The data output by the derived index rule is called a derived index. The derived index rules may process the base index and/or other derived indexes. The terminal can automatically generate the Flink sql script according to the basic index and/or the derivative index selected by the user and the statistical rule set by the user.

After each rule belonging to the data processing rule generates output data, other rules of the message theme corresponding to the output data table to which the subsequent monitoring output data belongs read the output data from the message theme, and further process the output data. As shown in fig. 9, a basic index E can be obtained through the original data stream X, and the basic index E is further processed to obtain a derivative index B; the basic index F can be obtained through the original data flow Z, the basic index F is further processed to obtain the derivative index D, and the derivative index D and the derivative index B can be sent into a data screening rule to be screened, so that the target data A is obtained. When each index is generated, in addition to storing the index in Kafka, the index is stored in a database such as Hbase, etc., so that backup processing is performed on the data. For the data stored in Hbase, a daily backup process can be performed.

The data screening rule and the data application rule are rules for screening target data from an original data table according to output data of the data processing rule and applying the target data. The data filtering rules may refer to the related descriptions of the foregoing embodiments, and the embodiments of the present application are not described herein again. The data screening rules may store the target data in the message topic of Kafka to which the target data is bound, in addition to storing the target data in a database. The data application rule is a rule for applying target data according to target data output by the data screening rule and application job content configured by a user (for example, template processing, product recommendation and the like are performed according to the target data). The data application rule may also read the target data from the Kafka message topic to which the target data is bound and apply the target data.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiments of the present application also provide a Kafka-based data processing apparatus for implementing the above-mentioned Kafka-based data processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so specific limitations in one or more embodiments of the data processing device based on Kafka provided below may be referred to above for limitations of the data processing method based on Kafka, and will not be repeated here.

In one embodiment, as shown in fig. 10, there is provided a Kafka-based data processing apparatus, comprising: a first determination module 1002, a first acquisition module 1004, a first reading module 1006, a processing module 1008, a first new modeling module 1010, wherein:

a first determining module 1002, configured to determine, based on a first command line in a data processing rule, a target input data table corresponding to the data processing rule, and determine a target input message topic having a binding relationship with the target input data table;

a first obtaining module 1004, configured to obtain the target input data table from the target input message topic of Kafka;

a first reading module 1006, configured to read input data from the target input data table based on a second command line in the data processing rule;

A processing module 1008, configured to process the input data according to the data processing rule, so as to obtain output data;

a first new modeling block 1010, configured to, in the absence of an output data table and an output message topic corresponding to the output data, create the output message topic bound to the output data table, store the output data in the output data table, and store the output data table in the output message topic;

According to the data processing device based on Kafka, each data table is provided with the corresponding message subject in Kafka, so that the data processing rule can automatically find the target input message subject corresponding to the target input data table according to the first command line, and input data under the target input data table is read from the target input message subject; after the input data is processed to form output data, an output message theme corresponding to the output data table can be created under the condition that the output data table and the message theme corresponding to the output data table do not exist, and the output data table is stored in the output message theme, so that when other data processing rules need to use the output data table, the output message theme can be automatically found, and the output data table can be read from the output message theme. The output data table obtained by processing is stored in Kafka, and data transmission is carried out based on Kafka, so that the real-time property of the data can be ensured, the input data is ensured to be calculated after the input data required by the data processing rule is generated, and the data dependence problem in real-time data stream processing can be solved; meanwhile, the target input message theme corresponding to the target input data table is automatically determined through the first command line of the data processing rule, and input data is acquired from the target input message theme, so that a user does not need to manually set a data source of the data processing rule, the issuing efficiency of the data processing rule is improved, and the real-time data stream processing efficiency is further improved.

In one embodiment, the first determining module 1002 is further configured to:

the first new modeling block 1010 is further configured to:

instruct Kafka to create the outgoing message topic;

In one embodiment, the apparatus further comprises:

the storage module is used for storing the output data table into a database;

the first obtaining module 1004 is further configured to:

The respective modules in the above-described Kafka-based data processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Kafka-based data processing method.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A Kafka-based data processing method, applied to a data processing engine flank, comprising:

2. The method of claim 1, wherein the determining a target input message topic having a binding relationship with the target input data table comprises:

instruct Kafka to create the outgoing message topic;

3. The method of claim 1, wherein the data processing rule comprises a first data processing rule, the target input data table comprises an original data table, and the processing the input data by the data processing rule to obtain output data comprises:

4. The method of claim 3, wherein prior to determining the target input data table corresponding to the data processing rule based on the first command line in the data processing rule, the method further comprises:

5. The method of claim 3, wherein prior to determining the target input data table corresponding to the data processing rule based on the first command line in the data processing rule, the method further comprises:

acquiring the original data table from a data source;

6. The method of claim 5, wherein after storing the output data table in the output message topic, the method further comprises:

reading the original data table from the original data message subject;

7. The method according to any one of claims 1 to 6, wherein after said storing said output data table in said output message topic, said method further comprises:

storing the output data table into a database;

8. A Kafka-based data processing apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.