CN114579614A

CN114579614A - Real-time data full-scale acquisition method and device and computer equipment

Info

Publication number: CN114579614A
Application number: CN202210128596.8A
Authority: CN
Inventors: 王祖正; 汪健; 吴凡
Original assignee: Wuhan Wuyi Yuntong Network Technology Co ltd
Current assignee: Wuhan Wuyi Yuntong Network Technology Co ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-06-03

Abstract

The invention provides a real-time data full-scale acquisition method, a real-time data full-scale acquisition device and computer equipment. The method comprises the following steps: monitoring and collecting binlog data of the mysql database by using a flinkcdc component; distributing global sorting id to the collected binlog data, and when writing in a kafka system, taking a main key of a table as a partitioning strategy, wherein the main key value adopts a hash partition, so that the same main key data of the same table data are in the same partition; creating a doris wide table, wherein other fields except the main key are replaced by REPLACE _ IF _ NOT _ NULL; creating a plurality of flash stream tasks, writing a plurality of dwd layer or dws layer data in the kafka system into the doris wide table in a stream load mode, and enabling the data of the same main key in the doris wide table to be updated in an overlaying mode; and querying a doris wide table according to actual requirements to obtain full data. The beneficial effects provided by the invention are as follows: mainly solves the problem of mysql cross-library analysis and has low cost.

Description

Real-time data full-scale acquisition method and device and computer equipment

Technical Field

The invention relates to the field of big data, in particular to a method and a device for acquiring full real-time data and computer equipment.

Background

With the development of the internet entering the next half, the timeliness of the data becomes more and more important to the fine operation of enterprises, and how valuable information can be effectively mined out in real time in mass data generated every day in markets such as battlefields greatly helps to the decision operation strategy adjustment of the enterprises.

From the perspective of intelligent commerce, the data result represents feedback of a user, the timeliness of the obtained result is particularly important, the quick data feedback obtaining can help a decision maker to make a decision more quickly and better perform corresponding software product iteration, and the real-time data warehouse plays an irreplaceable role in the process.

Typically, bins are intended to have data from the first day a new transaction comes online and then recorded until now. However, the real-time stream processing technique is a technique for emphasizing the current processing state, and a certain contradiction exists between the two techniques, so that the data timeliness of the current off-line bins is very low.

Specifically, the method comprises the following steps:

(1) at present, a real-time large wide table (a plurality of bins) is Flink + clickhouse, the generation of the wide table is realized by writing the results generated by associating each table with a flash component into the clickhouse, and the clickhouse is only responsible for query, and the OLAP support of the wide table is insufficient, and the performances of table association and the like are poor.

(2) The flink component generates a large-width table, has the concept of time (time window must be added, time range is set, otherwise task is problematic, state is wireless and OOM is generated), and has the problem of data delay and the like which are not relevant, so that the problem of inaccurate data result exists, and the objective display exists.

(3) The flink client acquires the binary log and writes the binary log to the kafka end, and because the flink can set multiple parallelism to improve efficiency, the kafka can also be partitioned. And multithreading acquires a plurality of binlogs of the same main key to ensure the orderliness of data, otherwise, the data disordering leads to the final data falling to the ground to be wrong.

To take a simple example, such as service: order placement + payment.

The order is a piece of information, the payment is a piece of information, the two are associated to be a flink sql join, but a time range exists, the payment is carried out within half an hour, the payment is failed, if the payment is a month or a year later, the data message of the order is always waited in the memory, the data volume is large, the memory is always stored, and the join of the historical full data can not be carried out.

Summarizing, the Flink + clickhouse cannot solve the problem that the data time correlation span is large, and the problem is also a pain point.

Disclosure of Invention

In view of this, the present invention provides a method for acquiring full real-time data based on the construction of several bins and the real-time stream data processing technology. The invention begins with.

In order to achieve the above purpose, the present invention provides a real-time data full-scale obtaining method, which is based on flinkcd + doris, and comprises the following steps:

s101: accessing real-time binary log binlog data: monitoring and collecting binlog data of the mysql database by using a flinkcdc component;

s102: write binlog data to kafka system: distributing global sorting id to the collected binlog data, and when writing in a kafka system, taking a main key of a table as a partitioning strategy, wherein the main key value adopts a hash partition, so that the same main key data of the same table data are in the same partition;

s103: create doris broad table: creating a doris wide table, wherein other fields except the main key are replaced by REPLACE _ IF _ NOT _ NULL;

s104: write data to doris wide table: creating a plurality of flash stream tasks, writing a plurality of dwd layer or dws layer data in the kafka system into the doris wide table in a stream load mode, and enabling the data of the same main key in the doris wide table to be updated in an overlaying mode;

s105: acquiring full data: and querying a doris wide table according to actual requirements to obtain full data.

Further, in step S105, when a doris wide table is queried, a cross-latitude correlation query is performed by using a correlation field with another table as a query condition.

Further, after the data of the doris wide table is written, when a new field is added to the doris wide table at any time, an asynchronous execution mode is adopted, and the specific process is as follows:

s201: newly building a field d in a doris wide table; and after the field d is newly built, starting to access the original incremental data of the task a.

S202: newly building an offline task b, and importing the full data in the doris wide table into the field d by the offline task b before the field d is built;

s203: in the importing process, the field d receives the incremental data and the historical full data at the same time, the incremental data and the historical full data are disordered, and the incremental data continuously keep the historical full data of the same main key until the importing is finished;

s204: newly building a flink temporary task c, accessing data consumption, writing data into the field d by the temporary task c and the original task d simultaneously, closing the temporary task c after the temporary task d runs for a period of time T, and finishing field updating by the doris wide table.

A real-time data full capture apparatus, the apparatus comprising:

binlog data acquisition unit: accessing real-time binary log binlog data: monitoring and collecting binlog data of the mysql database by using a flinkcdc component;

a data partitioning unit: distributing global sorting id to the collected binlog data, and when writing in a kafka system, taking a main key of a table as a partitioning strategy, wherein the main key value adopts a hash partition, so that the same main key data of the same table data are in the same partition;

doris big Wide Table creation Unit: creating a doris large-width table, wherein fields except for a main key are replaced by REPLACE _ IF _ NOT _ NULL;

data filling unit of doris large-width table: creating a plurality of flash stream tasks, writing a plurality of dwd layer or dws layer data in the kafka system into the doris wide table in a stream load mode, and enabling the data of the same main key in the doris wide table to be updated in an overlaying mode;

a full data acquisition unit: and querying a doris wide table according to actual requirements to obtain full data.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of any of said real-time data full-scale acquisition methods when executing said computer program.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of any one of the real-time data full-scale acquisition methods.

The beneficial effects provided by the invention are as follows:

1. the problem of mysql cross-library analysis is mainly solved, and the scheme is low in cost;

2. generating a full-scale real-time large-width table (full-data large-width table) based on Flink + doris, directly carrying out real-time statistics on the large-width table, optimizing query speed and improving use efficiency;

3. the problem of downstream data synchronization caused by modifying an original table in real-time statistics is solved with the minimum cost;

4. and generating a large-width table based on doris solving historical full-scale data.

Drawings

FIG. 1 is a schematic flow chart of a real-time data full-scale acquisition method according to the present invention;

FIG. 2 is a simplified example of creating a doris broad table;

FIG. 3 is a process of large width table data writing;

fig. 4 is a schematic diagram of data update.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

First, the related terms are explained in a unified way as follows:

the Flank is an apache top level open source project and is a calculation engine in the aspect of real-time processing;

flinkcdc: the system is a Flink-cdc-connectors component developed by a Flink community, and is a source component capable of directly reading full data and incremental change data from databases such as MySQL, PostgreSQL and the like. The CDC is used for monitoring and capturing the change of the database, and the changes are completely recorded according to the occurrence sequence and written into the message middleware for subscribing and consuming other services;

the Mysql Binlog is a log file in a binary format and records the change of the Mysql inside to a database;

kafka is a high throughput distributed publish-subscribe messaging system;

doris is an OLAP system of MPP, and provides high-performance analysis and report query functions on a large data set at lower cost;

the invention provides a real-time data full-scale acquisition method, and a basic idea refers to fig. 1. FIG. 1 is a flow chart of the method of the present invention.

it should be noted that the flinkcdc component monitors the mysql database binlog log, then links the log data to the message middleware kafka, and then develops the application program for data consumption processing. Or use other tools in existing commercial libraries, such as logail tool in ali, and then develop program consumption.

It is easy to understand that, in the present application, the manner of acquiring real-time changing data by monitoring the binlog log file of the mysql database does not affect the database performance, and the data synchronization performance problem and the timeliness problem are solved, compared with the conventional and common data query derivation schemes.

it should be noted that binlog data of mysql database is written into kafka system in real time, in this process, binlog is in the second order, but if data in this same second is out of order, the final data will be wrong.

Therefore, when the binlog is collected, the collected binlog is distributed with the global sorting id, further, when kafka is written in, a primary key of the table is used as a partition strategy, and the primary key value is partitioned by hash, so that the same primary key data of the same table data are ensured to be in the same partition, the ordering of the data is finally ensured, and the data are also ensured to be ordered when doris data are written in and updated.

It should be further noted that the data of the kafka system is divided into an ods layer, an dwd layer and a dws layer data, and the data of the kafka system is processed by etl and then is parsed into a standard format.

S103: create doris broad table: creating a doris large-width table, wherein fields except for a main key are replaced by REPLACE _ IF _ NOT _ NULL;

referring to fig. 2, fig. 2 is a simplified example of creating a doris wide table;

in fig. 2, a large-width table of the table name "test" is created; the key1 and the key2 are included, and all fields except the key are REPLACE _ IF _ NOT _ NULL.

In the application, through such a processing mode, after the modified delete data of the bin is accessed, the corresponding primary key data can be updated in a covering manner.

it should be noted that, when writing multiple pieces of flink stream data into the doris wide table at the same time, multiple flink tasks are created, and multiple pieces of data at dwd level or dws level are written into doris in a stream load manner, and the same primary key data is overwritten and updated.

Referring to fig. 3, fig. 3 illustrates a process of writing data into a large width table.

Examples are as follows: in a certain embodiment, a total of three pieces of stream data are included;

the first stream: key1, Key2, value1, value2

The second stream: key1, Key2, value3, value4

The third flow: key2, value5, value 6;

three streams are written simultaneously into the large width table.

In step S105, when a doris wide table is queried, a cross-latitude correlation query is performed using a correlation field with another table as a query condition.

It should be noted that when a doris wide table is used for query, the doris wide table can be used as a large table, and can be associated with other tables for query (cross-dimension associated query), and the doris wide table can be optimized for a single label, similar to roll up, so as to improve query speed.

After the doris wide table data is written in, when a new field is added to the doris wide table at any time, an asynchronous execution mode is adopted, and the specific process is as follows:

S202: newly building an offline task b, and importing the full data in the doris wide table into the field d by the offline task b before the field d is newly built;

The above process, briefly, is: if a certain field historical value of the large-width table needs to be modified, a doris table field is newly added for asynchronous execution.

A new field is added, historical data are firstly imported into a doris newly added field, a temporary task is newly built, incremental data size is accessed, the incremental data size is written into the doris new and old fields simultaneously with an original task, the doris new and old fields are operated for a period of time (the setting time is up, a script is called to kill the task) and the temporary task is closed, the original incremental task is modified, the written fields are restarted, and then a data source is consumed, so that the updating and the synchronization of the data are guaranteed.

For better explanation, please refer to fig. 4, fig. 4 is a schematic diagram of data update;

as before, in one embodiment, a total of three pieces of stream data are included;

the first stream: key1, Key2, value1, value2

The second stream: key1, Key2, value3, value4

The third stream: key2, value5, value 6;

where the third stream has only key2, to update its data, it can be supplemented by a dimension table (cross-dimension association query), see the top right hand portion of FIG. 4.

That is, the third flow is to disassociate the dimension table completion key into:

key1, Key2, value5, value 6. The process adopted by the method is also the process described in steps S201 to S204. The supplement of the key2 field in the third stream data and the data coverage update are finally realized through the process.

A real-time data full acquisition apparatus, the apparatus comprising:

a Binlog data acquisition unit: accessing real-time binary log binlog data: monitoring and collecting binlog data of the mysql database by using a flinkcdc component;

doris big Wide Table creation Unit: creating a doris wide table, wherein other fields except the main key are replaced by REPLACE _ IF _ NOT _ NULL;

The beneficial effects of the implementation of the invention are as follows:

The features of the above-described embodiments and embodiments of the invention may be combined with each other without conflict.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A real-time data full-scale acquisition method is characterized by comprising the following steps: the method comprises the following steps:

s102: write binlog data to kafka system: distributing global sorting id to the collected binlog data, and when the binlog data is written into a kafka system, pressing a primary key of a table to serve as a partitioning strategy, wherein the primary key value adopts a hash partition, so that the same primary key data of the same table data are in the same partition;

2. The method for acquiring the full amount of the real-time data according to claim 1, wherein: in step S105, when a doris wide table is queried, cross-latitude association query is performed by using the association fields of other tables as query conditions.

3. The method for acquiring the full amount of the real-time data according to claim 1, wherein: after the doris wide table data is written in, when newly adding fields to the doris wide table at any time, an asynchronous execution mode is adopted, and the specific process is as follows:

s203: in the importing process, the field d receives the incremental data and the historical full data at the same time, the incremental data and the historical full data are disordered, and the incremental data are continuously the historical full data of the same main key until the importing is finished;

4. A real-time data full-scale acquisition apparatus, the apparatus comprising:

a full data acquisition unit: and querying the doris wide table according to actual requirements to obtain full data.

5. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the real-time data full-scale acquisition method according to any one of claims 1 to 3 when executing the computer program.

6. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the real-time data full size acquisition method according to any one of claims 1 to 3.