CN114265875A

CN114265875A - Method for establishing wide table in real time based on stream data

Info

Publication number: CN114265875A
Application number: CN202210204367.XA
Authority: CN
Inventors: 唐建法; 肖贝贝; 黄佳钦; 欧阳世雄
Original assignee: Shenzhen Titanium Platinum Data Co ltd
Current assignee: Shenzhen Titanium Platinum Data Co ltd
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-04-01
Anticipated expiration: 2042-03-03
Also published as: CN114265875B

Abstract

The application discloses a real-time broad list establishing method based on stream data, which at least comprises the following steps: the data engine collects data and stores the data in a database; converting the collected data into structured data; saving the structured data to a data cache library; and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table. The method has the characteristics of high instantaneity, high flexibility, quick response, global model association and cross-library disorder support.

Description

Method for establishing wide table in real time based on stream data

Technical Field

The application relates to a method for establishing a broad list in real time based on stream data, belonging to the field of data processing.

Background

Whether traditional warehouse construction or modern data-driven application business, most data development work needs to construct new data tables to serve various analysis models or business models. Especially, because the data volume of the internet company is generally large, the multi-table association mode is not adopted. In this case, it is a very mainstream data development work to construct a wide table for supporting various business queries.

Traditional modeling and table building are all completed based on SQL. The SQL-based approach has these limitations:

1. disjointing the target model table and the original table data:

SQL is a way to perform query calculations based on a fixed data set and output to a target table, and is suitable for regular batch operations. If the original table involved is large, this operation often needs to be performed for several minutes or even hours, which may cause the target data to fail to reflect the current real state.

2. Concurrent task performance bottleneck:

due to the full-table computation model of traditional modeling, the number of tasks that can be performed simultaneously in several bins cannot substantially exceed 2-3. This severely limited the ability of conventional data platform batch modeling.

Disclosure of Invention

According to one aspect of the application, a method for building a wide table in real time based on streaming data is provided, and the method has the characteristics of high real-time performance, high flexibility, quick response, global model association and cross-library disorder support.

The method for establishing the width table in real time based on the stream data at least comprises the following steps:

the data engine collects data and stores the data in a database;

converting the collected data into structured data;

saving the structured data to a data cache library;

and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table.

Optionally, the data cache library is a MongoDB.

Optionally, the structured data is stored in a unified data buffer layer of the data cache library.

Optionally, the unified data buffer layer is an FDM layer.

Optionally, when the data engine collects data, a log collector forms a data log and stores the data log in a log storage center of the data cache library;

and the log storage center synchronizes the data log with the task collector, so that the data log and the user target database are shared.

Optionally, the model computation engine receives a data update event, including:

updating logs by data received by the model computing engine;

sending a database sharing association instruction;

if the data log and the user target database are successfully shared, gradually judging whether the data updating log comprises a log acquisition task or not, whether the log acquisition task comprises a required table or not, and whether the initial acquisition time of the log acquisition task is earlier than the initial time of the last synchronous task or not, if so, reading the data log through the log storage center to serve as an incremental data log;

and if the data log is not successfully shared with the user target database, or the data updating log does not contain a log acquisition task, or the log acquisition task does not contain a required table, or the initial acquisition time of the log acquisition task is later than the initial time of the last synchronous task, the model computing engine directly reads the data log in the database to be used as an incremental data log.

Optionally, for the structured data having a mapping relationship with the target model, querying a table, i.e., a sub-table, containing the structured data from the data cache library, merging the sub-table and the target main table data, and updating the target main table;

and for the structured data which does not have a mapping relation with the target model, writing the structured data into a sub-table cache table of the data cache library, and updating the sub-table into the target main table according to the newly established mapping relation.

The beneficial effects that this application can produce include:

1) the method for building the wide table in real time based on the streaming data provided by the application monitors the incremental change of the source data in real time, can achieve the time delay of a sub-second level, and has higher real-time performance compared with the traditional timing polling mode (based on SQL statements);

2) the method for establishing the broad table in real time based on the streaming data has the multi-model storage capacity, supports the modification of the model at any time, and is more timely in response to the requirement of the change of the model and more global flexibility compared with the traditional relational model (the model is created in advance and the modification cost is higher).

3) The method for establishing the broad table in real time based on the streaming data, provided by the application, is used for simply configuring based on the existing data model and establishing a data synchronization task, so that data modeling can be realized; compared with the prior art that the existing data processing platform (Flink) needs the cooperation of code coding, the development period is longer, the method has the capability of low code data development and even zero code, and the purpose of quickly responding to the service is achieved.

4) According to the method for establishing the wide table in real time based on the streaming data, a unified data cache layer is arranged in the synchronization process, so that the associated global data is supported, and compared with a method that a Flink supports time window time cache, the method is stronger in function.

5) The method for establishing the broad table in real time based on the streaming data has the advantages that all data sources are converged to a uniform data cache layer; when the sub-table events are firstly input, the sub-table events are firstly written into the data cache layer, and the subsequent main table events can inquire the associated sub-table data from the data cache layer to perform model combination calculation.

Drawings

FIG. 1 is a schematic flow chart of one embodiment of the method of the present application;

FIG. 2 is a schematic flow chart of a sharing mode in the method of the present application;

fig. 3 is a schematic flow chart of an update event according to the present application.

Detailed Description

The present application will be described in detail with reference to examples, but the present application is not limited to these examples.

The application relates to a method for establishing a width table in real time based on stream data, the flow of which is shown in figure 1, and the method comprises the following steps:

the data engine collects data and stores the data in a database;

as shown in fig. 2, when the data engine collects data, it needs to mine and share log data, specifically:

a log collector forms a data log and stores the data log to a log storage center of the data cache library;

Converting the collected data into structured data;

saving the structured data to a data cache library;

As shown in FIG. 3, the model computation engine receives data update events, including:

updating logs by data received by the model computing engine;

sending a database sharing association instruction;

For the structured data which has a mapping relation with a target model, inquiring a table containing the structured data, namely a sub-table, from the data cache library, merging the sub-table and the target main table data, and updating the target main table;

The method extracts the logs of the source database into a data cache library (MongoDB), and the subsequent tasks serve as consumers, directly acquire the required data from the data cache library and then transmit the required data to the subsequent task processing. At present, the mining sharing mode of log data can be used for Oracle, Mysql, PostgreSQL, Sqlserver, MongoDB and the like. The advantages of MongoDB as a data cache library are as follows: the high performance of the memory database can be persisted like MySQL. Because the log data is cached in the MongoDB (indirectly on the hard disk), the problem that the source cannot be traced after a large transaction scene, a specific transaction and a synchronous fault can be quickly solved.

Under the scene of processing a large transaction, the mining sharing of log data takes the cache MongoDB as a carrier to record all data changes in the large transaction, when the large transaction is finished, all the data changes are directly taken out from the MongoDB for consumption, other transactions irrelevant to the large transaction are normally processed, and the data changes and the other transactions are not interfered with each other, so that the influence of the large transaction on the synchronization performance is reduced. In the large transaction scene, the amount of data change generated is large generally, and because MongoDB is used as an intermediate carrier, the synchronization time cannot be increased due to the characteristic of millisecond-level insertion/query speed, and the synchronization time is actually measured to be 3.2 times larger than the maximum transaction processed by Oracle OGG.

In the case of specific transactions (partial rollback, violating constraint transactions, etc.), because the various database implementations are different, the transactions of the source library often have very complex non-conventional scenarios, such as partial rollback, transactions violating database constraints, etc. Directly processing these complex scenes in the source library requires consuming a large amount of source library resources, which affects the service. After the shared log based on the cache is used, the accumulated transactions can be analyzed and matched more comprehensively only aiming at various complex scenes, and corresponding processing is carried out.

The non-cached mode logs are typically cleared or archived quickly. After caching, due to the persistence capability, the date and the size of the log cache can be dynamically adjusted as required. Therefore, the data are traced when the data are inconsistent, historical change data are provided, and the tracing after the synchronous fault is more convenient.

In a log data mining sharing mode, for example, in a synchronous scene in which Oracle is a source database, the system can obtain all target data and real-time changes thereof in the database only by 1 logminer process (note: when one logminer process is started in Oracle, the memory, disk and cpu loss of the source database is increased, the contention of log level is generated, and the influence on the performance of the source database is generated when the logminer task is more than 5), that is, the influence of a data increment updating process on the source database is reduced.

The log mining of each database has small difference, Tapdata is combined with a sharing mining mode to optimize the mining of each database, and by taking Oracle as an example, through a large number of internal experiments and verification of actual scenes of guests, the optimized items of logminers are summarized, and the difference in performance is 50-100 times. For the precondition of mining, a fixed configuration item is formed, and the fixed configuration item comprises the optimized setting of a source end Oracle, so that the performance loss of the source end is reduced, and the performance of a Loginer is improved. In the excavation process, in different synchronization scenes, a plurality of key parameters are set, so that the current synchronization efficiency and the synchronization state can be identified, the synchronization parameters are dynamically adjusted, and the optimal synchronization efficiency is ensured all the time. For example, a parameter cursor fetch size, which determines the number of rows of the Oracle sending the log to the system each time, in a large transaction scenario, the parameter needs to be set to a larger value, for example, 1000, at this time, the synchronization efficiency has a difference of 50-100 times, but is not favorable for the synchronization of normal transactions, because the Oracle captures a sufficient number of fetch sizes based on the setting of the value and then pushes the captured number to the system, if the set value is 1000, it is necessary to capture 1000 events and then push the captured events, which is not favorable for the synchronization of small transactions, when small transactions frequently occur, because the fetch sizes are too large, synchronization delay is caused, and at this time, the system dynamically adjusts the parameter according to the synchronization efficiency, without manual intervention, and thus, the optimal synchronization efficiency can be maintained.

Example of a scene

When a customer policy view (CustomerPolicy) is constructed, the specific technical implementation process is as follows:

a. a client table (Customer) and a Policy table (Policy), wherein a client has a plurality of policies and belongs to the association relationship of one (client) to a plurality (Policy), and the client ID (ID) of the client table is associated with the client ID (Customer _ ID) of the Policy table;

b. in the actual data acquisition process, a timing problem can be encountered, and the following is a timing problem scene and a processing mode:

i. an old client creates a policy newly, a client (client) record is read first during data acquisition, and client data is written into a target (client policy) according to client id (id); and secondly, reading Policy (Policy) data again, performing associated writing according to the client ID (customer _ ID) in the Policy information and the client ID (ID) of the target (customer Policy), and combining into an embedded array.

A new client creates a Policy, when the client data and Policy data are submitted in the same transaction, the data collection may read Policy information first, but the client information does not exist in the target (Customerpolicy), so the Policy information is cached in a temporary table; when the client (Customer) information is read later, the policy information corresponding to the Customer is found in the temporary table according to the Customer ID (id), and the policy information is synchronized to the target table (Customer policy).

Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A method for real-time table widening based on streaming data, characterized in that the method at least comprises the following steps:

the data engine collects data and stores the data in a database;

converting the collected data into structured data;

saving the structured data to a data cache library;

2. The method of claim 1, wherein the data cache library is montodb.

3. The method of claim 1, wherein the structured data is stored in a unified data buffer layer of the data cache library.

4. The method of claim 3, wherein the uniform data buffer layer is an FDM layer.

5. The method of claim 1, wherein a log collector forms a data log while the data engine collects data, and stores the data log in a log storage center of the data cache;

6. The method of claim 5, wherein the model computation engine receives a data update event comprising:

updating logs by data received by the model computing engine;

sending a database sharing association instruction;

7. The method of claim 5, wherein for the structured data having a mapping relation with the target model, a table containing the structured data, i.e. a sub-table, is queried from the data cache library, and the sub-table is merged with the target main table data to update the target main table;