CN114265875A - Method for establishing wide table in real time based on stream data - Google Patents

Method for establishing wide table in real time based on stream data Download PDF

Info

Publication number
CN114265875A
CN114265875A CN202210204367.XA CN202210204367A CN114265875A CN 114265875 A CN114265875 A CN 114265875A CN 202210204367 A CN202210204367 A CN 202210204367A CN 114265875 A CN114265875 A CN 114265875A
Authority
CN
China
Prior art keywords
data
log
structured
target
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210204367.XA
Other languages
Chinese (zh)
Other versions
CN114265875B (en
Inventor
唐建法
肖贝贝
黄佳钦
欧阳世雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Titanium Platinum Data Co ltd
Original Assignee
Shenzhen Titanium Platinum Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Titanium Platinum Data Co ltd filed Critical Shenzhen Titanium Platinum Data Co ltd
Priority to CN202210204367.XA priority Critical patent/CN114265875B/en
Publication of CN114265875A publication Critical patent/CN114265875A/en
Application granted granted Critical
Publication of CN114265875B publication Critical patent/CN114265875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a real-time broad list establishing method based on stream data, which at least comprises the following steps: the data engine collects data and stores the data in a database; converting the collected data into structured data; saving the structured data to a data cache library; and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table. The method has the characteristics of high instantaneity, high flexibility, quick response, global model association and cross-library disorder support.

Description

Method for establishing wide table in real time based on stream data
Technical Field
The application relates to a method for establishing a broad list in real time based on stream data, belonging to the field of data processing.
Background
Whether traditional warehouse construction or modern data-driven application business, most data development work needs to construct new data tables to serve various analysis models or business models. Especially, because the data volume of the internet company is generally large, the multi-table association mode is not adopted. In this case, it is a very mainstream data development work to construct a wide table for supporting various business queries.
Traditional modeling and table building are all completed based on SQL. The SQL-based approach has these limitations:
1. disjointing the target model table and the original table data:
SQL is a way to perform query calculations based on a fixed data set and output to a target table, and is suitable for regular batch operations. If the original table involved is large, this operation often needs to be performed for several minutes or even hours, which may cause the target data to fail to reflect the current real state.
2. Concurrent task performance bottleneck:
due to the full-table computation model of traditional modeling, the number of tasks that can be performed simultaneously in several bins cannot substantially exceed 2-3. This severely limited the ability of conventional data platform batch modeling.
Disclosure of Invention
According to one aspect of the application, a method for building a wide table in real time based on streaming data is provided, and the method has the characteristics of high real-time performance, high flexibility, quick response, global model association and cross-library disorder support.
The method for establishing the width table in real time based on the stream data at least comprises the following steps:
the data engine collects data and stores the data in a database;
converting the collected data into structured data;
saving the structured data to a data cache library;
and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table.
Optionally, the data cache library is a MongoDB.
Optionally, the structured data is stored in a unified data buffer layer of the data cache library.
Optionally, the unified data buffer layer is an FDM layer.
Optionally, when the data engine collects data, a log collector forms a data log and stores the data log in a log storage center of the data cache library;
and the log storage center synchronizes the data log with the task collector, so that the data log and the user target database are shared.
Optionally, the model computation engine receives a data update event, including:
updating logs by data received by the model computing engine;
sending a database sharing association instruction;
if the data log and the user target database are successfully shared, gradually judging whether the data updating log comprises a log acquisition task or not, whether the log acquisition task comprises a required table or not, and whether the initial acquisition time of the log acquisition task is earlier than the initial time of the last synchronous task or not, if so, reading the data log through the log storage center to serve as an incremental data log;
and if the data log is not successfully shared with the user target database, or the data updating log does not contain a log acquisition task, or the log acquisition task does not contain a required table, or the initial acquisition time of the log acquisition task is later than the initial time of the last synchronous task, the model computing engine directly reads the data log in the database to be used as an incremental data log.
Optionally, for the structured data having a mapping relationship with the target model, querying a table, i.e., a sub-table, containing the structured data from the data cache library, merging the sub-table and the target main table data, and updating the target main table;
and for the structured data which does not have a mapping relation with the target model, writing the structured data into a sub-table cache table of the data cache library, and updating the sub-table into the target main table according to the newly established mapping relation.
The beneficial effects that this application can produce include:
1) the method for building the wide table in real time based on the streaming data provided by the application monitors the incremental change of the source data in real time, can achieve the time delay of a sub-second level, and has higher real-time performance compared with the traditional timing polling mode (based on SQL statements);
2) the method for establishing the broad table in real time based on the streaming data has the multi-model storage capacity, supports the modification of the model at any time, and is more timely in response to the requirement of the change of the model and more global flexibility compared with the traditional relational model (the model is created in advance and the modification cost is higher).
3) The method for establishing the broad table in real time based on the streaming data, provided by the application, is used for simply configuring based on the existing data model and establishing a data synchronization task, so that data modeling can be realized; compared with the prior art that the existing data processing platform (Flink) needs the cooperation of code coding, the development period is longer, the method has the capability of low code data development and even zero code, and the purpose of quickly responding to the service is achieved.
4) According to the method for establishing the wide table in real time based on the streaming data, a unified data cache layer is arranged in the synchronization process, so that the associated global data is supported, and compared with a method that a Flink supports time window time cache, the method is stronger in function.
5) The method for establishing the broad table in real time based on the streaming data has the advantages that all data sources are converged to a uniform data cache layer; when the sub-table events are firstly input, the sub-table events are firstly written into the data cache layer, and the subsequent main table events can inquire the associated sub-table data from the data cache layer to perform model combination calculation.
Drawings
FIG. 1 is a schematic flow chart of one embodiment of the method of the present application;
FIG. 2 is a schematic flow chart of a sharing mode in the method of the present application;
fig. 3 is a schematic flow chart of an update event according to the present application.
Detailed Description
The present application will be described in detail with reference to examples, but the present application is not limited to these examples.
The application relates to a method for establishing a width table in real time based on stream data, the flow of which is shown in figure 1, and the method comprises the following steps:
the data engine collects data and stores the data in a database;
as shown in fig. 2, when the data engine collects data, it needs to mine and share log data, specifically:
a log collector forms a data log and stores the data log to a log storage center of the data cache library;
and the log storage center synchronizes the data log with the task collector, so that the data log and the user target database are shared.
Converting the collected data into structured data;
saving the structured data to a data cache library;
and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table.
As shown in FIG. 3, the model computation engine receives data update events, including:
updating logs by data received by the model computing engine;
sending a database sharing association instruction;
if the data log and the user target database are successfully shared, gradually judging whether the data updating log comprises a log acquisition task or not, whether the log acquisition task comprises a required table or not, and whether the initial acquisition time of the log acquisition task is earlier than the initial time of the last synchronous task or not, if so, reading the data log through the log storage center to serve as an incremental data log;
and if the data log is not successfully shared with the user target database, or the data updating log does not contain a log acquisition task, or the log acquisition task does not contain a required table, or the initial acquisition time of the log acquisition task is later than the initial time of the last synchronous task, the model computing engine directly reads the data log in the database to be used as an incremental data log.
For the structured data which has a mapping relation with a target model, inquiring a table containing the structured data, namely a sub-table, from the data cache library, merging the sub-table and the target main table data, and updating the target main table;
and for the structured data which does not have a mapping relation with the target model, writing the structured data into a sub-table cache table of the data cache library, and updating the sub-table into the target main table according to the newly established mapping relation.
The method extracts the logs of the source database into a data cache library (MongoDB), and the subsequent tasks serve as consumers, directly acquire the required data from the data cache library and then transmit the required data to the subsequent task processing. At present, the mining sharing mode of log data can be used for Oracle, Mysql, PostgreSQL, Sqlserver, MongoDB and the like. The advantages of MongoDB as a data cache library are as follows: the high performance of the memory database can be persisted like MySQL. Because the log data is cached in the MongoDB (indirectly on the hard disk), the problem that the source cannot be traced after a large transaction scene, a specific transaction and a synchronous fault can be quickly solved.
Under the scene of processing a large transaction, the mining sharing of log data takes the cache MongoDB as a carrier to record all data changes in the large transaction, when the large transaction is finished, all the data changes are directly taken out from the MongoDB for consumption, other transactions irrelevant to the large transaction are normally processed, and the data changes and the other transactions are not interfered with each other, so that the influence of the large transaction on the synchronization performance is reduced. In the large transaction scene, the amount of data change generated is large generally, and because MongoDB is used as an intermediate carrier, the synchronization time cannot be increased due to the characteristic of millisecond-level insertion/query speed, and the synchronization time is actually measured to be 3.2 times larger than the maximum transaction processed by Oracle OGG.
In the case of specific transactions (partial rollback, violating constraint transactions, etc.), because the various database implementations are different, the transactions of the source library often have very complex non-conventional scenarios, such as partial rollback, transactions violating database constraints, etc. Directly processing these complex scenes in the source library requires consuming a large amount of source library resources, which affects the service. After the shared log based on the cache is used, the accumulated transactions can be analyzed and matched more comprehensively only aiming at various complex scenes, and corresponding processing is carried out.
The non-cached mode logs are typically cleared or archived quickly. After caching, due to the persistence capability, the date and the size of the log cache can be dynamically adjusted as required. Therefore, the data are traced when the data are inconsistent, historical change data are provided, and the tracing after the synchronous fault is more convenient.
In a log data mining sharing mode, for example, in a synchronous scene in which Oracle is a source database, the system can obtain all target data and real-time changes thereof in the database only by 1 logminer process (note: when one logminer process is started in Oracle, the memory, disk and cpu loss of the source database is increased, the contention of log level is generated, and the influence on the performance of the source database is generated when the logminer task is more than 5), that is, the influence of a data increment updating process on the source database is reduced.
The log mining of each database has small difference, Tapdata is combined with a sharing mining mode to optimize the mining of each database, and by taking Oracle as an example, through a large number of internal experiments and verification of actual scenes of guests, the optimized items of logminers are summarized, and the difference in performance is 50-100 times. For the precondition of mining, a fixed configuration item is formed, and the fixed configuration item comprises the optimized setting of a source end Oracle, so that the performance loss of the source end is reduced, and the performance of a Loginer is improved. In the excavation process, in different synchronization scenes, a plurality of key parameters are set, so that the current synchronization efficiency and the synchronization state can be identified, the synchronization parameters are dynamically adjusted, and the optimal synchronization efficiency is ensured all the time. For example, a parameter cursor fetch size, which determines the number of rows of the Oracle sending the log to the system each time, in a large transaction scenario, the parameter needs to be set to a larger value, for example, 1000, at this time, the synchronization efficiency has a difference of 50-100 times, but is not favorable for the synchronization of normal transactions, because the Oracle captures a sufficient number of fetch sizes based on the setting of the value and then pushes the captured number to the system, if the set value is 1000, it is necessary to capture 1000 events and then push the captured events, which is not favorable for the synchronization of small transactions, when small transactions frequently occur, because the fetch sizes are too large, synchronization delay is caused, and at this time, the system dynamically adjusts the parameter according to the synchronization efficiency, without manual intervention, and thus, the optimal synchronization efficiency can be maintained.
Example of a scene
When a customer policy view (CustomerPolicy) is constructed, the specific technical implementation process is as follows:
a. a client table (Customer) and a Policy table (Policy), wherein a client has a plurality of policies and belongs to the association relationship of one (client) to a plurality (Policy), and the client ID (ID) of the client table is associated with the client ID (Customer _ ID) of the Policy table;
b. in the actual data acquisition process, a timing problem can be encountered, and the following is a timing problem scene and a processing mode:
i. an old client creates a policy newly, a client (client) record is read first during data acquisition, and client data is written into a target (client policy) according to client id (id); and secondly, reading Policy (Policy) data again, performing associated writing according to the client ID (customer _ ID) in the Policy information and the client ID (ID) of the target (customer Policy), and combining into an embedded array.
A new client creates a Policy, when the client data and Policy data are submitted in the same transaction, the data collection may read Policy information first, but the client information does not exist in the target (Customerpolicy), so the Policy information is cached in a temporary table; when the client (Customer) information is read later, the policy information corresponding to the Customer is found in the temporary table according to the Customer ID (id), and the policy information is synchronized to the target table (Customer policy).
Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims (7)

1. A method for real-time table widening based on streaming data, characterized in that the method at least comprises the following steps:
the data engine collects data and stores the data in a database;
converting the collected data into structured data;
saving the structured data to a data cache library;
and the model calculation engine receives a data updating event, extracts the structured data associated with the target main table according to the fact whether the structured data and the target model have a mapping relation, and updates the structured data into the target main table.
2. The method of claim 1, wherein the data cache library is montodb.
3. The method of claim 1, wherein the structured data is stored in a unified data buffer layer of the data cache library.
4. The method of claim 3, wherein the uniform data buffer layer is an FDM layer.
5. The method of claim 1, wherein a log collector forms a data log while the data engine collects data, and stores the data log in a log storage center of the data cache;
and the log storage center synchronizes the data log with the task collector, so that the data log and the user target database are shared.
6. The method of claim 5, wherein the model computation engine receives a data update event comprising:
updating logs by data received by the model computing engine;
sending a database sharing association instruction;
if the data log and the user target database are successfully shared, gradually judging whether the data updating log comprises a log acquisition task or not, whether the log acquisition task comprises a required table or not, and whether the initial acquisition time of the log acquisition task is earlier than the initial time of the last synchronous task or not, if so, reading the data log through the log storage center to serve as an incremental data log;
and if the data log is not successfully shared with the user target database, or the data updating log does not contain a log acquisition task, or the log acquisition task does not contain a required table, or the initial acquisition time of the log acquisition task is later than the initial time of the last synchronous task, the model computing engine directly reads the data log in the database to be used as an incremental data log.
7. The method of claim 5, wherein for the structured data having a mapping relation with the target model, a table containing the structured data, i.e. a sub-table, is queried from the data cache library, and the sub-table is merged with the target main table data to update the target main table;
and for the structured data which does not have a mapping relation with the target model, writing the structured data into a sub-table cache table of the data cache library, and updating the sub-table into the target main table according to the newly established mapping relation.
CN202210204367.XA 2022-03-03 2022-03-03 Method for establishing wide table in real time based on stream data Active CN114265875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210204367.XA CN114265875B (en) 2022-03-03 2022-03-03 Method for establishing wide table in real time based on stream data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210204367.XA CN114265875B (en) 2022-03-03 2022-03-03 Method for establishing wide table in real time based on stream data

Publications (2)

Publication Number Publication Date
CN114265875A true CN114265875A (en) 2022-04-01
CN114265875B CN114265875B (en) 2022-07-22

Family

ID=80834024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210204367.XA Active CN114265875B (en) 2022-03-03 2022-03-03 Method for establishing wide table in real time based on stream data

Country Status (1)

Country Link
CN (1) CN114265875B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501715A (en) * 2023-04-28 2023-07-28 成都赛力斯科技有限公司 Real-time association updating method and device for multi-table full data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347662A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical data construction standard system based on generic data model
CN110362632A (en) * 2019-07-22 2019-10-22 无限极(中国)有限公司 A kind of method of data synchronization, device, equipment and computer readable storage medium
WO2021046750A1 (en) * 2019-09-11 2021-03-18 华为技术有限公司 Data redistribution method, device, and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347662A (en) * 2019-07-12 2019-10-18 之江实验室 A kind of multicenter medical data construction standard system based on generic data model
CN110362632A (en) * 2019-07-22 2019-10-22 无限极(中国)有限公司 A kind of method of data synchronization, device, equipment and computer readable storage medium
WO2021046750A1 (en) * 2019-09-11 2021-03-18 华为技术有限公司 Data redistribution method, device, and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
江勇等: "基于变化数据捕获机制的分布式缓存一致性策略", 《计算机系统应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501715A (en) * 2023-04-28 2023-07-28 成都赛力斯科技有限公司 Real-time association updating method and device for multi-table full data
CN116501715B (en) * 2023-04-28 2024-03-12 重庆赛力斯凤凰智创科技有限公司 Real-time association updating method and device for multi-table full data

Also Published As

Publication number Publication date
CN114265875B (en) 2022-07-22

Similar Documents

Publication Publication Date Title
US10180946B2 (en) Consistent execution of partial queries in hybrid DBMS
EP3968175B1 (en) Data replication method and apparatus, and computer device and storage medium
CN112445863B (en) Data real-time synchronization method and system
CN110837585B (en) Multi-source heterogeneous data association query method and system
CN112286941B (en) Big data synchronization method and device based on Binlog + HBase + Hive
CN111563102A (en) Cache updating method, server, system and storage medium
CN111125260A (en) Data synchronization method and system based on SQL Server
CN114579614A (en) Real-time data full-scale acquisition method and device and computer equipment
CN113535777A (en) Database query method, device and system
US11061899B2 (en) Query optimization in hybrid DBMS
CN110309233A (en) Method, apparatus, server and the storage medium of data storage
KR20190063835A (en) System for processing real-time data modification of in-memory database
CN114265875B (en) Method for establishing wide table in real time based on stream data
CN114153809A (en) Parallel real-time incremental statistic method based on database logs
CN118331962A (en) Data processing method, device, storage medium and database based on time schedule
US11449521B2 (en) Database management system
CN113961546A (en) Real-time query library design method supporting online analysis statistics
CN115098486A (en) Real-time data acquisition method based on customs service big data
CN114003660B (en) Method and device for efficiently synchronizing real-time data to click House based on flash
CN111258977A (en) Tax big data storage and analysis platform
Liu et al. Optimizing ETL by a two-level data staging method
CN118410068B (en) Database SQL query optimization method, terminal and storage medium
US11914655B2 (en) Mutation-responsive documentation generation based on knowledge base
CN115952200B (en) MPP architecture-based multi-source heterogeneous data aggregation query method and device
US11669535B1 (en) Maintaining at a target database system a copy of a source table of a source database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant