CN113553320B

CN113553320B - Data quality monitoring method and device

Info

Publication number: CN113553320B
Application number: CN202110866720.6A
Authority: CN
Inventors: 张明磊; 喻兆靖; 张杨; 郑志升
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2022-09-02
Anticipated expiration: 2041-07-29
Also published as: CN113553320A

Abstract

The embodiment of the application provides a data quality monitoring method, which comprises the following steps: judging whether the number of the source data synchronously output in the same transaction time is the same as the number of the data synchronously input in the same transaction time; if the data are the same, acquiring first data output in the first transaction time synchronization from the initial data storage node and acquiring second data synchronized to the HUDI in the first transaction time synchronization from the HUDI at intervals of a first preset time; determining a first monitoring result according to the first data and the second data; extracting data with analog identification information from the HUDI at intervals of second preset time; determining a second monitoring result according to the extracted data and all analog data pre-inserted into the initial data storage node; and determining a final data quality monitoring result according to the first monitoring result and the second monitoring result. The data quality can be improved.

Description

Data quality monitoring method and device

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a data quality monitoring method and device.

Background

With the rapid development of network technology, many enterprises and groups have solved the problem of data sharing and data information islanding by building workflow engines to synchronize various types of data collected each day through various application systems into a data lake (HUDI). In the prior art, a workflow engine generally comprises a plurality of data computing nodes and a plurality of data storage nodes, and various types of source data stored in a starting data storage node can be synchronized into a data lake (HUDI) through processing of the plurality of data computing nodes and the data storage nodes in the workflow engine.

In order to improve the data quality in a data lake (HUDI), in the prior art, after the data lake acquires source data from various application systems, the data lake treats the data by cleaning and integrating the source data, so as to improve the data quality. However, the method has the disadvantages that the source data volume is generally very large, and the computational resource for cleaning and integrating the data lake is limited, so that the cleaning and integrating efficiency of the data is low, and the requirement for cleaning and integrating the large data volume in the data lake cannot be met.

Therefore, in order to improve the data quality in the data lake, a scheme for auditing the data quality in the process of entering the data lake in time is urgently needed to find the problems existing when the data enter the lake, so as to improve the quality of the data entering the lake.

Disclosure of Invention

The embodiment of the application aims to provide a data quality monitoring method, which can solve the problem that the data quality existing when data enter a lake cannot be found in time when the data enter the lake in the prior art.

One aspect of the embodiments of the present application provides a data quality monitoring method, which is applied to a workflow engine, where the workflow engine is configured to synchronize source data stored in a starting data storage node into a data lake HUDI, and the data quality monitoring method includes:

judging whether the number of source data synchronously output from the initial data storage node at the same transaction time is the same as the number of data synchronously input into the data lake HUDI;

if the number of data is determined to be the same, acquiring first data synchronously output at a first transaction time from the initial data storage node and acquiring second data synchronously output at the first transaction time from the data lake HUDI at intervals of a first preset time;

determining a first data quality monitoring result according to the first data and the second data;

extracting data with simulation identification information from the data lake HUDI at intervals of second preset time;

determining a second data quality monitoring result according to the extracted data and all analog data pre-inserted into the initial data storage node;

and determining a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result.

Optionally, the method further includes:

in the process of synchronizing the source data to the data lake HUDI, counting the number of pieces of source data synchronously output from the initial data storage node and the number of pieces of data synchronously input to the data lake HUDI at the same transaction time, wherein the source data comprises analog data with analog identification information pre-inserted into the initial data storage node.

Optionally, the workflow engine includes at least one data computing node and at least one data storage node, the data storage nodes correspond to the data computing nodes one to one, the starting data storage node is a first data storage node in the workflow engine, and in the process of synchronizing the source data to the data lake HUDI, counting the number of source data synchronously output from the starting data storage node at the same transaction time and the number of data synchronously input to the data lake HUDI includes:

in the process of synchronizing the source data to the data lake HUDI, counting the number of input data and output data of each data computing node in the same transaction time;

determining whether the number of source data synchronously output from the originating data storage node and the number of data synchronously input to the data lake HUDI at the same transaction time are the same comprises:

and respectively judging whether the number of input data and the number of output data of each data computing node are the same at the same transaction time.

Optionally, the first preset time includes a plurality of first transaction times, and the acquiring, every first preset time, first data output in synchronization with the first transaction time from the start data storage node and second data synchronized with the first transaction time from the data lake HUDI includes:

and randomly acquiring first data synchronously output at a second transaction time from the initial data storage node every a first preset time and acquiring second data synchronously output to the data lake HUDI at the second transaction time from the data lake HUDI, wherein the second transaction time is one of the first transaction times.

Optionally, the method further includes:

and inserting a preset number of pieces of simulation data into the initial data storage node every third preset time, wherein the simulation data have simulation identification information.

Optionally, before the step of inserting a preset number of pieces of analog data into the starting data storage node every third preset time, the method further includes:

and simulating and generating the simulation data by adopting a preset data simulator.

Optionally, the determining a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result includes:

and when the first data quality monitoring result is that the first data is the same as the second data and the second data quality monitoring result is that the extracted data is the same as all the simulation data pre-inserted into the initial data storage node, determining that the final data quality monitoring result is that no problem exists in the data quality.

The present application further provides a data quality monitoring apparatus applied to a workflow engine, where the workflow engine is configured to synchronize source data stored in a starting data storage node to a data lake HUDI, and the data quality monitoring apparatus includes:

the judging module is used for judging whether the number of the source data synchronously output from the initial data storage node and the number of the data synchronously input into the data lake HUDI are the same at the same transaction time;

the acquisition module is used for acquiring first data synchronously output at a first transaction time from the initial data storage node every other first preset time and acquiring second data synchronously output to the data lake HUDI at the first transaction time from the data lake HUDI if the number of the data is judged to be the same;

the first determining module is used for determining a first data quality monitoring result according to the first data and the second data;

the extraction module is used for extracting data with simulation identification information from the data lake HUDI at intervals of second preset time;

the second determining module is used for determining a second data quality monitoring result according to the extracted data and all analog data which are inserted into the initial data storage node in advance;

and the third determining module is used for determining a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result.

In the data quality monitoring method provided by the embodiment of the application, in the process of writing source data into a data lake, whether the number of the source data synchronously output from the initial data storage node and the number of the data synchronously input into the data lake HUDI at the same transaction time are the same or not is judged, and when the number of the data is judged to be the same, whether first data synchronously output at a first transaction time is acquired from the initial data storage node every first preset time and second data synchronously output to the data lake HUDI at the first transaction time is acquired from the data lake HUDI or not is further compared, and a first data quality monitoring result is output according to a first comparison result. In the embodiment, the data is monitored in the lake entering process, so that the data quality problem existing in the lake entering process of the data can be timely and accurately found, the subsequent cleaning and integration of the data in the data lake are avoided, and the data quality in the data lake is improved.

Drawings

Fig. 1 schematically illustrates an application environment of a data quality monitoring method according to an embodiment of the present application;

FIG. 2 schematically illustrates a flow chart of a data quality monitoring method according to an embodiment of the present application;

FIG. 3 is a schematic diagram that schematically illustrates an architecture of a workflow engine in an embodiment of the present application;

FIG. 4 schematically illustrates a program block diagram of a data quality monitoring apparatus according to an embodiment of the present application;

fig. 5 schematically shows a hardware architecture diagram of a computer device suitable for implementing the data quality monitoring method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Fig. 1 schematically shows an architecture diagram of an application environment of a data quality monitoring method according to an embodiment of the present application, and in an exemplary embodiment, a system of the application environment may include the following parts: a workflow engine 1, a HUDI 2, a Data Quality monitoring device 3(Data Quality Center).

The workflow engine 1, alternatively referred to as a data flow calculation engine, may be an Airflow, which is a programmable, scheduling and monitoring workflow platform, and based on a Directed Acyclic Graph (DAG), the Airflow may define a group of tasks with dependencies, which are executed in sequence according to the dependencies.

HUDI 2 is a data lake or database, but it is not a data lake or database. As is well known, Hive is a computing framework, but now we use Spark to compute Schema information and metadata provided by files in HDFS based on Hive, while Hive is increasingly ignored as a computing engine, and most Hive is regarded as a "database" (although it is not), and Hudi is a part of functions that complete Hive, and can even provide near real-time data extraction and query.

Files in HDFS or other storage systems are managed using Hudi, which creates corresponding tables, which can be computed using Hive or Spark as well. But the problems of small files of Hadoop, slow query and the like are solved.

Hudi has the following characteristics: fast upsert, pluggable indexes, atomically operating data with rollback functionality, snapshot isolation between writers, savepoint user data recovery save points, managing file size, tracking ancestry using statistical data layout, asynchronous compression of data lines and columnar data, timeline data.

The Data Quality monitoring device 3 may be a Data Quality Center (DQC) for monitoring Data Quality, and may automatically monitor Data Quality in a Data processing task process by configuring a Data Quality verification rule.

The DQC mainly has two functions of data monitoring and data cleaning. Data monitoring, which means that the data quality can be monitored and the alarm is given, the data output is not processed by the data monitoring, and an alarm receiver needs to judge and decide how to process the data; and the data cleaning is to clean the data which does not conform to the established rule so as to ensure that the final data output does not contain dirty data and the data cleaning does not trigger an alarm.

Fig. 2 schematically shows a flowchart of a data quality monitoring method according to an embodiment of the present application. The method is applied to a workflow engine for synchronizing source data stored in a starting data storage node into a data lake HUDI.

Referring to fig. 2, the data quality monitoring method includes:

step S21, it is determined whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI are the same at the same transaction time.

Specifically, the workflow engine synchronizes the source data stored in the starting data storage node into the data lake HUDI in units of one transaction time.

Wherein the transaction time is a time point (alternatively referred to as commit time) on a time axis (Timeline) of all operations performed by data lake HUDI on the data set. Timeline is made up of a set of Instant objects that act on a table. The Instant indicates that the table is operated at a certain Time point, so as to achieve the representation of a certain State, so the Instant includes three contents, namely an Instant Action, an Instant Time and an Instant State, and the meanings of the three contents are as follows:

instant Action: the types of operations performed on the Hudi table currently include 6 operation types, COMMITS, CLEANS, DELTA _ COMMIT, COMPACTION, ROLLBACK, and SAVEPOINT.

Instant Time: represents a timestamp that must be monotonically increasing in the chronological order in which the Instant Action started execution.

Instant State: indicating the state of the Hudi table after the execution of an operation (Instant Action) on the Hudi table at a specified point in Time (Instant Time), which currently includes 3 states of REQUESTED (scheduled but not initialized), INFLIGHT (currently executing), COMPLETED (operation execution complete).

In an exemplary embodiment, the method further comprises:

Specifically, the source data comprises various service data collected by various application systems and stored in a starting data storage node in a workflow engine, and simulation data with simulation identification information and pre-inserted into the starting data storage node. The simulation identification information may be used to distinguish real service data from simulated service data.

It will be appreciated that the origin data storage node for storing the source data may comprise one or more data storage systems, which may be Mysql, Kafka, Redis, hive, or the like. In this embodiment, the starting data storage node is preferably a Mysql database.

In this embodiment, in the process of synchronizing the source data to data lake HUDI, each data has a transaction time when synchronized to data lake HUDI, and assuming that 200 transaction times can be divided in one day, the data synchronized at different times has one of the 200 transaction times. Therefore, in the embodiment, during the process of synchronizing data to the data lake, the number of pieces of source data synchronously output from the starting data storage node and the number of pieces of data synchronously input to the data lake HUDI at the same transaction time may be counted, for example, the number of pieces of source data synchronously output from the starting data storage node within a transaction time 10:05 is 20000, and the number of pieces of data synchronously input to the data lake HUDI within a transaction time 10:05 is also 20000.

In an exemplary embodiment, the workflow engine includes at least one data computing node and at least one data storage node, the data storage nodes are in one-to-one correspondence with the data computing nodes, the workflow engine is configured to synchronize source data stored in a starting data storage node into a data lake HUDI, and the starting data storage node is a first data storage node in the workflow engine.

In this embodiment, each data computing node is configured to acquire data from a corresponding data storage node, and process the acquired data according to a preset rule to obtain a data processing result. The data computation node may be Spark or Spark.

Flink is an open-source stream processing framework developed by the Apache software foundation, and is at the heart of a distributed stream data stream engine written in Java and Scala that can perform stateful computations on unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing calculations at memory speed and any scale. Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing.

Spark is a Hadoop MapReduce-like universal parallel framework sourced by UC Berkeley AMP lab (AMP laboratories, burkeley, university, ca). Spark has the advantages of Hadoop MapReduce, but unlike MapReduce: job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is also a similar open source clustered computing environment to Hadoop, but there are some differences between the two that are useful to make Spark superior in some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.

It should be noted that the Flink or Spark generally provides data computing services externally by way of a Flink cluster or Spark cluster.

In this embodiment, the data storage nodes correspond to the data computation nodes one to one, and are used for storing data. The data storage node may be Mysql, Kafka, Redis, hive, or the like.

MySQL is a relational database management system that stores data in different tables instead of putting all data in one large repository, which increases speed and flexibility. The SQL language used by MySQL is the most common standardized language for accessing databases. MySQL software adopts a double-authorization policy and is divided into a community version and a business version, and generally MySQL is selected as a website database for development of small and medium-sized websites due to the characteristics of small volume, high speed, low total ownership cost and particularly open source codes.

Kafka is an open source stream processing platform developed by the Apache software foundation, written in Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data of an acquirer in a website, where action flow data includes web browsing, searching, and other behavioral data.

Redis is a key-value storage system. Similar to Memcached, it supports relatively more stored value types, including string, list, set, zset, and hash. These data types all support push/pop, add/remove, and intersect union and difference, and richer operations, and these operations are all atomic. On this basis, Redis supports various different ways of ordering. As with memcached, data is cached in memory to ensure efficiency. The difference is that Redis periodically writes updated data to a disk or writes a modification operation to an additional recording file, and a master-slave synchronization is realized on the basis of the updated data or the modification operation.

Hive is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism for storing, querying and analyzing large-scale data stored in Hadoop. The Hive data warehouse tool can map the structured data file into a database table, provide SQL query function and convert SQL sentences into MapReduce tasks for execution. Hive has the advantages that the learning cost is low, rapid MapReduce statistics can be realized through similar SQL sentences, MapReduce is simpler, and a special MapReduce application program does not need to be developed. Hive is very suitable for statistical analysis of data warehouses.

By way of example, referring to fig. 3, the workflow engine includes Mysql 30 (first data storage node, also start data storage node), CDC Flink 31 (first data computing node), kafka 32 (second data storage node), HUDI Flink 33 (second data computing node).

In this embodiment, in order to improve data quality, counting the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI at the same transaction time in the process of synchronizing the source data to the data lake HUDI may include:

in the process of synchronizing the source data to the data lake HUDI, the number of input and output data of each data computing node at the same transaction time is counted.

Specifically, if the workflow engine includes 2 data computing nodes, in the process of statistics, the number of input and output data of the two data computing nodes in the same transaction time may be counted respectively.

In this embodiment, after obtaining the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI at the same transaction time, the two numbers may be compared to determine whether the two numbers are the same.

In another embodiment, if the number of input and output data of a plurality of data calculation nodes is counted, the determining whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI at the same transaction time are the same may include:

Specifically, in the determination, the number of input and output data of each data computing node needs to be determined, and similarly, assuming that the workflow engine includes two data computing nodes, the number of input and output data of the two data computing nodes needs to be determined, and only when the number of input and output data of the two data computing nodes are the same, the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI at the same transaction time can be determined to be the same.

In step S22, if it is determined that the number of pieces of data is the same, first data output in synchronization with the first transaction time is obtained from the start data storage node every first preset time, and second data synchronized with the first transaction time to the data lake HUDI is obtained from the data lake HUDI.

Specifically, the first preset time is preset, and may be set and modified according to an actual situation, for example, the first preset time is 5 minutes. The first transaction time is a transaction time included in the first preset time, for example, if there are 5 first transaction times in 5 minutes, the first transaction time may be a transaction time selected according to a preset rule.

In an exemplary embodiment, the first preset time includes a plurality of first transaction times, and the retrieving, from the start data storage node and from the data lake HUDI, the first data synchronously output at the first transaction time and the second data synchronously output at the first transaction time at the data lake HUDI every first preset time may include:

As an example, assuming that the first time period includes 5 transaction times, which are transaction time a, transaction time B, transaction time C, transaction time D, and transaction time E, respectively, when acquiring data, one transaction time may be selected from the 5 transaction times at random as the second transaction time, and assuming that the randomly selected transaction time is transaction time E, when actually acquiring data, first data input to a corresponding data storage node at transaction time E may be acquired from the initial data storage node and second data synchronized to the data lake HUDI at transaction time E may be acquired from the data lake HUDI.

In an exemplary embodiment, the method further comprises:

and if the number of the data is judged to be different, outputting alarm information.

Specifically, the alarm information may be input to the administrator in a form of a short message, a mail, or the like, so that the administrator may know the alarm condition in time.

Step S23, determining a first data quality monitoring result according to the first data and the second data.

Specifically, after the first data and the second data are obtained, the first data and the second data may be compared one by one to determine whether the first data and the second data are the same, and when the first data and the second data are not the same, different data may be output to serve as the first data quality monitoring result; when the data quality is the same, preset information may be output as the first data quality monitoring result, for example, "current data quality is better".

In an embodiment, when the first data and the second data are different from each other in comparison, the different data may not be directly output, but the number and the ratio of the different data may be directly output, for example, the number of the different data is 1000, and 10, 1% of the number of the different data is 10, which may be used as the first data quality monitoring result.

Step S24, extracting the data with the analog identification information from the data lake HUDI every second preset time.

Specifically, the second preset time may be a checkpoint time, for example, if the checkpoint time is once every 5 minutes, the third time interval is 5 minutes. Of course, the second preset time may also be other set times, and is not limited in this embodiment.

In this embodiment, the data with the simulation identification information may be extracted from the data lake each time checkpoint is performed.

Step S25, determining a second data quality monitoring result according to the extracted data and all the simulation data pre-inserted into the initial data storage node.

Specifically, after data is extracted from the data lake, the extracted data and the analog data inserted into the initial data storage node may be compared one by one to obtain a second comparison result, and a second data quality monitoring result may be determined according to the comparison result.

The second data quality monitoring result may be the same as the first data quality monitoring result, and when the data are different, the number and the ratio of the data containing the difference are used as the second data quality monitoring result.

In an exemplary embodiment, the method further comprises:

Specifically, the second preset time is preset, and may be set and modified according to an actual situation, for example, the second preset time is 5 minutes.

The preset number is also preset, and can be set and modified according to actual conditions, for example, the preset number is 1000.

The analog data may be various types of service data, and the specific type of data is not limited in this embodiment. Each piece of simulation data has simulation identification information that can be used to distinguish real service data from simulated service data.

It should be noted that the real service data refers to service data actually generated on the line, and the simulation data is service data generated by simulation through a simulator.

As an example, 1000 pieces of simulation data may be inserted into the starting data storage node every 5 minutes.

In an exemplary embodiment, to obtain the simulation data, the method further comprises:

The data simulator may be any of various existing data simulators, and is not limited in this embodiment.

Step S26, determining a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result.

Specifically, corresponding weight values may be set in advance for the first data quality monitoring result and the second data quality monitoring result, for example, the weight values of the first data quality monitoring result and the second data quality monitoring result are set to 0.7 and 0.3, respectively, so that the final data quality monitoring result is 0.7 × the first data quality monitoring result +0.3 × the second data quality monitoring result.

It can be understood that, when the first data quality monitoring result is that the first data is the same as the second data, and the second data quality monitoring result is that the extracted data is the same as all the simulation data previously inserted into the initial data storage node, it is determined that the final data quality monitoring result is that there is no problem in data quality.

That is, when the first data quality monitoring result and the second data quality monitoring result are both the same data, even if the first data quality monitoring result and the second data quality monitoring result have corresponding weight values, the quality monitoring result finally output still has no problem for the data quality.

In the embodiment, the data quality of the data synchronized to the data lake is judged by combining the data sampling detection and the data full-scale detection, so that the problems existing when the data enter the lake can be more accurately found, and the quality of the data entering the lake is improved.

The data quality monitoring method provided by the embodiment of the application can be used for monitoring the data quality of the source data in the data lake, determining whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI are the same at the same transaction time, and further comparing whether first data input to a corresponding data storage node at a first transaction time is acquired from the start data storage node every first preset time and second data synchronized to the data lake HUDI at the first transaction time is acquired from the data lake HUDI, if the number of data is determined to be the same, and comparing the extracted data with all the simulation data pre-inserted into the initial data storage node, and finally determining a final data quality monitoring result according to the comparison result of the extracted data and the simulation data. In the embodiment, the data is monitored in the process of entering the lake, so that the data quality problem existing in the process of entering the lake can be timely and accurately found, the subsequent cleaning and integration of the data in the data lake are avoided, and the data quality in the data lake is improved.

Fig. 4 is a program module diagram of a data quality monitoring apparatus 400 according to an embodiment of the present application, where the data quality monitoring apparatus 400 is applied to a workflow engine, the workflow engine includes at least one data computing node and at least one data storage node, the data storage nodes correspond to the data computing nodes in a one-to-one manner, the workflow engine is configured to synchronize source data stored in a starting data storage node, which is a first data storage node in the workflow engine and may be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete the embodiment of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments. As shown in fig. 4, the data quality monitoring apparatus 400 may include: a judgment module 401, an acquisition module 402, a first determination module 403, an extraction module 404, a second determination module 405, and a third determination module 406.

A determining module 401, configured to determine whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI are the same at the same transaction time;

an obtaining module 402, configured to, if it is determined that the number of pieces of data is the same, obtain, every first preset time, first data output in synchronization with a first transaction time from the start data storage node and obtain, from the data lake HUDI, second data synchronized with the data lake HUDI at the first transaction time;

a first determining module 403, configured to determine a first data quality monitoring result according to the first data and the second data.

An extracting module 404, configured to extract data with analog identification information from the data lake HUDI every second preset time;

a second determining module 405, configured to determine a second data quality monitoring result according to the extracted data and all analog data pre-inserted into the initial data storage node;

a third determining module 406, configured to determine a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result.

In an exemplary embodiment, the workflow engine includes at least one data computing node and at least one data storage node, the data storage nodes correspond to the data computing nodes one to one, the initial data storage node is a first data storage node in the workflow engine, and the counting module 401 is further configured to count the number of input and output data of each data computing node at the same transaction time in the process of synchronizing the source data to the data lake HUDI.

The determining module 401 is further configured to determine whether the number of input and output data of each data computing node is the same at the same transaction time.

In an exemplary embodiment, data quality monitoring apparatus 400 further includes a statistics module.

A counting module, configured to count, in a process of synchronizing the source data to the data lake HUDI, a number of pieces of source data synchronously output from the initial data storage node and a number of pieces of data synchronously input to the data lake HUDI at the same transaction time, where the source data includes analog data with analog identification information pre-inserted into the initial data storage node;

in an exemplary embodiment, the first preset time includes a plurality of first transaction times, and the obtaining module 402 is further configured to randomly obtain, every first preset time, first data input to a corresponding data storage node at a second transaction time from the initial data storage node and second data synchronized to the data lake HUDI at the second transaction time from the data lake HUDI, where the second transaction time is one of the plurality of first transaction times.

In an exemplary embodiment, data quality monitoring apparatus 400 further includes an insertion module.

The inserting module is used for inserting a preset number of pieces of simulation data into the initial data storage node every third preset time, wherein the simulation data have simulation identification information.

In an exemplary embodiment, data quality monitoring apparatus 400 further includes a generation module.

And the generating module is used for generating the simulation data by adopting a preset data simulator in a simulation way.

In an exemplary embodiment, the third determining module is further configured to determine that there is no problem in the final data quality monitoring result when the first data quality monitoring result indicates that the first data is the same as the second data, and the second data quality monitoring result indicates that the extracted data is the same as all analog data pre-inserted into the starting data storage node.

In the data quality monitoring device provided in the embodiment of the application, in the process of writing source data into a data lake, whether the number of the source data synchronously output from the initial data storage node and the number of the data synchronously input into the data lake HUDI are the same at the same transaction time are judged, and when the number of the data is judged to be the same, whether first data synchronously output at a first transaction time is acquired from the initial data storage node every first preset time and second data synchronously output to the data lake HUDI at the first transaction time is acquired from the data lake HUDI is further compared, and a first data quality monitoring result is output according to a first comparison result. In the embodiment, the data is monitored in the process of entering the lake, so that the data quality problem existing in the process of entering the lake can be timely and accurately found, the subsequent cleaning and integration of the data in the data lake are avoided, and the data quality in the data lake is improved.

Fig. 5 schematically shows a hardware architecture diagram of a computer device suitable for implementing the data quality monitoring method according to an embodiment of the present application. In the present embodiment, the computer device 20 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set or stored in advance. For example, it may be a data forwarding device such as a gateway. As shown in fig. 5, the computer device 20 includes at least, but is not limited to: the memory 21, processor 22, and network interface 23 may be communicatively coupled to each other by a system bus. Wherein:

the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 21 may be an internal storage module of the computer device 20, such as a hard disk or a memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device 20. Of course, the memory 21 may also include both internal and external memory modules of the computer device 20. In this embodiment, the memory 21 is generally used for storing an operating system installed in the computer device 20 and various types of application software, such as program codes of a data quality monitoring method. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is generally configured to control the overall operation of the computer device 20, such as performing control and processing related to data interaction or communication with the computer device 20. In this embodiment, the processor 22 is configured to execute the program code stored in the memory 21 or process data.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is typically used to establish a communication connection between the computer device 20 and other computer devices. For example, the network interface 23 is used to connect the computer device 20 with an external terminal through a network, establish a data quality monitoring channel and a communication connection between the computer device 20 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.

It is noted that fig. 5 only shows a computer device with components 21-23, but it is understood that not all shown components are required to be implemented, and more or less components may be implemented instead.

In this embodiment, the data quality monitoring method stored in the memory 21 may be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 22) to complete the present invention.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data quality monitoring method in the embodiments.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used to store an operating system and various types of application software installed in the computer device, for example, the program code of the data quality monitoring method in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A data quality monitoring method is applied to a workflow engine, wherein the workflow engine is used for synchronizing source data stored in a starting data storage node into a data lake HUDI, and the data quality monitoring method comprises the following steps:

judging whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input into the data lake HUDI at the same transaction time are the same or not;

extracting data with simulation identification information from the data lake HUDI at intervals of second preset time, wherein the simulation identification information is used for distinguishing real service data from simulated service data;

2. The data quality monitoring method of claim 1, further comprising:

3. The data quality monitoring method according to claim 2, wherein the workflow engine includes at least one data computing node and at least one data storage node, the data storage nodes are in one-to-one correspondence with the data computing nodes, the initial data storage node is a first data storage node in the workflow engine, and the counting the number of pieces of source data synchronously output from the initial data storage node and the number of pieces of data synchronously input to the data lake HUDI at the same transaction time in the process of synchronizing the source data to the data lake HUDI includes:

the judging whether the number of source data synchronously output from the initial data storage node and the number of data synchronously input to the data lake HUDI are the same at the same transaction time includes:

and respectively judging whether the number of input and output data of each data computing node is the same in the same transaction time.

4. The data quality monitoring method according to claim 1, wherein the first preset time includes a plurality of first transaction times, and the acquiring, from the initial data storage node, the first data synchronously output at the first transaction time and the second data synchronously output at the first transaction time from the data lake HUDI every first preset time includes:

5. The data quality monitoring method according to any one of claims 1 to 4, characterized in that the method further comprises:

6. The data quality monitoring method according to claim 5, wherein before the step of inserting a predetermined number of pieces of analog data into the initial data storage node every third predetermined time, the method further comprises:

7. The data quality monitoring method of claim 1, wherein the determining a final data quality monitoring result according to the first data quality monitoring result and the second data quality monitoring result comprises:

8. A data quality monitoring apparatus for use in a workflow engine for synchronizing source data stored in a starting data storage node to a data lake HUDI, the data quality monitoring apparatus comprising:

the extraction module is used for extracting data with simulation identification information from the data lake HUDI at intervals of second preset time, wherein the simulation identification information is used for distinguishing real service data from simulated service data;

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, is adapted to carry out the steps of data quality monitoring according to any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the steps of data quality monitoring according to any one of claims 1 to 7.