CN113220530A - Data quality monitoring method and platform - Google Patents

Data quality monitoring method and platform Download PDF

Info

Publication number
CN113220530A
CN113220530A CN202110529402.0A CN202110529402A CN113220530A CN 113220530 A CN113220530 A CN 113220530A CN 202110529402 A CN202110529402 A CN 202110529402A CN 113220530 A CN113220530 A CN 113220530A
Authority
CN
China
Prior art keywords
data
computing
calculation
node
data quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110529402.0A
Other languages
Chinese (zh)
Other versions
CN113220530B (en
Inventor
张杨
刘方奇
郑志升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bilibili Technology Co Ltd
Original Assignee
Shanghai Bilibili Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bilibili Technology Co Ltd filed Critical Shanghai Bilibili Technology Co Ltd
Priority to CN202110529402.0A priority Critical patent/CN113220530B/en
Publication of CN113220530A publication Critical patent/CN113220530A/en
Application granted granted Critical
Publication of CN113220530B publication Critical patent/CN113220530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the application provides a data quality monitoring platform, the data quality monitoring platform includes: the workflow engine comprises at least one data computing node and at least one data storage node, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, and the data storage nodes correspond to the data computing nodes one to one and are used for storing the first computing results of the data computing nodes; the data storage system is used for storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node; and the data quality monitoring system is used for consuming a plurality of second calculation results from the data storage device and carrying out data quality analysis on each consumed second calculation result to obtain a data quality analysis result. The method and the device can improve the checking efficiency.

Description

Data quality monitoring method and platform
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a data quality monitoring method and a data quality monitoring platform.
Background
With the rapid development of network technology, many enterprises and groups analyze various types of data collected each day by building workflow engines. In the prior art, a workflow engine generally includes a plurality of data computing nodes and a plurality of data storage nodes, and computes various types of data through the data computing nodes and stores the computed data results into the data storage nodes.
However, the inventor finds that, because a plurality of data computing nodes exist in the workflow engine, when a certain data computing node has a problem in a computing process of data due to various reasons, it is very difficult to analyze which data node has a problem in the workflow engine, and generally, the data computing nodes need to be checked one by one, which takes a lot of time, and the checking efficiency is very low.
Disclosure of Invention
The embodiment of the application aims to provide a data quality monitoring platform, which can solve the problems that when a data computing node in a workflow engine in the prior art fails, a large number of data computing nodes needs to be consumed to find out which data computing node is in failure, and the checking efficiency is very low.
One aspect of the embodiments of the present application provides a data quality monitoring platform, including:
the workflow engine comprises at least one data computing node and at least one data storage node, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, the data storage nodes are in one-to-one correspondence with the data computing nodes and are used for storing the first computing result of the data computing node, the first computing result of at least one first data computing node is used as input data of a second data computing node, and the first data computing node and the second data node are both one of the data computing nodes in the workflow engine;
the data storage system is used for storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node;
and the data quality monitoring system is used for acquiring a plurality of second calculation results from the data storage device and carrying out data quality analysis on each acquired second calculation result to obtain a data quality analysis result.
Optionally, the data quality monitoring platform further includes:
and the data analysis system is used for storing the data quality analysis result so that a user can inquire and analyze the data quality analysis result.
Optionally, the workflow engine is further configured to set a data type of the data output in a side output manner in each data calculation node, so as to serve as a data type of the second calculation result.
Optionally, the data quality monitoring platform is further configured to set a data quality check rule corresponding to each data computing node.
Optionally, the data quality monitoring platform is further configured to:
when a plurality of second calculation results are acquired from the data storage device, determining a data calculation node corresponding to each second calculation result;
acquiring a data quality check rule corresponding to each determined data calculation node;
judging whether each second calculation result meets the corresponding data quality check rule or not;
and if the current second calculation result does not accord with the data quality check rule, outputting alarm information.
Optionally, the data quality monitoring platform is further configured to:
and if the current second calculation result does not accord with the data quality check rule, performing data cleaning processing on the current second calculation result.
Optionally, the data quality verification rule includes whether the second calculation result exceeds a preset alarm threshold, and the data quality monitoring platform is further configured to:
and if the current second calculation result exceeds the corresponding alarm threshold, outputting alarm information.
Optionally, the alarm threshold includes at least one of the following:
the average value of the second calculation results in the preset time period, the maximum value of the second calculation results in the preset time period and the minimum value of the second calculation results in the preset time period.
The application also provides a data quality monitoring method, which is applied to a data quality monitoring platform comprising a workflow engine, a data storage system and a data quality monitoring system, and the method comprises the following steps:
creating at least one data computing node and at least one data storage node in the workflow engine, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, the data storage nodes are in one-to-one correspondence with the data computing nodes and are used for storing the first computing result of the data computing node, the first computing result of at least one first data computing node is used as input data of a second data computing node, and the first data computing node and the second data node are both one of the data computing nodes in the workflow engine;
storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node through the data storage system;
and acquiring a plurality of second calculation results from the data storage device through the data quality monitoring system, and performing data quality analysis on each acquired second calculation result to obtain a data quality analysis result.
Optionally, the data quality monitoring platform further includes a data quality monitoring platform, and the method further includes: the data quality analysis result is stored through the data analysis system, so that a user can inquire and analyze the data quality analysis result through the data quality monitoring platform provided by the embodiment of the application, data are output to the data storage system through each data computing node in a side output mode, the data are obtained from the data storage system through the data quality monitoring platform, and the data quality analysis result is obtained through data quality analysis of the obtained data. In the application, because the data stored in the data storage system are from each data computing node, when the data are analyzed, when the analysis result is data abnormality, the data computing node which has the problem can be directly determined, the data computing node with the problem can be found in time, and the troubleshooting efficiency is improved.
Drawings
Fig. 1 schematically illustrates an architecture diagram of a data quality monitoring platform in an embodiment of the present application;
FIG. 2 schematically illustrates a block diagram of a data quality monitoring platform of an embodiment of the present application;
FIG. 3 is a schematic diagram that schematically illustrates an architecture of a workflow engine in an embodiment of the present application;
FIG. 4 schematically illustrates a block diagram of a data quality monitoring platform of another embodiment of the present application;
fig. 5 schematically shows a flow chart of a data quality monitoring method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Fig. 1 schematically shows an architecture diagram of a data quality monitoring platform in an embodiment of the present application, and in an exemplary embodiment, the data quality monitoring platform may include the following parts: a workflow engine 1, a Data storage system 2, and a Data Quality monitoring system 3(Data Quality Center).
The workflow engine 1 may be an Airflow, which is a programmable, scheduling and monitoring workflow platform, and based on a Directed Acyclic Graph (DAG), the Airflow may define a group of tasks with dependencies, which are executed in sequence according to the dependencies.
The data storage system 2 is used for storing a database of data, which may be ES, Hive, Kafka, HDFS, Hbase, etc., and in this embodiment, the database is preferably Kafka.
The Data Quality monitoring system 3, which is alternatively referred to as a Data Quality Center (DQC), is configured to monitor Data Quality, and can automatically monitor Data Quality during a Data processing task by configuring a Data Quality verification rule.
The DQC mainly has two functions of data monitoring and data cleaning. Data monitoring, which means that the data quality can be monitored and the alarm is given, the data output is not processed by the data monitoring, and an alarm receiver needs to judge and decide how to process the data; and the data cleaning is to clean the data which does not conform to the established rule so as to ensure that the final data output does not contain dirty data and the data cleaning does not trigger an alarm.
To help understand the working principle of the data quality monitoring platform, the data quality monitoring service provided by the data quality monitoring platform is described as follows: the client reports various data sources to be analyzed to corresponding data computing nodes through protocols such as HTTP, RPC and the like, so that the data sources are correspondingly computed through the data computing nodes, after the computation is completed, the computation results are stored in the corresponding data storage nodes, meanwhile, the data computing nodes output data meeting preset rules in the computation results to the data storage system 2 in a side output mode, then, the data quality monitoring system 3 acquires the data from the data storage system 2, analyzes the acquired data, and obtains data quality analysis results, such as alarm information.
Fig. 2 schematically shows a block diagram of a data quality monitoring platform according to an embodiment of the present application. As shown in fig. 2, the data quality monitoring platform may include a workflow engine 20, a data storage system 21, and a data quality monitoring system 22, wherein:
the workflow engine 20 includes at least one data computing node 201 and at least one data storage node 202, wherein fig. 2 exemplifies one data computing node 201 and one data storage node 202.
Each data computing node 201 is configured to obtain data from a data source, and compute the obtained data according to a preset computation rule to obtain a first computation result, where the first computation result of at least one first data computing node is used as input data of a second data computing node, and the first data computing node and the second data computing node are both one of the data computing nodes in the workflow engine.
Specifically, the data source is service data to be analyzed, and the client may report the data to the corresponding data computing node through HTTP, RPC, and other protocols, so that the data computing node can perform computing. In a video scene, the data source may be viewing duration information, video name information, video type information, and the like of a video application (app) when a user views a video, which are reported by protocols such as HTTP, RPC, and the like, or may also be user behavior data of the user when the user views the video, where the user behavior data may include a collection behavior of the user on the video, a praise behavior, and the like.
In this embodiment, the data computing node 201 may be Spark or Spark. Flink is an open-source stream processing framework developed by the Apache software foundation, and is at the heart of a distributed stream data stream engine written in Java and Scala that can perform stateful computations on unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing calculations at memory speed and any scale. Apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a Hadoop MapReduce-like universal parallel framework sourced by UC Berkeley AMP lab (AMP laboratories, burkeley, university, ca). Spark has the advantages of Hadoop MapReduce, but unlike MapReduce: job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is also a similar open source clustered computing environment as Hadoop, but there are some differences between the two that are useful to make Spark perform better in some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects.
It should be noted that the Flink or Spark generally provides data computing services externally in a Flink cluster or Spark cluster manner.
The calculation rule is a rule preset by a user for calculating the acquired data, for example, if the acquired data is the playing time length data of a video, the calculation rule may be to count the total time length of the user watching a certain type of video.
The data storage nodes 202 are in one-to-one correspondence with the data computation nodes 201, and are used for storing first computation results of the data computation nodes 202.
In this embodiment, the data storage node may be Kafka, Redis, or hive.
Among them, Kafka is an open source stream processing platform developed by Apache software foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data of an acquirer in a website, where action flow data includes web browsing, searching, and other behavioral data.
Redis is a key-value storage system. Similar to Memcached, it supports relatively more stored value types, including string, list, set, zset, and hash. These data types all support push/pop, add/remove, and intersect union and difference, and richer operations, and these operations are all atomic. On this basis, Redis supports various different ways of ordering. Like memcached, data is cached in memory to ensure efficiency. The difference is that Redis periodically writes updated data to a disk or writes a modification operation to an additional recording file, and a master-slave synchronization is realized on the basis of the updated data or the modification operation.
Hive is a data warehouse tool based on Hadoop, which is used for data extraction, transformation and loading, and is a mechanism for storing, querying and analyzing large-scale data stored in Hadoop. The Hive data warehouse tool can map the structured data file into a database table, provide SQL query function and convert SQL sentences into MapReduce tasks for execution. Hive has the advantages that the learning cost is low, rapid MapReduce statistics can be realized through similar SQL sentences, MapReduce is simpler, and a special MapReduce application program does not need to be developed. Hive is very suitable for statistical analysis of data warehouses.
In an exemplary embodiment, referring to fig. 3, the workflow engine 20 may include a third data computing node (Flink a), a first data storage node (Kafka a) for storing a computation result of the third data computing node, a fourth data computing node (Spark), a second data storage node (Redis a) for storing a computation result of the fourth data computing node, a fifth data computing node (Flink B) for obtaining data from the first data storage node and the second data storage node and performing data stream merging and computation on the obtained data, a third data storage node (Kafka B) for storing a computation result of the fifth data computing node, a sixth data computing node (Flink C), a fourth data storage node (Redis B) for storing a computation result of the sixth data computing node, and a third data storage node (Flink a) for storing a computation result of the sixth data computing node, A seventh data computing node (Flink D) for acquiring data from the third data storage node and the fourth data storage node, and performing data stream merging and computation on the acquired data, and a fifth data storage node (Hive) for storing the computation result of the seventh data computing node.
The third data calculation node, the fourth data calculation node, and the sixth data calculation node are different from each other in data to be calculated.
And the data storage system 21 is used for storing preset types of second calculation results contained in the first calculation results acquired from each data calculation node.
Specifically, the second calculation result included in the first calculation result may be acquired from each of the data calculation nodes in a side-output (side-output) manner.
The side-output (side-output) is a splitting mechanism that splits a data stream without copying the data stream. The data type of the side output result stream does not need to be consistent with the type of the main data stream, and the types of output streams of different sides can be different.
The preset type is the type of data which needs to be output in a side output mode preset by a user.
In this embodiment, in order to obtain the second calculation result of the preset type from the first calculation result of each data calculation node, the data type of the data output in the side output manner needs to be set in each data calculation node by the workflow engine 20 as the data type of the second calculation result. Specifically, an OutputTag may be defined that will be used to identify the data type of data that needs to be output by an output stream on one side.
In this embodiment, after the data type of the data output by the side output method is defined in the data calculation node, after the data calculation node calculates the data to obtain the first calculation result, the first calculation result obtained by the calculation is matched with the data type set by the side output method to determine whether the data type of the first calculation result obtained currently is the data type defined by the side output method, and if the data type of the first calculation result is matched with the data type defined by the side output method, the first calculation result is used as the second calculation result and the second calculation result is output to the data storage system 21.
It should be noted that, in order to facilitate the subsequent distinction of the second calculation results output by each data calculation node in the side output manner, when the second calculation results are output to the data storage system, each second calculation result carries identification information of the current data calculation node, where the identification information is used to uniquely distinguish different data calculation nodes.
In this embodiment, the data storage system is preferably a Kafka cluster.
And the data quality monitoring system 22 is configured to obtain a plurality of second calculation results from the data storage device, and perform data quality analysis on each obtained second calculation result to obtain a data quality analysis result.
Specifically, in order to facilitate data quality analysis on each second calculation result, the data quality monitoring system 22 may preset a data quality check rule corresponding to each data calculation node 201. In this embodiment, each data computing node 201 may set one data quality check rule, or may set multiple data quality check rules, and the number of the specifically set data quality check rules may be determined according to an actual situation, which is not limited in this embodiment.
And the data quality check rule is a rule for checking a second calculation result. In this embodiment, the data quality verification rule may include a primary key monitoring rule, a table data amount monitoring rule, a fluctuation monitoring rule, a non-null monitoring rule of an important field, a discrete value monitoring rule of an important enumerated field, an index value fluctuation monitoring rule, a service rule monitoring rule, and the like.
The data quality analysis result may be output alarm information, or data cleaning operation required for the current second calculation result, and the like.
In an exemplary embodiment, the data quality monitoring system 22 is further configured to determine, when a plurality of second calculation results are obtained from the data storage device, a data calculation node corresponding to each second calculation result; acquiring a data quality check rule corresponding to each determined data calculation node; judging whether each second calculation result meets the corresponding data quality check rule or not; and if the current second calculation result does not accord with the data quality check rule, outputting alarm information.
Specifically, since different second calculation results may be from different data calculation nodes, when each second calculation result is checked, it is necessary to determine which data calculation node the current second calculation result comes from, and then, may obtain the data quality check rule corresponding to the data calculation node. For example, if the data quality check rule corresponding to the data computing node a is rule 1 and the data quality check rule corresponding to the data computing node B is rule 2 and rule 3, the current second computing result may be checked by using rule 2 and rule 3 when the current second computing result is from the data computing node B.
In this embodiment, when the current second calculation result is checked according to the data quality check rule, and the current second calculation result is found not to conform to the data quality check rule, an alarm message may be output to notify the user of which data calculation node has a problem.
In an exemplary embodiment, the data quality monitoring platform 22 is further configured to: and if the current second calculation result does not accord with the data quality check rule, performing data cleaning processing on the current second calculation result.
Specifically, before data cleaning, a data cleaning rule needs to be configured, so that when data cleaning is performed, a pre-configured data cleaning rule can be called to clean data meeting the data cleaning rule.
In an exemplary embodiment, the data quality check rule may include whether the second data calculation result exceeds a preset alarm threshold, where the alarm threshold may be a mean value of the second calculation result within a preset time period, a maximum value of the second calculation result within the preset time period, and a minimum value of the second calculation result within the preset time period. The preset time period may be set according to actual conditions, for example, the preset time period is the last week, the last month, the last day, and the like.
It should be noted that the alarm threshold corresponding to the second calculation result from different data calculation nodes is determined according to the second calculation result received by the corresponding data calculation node. For example, the alarm threshold corresponding to the data computing node a is determined according to the second computing result received from the data computing node a in the last 30 days, for example, an average value of the second computing results received in the 30 days is used as the alarm threshold, or a maximum value of the second computing results received in the 30 days is used as the alarm threshold, or a minimum value of the second computing results received in the 30 days is used as the alarm threshold.
In this embodiment, the data quality monitoring platform 22 outputs the alarm information only if the current second calculation result exceeds the corresponding alarm threshold. If the current second calculation result does not exceed the corresponding alarm threshold, the data quality monitoring platform 22 will not output alarm information.
As an example, it is assumed that the current second calculation result is playing time length data of a video, and the alarm threshold corresponding to the current second calculation result is that the playing time length is 3 hours. At this time, if the current second calculation result is that the playing time length is 3.5 hours, it may be determined that the current second calculation result exceeds the alarm threshold, and the alarm information is output to the user.
In an exemplary embodiment, referring to fig. 4, the data quality monitoring platform comprises: a workflow engine 40, a data storage system 41, a data quality monitoring system 42, and a data analysis system 43.
The workflow engine 40, the data storage system 41, and the data quality monitoring system 42 are the same as the workflow engine 20, the data storage system 21, and the data quality monitoring system 22 in the above embodiments, and are not described again in this embodiment.
The data analysis system 43 is configured to store the data quality analysis result, so that a user can query and analyze the data quality analysis result.
In particular, the data analysis system 43 may be a clickwouse database.
The ClickHouse database is a database based on column storage for real-time data analysis, and the speed of processing data is 100 times faster and 1000 times faster than the traditional method. ClickHouse's performance exceeds the comparable nematic DBMS currently on the market, handling hundreds of millions to billions of lines and tens of gigabytes of data per second per server per second. The ClickHouse is based on OLAP scene requirements, a set of brand-new efficient column type storage engine is customized and developed, and rich functions of data ordered storage, main key index, sparse index, data shading, data Partitioning, TTL, active-standby copying and the like are achieved.
In this embodiment, the data quality analysis result is stored through the clickwouse, so that the user can query and analyze the data quality analysis result very conveniently. In an embodiment, the data quality analysis result is stored, and the data quality analysis result can also be used for drawing a monitoring panel so as to realize monitoring on each data computing node.
According to the data quality monitoring platform provided by the embodiment of the application, data are output to the data storage system in a side output mode for each data computing node, the data are acquired from the data storage system through the data quality monitoring platform, and data quality analysis is performed on the acquired data to obtain a data quality analysis result. In the application, because the data stored in the data storage system are from each data computing node, when the data are analyzed, when the analysis result is data abnormality, the data computing node which has the problem can be directly determined, the data computing node with the problem can be found in time, and the troubleshooting efficiency is improved.
Fig. 5 is a schematic flow chart of a data quality monitoring method according to an embodiment of the present application. The method is applied to a data quality monitoring platform comprising a workflow engine, a data storage system and a data quality monitoring system, wherein the data quality monitoring platform is the data quality monitoring platform in the above embodiment, and is not described in detail in this embodiment.
In this embodiment, the method includes:
step S50, creating at least one data computing node and at least one data storage node in the workflow engine, where each data computing node is configured to obtain data from a data source and calculate the obtained data according to a preset calculation rule to obtain a first calculation result, the data storage nodes are in one-to-one correspondence with the data computing nodes and are configured to store the first calculation result of the data computing node, the first calculation result of at least one first data computing node is used as input data of a second data computing node, and the first data computing node and the second data node are both one of the data computing nodes in the workflow engine.
Step S51, storing, by the data storage system, a preset type of second calculation result included in the first calculation result acquired from each data calculation node.
Step S52, obtaining a plurality of second calculation results from the data storage device through the data quality monitoring system, and performing data quality analysis on each obtained second calculation result to obtain a data quality analysis result.
In an exemplary embodiment, the data quality monitoring platform further comprises: a data analysis system, the method further comprising: and storing the data quality analysis result through the data analysis system so that a user can inquire and analyze the data quality analysis result.
In an exemplary embodiment, the method further comprises:
setting, by the workflow engine, a data type of the data output by the side output manner in each data calculation node as a data type of the second calculation result.
In an exemplary embodiment, the method further comprises: and setting a data quality check rule corresponding to each data calculation node through the data quality monitoring platform.
In an exemplary embodiment, the obtaining, by the data quality monitoring system, a plurality of second calculation results from the data storage device, and performing data quality analysis on each obtained second calculation result, and obtaining the data quality analysis result includes: determining a data computing node corresponding to each second computing result when the plurality of second computing results are consumed from the data storage device; acquiring a data quality check rule corresponding to each determined data calculation node; judging whether each second calculation result meets the corresponding data quality check rule or not; and if the current second calculation result does not accord with the data quality check rule, outputting alarm information.
In an exemplary embodiment, the obtaining, by the data quality monitoring system, a plurality of second calculation results from the data storage device, and performing data quality analysis on each obtained second calculation result, and obtaining the data quality analysis result further includes:
and if the current second calculation result does not accord with the data quality check rule, performing data cleaning processing on the current second calculation result.
In an exemplary embodiment, the data quality check rule includes whether the second calculation result exceeds a preset alarm threshold, and if the current second calculation result does not meet the data quality check rule, outputting the alarm information includes:
and if the current second calculation result exceeds the corresponding alarm threshold, outputting alarm information.
According to the data quality monitoring method provided by the embodiment of the application, data are output to the data storage system by adopting a side output mode for each data computing node, the data are acquired from the data storage system through the data quality monitoring platform, and the acquired data are subjected to data quality analysis to obtain a data quality analysis result. In the application, because the data stored in the data storage system are from each data computing node, when the data are analyzed, when the analysis result is data abnormality, the data computing node which has the problem can be directly determined, the data computing node with the problem can be found in time, and the troubleshooting efficiency is improved.
The present embodiments also provide a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the steps of:
creating at least one data computing node and at least one data storage node in a workflow engine, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, the data storage nodes are in one-to-one correspondence with the data computing nodes and are used for storing the first computing result of the data computing node, the first computing result of at least one first data computing node is used as input data of a second data computing node, and the first data computing node and the second data node are both one of the data computing nodes in the workflow engine;
storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node through a data storage system;
and acquiring a plurality of second calculation results from the data storage device through a data quality monitoring system, and performing data quality analysis on each acquired second calculation result to obtain a data quality analysis result.
In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in a computer device, for example, the program code of the data quality monitoring method implemented by the data quality monitoring platform in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A data quality monitoring platform, the data quality monitoring platform comprising:
the workflow engine comprises at least one data computing node and at least one data storage node, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, and the data storage nodes correspond to the data computing nodes one to one and are used for storing the first computing results of the data computing nodes; taking a first calculation result of at least one first data calculation node as input data of a second data calculation node, wherein the first data calculation node and the second data calculation node are both one data calculation node in the workflow engine;
the data storage system is used for storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node;
and the data quality monitoring system is used for acquiring a plurality of second calculation results from the data storage device and carrying out data quality analysis on each acquired second calculation result to obtain a data quality analysis result.
2. The data quality monitoring platform of claim 1, further comprising:
and the data analysis system is used for storing the data quality analysis result so that a user can inquire and analyze the data quality analysis result.
3. The data quality monitoring platform according to claim 1, wherein the workflow engine is further configured to set a data type of the data output in a side output manner in each data computing node as the data type of the second computing result.
4. The data quality monitoring platform according to claim 1, further configured to set a data quality check rule corresponding to each data computing node.
5. The data quality monitoring platform of claim 4, further configured to:
determining a data computing node corresponding to each second computing result when the plurality of second computing results are consumed from the data storage device;
acquiring a data quality check rule corresponding to each determined data calculation node;
judging whether each second calculation result meets the corresponding data quality check rule or not;
and if the current second calculation result does not accord with the data quality check rule, outputting alarm information.
6. The data quality monitoring platform of claim 5, further configured to:
and if the current second calculation result does not accord with the data quality check rule, performing data cleaning processing on the current second calculation result.
7. The data quality monitoring platform according to claim 5, wherein the data quality check rule includes whether the second calculation result exceeds a preset alarm threshold, and the data quality monitoring platform is further configured to:
and if the current second calculation result exceeds the corresponding alarm threshold, outputting alarm information.
8. The data quality monitoring platform of claim 7, wherein the alarm threshold comprises at least one of:
the average value of the second calculation results in the preset time period, the maximum value of the second calculation results in the preset time period and the minimum value of the second calculation results in the preset time period.
9. A data quality monitoring method is applied to a data quality monitoring platform comprising a workflow engine, a data storage system and a data quality monitoring system, and is characterized by comprising the following steps:
creating at least one data computing node and at least one data storage node in the workflow engine, wherein each data computing node is used for acquiring data from a data source and computing the acquired data according to a preset computing rule to obtain a first computing result, and the data storage nodes correspond to the data computing nodes one to one and are used for storing the first computing results of the data computing nodes; taking a first calculation result of at least one first data calculation node as input data of a second data calculation node, wherein the first data calculation node and the second data calculation node are both one data calculation node in the workflow engine;
storing a preset type of second calculation result contained in the first calculation result acquired from each data calculation node through the data storage system;
and acquiring a plurality of second calculation results from the data storage device through the data quality monitoring system, and performing data quality analysis on each acquired second calculation result to obtain a data quality analysis result.
10. The data quality monitoring method of claim 9, wherein the data quality monitoring platform further comprises a data quality monitoring platform, the method further comprising: and storing the data quality analysis result through the data analysis system so that a user can inquire and analyze the data quality analysis result.
CN202110529402.0A 2021-05-14 2021-05-14 Data quality monitoring method and platform Active CN113220530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110529402.0A CN113220530B (en) 2021-05-14 2021-05-14 Data quality monitoring method and platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110529402.0A CN113220530B (en) 2021-05-14 2021-05-14 Data quality monitoring method and platform

Publications (2)

Publication Number Publication Date
CN113220530A true CN113220530A (en) 2021-08-06
CN113220530B CN113220530B (en) 2022-07-19

Family

ID=77092018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110529402.0A Active CN113220530B (en) 2021-05-14 2021-05-14 Data quality monitoring method and platform

Country Status (1)

Country Link
CN (1) CN113220530B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641420A (en) * 2021-08-16 2021-11-12 北京明略昭辉科技有限公司 Flink-based workflow engine implementation method, system, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9317805B1 (en) * 2013-03-12 2016-04-19 Ubs Ag System and method of performing modular quantitative analysis with nodes that have contextual labels
CN108270618A (en) * 2017-12-30 2018-07-10 杭州华为数字技术有限公司 Alert the method, apparatus and warning system of judgement
CN110908883A (en) * 2019-11-15 2020-03-24 江苏满运软件科技有限公司 User portrait data monitoring method, system, equipment and storage medium
CN111459986A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Data computing system and method
CN111563103A (en) * 2020-04-28 2020-08-21 厦门市美亚柏科信息股份有限公司 Method and system for detecting data blood margin
CN112529528A (en) * 2020-12-16 2021-03-19 中国南方电网有限责任公司 Workflow monitoring and warning method, device and system based on big data flow calculation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9317805B1 (en) * 2013-03-12 2016-04-19 Ubs Ag System and method of performing modular quantitative analysis with nodes that have contextual labels
CN108270618A (en) * 2017-12-30 2018-07-10 杭州华为数字技术有限公司 Alert the method, apparatus and warning system of judgement
CN110908883A (en) * 2019-11-15 2020-03-24 江苏满运软件科技有限公司 User portrait data monitoring method, system, equipment and storage medium
CN111459986A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Data computing system and method
CN111563103A (en) * 2020-04-28 2020-08-21 厦门市美亚柏科信息股份有限公司 Method and system for detecting data blood margin
CN112529528A (en) * 2020-12-16 2021-03-19 中国南方电网有限责任公司 Workflow monitoring and warning method, device and system based on big data flow calculation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
卜尧等: "BDAP――一个基于Spark的数据挖掘工具平台", 《中国科学技术大学学报》 *
曹舒扬等: "基于大数据的广播电视节目互联网舆情分析系统设计", 《广播电视信息》 *
柯文等: "信息化综合运维管理系统的设计与实现", 《铁路计算机应用》 *
潘卫军等: "民航空管大数据处理平台架构研究", 《计算机应用与软件》 *
王兴等: "基于物联网的林产品可追溯系统设计", 《森林工程》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113641420A (en) * 2021-08-16 2021-11-12 北京明略昭辉科技有限公司 Flink-based workflow engine implementation method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN113220530B (en) 2022-07-19

Similar Documents

Publication Publication Date Title
US20230126005A1 (en) Consistent filtering of machine learning data
US8719271B2 (en) Accelerating data profiling process
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US11100420B2 (en) Input processing for machine learning
US10339465B2 (en) Optimized decision tree based models
US10318882B2 (en) Optimized training of linear machine learning models
US10614087B2 (en) Data analytics on distributed databases
US8276022B2 (en) Efficient failure detection for long running data transfer jobs
CN111209352A (en) Data processing method and device, electronic equipment and storage medium
CN111314158B (en) Big data platform monitoring method, device, equipment and medium
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
KR20150118963A (en) Queue monitoring and visualization
CN116009428A (en) Industrial data monitoring system and method based on stream computing engine and medium
US11182386B2 (en) Offloading statistics collection
CN113220530B (en) Data quality monitoring method and platform
CN113468196B (en) Method, apparatus, system, server and medium for processing data
CN113360581A (en) Data processing method, device and storage medium
CN113190426B (en) Stability monitoring method for big data scoring system
US11023449B2 (en) Method and system to search logs that contain a massive number of entries
CN104317820B (en) Statistical method and device for report forms
CN117131059A (en) Report data processing method, device, equipment and storage medium
CN116010452A (en) Industrial data processing system and method based on stream type calculation engine and medium
CN113553320B (en) Data quality monitoring method and device
CN111274316A (en) Execution method and device of multi-level data flow task, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant