CN112241443B

CN112241443B - Data quality monitoring method, device, computing equipment and computer storage medium

Info

Publication number: CN112241443B
Application number: CN201910642686.7A
Authority: CN
Inventors: 金崇超; 孙新华; 刘坤
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2023-11-21
Anticipated expiration: 2039-07-16
Also published as: CN112241443A

Abstract

The embodiment of the invention relates to the technical field of quality monitoring, and discloses a data quality monitoring method, a device, a computing device and a computer storage medium, wherein the method comprises the following steps: acquiring a data flow map among all data nodes; acquiring real-time business data in a production system; acquiring the data flow characteristics of each data node in the service data according to the data flow map; and carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data. By means of the mode, the embodiment of the invention automatically organizes the data flow modes under different scenes and services by utilizing the association relation among the data of the production system, monitors the quality of the data in the whole service and the whole scene in the whole process, provides the occurrence point of the abnormal links of the data, improves the data quality monitoring efficiency and reduces the labor cost.

Description

Data quality monitoring method, device, computing equipment and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of quality monitoring, in particular to a data quality monitoring method, a data quality monitoring device, computing equipment and a computer storage medium.

Background

With the business development of the telecommunication market, operators gradually shift from extensive operation which attracts users from price war to refined operation which is centered on customers under the action of strong market competition and technology driving, and diversified customer demands are met through business innovation. In the process, the business characteristics of various service types and flexible tariff package lead the business support system to be more and more complex, the risk of generating data quality problems due to abnormality of data in the production process is inevitably increased, and once the data quality problems occur in the production system, the normal handling of business can be influenced, and the working and processing results of later data types can be influenced.

Aiming at the data quality problem, the existing processing mode is to perform manual auditing comparison work at regular intervals, and the method can be roughly divided into three steps: 1) Establishing data audit points, and respectively establishing audit points according to different service conditions and experiences; 2) Manually combing the auditing flow, namely manually combing the business scene flow related to each auditing point and the data table flowing through by utilizing a system interaction panoramic framework; 3) And (3) checking periodically, and arranging personnel to perform data quality comparison auditing work on each auditing point step by step according to the auditing flow.

In carrying out embodiments of the present invention, the inventors found that: in order to meet diversified customer demands and merge continuously updated market environments, operators need to develop innovative businesses in time, so that the auditing points and the auditing flow are required to be updated continuously and synchronously according to the traditional data quality auditing comparison method, and the business flow, the data flow and other carding works are completely carried out manually, so that the time and the labor are consumed, and the efficiency is extremely low. In addition, under the condition of business and scene interaction, the production environment has complex and changeable derived data conditions, and through manually establishing audit points, only partial business and scene data quality audit is often involved, and the coverage audit range has limitation; based on complicated specific processes of different services and scenes, the method only focuses on the end-to-end data quality problem by means of a traditional data quality audit comparison mode of manual processing, cannot specifically locate the link of the problem in the middle of the service, and cannot monitor the occurrence link of the data quality problem.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention provide a data quality monitoring method, apparatus, computing device, and computer storage medium, which overcome or at least partially solve the foregoing problems.

According to an aspect of an embodiment of the present invention, there is provided a data quality monitoring method, the method including: acquiring a data flow map among all data nodes; acquiring real-time business data in a production system; acquiring the data flow characteristics of each data node in the service data according to the data flow map; and carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

In an optional manner, the acquiring a data flow map between the data nodes includes: carrying out data characteristic analysis on historical service data in a production system to obtain a dependency relationship among data nodes; and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In an optional manner, the performing data feature analysis on the historical service data to obtain a dependency relationship between data nodes includes: acquiring historical service data in a production system, and establishing a training data node table; collecting field characteristics according to the data node table respectively, and acquiring general field combinations of all data nodes; aiming at any two data nodes, obtaining optimal field combinations of any two data nodes according to the common field combination application expansion rate and the retention rate; and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In an optional manner, the obtaining, for any two data nodes, the optimal field combination of any two data nodes according to the common field combination application expansion rate and retention rate includes: extracting a preset number of business data in the general field combination of any data node to be matched with the general field combination of another data node, and counting the retention rate and the expansion rate; taking the field combination with the highest retention rate; when a plurality of field combinations with highest retention rate exist, eliminating the field combinations with abnormal expansion rate to form special field combinations; splicing the special field combination and the general field combination of the other data node to form a new inter-table field combination; and repeatedly iterating the inter-table field combination and the special field combination until no new special field combination is generated between tables, thereby obtaining the optimal field combination.

In an optional manner, the obtaining the optimal flow direction between the data nodes according to the dependency relationship between the data nodes to form a data flow map includes: acquiring historical service data in a production system, and establishing a service data temporary table; selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the initial table and the time field of the association table; judging the internal association flow direction of association tables of any two data nodes in the association table set; and obtaining an optimal flow direction according to the basic association flow direction and the internal association flow direction, and forming a data flow map.

In an optional manner, the obtaining the optimal flow direction according to the basic association flow direction and the internal association flow direction to form a data flow map includes: if the first basic association flow direction can be realized through the second basic association flow direction and the internal association flow direction, reserving the second basic association flow direction and the internal association flow direction, and deleting the first basic association flow direction; traversing the basic association flow direction and the internal association flow direction, and finally reserving the basic association flow direction and the internal association flow direction as optimal flow directions; and forming a data flow map according to the optimal flow direction among the data nodes.

In an alternative manner, the data flow characteristics include a start node data amount and an end node data amount; the step of carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting abnormal data and positioning the abnormal occurrence node of the abnormal data comprises the following steps: comparing the node data quantity of the data nodes flowing in each stream in the business data and the data flow map, calculating the missing data quantity, performing data node position marking, and outputting a data missing level list and a data missing detail list; and outputting an abnormal data detection result according to the data missing level list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

According to another aspect of an embodiment of the present invention, there is provided a data quality monitoring apparatus, the apparatus including: the flow acquisition unit is used for acquiring a data flow map among the data nodes; the data acquisition unit is used for acquiring real-time service data in the production system; the feature extraction unit is used for acquiring the data flow features of each data node in the service data according to the data flow map; and the anomaly detection unit is used for carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an anomaly data detection result and positioning an anomaly occurrence node of the anomaly data.

According to another aspect of an embodiment of the present invention, there is provided a computing device including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the steps of the data quality monitoring method.

According to yet another aspect of the embodiments of the present invention, there is provided a computer storage medium having stored therein at least one executable instruction for causing the processor to perform the steps of the above-described data quality monitoring method.

According to the embodiment of the invention, the data flow patterns among the data nodes are obtained; acquiring real-time business data in a production system; acquiring the data flow characteristics of each data node in the service data according to the data flow map; and carrying out consistency comparison on the data flow characteristics of the business data and each data node of the data flow map, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data, and automatically combing data flow forms under different scenes and businesses by utilizing the association relation among the data of the production system, so that the data abnormal link occurrence node is provided by full-business full-flow monitoring comparison of the data quality conditions, the data quality monitoring efficiency is improved, and the labor cost is reduced.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

Fig. 1 shows a flow chart of a data quality monitoring method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another method for monitoring data quality according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of another method for monitoring data quality according to an embodiment of the present invention;

FIG. 4 shows a schematic diagram of retention and expansion of a data quality monitoring method according to an embodiment of the present invention;

fig. 5 shows an optimal flow direction obtaining schematic diagram of a data quality monitoring method according to an embodiment of the present invention;

fig. 6 shows a schematic structural diagram of a data quality monitoring device according to an embodiment of the present invention;

FIG. 7 illustrates a schematic diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flow chart of a data quality monitoring method according to an embodiment of the present invention. As shown in fig. 1, the data quality monitoring method includes:

step S11: and acquiring a data flow map among the data nodes.

In the embodiment of the invention, the data flow patterns among the data nodes which are trained in advance can be obtained, and the data flow patterns among the data nodes can be obtained by training the historical service data in the production system.

When training the historical service data in the production system to obtain a data flow chart among the data nodes, as shown in fig. 2, step S11 includes:

step S111: and carrying out data characteristic analysis on the historical service data in the production system to obtain the dependency relationship among the data nodes.

In the embodiment of the invention, the historical service data in the production system is utilized to create a training node set table, the characteristic frames such as the data expansion rate and the retention rate are constructed, the optimal field combination of each data node is obtained according to the expansion rate and the retention rate, and the dependency relationship among each data node is obtained according to the obvious and implicit relationship among the node tables.

As shown in fig. 3, step S111 includes the steps of:

step S115: and acquiring historical service data in the production system, and establishing a training data node table.

And acquiring historical service data of a training node table in the production system, creating and generating a corresponding temporary training node table for subsequent training, establishing a training data set, and creating the training data node table.

Step S116: and respectively collecting field characteristics according to the data node table, and acquiring a plurality of general field combinations of each data node.

Specifically, extracting a field of which the type of each data node in the training data node table is character type or numerical value type, and performing field analysis to obtain a field analysis result; and extracting a tag field from the field analysis result, and removing the field with the null value exceeding a threshold value to form a plurality of general field combinations of each data node.

For the created training data node table, extracting the fields with the types of character type or numerical value type from the training data node table, and sequentially counting the field characteristics of the relevant fields of each data table, such as the record number, the duplicate removal record number, the null record number, the type length with the maximum data volume, the corresponding data volume and the like, so as to form a data field analysis result table shown in table 1.

Table 1 data field analysis results table

Field name	Logical name	Data type
			NODE_ID	Node ID	NUMBER(10)
NODE_CODE	Node coding	VARchar2(30)
			COLUMN_NAME	Field coding	VARchar2(30)
COLUMN_TYPE	Field type	VARchar2(30)
			ALL_CNT	Node record number	NUMBER(20)
COL_CNT	Record number after the field is duplicated	NUMBER(20)
			NULL_CNT	The field null record number	NUMBER(20)
MAX_LENGTH	The field records the most length type	NUMBER(4)
			MAX_LENGTH_CNT	The field records the number of records of the maximum length type	NUMBER(20)

The fields of the suspected tag set are extracted according to the counted field characteristics and used for analyzing the field diversity, the data characteristic field results are written, and the field extraction rules of the suspected tag set are as follows:

and further eliminating the fields with excessive null values according to preset rules, and performing field feature screening to form general field combinations meeting the conditions, and preferably eliminating the fields with null values exceeding the preset values to form general field combinations of all data nodes.

Step S117: and aiming at any two data nodes, obtaining the optimal field combination of any two data nodes according to the common field combination application expansion rate and the retention rate.

Specifically, extracting a preset number of business data in the general field combination of any data node to be matched with the general field combination of another data node, and counting the retention rate and the expansion rate; taking the field combination with the highest retention rate; when a plurality of field combinations with the highest retention rate exist, the field combinations with abnormal expansion rate are removed to form special field combinations. In the embodiment of the present invention, as shown in fig. 4, taking any two associated data nodes a and B as an example, 10000 pieces of data in the general field combination of the data node a are extracted to form a table a, and the general field combination of the data node B forms a full table B. And matching the field combination with the associated full-quantity table B, and sequentially counting the data retention rate and the expansion rate. Wherein the retention rate is the data amount/10000 of the intersection with the full table B in the table a, and the expansion rate is the data amount of the intersection with the table a in the full table B/the data amount of the intersection with the full table B in the table a. For example, referring to fig. 4, table a includes an α field, table B includes a β field and a γ field, and the two tables are associated by field combination α=β, resulting in q=2, p=3. If the field retention rate among the field correlation tables accords with the threshold value, the field combination with the highest retention rate is taken, otherwise, the field combination is removed. If there are multiple fields meeting the above conditions, the combination of fields with abnormal expansion rate, such as expansion rate >100, is eliminated.

After obtaining a special field combination, splicing the special field combination and the general field combination of the other data node to form a new inter-table field combination; and repeatedly iterating the inter-table field combination and the special field combination until no new special field combination is generated between tables, thereby obtaining the optimal field combination.

Step S118: and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

And judging whether the association tables are dominant relations or recessive relations according to the time fields in the training data node tables. If the time sequence relation exists in the time field between tables, the time sequence relation is dominant, otherwise, the time sequence relation is recessive. And outputting the dependency relationship among the data nodes to form an association table of the dependency relationship among the data nodes as shown in table 2.

Table 2 association table of dependency relationships between data nodes

Step S112: and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

And forming a basic and internal association flow direction on the basis of the dependency relationship among the data nodes, finally automatically establishing a data flow map, and completing data characteristic dependency model analysis to form the data flow map.

In the embodiment of the invention, the historical service data in the production system is acquired, the historical service data in the production system is established and acquired, and a service data temporary table is established. Specifically, service data of the data node is acquired according to the configuration table, and a new related service data temporary table is created and generated from the training data node table according to the service caliber of the training service entry. Then extracting each data node, and selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the initial table and the time field of the association table; and judging the internal association flow direction of the association tables of any two data nodes in the association table set. If any association table B is selected from the association table set { B } of the initial table a _i Carding to obtain the following characteristics: node a and associated word of node aSegment, node b _i Node b _i Association field, association type of (a). Based on node a and node b _i Sequentially generating temporary association result tables, and respectively taking an initial table a and any association table b _i The time fields of the initial table a are spliced, and the initial data quantity of the initial table a and any association table b are further counted _i And (3) completing the judgment of the basic association flow direction: if a- & gt b in time sequence association _i If the confidence of the rule exceeds the threshold, the traffic is considered to flow through Table b _i And otherwise, does not flow through. Traversing the association table set { B } to obtain a basic association flow direction a → { B' }. The association tables with association relation in the association table set { B } are compared in pairs to judge the flow direction, and { B } is calculated respectively _i '}→{B _j ' j (i not equal to j) and { B } _j '}→{B _i Confidence of' (i not equal to j), if the confidence exceeds a threshold, the internal association flow direction with higher confidence is taken.

In the embodiment of the invention, the optimal flow direction is further obtained according to the basic association flow direction and the internal association flow direction, and a data flow map is formed. The method comprises the following steps: if the first basic association flow direction can be realized through the second basic association flow direction and the internal association flow direction, reserving the second basic association flow direction and the internal association flow direction, and deleting the first basic association flow direction; traversing the basic association flow direction and the internal association flow direction, and finally reserving the basic association flow direction and the internal association flow direction as optimal flow directions; and forming a data flow map according to the optimal flow direction among the data nodes. For example, referring to fig. 5, if there is a basic association flow direction, such as a→b2, and there is a basic and internal association flow direction, such as a→b1, b1→b2, where the start node can reach the end node through other nodes, the original basic association flow direction, such as a→b2, is removed, and the other basic and internal association flow directions are combined and reserved as the optimal flow direction. And forming a business process according to the optimal flow direction among the data nodes, wherein the result is shown in a table 3, and further generating a data process map.

TABLE 3 business process results table

Field name	Logical name	Data type
			JOB_ID	Training item ID	NUMBER(10)
TASK_ID	Training task ID	NUMBER(10)
			FLOW_ID	Flow ID	NUMBER(10)
START_NODE_ID	Start node ID	NUMBER(10)
			START_NODE_CODE	Start node encoding	VARchar2(30)
END_NODE_ID	End node ID	NUMBER(10)
			END_NODE_CODE	End node encoding	VARchar2(30)

According to the embodiment of the invention, the characteristic frames such as the data expansion rate and the retention rate are constructed to form the basic and internal association flow directions, and finally the automatic establishment function of the data flow chart is realized, so that the association relation among the production system data is used creatively in the follow-up process, the automatic monitoring replaces a purely manual auditing mode, the data quality auditing efficiency is improved, and the labor cost is reduced.

Step S12: and acquiring real-time business data in the production system.

Specifically, service data of the data node is acquired according to the configuration table, real-time service data in the production system is acquired from the training data node table according to the service caliber of the service inlet, and a new related service data temporary table is created and produced.

Step S13: and acquiring the data flow characteristics of each data node in the service data according to the data flow map.

Specifically, according to a flow chart formed by training, the data flow characteristics of each data node are counted, wherein the data flow characteristics comprise a start node data quantity and an end node data quantity.

Step S14: and carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

And comparing the node data quantity of the data nodes flowing in each flow direction in the business data and the data flow chart, calculating the missing data quantity, marking the position of the data nodes, outputting a data missing level list and a data missing detail list, and respectively outputting the data missing level list and the data missing detail list as shown in tables 4 and 5.

TABLE 4 data loss level inventory

Field name	Logical name	Data type
			JOB_ID	Training item ID	NUMBER(10)
TM_INTRVL_ID	Training task ID	VARchar2(30)
			FLOW_ID	Belonging to Process ID	NUMBER(10)
START_NODE_CODE	Start node encoding	VARchar2(30)
			END_NODE_CODE	End node encoding	VARchar2(30)
V_START_CNT	Initial node data volume	NUMBER(10)
			V_END_CNT	End node data volume	NUMBER(10)
GRP_EXPR	Magnitude of difference	VARchar2(1000)

TABLE 5 data loss detail list

Further, outputting an abnormal data detection result according to the data missing level list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data. The recording form of the abnormal data detection result is shown in table 6.

TABLE 6 abnormal Link detection results schematic Table

Field name	Logical name	Data type
			USER_ID	User ID	VARchar2(30)
JOB_ID	Training item ID	NUMBER(10)
			TASK_ID	Training task ID	NUMBER(10)
NODE_ID	Data node ID	NUMBER(10)
			NODE_CODE	Data node encoding	VARchar2(30)
GRP_ROWNO	Number of user records	NUMBER(10)
			GRP_EXPR	Recording type of user	VARchar2(1000)
ERR_FLAG	Whether the user is abnormal at the node	NUMBER(1)

The embodiment of the invention creatively breaks through the limitation of manpower and complex flow in the traditional mode, and by automatically combing the data flow forms under different scenes and services, the full-service full-scene full-flow monitoring compares the data quality conditions, is not limited to artificially established audit points, automatically checks abnormal service operation in the production process from the abnormal data flow conditions, and can rapidly locate abnormal links and provide abnormal completion reasons of the service operation, namely data abnormal link occurrence points.

Fig. 6 shows a schematic structural diagram of a data quality monitoring apparatus according to an embodiment of the present invention. As shown in fig. 6, the data quality monitoring apparatus includes: a flow acquisition unit 601, a data acquisition unit 602, a feature extraction unit 603, and an abnormality detection unit 604. Wherein:

the flow obtaining unit 601 is configured to obtain a data flow map between each data node; the data acquisition unit 602 is configured to acquire real-time service data in the production system; the feature extraction unit 603 is configured to obtain a data flow feature of each data node in the service data according to the data flow map; the anomaly detection unit 604 is configured to perform consistency comparison on the business data and the data flow characteristics of each data node of the data flow graph, output an anomaly data detection result, and locate an anomaly occurrence node of the anomaly data.

In an alternative manner, the flow obtaining unit 601 is configured to: carrying out data characteristic analysis on historical service data in a production system to obtain a dependency relationship among data nodes; and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In an alternative manner, the flow obtaining unit 601 is configured to: acquiring historical service data in a production system, and establishing a training data node table; collecting field characteristics according to the data node table respectively, and acquiring general field combinations of all data nodes; aiming at any two data nodes, obtaining optimal field combinations of any two data nodes according to the common field combination application expansion rate and the retention rate; and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In an alternative manner, the flow obtaining unit 601 is further configured to: extracting a preset number of business data in the general field combination of any data node to be matched with the general field combination of another data node, and counting the retention rate and the expansion rate; taking the field combination with the highest retention rate; when a plurality of field combinations with highest retention rate exist, eliminating the field combinations with abnormal expansion rate to form special field combinations; splicing the special field combination and the general field combination of the other data node to form a new inter-table field combination; and repeatedly iterating the inter-table field combination and the special field combination until no new special field combination is generated between tables, thereby obtaining the optimal field combination.

In an alternative manner, the flow obtaining unit 601 is further configured to: acquiring historical service data in a production system, and establishing a service data temporary table; selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the initial table and the time field of the association table; judging the internal association flow direction of association tables of any two data nodes in the association table set; and obtaining an optimal flow direction according to the basic association flow direction and the internal association flow direction, and forming a data flow map.

In an alternative manner, the flow obtaining unit 601 is further configured to: if the first basic association flow direction can be realized through the second basic association flow direction and the internal association flow direction, reserving the second basic association flow direction and the internal association flow direction, and deleting the first basic association flow direction; traversing the basic association flow direction and the internal association flow direction, and finally reserving the basic association flow direction and the internal association flow direction as optimal flow directions; and forming a data flow map according to the optimal flow direction among the data nodes.

In an alternative manner, the data flow characteristics include a start node data amount and an end node data amount; the abnormality detection unit 604 is configured to: comparing the node data quantity of the data nodes flowing in each stream in the business data and the data flow map, calculating the missing data quantity, performing data node position marking, and outputting a data missing level list and a data missing detail list; and outputting an abnormal data detection result according to the data missing level list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

Embodiments of the present invention provide a non-volatile computer storage medium having stored thereon at least one executable instruction for performing the data quality monitoring method of any of the method embodiments described above.

The executable instructions may be particularly useful for causing a processor to:

acquiring a data flow map among all data nodes;

acquiring real-time business data in a production system;

acquiring the data flow characteristics of each data node in the service data according to the data flow map;

and carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

In one alternative, the executable instructions cause the processor to:

carrying out data characteristic analysis on historical service data in a production system to obtain a dependency relationship among data nodes;

and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In one alternative, the executable instructions cause the processor to:

Acquiring historical service data in a production system, and establishing a training data node table;

collecting field characteristics according to the data node table respectively, and acquiring general field combinations of all data nodes;

aiming at any two data nodes, obtaining optimal field combinations of any two data nodes according to the common field combination application expansion rate and the retention rate;

and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In one alternative, the executable instructions cause the processor to:

extracting a preset number of business data in the general field combination of any data node to be matched with the general field combination of another data node, and counting the retention rate and the expansion rate;

taking the field combination with the highest retention rate;

when a plurality of field combinations with highest retention rate exist, eliminating the field combinations with abnormal expansion rate to form special field combinations;

splicing the special field combination and the general field combination of the other data node to form a new inter-table field combination;

and repeatedly iterating the inter-table field combination and the special field combination until no new special field combination is generated between tables, thereby obtaining the optimal field combination.

In one alternative, the executable instructions cause the processor to:

acquiring historical service data in a production system, and establishing a service data temporary table;

selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes;

judging a basic association flow direction according to the initial table and the time field of the association table;

judging the internal association flow direction of association tables of any two data nodes in the association table set;

and obtaining an optimal flow direction according to the basic association flow direction and the internal association flow direction, and forming a data flow map.

In one alternative, the executable instructions cause the processor to:

if the first basic association flow direction can be realized through the second basic association flow direction and the internal association flow direction, reserving the second basic association flow direction and the internal association flow direction, and deleting the first basic association flow direction;

traversing the basic association flow direction and the internal association flow direction, and finally reserving the basic association flow direction and the internal association flow direction as optimal flow directions;

And forming a data flow map according to the optimal flow direction among the data nodes.

In an alternative manner, the data flow characteristics include a start node data amount and an end node data amount; the executable instructions cause the processor to:

comparing the node data quantity of the data nodes flowing in each stream in the business data and the data flow map, calculating the missing data quantity, performing data node position marking, and outputting a data missing level list and a data missing detail list;

and outputting an abnormal data detection result according to the data missing level list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

An embodiment of the present invention provides a computer program product comprising a computer program stored on a computer storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the data quality monitoring method of any of the method embodiments described above.

acquiring a data flow map among all data nodes;

acquiring real-time business data in a production system;

In one alternative, the executable instructions cause the processor to:

taking the field combination with the highest retention rate;

In one alternative, the executable instructions cause the processor to:

FIG. 7 illustrates a schematic diagram of a computing device in accordance with an embodiment of the invention, which is not limited to a particular implementation of the device.

As shown in fig. 7, the computing device may include: a processor 702, a communication interface (Communications Interface), a memory 706, and a communication bus 708.

Wherein: processor 702, communication interface 704, and memory 706 perform communication with each other via a communication bus 708. A communication interface 704 for communicating with network elements of other devices, such as clients or other servers. The processor 702 is configured to execute the program 710, and may specifically perform relevant steps in the above-described data quality monitoring method embodiment.

In particular, program 710 may include program code including computer-operating instructions.

The processor 702 may be a Central Processing Unit (CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors comprised by the device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 706 for storing programs 710. The memory 706 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 710 may be specifically configured to cause the processor 702 to:

acquiring a data flow map among all data nodes;

acquiring real-time business data in a production system;

In an alternative, the program 710 causes the processor to:

taking the field combination with the highest retention rate;

In an alternative, the program 710 causes the processor to:

In an alternative manner, the data flow characteristics include a start node data amount and an end node data amount; the program 710 causes the processor to:

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A method of data quality monitoring, the method comprising:

acquiring a data flow map among the data nodes, including: carrying out data characteristic analysis on historical service data in a production system to obtain a dependency relationship among data nodes; acquiring the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map;

The step of carrying out data characteristic analysis on the historical service data to obtain the dependency relationship among the data nodes comprises the following steps: acquiring historical service data in a production system, and establishing a training data node table; collecting field characteristics according to the data node table respectively, and acquiring general field combinations of all data nodes; aiming at any two data nodes, obtaining optimal field combinations of any two data nodes according to the common field combination application expansion rate and the retention rate; judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table;

the step of obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map comprises the following steps: acquiring historical service data in a production system, and establishing a service data temporary table; selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the initial table and the time field of the association table; judging the internal association flow direction of association tables of any two data nodes in the association table set; acquiring an optimal flow direction according to the basic association flow direction and the internal association flow direction, and forming a data flow map;

Acquiring real-time business data in a production system;

2. The method according to claim 1, wherein said obtaining, for any two of said data nodes, an optimal field combination of any two of said data nodes from said generic field combination application expansion rate and retention rate comprises:

taking the field combination with the highest retention rate;

3. The method of claim 1, wherein the obtaining the optimal flow direction according to the base associated flow direction and the internal associated flow direction to form a data flow map comprises:

4. The method of claim 1, wherein the data flow characteristics include a start node data amount and an end node data amount;

the step of carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting abnormal data and positioning the abnormal occurrence node of the abnormal data comprises the following steps:

5. A data quality monitoring device, the device comprising:

the flow acquisition unit is used for acquiring a data flow map among the data nodes, and comprises the following steps: carrying out data characteristic analysis on historical service data in a production system to obtain a dependency relationship among data nodes; acquiring the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map;

the data acquisition unit is used for acquiring real-time service data in the production system;

the feature extraction unit is used for acquiring the data flow features of each data node in the service data according to the data flow map;

and the anomaly detection unit is used for carrying out consistency comparison on the data flow characteristics of each data node in the business data and the data flow map, outputting an anomaly data detection result and positioning an anomaly occurrence node of the anomaly data.

6. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to hold at least one executable instruction that causes the processor to perform the steps of the data quality monitoring method according to any one of claims 1-4.

7. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform the steps of the data quality monitoring method according to any one of claims 1-4.