CN112241443A

CN112241443A - Data quality monitoring method and device, computing equipment and computer storage medium

Info

Publication number: CN112241443A
Application number: CN201910642686.7A
Authority: CN
Inventors: 金崇超; 孙新华; 刘坤
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Zhejiang Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2021-01-19
Anticipated expiration: 2039-07-16
Also published as: CN112241443B

Abstract

The embodiment of the invention relates to the technical field of quality monitoring, and discloses a data quality monitoring method, a data quality monitoring device, computing equipment and a computer storage medium, wherein the method comprises the following steps: acquiring a data flow map among data nodes; acquiring real-time service data in a production system; acquiring data flow characteristics of each data node in the service data according to the data flow graph; and comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data. Through the mode, the embodiment of the invention automatically combs the data circulation forms under different scenes and services by utilizing the incidence relation among the data of the production system, the full-service full-scene full-process monitoring compares the data quality condition, provides the data abnormal link occurrence point, improves the data quality monitoring efficiency and reduces the labor cost.

Description

Data quality monitoring method and device, computing equipment and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of quality monitoring, in particular to a data quality monitoring method, a data quality monitoring device, computing equipment and a computer storage medium.

Background

With the business development of the telecommunication market, under the driving of intense market competition and technology, operators have gradually changed from extensive operation in which price war attracts users to refined operation with customers as the center, and diversified customer requirements are met through business innovation. In the process, the business support system is more and more complex due to the business characteristics of various service types and flexible price packages, the risk of data quality problems caused by abnormity in the production process is inevitably increased, and once the data quality problems occur in the production system, the normal handling of the business can be influenced on one hand, and the later data type work processing results can be influenced on the other hand.

For the data quality problem, the conventional processing mode is to perform manual audit comparison work regularly, and the conventional processing mode can be roughly divided into three steps: 1) establishing data audit points, and respectively establishing the audit points according to different service conditions and experience; 2) manually combing the audit flows, namely manually combing the business scene flows related to each audit point and the data sheets flowing through the audit points by utilizing a system interactive panoramic frame; 3) and (4) regular audit verification, wherein personnel are arranged to carry out audit work of comparing data quality step by step for each audit point regularly according to an audit flow.

In the process of implementing the embodiment of the present invention, the inventors found that: in order to meet diversified customer requirements and integrate a continuously updated market environment, operators need to develop innovative services in time, and therefore, in a traditional data quality audit comparison mode, the audit point and the audit process need to be continuously and synchronously updated, and the service flow, the data flow and other carding work are completely carried out manually, so that the time consumption, the labor consumption and the efficiency are extremely low. In addition, under the condition of service and scene interaction in the production environment, the derived data condition is complex and variable, the audit points are established manually, the audit points are often only related to the data quality audit of partial services and scenes, and the coverage of the audit range has limitation; and based on different services and scenes, the specific flow is complicated, and only the end-to-end data quality problem is focused by a traditional data quality audit comparison mode of manual processing, the link of the intermediate problem of the service cannot be specifically positioned, and the link of the occurrence of the data quality problem cannot be monitored.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a data quality monitoring method, apparatus, computing device and computer storage medium, which overcome or at least partially solve the above problems.

According to an aspect of an embodiment of the present invention, there is provided a data quality monitoring method, including: acquiring a data flow map among data nodes; acquiring real-time service data in a production system; acquiring data flow characteristics of each data node in the service data according to the data flow graph; and comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

In an optional manner, the obtaining a data flow graph between data nodes includes: performing data characteristic analysis on historical service data in the production system to obtain the dependency relationship among data nodes; and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In an optional manner, the performing data feature analysis on the historical service data to obtain a dependency relationship between data nodes includes: acquiring historical service data in a production system, and establishing a training data node table; respectively collecting field characteristics according to the data node table, and acquiring a general field combination of each data node; aiming at any two data nodes, acquiring the optimal field combination of any two data nodes according to the general field combination application expansion rate and retention rate; and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In an optional manner, the obtaining, for any two of the data nodes, an optimal field combination of any two of the data nodes according to the general field combination application expansion rate and retention rate includes: extracting preset amount of service data in the general field combination of any data node to match with the general field combination of another data node, and counting retention rate and expansion rate; combining fields with the highest retention rate; when a plurality of field combinations with the highest retention rate exist, the field combinations with abnormal expansion rates are removed to form special field combinations; splicing the special field combination with the general field combination of the other data node to form a new inter-table field combination; and repeating iteration on the field combination between tables and the special field combination until no new special field combination is generated between the tables, and obtaining the optimal field combination.

In an optional manner, the obtaining an optimal flow direction between data nodes according to a dependency relationship between the data nodes to form a data flow graph includes: acquiring historical service data in a production system, and establishing a temporary service data table; selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the time fields of the initial table and the association table; judging the internal association flow direction of the association tables of any two data nodes in the association table set; and obtaining the optimal flow direction according to the basic associated flow direction and the internal associated flow direction to form a data flow map.

In an optional manner, the obtaining an optimal flow direction according to the basic associated flow direction and the internal associated flow direction to form a data flow map includes: if the first basic associated flow direction can be realized through a second basic associated flow direction and the internal associated flow direction, reserving the second basic associated flow direction and the internal associated flow direction, and deleting the first basic associated flow direction; traversing the basic correlation flow direction and the internal correlation flow direction, and finally keeping the basic correlation flow direction and the internal correlation flow direction as an optimal flow direction; and forming a data flow chart according to the optimal flow direction among the data nodes.

In an optional manner, the data flow characteristics include a start node data volume and an end node data volume; the consistency comparison of the data flow characteristics of the data nodes in the service data and the data flow graph is performed, abnormal data is output, and an abnormal occurrence node of the abnormal data is located, including: comparing the node data quantity of the data nodes flowing in the business data and the data flow graph, calculating missing data quantity, marking the positions of the data nodes, and outputting a data missing magnitude list and a data missing detail list; and outputting an abnormal data detection result according to the data missing magnitude list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

According to another aspect of the embodiments of the present invention, there is provided a data quality monitoring apparatus, including: the flow acquiring unit is used for acquiring a data flow map among the data nodes; the data acquisition unit is used for acquiring real-time service data in the production system; the characteristic extraction unit is used for acquiring the data flow characteristics of each data node in the service data according to the data flow map; and the abnormal detection unit is used for comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

According to another aspect of embodiments of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the steps of the data quality monitoring method.

According to a further aspect of the embodiments of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing the processor to execute the steps of the data quality monitoring method described above.

The embodiment of the invention obtains the data flow map among the data nodes; acquiring real-time service data in a production system; acquiring data flow characteristics of each data node in the service data according to the data flow graph; the data flow characteristics of the data nodes in the service data and the data flow maps are compared in a consistent mode, abnormal data detection results are output, abnormal generation nodes of the abnormal data are located, data flow modes under different scenes and services are automatically combed through the incidence relation among data of the production system, the data quality condition is monitored and compared in the whole service whole scene whole flow, data abnormal link occurrence points are provided, data quality monitoring efficiency is improved, and labor cost is reduced.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a schematic flow chart of a data quality monitoring method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of another data quality monitoring method provided by an embodiment of the invention;

FIG. 3 is a flow chart of another data quality monitoring method provided by the embodiment of the invention;

FIG. 4 is a graph illustrating retention and expansion rates of a data quality monitoring method provided by an embodiment of the invention;

FIG. 5 is a schematic diagram illustrating an optimal flow direction acquisition of a data quality monitoring method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a data quality monitoring apparatus provided in an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 is a schematic flow chart illustrating a data quality monitoring method according to an embodiment of the present invention. As shown in fig. 1, the data quality monitoring method includes:

step S11: and acquiring a data flow map among the data nodes.

In the embodiment of the invention, a pre-trained data flow map among the data nodes can be obtained, and historical service data in a production system can be trained to obtain the data flow map among the data nodes.

When the historical service data in the production system is trained to obtain the data flow graph between the data nodes, as shown in fig. 2, step S11 includes:

step S111: and carrying out data characteristic analysis on historical service data in the production system to obtain the dependency relationship among the data nodes.

In the embodiment of the invention, historical service data in a production system is utilized to create a training node set table, a characteristic frame such as data expansion rate, retention rate and the like is constructed, the optimal field combination of each data node is obtained according to the expansion rate and the retention rate, and the dependency relationship among the data nodes is obtained according to the explicit-implicit relationship among the node tables.

As shown in fig. 3, step S111 includes the steps of:

step S115: and acquiring historical service data in the production system, and establishing a training data node table.

Historical service data of a training node table in a production system is collected, a corresponding temporary training node table is created and generated for subsequent training, a training data set is established, and a training data node table is created.

Step S116: and respectively collecting field characteristics according to the data node table, and acquiring a plurality of general field combinations of each data node.

Specifically, extracting a field of which the type of each data node in the training data node table is a character type or a numerical type, and performing field analysis to obtain a field analysis result; and extracting label fields from the field analysis result, and eliminating fields with null values exceeding a threshold value to form a plurality of general field combinations of each data node.

And aiming at the created training data node table, extracting fields with character types or numerical types in the training data node table, and sequentially counting field characteristics such as the number of records, the number of duplicate removal records, the number of null value records, the type length with the largest data quantity, the corresponding data quantity and the like of related fields of each data table to form a data field analysis result table shown in table 1.

Table 1 data field analysis results table

Name of field	Logical name	Data type
			NODE_ID	Node ID	NUMBER(10)
NODE_CODE	Node encoding	VARchar2(30)
			COLUMN_NAME	Field coding	VARchar2(30)
COLUMN_TYPE	Type of field	VARchar2(30)
			ALL_CNT	Node record number	NUMBER(20)
COL_CNT	Record number after the duplication of the field	NUMBER(20)
			NULL_CNT	The field null value records the number	NUMBER(20)
MAX_LENGTH	The length type of the field record the most	NUMBER(4)
			MAX_LENGTH_CNT	The field records the number of records of the maximum length type	NUMBER(20)

Extracting fields of a suspected label set according to the statistical field characteristics for analyzing field diversity and writing data characteristic field results, wherein the suspected label set field extraction rule is as follows:

and further eliminating excessive fields with null values according to a preset rule, and screening field features to form a general field combination meeting the conditions.

Step S117: and aiming at any two data nodes, obtaining the optimal field combination of any two data nodes according to the general field combination application expansion rate and retention rate.

Specifically, a preset number of service data in the general field combination of any data node is extracted to be matched with the general field combination of another data node, and a retention rate and an expansion rate are counted; combining fields with the highest retention rate; and when a plurality of field combinations with the highest retention rate exist, the field combinations with abnormal expansion rates are rejected to form special field combinations. In the embodiment of the present invention, as shown in fig. 4, taking any two associated data nodes a and B as an example, 10000 pieces of data in the general field combination of the data node a are extracted to form a table a, and the general field combination of the data node B forms a full table B. And matching the field combination with the associated full-scale table B, and sequentially counting the data retention rate and the expansion rate. The retention rate is the data volume/10000 of the table A which is intersected with the full-scale table B, and the expansion rate is the data volume of the table B which is intersected with the table A/the data volume of the table A which is intersected with the full-scale table B. For example, referring to fig. 4, table a includes an α field, table B includes a β field and a γ field, and the two tables are associated by combining the fields α ═ β, resulting in Q ═ 2 and P ═ 3. And if the field retention rate between the field related tables meets a threshold value, the field combination with the highest retention rate is selected, and if not, the field combination is rejected. If a plurality of fields meet the conditions, field combinations with abnormal expansion rates are rejected, and the expansion rate is larger than 100.

After a special field combination is obtained, splicing the special field combination and the general field combination of the other data node to form a new field combination between tables; and repeating iteration on the field combination between tables and the special field combination until no new special field combination is generated between the tables, and obtaining the optimal field combination.

Step S118: and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

And judging whether the association tables are in an explicit relation or an implicit relation according to the time fields in the training data node tables. If the time sequence relation exists in the time field between tables, the relation is explicit, otherwise, the relation is implicit. The dependency relationship between the data nodes is output, and an association table of the dependency relationship between the data nodes is formed as shown in table 2.

TABLE 2 Association table of dependencies between data nodes

Step S112: and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

And forming a basic and internal association flow direction on the basis of the dependency relationship among the data nodes, and finally automatically establishing a data flow graph, completing the analysis of a data feature dependency model and forming the data flow graph.

In the embodiment of the invention, historical business data in the production system is obtained, and a business data temporary table is established. Specifically, the business data of the data node of the inlet is taken according to the configuration table, and a new business is created and generated from the training data node table according to the business caliber of the training business inletAnd (4) related service data temporary tables. Then extracting each data node, and selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the time fields of the initial table and the association table; and judging the internal association flow direction of the association tables of any two data nodes in the association table set. E.g. selecting any association table B from the association table set B of the original table a_iCarding, the following characteristics are obtained: node a, association field of node a, and node b_iNode b_iAn association field, an association type. Based on node a and node b_iSequentially generating a temporary table of association results, and respectively taking the initial table a and any association table b_iThe time field of the table is spliced, and the initial data volume of the initial table a and any associated table b are further counted_iThe data volume of the association set is judged, and the basic association flow direction is judged: if a → b in the timing relationship_iIf the confidence of the rule exceeds the threshold, the business flow is considered to pass through the table b_iOtherwise, it does not flow through. The set of association tables { B } is traversed to obtain the base association flow direction a → { B' }. Pairwise comparison of the association tables with association relations in the association table set { B } is performed to judge the flow direction, and { B } is calculated respectively_i'}→{B_j' } (i ≠ j) and { B_j'}→{B_i' } (i ≠ j), if the confidence exceeds the threshold value, the internal association flow is taken as the internal association flow with high confidence.

In the embodiment of the invention, an optimal flow direction is further obtained according to the basic associated flow direction and the internal associated flow direction, so as to form a data flow map. The method specifically comprises the following steps: if the first basic associated flow direction can be realized through a second basic associated flow direction and the internal associated flow direction, reserving the second basic associated flow direction and the internal associated flow direction, and deleting the first basic associated flow direction; traversing the basic correlation flow direction and the internal correlation flow direction, and finally keeping the basic correlation flow direction and the internal correlation flow direction as an optimal flow direction; and forming a data flow chart according to the optimal flow direction among the data nodes. For example, referring to fig. 5, if there is a basic associated flow direction, e.g., a → b2, and there is a basic and internal associated flow direction of the start node that can reach the end node through other nodes, e.g., a → b1, b1 → b2, the original basic associated flow direction, i.e., a → b2, is eliminated, and the other basic and internal associated flow directions are merged and retained as the optimal flow direction. And forming a business process according to the optimal flow direction among the data nodes, wherein the result is shown in a table 3, and further generating a data process map.

Table 3 business process results table

Name of field	Logical name	Data type
			JOB_ID	Training item ID	NUMBER(10)
TASK_ID	Training task ID	NUMBER(10)
			FLOW_ID	Process ID	NUMBER(10)
START_NODE_ID	Start node ID	NUMBER(10)
			START_NODE_CODE	Start node encoding	VARchar2(30)
END_NODE_ID	End node ID	NUMBER(10)
			END_NODE_CODE	End node encoding	VARchar2(30)

According to the embodiment of the invention, the characteristic frames of the data expansion rate, the retention rate and the like are constructed to form the basic and internal association flow direction, and finally, the automatic establishment function of the data flow chart is realized, so that the association relation among the production system data is utilized in a subsequent creative manner, the automatic monitoring replaces a pure manual auditing mode, the data quality auditing efficiency is improved, and the labor cost is reduced.

Step S12: and acquiring real-time service data in the production system.

Specifically, the service data of the data node at the inlet is taken according to the configuration table, the real-time service data in the production system is obtained from the training data node table according to the service aperture of the service inlet, and a new temporary table of the relevant service data is created and produced.

Step S13: and acquiring the data flow characteristics of each data node in the service data according to the data flow graph.

Specifically, the data flow characteristics of each data node are counted according to the process graph formed by training, wherein the data flow characteristics comprise the data volume of the starting node and the data volume of the ending node.

Step S14: and comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

And comparing the node data amount of the data nodes flowing in the business data and the data flow graph, calculating the missing data amount, marking the positions of the data nodes, and outputting a data missing magnitude list and a data missing detail list, wherein the output data missing magnitude list and the output data missing detail list are respectively shown in a table 4 and a table 5.

TABLE 4 data loss magnitude List

Name of field	Logical name	Data type
			JOB_ID	Training item ID	NUMBER(10)
TM_INTRVL_ID	Training task ID	VARchar2(30)
			FLOW_ID	The process ID	NUMBER(10)
START_NODE_CODE	Initial node encoding	VARchar2(30)
			END_NODE_CODE	End node encoding	VARchar2(30)
V_START_CNT	Starting node data volume	NUMBER(10)
			V_END_CNT	End node data volume	NUMBER(10)
GRP_EXPR	Magnitude of difference	VARchar2(1000)

TABLE 5 data loss List

And further, outputting an abnormal data detection result according to the data missing magnitude list and the data missing detail list and positioning an abnormal occurrence node of the abnormal data. The abnormal data detection results are recorded in the form shown in Table 6.

Table 6 schematic diagram of abnormal link detection results

Name of field	Logical name	Data type
			USER_ID	User ID	VARchar2(30)
JOB_ID	Training item ID	NUMBER(10)
			TASK_ID	Training task ID	NUMBER(10)
NODE_ID	Data node ID	NUMBER(10)
			NODE_CODE	Data node encoding	VARchar2(30)
GRP_ROWNO	Number of user records	NUMBER(10)
			GRP_EXPR	Record type of user	VARchar2(1000)
ERR_FLAG	Whether the user is abnormal at the node	NUMBER(1)

The embodiment of the invention creatively breaks through the limitation of the traditional mode due to manpower and complex flow, monitors and compares the data quality condition of the whole-service whole-scene whole-flow by automatically combing the data flow modes under different scenes and services without being limited to the artificially established check point, automatically checks the abnormal service operation in the production process from the abnormal condition of the data flow, can quickly position the abnormal link, and provides the reason for completing the abnormal service operation, namely the abnormal data link occurrence point.

Fig. 6 shows a schematic structural diagram of a data quality monitoring device according to an embodiment of the present invention. As shown in fig. 6, the data quality monitoring apparatus includes: a flow acquisition unit 601, a data acquisition unit 602, a feature extraction unit 603, and an abnormality detection unit 604. Wherein:

the flow acquiring unit 601 is configured to acquire a data flow map between data nodes; the data obtaining unit 602 is configured to obtain real-time service data in the production system; the feature extraction unit 603 is configured to obtain data flow features of each data node in the service data according to the data flow graph; the anomaly detection unit 604 is configured to compare consistency of the data flow characteristics of the data nodes in the service data and the data flow graph, output an anomaly data detection result, and locate an anomaly occurrence node of the anomaly data.

In an optional manner, the flow acquiring unit 601 is configured to: performing data characteristic analysis on historical service data in the production system to obtain the dependency relationship among data nodes; and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In an optional manner, the flow acquiring unit 601 is configured to: acquiring historical service data in a production system, and establishing a training data node table; respectively collecting field characteristics according to the data node table, and acquiring a general field combination of each data node; aiming at any two data nodes, acquiring the optimal field combination of any two data nodes according to the general field combination application expansion rate and retention rate; and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In an optional manner, the flow acquiring unit 601 is further configured to: extracting preset amount of service data in the general field combination of any data node to match with the general field combination of another data node, and counting retention rate and expansion rate; combining fields with the highest retention rate; when a plurality of field combinations with the highest retention rate exist, the field combinations with abnormal expansion rates are removed to form special field combinations; splicing the special field combination with the general field combination of the other data node to form a new inter-table field combination; and repeating iteration on the field combination between tables and the special field combination until no new special field combination is generated between the tables, and obtaining the optimal field combination.

In an optional manner, the flow acquiring unit 601 is further configured to: acquiring historical service data in a production system, and establishing a temporary service data table; selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes; judging a basic association flow direction according to the time fields of the initial table and the association table; judging the internal association flow direction of the association tables of any two data nodes in the association table set; and obtaining the optimal flow direction according to the basic associated flow direction and the internal associated flow direction to form a data flow map.

In an optional manner, the flow acquiring unit 601 is further configured to: if the first basic associated flow direction can be realized through a second basic associated flow direction and the internal associated flow direction, reserving the second basic associated flow direction and the internal associated flow direction, and deleting the first basic associated flow direction; traversing the basic correlation flow direction and the internal correlation flow direction, and finally keeping the basic correlation flow direction and the internal correlation flow direction as an optimal flow direction; and forming a data flow chart according to the optimal flow direction among the data nodes.

In an optional manner, the data flow characteristics include a start node data volume and an end node data volume; the anomaly detection unit 604 is configured to: comparing the node data quantity of the data nodes flowing in the business data and the data flow graph, calculating missing data quantity, marking the positions of the data nodes, and outputting a data missing magnitude list and a data missing detail list; and outputting an abnormal data detection result according to the data missing magnitude list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

The embodiment of the invention provides a nonvolatile computer storage medium, wherein at least one executable instruction is stored in the computer storage medium, and the computer executable instruction can execute the data quality monitoring method in any method embodiment.

The executable instructions may be specifically configured to cause the processor to:

acquiring a data flow map among data nodes;

acquiring real-time service data in a production system;

acquiring data flow characteristics of each data node in the service data according to the data flow graph;

and comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

In an alternative, the executable instructions cause the processor to:

performing data characteristic analysis on historical service data in the production system to obtain the dependency relationship among data nodes;

and obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow map.

In an alternative, the executable instructions cause the processor to:

acquiring historical service data in a production system, and establishing a training data node table;

respectively collecting field characteristics according to the data node table, and acquiring a general field combination of each data node;

aiming at any two data nodes, acquiring the optimal field combination of any two data nodes according to the general field combination application expansion rate and retention rate;

and judging the dependency relationship between any two data nodes in the optimal field combination according to the time field in the training data node table.

In an alternative, the executable instructions cause the processor to:

extracting preset amount of service data in the general field combination of any data node to match with the general field combination of another data node, and counting retention rate and expansion rate;

combining fields with the highest retention rate;

when a plurality of field combinations with the highest retention rate exist, the field combinations with abnormal expansion rates are removed to form special field combinations;

splicing the special field combination with the general field combination of the other data node to form a new inter-table field combination;

and repeating iteration on the field combination between tables and the special field combination until no new special field combination is generated between the tables, and obtaining the optimal field combination.

In an alternative, the executable instructions cause the processor to:

acquiring historical service data in a production system, and establishing a temporary service data table;

selecting an association table of any data node from an association table set associated with an initial table of the initial data node according to the dependency relationship among the data nodes;

judging a basic association flow direction according to the time fields of the initial table and the association table;

judging the internal association flow direction of the association tables of any two data nodes in the association table set;

and obtaining the optimal flow direction according to the basic associated flow direction and the internal associated flow direction to form a data flow map.

In an alternative, the executable instructions cause the processor to:

if the first basic associated flow direction can be realized through a second basic associated flow direction and the internal associated flow direction, reserving the second basic associated flow direction and the internal associated flow direction, and deleting the first basic associated flow direction;

traversing the basic correlation flow direction and the internal correlation flow direction, and finally keeping the basic correlation flow direction and the internal correlation flow direction as an optimal flow direction;

and forming a data flow chart according to the optimal flow direction among the data nodes.

In an optional manner, the data flow characteristics include a start node data volume and an end node data volume; the executable instructions cause the processor to:

comparing the node data quantity of the data nodes flowing in the business data and the data flow graph, calculating missing data quantity, marking the positions of the data nodes, and outputting a data missing magnitude list and a data missing detail list;

and outputting an abnormal data detection result according to the data missing magnitude list and the data missing detail list, and positioning an abnormal occurrence node of the abnormal data.

Embodiments of the present invention provide a computer program product comprising a computer program stored on a computer storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the data quality monitoring method of any of the above-mentioned method embodiments.

acquiring a data flow map among data nodes;

acquiring real-time service data in a production system;

In an alternative, the executable instructions cause the processor to:

combining fields with the highest retention rate;

In an alternative, the executable instructions cause the processor to:

Fig. 7 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and a specific embodiment of the present invention does not limit a specific implementation of the device.

As shown in fig. 7, the computing device may include: a processor (processor)702, a Communications Interface 704, a memory 706, and a communication bus 708.

Wherein: the processor 702, communication interface 704, and memory 706 communicate with each other via a communication bus 708. A communication interface 704 for communicating with network elements of other devices, such as clients or other servers. The processor 702 is configured to execute the program 710, and may specifically execute the relevant steps in the above-described data quality monitoring method embodiment.

In particular, the program 710 may include program code that includes computer operating instructions.

The processor 702 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement an embodiment of the present invention. The device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 706 stores a program 710. The memory 706 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 710 may specifically be used to cause the processor 702 to perform the following operations:

acquiring a data flow map among data nodes;

acquiring real-time service data in a production system;

In an alternative, the program 710 causes the processor to:

combining fields with the highest retention rate;

In an alternative, the program 710 causes the processor to:

In an optional manner, the data flow characteristics include a start node data volume and an end node data volume; the program 710 causes the processor to:

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method for monitoring data quality, the method comprising:

acquiring a data flow map among data nodes;

acquiring real-time service data in a production system;

2. The method of claim 1, wherein obtaining the dataflow graph between data nodes includes:

3. The method according to claim 2, wherein the performing data characteristic analysis on the historical service data to obtain the dependency relationship between the data nodes comprises:

4. The method according to claim 3, wherein the obtaining an optimal field combination for any two of the data nodes according to the general field combination application inflation rate and retention rate comprises:

combining fields with the highest retention rate;

5. The method according to claim 2, wherein the obtaining the optimal flow direction among the data nodes according to the dependency relationship among the data nodes to form a data flow graph comprises:

6. The method of claim 5, wherein obtaining an optimal flow direction from the base associative flow direction and the internal associative flow direction, forming a data flow graph, comprises:

7. The method of claim 1, wherein the data flow characteristics include a starting node data volume and an ending node data volume;

the consistency comparison of the data flow characteristics of the data nodes in the service data and the data flow graph is performed, abnormal data is output, and an abnormal occurrence node of the abnormal data is located, including:

8. A data quality monitoring apparatus, the apparatus comprising:

the flow acquiring unit is used for acquiring a data flow map among the data nodes;

the data acquisition unit is used for acquiring real-time service data in the production system;

the characteristic extraction unit is used for acquiring the data flow characteristics of each data node in the service data according to the data flow map;

and the abnormal detection unit is used for comparing the consistency of the data flow characteristics of the data nodes in the service data and the data flow maps, outputting an abnormal data detection result and positioning an abnormal occurrence node of the abnormal data.

9. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is configured to store at least one executable instruction that causes the processor to perform the steps of the data quality monitoring method according to any one of claims 1-7.

10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform the steps of the data quality monitoring method according to any one of claims 1-7.