CN110083639B

CN110083639B - Intelligent data blood source tracing method and device based on cluster analysis

Info

Publication number: CN110083639B
Application number: CN201910337129.4A
Authority: CN
Inventors: 王鹏; 陈昊; 于会游; 姜玉峰; 滕姿; 李栋; 杜浩; 饶定远; 唐丽娜; 靳翼; 闵圣捷; 陈丽婷; 童昊; 许亚洋
Original assignee: Cetc Kehuayun Information Technology Co ltd; Zhongdianke Jiaxing Novel Wisdom City Technology Development Co ltd
Current assignee: Cetc Kehuayun Information Technology Co ltd; Zhongdianke Jiaxing Novel Wisdom City Technology Development Co ltd
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2023-03-10
Anticipated expiration: 2039-04-25
Also published as: CN110083639A

Abstract

A method for intelligently tracing data blood source based on cluster analysis comprises the following steps: step 1, reading a table structure and data, and forming data characteristics of each field by means of data engineering; step 2, learning the data samples by using a clustering analysis algorithm in machine learning by taking the field as a unit and taking a field data feature set as a feature; step 3, repeatedly executing the clustering analysis in the step 2 until the optimal classification number and the optimal classification are found; step 4, under the best classification, the data fields in the same classification are automatically judged as fields possibly having blood relationship; step 5, for each blood relationship, deducing the direction of the blood relationship according to the sequence of the table creation time pointed by the relationship, namely deducing which field is a source and which field is a target, and marking the blood relationship as an invalid blood relationship if the fields of the blood relationship come from the same table; and 6, calculating the table blood relationship according to the effective field blood relationship.

Description

Intelligent data blood source tracing method and device based on cluster analysis

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a method and a device for intelligently tracing a data blood source based on cluster analysis.

Background

With the development and popularization of big data and machine learning technology, the data analysis software has larger and larger data volume, higher and higher dependence on the format, content and quantity of the data, and almost all data analysis systems need to perform various operations of extraction, cleaning, conversion, desensitization and the like on the data before running. The complexity of these services determines that the data processing process has many processes, long flow and complex method. The data reliability, the influence of the analysis data and the source of the error data must be judged by backtracking the blood source of the data. Therefore, establishing a data lineage relationship chain in a big data context becomes an important issue for big data technology to support business applications and system maintenance.

Existing data margin management techniques suffer from various drawbacks. The traditional data blood margin is completed by manual entry, and the method has low efficiency, high cost and is easy to make mistakes. Moreover, as the data is frequently processed, the data blood margin needs to be continuously updated manually, and the maintenance accuracy and timeliness are difficult to guarantee. As the amount of data increases, the approach of manual maintenance is difficult to follow.

Aiming at the defect of manually maintaining the blood margin of the data, a data blood margin automatic analysis technology based on a data dictionary is developed, and the matching relation of fields in a database is analyzed according to the data dictionary. But this technique requires a complete data dictionary to be constructed in advance. The technology is very suitable for a traditional information management system, but with the development of novel data analysis services, it is difficult to construct a data dictionary for all intermediate results on a data analysis link, and the cost is too high, so that the method can only adapt to a specific service scene, and the reusability is poor. Meanwhile, even if the dictionary is completely constructed, the technology can only carry out blood vessel analysis according to the defined data in the data dictionary, and is difficult to widely adapt to various application scenes. Generally speaking, the technology has the disadvantages of high difficulty, high cost, poor expansibility and adaptability, and is not compelling to meet the requirements of novel data analysis tools.

There are also data management software, such as Atlas, that can record data bloodlines based on database plug-ins. The principle determines that only the data relationship in the same database can be recorded, and the relationship tracking across databases and even across database types (such as data conversion from mysql to oracle) cannot be completed. Moreover, database plug-in based methods require that data consanguinity be recorded while data is being generated, and if the recording fails while data is being generated (e.g., temporary network unreachable), the consanguinity cannot be retrieved. Meanwhile, the adoption of a plug-in mode can affect the performance of software for accessing data, and certain software with high real-time requirement cannot adopt the plug-in. Moreover, this method cannot process data blood margin at field level, and granularity of data blood margin tracing is not enough.

Disclosure of Invention

The invention provides a method and a device for intelligently tracing a data blood source based on cluster analysis, which are used for overcoming the defects of a data tracing method in the prior art.

In one embodiment of the present invention, a method for intelligently tracing a data blood relationship based on cluster analysis includes the following steps:

step 1, reading a table structure and data, and forming data characteristics of each field by means of data engineering;

step 2, learning the data samples by using a clustering analysis algorithm in machine learning by taking the field as a unit and taking a field data feature set as a feature;

step 3, repeatedly executing the clustering analysis in the step 2 until the optimal classification number and the optimal classification are found;

step 4, under the optimal classification, automatically judging the data fields in the same classification as fields possibly having blood relationship;

step 5, for each blood relationship, deducing the direction of the blood relationship according to the sequence of the table creation time pointed by the relationship, namely deducing which field is a source and which field is a target, and marking the blood relationship as an invalid blood relationship if the fields of the blood relationship come from the same table;

and 6, calculating the table blood relationship according to the effective field blood relationship.

And 7, correcting the blood relationship by checking the data of each table in the relationship chain, and adjusting the standard for searching the optimal classification data.

The beneficial effects of the invention include:

1. the method and the device realize the fully automatic analysis, establishment and maintenance of the data blood margin of the complex processing flow of mass data.

2. The blood relationship analysis is based on a machine learning algorithm, does not depend on the definition of the system on data in advance, and can adapt to various data types and services. Meanwhile, the blood margin analysis of the scheme does not depend on a data processing flow, so that the blood margin can be created simultaneously in the data processing process, and the historical data can be analyzed to create the blood margin.

3. The invention supports data source tracing at field level and table level. Moreover, because the adopted machine learning algorithm belongs to unsupervised learning, the blood margin analysis can be carried out without depending on sample data, and meanwhile, the accuracy and the efficiency of the blood margin analysis of the algorithm can be improved through manual intervention after the system is classified.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a schematic flow chart of a method for intelligently tracing the source of a data blood relationship based on cluster analysis according to one embodiment of the present invention.

Detailed Description

According to one or more embodiments, as shown in fig. 1, a method for intelligent tracing of data blood source based on cluster analysis includes the steps of:

step 1: reading the structure and data of the table, and forming the data characteristics of each field by means of data engineering, wherein the specific mode is as follows:

step 1.1: the data characteristics of the original data are analyzed into structured sample data which comprises field types, field lengths, field content modes and the like.

Step 1.2: combining the existing features in the sample data to form high-dimensional features;

step 1.3: analyzing the high-dimensional features to form new dimensions and sequencing the influence of the new dimensions;

step 1.4: reducing the dimension of the sample data according to the new dimension, and using the minimum dimension on the premise of ensuring that the distortion rate of the sample data is lower than a set value;

step 1.5: and carrying out normalization processing on the sample data of the new dimension.

Step 2: learning the data samples by using a clustering analysis algorithm in machine learning by taking fields as units and field data characteristic sets as characteristics;

and step 3: repeating the clustering analysis calculation in the step 2 for multiple times to find the optimal classification number and the optimal classification, wherein the specific method comprises the following steps:

step 3.1: setting the classification number as M (M is initially 1, namely all data belong to one classification), and executing the step 2 to obtain a corresponding loss value, wherein the loss value is the maximum loss value of the system;

step 3.2: setting the classification number to be N (N is the number of the data tables minus one initially, namely, except for the most similar two tables, each of the rest tables belongs to a single classification), and executing the step 2 to obtain a corresponding minimum loss value;

step 3.3: setting the classification number as the arithmetic mean (M + N)/2 of the classification numbers adopted in the step 3.1 and the step 3.2, and executing the step 2 to obtain a loss value T;

step 3.5: if the loss value is greater than the target loss value, setting M = (M + N)/2, and repeatedly performing steps 3.1 to 3.3;

step 3.6: if the loss value is less than the target loss value, setting N = (M + N)/2, and repeatedly performing steps 3.1 to 3.3;

step 3.7: if the loss value is approximately equal to the target loss value, recording the value as the current optimal classification number, and recording the current optimal classification;

and 4, step 4: under the best classification, the data fields in the same classification are automatically judged as fields possibly having blood relationship;

and 5: for each relationship, the direction of the relationship is deduced according to the sequence of the table creation time pointed by the relationship, namely which field is the source and which field is the target. If the fields of the relationship are from the same table, the relationship is marked as invalid.

Step 6: and calculating the table blood relationship according to the effective field blood relationship, wherein the specific method comprises the following steps:

step 6.1: aiming at an effective field blood relationship, if the two tables do not have any direct or indirect blood relationship, recording that the two tables have the direct blood relationship;

step 6.2: aiming at the fact that all the tables which have direct or indirect blood relationship with the two tables have indirect blood relationship;

step 6.2: repeating the step 6.1 and the step 6.2, and processing the blood source relations of all field levels;

and 7: for the blood relationship of the table deduced by the algorithm, the data of each table in the relationship chain are allowed to be manually checked, and the final blood relationship can be corrected manually when necessary;

and step 8: the blood source relationship after artificial correction is adjusted and searched for the standard of the optimal classification number according to the adjustment method, and the specific method is as follows:

step 8.1: for the fact that the inferred relationship of the blood relationship is deleted manually, namely the fact that the relationship of the blood relationship does not exist between the two tables is confirmed manually, the target loss value of the optimal classification number is searched for in a properly increased mode;

step 8.2: for manually creating a new kindred relationship, i.e. manually confirming that there is a kindred relationship between the two tables, the target loss value is appropriately reduced.

According to one or more embodiments, an apparatus for intelligent tracing of data blood relationship based on cluster analysis comprises a memory; and a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A data blood source intelligent tracing method based on cluster analysis is characterized by comprising the following steps:

step 2, learning the data samples by using fields as units and field data feature sets as features and adopting a clustering analysis algorithm in machine learning;

step 4, under the best classification, the data fields in the same classification are automatically judged as fields possibly having blood relationship;

step 5, for each blood relationship, deducing the direction of the blood relationship according to the sequence of the table creation time pointed by the relationship, namely deducing which field is the source and which field is the target,

if the fields of the relationship are from the same table, marking the relationship as invalid;

step 6, calculating the table blood relationship according to the effective field blood relationship,

the step 1 further comprises:

step 1.1, analyzing the data characteristics of original data into structured sample data, wherein the structured sample data comprises a field type, a field length and a field content mode;

step 1.2, combining the existing features in the sample data to form high-dimensional features;

step 1.3, analyzing the high-dimensional characteristics to form a new dimension, and sequencing the influence of the new dimension;

step 1.4, dimension reduction is carried out on the sample data according to a new dimension, and the minimum dimension degree is used on the premise that the distortion rate of the sample data is lower than a set value;

step 1.5: the sample data of the new dimension is normalized,

the step 3 further comprises:

step 3.1, setting the classification number as M, wherein M is initially 1, namely all data belong to one classification, and executing step 2 to obtain a corresponding loss value which is the maximum loss value of the system;

step 3.2, the classification number is set to be N, N is initially the number of the data tables minus one, namely, except for the two most similar tables, each of the rest tables belongs to a single classification, and the step 2 is executed to obtain a corresponding minimum loss value;

step 3.3, setting the classification number as the arithmetic mean (M + N)/2 of the classification numbers adopted in the step 3.1 and the step 3.2, and executing the step 2 to obtain a loss value T;

step 3.4, if the loss value is greater than the target loss value, setting M = (M + N)/2, and repeatedly executing steps 3.1 to 3.3;

step 3.5, if the loss value is smaller than the target loss value, setting N = (M + N)/2, and repeatedly executing steps 3.1 to 3.3;

step 3.6, if the loss value is equal to the target loss value, recording the value as the current optimal classification number, recording the classification as the current optimal classification,

the step 6 further comprises:

step 6.1, aiming at an effective field blood relationship, if the two tables have no direct or indirect blood relationship, recording that the two tables have the direct blood relationship;

step 6.2, aiming at the fact that all the tables which have direct or indirect blood relationship with the two tables have indirect blood relationship with each other;

and 6.3, repeating the step 6.1 and the step 6.2, and processing the blood source relations at all field levels.

2. The method for intelligent tracing of data blood relationship based on cluster analysis as claimed in claim 1, further comprising,

and correcting the blood relationship by checking the data of each table in the relationship chain, and adjusting and searching the optimal classification number standard.