CN110083639A

CN110083639A - A kind of method and device that the data blood relationship based on clustering is intelligently traced to the source

Info

Publication number: CN110083639A
Application number: CN201910337129.4A
Authority: CN
Inventors: 王鹏; 陈昊; 于会游; 姜玉峰; 滕姿; 李栋; 杜浩; 饶定远; 唐丽娜; 靳翼; 闵圣捷; 陈丽婷; 童昊; 许亚洋
Original assignee: CLP SECTION HUAYUN INFORMATION TECHNOLOGY Co Ltd; Clp Jiaxing New Intelligent City Science And Technology Development Co Ltd
Current assignee: CLP SECTION HUAYUN INFORMATION TECHNOLOGY Co Ltd; Clp Jiaxing New Intelligent City Science And Technology Development Co Ltd
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2019-08-02
Anticipated expiration: 2039-04-25
Also published as: CN110083639B

Abstract

A method of the data blood relationship based on clustering is intelligently traced to the source, comprising steps of step 1, reads table structure and data, form each field data feature by data engineering means；Step 2, as unit of field, field data feature set is characterized, and is learnt using the cluster algorithm in machine learning to data sample；Step 3, the clustering in step 2 is repeated, until finding optimal classes and optimal classification；Step 4, under optimal classification, automatic discrimination is the field possible with genetic connection by the data field in classification together；Step 5, to each genetic connection, according to the sequencing of table creation time pointed by the relationship, infer the direction of the genetic connection, that is, infers which field is source, which field is target, if the field of the genetic connection comes from same table, marking genetic connection is invalid genetic connection；Step 6, according to effective field genetic connection computational chart genetic connection.

Description

A kind of method and device that the data blood relationship based on clustering is intelligently traced to the source

Technical field

The invention belongs to big data technical field, in particular to what a kind of data blood relationship based on clustering was intelligently traced to the source Method and device.

Background technique

With the development of big data and machine learning techniques and universal, Data Analysis Software is used, manages and is generated Data volume it is increasing, almost all of data also higher and higher to the degree of dependence of the format of data, content and quantity It requires to carry out data the operations such as various extractions, cleaning, conversion and desensitization before analysis system operation.The complexity of these business Property determine that the process in data handling procedure is more, long flow path, method is complicated.It must can just be sentenced by the backtracking to data blood relationship The confidence level of disconnected data analyzes the influence power of data and error data source is analyzed and handled.Therefore, in big data Data genetic connection chain is established under background becomes the major issue of big data technical support service application and system maintenance.

There are various defects for existing data blood relationship administrative skill.Traditional data blood relationship is complete by way of manual entry At this method low efficiency is at high cost, and error-prone.Moreover, as data frequent progress is handled, it is also necessary to Artificial regeneration constantly is carried out to data blood relationship, the accuracy of maintenance and timeliness are all difficult to ensure.With the increase of data volume, people The method of work maintenance is hard to carry on.

For the defect of manual maintenance data blood relationship, develops and automatically analyzed by the data blood relationship of foundation of data dictionary Technology, according to the matching relationship of field in data dictionary analytical database.But the technical requirements construct complete data word in advance Allusion quotation.The technology is applicable in very much in traditional information management system, but with the development of new types of data analysis business, to data point All intermediate results of analysis chain road, which all construct data dictionary, becomes highly difficult, and cost is excessively high, therefore this method can only adapt to spy Determine business scenario, reusability is poor.Meanwhile even if dictionary building it is very complete, which can only also determine according in data dictionary The data of justice carry out consanguinity analysis, it is difficult to adapt to plurality of application scenes extensively.On the whole, the technical difficulty is big, at high cost, expands Malleability and adaptability are poor, seem unable to do what one wishes in face of the demand of new types of data analysis tool.

There is also some data management softwares, such as Atlas, can record data blood relationship based on database plug-in unit.It is former It manages and determines the data genetic connection that can only record same data store internal, integration across database even integration across database type (such as Data conversion from mysql to oracle) genetic connection tracking be just unable to complete.Also, the method based on database plug-in unit It is required that data genetic connection is recorded while generating data, if record fails when generation data (such as temporary network It is unreachable), then genetic connection can not be obtained again.Meanwhile the performance of software access data can be generated by the way of plug-in unit It influences, the high software of certain requirement of real-time can not be using this kind of plug-in units.Moreover, this method can not processing field grade The granularity of other data blood relationship, the retrospect of data blood relationship is inadequate.

Summary of the invention

The present invention provides a kind of method and device that the data blood relationship based on clustering is intelligently traced to the source, existing to solve The drawbacks of data source tracing method in technology.

One of embodiment of the present invention, a method of the data blood relationship based on clustering is intelligently traced to the source, including following step It is rapid:

Step 1, table structure and data are read, form each field data feature by data engineering means；

Step 2, as unit of field, field data feature set is characterized, using the cluster algorithm in machine learning Data sample is learnt；

Step 3, the clustering in step 2 is repeated, until finding optimal classes and optimal classification；

Step 4, under optimal classification, automatic discrimination is the word possible with genetic connection by the data field in classification together Section；

Step 5, to each genetic connection, according to the sequencing of table creation time pointed by the relationship, inferring should The direction of genetic connection infers which field is source, which field is target, if the field of the genetic connection comes from same Table, then marking genetic connection is invalid genetic connection；

Step 6, according to effective field genetic connection computational chart genetic connection.

Step 7, by checking each table data in relation chain, genetic connection is modified, optimal classes are found in adjustment According to standard.

The beneficial effect comprise that

1. realizing analysis, foundation and the dimension of the data blood relationship to mass data complex process process of full automation Shield.

2. consanguinity analysis of the invention is based on machine learning algorithm, system is not depended in advance to the definition of data, Ke Yishi Answer various data types and business.Meanwhile the consanguinity analysis of this programme is also not dependent on flow chart of data processing, it both can be in data Blood relationship is created simultaneously in treatment process, can analyze historical data also to create blood relationship.

3. the present invention supports field rank and the other data of table level to trace to the source simultaneously.Moreover, due to the engineering of use It practises algorithm and belongs to unsupervised learning, therefore independent of sample data can carry out consanguinity analysis, while can complete again in system The consanguinity analysis accuracy and efficiency of algorithm are improved after classification by manual intervention.

Detailed description of the invention

The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:

Fig. 1 according to embodiments of the present invention one of the data blood relationship based on clustering intelligently trace to the source method flow signal Figure.

Specific embodiment

According to one or more embodiment, as shown in Figure 1, what a kind of data blood relationship based on clustering was intelligently traced to the source Method, comprising steps of

Step 1: reading table structure and data, form each field data feature by data engineering means, concrete mode is such as Under:

Step 1.1: the data characteristics of initial data is parsed into the sample data of structuring, including field type, field Length, field contents mode etc..

Step 1.2: combining feature existing in sample data to form high dimensional feature；

Step 1.3: high dimensional feature being analyzed, new dimension is formed and the influence power of new dimension is ranked up；

Step 1.4: sample data being subjected to dimensionality reduction according to new dimension, is ensuring sample data distortion rate lower than setting value Under the premise of use smallest dimension number；

Step 1.5: the sample data of new dimension is normalized.

Step 2: as unit of field, field data feature set is characterized, using the cluster algorithm in machine learning Data sample is learnt；

Step 3: the clustering being repeated several times in step 2 calculates, and finds optimal classes and optimal classification, specific side Method is as follows:

Step 3.1: it sets classification number to M (M is initially 1, i.e., all data belong to a classification), executes step 2, A corresponding penalty values are obtained, which is the maximum loss value of system；

Step 3.2: will classification number setting N (number that N is initially tables of data subtracts one, i.e., in addition to two most like tables it Outside, remaining each table belongs to an individually classification), step 2 is executed, a corresponding least disadvantage value is obtained；

Step 3.3: by number of classifying be set as step 3.1 and step 3.2 use the arithmetic mean number (M+N) of number of classifying/ 2, step 2 is executed, penalty values T is obtained；

Step 3.5: if the penalty values be greater than target loss value, M=(M+N)/2 is set, and repeat step 3.1 to Step 3.3；

Step 3.6: if the penalty values be less than target loss value, N=(M+N)/2 is set, and repeat step 3.1 to 3.3；

Step 3.7: if the penalty values are approximately equal to target loss value, recording the value is current optimal classes, and record should Subseries is current optimal classification；

Step 4: under optimal classification, automatic discrimination is the word possible with genetic connection by the data field in classification together Section；

Step 5: to each genetic connection, according to the sequencing of table creation time pointed by the relationship, inferring should The direction of genetic connection, i.e. which field are sources, which field is target.If the field of the genetic connection comes from same table, Then marking genetic connection is invalid genetic connection.

Step 6: according to effective field genetic connection computational chart genetic connection, the specific method is as follows:

Step 6.1: an effective field genetic connection is directed to, if not having to appoint directly or indirectly between this two tables Genetic connection, then recording has direct genetic connection between this two tables；

Step 6.2: between all tables for having direct or indirect genetic connection with this two tables there is indirect blood relationship to close System；

Step 6.2: repeating step 6.1 and step 6.2, handle the other blood source relationship of all field levels；

Step 7: for the table genetic connection being inferred to by algorithm, permission manually checks each table data in relation chain, must It can be by being manually modified to final genetic connection when wanting；

Step 8: the blood source relationship crossed through artificial correction adjusts the standard for finding optimal classes according to its method of adjustment, The specific method is as follows:

Step 8.1: for manually will infer that genetic connection was deleted, i.e., there is no genetic connection between two table of manual confirmation , it is appropriate to increase the target loss value for finding optimal classes；

Step 8.2: the genetic connection new for manual creation, i.e., it has relationship by blood, fits between two table of manual confirmation When reduction target loss value.

According to one or more embodiment, a kind of device that the data blood relationship based on clustering is intelligently traced to the source is described Device includes memory；And it is coupled to the processor of the memory, which, which is configured as executing, is stored in described deposit Instruction in reservoir, the processor execute following operation:

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.In addition, shown or beg for Opinion mutual coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING of device or unit Or communication connection, it is also possible to electricity, mechanical or other form connections.

Unit may or may not be physically separated as illustrated by the separation member, shown as a unit Component may or may not be physical unit, it can and it is in one place, or may be distributed over multiple networks On unit.It can select some or all of unit therein according to the actual needs to realize the mesh of the embodiment of the present invention 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, is also possible to two or more units and is integrated in one unit.It is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey The medium of sequence code.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims

1. a kind of method that the data blood relationship based on clustering is intelligently traced to the source, which comprises the following steps:

Step 2, as unit of field, field data feature set is characterized, using the cluster algorithm logarithm in machine learning Learnt according to sample；

Step 4, under optimal classification, automatic discrimination is the field possible with genetic connection by the data field in classification together；

Step 5, to each genetic connection, according to the sequencing of table creation time pointed by the relationship, infer the blood relationship The direction of relationship infers which field is source, which field is target, if the field of the genetic connection comes from same table, Then marking genetic connection is invalid genetic connection；

2. the method that the data blood relationship according to claim 1 based on clustering is intelligently traced to the source, which is characterized in that into one Step includes,

By checking each table data in relation chain, genetic connection is modified, optimal classes standard is found in adjustment.

3. the method that the data blood relationship according to claim 1 based on clustering is intelligently traced to the source, which is characterized in that described Step 1 further comprises:

Step 1.1, the data characteristics of initial data is parsed into the sample data of structuring, including field type, field length, Field contents mode；

Step 1.2, it combines feature existing in sample data to form high dimensional feature；

Step 1.3, high dimensional feature is analyzed, form new dimension and the influence power of new dimension is ranked up；

Step 1.4, sample data is subjected to dimensionality reduction according to new dimension, ensures premise of the sample data distortion rate lower than setting value It is lower to use smallest dimension number；

Step 1.5: the sample data of new dimension is normalized.

4. the method that the data blood relationship according to claim 3 based on clustering is intelligently traced to the source, which is characterized in that described Step 3 further comprises:

Step 3.1, set M for classification number, execute step 2, obtain a corresponding penalty values, the penalty values be system most Big penalty values；

Step 3.2, N is arranged in classification number, executes step 2, obtains a corresponding least disadvantage value；

Step 3.3, the number that will classify is set as step 3.1 and step 3.2 uses the arithmetic mean number (M+N)/2 of classification number, holds Row step 2 obtains penalty values T；

Step 3.4, if the penalty values are greater than target loss value, M=(M+N)/2 is set, and repeats step 3.1 to step 3.3；

Step 3.5, if the penalty values are less than target loss value, N=(M+N)/2 is set, and repeats step 3.1 to 3.3；

Step 3.6, if the penalty values are approximately equal to target loss value, recording the value is current optimal classes, records this time point Class is current optimal classification.

5. the method that the data blood relationship according to claim 4 based on clustering is intelligently traced to the source, which is characterized in that described Step 6 further comprises:

Step 6.1, for an effective field genetic connection, if not having to appoint direct or indirect blood relationship between this two tables Relationship, then recording has direct genetic connection between this two tables；

Step 6.2, between all tables for having direct or indirect genetic connection with this two tables have indirect genetic connection；

Step 6.3, step 6.1 and step 6.2 are repeated, the other blood source relationship of all field levels is handled.

6. a kind of device that the data blood relationship based on clustering is intelligently traced to the source, which is characterized in that described device includes memory； And

It is coupled to the processor of the memory, which is configured as executing the instruction of storage in the memory, institute It states processor and executes following operation: