CN108256113A

CN108256113A - The method for digging and device of data genetic connection

Info

Publication number: CN108256113A
Application number: CN201810135449.7A
Authority: CN
Inventors: 杨宇; 吴洋; 王小冬; 兰杰; 朱兴
Original assignee: Koubei Shanghai Information Technology Co Ltd
Current assignee: Koubei Shanghai Information Technology Co Ltd
Priority date: 2018-02-09
Filing date: 2018-02-09
Publication date: 2018-07-06
Anticipated expiration: 2038-02-09
Also published as: CN108256113B

Abstract

The invention discloses a kind of method for digging and device of data genetic connection, method includes：Count in multiple tables of data Data distribution information in each column field；The Data distribution information of a row field in one of tables of data and the Data distribution information of the one or more columns per page field in other tables of data are compared；Determine whether the row field in a tables of data has data genetic connection with the one or more columns per page field in other tables of data according to comparison result.The present invention is compared using the distributed intelligence of data in each field in multiple tables of data, determine the data genetic connection of each field between tables of data, realize that convenient, cost is relatively low, easy implementation, and the intermediate factor relied on is few, the data genetic connection that can be common to multitype database is excavated.

Description

The method for digging and device of data genetic connection

Technical field

The present invention relates to software fields, and in particular to a kind of method for digging and device of data genetic connection.

Background technology

Data blood relationship refers to that during one of a certain business interface object invocation is completed the data that business is related to all are fallen The storage devices such as database or the file of which system are entered, there is data genetic connection between these storage devices.Such as exist In actual production process, it is responsible for record the amount field of order in the system of order, and amount field is in transaction system It can store, the amount field in the two systems has data genetic connection.Specifically, amount field can be in following three systems Memory storage：Form ordering system A tables can record order amount of money a, after the preferential decision of preferential decision system, transaction system B tables When record payment amount b, fund are really paid, the total collection amount c of payment system C token record shroff account numbers.When system operation just In the case of often, for same order, the data of tri- fields of a, b, c should be consistent, but (such as preferential during the system failure Consulting means of payment logic exception etc. in decision system decision error, payment system), it may result in a, b, c field portions number According to inconsistent.At this moment, operation verification rule on line is needed, field that should be consistent is carried out unanimously based on data genetic connection Property judge.It can be ensured the consistency for correcting rear data when to data revision, data avoided to differ by data genetic connection Cause leads to the failure problems such as bill, report exception.

The prior art relies primarily on the excavation of data genetic connection that architect combs or while being designed according to system retains Document obtains, and needs to spend a large amount of cost of labor, and the later stage needs to spend higher cost when safeguarding data genetic connection. Alternatively, the prior art utilizes the scanning system code such as java engineerings code of abstract syntax tree technology static state, according to code logic Data link combing is carried out, obtains data genetic connection.But this mode depends on system code when realizing, by difference The code spice of code engineering influences, in addition the influence of multiple intermediate factors such as code engineering, ORM frames so that this mode Constraints is excessive, can not be common to various system codes, and enforcement difficulty is larger.

Invention content

In view of the above problems, it is proposed that the present invention overcomes the above problem in order to provide one kind or solves at least partly State the method for digging and device of the data genetic connection of problem.

According to an aspect of the invention, there is provided a kind of method for digging of data genetic connection, including：

Count in multiple tables of data Data distribution information in each column field；

By the one or more columns per page in the Data distribution information of the row field in one of tables of data and other tables of data The Data distribution information of field is compared；

A row field in one tables of data and the one or more columns per page field in other tables of data are determined according to comparison result Whether there is data genetic connection.

Optionally, count in multiple tables of data that Data distribution information further comprises in each column field：

According to Data distribution information in each column field in multiple tables of data, first point of data in each column field is calculated Cloth probability density；

By the one or more columns per page in the Data distribution information of the row field in one of tables of data and other tables of data The Data distribution information of field, which is compared, to be further comprised：

The the first distribution probability density for calculating the data of the row field in a tables of data and one in other tables of data The similarity of first distribution probability density of the data of row or multiple row field；

A row field in one tables of data and the one or more columns per page field in other tables of data are determined according to comparison result Whether there is data genetic connection to further comprise：

If the first distribution probability density similarity is more than the first predetermined threshold value, it is determined that the row field in a tables of data There is data genetic connection with the one or more columns per page field in other tables of data.

The data repeated in each column field in multiple tables of data are subjected to duplicate removal processing, are calculated in each column field after duplicate removal Second distribution probability density of data；

The the second distribution probability density for calculating the data of the row field in a tables of data and one in other tables of data The similarity of second distribution probability density of the data of row or multiple row field；

If the second distribution probability density similarity is more than the second predetermined threshold value, it is determined that the row field in a tables of data There is data genetic connection with the one or more columns per page field in other tables of data.

Optionally, by the row in the Data distribution information of the row field in one of tables of data and other tables of data Or the Data distribution information of multiple row field is compared and further comprises：

Calculate the data of the data and the one or more columns per page field in other tables of data of the row field in a tables of data Degree of overlapping；

Judge whether degree of overlapping meets default anti-eclipse threshold；If so, determine a tables of data in a row field and its One or more columns per page field in his tables of data has data genetic connection.

Optionally, method further includes：

It obtains in the row field and other tables of data in the tables of data with data genetic connection marked in advance One or more columns per page field.

Optionally, method further includes：

A row or more in a row field and other tables of data in a tables of data with data genetic connection Row field, by a tables of data and other tables of data carry out it is outer connect, calculate in a tables of data data of other row fields and The similarity of the data of other one or more columns per page fields in other tables of data；

If similarity is more than default similarity threshold, it is determined that in a tables of data in other row fields and other tables of data Other one or more columns per page fields have data genetic connection.

Optionally, method further includes：

The data of each column field in multiple tables of data are pre-processed in advance；Wherein, pretreatment include with the next item down or It is multinomial：Big field data are split as multiple form field datas, Boolean field data are converted to numerical data, null value field data lacks Province is handled.

According to another aspect of the present invention, a kind of excavating gear of data genetic connection is provided, including：

Statistical module, suitable for Data distribution information in each column field in the multiple tables of data of statistics；

Comparison module, suitable for will be in the Data distribution information of the row field in one of tables of data and other tables of data The Data distribution information of one or more columns per page field be compared；

Determining module, suitable for determining the row field in a tables of data and one in other tables of data according to comparison result Whether row or multiple row field have data genetic connection.

Optionally, statistical module is further adapted for：

Comparison module is further adapted for：

Determining module is further adapted for：

Optionally, statistical module is further adapted for：

Comparison module is further adapted for：

Determining module is further adapted for：

Optionally, comparison module is further adapted for：

Determining module is further adapted for：

Optionally, device further includes：

Acquisition module, suitable for obtain a row field in the tables of data with data genetic connection that marks in advance and One or more columns per page field in other tables of data.

Optionally, device further includes：

Outer link block, suitable for the row field in a tables of data with data genetic connection and other data One tables of data with outside the progress of other tables of data connect, calculates other in a tables of data by the one or more columns per page field in table The similarity of the data of row field and the data of other one or more columns per page fields in other tables of data；If similarity is more than default Similarity threshold, it is determined that other row fields have with other one or more columns per page fields in other tables of data in a tables of data Data genetic connection.

Optionally, device further includes：

Preprocessing module, suitable for being pre-processed in advance to the data of each column field in multiple tables of data；Wherein, it pre-processes Including following one or more：Big field data are split as multiple form field datas, Boolean field data are converted to numerical data, Null value field data default process.

According to another aspect of the invention, a kind of electronic equipment is provided, including：Processor, memory, communication interface and Communication bus, processor, memory and communication interface complete mutual communication by communication bus；

For memory for storing an at least executable instruction, executable instruction makes processor perform above-mentioned data genetic connection The corresponding operation of method for digging.

In accordance with a further aspect of the present invention, a kind of computer storage media is provided, at least one is stored in storage medium Executable instruction, executable instruction make processor perform the corresponding operation of method for digging such as above-mentioned data genetic connection.

According to the method for digging and device of data genetic connection provided by the invention, each column field in multiple tables of data is counted Middle Data distribution information；By the row in the Data distribution information of the row field in one of tables of data and other tables of data Or the Data distribution information of multiple row field is compared；A row field in one tables of data and other are determined according to comparison result Whether the one or more columns per page field in tables of data has data genetic connection.The present invention utilizes number in each field in multiple tables of data According to distributed intelligence be compared, determine the data genetic connection of each field between tables of data, realize that convenient, cost is relatively low, Yi Shi It applies, and the intermediate factor relied on is few, the data genetic connection that can be common to multitype database is excavated.

Above description is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, below the special specific embodiment for lifting the present invention.

Description of the drawings

By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this field Technical staff will become clear.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings：

Fig. 1 shows the flow chart of the method for digging of data genetic connection according to an embodiment of the invention；

Fig. 2 shows the flow charts of the method for digging of data genetic connection in accordance with another embodiment of the present invention；

Fig. 3 shows the functional block diagram of the excavating gear of data genetic connection according to an embodiment of the invention；

Fig. 4 shows the functional block diagram of the excavating gear of data genetic connection in accordance with another embodiment of the present invention；

Fig. 5 shows the structure diagram of a kind of electronic equipment according to an embodiment of the invention.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.

Fig. 1 shows the flow chart of the method for digging of data genetic connection according to an embodiment of the invention.Such as Fig. 1 institutes Show, the method for digging of data genetic connection specifically comprises the following steps：

Step S101 counts in multiple tables of data Data distribution information in each column field.

When row interfield each in multiple tables of data has data genetic connection, i.e., these row fields are in the mistake of finishing service Cheng Zhong belongs to identical data and generates link, and the data in these row fields have the characteristics such as identical value.Therefore, this implementation The excavation of data genetic connection is based primarily upon in multiple tables of data that data are excavated in each column field in example.

According to data in each column field in multiple tables of data, data distribution in each column field in multiple tables of data can be believed Breath is counted.Specifically, the data in each column field in multiple tables of data can be regarded as stochastic variable, according to each column field In all data distributions obtain an interval division, can obtain each stochastic variable (i.e. in each column field using statistical Data) frequency disribution in different sections, as the probability distribution of data in row field, i.e., the data distribution letter in row field Breath.The restriction in section is configured according to real data, and Statistics Implementation mode does not limit herein.

It is alternatively possible to according to Data distribution information in each column field in multiple tables of data, it is calculated in each column field First distribution probability density of data；Alternatively, when Data duplication in each column field, such as a row field is commodity price, when not With the same commodity of color price all same when, data include multiple identical price datas in the row field.In reality It may need to record the price of the different colours of same commodity in business, in some tables of data, it can in some tables of data It can only need to record an a kind of price of commodity, calculate distribution probability density using multiple identical price datas at this time And there may be deviation, mirror for the result that repeatedly or repeatedly less data calculating distribution probability density does not obtain in other row fields In such case, the price value repeated can be subjected to duplicate removal processing, i.e., the number that will be repeated in each column field in multiple tables of data According to duplicate removal processing is carried out, the second distribution probability density is calculated to data in each column field after duplicate removal.

Step S102, by one in the Data distribution information of the row field in one of tables of data and other tables of data Row or the Data distribution information of multiple row field are compared.

It will be in the distributed intelligence of the row field in a tables of data in above-mentioned multiple tables of data and other tables of data The Data distribution information of one or more columns per page field is compared, and obtains comparison result.

Specifically, such as the first distribution probability density, the data of a row field in a tables of data can be calculated First distribution probability density is similar to the first distribution probability density of the data of the one or more columns per page field in other tables of data Degree, similarity calculation can pass through the first distribution probability density of the data for calculating the row field in a tables of data and other KL distances (the Kullback-Leibler of first distribution probability density of the data of the one or more columns per page field in tables of data Divergenced it) obtains.When the first distribution probability density and other tables of data of the data of the row field in a tables of data In one or more columns per page field data the first distribution probability density it is identical when, KL distance be 0.That is KL apart from smaller, The obtained similarity of the first distribution probability density is bigger.Existing KL distances calculating side may be used apart from calculation in KL Formula, except by KL distance calculate the first distribution probability density similarity in addition to, can also use other similarity calculation modes into Row calculates, and does not limit herein.

Alternatively, such as the second distribution probability density, the of the data of a row field in a tables of data can be calculated The similarity of two distribution probability density and the second distribution probability density of the data of the one or more columns per page field in other tables of data. In the similarity for calculating the second distribution probability density, the similarity mode for calculating the first distribution probability density can be referred to, this Place repeats no more.

Alternatively, in practical business, the data of a certain row field may include certain in another tables of data in certain tables of data The data of one row field, such as the data of the order amount of money in order data table, compared with the data of payment amount in payment data table, When all orders have been paid, the data of the order amount of money should with the data of payment amount in payment data table in order data table It is completely the same；But when there are the unpaid situation of certain orders, the data of the order amount of money and payment data table in order data table The data of middle payment amount are not completely the same, and the data of the order amount of money are contained in all payment data tables in order data table The data of payment amount, and more than the situation of the data of payment amount in payment data table.The practical consideration of more than business is based on, it is right The Data distribution information in each column field in multiple tables of data can also calculate the number of the row field in one of tables of data According to the degree of overlapping of the data with the one or more columns per page field in other tables of data, the row field in a tables of data can be obtained Data and other tables of data in one or more columns per page field data inclusion relation.

Step S103, according to comparison result determine a row field in a tables of data and the row in other tables of data or Whether multiple row field has data genetic connection.

According to comparison result, if specifically, comparison result is the first distribution probability density similarity, if the first distribution is general Rate density similarity is more than the first predetermined threshold value, it is determined that the row field in a tables of data and the row in other tables of data Or multiple row field has data genetic connection.

Alternatively, if comparison result is the second distribution probability density similarity, if the second distribution probability density similarity is big In the second predetermined threshold value, it is determined that the row field in a tables of data has with the one or more columns per page field in other tables of data Data genetic connection.

Alternatively, if comparison result is the degree of overlapping of data, judge whether degree of overlapping meets default anti-eclipse threshold；If so, Then determine that the row field in a tables of data has data genetic connection with the one or more columns per page field in other tables of data.

Wherein, the first predetermined threshold value, the second predetermined threshold value and/or default anti-eclipse threshold are set according to performance, herein It does not limit.Any of the above mode can select one or more of which when implementing, and set, do not do herein as the case may be It limits.When selecting a variety of, the knot of various ways can be considered by the way that various ways are set with the forms such as weight respectively Fruit, finally determines whether the row field in a tables of data has data blood with the one or more columns per page field in other tables of data Edge relationship.

According to the method for digging of data genetic connection provided by the invention, data in each column field are counted in multiple tables of data Distributed intelligence；By the one or more columns per page in the Data distribution information of the row field in one of tables of data and other tables of data The Data distribution information of field is compared；The row field and other tables of data in a tables of data are determined according to comparison result In one or more columns per page field whether have data genetic connection.The present invention utilizes point of data in each field in multiple tables of data Cloth information is compared, and determines the data genetic connection of each field between tables of data, realizes that convenient, cost is relatively low, easy implementation, and according to Bad intermediate factor is few, and the data genetic connection that can be common to multitype database is excavated.

Fig. 2 shows the flow charts of the method for digging of data genetic connection in accordance with another embodiment of the present invention.Such as Fig. 2 Shown, the method for digging of data genetic connection specifically comprises the following steps：

Step S201 in advance pre-processes the data of each column field in multiple tables of data.

Pretreatment includes a variety of processing, such as when certain row field is big field in tables of data, may include multiple words Segment information, if big field is store information field, it comprises the information such as shop title, store address, shop type, make in this way The data that multiple information are contained in a big field can not be carried out the statistics of distributed intelligence by obtaining, and pretreatment is needed big field Data are split as multiple form field datas, i.e., store information field are split as shop name field, store address field, shop Type field etc., the individual character section split records respective data information respectively, and records and split to obtain by which big field, Corresponding data genetic connection is recorded respectively；Alternatively, it is Boolean field to the row field in Mr. Yu's tables of data, data are generally True or false, this data type are inconvenient to count in statistical distribution information, can be by pre-processing Boolean field Data be converted to numerical data, be converted to 0 as true is converted to 1, false, the convenient data distribution to Boolean field carries out Statistics；Alternatively, in certain tables of data row field data be it is empty or be null when null values, these null value fields are carried out Default process.Such as row field be provided with default value, null value can be updated to acquiescence default value or by null value according to The type of row field in itself is revised as corresponding data, convenient the data distribution of null value field to be counted or when necessary The numeralization of discrete character variable is handled etc..

Step S202 counts in multiple tables of data Data distribution information in each column field.

Step S203, by one in the Data distribution information of the row field in one of tables of data and other tables of data Row or the Data distribution information of multiple row field are compared.

Step S204, according to comparison result determine a row field in a tables of data and the row in other tables of data or Whether multiple row field has data genetic connection.

Above step is with reference to the step S101-S103 of Fig. 1 embodiments, and details are not described herein.

Step S205 obtains a row field in the tables of data with data genetic connection that marks in advance and other One or more columns per page field in tables of data.

For the row field in the tables of data with data genetic connection that has learned that in other tables of data One or more columns per page field, such as when establishing or newly increasing tables of data, have learned that it with the related of data genetic connection Information can be labeled the row field in the tables of data with data genetic connection in advance by the modes such as manually marking. Notation methods may be used such as database purchase, file storage mode and record, and not limit herein.When needed, Ke Yizhi Obtain the row field in the tables of data with data genetic connection got and marked in advance and one in other tables of data Row or multiple row field.

Step S206, a row field in a tables of data with data genetic connection in other tables of data One tables of data with outside the progress of other tables of data connect, calculates other row fields in a tables of data by one or more columns per page field Data and other tables of data in other one or more columns per page fields data similarity.

Step S207, if similarity is more than default similarity threshold, it is determined that in a tables of data other row fields and its Other one or more columns per page fields in his tables of data have data genetic connection.

According to step S204 and step S205, can obtain in the tables of data that a part has data genetic connection One row field and the one or more columns per page field in other tables of data.According to one in a tables of data with data genetic connection One tables of data can be carried out outer connect by row field and the one or more columns per page field in other tables of data with other tables of data. The condition of outer connection is that the data of the field with data genetic connection are equal, such as gets A tables of data a fields and B data table b Field has data genetic connection, A tables of data is connect with outside the progress of B data table, select A.a1, A.a2, B.b1, B.b2from A left outer join B on A.a=B.b.It can further get in this way and a fields are removed in A tables of data Except other fields and B data table in other fields in addition to b fields data, the convenient row field in two tables of data Between there are data genetic connection on the basis of, go excavate whether also exist with genetic connection row field.

By being connect outside a tables of data is carried out with other tables of data, other row fields in a tables of data can be obtained Data and other tables of data in other one or more columns per page fields data.According to other row words in an obtained tables of data The data of the data of section and other one or more columns per page fields in other tables of data, can calculate other row words in a tables of data The similarity of the data of section and the data of other one or more columns per page fields in other tables of data.Such as utilize cosine similarity algorithm Calculate similarity of row field data etc..If the similarity being calculated is more than similarity threshold, a data can be determined Other row fields have data genetic connection with other one or more columns per page fields in other tables of data in table.

Combining step S204, S205 and S207 can obtain a row field in multiple tables of data in a tables of data and its The data genetic connection of one or more columns per page interfield in his tables of data.Wherein, multiple tables of data can reside in multiple and different In operation system, such as form ordering system, payment system, preferential decision system, merchandise system are respectively present in, the present embodiment can be with It realizes the mining data genetic connection of cross-system, the data genetic connection between each system is more easily provided, preferably to make Operation system carries out the operations such as core account, report according to data.

According to the method for digging of data genetic connection provided by the invention, a number in multiple tables of data is excavated from many aspects According to the data genetic connection of one or more columns per page interfield in the row field in table and other tables of data, and according to having obtained A row field in a tables of data with data genetic connection and the one or more columns per page field in other tables of data, by one Tables of data and other tables of data carry out it is outer connect, can also further excavate in a tables of data other row fields whether with other Other one or more columns per page fields in tables of data have genetic connection.The present invention is based on business reality, are excavated from many aspects, It can obtain accurate comprehensive data genetic connection.And realize that convenient, cost is relatively low, easy implementation, the intermediate factor of dependence is few, can To be common to the excavation of the data genetic connection of multitype database.

Fig. 3 shows the functional block diagram of the excavating gear of data genetic connection according to an embodiment of the invention.Such as Fig. 3 Shown, the excavating gear of data genetic connection includes following module：

Statistical module 310, suitable for Data distribution information in each column field in the multiple tables of data of statistics.

Statistical module 310, can be to each column field in multiple tables of data according to data in each column field in multiple tables of data Middle Data distribution information is counted.Specifically, statistical module 310 can see the data in each column field in multiple tables of data Stochastic variable is done, an interval division is obtained according to data distributions all in each column field, can be obtained often using statistical The frequency disribution that a stochastic variable (i.e. data in each column field) is distributed in different sections, the probability point as data in row field Data distribution information in cloth, i.e. row field.The restriction in section is configured according to the size of real data, Statistics Implementation mode It does not limit herein.

Optionally, statistical module 310 can be calculated according to Data distribution information in each column field in multiple tables of data First distribution probability density of data in each column field；Alternatively, when Data duplication in each column field, a such as row field is commodity Price, when the price all same of the same commodity of different colours, data include multiple identical price numbers in the row field According to.It may need to record the price of the different colours of same commodity in practical business, in some tables of data, some numbers According to the price that may only need to record a kind of commodity in table, distribution is calculated using multiple identical price datas at this time Probability density may be deposited with the result that less data calculating distribution probability density obtains is not repeated or repeated in other row fields In deviation, in light of this situation, the price value repeated can be carried out duplicate removal processing by statistical module 310, i.e. statistical module 310 will The data repeated in each column field in multiple tables of data carry out duplicate removal processing, and data in each column field after duplicate removal are calculated To the second distribution probability density.

Comparison module 320, suitable for by the Data distribution information of the row field in one of tables of data and other data The Data distribution information of one or more columns per page field in table is compared.

Specifically, such as the first distribution probability density, comparison module 320 can calculate the row word in a tables of data First distribution probability of the first distribution probability density of the data of section and the data of the one or more columns per page field in other tables of data The similarity of density, similarity calculation can pass through the first distribution probability of the data for calculating the row field in a tables of data KL distances (the Kullback- of density and the first distribution probability density of the data of the one or more columns per page field in other tables of data Leibler Divergenced) it obtains.When the first distribution probability density of the data of the row field in a tables of data and its When first distribution probability density of the data of the one or more columns per page field in his tables of data is identical, KL distances are 0.That is KL away from From smaller, the obtained similarity of the first distribution probability density is bigger.Existing KL distances may be used apart from calculation in KL Calculation in addition to comparison module 320 calculates the similarity of the first distribution probability density by KL distances, can also use other Similarity calculation mode is calculated, and is not limited herein.

Alternatively, such as the second distribution probability density, comparison module 320 can calculate the row field in a tables of data Data the second distribution probability density and the data of the one or more columns per page field in other tables of data the second distribution probability it is close The similarity of degree.Comparison module 320 can refer in the similarity for calculating the second distribution probability density and calculate the first distribution generally The similarity mode of rate density, details are not described herein again.

Alternatively, in practical business, the data of a certain row field may include certain in another tables of data in certain tables of data The data of one row field, such as the data of the order amount of money in order data table, compared with the data of payment amount in payment data table, When all orders have been paid, the data of the order amount of money should with the data of payment amount in payment data table in order data table It is completely the same；But when there are the unpaid situation of certain orders, the data of the order amount of money and payment data table in order data table The data of middle payment amount are not completely the same, and the data of the order amount of money are contained in all payment data tables in order data table The data of payment amount, and more than the situation of the data of payment amount in payment data table.The practical consideration of more than business is based on, it is right Data distribution information in each column field, comparison module 320 can also be calculated in one of tables of data in multiple tables of data The degree of overlapping of the data of one row field and the data of the one or more columns per page field in other tables of data, can obtain a tables of data In a row field data and other tables of data in one or more columns per page field data inclusion relation.

Determining module 330, suitable for being determined according to comparison result in the row field in a tables of data and other tables of data One or more columns per page field whether have data genetic connection.

Determining module 330 according to comparison result, if specifically, comparison result be the first distribution probability density similarity, If the first distribution probability density similarity is more than the first predetermined threshold value, it is determined that module 330 determines the row in a tables of data Field has data genetic connection with the one or more columns per page field in other tables of data.

Alternatively, if comparison result is the second distribution probability density similarity, if the second distribution probability density similarity is big In the second predetermined threshold value, it is determined that module 330 determine row in a row field and other tables of data in a tables of data or Multiple row field has data genetic connection.

Alternatively, if comparison result is the degree of overlapping of data, it is determined that module 330 judges whether degree of overlapping meets default overlapping Threshold value；If so, determining module 330 determines the row field in a tables of data and the one or more columns per page word in other tables of data Section has data genetic connection.

Wherein, the first predetermined threshold value, the second predetermined threshold value and/or default anti-eclipse threshold are set according to performance, herein It does not limit.Any of the above mode can select one or more of which when implementing, and set, do not do herein as the case may be It limits.When selecting a variety of, determining module 330 can be considered more by the way that various ways are set with the forms such as weight respectively Kind of mode as a result, the final row field determined in a tables of data and the one or more columns per page field in other tables of data whether With data genetic connection.

According to the excavating gear of data genetic connection provided by the invention, data in each column field are counted in multiple tables of data Distributed intelligence；By the one or more columns per page in the Data distribution information of the row field in one of tables of data and other tables of data The Data distribution information of field is compared；The row field and other tables of data in a tables of data are determined according to comparison result In one or more columns per page field whether have data genetic connection.The present invention utilizes point of data in each field in multiple tables of data Cloth information is compared, and determines the data genetic connection of each field between tables of data, realizes that convenient, cost is relatively low, easy implementation, and according to Bad intermediate factor is few, and the data genetic connection that can be common to multitype database is excavated.

Fig. 4 shows the functional block diagram of the excavating gear of data genetic connection in accordance with another embodiment of the present invention.Such as Shown in Fig. 4, compared with Fig. 3, the excavating gear of data genetic connection further includes following module：

Acquisition module 340, suitable for obtaining the row word in the tables of data with data genetic connection marked in advance Section and the one or more columns per page field in other tables of data.

For the row field in the tables of data with data genetic connection that has learned that in other tables of data One or more columns per page field, such as when establishing or newly increasing tables of data, have learned that it with the related of data genetic connection Information can be labeled the row field in the tables of data with data genetic connection in advance by the modes such as manually marking. Notation methods may be used such as database purchase, file storage mode and record, and not limit herein.When needed, mould is obtained Block 340 can be directly obtained a row field and other numbers in the tables of data with data genetic connection marked in advance According to the one or more columns per page field in table.

Outer link block 350, suitable for the row field in a tables of data with data genetic connection and other One tables of data with outside the progress of other tables of data connect, calculates in a tables of data by the one or more columns per page field in tables of data The similarity of the data of other row fields and the data of other one or more columns per page fields in other tables of data；If similarity is more than Default similarity threshold, it is determined that other row fields and other one or more columns per page fields in other tables of data in a tables of data With data genetic connection.

According to above-mentioned determining module 330 and acquisition module 340, can obtain a part has one of data genetic connection A row field in tables of data and the one or more columns per page field in other tables of data.Outer link block 350 is according to data blood A row field in one tables of data of edge relationship and the one or more columns per page field in other tables of data, can be by a tables of data Outer connect is carried out with other tables of data.The condition of outer connection is that the data of the field with data genetic connection are equal, is such as obtained A tables of data a fields and B data table b fields there is data genetic connection, outer link block 350 is by A tables of data and B data table Carry out outer connection, select A.a1, A.a2, B.b1, B.b2from A left outer join B on A.a=B.b.This Sample can further get other fields in A tables of data in addition to a fields and other in B data table in addition to b fields The data of field, it is convenient between the row field of two tables of data there are data genetic connection on the basis of, outer link block 350 It further excavates and whether also there is the row field with genetic connection.

Outer link block 350 can obtain a data by being connect outside a tables of data is carried out with other tables of data The data of the data of other row fields and other one or more columns per page fields in other tables of data in table.According to an obtained number According to the data of other one or more columns per page fields in the data of other row fields in table and other tables of data, outer link block 350 The data of the data of other row fields and other one or more columns per page fields in other tables of data in a tables of data can be calculated Similarity.The similarity of row field data is such as calculated using cosine similarity algorithm.If outer link block 350 is calculated Similarity be more than similarity threshold, then can determine in a tables of data in other row fields and other tables of data other one Row or multiple row field have data genetic connection.

Comprehensive determining module 330, acquisition module 340 and outer link block 350 can obtain a number in multiple tables of data According to the data genetic connection of one or more columns per page interfield in the row field in table and other tables of data.Wherein, multiple tables of data It can reside in multiple and different operation systems, be such as respectively present in form ordering system, payment system, preferential decision system, quotient Strain system etc., the present embodiment can realize the mining data genetic connection of cross-system, more easily provide the data between each system Genetic connection, so that operation system is preferably made to carry out the operations such as core account, report according to data.

Preprocessing module 360, suitable for being pre-processed in advance to the data of each column field in multiple tables of data.

Preprocessing module 360 includes a variety of processing, such as when certain row field is big field in tables of data, may include Multiple field informations, if big field is store information field, it comprises information such as shop title, store address, shop types, In this way so that can not the data that multiple information are contained in a big field be carried out with the statistics of distributed intelligence, preprocessing module 360 need big field data being split as multiple form field datas, i.e. store information field is split as shop by preprocessing module 360 Name field, store address field, shop type field etc. are spread, the individual character section split records respective data letter respectively Breath, and record and split to obtain by which big field, corresponding data genetic connection is recorded respectively；Alternatively, in Mr. Yu's tables of data Row field for Boolean field, data are generally true or false, this data type not side in statistical distribution information Just it counts, the data of Boolean field are converted to numerical data by preprocessing module 360, are converted to as true is converted to 1, false 0, it is convenient that the data distribution of Boolean field is counted；Alternatively, row field data is empty or is null etc. in certain tables of data During null value, default process is carried out for these null value field preprocessing modules 360.Such as row field is provided with default Null value can be updated to the default value of acquiescence or repair null value according to the type of row field in itself by value, preprocessing module 360 Corresponding data are changed to, it is convenient the data distribution of null value field to be counted or when necessary to discrete character variable Numeralization processing etc..

According to the excavating gear of data genetic connection provided by the invention, a number in multiple tables of data is excavated from many aspects According to the data genetic connection of one or more columns per page interfield in the row field in table and other tables of data, and according to having obtained A row field in a tables of data with data genetic connection and the one or more columns per page field in other tables of data, by one Tables of data and other tables of data carry out it is outer connect, can also further excavate in a tables of data other row fields whether with other Other one or more columns per page fields in tables of data have genetic connection.The present invention is based on business reality, are excavated from many aspects, It can obtain accurate comprehensive data genetic connection.And realize that convenient, cost is relatively low, easy implementation, the intermediate factor of dependence is few, can To be common to the excavation of the data genetic connection of multitype database.

Present invention also provides a kind of nonvolatile computer storage media, the computer storage media is stored at least One executable instruction, the computer executable instructions can perform the excavation of the data genetic connection in above-mentioned any means embodiment Method.

Fig. 5 shows the structure diagram of a kind of electronic equipment according to an embodiment of the invention, and the present invention is specific real Example is applied not limit the specific implementation of electronic equipment.

As shown in figure 5, the electronic equipment can include：Processor (processor) 502, communication interface (Communications Interface) 504, memory (memory) 506 and communication bus 508.

Wherein：

Processor 502, communication interface 504 and memory 506 complete mutual communication by communication bus 508.

Communication interface 504, for communicating with the network element of miscellaneous equipment such as client or other servers etc..

Processor 502, for performing program 510, the method for digging that can specifically perform above-mentioned data genetic connection is implemented Correlation step in example.

Specifically, program 510 can include program code, which includes computer-managed instruction.

Processor 502 may be central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit) or be arranged to implement the embodiment of the present invention one or more integrate electricity Road.The one or more processors that electronic equipment includes can be same type of processor, such as one or more CPU；Also may be used To be different types of processor, such as one or more CPU and one or more ASIC.

Memory 506, for storing program 510.Memory 506 may include high-speed RAM memory, it is also possible to further include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.

Program 510 specifically can be used for so that the data blood relationship that processor 502 is performed in above-mentioned any means embodiment is closed The method for digging of system.The specific implementation of each step may refer in the excavation embodiment of above-mentioned data genetic connection in program 510 Corresponding steps and unit in corresponding description, this will not be repeated here.It is apparent to those skilled in the art that it is Convenienct and succinct, the equipment of foregoing description and the specific work process of module of description can be referred in preceding method embodiment Corresponding process description, details are not described herein.

Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with teaching based on this.As described above, required by constructing this kind of system Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It should be understood that it can utilize various Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In the specification provided in this place, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention：I.e. required guarantor Shield the present invention claims the more features of feature than being expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim is in itself Separate embodiments all as the present invention.

Those skilled in the art, which are appreciated that, to carry out adaptively the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.It can be the module or list in embodiment Member or component be combined into a module or unit or component and can be divided into addition multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it may be used any Combination is disclosed to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification is (including adjoint power Profit requirement, abstract and attached drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.

The all parts embodiment of the present invention can be with hardware realization or to be run on one or more processor Software module realize or realized with combination thereof.It will be understood by those of skill in the art that it can use in practice Microprocessor or digital signal processor (DSP) realize the excavating gear of data genetic connection according to embodiments of the present invention In some or all components some or all functions.The present invention is also implemented as described herein for performing The some or all equipment or program of device (for example, computer program and computer program product) of method.In this way Realization the present invention program can may be stored on the computer-readable medium or can have one or more signal shape Formula.Such signal can be downloaded from internet website to be obtained either providing or with any other shape on carrier signal Formula provides.

It should be noted that the present invention will be described rather than limits the invention, and ability for above-described embodiment Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and run after fame Claim.

Claims

1. a kind of method for digging of data genetic connection, including：

By the one or more columns per page field in the Data distribution information of the row field in one of tables of data and other tables of data Data distribution information be compared；

Determine whether are a row field in a tables of data and the one or more columns per page field in other tables of data according to comparison result With data genetic connection.

2. according to the method described in claim 1, wherein, Data distribution information in each column field in the multiple tables of data of statistics Further comprise：

According to Data distribution information in each column field in multiple tables of data, the first distribution that data in each column field are calculated is general Rate density；

The Data distribution information of a row field by one of tables of data and the one or more columns per page in other tables of data The Data distribution information of field, which is compared, to be further comprised：

Calculate row in the first distribution probability density and other tables of data of the data of the row field in a tables of data or The similarity of first distribution probability density of the data of multiple row field；

It is described that a row field in a tables of data and the one or more columns per page field in other tables of data are determined according to comparison result Whether there is data genetic connection to further comprise：

3. according to the method described in claim 1, wherein, Data distribution information in each column field in the multiple tables of data of statistics Further comprise：

The data repeated in each column field in multiple tables of data are subjected to duplicate removal processing, calculate data in each column field after duplicate removal The second distribution probability density；

Calculate row in the second distribution probability density and other tables of data of the data of the row field in a tables of data or The similarity of second distribution probability density of the data of multiple row field；

4. according to the method described in claim 1, wherein, the data distribution of the row field by one of tables of data The Data distribution information of information and the one or more columns per page field in other tables of data, which is compared, to be further comprised：

Calculate the weight of the data and the data of the one or more columns per page field in other tables of data of the row field in a tables of data Folded degree；

Judge whether the degree of overlapping meets default anti-eclipse threshold；If so, determine a tables of data in a row field and its One or more columns per page field in his tables of data has data genetic connection.

5. according to the method described in claim 1, wherein, the method further includes：

Obtain the row field in the tables of data with data genetic connection marked in advance and one in other tables of data Row or multiple row field.

6. according to the method described in claim 1-5, wherein, the method further includes：

A row field in a tables of data with data genetic connection and the one or more columns per page word in other tables of data Section, by a tables of data and other tables of data carry out it is outer connect, in one tables of data of calculating the data of other row fields and other The similarity of the data of other one or more columns per page fields in tables of data；

If the similarity is more than default similarity threshold, it is determined that in a tables of data in other row fields and other tables of data Other one or more columns per page fields have data genetic connection.

7. according to the described method of any one of claim 1-4, wherein, the method further includes：

The data of each column field in multiple tables of data are pre-processed in advance；Wherein, pretreatment includes following one or more： Big field data are split as multiple form field datas, Boolean field data are converted to numerical data, the default place of null value field data Reason.

8. a kind of excavating gear of data genetic connection, including：

Comparison module, suitable for by one in the Data distribution information of the row field in one of tables of data and other tables of data Row or the Data distribution information of multiple row field are compared；

Determining module, suitable for determined according to comparison result row in the row field in a tables of data and other tables of data or Whether multiple row field has data genetic connection.

9. a kind of electronic equipment, including：Processor, memory, communication interface and communication bus, the processor, the storage Device and the communication interface complete mutual communication by the communication bus；

For the memory for storing an at least executable instruction, the executable instruction makes the processor perform right such as will Ask the corresponding operation of method for digging of the data genetic connection described in any one of 1-7.

10. a kind of computer storage media, an at least executable instruction, the executable instruction are stored in the storage medium Make the corresponding operation of method for digging of data genetic connection that processor is performed as described in any one of claim 1-7.