CN106844338B

CN106844338B - method for detecting entity column of network table based on dependency relationship between attributes

Info

Publication number: CN106844338B
Application number: CN201710002389.7A
Authority: CN
Inventors: 王宁; 张丽方
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2017-01-03
Filing date: 2017-01-03
Publication date: 2019-12-10
Anticipated expiration: 2037-01-03
Also published as: CN106844338A

Abstract

the invention provides a method for detecting an entity column of a network table based on dependency relationship among attributes. Aiming at a network table, calculating the approximate function dependence probability between any two columns according to the function dependence relationship between the column values, and acquiring a candidate function dependence set according to the approximate function dependence probability; according to the characteristics of the network table, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set; and 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after 3NF normalization is used as an entity column of the network table. The method provided by the invention can more accurately express the inherent function dependence relationship among the attributes; the method is not only suitable for network tables of single entity columns, but also suitable for tables of multiple entity columns.

Description

Method for detecting entity column of network table based on dependency relationship between attributes

Technical Field

the invention relates to the technical field of network information processing, in particular to a method for detecting an entity column of a network table based on dependency relationship among attributes.

Background

with the development of information technology, resources on the internet are more and more abundant, besides unstructured data, a large number of network tables exist, and the network tables have better structural characteristics than texts, so that the network tables are greatly concerned by people. How to let machines better understand the semantics of web tables becomes a significant challenge to improve table search coverage and accuracy. The entity column can identify the entity described by the network table, and the column label describes the subject of the whole network table, through which the semantic information of the network table can be determined. If the entity column of the network table is accurately detected, the understanding degree of the machine to the network table semantics can be greatly improved.

One entity column discovery algorithm in the prior art is the evidence-based entity column discovery algorithm proposed by Wang et al. The algorithm attempts to implement the discovery of the entity columns of the network table by relying on two evidences, using base as a knowledge base. They are based on evidence that: firstly, all entities in an entity column describe the same concept; second, there is a concept attribute relationship between concepts expressed by the entity column and concepts expressed by other non-entity columns.

in the evidence-based entity column discovery algorithm, for each candidate pattern s of a network table, when one column col is selected as an entity column, the rest columns are the attributes of the entity column, the scores of all the candidate entity columns are calculated, and the candidate entity column with the highest score is selected as the entity column of the network table. The objective function is as follows:

Wherein, SC_AIs the set of all possible conceptual attribute relationships for attribute set a,c_iIs attribute set A_iConcept of description, sa_iRepresenting that the collection of attributes A is a concept c_ithe trustworthiness of the attribute of (1); SC (Single chip computer)_Eis the set of all possible conceptual entity relationships of the entity set E,c_iis entity set E_isubject concept, se_irepresenting a set of entities E belonging to a concept c_ithe reliability of (2); a. the^colrepresenting all attribute sets except col columns in the candidate pattern s; e^colRepresenting all but the set of column values in the col column.

The disadvantages of the entity column discovery algorithm in the prior art are as follows: first, the method relies on the header and knowledge base of the network table, requiring a large computational overhead. The knowledge base does cover many entities, attributes, concepts and relationships between them, but it is difficult for the knowledge base to cover all entities, attributes, concepts and relationships between them on the network. Meanwhile, network tables often lack header information, and it is difficult to accurately recover the headers, especially labels of columns such as numbers and dates, only by using a knowledge base. Thus, the recall rate and accuracy of evidence-based entity column discovery algorithms is low. Second, the evidence-based entity column discovery method can only perform entity column discovery on network tables of a single entity column, and ignore the existence of network tables of multiple entity columns. Many tables on a network have more than one column of entities, and the algorithm has certain limitations.

Disclosure of Invention

the embodiment of the invention provides a method for detecting an entity column of a network table based on dependency relationship among attributes, so as to effectively discover the entity column of the network table.

In order to achieve the purpose, the invention adopts the following technical scheme.

an entity column detection method of a network table based on dependency relationship among attributes further comprises the following steps:

Aiming at a network table, calculating the approximate function dependence probability between any two columns according to the function dependence relationship between the column values, and acquiring a candidate function dependence set according to the approximate function dependence probability;

according to the characteristics of the network table, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set;

and 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after 3NF normalization is used as an entity column of the network table.

Further, the calculating, for a network table, an approximate function dependency probability between any two columns according to a function dependency relationship between column values, and obtaining a candidate function dependency set according to the approximate function dependency probability includes:

Let X be an attribute in the network table T, A be an attribute different from X in T, and when there is an attribute value pair (X, A) of a partial tuple in T, so that X → A holds, let X approximate function determine that A or A approximate function depends on X, and remember that X is an attribute value an approximate function dependency probability that X → a holds true on T is expressed, and data that holds X → a in the attribute value pair (X, a) is referred to as consistency data, and the rest is referred to as inconsistency data;

in network Table T, for X the attribute value is v_xthe A attribute column of the tuple may have different values, and the set of the different values is assumed to be V_A。

If set V_AIf the value with the most number is not unique, the value with the most number is respectively used as class centers, the sum of the similarity of other values and the class center value is calculated, and the class center value v when the sum is maximum is selected_aas consistency data. The specific calculation method is shown in formula (1).

For any class center value v_j。

X median value of v_xall tuples of (1), consistency data v therein_aSupport degree S for X → A_c(X→A,V_X,V_A') Calculated by formula (2);

wherein:

V_X＝{X.r|X.r＝v_x}

V_A'＝{A.r|X.r＝v_x&A.r＝v_a}

|V_X,V_A'|＝|{<X.r,A.r>|X.r＝v_x&A.r＝v_a}|

V_A' that is, when X row takes v_xWhen the consistency data in the corresponding A column is collected, X.r is the value of the r row cells in the X column, and A.r is the value of the r row cells in the A column;

Inconsistent data pairsSupport degree S for establishment of X → A_nc(X→A,V_X,V_A*) The calculation formula (2) is calculated by formula (3);

Set V_XSupport degree for X → ABy the weighted average sum representation of the support degree of the satisfaction of the consistency data and the inconsistency data for X → a,Calculated from equation (5):

wherein ω is₁+ω₂＝1；

Taking the support degrees of all different tuples in X and averaging the support degreesAs the probability that X → a is established in the network table T,calculated from equation (6):

Wherein | D_X| represents a distinctive V in X_XThe number of (2);

Representing an approximation in a network table TFunction dependencethe set of candidate function dependencies contains all possible approximate function dependencies in the network table T.

further, the deleting the noise function dependence in the candidate function dependence set according to the characteristics of the network table to obtain an approximate function dependence set includes:

If approximate function dependence in the candidate function dependence setSatisfy any of the following 3 rules, theneliminate from the candidate approximation function dependent set:

rule 1 if the type of the attribute value of X column is date type, floating point type or Boolean type:

Rule 2 if there is an attribute column Y in the network table T, so thatIf true;

Rule 3 if in the candidate approximation function dependency set, there are attribute columns X and A such thatAnd is

further, the 3NF normalization is performed on the approximation function dependency set, and the primary key set generated after the 3NF normalization is used as an entity column of the network table, including:

Mapping the approximate function dependency relationship in the approximate function dependency set to a relation matrix FD [ m ] [ n ], and mapping the approximate function dependency relationship among the determined attributes to a relation matrix KK [ m ] [ m ], wherein m is the number of attributes positioned at the left side of the approximate function dependency implication, namely the number of determined attributes, and n is the number of all attribute columns in the network table:

(1) The elements of FD [ m ] [ n ] are generated as follows:

let α be the set of { decision attributes }, β be the set of all column attributes }

4) If α ═ β, then FD [ α ] [ β ]: 2;

5) If it is notFD [ alpha ] is then][β]:＝1；

6) otherwise, FD [ α ] [ β ] is 0;

(2) the elements of KK [ m ] [ m ] are generated as follows:

Let α, γ ∈ { decision attribute set }

3) if α ═ γ orThen KK [ alpha ]][γ]:＝1；

4) Otherwise, KK [ α ] [ γ ]: — 1;

Defined in the network table T ifThen we call Z the approximate transfer function dependence of X, and we note aswherein Y is an intermediary key on which the approximation transfer function depends;

Determining an approximate function dependency set closure DC [ m ] [ n ] according to the relation matrix FD [ m ] [ n ] and the relation matrix KK [ m ] [ m ], determining decision attributes and intermediate keys only existing in direct approximate function dependency according to the approximate function dependency set closure DC [ m ] [ n ], and outputting the decision attributes only existing in the direct approximate function dependency and the intermediate keys as entity columns of the network table.

further, the determining the approximate function dependency set closure DC [ m ] [ n ] according to the relationship matrix FD [ m ] [ n ] and the relationship matrix KK [ m ] [ m ] includes:

step 1, copying elements in FD [ m ] [ n ] to DC [ m ] [ n ]; i: 0; i represents the ith approximation function dependence in KK [ m ] [ m ];

Step 2, i is 1;

and step 3: judging whether to useAt KK [ m ]][m]is present in, andat DC [ m ]][n]If so, then DC [ m ]][n]:＝β_iAnd performing step 4; otherwise, directly executing the step 4;

And 4, step 4: judging whether the i +1 th approximation function dependence exists in KK [ m ] [ m ], if so, executing the step 5; otherwise, directly executing step 6;

And 5: i, i +1, and returning to the step 3;

Step 6: judging whether the DC [ m ] [ n ] changes or not, and if so, returning to the step 2; otherwise, outputting DC [ m ] [ n ], and ending the process.

Further, the determining that only decision attributes and intermediate keys exist in direct approximate function dependency according to the approximate function dependency set closure DC [ m ] [ n ] includes:

Step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];

Step 2: i: ═ 0, j: ═ 0; i, j represents the row number and column number of DC [ m ] [ n ];

and step 3: judging DC [ i ] [ j ]! Whether or not 1& & FD [ j ] (1 & & FD [ j ] [ i ] (1) is true, if true, DC [ i ] [ j ]: 1, and perform step 4; otherwise, executing step 4;

And 4, step 4: judging whether all traversal is finished, if all traversal is finished, setting i: ═ 0 and j: ═ 0, and executing the step 5; otherwise, take down one DC [ i ] [ j ], and perform step 3;

and 5: judging DC [ i ] [ j ]! Whether or not {0,1,2} is true, if true, Entity { }:dc [ i ] [ j ], and perform step 7; otherwise, executing step 6;

Step 6: judging that DC [ i ] [ j ] ═ 1& & i! If j is true, assigning the decision attribute of the i row to the Entity set, and executing the step 7; otherwise, directly executing step 7;

And 7: judging whether all traversals are finished or not, if all traversals are finished, outputting an Entity set, and ending the process; otherwise, take the next DC [ i ] [ j ], and continue to execute step 5.

according to the technical scheme provided by the embodiment of the invention, the approximate function dependence detection method suitable for the characteristics of the network table provided by the embodiment of the invention can more accurately express the inherent function dependence relationship among the attributes; when the approximate function dependence is calculated, the algorithm has obvious noise resistance based on the support degree of the consistency data and the inconsistency data on the function dependence; the method is not only suitable for the network table with a header, but also suitable for the network table without the header or the network table which cannot recover the complete header by utilizing a semantic recovery technology.

additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

in order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

fig. 1 is a processing flow chart of a method for detecting an entity column of a network table based on dependency relationships between attributes according to an embodiment of the present invention;

Fig. 2 is a flowchart of a process for obtaining a candidate dependency set according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for finding an approximation function dependency set closure according to an approximation function dependency set according to an embodiment of the present invention;

FIG. 4 is a flowchart of an embodiment of a method for obtaining entity columns using a three-paradigm;

FIG. 5 is a diagram illustrating the comparison between AFD _ Model algorithm and PFD _ Model algorithm, and the detection accuracy, coverage, F-value and time efficiency of the entity column for a single entity list by the evidence-based method (ED _ Model) according to the embodiment of the present invention;

fig. 6 is a schematic diagram illustrating comparison of effectiveness of the AFD _ Model algorithm and the PFD _ Model algorithm in the multi-entity column discovery algorithm according to the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

in order to solve the technical problem of the existing entity column detection algorithm, the invention designs the entity column detection algorithm which has low calculation cost, does not depend on a header and a knowledge base and is suitable for a multi-entity column network table. The invention solves the problems that the traditional algorithm depends on the table header and the knowledge base of the network table and can not discover a plurality of entity columns, improves the noise resistance of the method by introducing the concept of approximate function dependence, and simultaneously obtains the high-quality entity column discovery result.

The processing flow of the method for detecting the entity column of the network table based on the dependency relationship among the attributes provided by the embodiment of the invention is shown in fig. 1, and comprises the following processing steps:

Step 1, obtaining a candidate function dependency set according to the approximate function dependency probability among the column values of the network table.

for a network table, if one or more columns can identify the entities described by the network table, the one or more columns are defined as entity columns, and the other columns except the entity columns are defined as attribute columns.

For each table, the invention calculates the approximate function dependence probability between any two columns according to the function dependence relationship between column values. Here we introduce the support of consistent data and inconsistent data, taking into account the presence of noise in the table.

Definition 1 let X be a certain attribute column in the network table T, and a be an attribute column in T other than X. When there is an (X, A) attribute value pair of a partial tuple in T, such that X → A holds, then the X approximation function is said to determine whether the A or A approximation function depends on X, and is written as Indicates the likelihood that X → A holds true at T, i.e., the approximation function-dependent probability. Data in the (X, a) attribute value pair for which X → a holds is referred to as consistent data, and the rest is referred to as inconsistent data.

For any class center value v_j。

The column values of the network table are possible to be wrongly written, the support degree of the consistent data and the inconsistent data for establishing the function dependence is integrated, the approximate function dependence probability between any two columns is calculated, and the candidate function dependence set is obtained.

fig. 2 is a flowchart of a process for acquiring a candidate dependency set according to an embodiment of the present invention, where the specific process includes: first, the greater the proportion of the matching data, the greater the probability that X → a is established, that is, the higher the support of the matching data for X → a is established, and the greater the proportion of the matching data, the greater the probability that the matching data is true matching data. X median value of v_xall tuples of (1), consistency data v therein_aThe support degree for X → a and the reliability of the consistency data are both calculated by equation (2).

Wherein:

V_X＝{X.r|X.r＝v_x}

V_A'＝{A.r|X.r＝v_x&A.r＝v_a}

|V_X,V_A'|＝|{<X.r,A.r>|X.r＝v_x&A.r＝v_a}|

V_A' that is, when X row takes v_xin this case, X.r represents the value of the row r cell in column X, and A.r represents the value of the row r cell in column A.

next, the more similar the inconsistency data and the consistency data are, and the greater the reliability of the consistency data is, the greater the support degree of the inconsistency data for X → a is, and the calculation formula (3) is shown.

Wherein V_A*＝{A.r|X.r＝v_x&A.r≠v_a}。

set V_Xthe support degree for the establishment of X → A can be represented by the weighted average sum of the support degrees for the establishment of X → A by the consistency data and the inconsistency data, and is described asAs shown in equation (5).

wherein ω is₁+ω₂＝1。

Finally, the support degrees of all different tuples in X are taken and the average value of the support degreesAs the probability that X → a is established in the network table T,Calculated from equation (6):

wherein | D_X| represents a distinctive V in X_Xthe number of (2).

Equation (6) represents the probability that X → a is established in table T, and all possible approximation function dependencies in T are included in the set of candidate function dependencies, and the probability that these approximation function dependencies are established is calculated according to equation (6).

If it isx is called the decision attribute on which this approximation function depends. The approximation function depends on all the decision attributes in the set to form a decision attribute set, and the number of elements in the decision attribute set is the number of decision attributes, namely m.

And 2, deleting the noise function dependence in the candidate function dependence set according to the characteristics of the network table to obtain an approximate function dependence set.

the noise function dependence is deleted mainly to obtain a more accurate function dependence set, and a foundation is laid for obtaining entity columns in the next step. The specific pruning rule is as follows:

if it is notSatisfying any of the following 3 rules willremoved from the candidate approximation function dependency set.

Rule 1 if the type of the attribute value of column X is date type, floating point type, or Boolean type.

rule 2 if there is an attribute column Y in T, so thatIf true;

and according to the deleting rule, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set.

And 3, acquiring the entity column according to the standardized thought.

the attribute column approximation function in the network table depends on the entity column described by the attribute column approximation function, 3NF normalization is carried out on the approximation function dependence set according to the normalization principle of the relational database theory, and a main key set generated after 3NF normalization is the required entity column of the network table.

the above process of 3NF normalization of the dependence set of approximation functions includes:

mapping the dependence of the dependence set of the approximation function to a relationship matrix FD [ m ] [ n ]; the approximate functional dependencies between decision attributes are mapped to a relationship matrix KK [ m ] [ m ]. Where m is the number of attributes located to the left of the dependence implication of the approximation function, i.e. the number of decision attributes, and n is the number of all attribute columns in the network table. For convenience, different relationships between attributes are represented by different numbers, and the elements in the matrix are generated as follows:

(1) The elements of FD [ m ] [ n ] are generated as follows:

7) If α ═ β, then FD [ α ] [ β ]: 2;

8) if it is notFD [ alpha ] is then][β]:＝1；

9) Otherwise, FD [ α ] [ β ] is 0;

(2) the elements of KK [ m ] [ m ] are generated as follows:

Let α, γ ∈ { decision attribute set }

5) If α ═ γ orThen KK [ alpha ]][γ]:＝1；

6) Otherwise, KK [ α ] [ γ ]: — 1;

For ease of description, definition 3 gives the definition of the approximate transfer function dependence as follows:

definition 3 in the network table T, ifThen we call Z the approximate transfer function dependence of X, and we note asWhere Y is an intermediary key on which the approximation transfer function depends.

FIG. 3 is a schematic diagram of a process for finding an approximate function dependency set closure DC [ m ] [ n ] according to an approximate function dependency set, and determining DC [ m ] [ n ] according to FD [ m ] [ n ] and KK [ m ] [ m ].

Step 1, copying elements in FD [ m ] [ n ] to DC [ m ] [ n ]; i: -0; i represents the ith approximation function dependence in KK [ m ] [ m ];

Step 2, i is 1;

And step 3: judging whether to useat KK [ m ]][m]In the presence of (a) a (b),

And isat DC [ m ]][n]If so, then DC [ m ]][n]:＝β_iAnd performing step 4; otherwise, directly executing the step 4;

and 5: i: and returning to the step 3 when the sum is i + 1.

FIG. 4 is a flow chart of the method for obtaining entity columns using the three-norm, correcting the mislabeled approximate propagation dependence according to the above-mentioned approximation function dependence set closure DC [ m ] [ n ]. Finally, the decision attribute only existing in the direct approximation function dependency and the intermediate key are output as entity columns, and the finding process of the decision attribute only existing in the direct approximation function dependency and the intermediate key comprises the following steps:

step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];

in summary, the approximate function dependency detection method adapted to the characteristics of the network table provided by the embodiment of the present invention can more accurately express the inherent function dependency relationship between the attributes; when the approximate function dependence is calculated, the algorithm has obvious noise resistance based on the support degree of the consistency data and the inconsistency data on the function dependence;

The entity column discovery algorithm based on the approximation function dependence and the normalization provided by the embodiment of the invention can discover entity columns in more scenes. The method is not only suitable for the network table of a single entity column, but also suitable for the table of a plurality of entity columns; the method is not only suitable for the network table with the header, but also suitable for the network table without the header or the network table which can not recover the complete header by utilizing the semantic recovery technology.

compared with the prior art, the method has the advantages of high entity column discovery quality and capability of discovering multiple entity columns. To verify the above advantages, we performed a number of experiments with experimental data from two data sources: one is an open-source Wiki Table dataset and the other is a network Table we crawl from the network, which we call a Web Table dataset. The collected network table is divided into a large table data set (more than 100 rows), L data set for short, and a small table data set (less than 100 rows), S data set for short according to the number of rows. To facilitate experimental validation of single-entity column and multi-entity column discovery, we split the L data set into L single-entity sets (WiKi _ LS and Web _ LS) and L multi-entity sets (WiKi _ LM and Web _ LM); the S data set is divided into S single entity sets (WiKi _ SS and Web _ SS) and S multiple entity sets (WiKi _ SM and Web _ SM).

the method and the device discover the entity column based on the function dependency relationship among the column values, do not depend on the information of the header and the knowledge base, and improve the quality of discovering the entity column. In order to verify the effectiveness of the algorithm (AFD _ Model) of the embodiment of the present invention in noise reduction, a PFD _ Model algorithm is particularly implemented, which is the same as the AFD _ Model algorithm except that the table noise is not considered. FIG. 3 shows a comparison of the entity column detection accuracy, coverage, F-value, and time efficiency for a single entity list for AFD _ Model, PFD _ Model, and evidence-based methods (ED _ Model). FIG. 5 shows that the AFD _ Model algorithm of the present invention is superior to the ED _ Model and PFD _ Model algorithms as a whole. In terms of accuracy, the ED _ Model algorithm requires that a header of a network table has a conceptual attribute relationship in a base library, the quality of the header and the coverage degree of a knowledge base influence the accuracy of the ED _ Model algorithm, and the AFD _ Model algorithm does not depend on any header information and the knowledge base, so that the accuracy is high. The AFD _ Model algorithm considers the characteristics of a network table and has certain noise filtering capability, so the accuracy of entity detection is higher than that of the PFD _ Model algorithm. In terms of recall, the AFD _ Model algorithm is higher than the ED _ Model algorithm and the PFD _ Model algorithm. The AFD _ Model algorithm does not require that a network table must have a header, does not require that an entity column and a non-entity column in the table have an attribute relationship, does not require that the concept-attribute relationship exists in a Probase library, and has certain noise filtering capability, so the adaptability of the algorithm is stronger. The F-measure measures the quality of the algorithm as a whole, and the algorithm has obvious advantages. In terms of runtime, the time cost of the ED _ Model algorithm is significantly greater than that of the AFD _ Model algorithm and the PFD _ Model algorithm, because the ED _ Model algorithm needs to determine the conceptual attribute relationship of the table header or the semantically restored table header by using the base library to determine the entity column, and the time complexity of the AFD _ Model algorithm and the PFD _ Model algorithm is only related to the size of the table.

The method is suitable for the table with multiple entity columns, and the applicability is obviously enhanced. The ED _ Model algorithm cannot perform multi-entity column discovery, and the method of the invention only performs comparison with the PFD _ Model. Fig. 6 is a schematic diagram illustrating comparison of effectiveness of the AFD _ Model algorithm and the PFD _ Model algorithm in the multi-entity column discovery algorithm according to the embodiment of the present invention. FIG. 6 shows that the AFD _ Model algorithm performs better than the PFD _ Model algorithm, regardless of accuracy, recall, or F-value, because the AFD _ Model algorithm takes into account the effects of noise data when computing the approximation function dependence between attributes.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

from the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

the above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for detecting an entity column of a network table based on dependency relationship among attributes is characterized by comprising the following steps:

3NF standardization is carried out on the approximation function dependence set, and a main key set generated after 3NF standardization is used as an entity column of the network table;

Aiming at a network table, calculating an approximate function dependence probability between any two columns according to a function dependence relationship between column values, and acquiring a candidate function dependence set according to the approximate function dependence probability, wherein the method comprises the following steps:

Let X be an attribute in the network table T, A be an attribute different from X in T, and when there is an attribute value pair (X, A) of a partial tuple in T, so that X → A holds, let X approximate function determine that A or A approximate function depends on X, and remember that X is an attribute valueAn approximate function dependency probability that X → a holds true on T is expressed, and data that holds X → a in the attribute value pair (X, a) is referred to as consistency data, and the rest is referred to as inconsistency data;

In network Table T, for X the attribute value is v_xthe A attribute column of the tuple may have different values, and the set of the different values is assumed to be V_A；

if set V_Aif the value with the most number is not unique, the value with the most number is respectively used as class centers, the sum of the similarity of other values and the class center value is calculated, and the class center value v when the sum is maximum is selected_aAs the consistency data, a specific calculation method is shown in formula 1;

For any class center value v_j；

x median value of v_xAll tuples of (1), consistency data v therein_ato XSupport degree S for establishment of → A_c(X→A,V_X,V_A') Calculated from equation 2, where V_A'When the X column takes v_xwhen the data is consistent, corresponding to the consistent data in the column A;

Wherein:

V_X＝{X.r|X.r＝v_x}

V_A'＝{A.r|X.r＝v_x&A.r＝v_a}

|V_X,V_A'|＝|{<X.r,A.r>|X.r＝v_x&A.r＝v_a}|

X.r is the value of the X column r row cell, A.r is the value of the a column r row cell;

support degree for X → A establishment of inconsistency datais calculated by formula 3, wherein

Set V_XSupport degree for X → Aby the weighted average sum representation of the support degree of the satisfaction of the consistency data and the inconsistency data for X → a,Calculated by the formula 5calculating:

wherein ω is₁+ω₂＝1；

Taking the support of all different tuples in X, and averaging the supportAs the probability that X → a is established in the network table T,calculated from equation 6:

Wherein | D_X| represents a distinctive V in X_XThe number of (2);

representing an approximation function dependence in a network table TProbability of establishment, wherein the candidate function dependence set comprises all possible approximate function dependence in the network table T;

The 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after the 3NF normalization is used as an entity column of a network table, and the method comprises the following steps:

(1) The elements of FD [ m ] [ n ] are generated as follows:

1) if α ═ β, then FD [ α ] [ β ]: 2;

2) If it is notFD [ alpha ] is then][β]:＝1；

3) Otherwise, FD [ α ] [ β ] is 0;

(2) the elements of KK [ m ] [ m ] are generated as follows:

let α, γ ∈ { decision attribute set }

1) If α ═ γ orThen KK [ alpha ]][γ]:＝1；

2) Otherwise, KK [ α ] [ γ ]: — 1;

2. The method of claim 1, wherein the pruning the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set according to the characteristics of the network table comprises:

3. The method according to claim 1, wherein determining the approximate function-dependent set closure DC [ m ] [ n ] from the relationship matrix FD [ m ] [ n ] and the relationship matrix KK [ m ] [ m ] comprises:

Step 2, i is 1;

And step 3: judging whether to useAt KK [ m ]][m]Is present in, andat DC [ m ]][n]if so, then DC [ m ]][n]:＝β_iand performing step 4; otherwise, it is straightThen, executing the step 4;

And 5: i, i +1, and returning to the step 3;

4. The method according to claim 3, wherein said determining from said approximation function dependency closures DC [ m ] [ n ] that only decision attributes and intermediary keys in direct approximation function dependencies exist comprises:

Step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];