CN106844338B - method for detecting entity column of network table based on dependency relationship between attributes - Google Patents

method for detecting entity column of network table based on dependency relationship between attributes Download PDF

Info

Publication number
CN106844338B
CN106844338B CN201710002389.7A CN201710002389A CN106844338B CN 106844338 B CN106844338 B CN 106844338B CN 201710002389 A CN201710002389 A CN 201710002389A CN 106844338 B CN106844338 B CN 106844338B
Authority
CN
China
Prior art keywords
column
network table
function
dependence
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710002389.7A
Other languages
Chinese (zh)
Other versions
CN106844338A (en
Inventor
王宁
张丽方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN201710002389.7A priority Critical patent/CN106844338B/en
Publication of CN106844338A publication Critical patent/CN106844338A/en
Application granted granted Critical
Publication of CN106844338B publication Critical patent/CN106844338B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

the invention provides a method for detecting an entity column of a network table based on dependency relationship among attributes. Aiming at a network table, calculating the approximate function dependence probability between any two columns according to the function dependence relationship between the column values, and acquiring a candidate function dependence set according to the approximate function dependence probability; according to the characteristics of the network table, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set; and 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after 3NF normalization is used as an entity column of the network table. The method provided by the invention can more accurately express the inherent function dependence relationship among the attributes; the method is not only suitable for network tables of single entity columns, but also suitable for tables of multiple entity columns.

Description

Method for detecting entity column of network table based on dependency relationship between attributes
Technical Field
the invention relates to the technical field of network information processing, in particular to a method for detecting an entity column of a network table based on dependency relationship among attributes.
Background
with the development of information technology, resources on the internet are more and more abundant, besides unstructured data, a large number of network tables exist, and the network tables have better structural characteristics than texts, so that the network tables are greatly concerned by people. How to let machines better understand the semantics of web tables becomes a significant challenge to improve table search coverage and accuracy. The entity column can identify the entity described by the network table, and the column label describes the subject of the whole network table, through which the semantic information of the network table can be determined. If the entity column of the network table is accurately detected, the understanding degree of the machine to the network table semantics can be greatly improved.
One entity column discovery algorithm in the prior art is the evidence-based entity column discovery algorithm proposed by Wang et al. The algorithm attempts to implement the discovery of the entity columns of the network table by relying on two evidences, using base as a knowledge base. They are based on evidence that: firstly, all entities in an entity column describe the same concept; second, there is a concept attribute relationship between concepts expressed by the entity column and concepts expressed by other non-entity columns.
in the evidence-based entity column discovery algorithm, for each candidate pattern s of a network table, when one column col is selected as an entity column, the rest columns are the attributes of the entity column, the scores of all the candidate entity columns are calculated, and the candidate entity column with the highest score is selected as the entity column of the network table. The objective function is as follows:
Wherein, SCAIs the set of all possible conceptual attribute relationships for attribute set a,ciIs attribute set AiConcept of description, saiRepresenting that the collection of attributes A is a concept cithe trustworthiness of the attribute of (1); SC (Single chip computer)Eis the set of all possible conceptual entity relationships of the entity set E,ciis entity set Eisubject concept, seirepresenting a set of entities E belonging to a concept cithe reliability of (2); a. thecolrepresenting all attribute sets except col columns in the candidate pattern s; ecolRepresenting all but the set of column values in the col column.
The disadvantages of the entity column discovery algorithm in the prior art are as follows: first, the method relies on the header and knowledge base of the network table, requiring a large computational overhead. The knowledge base does cover many entities, attributes, concepts and relationships between them, but it is difficult for the knowledge base to cover all entities, attributes, concepts and relationships between them on the network. Meanwhile, network tables often lack header information, and it is difficult to accurately recover the headers, especially labels of columns such as numbers and dates, only by using a knowledge base. Thus, the recall rate and accuracy of evidence-based entity column discovery algorithms is low. Second, the evidence-based entity column discovery method can only perform entity column discovery on network tables of a single entity column, and ignore the existence of network tables of multiple entity columns. Many tables on a network have more than one column of entities, and the algorithm has certain limitations.
Disclosure of Invention
the embodiment of the invention provides a method for detecting an entity column of a network table based on dependency relationship among attributes, so as to effectively discover the entity column of the network table.
In order to achieve the purpose, the invention adopts the following technical scheme.
an entity column detection method of a network table based on dependency relationship among attributes further comprises the following steps:
Aiming at a network table, calculating the approximate function dependence probability between any two columns according to the function dependence relationship between the column values, and acquiring a candidate function dependence set according to the approximate function dependence probability;
according to the characteristics of the network table, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set;
and 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after 3NF normalization is used as an entity column of the network table.
Further, the calculating, for a network table, an approximate function dependency probability between any two columns according to a function dependency relationship between column values, and obtaining a candidate function dependency set according to the approximate function dependency probability includes:
Let X be an attribute in the network table T, A be an attribute different from X in T, and when there is an attribute value pair (X, A) of a partial tuple in T, so that X → A holds, let X approximate function determine that A or A approximate function depends on X, and remember that X is an attribute value an approximate function dependency probability that X → a holds true on T is expressed, and data that holds X → a in the attribute value pair (X, a) is referred to as consistency data, and the rest is referred to as inconsistency data;
in network Table T, for X the attribute value is vxthe A attribute column of the tuple may have different values, and the set of the different values is assumed to be VA
If set VAIf the value with the most number is not unique, the value with the most number is respectively used as class centers, the sum of the similarity of other values and the class center value is calculated, and the class center value v when the sum is maximum is selectedaas consistency data. The specific calculation method is shown in formula (1).
For any class center value vj
X median value of vxall tuples of (1), consistency data v thereinaSupport degree S for X → Ac(X→A,VX,VA') Calculated by formula (2);
wherein:
VX={X.r|X.r=vx}
VA'={A.r|X.r=vx&A.r=va}
|VX,VA'|=|{<X.r,A.r>|X.r=vx&A.r=va}|
VA' that is, when X row takes vxWhen the consistency data in the corresponding A column is collected, X.r is the value of the r row cells in the X column, and A.r is the value of the r row cells in the A column;
Inconsistent data pairsSupport degree S for establishment of X → Anc(X→A,VX,VA*) The calculation formula (2) is calculated by formula (3);
Set VXSupport degree for X → ABy the weighted average sum representation of the support degree of the satisfaction of the consistency data and the inconsistency data for X → a,Calculated from equation (5):
wherein ω is12=1;
Taking the support degrees of all different tuples in X and averaging the support degreesAs the probability that X → a is established in the network table T,calculated from equation (6):
Wherein | DX| represents a distinctive V in XXThe number of (2);
Representing an approximation in a network table TFunction dependencethe set of candidate function dependencies contains all possible approximate function dependencies in the network table T.
further, the deleting the noise function dependence in the candidate function dependence set according to the characteristics of the network table to obtain an approximate function dependence set includes:
If approximate function dependence in the candidate function dependence setSatisfy any of the following 3 rules, theneliminate from the candidate approximation function dependent set:
rule 1 if the type of the attribute value of X column is date type, floating point type or Boolean type:
Rule 2 if there is an attribute column Y in the network table T, so thatIf true;
Rule 3 if in the candidate approximation function dependency set, there are attribute columns X and A such thatAnd is
further, the 3NF normalization is performed on the approximation function dependency set, and the primary key set generated after the 3NF normalization is used as an entity column of the network table, including:
Mapping the approximate function dependency relationship in the approximate function dependency set to a relation matrix FD [ m ] [ n ], and mapping the approximate function dependency relationship among the determined attributes to a relation matrix KK [ m ] [ m ], wherein m is the number of attributes positioned at the left side of the approximate function dependency implication, namely the number of determined attributes, and n is the number of all attribute columns in the network table:
(1) The elements of FD [ m ] [ n ] are generated as follows:
let α be the set of { decision attributes }, β be the set of all column attributes }
4) If α ═ β, then FD [ α ] [ β ]: 2;
5) If it is notFD [ alpha ] is then][β]:=1;
6) otherwise, FD [ α ] [ β ] is 0;
(2) the elements of KK [ m ] [ m ] are generated as follows:
Let α, γ ∈ { decision attribute set }
3) if α ═ γ orThen KK [ alpha ]][γ]:=1;
4) Otherwise, KK [ α ] [ γ ]: — 1;
Defined in the network table T ifThen we call Z the approximate transfer function dependence of X, and we note aswherein Y is an intermediary key on which the approximation transfer function depends;
Determining an approximate function dependency set closure DC [ m ] [ n ] according to the relation matrix FD [ m ] [ n ] and the relation matrix KK [ m ] [ m ], determining decision attributes and intermediate keys only existing in direct approximate function dependency according to the approximate function dependency set closure DC [ m ] [ n ], and outputting the decision attributes only existing in the direct approximate function dependency and the intermediate keys as entity columns of the network table.
further, the determining the approximate function dependency set closure DC [ m ] [ n ] according to the relationship matrix FD [ m ] [ n ] and the relationship matrix KK [ m ] [ m ] includes:
step 1, copying elements in FD [ m ] [ n ] to DC [ m ] [ n ]; i: 0; i represents the ith approximation function dependence in KK [ m ] [ m ];
Step 2, i is 1;
and step 3: judging whether to useAt KK [ m ]][m]is present in, andat DC [ m ]][n]If so, then DC [ m ]][n]:=βiAnd performing step 4; otherwise, directly executing the step 4;
And 4, step 4: judging whether the i +1 th approximation function dependence exists in KK [ m ] [ m ], if so, executing the step 5; otherwise, directly executing step 6;
And 5: i, i +1, and returning to the step 3;
Step 6: judging whether the DC [ m ] [ n ] changes or not, and if so, returning to the step 2; otherwise, outputting DC [ m ] [ n ], and ending the process.
Further, the determining that only decision attributes and intermediate keys exist in direct approximate function dependency according to the approximate function dependency set closure DC [ m ] [ n ] includes:
Step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];
Step 2: i: ═ 0, j: ═ 0; i, j represents the row number and column number of DC [ m ] [ n ];
and step 3: judging DC [ i ] [ j ]! Whether or not 1& & FD [ j ] (1 & & FD [ j ] [ i ] (1) is true, if true, DC [ i ] [ j ]: 1, and perform step 4; otherwise, executing step 4;
And 4, step 4: judging whether all traversal is finished, if all traversal is finished, setting i: ═ 0 and j: ═ 0, and executing the step 5; otherwise, take down one DC [ i ] [ j ], and perform step 3;
and 5: judging DC [ i ] [ j ]! Whether or not {0,1,2} is true, if true, Entity { }:dc [ i ] [ j ], and perform step 7; otherwise, executing step 6;
Step 6: judging that DC [ i ] [ j ] ═ 1& & i! If j is true, assigning the decision attribute of the i row to the Entity set, and executing the step 7; otherwise, directly executing step 7;
And 7: judging whether all traversals are finished or not, if all traversals are finished, outputting an Entity set, and ending the process; otherwise, take the next DC [ i ] [ j ], and continue to execute step 5.
according to the technical scheme provided by the embodiment of the invention, the approximate function dependence detection method suitable for the characteristics of the network table provided by the embodiment of the invention can more accurately express the inherent function dependence relationship among the attributes; when the approximate function dependence is calculated, the algorithm has obvious noise resistance based on the support degree of the consistency data and the inconsistency data on the function dependence; the method is not only suitable for the network table with a header, but also suitable for the network table without the header or the network table which cannot recover the complete header by utilizing a semantic recovery technology.
additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
in order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
fig. 1 is a processing flow chart of a method for detecting an entity column of a network table based on dependency relationships between attributes according to an embodiment of the present invention;
Fig. 2 is a flowchart of a process for obtaining a candidate dependency set according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for finding an approximation function dependency set closure according to an approximation function dependency set according to an embodiment of the present invention;
FIG. 4 is a flowchart of an embodiment of a method for obtaining entity columns using a three-paradigm;
FIG. 5 is a diagram illustrating the comparison between AFD _ Model algorithm and PFD _ Model algorithm, and the detection accuracy, coverage, F-value and time efficiency of the entity column for a single entity list by the evidence-based method (ED _ Model) according to the embodiment of the present invention;
fig. 6 is a schematic diagram illustrating comparison of effectiveness of the AFD _ Model algorithm and the PFD _ Model algorithm in the multi-entity column discovery algorithm according to the embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
in order to solve the technical problem of the existing entity column detection algorithm, the invention designs the entity column detection algorithm which has low calculation cost, does not depend on a header and a knowledge base and is suitable for a multi-entity column network table. The invention solves the problems that the traditional algorithm depends on the table header and the knowledge base of the network table and can not discover a plurality of entity columns, improves the noise resistance of the method by introducing the concept of approximate function dependence, and simultaneously obtains the high-quality entity column discovery result.
The processing flow of the method for detecting the entity column of the network table based on the dependency relationship among the attributes provided by the embodiment of the invention is shown in fig. 1, and comprises the following processing steps:
Step 1, obtaining a candidate function dependency set according to the approximate function dependency probability among the column values of the network table.
for a network table, if one or more columns can identify the entities described by the network table, the one or more columns are defined as entity columns, and the other columns except the entity columns are defined as attribute columns.
For each table, the invention calculates the approximate function dependence probability between any two columns according to the function dependence relationship between column values. Here we introduce the support of consistent data and inconsistent data, taking into account the presence of noise in the table.
Definition 1 let X be a certain attribute column in the network table T, and a be an attribute column in T other than X. When there is an (X, A) attribute value pair of a partial tuple in T, such that X → A holds, then the X approximation function is said to determine whether the A or A approximation function depends on X, and is written as Indicates the likelihood that X → A holds true at T, i.e., the approximation function-dependent probability. Data in the (X, a) attribute value pair for which X → a holds is referred to as consistent data, and the rest is referred to as inconsistent data.
In network Table T, for X the attribute value is vxThe A attribute column of the tuple may have different values, and the set of the different values is assumed to be VA
if set VAIf the value with the most number is not unique, the value with the most number is respectively used as class centers, the sum of the similarity of other values and the class center value is calculated, and the class center value v when the sum is maximum is selectedaas consistency data. The specific calculation method is shown in formula (1).
For any class center value vj
The column values of the network table are possible to be wrongly written, the support degree of the consistent data and the inconsistent data for establishing the function dependence is integrated, the approximate function dependence probability between any two columns is calculated, and the candidate function dependence set is obtained.
fig. 2 is a flowchart of a process for acquiring a candidate dependency set according to an embodiment of the present invention, where the specific process includes: first, the greater the proportion of the matching data, the greater the probability that X → a is established, that is, the higher the support of the matching data for X → a is established, and the greater the proportion of the matching data, the greater the probability that the matching data is true matching data. X median value of vxall tuples of (1), consistency data v thereinaThe support degree for X → a and the reliability of the consistency data are both calculated by equation (2).
Wherein:
VX={X.r|X.r=vx}
VA'={A.r|X.r=vx&A.r=va}
|VX,VA'|=|{<X.r,A.r>|X.r=vx&A.r=va}|
VA' that is, when X row takes vxin this case, X.r represents the value of the row r cell in column X, and A.r represents the value of the row r cell in column A.
next, the more similar the inconsistency data and the consistency data are, and the greater the reliability of the consistency data is, the greater the support degree of the inconsistency data for X → a is, and the calculation formula (3) is shown.
Wherein VA*={A.r|X.r=vx&A.r≠va}。
set VXthe support degree for the establishment of X → A can be represented by the weighted average sum of the support degrees for the establishment of X → A by the consistency data and the inconsistency data, and is described asAs shown in equation (5).
wherein ω is12=1。
Finally, the support degrees of all different tuples in X are taken and the average value of the support degreesAs the probability that X → a is established in the network table T,Calculated from equation (6):
wherein | DX| represents a distinctive V in XXthe number of (2).
Equation (6) represents the probability that X → a is established in table T, and all possible approximation function dependencies in T are included in the set of candidate function dependencies, and the probability that these approximation function dependencies are established is calculated according to equation (6).
If it isx is called the decision attribute on which this approximation function depends. The approximation function depends on all the decision attributes in the set to form a decision attribute set, and the number of elements in the decision attribute set is the number of decision attributes, namely m.
And 2, deleting the noise function dependence in the candidate function dependence set according to the characteristics of the network table to obtain an approximate function dependence set.
the noise function dependence is deleted mainly to obtain a more accurate function dependence set, and a foundation is laid for obtaining entity columns in the next step. The specific pruning rule is as follows:
if it is notSatisfying any of the following 3 rules willremoved from the candidate approximation function dependency set.
Rule 1 if the type of the attribute value of column X is date type, floating point type, or Boolean type.
rule 2 if there is an attribute column Y in T, so thatIf true;
rule 3 if in the candidate approximation function dependency set, there are attribute columns X and A such thatAnd is
and according to the deleting rule, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set.
And 3, acquiring the entity column according to the standardized thought.
the attribute column approximation function in the network table depends on the entity column described by the attribute column approximation function, 3NF normalization is carried out on the approximation function dependence set according to the normalization principle of the relational database theory, and a main key set generated after 3NF normalization is the required entity column of the network table.
the above process of 3NF normalization of the dependence set of approximation functions includes:
mapping the dependence of the dependence set of the approximation function to a relationship matrix FD [ m ] [ n ]; the approximate functional dependencies between decision attributes are mapped to a relationship matrix KK [ m ] [ m ]. Where m is the number of attributes located to the left of the dependence implication of the approximation function, i.e. the number of decision attributes, and n is the number of all attribute columns in the network table. For convenience, different relationships between attributes are represented by different numbers, and the elements in the matrix are generated as follows:
(1) The elements of FD [ m ] [ n ] are generated as follows:
Let α be the set of { decision attributes }, β be the set of all column attributes }
7) If α ═ β, then FD [ α ] [ β ]: 2;
8) if it is notFD [ alpha ] is then][β]:=1;
9) Otherwise, FD [ α ] [ β ] is 0;
(2) the elements of KK [ m ] [ m ] are generated as follows:
Let α, γ ∈ { decision attribute set }
5) If α ═ γ orThen KK [ alpha ]][γ]:=1;
6) Otherwise, KK [ α ] [ γ ]: — 1;
For ease of description, definition 3 gives the definition of the approximate transfer function dependence as follows:
definition 3 in the network table T, ifThen we call Z the approximate transfer function dependence of X, and we note asWhere Y is an intermediary key on which the approximation transfer function depends.
FIG. 3 is a schematic diagram of a process for finding an approximate function dependency set closure DC [ m ] [ n ] according to an approximate function dependency set, and determining DC [ m ] [ n ] according to FD [ m ] [ n ] and KK [ m ] [ m ].
Step 1, copying elements in FD [ m ] [ n ] to DC [ m ] [ n ]; i: -0; i represents the ith approximation function dependence in KK [ m ] [ m ];
Step 2, i is 1;
And step 3: judging whether to useat KK [ m ]][m]In the presence of (a) a (b),
And isat DC [ m ]][n]If so, then DC [ m ]][n]:=βiAnd performing step 4; otherwise, directly executing the step 4;
and 4, step 4: judging whether the i +1 th approximation function dependence exists in KK [ m ] [ m ], if so, executing the step 5; otherwise, directly executing step 6;
and 5: i: and returning to the step 3 when the sum is i + 1.
Step 6: judging whether the DC [ m ] [ n ] changes or not, and if so, returning to the step 2; otherwise, outputting DC [ m ] [ n ], and ending the process.
FIG. 4 is a flow chart of the method for obtaining entity columns using the three-norm, correcting the mislabeled approximate propagation dependence according to the above-mentioned approximation function dependence set closure DC [ m ] [ n ]. Finally, the decision attribute only existing in the direct approximation function dependency and the intermediate key are output as entity columns, and the finding process of the decision attribute only existing in the direct approximation function dependency and the intermediate key comprises the following steps:
step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];
Step 2: i: ═ 0, j: ═ 0; i, j represents the row number and column number of DC [ m ] [ n ];
And step 3: judging DC [ i ] [ j ]! Whether or not 1& & FD [ j ] (1 & & FD [ j ] [ i ] (1) is true, if true, DC [ i ] [ j ]: 1, and perform step 4; otherwise, executing step 4;
And 4, step 4: judging whether all traversal is finished, if all traversal is finished, setting i: ═ 0 and j: ═ 0, and executing the step 5; otherwise, take down one DC [ i ] [ j ], and perform step 3;
And 5: judging DC [ i ] [ j ]! Whether or not {0,1,2} is true, if true, Entity { }:dc [ i ] [ j ], and perform step 7; otherwise, executing step 6;
Step 6: judging that DC [ i ] [ j ] ═ 1& & i! If j is true, assigning the decision attribute of the i row to the Entity set, and executing the step 7; otherwise, directly executing step 7;
And 7: judging whether all traversals are finished or not, if all traversals are finished, outputting an Entity set, and ending the process; otherwise, take the next DC [ i ] [ j ], and continue to execute step 5.
in summary, the approximate function dependency detection method adapted to the characteristics of the network table provided by the embodiment of the present invention can more accurately express the inherent function dependency relationship between the attributes; when the approximate function dependence is calculated, the algorithm has obvious noise resistance based on the support degree of the consistency data and the inconsistency data on the function dependence;
The entity column discovery algorithm based on the approximation function dependence and the normalization provided by the embodiment of the invention can discover entity columns in more scenes. The method is not only suitable for the network table of a single entity column, but also suitable for the table of a plurality of entity columns; the method is not only suitable for the network table with the header, but also suitable for the network table without the header or the network table which can not recover the complete header by utilizing the semantic recovery technology.
compared with the prior art, the method has the advantages of high entity column discovery quality and capability of discovering multiple entity columns. To verify the above advantages, we performed a number of experiments with experimental data from two data sources: one is an open-source Wiki Table dataset and the other is a network Table we crawl from the network, which we call a Web Table dataset. The collected network table is divided into a large table data set (more than 100 rows), L data set for short, and a small table data set (less than 100 rows), S data set for short according to the number of rows. To facilitate experimental validation of single-entity column and multi-entity column discovery, we split the L data set into L single-entity sets (WiKi _ LS and Web _ LS) and L multi-entity sets (WiKi _ LM and Web _ LM); the S data set is divided into S single entity sets (WiKi _ SS and Web _ SS) and S multiple entity sets (WiKi _ SM and Web _ SM).
the method and the device discover the entity column based on the function dependency relationship among the column values, do not depend on the information of the header and the knowledge base, and improve the quality of discovering the entity column. In order to verify the effectiveness of the algorithm (AFD _ Model) of the embodiment of the present invention in noise reduction, a PFD _ Model algorithm is particularly implemented, which is the same as the AFD _ Model algorithm except that the table noise is not considered. FIG. 3 shows a comparison of the entity column detection accuracy, coverage, F-value, and time efficiency for a single entity list for AFD _ Model, PFD _ Model, and evidence-based methods (ED _ Model). FIG. 5 shows that the AFD _ Model algorithm of the present invention is superior to the ED _ Model and PFD _ Model algorithms as a whole. In terms of accuracy, the ED _ Model algorithm requires that a header of a network table has a conceptual attribute relationship in a base library, the quality of the header and the coverage degree of a knowledge base influence the accuracy of the ED _ Model algorithm, and the AFD _ Model algorithm does not depend on any header information and the knowledge base, so that the accuracy is high. The AFD _ Model algorithm considers the characteristics of a network table and has certain noise filtering capability, so the accuracy of entity detection is higher than that of the PFD _ Model algorithm. In terms of recall, the AFD _ Model algorithm is higher than the ED _ Model algorithm and the PFD _ Model algorithm. The AFD _ Model algorithm does not require that a network table must have a header, does not require that an entity column and a non-entity column in the table have an attribute relationship, does not require that the concept-attribute relationship exists in a Probase library, and has certain noise filtering capability, so the adaptability of the algorithm is stronger. The F-measure measures the quality of the algorithm as a whole, and the algorithm has obvious advantages. In terms of runtime, the time cost of the ED _ Model algorithm is significantly greater than that of the AFD _ Model algorithm and the PFD _ Model algorithm, because the ED _ Model algorithm needs to determine the conceptual attribute relationship of the table header or the semantically restored table header by using the base library to determine the entity column, and the time complexity of the AFD _ Model algorithm and the PFD _ Model algorithm is only related to the size of the table.
The method is suitable for the table with multiple entity columns, and the applicability is obviously enhanced. The ED _ Model algorithm cannot perform multi-entity column discovery, and the method of the invention only performs comparison with the PFD _ Model. Fig. 6 is a schematic diagram illustrating comparison of effectiveness of the AFD _ Model algorithm and the PFD _ Model algorithm in the multi-entity column discovery algorithm according to the embodiment of the present invention. FIG. 6 shows that the AFD _ Model algorithm performs better than the PFD _ Model algorithm, regardless of accuracy, recall, or F-value, because the AFD _ Model algorithm takes into account the effects of noise data when computing the approximation function dependence between attributes.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
from the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
the above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (4)

1. A method for detecting an entity column of a network table based on dependency relationship among attributes is characterized by comprising the following steps:
Aiming at a network table, calculating the approximate function dependence probability between any two columns according to the function dependence relationship between the column values, and acquiring a candidate function dependence set according to the approximate function dependence probability;
According to the characteristics of the network table, deleting the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set;
3NF standardization is carried out on the approximation function dependence set, and a main key set generated after 3NF standardization is used as an entity column of the network table;
Aiming at a network table, calculating an approximate function dependence probability between any two columns according to a function dependence relationship between column values, and acquiring a candidate function dependence set according to the approximate function dependence probability, wherein the method comprises the following steps:
Let X be an attribute in the network table T, A be an attribute different from X in T, and when there is an attribute value pair (X, A) of a partial tuple in T, so that X → A holds, let X approximate function determine that A or A approximate function depends on X, and remember that X is an attribute valueAn approximate function dependency probability that X → a holds true on T is expressed, and data that holds X → a in the attribute value pair (X, a) is referred to as consistency data, and the rest is referred to as inconsistency data;
In network Table T, for X the attribute value is vxthe A attribute column of the tuple may have different values, and the set of the different values is assumed to be VA
if set VAif the value with the most number is not unique, the value with the most number is respectively used as class centers, the sum of the similarity of other values and the class center value is calculated, and the class center value v when the sum is maximum is selectedaAs the consistency data, a specific calculation method is shown in formula 1;
For any class center value vj
x median value of vxAll tuples of (1), consistency data v thereinato XSupport degree S for establishment of → Ac(X→A,VX,VA') Calculated from equation 2, where VA'When the X column takes vxwhen the data is consistent, corresponding to the consistent data in the column A;
Wherein:
VX={X.r|X.r=vx}
VA'={A.r|X.r=vx&A.r=va}
|VX,VA'|=|{<X.r,A.r>|X.r=vx&A.r=va}|
X.r is the value of the X column r row cell, A.r is the value of the a column r row cell;
support degree for X → A establishment of inconsistency datais calculated by formula 3, wherein
Set VXSupport degree for X → Aby the weighted average sum representation of the support degree of the satisfaction of the consistency data and the inconsistency data for X → a,Calculated by the formula 5calculating:
wherein ω is12=1;
Taking the support of all different tuples in X, and averaging the supportAs the probability that X → a is established in the network table T,calculated from equation 6:
Wherein | DX| represents a distinctive V in XXThe number of (2);
representing an approximation function dependence in a network table TProbability of establishment, wherein the candidate function dependence set comprises all possible approximate function dependence in the network table T;
The 3NF normalization is carried out on the approximation function dependency set, and a primary key set generated after the 3NF normalization is used as an entity column of a network table, and the method comprises the following steps:
Mapping the approximate function dependency relationship in the approximate function dependency set to a relation matrix FD [ m ] [ n ], and mapping the approximate function dependency relationship among the determined attributes to a relation matrix KK [ m ] [ m ], wherein m is the number of attributes positioned at the left side of the approximate function dependency implication, namely the number of determined attributes, and n is the number of all attribute columns in the network table:
(1) The elements of FD [ m ] [ n ] are generated as follows:
Let α be the set of { decision attributes }, β be the set of all column attributes }
1) if α ═ β, then FD [ α ] [ β ]: 2;
2) If it is notFD [ alpha ] is then][β]:=1;
3) Otherwise, FD [ α ] [ β ] is 0;
(2) the elements of KK [ m ] [ m ] are generated as follows:
let α, γ ∈ { decision attribute set }
1) If α ═ γ orThen KK [ alpha ]][γ]:=1;
2) Otherwise, KK [ α ] [ γ ]: — 1;
Defined in the network table T ifthen we call Z the approximate transfer function dependence of X, and we note aswherein Y is an intermediary key on which the approximation transfer function depends;
determining an approximate function dependency set closure DC [ m ] [ n ] according to the relation matrix FD [ m ] [ n ] and the relation matrix KK [ m ] [ m ], determining decision attributes and intermediate keys only existing in direct approximate function dependency according to the approximate function dependency set closure DC [ m ] [ n ], and outputting the decision attributes only existing in the direct approximate function dependency and the intermediate keys as entity columns of the network table.
2. The method of claim 1, wherein the pruning the noise function dependence in the candidate function dependence set to obtain an approximate function dependence set according to the characteristics of the network table comprises:
If approximate function dependence in the candidate function dependence setsatisfy any of the following 3 rules, thenEliminate from the candidate approximation function dependent set:
Rule 1 if the type of the attribute value of X column is date type, floating point type or Boolean type:
rule 2 if there is an attribute column Y in the network table T, so thatif true;
rule 3 if in the candidate approximation function dependency set, there are attribute columns X and A such thatand is
3. The method according to claim 1, wherein determining the approximate function-dependent set closure DC [ m ] [ n ] from the relationship matrix FD [ m ] [ n ] and the relationship matrix KK [ m ] [ m ] comprises:
Step 1, copying elements in FD [ m ] [ n ] to DC [ m ] [ n ]; i: -0; i represents the ith approximation function dependence in KK [ m ] [ m ];
Step 2, i is 1;
And step 3: judging whether to useAt KK [ m ]][m]Is present in, andat DC [ m ]][n]if so, then DC [ m ]][n]:=βiand performing step 4; otherwise, it is straightThen, executing the step 4;
and 4, step 4: judging whether the i +1 th approximation function dependence exists in KK [ m ] [ m ], if so, executing the step 5; otherwise, directly executing step 6;
And 5: i, i +1, and returning to the step 3;
Step 6: judging whether the DC [ m ] [ n ] changes or not, and if so, returning to the step 2; otherwise, outputting DC [ m ] [ n ], and ending the process.
4. The method according to claim 3, wherein said determining from said approximation function dependency closures DC [ m ] [ n ] that only decision attributes and intermediary keys in direct approximation function dependencies exist comprises:
Step 1: inputting DC [ m ] [ n ], FD [ m ] [ n ];
step 2: i: ═ 0, j: ═ 0; i, j represents the row number and column number of DC [ m ] [ n ];
And step 3: judging DC [ i ] [ j ]! Whether or not 1& & FD [ j ] (1 & & FD [ j ] [ i ] (1) is true, if true, DC [ i ] [ j ]: 1, and perform step 4; otherwise, executing step 4;
And 4, step 4: judging whether all traversal is finished, if all traversal is finished, setting i: ═ 0 and j: ═ 0, and executing the step 5; otherwise, take down one DC [ i ] [ j ], and perform step 3;
And 5: judging DC [ i ] [ j ]! Whether or not {0,1,2} is true, if true, Entity { }:dc [ i ] [ j ], and perform step 7; otherwise, executing step 6;
step 6: judging that DC [ i ] [ j ] ═ 1& & i! If j is true, assigning the decision attribute of the i row to the Entity set, and executing the step 7; otherwise, directly executing step 7;
And 7: judging whether all traversals are finished or not, if all traversals are finished, outputting an Entity set, and ending the process; otherwise, take the next DC [ i ] [ j ], and continue to execute step 5.
CN201710002389.7A 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes Expired - Fee Related CN106844338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710002389.7A CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710002389.7A CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Publications (2)

Publication Number Publication Date
CN106844338A CN106844338A (en) 2017-06-13
CN106844338B true CN106844338B (en) 2019-12-10

Family

ID=59117509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710002389.7A Expired - Fee Related CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Country Status (1)

Country Link
CN (1) CN106844338B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595624A (en) * 2018-04-23 2018-09-28 南京大学 A kind of large-scale distributed functional dependence discovery method
CN109472013B (en) * 2018-10-25 2020-06-16 北京交通大学 Foreign key relation detection method between network tables based on distribution fitting
CN111061923B (en) * 2019-12-13 2022-08-02 北京航空航天大学 Graph data entity recognition system based on graph dependence rule and supervised learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077181A (en) * 2012-11-20 2013-05-01 深圳市华傲数据技术有限公司 Method for automatically generating approximate functional dependency rule
CN104281563A (en) * 2013-07-01 2015-01-14 国际商业机器公司 Method and system for discovering relationships in tabular data
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014074873A1 (en) * 2012-11-09 2014-05-15 Kla-Tencor Corporation Reducing algorithmic inaccuracy in scatterometry overlay metrology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077181A (en) * 2012-11-20 2013-05-01 深圳市华傲数据技术有限公司 Method for automatically generating approximate functional dependency rule
CN104281563A (en) * 2013-07-01 2015-01-14 国际商业机器公司 Method and system for discovering relationships in tabular data
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Functional Dependency Generation and Applications in Pay-as-You-Go data Integration Systems;WANG D G 等;《Proceedings of the 12th International Workshop on the Web and Databases》;20091231;第1654-1655页 *
基于函数依赖的导出关系候选码计算;黎章海 等;《计算机工程》;20160531;第42卷(第5期);第60-65页 *
网络表格的实体列发现与标识;任向冉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151015(第10期);第I138-606页 *

Also Published As

Publication number Publication date
CN106844338A (en) 2017-06-13

Similar Documents

Publication Publication Date Title
WO2022068196A1 (en) Cross-modal data processing method and device, storage medium, and electronic device
TWI729472B (en) Method, device and server for determining feature words
CN110880019B (en) Method for adaptively training target domain classification model through unsupervised domain
CN110704743B (en) Semantic search method and device based on knowledge graph
US9110985B2 (en) Generating a conceptual association graph from large-scale loosely-grouped content
Zhang et al. Feature reintegration over differential treatment: A top-down and adaptive fusion network for RGB-D salient object detection
CN104573130B (en) The entity resolution method and device calculated based on colony
JP2017123168A (en) Method for making entity mention in short text associated with entity in semantic knowledge base, and device
WO2016205286A1 (en) Automatic entity resolution with rules detection and generation system
JP2009282980A (en) Method and apparatus for image learning, automatic notation, and retrieving
CN111563192B (en) Entity alignment method, device, electronic equipment and storage medium
CN102902821A (en) Methods for labeling and searching advanced semantics of imagse based on network hot topics and device
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
KR101977231B1 (en) Community detection method and community detection framework apparatus
CN107291895B (en) Quick hierarchical document query method
US10135723B2 (en) System and method for supervised network clustering
CN106778880B (en) Microblog topic representation and topic discovery method based on multi-mode deep Boltzmann machine
CN106844338B (en) method for detecting entity column of network table based on dependency relationship between attributes
Petkos et al. Graph-based multimodal clustering for social event detection in large collections of images
Zhang et al. Unsupervised entity resolution with blocking and graph algorithms
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
Li et al. Social context-aware person search in videos via multi-modal cues
Chehreghani Efficient computation of pairwise minimax distance measures
CN111125329B (en) Text information screening method, device and equipment
CN112883736A (en) Medical entity relationship extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191210

Termination date: 20210103

CF01 Termination of patent right due to non-payment of annual fee