CN106844338A - Detection method based on the entity row of the network form of dependence between attribute - Google Patents

Detection method based on the entity row of the network form of dependence between attribute Download PDF

Info

Publication number
CN106844338A
CN106844338A CN201710002389.7A CN201710002389A CN106844338A CN 106844338 A CN106844338 A CN 106844338A CN 201710002389 A CN201710002389 A CN 201710002389A CN 106844338 A CN106844338 A CN 106844338A
Authority
CN
China
Prior art keywords
functional dependencies
approximate
row
network form
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710002389.7A
Other languages
Chinese (zh)
Other versions
CN106844338B (en
Inventor
王宁
张丽方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN201710002389.7A priority Critical patent/CN106844338B/en
Publication of CN106844338A publication Critical patent/CN106844338A/en
Application granted granted Critical
Publication of CN106844338B publication Critical patent/CN106844338B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of detection method based on the entity row of the network form of dependence between attribute.For a network form, the approximate functional dependencies probability between any two row is calculated according to the functional dependencies between train value, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies collection;Pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF is standardized is arranged as the entity of network form.The present invention provide method can more accurately express attribute it is interior functional dependencies;When approximate functional dependencies are calculated based on consistent data and inconsistency data to the support of functional dependence, the algorithm has obvious noise resisting ability, and the method is applicable not only to the network form of single entity row, it may also be used for the form of multiple entity row.

Description

Detection method based on the entity row of the network form of dependence between attribute
Technical field
The present invention relates to network information processing technical field, more particularly to a kind of net list based on dependence between attribute The detection method of the entity row of lattice.
Background technology
With the development of information technology, the resource on internet is increasingly enriched, in addition to unstructured data, also greatly The network form of amount is present, and for these network forms are compared with text, with more preferable structural features, therefore suffers from the pole of people Big concern.How to allow machine to more fully understand that the semanteme of network form turns into and improve the great of table search coverage rate and accuracy rate Challenge.Entity row can be identified for that the entity described by network form, and its column label describes the theme of whole network form of throwing the net, passes through It may determine that the semantic information of network form.If the entity of detection network form is arranged exactly, it is possible to greatly promote machine Device is to the semantic degree of understanding of network form.
A kind of entity row of the prior art find that algorithm is that the evidential entity proposed by Wang et al. arranges discovery Algorithm.The algorithm is attempted Probase as knowledge base, is relied on two evidences and is realized that the entity row of network form find.They The evidence of foundation is:First, all entity descriptions in entity row is same concept;Secondly, the concept that list of entities reaches There is concept attribute relation between the concept reached with other non-physical lists.
In evidential entity row find algorithm, each candidate pattern s of network form of being thrown the net for works as selection When wherein one row col is entity row, remaining is classified as the attribute of entity row, calculates the scoring of all candidate's entity row, and selection is commented Divide highest candidate's entity to arrange to be arranged as the entity of the network form.Object function is as follows:
Wherein, SCAIt is all possible concept attribute set of relationship of attribute set A,ciIt is attribute set AiRetouch The concept stated, saiRepresent that attribute set A is concept ciAttribute confidence level;SCEIt is all possible concept of entity sets E Entity relationship set,ciIt is entity set EiAffiliated concept, seiPresentation-entity collection E belongs to concept ciConfidence level;Acol In expression candidate pattern s, except all properties set of col row;EcolRepresent col in arranging except all train value set of gauge outfit.
Above-mentioned entity row of the prior art find that the shortcoming of algorithm is:First, the method depends on the table of network form Head is with knowledge base, it is necessary to very big computing cost.Knowledge base cover really many entities, attribute, concept and they between Relation, but knowledge base is difficult on overlay network whole entities, attribute, concept and the relation between them.Meanwhile, net Network form usually lacks Table Header information, is only difficult accurately to recover its gauge outfit by knowledge base, particularly the mark of the row such as numeral, date Sign.Therefore, evidential entity row find that the recall rate and accuracy rate of algorithm are relatively low.Secondly, evidential entity row hair Existing method can only carry out entity row and find to the network form of single entity row, and have ignored the presence of multiple entity row network form. Many form more than one entities row on network, the algorithm has certain limitation.
The content of the invention
The embodiment provides a kind of detection side based on the entity row of the network form of dependence between attribute Method, to realize that the entity for effectively finding network form is arranged.
To achieve these goals, this invention takes following technical scheme.
A kind of entity row detection method based on the network form of dependence between attribute, further, including:
For a network form, the approximate functional dependencies between any two row are calculated according to the functional dependencies between train value Probability, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;
According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies Collection;
Pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF is standardized is used as network form Entity row.
Further, it is described for a network form, calculate any two row according to the functional dependencies between train value Between approximate functional dependencies probability, according to the approximate functional dependencies probability obtain candidate functions Dependency Set, including:
If X is certain attribute in network form T, A is the attribute that X is different from T, when exist in T part tuple (X, A) property value pair so that X → A sets up, then claim X approximate functions to determine that A or A approximate functional dependencies, in X, are denoted as The approximate functional dependencies probability that X → A sets up on T is represented, (X, A) property value centering causes what X → A set up Data are referred to as consistent data, and remaining is referred to as inconsistency data;
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that The collection of the different value is combined into VA
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number It is not unique, then using the most value of these numbers as class center, calculate the sum of other values and class central value similarity, selection Class central value v during with maximumaAs consistent data.Shown in circular such as formula (1).
For any class central value vj
X intermediate values are vxAll tuples, consistent data v thereinaThe support S set up to X → Ac(X→A,VX,VA') Calculated by formula (2);
Wherein:
VX=X.r | X.r=vx}
VA'=A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells, A.r is the value of A row r row cells;
The support S that inconsistency data are set up to X → Anc(X→A,VX,VA*) computing formula by formula (3) calculate;
Set VXThe support set up to X → ABy consistent data and inconsistency data to X → A The weighted average of the support of establishment and expression,Calculated by formula (5):
Wherein ω12=1;
The support of all different tuples in X is taken, by their average valueAs X → A in network form T The probability of establishment,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber;
Represent a kind of approximate functional dependencies in network form TThe probability of establishment, candidate functions Comprising all possible approximate functional dependencies in network form T in Dependency Set.
Further, described according to the characteristics of network form, the noise function deleted in candidate functions Dependency Set is relied on Approximate functional dependencies collection is obtained, including:
If the approximate functional dependencies relation in candidate functions Dependency SetMeet any bar in following 3 rules, Then willConcentrated from candidate's approximate functional dependencies and left out:
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type:
Rule 2:If there is attribute column Y in network form T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
Further, described pairing approximation Functional Dependencies carry out 3NF standardization, the major key produced after 3NF is standardized Gather and arranged as the entity of network form, including:
In the approximate functional dependencies relationship map in approximate Functional Dependencies to relational matrix FD [m] [n], will will determine to belong to , in relational matrix KK [m] [m], wherein m is to contain a left side positioned at approximate functional dependencies for approximate functional dependencies relationship map between property The attribute number on side, that is, determine attribute number, and n is the number of all properties row in network form:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 4) α=β, FD [α] [β]:=2;
If 5)Then FD [α] [β]:=1;
6) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 3) α=γ orThen KK [α] [γ]:=1;
4) other situations, then KK [α] [γ]:=-1;
It is defined in network form T, ifZ is then claimed to rely on X approximate transfer functions, It is designated asWherein Y is intermediary's key that approximate transfer function is relied on;
Approximate functional dependencies collection closure DC [m] are determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m] [n], determines to only exist the decision attribute in direct approximate functional dependencies according to the approximate functional dependencies collection closure DC [m] [n] With intermediary's key, using it is described only exist in direct approximate functional dependencies determine attribute and intermediary's key as the reality of network form Body row output.
Further, it is described that approximate function is determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m] Dependency Set closure DC [m] [n], including:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represent in KK [m] [m] i-th it is approximate Functional dependence;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m], andExist in DC [m] [n], if it is, Then DC [m] [n]:=βiAnd perform step 4;Otherwise, step 4 is directly performed;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5; Otherwise, step 6 is directly performed;
Step 5:i:=i+1, return to step 3;
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, DC is exported [m] [n], flow terminates.
Further, it is described to determine to only exist directly approximate letter according to the approximate functional dependencies collection closure DC [m] [n] Decision attribute and intermediary's key in number dependence, including:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if into Stand, then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and Perform step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j], And perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, the decision attribute assignment of i rows is given Entity gathers, and performs step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow knot Beam;Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
The technical scheme provided by embodiments of the invention described above can be seen that adaptation network provided in an embodiment of the present invention The approximate functional dependencies detection method of form feature can more accurately express attribute it is interior functional dependencies;Calculating Based on consistent data and inconsistency data to the support of functional dependence during approximate functional dependencies, the algorithm has obvious Noise resisting ability;Real body row can be issued in more scenes, the method is applicable not only to the network form of single entity row, may be used also For the form of multiple entity row, the network form of gauge outfit is applicable not only to, and suitable for no gauge outfit or using semantic Recovery technology cannot also recover the network form of complete gauge outfit.
The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description Obtain substantially, or recognized by practice of the invention.
Brief description of the drawings
Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be to that will use needed for embodiment description Accompanying drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for this For the those of ordinary skill of field, without having to pay creative labor, other can also be obtained according to these accompanying drawings Accompanying drawing.
Fig. 1 is a kind of detection based on the entity row of the network form of dependence between attribute provided in an embodiment of the present invention The process chart of method;
Fig. 2 is a kind of process chart for obtaining candidate's Dependency Set provided in an embodiment of the present invention;
Fig. 3 is provided in an embodiment of the present invention a kind of according to approximate functional dependencies collection searching approximate functional dependencies collection closure Process schematic;
Fig. 4 is a kind of flow chart that entity row are obtained using three normal forms provided in an embodiment of the present invention;
Fig. 5 is AFD_Model algorithms provided in an embodiment of the present invention and PFD_Model algorithms, evidential method (ED_Model) it is directed to the contrast schematic diagram of entity row accuracy of detection, coverage rate, F- values and the time efficiency of single list of entities;
Fig. 6 is that AFD_Model algorithms provided in an embodiment of the present invention find algorithm with PFD_Model algorithms in multiple entity row Validity contrast schematic diagram.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings, wherein ad initio Same or similar element or element with same or like function are represented to same or similar label eventually.Below by ginseng The implementation method for examining Description of Drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " " used herein, " one It is individual ", " described " and " being somebody's turn to do " may also comprise plural form.It is to be further understood that what is used in specification of the invention arranges Diction " including " refer to the presence of the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition One or more other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim unit Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or can also exist Intermediary element.Additionally, " connection " used herein or " coupling " can include wireless connection or coupling.Wording used herein "and/or" includes one or more associated any cells for listing item and all combines.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific terminology) have with art of the present invention in those of ordinary skill general understanding identical meaning.Should also Understand, those terms defined in such as general dictionary should be understood that the meaning having with the context of prior art The consistent meaning of justice, and unless defined as here, will not be with idealizing or excessively formal implication be explained.
For ease of the understanding to the embodiment of the present invention, done by taking several specific embodiments as an example further below in conjunction with accompanying drawing Explanation, and each embodiment does not constitute the restriction to the embodiment of the present invention.
In order to solve the technical problem of above-mentioned existing entity row detection algorithm, the present invention devises a kind of computing cost It is small, gauge outfit and knowledge base are not relied on, and suitable for the entity row detection algorithm of multiple entity row network form.The present invention is solved Traditional algorithm relies on the gauge outfit and knowledge base of network form and can not carry out the problem that multiple entity row find, approximate by introducing The concept of functional dependence, improves the noise resisting ability of method, while obtaining high-quality entity row finds result.
It is provided in an embodiment of the present invention a kind of based on the entity row detection method of the network form of dependence between attribute Handling process is as shown in figure 1, including following process step:
Step 1, according to the approximate functional dependencies probability between the train value of network form, obtain candidate functions Dependency Set.
Thrown the net for one network form, if a row therein or multiple row can identify the entity described by network form, This row or multiple row are defined as entity row, other row beyond entity row are defined as attribute column.
The present invention to each form, according to the functional dependencies between train value calculate approximate function between any two row according to Rely probability.Here it is considered that there is noise in form, we introduce the support of consistent data and inconsistency data.
It is certain attribute column in network form T to define 1 and set X, and A is the attribute column that X is different from T.When there is part in T (X, A) property value pair of tuple so that X → A sets up, then claim X approximate functions to determine that A or A approximate functional dependencies, in X, are denoted as Represent the possibility that X → A sets up on T, i.e. approximate functional dependencies probability.(X, A) property value centering makes The data for obtaining X → A establishments are referred to as consistent data, and remaining is referred to as inconsistency data.
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that The collection of the different value is combined into VA
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number It is not unique, then using the most value of these numbers as class center, calculate the sum of other values and class central value similarity, selection Class central value v during with maximumaAs consistent data.Shown in circular such as formula (1).
For any class central value vj
There is the possibility write by mistake in the train value of network form, comprehensive consistent data and inconsistency data to functional dependence into Vertical support, calculates the approximate functional dependencies probability between any two row, obtains candidate functions Dependency Set.
Fig. 2 is a kind of process chart for obtaining candidate's Dependency Set provided in an embodiment of the present invention, concrete processing procedure bag Include:First, consistent data proportion is bigger, illustrate X → A set up possibility it is bigger, i.e., consistent data to X → A into Vertical support is higher, while consistent data proportion is bigger, illustrates that the consistent data is real consistent data Possibility is bigger.X intermediate values are vxAll tuples, consistent data v thereinaThe support and uniformity number set up to X → A According to reliability calculated by formula (2).
Wherein:
VX=X.r | X.r=vx}
VA'=A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells, A.r is the value of A row r row cells.
Secondly, inconsistency data and consistent data are more similar, and the reliability of consistent data is bigger, then inconsistent Property data it is bigger to support that X → A sets up, computing formula is as shown in (3).
Wherein VA*=A.r | X.r=vx&A.r≠va}。
Set VXThe support that X → A sets up can be set up by consistent data and inconsistency data to X → A The weighted average of support and expression, are designated asAs shown in formula (5).
Wherein ω12=1.
Finally, the support of all different tuples in X is taken, by their average valueAs network form T The probability that middle X → A sets up,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber.
Formula (6) represents the probability of X → A establishments in form T, and all possible approximate functional dependencies in T are included in into time Functional dependence is selected to concentrate, the probability that these approximate functional dependencies are set up is calculated according to formula (6).
IfThen X is referred to as the decision attribute of this approximate functional dependencies.All decisions that approximate functional dependencies are concentrated Attribute composition determines attribute set, and the element number for determining attribute set is exactly to determine attribute number, i.e. m.
Step 2, according to the characteristics of network form, delete noise function in candidate functions Dependency Set and rely on, obtain approximate Functional Dependencies.
Erased noise functional dependence is that next step obtains entity row and beats primarily to obtain more accurately Functional Dependencies Lower basis.Specifically delete rule as follows:
IfMeet any bar in following 3 rules, just willConcentrated from candidate's approximate functional dependencies Leave out.
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type.
Rule 2:If there is attribute column Y in T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
Delete rule according to above-mentioned, delete after noise function in candidate functions Dependency Set relies on, obtain approximate function according to Rely collection.
Step 3, the thought according to standardization, obtain entity row.
Attribute column approximate functional dependencies are arranged in the entity described by it in network form, according to the rule of relational database theory Generalized principle, pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF standardization is exactly desired net The entity row of network form.
The process that above-mentioned pairing approximation Functional Dependencies carry out 3NF standardization includes:
The dependence of approximate Functional Dependencies is mapped to relational matrix FD [m] [n];The approximate letter between attribute will be determined Number dependence is mapped to relational matrix KK [m] [m].Wherein m is the attribute number for containing the left side positioned at approximate functional dependencies, i.e., Attribute number is determined, n is the number of all properties row in network form.For convenience, with different numerals to represent attribute between Different relations, element produces as follows in matrix:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 7) α=β, FD [α] [β]:=2;
If 8)Then FD [α] [β]:=1;
9) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 5) α=γ orThen KK [α] [γ]:=1;
6) other situations, then KK [α] [γ]:=-1;
For convenience of description, definition 3 provides approximate transfer function dependence and is defined as follows:
3 are defined in network form T, ifThen claim Z to X approximate transfer functions according to Rely, be designated asWherein Y is intermediary's key that approximate transfer function is relied on.
Fig. 3 is the process schematic that approximate functional dependencies collection closure DC [m] [n] are found according to approximate functional dependencies collection, root Determine that DC [m] [n] concrete processing procedures include according to FD [m] [n] and KK [m] [m]:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represent in KK [m] [m] i-th it is approximate Functional dependence;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m],
AndExist in DC [m] [n], if it is, DC [m] [n]:=βi, and perform step 4;Otherwise, directly Connect execution step 4;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5; Otherwise, step 6 is directly performed;
Step 5:i:=i+1, return to step 3.
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, DC is exported [m] [n], flow terminates.
Fig. 4 is to obtain the flow chart that entity is arranged using three normal forms, according to above-mentioned approximate functional dependencies collection closure DC [m] [n] The approximate transitive dependency that amendment is mis-marked.Finally, intermediary's key and the decision attribute in direct approximate functional dependencies will be only existed Arranged as entity and exported, the searching process of the above-mentioned decision attribute only existed in direct approximate functional dependencies and intermediary's key includes:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if into Stand, then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and Perform step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j], And perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, the decision attribute assignment of i rows is given Entity gathers, and performs step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow knot Beam;Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
In sum, the approximate functional dependencies detection method for adapting to network form feature provided in an embodiment of the present invention can be more Plus exactly expression attribute it is interior functional dependencies;Consistent data is based on when approximate functional dependencies are calculated and is differed Cause property data have obvious noise resisting ability to the support of functional dependence, the algorithm;
Entity row based on approximate functional dependencies and standardization provided in an embodiment of the present invention find algorithm, can be more Scene issues real body row.The method is applicable not only to the network form of single entity row, it may also be used for the form of multiple entity row;No The network form of gauge outfit is only applicable to, and cannot also have been recovered suitable for no gauge outfit or using semantic recovery technology The network form of whole gauge outfit.
Compared with prior art, the method for the present invention there is entity to arrange and finds that quality is high and can carry out multiple entity row hair Existing advantage.To verify the advantage of the above, we have done many experiments, and experimental data comes from two data sources:One is to increase income Wiki Table data sets, another network form crawled from network for us, we term it Web Table data Collection.We will collect the network form that comes according to line number number be divided into big table data set (more than 100 rows), abbreviation L data sets, With small table data set (below 100 rows), abbreviation S data collection.The experiment for carrying out single entity row and multiple entity row discovery for convenience is tested Card, L data sets are divided into the mono- entity sets of L (WiKi_LS and Web_LS) and L multiple entities collection (WiKi_LM and Web_LM) by us;S Data set is divided into the mono- entity sets of S (WiKi_SS and Web_SS) and S multiple entities collection (WiKi_SM and Web_SM).
The present invention has found that entity is arranged based on the functional dependencies between train value, is independent of gauge outfit and knowledge base information, carries The quality that entity row high find.In order to verify validity of the algorithm (AFD_Model) of the embodiment of the present invention in terms of noise reduction, Specially PFD_Model algorithms are realized, the algorithm in addition to not accounting for form noise remaining with AFD_Model algorithms one Sample.Fig. 3 gives AFD_Model, PFD_Model and evidential method (ED_Model) is directed to the reality of single list of entities The contrast of body row accuracy of detection, coverage rate, F- values and time efficiency.Fig. 5 shows, algorithm AFD_Model entirety of the invention It is upper to be better than ED_Model and PFD_Model.In terms of accuracy rate, the gauge outfit of ED_Model algorithms requirement network form exists There is concept attribute relation in Probase storehouses, the quality of gauge outfit and the level of coverage of knowledge base can all influence ED_Model algorithms The degree of accuracy, and AFD_Model algorithms are independent of any Table Header information and knowledge base, therefore the degree of accuracy is higher.Due to AFD_ The characteristics of Model algorithms take into account network form, with certain noise filtering ability, therefore the accuracy that entity is detected Higher than PFD_Model algorithms.In terms of recall rate, AFD_Model algorithms are higher than ED_Model algorithms and PFD_Model algorithms. Because AFD_Model algorithms do not require that net list lattice there must be gauge outfit, do not require that the entity row in table arrange presence with non-physical Relation on attributes, does not require that this concept-relation on attributes exists in Probase storehouses yet, while having certain noise filtering energy Power, therefore the adaptability of algorithm is stronger.The quality of F-measure measure algorithms on the whole, algorithm of the invention has obvious Advantage.In terms of at runtime, the time of ED_Model algorithms spends and is significantly greater than AFD_Model algorithms and PFD_Model Algorithm, because ED_Model algorithms need the general of the gauge outfit for recovering the gauge outfit or semanteme of form using Probase storehouses Read relation on attributes to decide, and then determine entity row, and the time complexity of AFD_Model algorithms and PFD_Model algorithms Size only with form is relevant.
The method of the present invention is applied to the form of multiple entity row, and applicability is significantly increased.ED_Model algorithms can not be carried out The discovery of multiple entity row, the method for the present invention is only contrasted with PFD_Model.Fig. 6 is AFD_ provided in an embodiment of the present invention Model algorithms find the validity contrast schematic diagram of algorithm with PFD_Model algorithms in multiple entity row.Fig. 6 shows, no matter smart Degree, recall rate or F values, AFD_Model algorithms all show outstanding than PFD_Model algorithm, because AFD_Model is calculated During approximate functional dependencies of the method between computation attribute, it is contemplated that the influence of noise data.
One of ordinary skill in the art will appreciate that:Accompanying drawing is the schematic diagram of one embodiment, module in accompanying drawing or Flow is not necessarily implemented necessary to the present invention.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can Realized by the mode of software plus required general hardware platform.Based on such understanding, technical scheme essence On the part that is contributed to prior art in other words can be embodied in the form of software product, the computer software product Can store in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used to so that a computer equipment (can be personal computer, server, or network equipment etc.) performs some of each embodiment of the invention or embodiment Method described in part.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device or For system embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to method The part explanation of embodiment.Apparatus and system embodiment described above is only schematical, wherein the conduct Separating component explanation unit can be or may not be it is physically separate, the part shown as unit can be or Person may not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be with root Some or all of module therein is factually selected the need for border to realize the purpose of this embodiment scheme.Ordinary skill Personnel are without creative efforts, you can to understand and implement.
The above, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto, Any one skilled in the art the invention discloses technical scope in, the change or replacement that can be readily occurred in, Should all be included within the scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims It is defined.

Claims (6)

1. a kind of entity row detection method based on the network form of dependence between attribute, it is characterised in that including:
For a network form, the approximate functional dependencies calculated according to the functional dependencies between train value between any two row are general Rate, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;
According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies collection;
Pairing approximation Functional Dependencies carry out 3NF standardization, reality of the major key set produced after 3NF is standardized as network form Body is arranged.
2. method according to claim 1, it is characterised in that described for a network form, according between train value Functional dependencies calculate the approximate functional dependencies probability between any two row, and candidate is obtained according to the approximate functional dependencies probability Functional Dependencies, including:
If X is certain attribute in network form T, A is the attribute that X is different from T, when (X, the A) category that there is part tuple in T Property value pair so that X → A set up, then claim X approximate functions determine A or A approximate functional dependencies in X, be denoted as The approximate functional dependencies probability that X → A sets up on T is represented, (X, A) property value centering causes what X → A set up Data are referred to as consistent data, and remaining is referred to as inconsistency data;
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that the difference The collection of value is combined into VA
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number is not only One, then using the most value of these numbers as class center, calculate sum of other values and class central value similarity, selection with most Class central value v when bigaAs consistent data.Shown in circular such as formula (1);
For any class central value vj
X intermediate values are vxAll tuples, consistent data v thereinaThe support S set up to X → Ac(X→A,VX,VA') by public affairs Formula (2) is calculated;
Wherein:
VX=X.r | X.r=vx}
VA'={ A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells, and A.r is The value of A row r row cells;
The support S that inconsistency data are set up to X → Anc(X→A,VX,VA* computing formula) is calculated by formula (3);
Set VXThe support set up to X → AX → A is set up by consistent data and inconsistency data Support weighted average and expression,Calculated by formula (5):
Wherein ω12=1;
Take the support of all different tuples in X, their average valueSet up as X → A in network form T Probability,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber;
Represent a kind of approximate functional dependencies in network form TThe probability of establishment, candidate functions are relied on Concentrate comprising all possible approximate functional dependencies in network form T.
3. method according to claim 2, it is characterised in that described according to the characteristics of network form, deletes candidate's letter Noise function in number Dependency Set is relied on and obtains approximate functional dependencies collection, including:
If the approximate functional dependencies relation in candidate functions Dependency SetMeet any bar in following 3 rules, then willConcentrated from candidate's approximate functional dependencies and left out:
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type:
Rule 2:If there is attribute column Y in network form T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
4. method according to claim 3, it is characterised in that described pairing approximation Functional Dependencies carry out 3NF standardization, The major key set produced after 3NF is standardized is arranged as the entity of network form, including:
By in the approximate functional dependencies relationship map in approximate Functional Dependencies to relational matrix FD [m] [n], between attribute is determined Approximate functional dependencies relationship map in relational matrix KK [m] [m], wherein m is to contain the left side positioned at approximate functional dependencies Attribute number, that is, determine attribute number, and n is the number of all properties row in network form:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 1) α=β, FD [α] [β]:=2;
If 2)Then FD [α] [β]:=1;
3) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 1) α=γ orThen KK [α] [γ]:=1;
2) other situations, then KK [α] [γ]:=-1;
It is defined in network form T, ifThen claim Z to rely on X approximate transfer functions, be designated asWherein Y is intermediary's key that approximate transfer function is relied on;
Approximate functional dependencies collection closure DC [m] [n] are determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m], According to the approximate functional dependencies collection closure DC [m] [n] determine to only exist decision attribute in direct approximate functional dependencies and in Jie's key, the decision attribute and intermediary's key only existed in direct approximate functional dependencies is arranged as the entity of network form Output.
5. method according to claim 4, it is characterised in that described according to the relational matrix FD [m] [n], relation Matrix K K [m] [m] determines approximate functional dependencies collection closure DC [m] [n], including:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represents i-th approximate function in KK [m] [m] Rely on;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m], andExist in DC [m] [n], if it is, DC [m][n]:=βiAnd perform step 4;Otherwise, step 4 is directly performed;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5;It is no Then, step 6 is directly performed;
Step 5:i:=i+1, return to step 3;
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, output DC [m] [n], flow terminates.
6. method according to claim 5, it is characterised in that described according to the approximate functional dependencies collection closure DC [m] [n] determines to only exist the decision attribute and intermediary's key in direct approximate functional dependencies, including:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if set up, Then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and perform Step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j], and Perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, by the decision attribute assignment of i rows to Entity Set, and perform step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow terminates; Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
CN201710002389.7A 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes Expired - Fee Related CN106844338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710002389.7A CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710002389.7A CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Publications (2)

Publication Number Publication Date
CN106844338A true CN106844338A (en) 2017-06-13
CN106844338B CN106844338B (en) 2019-12-10

Family

ID=59117509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710002389.7A Expired - Fee Related CN106844338B (en) 2017-01-03 2017-01-03 method for detecting entity column of network table based on dependency relationship between attributes

Country Status (1)

Country Link
CN (1) CN106844338B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595624A (en) * 2018-04-23 2018-09-28 南京大学 A kind of large-scale distributed functional dependence discovery method
CN109472013A (en) * 2018-10-25 2019-03-15 北京交通大学 The foreign key relationship detection method of net list compartment based on fitting of distribution
CN111061923A (en) * 2019-12-13 2020-04-24 北京航空航天大学 Graph data entity identification method and system based on graph dependence rule and supervised learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077181A (en) * 2012-11-20 2013-05-01 深圳市华傲数据技术有限公司 Method for automatically generating approximate functional dependency rule
CN104281563A (en) * 2013-07-01 2015-01-14 国际商业机器公司 Method and system for discovering relationships in tabular data
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method
US20150233705A1 (en) * 2012-11-09 2015-08-20 Kla-Tencor Corporation Reducing algorithmic inaccuracy in scatterometry overlay metrology

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150233705A1 (en) * 2012-11-09 2015-08-20 Kla-Tencor Corporation Reducing algorithmic inaccuracy in scatterometry overlay metrology
CN103077181A (en) * 2012-11-20 2013-05-01 深圳市华傲数据技术有限公司 Method for automatically generating approximate functional dependency rule
CN104281563A (en) * 2013-07-01 2015-01-14 国际商业机器公司 Method and system for discovering relationships in tabular data
CN104794222A (en) * 2015-04-29 2015-07-22 北京交通大学 Network table semantic recovery method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WANG D G 等: "Functional Dependency Generation and Applications in Pay-as-You-Go data Integration Systems", 《PROCEEDINGS OF THE 12TH INTERNATIONAL WORKSHOP ON THE WEB AND DATABASES》 *
任向冉: "网络表格的实体列发现与标识", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
黎章海 等: "基于函数依赖的导出关系候选码计算", 《计算机工程》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595624A (en) * 2018-04-23 2018-09-28 南京大学 A kind of large-scale distributed functional dependence discovery method
CN109472013A (en) * 2018-10-25 2019-03-15 北京交通大学 The foreign key relationship detection method of net list compartment based on fitting of distribution
CN111061923A (en) * 2019-12-13 2020-04-24 北京航空航天大学 Graph data entity identification method and system based on graph dependence rule and supervised learning
CN111061923B (en) * 2019-12-13 2022-08-02 北京航空航天大学 Graph data entity recognition system based on graph dependence rule and supervised learning

Also Published As

Publication number Publication date
CN106844338B (en) 2019-12-10

Similar Documents

Publication Publication Date Title
CN106372060B (en) Search for the mask method and device of text
Liu et al. What's in a name? An unsupervised approach to link users across communities
CN109299258B (en) Public opinion event detection method, device and equipment
CN103729402B (en) Method for establishing mapping knowledge domain based on book catalogue
Yang et al. Hierarchical multi-clue modelling for POI popularity prediction with heterogeneous tourist information
CN102473190B (en) Keyword assignment to a web page
CN103412888B (en) A kind of point of interest recognition methods and device
CN106649464A (en) Method of building Chinese address tree and device
CN111581092B (en) Simulation test data generation method, computer equipment and storage medium
CN109871954A (en) Training sample generation method, method for detecting abnormality and device
CN107341220A (en) A kind of multi-source data fusion method and device
JP7103496B2 (en) Related score calculation system, method and program
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN107679135A (en) The topic detection of network-oriented text big data and tracking, device
CN107918657A (en) The matching process and device of a kind of data source
CN106844338A (en) Detection method based on the entity row of the network form of dependence between attribute
Wang et al. LILY: the results for the ontology alignment contest OAEI 2007
Liu et al. Automatic event salience identification
JP2008084203A (en) System, method and program for assigning label
CN110347931A (en) The detection method and device of the new chapters and sections of article
CN105869058A (en) Method for user portrait extraction based on multilayer latent variable model
Barlacchi et al. Land use classification with point of interests and structural patterns
CN105447104A (en) Knowledge map generating method and apparatus
Imam et al. Discovering attribute dependence in databases by integrating symbolic learning and statistical analysis techniques
CN104156458B (en) The extracting method and device of a kind of information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191210

Termination date: 20210103

CF01 Termination of patent right due to non-payment of annual fee