CN106844338A - Detection method based on the entity row of the network form of dependence between attribute - Google Patents
Detection method based on the entity row of the network form of dependence between attribute Download PDFInfo
- Publication number
- CN106844338A CN106844338A CN201710002389.7A CN201710002389A CN106844338A CN 106844338 A CN106844338 A CN 106844338A CN 201710002389 A CN201710002389 A CN 201710002389A CN 106844338 A CN106844338 A CN 106844338A
- Authority
- CN
- China
- Prior art keywords
- functional dependencies
- approximate
- row
- network form
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/177—Editing, e.g. inserting or deleting of tables; using ruled lines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of detection method based on the entity row of the network form of dependence between attribute.For a network form, the approximate functional dependencies probability between any two row is calculated according to the functional dependencies between train value, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies collection;Pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF is standardized is arranged as the entity of network form.The present invention provide method can more accurately express attribute it is interior functional dependencies;When approximate functional dependencies are calculated based on consistent data and inconsistency data to the support of functional dependence, the algorithm has obvious noise resisting ability, and the method is applicable not only to the network form of single entity row, it may also be used for the form of multiple entity row.
Description
Technical field
The present invention relates to network information processing technical field, more particularly to a kind of net list based on dependence between attribute
The detection method of the entity row of lattice.
Background technology
With the development of information technology, the resource on internet is increasingly enriched, in addition to unstructured data, also greatly
The network form of amount is present, and for these network forms are compared with text, with more preferable structural features, therefore suffers from the pole of people
Big concern.How to allow machine to more fully understand that the semanteme of network form turns into and improve the great of table search coverage rate and accuracy rate
Challenge.Entity row can be identified for that the entity described by network form, and its column label describes the theme of whole network form of throwing the net, passes through
It may determine that the semantic information of network form.If the entity of detection network form is arranged exactly, it is possible to greatly promote machine
Device is to the semantic degree of understanding of network form.
A kind of entity row of the prior art find that algorithm is that the evidential entity proposed by Wang et al. arranges discovery
Algorithm.The algorithm is attempted Probase as knowledge base, is relied on two evidences and is realized that the entity row of network form find.They
The evidence of foundation is:First, all entity descriptions in entity row is same concept;Secondly, the concept that list of entities reaches
There is concept attribute relation between the concept reached with other non-physical lists.
In evidential entity row find algorithm, each candidate pattern s of network form of being thrown the net for works as selection
When wherein one row col is entity row, remaining is classified as the attribute of entity row, calculates the scoring of all candidate's entity row, and selection is commented
Divide highest candidate's entity to arrange to be arranged as the entity of the network form.Object function is as follows:
Wherein, SCAIt is all possible concept attribute set of relationship of attribute set A,ciIt is attribute set AiRetouch
The concept stated, saiRepresent that attribute set A is concept ciAttribute confidence level;SCEIt is all possible concept of entity sets E
Entity relationship set,ciIt is entity set EiAffiliated concept, seiPresentation-entity collection E belongs to concept ciConfidence level;Acol
In expression candidate pattern s, except all properties set of col row;EcolRepresent col in arranging except all train value set of gauge outfit.
Above-mentioned entity row of the prior art find that the shortcoming of algorithm is:First, the method depends on the table of network form
Head is with knowledge base, it is necessary to very big computing cost.Knowledge base cover really many entities, attribute, concept and they between
Relation, but knowledge base is difficult on overlay network whole entities, attribute, concept and the relation between them.Meanwhile, net
Network form usually lacks Table Header information, is only difficult accurately to recover its gauge outfit by knowledge base, particularly the mark of the row such as numeral, date
Sign.Therefore, evidential entity row find that the recall rate and accuracy rate of algorithm are relatively low.Secondly, evidential entity row hair
Existing method can only carry out entity row and find to the network form of single entity row, and have ignored the presence of multiple entity row network form.
Many form more than one entities row on network, the algorithm has certain limitation.
The content of the invention
The embodiment provides a kind of detection side based on the entity row of the network form of dependence between attribute
Method, to realize that the entity for effectively finding network form is arranged.
To achieve these goals, this invention takes following technical scheme.
A kind of entity row detection method based on the network form of dependence between attribute, further, including:
For a network form, the approximate functional dependencies between any two row are calculated according to the functional dependencies between train value
Probability, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;
According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies
Collection;
Pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF is standardized is used as network form
Entity row.
Further, it is described for a network form, calculate any two row according to the functional dependencies between train value
Between approximate functional dependencies probability, according to the approximate functional dependencies probability obtain candidate functions Dependency Set, including:
If X is certain attribute in network form T, A is the attribute that X is different from T, when exist in T part tuple (X,
A) property value pair so that X → A sets up, then claim X approximate functions to determine that A or A approximate functional dependencies, in X, are denoted as The approximate functional dependencies probability that X → A sets up on T is represented, (X, A) property value centering causes what X → A set up
Data are referred to as consistent data, and remaining is referred to as inconsistency data;
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that
The collection of the different value is combined into VA。
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number
It is not unique, then using the most value of these numbers as class center, calculate the sum of other values and class central value similarity, selection
Class central value v during with maximumaAs consistent data.Shown in circular such as formula (1).
For any class central value vj。
X intermediate values are vxAll tuples, consistent data v thereinaThe support S set up to X → Ac(X→A,VX,VA')
Calculated by formula (2);
Wherein:
VX=X.r | X.r=vx}
VA'=A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells,
A.r is the value of A row r row cells;
The support S that inconsistency data are set up to X → Anc(X→A,VX,VA*) computing formula by formula (3) calculate;
Set VXThe support set up to X → ABy consistent data and inconsistency data to X → A
The weighted average of the support of establishment and expression,Calculated by formula (5):
Wherein ω1+ω2=1;
The support of all different tuples in X is taken, by their average valueAs X → A in network form T
The probability of establishment,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber;
Represent a kind of approximate functional dependencies in network form TThe probability of establishment, candidate functions
Comprising all possible approximate functional dependencies in network form T in Dependency Set.
Further, described according to the characteristics of network form, the noise function deleted in candidate functions Dependency Set is relied on
Approximate functional dependencies collection is obtained, including:
If the approximate functional dependencies relation in candidate functions Dependency SetMeet any bar in following 3 rules,
Then willConcentrated from candidate's approximate functional dependencies and left out:
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type:
Rule 2:If there is attribute column Y in network form T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
Further, described pairing approximation Functional Dependencies carry out 3NF standardization, the major key produced after 3NF is standardized
Gather and arranged as the entity of network form, including:
In the approximate functional dependencies relationship map in approximate Functional Dependencies to relational matrix FD [m] [n], will will determine to belong to
, in relational matrix KK [m] [m], wherein m is to contain a left side positioned at approximate functional dependencies for approximate functional dependencies relationship map between property
The attribute number on side, that is, determine attribute number, and n is the number of all properties row in network form:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 4) α=β, FD [α] [β]:=2;
If 5)Then FD [α] [β]:=1;
6) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 3) α=γ orThen KK [α] [γ]:=1;
4) other situations, then KK [α] [γ]:=-1;
It is defined in network form T, ifZ is then claimed to rely on X approximate transfer functions,
It is designated asWherein Y is intermediary's key that approximate transfer function is relied on;
Approximate functional dependencies collection closure DC [m] are determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m]
[n], determines to only exist the decision attribute in direct approximate functional dependencies according to the approximate functional dependencies collection closure DC [m] [n]
With intermediary's key, using it is described only exist in direct approximate functional dependencies determine attribute and intermediary's key as the reality of network form
Body row output.
Further, it is described that approximate function is determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m]
Dependency Set closure DC [m] [n], including:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represent in KK [m] [m] i-th it is approximate
Functional dependence;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m], andExist in DC [m] [n], if it is,
Then DC [m] [n]:=βiAnd perform step 4;Otherwise, step 4 is directly performed;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5;
Otherwise, step 6 is directly performed;
Step 5:i:=i+1, return to step 3;
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, DC is exported
[m] [n], flow terminates.
Further, it is described to determine to only exist directly approximate letter according to the approximate functional dependencies collection closure DC [m] [n]
Decision attribute and intermediary's key in number dependence, including:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if into
Stand, then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and
Perform step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j],
And perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, the decision attribute assignment of i rows is given
Entity gathers, and performs step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow knot
Beam;Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
The technical scheme provided by embodiments of the invention described above can be seen that adaptation network provided in an embodiment of the present invention
The approximate functional dependencies detection method of form feature can more accurately express attribute it is interior functional dependencies;Calculating
Based on consistent data and inconsistency data to the support of functional dependence during approximate functional dependencies, the algorithm has obvious
Noise resisting ability;Real body row can be issued in more scenes, the method is applicable not only to the network form of single entity row, may be used also
For the form of multiple entity row, the network form of gauge outfit is applicable not only to, and suitable for no gauge outfit or using semantic
Recovery technology cannot also recover the network form of complete gauge outfit.
The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description
Obtain substantially, or recognized by practice of the invention.
Brief description of the drawings
Technical scheme in order to illustrate more clearly the embodiments of the present invention, below will be to that will use needed for embodiment description
Accompanying drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for this
For the those of ordinary skill of field, without having to pay creative labor, other can also be obtained according to these accompanying drawings
Accompanying drawing.
Fig. 1 is a kind of detection based on the entity row of the network form of dependence between attribute provided in an embodiment of the present invention
The process chart of method;
Fig. 2 is a kind of process chart for obtaining candidate's Dependency Set provided in an embodiment of the present invention;
Fig. 3 is provided in an embodiment of the present invention a kind of according to approximate functional dependencies collection searching approximate functional dependencies collection closure
Process schematic;
Fig. 4 is a kind of flow chart that entity row are obtained using three normal forms provided in an embodiment of the present invention;
Fig. 5 is AFD_Model algorithms provided in an embodiment of the present invention and PFD_Model algorithms, evidential method
(ED_Model) it is directed to the contrast schematic diagram of entity row accuracy of detection, coverage rate, F- values and the time efficiency of single list of entities;
Fig. 6 is that AFD_Model algorithms provided in an embodiment of the present invention find algorithm with PFD_Model algorithms in multiple entity row
Validity contrast schematic diagram.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings, wherein ad initio
Same or similar element or element with same or like function are represented to same or similar label eventually.Below by ginseng
The implementation method for examining Description of Drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " " used herein, " one
It is individual ", " described " and " being somebody's turn to do " may also comprise plural form.It is to be further understood that what is used in specification of the invention arranges
Diction " including " refer to the presence of the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition
One or more other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim unit
Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or can also exist
Intermediary element.Additionally, " connection " used herein or " coupling " can include wireless connection or coupling.Wording used herein
"and/or" includes one or more associated any cells for listing item and all combines.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art
Language and scientific terminology) have with art of the present invention in those of ordinary skill general understanding identical meaning.Should also
Understand, those terms defined in such as general dictionary should be understood that the meaning having with the context of prior art
The consistent meaning of justice, and unless defined as here, will not be with idealizing or excessively formal implication be explained.
For ease of the understanding to the embodiment of the present invention, done by taking several specific embodiments as an example further below in conjunction with accompanying drawing
Explanation, and each embodiment does not constitute the restriction to the embodiment of the present invention.
In order to solve the technical problem of above-mentioned existing entity row detection algorithm, the present invention devises a kind of computing cost
It is small, gauge outfit and knowledge base are not relied on, and suitable for the entity row detection algorithm of multiple entity row network form.The present invention is solved
Traditional algorithm relies on the gauge outfit and knowledge base of network form and can not carry out the problem that multiple entity row find, approximate by introducing
The concept of functional dependence, improves the noise resisting ability of method, while obtaining high-quality entity row finds result.
It is provided in an embodiment of the present invention a kind of based on the entity row detection method of the network form of dependence between attribute
Handling process is as shown in figure 1, including following process step:
Step 1, according to the approximate functional dependencies probability between the train value of network form, obtain candidate functions Dependency Set.
Thrown the net for one network form, if a row therein or multiple row can identify the entity described by network form,
This row or multiple row are defined as entity row, other row beyond entity row are defined as attribute column.
The present invention to each form, according to the functional dependencies between train value calculate approximate function between any two row according to
Rely probability.Here it is considered that there is noise in form, we introduce the support of consistent data and inconsistency data.
It is certain attribute column in network form T to define 1 and set X, and A is the attribute column that X is different from T.When there is part in T
(X, A) property value pair of tuple so that X → A sets up, then claim X approximate functions to determine that A or A approximate functional dependencies, in X, are denoted as Represent the possibility that X → A sets up on T, i.e. approximate functional dependencies probability.(X, A) property value centering makes
The data for obtaining X → A establishments are referred to as consistent data, and remaining is referred to as inconsistency data.
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that
The collection of the different value is combined into VA。
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number
It is not unique, then using the most value of these numbers as class center, calculate the sum of other values and class central value similarity, selection
Class central value v during with maximumaAs consistent data.Shown in circular such as formula (1).
For any class central value vj。
There is the possibility write by mistake in the train value of network form, comprehensive consistent data and inconsistency data to functional dependence into
Vertical support, calculates the approximate functional dependencies probability between any two row, obtains candidate functions Dependency Set.
Fig. 2 is a kind of process chart for obtaining candidate's Dependency Set provided in an embodiment of the present invention, concrete processing procedure bag
Include:First, consistent data proportion is bigger, illustrate X → A set up possibility it is bigger, i.e., consistent data to X → A into
Vertical support is higher, while consistent data proportion is bigger, illustrates that the consistent data is real consistent data
Possibility is bigger.X intermediate values are vxAll tuples, consistent data v thereinaThe support and uniformity number set up to X → A
According to reliability calculated by formula (2).
Wherein:
VX=X.r | X.r=vx}
VA'=A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells,
A.r is the value of A row r row cells.
Secondly, inconsistency data and consistent data are more similar, and the reliability of consistent data is bigger, then inconsistent
Property data it is bigger to support that X → A sets up, computing formula is as shown in (3).
Wherein VA*=A.r | X.r=vx&A.r≠va}。
Set VXThe support that X → A sets up can be set up by consistent data and inconsistency data to X → A
The weighted average of support and expression, are designated asAs shown in formula (5).
Wherein ω1+ω2=1.
Finally, the support of all different tuples in X is taken, by their average valueAs network form T
The probability that middle X → A sets up,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber.
Formula (6) represents the probability of X → A establishments in form T, and all possible approximate functional dependencies in T are included in into time
Functional dependence is selected to concentrate, the probability that these approximate functional dependencies are set up is calculated according to formula (6).
IfThen X is referred to as the decision attribute of this approximate functional dependencies.All decisions that approximate functional dependencies are concentrated
Attribute composition determines attribute set, and the element number for determining attribute set is exactly to determine attribute number, i.e. m.
Step 2, according to the characteristics of network form, delete noise function in candidate functions Dependency Set and rely on, obtain approximate
Functional Dependencies.
Erased noise functional dependence is that next step obtains entity row and beats primarily to obtain more accurately Functional Dependencies
Lower basis.Specifically delete rule as follows:
IfMeet any bar in following 3 rules, just willConcentrated from candidate's approximate functional dependencies
Leave out.
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type.
Rule 2:If there is attribute column Y in T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
Delete rule according to above-mentioned, delete after noise function in candidate functions Dependency Set relies on, obtain approximate function according to
Rely collection.
Step 3, the thought according to standardization, obtain entity row.
Attribute column approximate functional dependencies are arranged in the entity described by it in network form, according to the rule of relational database theory
Generalized principle, pairing approximation Functional Dependencies carry out 3NF standardization, and the major key set produced after 3NF standardization is exactly desired net
The entity row of network form.
The process that above-mentioned pairing approximation Functional Dependencies carry out 3NF standardization includes:
The dependence of approximate Functional Dependencies is mapped to relational matrix FD [m] [n];The approximate letter between attribute will be determined
Number dependence is mapped to relational matrix KK [m] [m].Wherein m is the attribute number for containing the left side positioned at approximate functional dependencies, i.e.,
Attribute number is determined, n is the number of all properties row in network form.For convenience, with different numerals to represent attribute between
Different relations, element produces as follows in matrix:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 7) α=β, FD [α] [β]:=2;
If 8)Then FD [α] [β]:=1;
9) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 5) α=γ orThen KK [α] [γ]:=1;
6) other situations, then KK [α] [γ]:=-1;
For convenience of description, definition 3 provides approximate transfer function dependence and is defined as follows:
3 are defined in network form T, ifThen claim Z to X approximate transfer functions according to
Rely, be designated asWherein Y is intermediary's key that approximate transfer function is relied on.
Fig. 3 is the process schematic that approximate functional dependencies collection closure DC [m] [n] are found according to approximate functional dependencies collection, root
Determine that DC [m] [n] concrete processing procedures include according to FD [m] [n] and KK [m] [m]:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represent in KK [m] [m] i-th it is approximate
Functional dependence;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m],
AndExist in DC [m] [n], if it is, DC [m] [n]:=βi, and perform step 4;Otherwise, directly
Connect execution step 4;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5;
Otherwise, step 6 is directly performed;
Step 5:i:=i+1, return to step 3.
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, DC is exported
[m] [n], flow terminates.
Fig. 4 is to obtain the flow chart that entity is arranged using three normal forms, according to above-mentioned approximate functional dependencies collection closure DC [m] [n]
The approximate transitive dependency that amendment is mis-marked.Finally, intermediary's key and the decision attribute in direct approximate functional dependencies will be only existed
Arranged as entity and exported, the searching process of the above-mentioned decision attribute only existed in direct approximate functional dependencies and intermediary's key includes:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if into
Stand, then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and
Perform step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j],
And perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, the decision attribute assignment of i rows is given
Entity gathers, and performs step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow knot
Beam;Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
In sum, the approximate functional dependencies detection method for adapting to network form feature provided in an embodiment of the present invention can be more
Plus exactly expression attribute it is interior functional dependencies;Consistent data is based on when approximate functional dependencies are calculated and is differed
Cause property data have obvious noise resisting ability to the support of functional dependence, the algorithm;
Entity row based on approximate functional dependencies and standardization provided in an embodiment of the present invention find algorithm, can be more
Scene issues real body row.The method is applicable not only to the network form of single entity row, it may also be used for the form of multiple entity row;No
The network form of gauge outfit is only applicable to, and cannot also have been recovered suitable for no gauge outfit or using semantic recovery technology
The network form of whole gauge outfit.
Compared with prior art, the method for the present invention there is entity to arrange and finds that quality is high and can carry out multiple entity row hair
Existing advantage.To verify the advantage of the above, we have done many experiments, and experimental data comes from two data sources:One is to increase income
Wiki Table data sets, another network form crawled from network for us, we term it Web Table data
Collection.We will collect the network form that comes according to line number number be divided into big table data set (more than 100 rows), abbreviation L data sets,
With small table data set (below 100 rows), abbreviation S data collection.The experiment for carrying out single entity row and multiple entity row discovery for convenience is tested
Card, L data sets are divided into the mono- entity sets of L (WiKi_LS and Web_LS) and L multiple entities collection (WiKi_LM and Web_LM) by us;S
Data set is divided into the mono- entity sets of S (WiKi_SS and Web_SS) and S multiple entities collection (WiKi_SM and Web_SM).
The present invention has found that entity is arranged based on the functional dependencies between train value, is independent of gauge outfit and knowledge base information, carries
The quality that entity row high find.In order to verify validity of the algorithm (AFD_Model) of the embodiment of the present invention in terms of noise reduction,
Specially PFD_Model algorithms are realized, the algorithm in addition to not accounting for form noise remaining with AFD_Model algorithms one
Sample.Fig. 3 gives AFD_Model, PFD_Model and evidential method (ED_Model) is directed to the reality of single list of entities
The contrast of body row accuracy of detection, coverage rate, F- values and time efficiency.Fig. 5 shows, algorithm AFD_Model entirety of the invention
It is upper to be better than ED_Model and PFD_Model.In terms of accuracy rate, the gauge outfit of ED_Model algorithms requirement network form exists
There is concept attribute relation in Probase storehouses, the quality of gauge outfit and the level of coverage of knowledge base can all influence ED_Model algorithms
The degree of accuracy, and AFD_Model algorithms are independent of any Table Header information and knowledge base, therefore the degree of accuracy is higher.Due to AFD_
The characteristics of Model algorithms take into account network form, with certain noise filtering ability, therefore the accuracy that entity is detected
Higher than PFD_Model algorithms.In terms of recall rate, AFD_Model algorithms are higher than ED_Model algorithms and PFD_Model algorithms.
Because AFD_Model algorithms do not require that net list lattice there must be gauge outfit, do not require that the entity row in table arrange presence with non-physical
Relation on attributes, does not require that this concept-relation on attributes exists in Probase storehouses yet, while having certain noise filtering energy
Power, therefore the adaptability of algorithm is stronger.The quality of F-measure measure algorithms on the whole, algorithm of the invention has obvious
Advantage.In terms of at runtime, the time of ED_Model algorithms spends and is significantly greater than AFD_Model algorithms and PFD_Model
Algorithm, because ED_Model algorithms need the general of the gauge outfit for recovering the gauge outfit or semanteme of form using Probase storehouses
Read relation on attributes to decide, and then determine entity row, and the time complexity of AFD_Model algorithms and PFD_Model algorithms
Size only with form is relevant.
The method of the present invention is applied to the form of multiple entity row, and applicability is significantly increased.ED_Model algorithms can not be carried out
The discovery of multiple entity row, the method for the present invention is only contrasted with PFD_Model.Fig. 6 is AFD_ provided in an embodiment of the present invention
Model algorithms find the validity contrast schematic diagram of algorithm with PFD_Model algorithms in multiple entity row.Fig. 6 shows, no matter smart
Degree, recall rate or F values, AFD_Model algorithms all show outstanding than PFD_Model algorithm, because AFD_Model is calculated
During approximate functional dependencies of the method between computation attribute, it is contemplated that the influence of noise data.
One of ordinary skill in the art will appreciate that:Accompanying drawing is the schematic diagram of one embodiment, module in accompanying drawing or
Flow is not necessarily implemented necessary to the present invention.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can
Realized by the mode of software plus required general hardware platform.Based on such understanding, technical scheme essence
On the part that is contributed to prior art in other words can be embodied in the form of software product, the computer software product
Can store in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used to so that a computer equipment
(can be personal computer, server, or network equipment etc.) performs some of each embodiment of the invention or embodiment
Method described in part.
Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment
Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for device or
For system embodiment, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to method
The part explanation of embodiment.Apparatus and system embodiment described above is only schematical, wherein the conduct
Separating component explanation unit can be or may not be it is physically separate, the part shown as unit can be or
Person may not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be with root
Some or all of module therein is factually selected the need for border to realize the purpose of this embodiment scheme.Ordinary skill
Personnel are without creative efforts, you can to understand and implement.
The above, the only present invention preferably specific embodiment, but protection scope of the present invention is not limited thereto,
Any one skilled in the art the invention discloses technical scope in, the change or replacement that can be readily occurred in,
Should all be included within the scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims
It is defined.
Claims (6)
1. a kind of entity row detection method based on the network form of dependence between attribute, it is characterised in that including:
For a network form, the approximate functional dependencies calculated according to the functional dependencies between train value between any two row are general
Rate, candidate functions Dependency Set is obtained according to the approximate functional dependencies probability;
According to the characteristics of network form, delete the dependence of the noise function in candidate functions Dependency Set and obtain approximate functional dependencies collection;
Pairing approximation Functional Dependencies carry out 3NF standardization, reality of the major key set produced after 3NF is standardized as network form
Body is arranged.
2. method according to claim 1, it is characterised in that described for a network form, according between train value
Functional dependencies calculate the approximate functional dependencies probability between any two row, and candidate is obtained according to the approximate functional dependencies probability
Functional Dependencies, including:
If X is certain attribute in network form T, A is the attribute that X is different from T, when (X, the A) category that there is part tuple in T
Property value pair so that X → A set up, then claim X approximate functions determine A or A approximate functional dependencies in X, be denoted as The approximate functional dependencies probability that X → A sets up on T is represented, (X, A) property value centering causes what X → A set up
Data are referred to as consistent data, and remaining is referred to as inconsistency data;
It is v for X property values in network form TxTuple, there may be different values in its A attribute column, it is assumed that the difference
The collection of value is combined into VA;
If set VAThe most value of middle number is unique, then using the value as consistent data, if the most value of number is not only
One, then using the most value of these numbers as class center, calculate sum of other values and class central value similarity, selection with most
Class central value v when bigaAs consistent data.Shown in circular such as formula (1);
For any class central value vj;
X intermediate values are vxAll tuples, consistent data v thereinaThe support S set up to X → Ac(X→A,VX,VA') by public affairs
Formula (2) is calculated;
Wherein:
VX=X.r | X.r=vx}
VA'={ A.r | X.r=vx&A.r=va}
|VX,VA'|=|<X.r,A.r>| X.r=vx&A.r=va}|
VA' it is exactly when X row take vxWhen, the set of consistent data in corresponding A row, X.r is the value of X row r row cells, and A.r is
The value of A row r row cells;
The support S that inconsistency data are set up to X → Anc(X→A,VX,VA* computing formula) is calculated by formula (3);
Set VXThe support set up to X → AX → A is set up by consistent data and inconsistency data
Support weighted average and expression,Calculated by formula (5):
Wherein ω1+ω2=1;
Take the support of all different tuples in X, their average valueSet up as X → A in network form T
Probability,Calculated by formula (6):
Wherein | DX| represent distinguishing V in XXNumber;
Represent a kind of approximate functional dependencies in network form TThe probability of establishment, candidate functions are relied on
Concentrate comprising all possible approximate functional dependencies in network form T.
3. method according to claim 2, it is characterised in that described according to the characteristics of network form, deletes candidate's letter
Noise function in number Dependency Set is relied on and obtains approximate functional dependencies collection, including:
If the approximate functional dependencies relation in candidate functions Dependency SetMeet any bar in following 3 rules, then willConcentrated from candidate's approximate functional dependencies and left out:
Rule 1:If the type of the property value of X row is date type, floating point type or Boolean type:
Rule 2:If there is attribute column Y in network form T so thatSet up;
Rule 3:If being concentrated in candidate's approximate functional dependencies, there is such attribute column X and A so thatAnd
4. method according to claim 3, it is characterised in that described pairing approximation Functional Dependencies carry out 3NF standardization,
The major key set produced after 3NF is standardized is arranged as the entity of network form, including:
By in the approximate functional dependencies relationship map in approximate Functional Dependencies to relational matrix FD [m] [n], between attribute is determined
Approximate functional dependencies relationship map in relational matrix KK [m] [m], wherein m is to contain the left side positioned at approximate functional dependencies
Attribute number, that is, determine attribute number, and n is the number of all properties row in network form:
(1) element of FD [m] [n] produces as follows:
If α ∈ { decision property set }, β ∈ { all Column Properties collection }
If 1) α=β, FD [α] [β]:=2;
If 2)Then FD [α] [β]:=1;
3) other situations, then FD [α] [β]:=0;
(2) element of KK [m] [m] produces as follows:
If α, γ ∈ { decision property set }
If 1) α=γ orThen KK [α] [γ]:=1;
2) other situations, then KK [α] [γ]:=-1;
It is defined in network form T, ifThen claim Z to rely on X approximate transfer functions, be designated asWherein Y is intermediary's key that approximate transfer function is relied on;
Approximate functional dependencies collection closure DC [m] [n] are determined according to the relational matrix FD [m] [n], relational matrix KK [m] [m],
According to the approximate functional dependencies collection closure DC [m] [n] determine to only exist decision attribute in direct approximate functional dependencies and in
Jie's key, the decision attribute and intermediary's key only existed in direct approximate functional dependencies is arranged as the entity of network form
Output.
5. method according to claim 4, it is characterised in that described according to the relational matrix FD [m] [n], relation
Matrix K K [m] [m] determines approximate functional dependencies collection closure DC [m] [n], including:
Step 1, the element in FD [m] [n] is copied into DC [m] [n];i:=0;I represents i-th approximate function in KK [m] [m]
Rely on;
Step 2, i:=1;
Step 3:Judge whetherExist in KK [m] [m], andExist in DC [m] [n], if it is, DC
[m][n]:=βiAnd perform step 4;Otherwise, step 4 is directly performed;
Step 4:Judge to whether there is i+1 approximate functional dependencies in KK [m] [m], if it is present performing step 5;It is no
Then, step 6 is directly performed;
Step 5:i:=i+1, return to step 3;
Step 6:Judge whether DC [m] [n] changes, if it happens change, then return to step 2;Otherwise, output DC [m]
[n], flow terminates.
6. method according to claim 5, it is characterised in that described according to the approximate functional dependencies collection closure DC
[m] [n] determines to only exist the decision attribute and intermediary's key in direct approximate functional dependencies, including:
Step 1:Input DC [m] [n], FD [m] [n];
Step 2:i:=0, j:=0;I, j represent the line number and row number of DC [m] [n];
Step 3:Judge DC [i] [j]!=whether 0,1,2 } &&FD [i] [j]=1&&FD [j] [i]=1 sets up, if set up,
Then DC [i] [j]:=1, and perform step 4;Otherwise, step 4 is performed;
Step 4:Judge whether that all traversal terminates, if all traveled through, sets i:=0, j:=0, and perform
Step 5;Otherwise, next DC [i] [j] is taken, and performs step 3;
Step 5:Judge DC [i] [j]!Whether={ 0,1,2 } sets up, if set up, Entity { }:=DC [i] [j], and
Perform step 7;Otherwise, step 6 is performed;
Step 6:Judge DC [i] [j]=1&&i!Whether=j sets up, if set up, by the decision attribute assignment of i rows to Entity
Set, and perform step 7;Otherwise, step 7 is directly performed;
Step 7:Judge whether that all traversal terminates, if all traversal terminates, output Entity set, flow terminates;
Otherwise, next DC [i] [j] is taken, step 5 is continued executing with.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710002389.7A CN106844338B (en) | 2017-01-03 | 2017-01-03 | method for detecting entity column of network table based on dependency relationship between attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710002389.7A CN106844338B (en) | 2017-01-03 | 2017-01-03 | method for detecting entity column of network table based on dependency relationship between attributes |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106844338A true CN106844338A (en) | 2017-06-13 |
CN106844338B CN106844338B (en) | 2019-12-10 |
Family
ID=59117509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710002389.7A Expired - Fee Related CN106844338B (en) | 2017-01-03 | 2017-01-03 | method for detecting entity column of network table based on dependency relationship between attributes |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106844338B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595624A (en) * | 2018-04-23 | 2018-09-28 | 南京大学 | A kind of large-scale distributed functional dependence discovery method |
CN109472013A (en) * | 2018-10-25 | 2019-03-15 | 北京交通大学 | The foreign key relationship detection method of net list compartment based on fitting of distribution |
CN111061923A (en) * | 2019-12-13 | 2020-04-24 | 北京航空航天大学 | Graph data entity identification method and system based on graph dependence rule and supervised learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103077181A (en) * | 2012-11-20 | 2013-05-01 | 深圳市华傲数据技术有限公司 | Method for automatically generating approximate functional dependency rule |
CN104281563A (en) * | 2013-07-01 | 2015-01-14 | 国际商业机器公司 | Method and system for discovering relationships in tabular data |
CN104794222A (en) * | 2015-04-29 | 2015-07-22 | 北京交通大学 | Network table semantic recovery method |
US20150233705A1 (en) * | 2012-11-09 | 2015-08-20 | Kla-Tencor Corporation | Reducing algorithmic inaccuracy in scatterometry overlay metrology |
-
2017
- 2017-01-03 CN CN201710002389.7A patent/CN106844338B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150233705A1 (en) * | 2012-11-09 | 2015-08-20 | Kla-Tencor Corporation | Reducing algorithmic inaccuracy in scatterometry overlay metrology |
CN103077181A (en) * | 2012-11-20 | 2013-05-01 | 深圳市华傲数据技术有限公司 | Method for automatically generating approximate functional dependency rule |
CN104281563A (en) * | 2013-07-01 | 2015-01-14 | 国际商业机器公司 | Method and system for discovering relationships in tabular data |
CN104794222A (en) * | 2015-04-29 | 2015-07-22 | 北京交通大学 | Network table semantic recovery method |
Non-Patent Citations (3)
Title |
---|
WANG D G 等: "Functional Dependency Generation and Applications in Pay-as-You-Go data Integration Systems", 《PROCEEDINGS OF THE 12TH INTERNATIONAL WORKSHOP ON THE WEB AND DATABASES》 * |
任向冉: "网络表格的实体列发现与标识", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
黎章海 等: "基于函数依赖的导出关系候选码计算", 《计算机工程》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108595624A (en) * | 2018-04-23 | 2018-09-28 | 南京大学 | A kind of large-scale distributed functional dependence discovery method |
CN109472013A (en) * | 2018-10-25 | 2019-03-15 | 北京交通大学 | The foreign key relationship detection method of net list compartment based on fitting of distribution |
CN111061923A (en) * | 2019-12-13 | 2020-04-24 | 北京航空航天大学 | Graph data entity identification method and system based on graph dependence rule and supervised learning |
CN111061923B (en) * | 2019-12-13 | 2022-08-02 | 北京航空航天大学 | Graph data entity recognition system based on graph dependence rule and supervised learning |
Also Published As
Publication number | Publication date |
---|---|
CN106844338B (en) | 2019-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106372060B (en) | Search for the mask method and device of text | |
Liu et al. | What's in a name? An unsupervised approach to link users across communities | |
CN109299258B (en) | Public opinion event detection method, device and equipment | |
CN103729402B (en) | Method for establishing mapping knowledge domain based on book catalogue | |
Yang et al. | Hierarchical multi-clue modelling for POI popularity prediction with heterogeneous tourist information | |
CN102473190B (en) | Keyword assignment to a web page | |
CN103412888B (en) | A kind of point of interest recognition methods and device | |
CN106649464A (en) | Method of building Chinese address tree and device | |
CN111581092B (en) | Simulation test data generation method, computer equipment and storage medium | |
CN109871954A (en) | Training sample generation method, method for detecting abnormality and device | |
CN107341220A (en) | A kind of multi-source data fusion method and device | |
JP7103496B2 (en) | Related score calculation system, method and program | |
CN110737821B (en) | Similar event query method, device, storage medium and terminal equipment | |
CN107679135A (en) | The topic detection of network-oriented text big data and tracking, device | |
CN107918657A (en) | The matching process and device of a kind of data source | |
CN106844338A (en) | Detection method based on the entity row of the network form of dependence between attribute | |
Wang et al. | LILY: the results for the ontology alignment contest OAEI 2007 | |
Liu et al. | Automatic event salience identification | |
JP2008084203A (en) | System, method and program for assigning label | |
CN110347931A (en) | The detection method and device of the new chapters and sections of article | |
CN105869058A (en) | Method for user portrait extraction based on multilayer latent variable model | |
Barlacchi et al. | Land use classification with point of interests and structural patterns | |
CN105447104A (en) | Knowledge map generating method and apparatus | |
Imam et al. | Discovering attribute dependence in databases by integrating symbolic learning and statistical analysis techniques | |
CN104156458B (en) | The extracting method and device of a kind of information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191210 Termination date: 20210103 |
|
CF01 | Termination of patent right due to non-payment of annual fee |