CN106250412A - The knowledge mapping construction method merged based on many source entities - Google Patents

The knowledge mapping construction method merged based on many source entities Download PDF

Info

Publication number
CN106250412A
CN106250412A CN201610583823.0A CN201610583823A CN106250412A CN 106250412 A CN106250412 A CN 106250412A CN 201610583823 A CN201610583823 A CN 201610583823A CN 106250412 A CN106250412 A CN 106250412A
Authority
CN
China
Prior art keywords
page
synonym
similarity
limit
title
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610583823.0A
Other languages
Chinese (zh)
Other versions
CN106250412B (en
Inventor
鲁伟明
戴豪
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610583823.0A priority Critical patent/CN106250412B/en
Publication of CN106250412A publication Critical patent/CN106250412A/en
Application granted granted Critical
Publication of CN106250412B publication Critical patent/CN106250412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a kind of knowledge mapping construction method merged based on many source entities.First the present invention crawls Chinese three big encyclopaedias: Baidupedia, interactive encyclopaedia, wikipedia, and data are done pretreatment, extracts including title synonym, the qi page that disappears extracts, Candidate Set extraction and text participle etc..Then, for the page in same Candidate Set, calculate the feature between the page two-by-two, and train the similarity between the classifier calculated page, and build weight map according to similarity.Finally, by mixed linear programming model, retrain the relation between summit and summit in weight map, by the maximum of calculating target function, obtain the connectedness between summit and summit, by each connected component as an entity, thus obtain all pages describing same entity.The present invention, by introducing Candidate Set, substantially reduces the scale of problem;Simultaneously further through mixed linear programming model, improve the accuracy rate that entity merges.

Description

The knowledge mapping construction method merged based on many source entities
Technical field
The present invention relates to Text similarity computing method, particularly relate to a kind of knowledge mapping structure merged based on many source entities Construction method.
Background technology
Along with developing rapidly of the Internet, the approach that people obtain information and knowledge is more and more diversified, but magnanimity Data are distributed in each corner of the Internet, and this obtains knowledge to user and brings the biggest obstacle.Therefore, a system is built One complete knowledge base is extremely urgent.
Currently existing many knowledge base, such as DBpedia is a special semantic net exemplary applications, and it is from Wiki Capture structurized data in the entry of encyclopaedia, to strengthen the function of searching of wikipedia, and other data set are linked to Wikipedia;Freebase is a large-scale cooperation knowledge base, and it incorporates the many resources on network.In Freebase Entry is also similar with DBpedia, all uses the form of structural data.By accessing its data it appeared that the most all of in Holding is all to format, and stores according to the form of tlv triple and shows.This pattern is fixing, and same type of entry is all wrapped Containing identical attribute.For these reasons, just can link together easily between homogeneous data, provide for information inquiry Facility.Freebase comprises number theme in terms of necessarily, thousands of type and attribute.But the language of these knowledge bases It is all English, the complete knowledge base that Chinese field also neither one is large-scale at present.
In traditional Entities Matching algorithm about knowledge base, it is mainly based upon the coupling of paired entity, and this is asked Topic form one classification problem of chemical conversion.But, most of this kind of algorithms all depend heavily on the quality of data template.For For web data, data are not to present with a unified triple form, and the data of homology are not on expression-form Also having bigger difference, the suitability in our this problem of the most this method is relatively low.
In other matching algorithm, the structural information of the page is also allowed in feature, such as in Chinese and English Wiki Entities Matching in because with the presence of quite a few page across language link, so this partial information can be as elder generation Test knowledge.But, there is no any link between our multi-source data, so the architectural feature of the page cannot include feature in Among.
In the feature calculation of two set, it is possible to use Jaccard coefficient.Jaccard coefficient is mainly used in calculating symbol Number tolerance or Boolean tolerance individuality between similarity because the characteristic attribute of individuality is all to be measured or Boolean by symbol Mark, therefore cannot weigh the size of difference occurrence, can only obtain " the most identical " this result, so Jaccard coefficient It is only concerned between individuality and jointly has the characteristic that no this problem consistent.If comparing the Jaccard similarity coefficient of X Yu Y, only than Relatively XnAnd YnIn identical number.
In characteristic similarity calculates, many algorithms are had to apply.Simply can directly calculate Euclidean distance or COS distance.Grader can also be used to calculate similarity according to features training grader.Random forest is that a kind of performance is good Good grader, can be used in characteristic similarity calculating.It refers to utilize many decision trees to be trained sample and in advance A kind of grader surveyed, and classification of its output be classifications by indivedual tree outputs mode depending on.Random forest has perhaps Many advantages, such as during Character losing, still can keep higher accuracy, and will not produce over-fitting problem.
Summary of the invention
The present invention, for integrating multi-source encyclopaedic knowledge, builds unified knowledge base, it is provided that one merges based on many source entities Knowledge mapping construction method.The encyclopaedia of homology would generally not comprise the multiple pages describing same entity, and many source entities melt Conjunction technology can find these pages in the data of magnanimity, and map that to same physically.
The technical scheme that the present invention solves the employing of its technical problem is as follows: a kind of knowledge mapping merged based on many source entities Construction method, comprises the following steps:
1) the pretreatment encyclopaedia page: extract the synonym of encyclopaedia title, extracts the qi page that disappears, and utilizes synon transmission to close System builds synonym phrase, and all synonym phrases form synonym phrase set, according to each synonym phrase in synonym phrase set Corresponding page makeup Candidate Set, carries out participle with participle instrument to the text of the encyclopaedia page.
2) by step 1) word segmentation result, calculate the feature between the page two-by-two in same Candidate Set, by instruction Practicing grader is that every one-dimensional characteristic composes upper different weight, and utilizes the similarity between this classifier calculated page.
3) according to step 2) in similarity between the page that calculates build the weight map of this Candidate Set, utilize mixed linear Plan model, defines this model objective function, and the maximum of calculating target function, obtains the connection between summit and summit Property.By each connected component in weight map as an entity, thus obtain all pages describing same entity.
Further, described step 1) including:
1.1) extracting the synonym of encyclopaedia title, extracting mode includes following two:
A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if It is made into merit, then obtains synonym pair.Template is artificially defined, contains major part synonym to pattern occur.
B) link redirect: jump to another page by hyperlink in the page, if the title of another page and The text of this hyperlink is different, then it is assumed that the two word is synonym.
1.2) the qi page that disappears is extracted: kth encyclopaedia is expressed asK maximum is 3, wherein aiRepresent page Face, n representation page total quantity.By all pages occurred in the qi page that disappears, qi page set M that disappears can be extracted, inside set M The page all can not represent same entity the most two-by-two.
M={ai∈εk|ai∈M≠aj∈M}
1.3) Candidate Set is extracted: according to synon transitivity, if A and B synonym each other, A and C synonym each other, So B and C synonym the most each other.In this way, synonym phrase S is obtainedt, all synonym phrase StForm synonym phrase collection Close, the synonym each other of element two-by-two in each synonym phrase of this set.
Given St, from all encyclopaedia source, find out title belong to StThe page, all these page constitute Candidate Set Pt
Pt={ a ∈ ε1,…,K|a.Title∈St}
K is the sum of encyclopaedia;A.Title is the title of page a.
1.4) text to the encyclopaedia page carries out participle: 5 territory participles to the page, including summary, message box (key and Value), link, catalogue, user tag, and remove stop words and the length word less than 2.
Further, described step 2) including:
2.1) 6 territories that one page of definition is comprised, including title T, make a summary A, message box I, catalogue C, user tag G With link L, represent a page by 6 tuples:
A={T, A, I, C, G, L}
Wherein message box is expressed as key-value pair, therefore I={P, V}, and wherein P represents that attribute, V represent property value;
For belonging to 2 pages of same Candidate Set, if they describe be an entity, then their text Duplication can be bigger, therefore following 7 features of definition, as follows:
1) summary feature
f a ( a i , a j ) = | S w ( a i . A ) ∩ S w ( a j . A ) | | S w ( a i . A ) ∪ S w ( a j . A ) |
2) message box attribute character
f p ( a i , a j ) = | S w ( a i . I . P ) ∩ S w ( a j . I . P ) | | S w ( a i . I . P ) ∪ S w ( a j . I . P ) |
3) message box property value feature
f v ( a i , a j ) = | S w ( a i . I . V ) ∩ S w ( a j . I . V ) | | S w ( a i . I . V ) ∪ S w ( a j . I . V ) |
4) directory feature
f C ( a i , a j ) = | S w ( a i . C ) ∩ S w ( a j . C ) | | S w ( a i . C ) ∪ S w ( a j . C ) |
5) user tag feature
f g ( a i , a j ) = | S w ( a i . G ) ∩ S w ( a j . G ) | | S w ( a i . G ) ∪ S w ( a j . G ) |
6) chain feature
f l ( a i , a j ) = | S w ( a i . L ) ∩ S w ( a j . L ) | | S w ( a i . L ) ∪ S w ( a j . L ) |
7) global characteristics, S represents the 6 tuples { string-concatenation of T, A, I, C, G, L}
f a l l ( a i , a j ) = | S w ( a i . S ) ∩ S w ( a j . S ) | | S w ( a i . S ) ∪ S w ( a j . S ) |
Sw(X) represent the results set after character string X participle.
2.2) will be in step 2.1) 7 features obtaining as the input of grader, utilize in Weka algorithm bag RandomForest Algorithm for Training two classification device, then with this two classification device predict between two pages similar Degree.
Further, described step 3) specifically include following steps:
3.1) according to step 2) similarity between the calculated page builds the weight map of this Candidate Set, two nodes Between weight limit similarity represent.Thus, former problem is converted into the choice problem on limit.Use yijRepresent between two nodes Whether there is a limit:
It is simultaneously introduced other penalty terms and constraints to build mixed linear programming model:
Penalty term 1:
If aiWith ajThere are limit, and aiWith akThere is limit, so ajWith akBetween also should have a limit, otherwise add penalty term φ, It is multiplied by coefficient u as adjusting parameter simultaneously.Therefore for φ, there is a following constraint:
y i j + y i k ≤ 1 + y j k + φ j k , ∀ a i , a j , a k ∈ P t
φjk≥0
Penalty term 2:
If aiWith ajBetween similarity the highest, then the probability having limit between them is the biggest.For two similarities very Little aiWith ajIf there being limit between them, then penalty term is relatively big, if aiWith ajSimilarity bigger, then penalty term is relatively Little.Therefore, ψ is usedijRepresenting penalty term, represent adjustment parameter with λ, this penalty term following formula retrains:
λ | y i j - s i m ( a i , a j ) | ≤ ψ i j , ∀ a i , a j ∈ P t
ψij≥0
sim(ai,aj) it is aiAnd ajBetween weight;
Penalty term 3:
For a occurred inside disappearing qi page set M atiWith ajIf, yijEqual to 1, then show matching error, because of This needs to use penalty term ζijRetrain aiWith ajBetween there is no limit.This constraints is represented by following formula:
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
ζij≥0
N is the number of qi page set of disappearing;
Additionally, similarity is arranged threshold tau, only similarity is more than a of threshold tauiWith ajThe page between just can have limit.
Comprehensive each penalty term above and threshold value, obtain object function as follows:
max i m i z e &Sigma; a i , a j &Element; P t ( y i j * s i m ( a i , a j ) - u * &phi; i j - &psi; i j ) - &Sigma; n = 1 N &Sigma; a i , a j &Element; M n &zeta; i j
s.t.yij∈{0,1},φijijij≥0
y i j + y i k &le; 1 + y i j + &phi; j k , &ForAll; a i , a j , a k &Element; P t
&lambda; | y i j - s i m ( a i , a j ) | &le; &psi; i j , &ForAll; a i , a j &Element; P t
s i m ( a i , a j ) > y i j * &tau; , &ForAll; a i , a j &Element; P t
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
Try to achieve the maximum of this object function, thus obtain parameter y on limit corresponding to this maximumij
3.2) by each connected component in this weight map as an entity, obtain describing all pages of an entity Face.
The inventive method compared with prior art has the advantages that
1. the method utilizes title synonym, obtains title Candidate Set, then obtains page Candidate Set from title Candidate Set, At page candidate's centralized calculation Page resemblance, thus reduce the scale of problem largely so that ensuing Algorithm is implemented simpler.
2. the method is according to page structure, is extracted the Jaccard coefficient of 7 text features, and uses random forest to calculate Method calculates the similarity between the page and the page, and this similarity can accurately react the similarity of the page.
3. the method is to the similarity modeling between the page on figure, utilizes mixed linear programming model to try to achieve summit on figure And the relation between summit, i.e. relation between the page and the page.By these relations, a non-directed graph can be built.At this In individual non-directed graph, can accurately obtain describing all pages of an entity.
Accompanying drawing explanation
Fig. 1 is the overview flow chart of the present invention;
Fig. 2 is step 2) flow chart;
Fig. 3 is step 3) flow chart;
Fig. 4 is step 4) flow chart.
Detailed description of the invention
With specific embodiment, the present invention is made into once describing in detail below in conjunction with the accompanying drawings.
As Figure 1-Figure 4, the step of the knowledge mapping construction method merged based on many source entities is as follows:
1) the pretreatment encyclopaedia page: extract the synonym of encyclopaedia title, extracts the qi page that disappears, and utilizes synon transmission to close System builds synonym phrase, and all synonym phrases form synonym phrase set, according to each synonym phrase in synonym phrase set Corresponding page makeup Candidate Set, carries out participle with participle instrument to the text of the encyclopaedia page.
2) by step 1) word segmentation result, calculate the feature between the page two-by-two in same Candidate Set, by instruction Practicing grader is that every one-dimensional characteristic composes upper different weight, and utilizes the similarity between this classifier calculated page.
3) according to step 2) in similarity between the page that calculates build the weight map of this Candidate Set, utilize mixed linear Plan model, defines this model objective function, and the maximum of calculating target function, obtains the connection between summit and summit Property.By each connected component in weight map as an entity, thus obtain all pages describing same entity.
Described step 1) be:
1.1) extracting the synonym of encyclopaedia title, extracting mode includes following two:
A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if It is made into merit, then obtains synonym pair.Template is artificially defined, contains major part synonym to pattern occur.Such as: for same , in short would generally there is " A has another name called B " in the beginning of the page or the of summary in the page of justice word, " A have another name called B ", and " A is the same of B Justice word " etc. character string, mated by canonical, a part of synonym pair can be obtained.
B) link redirect: jump to another page by hyperlink in the page, if the title of another page and The text of this hyperlink is different, then it is assumed that the two word is synonym.
1.2) the qi page that disappears is extracted: kth encyclopaedia is expressed asK maximum is 3, wherein aiRepresent page Face, n representation page total quantity.By all pages occurred in the qi page that disappears, qi page set M that disappears can be extracted, inside set M The page all can not represent same entity the most two-by-two.
M={ai∈εk|ai∈M≠aj∈M}
1.3) Candidate Set is extracted: according to synon transitivity, if A and B synonym each other, A and C synonym each other, So B and C synonym the most each other.In this way, synonym phrase S is obtainedt, all synonym phrase StForm synonym phrase collection Close, the synonym each other of element two-by-two in each synonym phrase of this set.
Given St, from all encyclopaedia source, find out title belong to StThe page, all these page constitute Candidate Set Pt
Pt={ a ∈ ε1,…,K|a.Title∈St}
K is the sum of encyclopaedia;A.Title is the title of page a.
1.4) text to the encyclopaedia page carries out participle: 5 territory participles to the page, including summary, message box (key and Value), link, catalogue, user tag, and remove stop words and the length word less than 2.
Described step 2) including:
2.1) 6 territories that one page of definition is comprised, including title T, make a summary A, message box I, catalogue C, user tag G With link L, represent a page by 6 tuples:
A={T, A, I, C, G, L}
Wherein message box is expressed as key-value pair, therefore I={P, V}, and wherein P represents that attribute, V represent property value;
For belonging to 2 pages of same Candidate Set, if they describe be an entity, then their text Duplication can be bigger, and therefore following 7 features of definition, as follows: 1) summary feature
f a ( a i , a j ) = | S w ( a i . A ) &cap; S w ( a j . A ) | | S w ( a i . A ) &cup; S w ( a j . A ) |
2) message box attribute character
f p ( a i , a j ) = | S w ( a i . I . P ) &cap; S w ( a j . I . P ) | | S w ( a i . I . P ) &cup; S w ( a j . I . P ) |
3) message box property value feature
f v ( a i , a j ) = | S w ( a i . I . V ) &cap; S w ( a j . I . V ) | | S w ( a i . I . V ) &cup; S w ( a j . I . V ) |
4) directory feature
f C ( a i , a j ) = | S w ( a i . C ) &cap; S w ( a j . C ) | | S w ( a i . C ) &cup; S w ( a j . C ) |
5) user tag feature
f g ( a i , a j ) = | S w ( a i . G ) &cap; S w ( a j . G ) | | S w ( a i . G ) &cup; S w ( a j . G ) |
6) chain feature
f l ( a i , a j ) = | S w ( a i . L ) &cap; S w ( a j . L ) | | S w ( a i . L ) &cup; S w ( a j . L ) |
7) global characteristics, S represents the 6 tuples { string-concatenation of T, A, I, C, G, L}
f a l l ( a i , a j ) = | S w ( a i . S ) &cap; S w ( a j . S ) | | S w ( a i . S ) &cup; S w ( a j . S ) |
Sw(X) represent the results set after character string X participle.
2.2) will be in step 2.1) 7 features obtaining as the input of grader, utilize in Weka algorithm bag RandomForest Algorithm for Training two classification device, then with this two classification device predict between two pages similar Degree.
Described step 3) specifically include following steps:
3.1) according to step 2) similarity between the calculated page builds the weight map of this Candidate Set, two nodes Between weight limit similarity represent.Thus, former problem is converted into the choice problem on limit.Use yijRepresent between two nodes Whether there is a limit:
It is simultaneously introduced other penalty terms and constraints to build mixed linear programming model:
Penalty term 1:
If aiWith ajThere are limit, and aiWith akThere is limit, so ajWith akBetween also should have a limit, otherwise add penalty term φ, It is multiplied by coefficient u as adjusting parameter simultaneously.Therefore for φ, there is a following constraint:
y i j + y i k &le; 1 + y j k + &phi; j k , &ForAll; a i , a j , a k &Element; P t
φjk≥0
Penalty term 2:
If aiWith ajBetween similarity the highest, then the probability having limit between them is the biggest.For two similarities very Little aiWith ajIf there being limit between them, then penalty term is relatively big, if aiWith ajSimilarity bigger, then penalty term is relatively Little.Therefore, ψ is usedijRepresenting penalty term, represent adjustment parameter with λ, this penalty term following formula retrains:
&lambda; | y i j - s i m ( a i , a j ) | &le; &psi; i j , &ForAll; a i , a j &Element; P t
ψij≥0
sim(ai,aj) it is aiAnd ajBetween weight;
Penalty term 3:
For a occurred inside disappearing qi page set M atiWith ajIf, yijEqual to 1, then show matching error, because of This needs to use penalty term ζijRetrain aiWith ajBetween there is no limit.This constraints is represented by following formula:
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
ζij≥0
N is the number of qi page set of disappearing;
Additionally, similarity is arranged threshold tau, only similarity is more than a of threshold tauiWith ajThe page between just can have limit.
Comprehensive each penalty term above and threshold value, obtain object function as follows:
max i m i z e &Sigma; a i , a j &Element; P t ( y i j * s i m ( a i , a j ) - u * &phi; i j - &psi; i j ) - &Sigma; n = 1 N &Sigma; a i , a j &Element; M n &zeta; i j
s.t.yij∈{0,1},φijijij≥0
y i j + y i k &le; 1 + y i j + &phi; j k , &ForAll; a i , a j , a k &Element; P t
&lambda; | y i j - s i m ( a i , a j ) | &le; &psi; i j , &ForAll; a i , a j &Element; P t
s i m ( a i , a j ) > y i j * &tau; , &ForAll; a i , a j &Element; P t
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
Try to achieve the maximum of this object function, thus obtain parameter y on limit corresponding to this maximumij
3.2) by each connected component in this weight map as an entity, obtain describing all pages of an entity Face.
Embodiment
Provide below the step that realizes of an example in detail present invention:
(1) data set that example uses is from Baidupedia and interactive encyclopaedia, and wherein the page quantity of Baidupedia is 10143321, the page quantity of interactive encyclopaedia is 6618544.
(2) according to all pages in (1), analyze page column structure, extract title, summary, catalogue, classify, link, The information such as message box, and these information are stored in lucene index.In addition to title, other territory can be all empty.
(3) according to all pages in (1), title synonym is extracted.Synon extracting method mainly includes template Join and link redirection.By the synonym pair extracted, obtain title TongYiCi CiLin further.With these title synonyms Set is gone and the page title coupling in (1), obtains the Candidate Set page.
(4) in the Candidate Set page that (3) obtain, extract the feature between the page two-by-two, and be characterized as input with these, Training random forest grader.In this step, need manually to mark training set.
(5) similarity matrix obtained based on step (4), builds mixed linear programming model, can be pushed up with this model Relation between point and summit, 1 represents there is limit between two summits, and 0 represents do not have limit between two summits.With these summits and Limit is input, can build a non-directed graph.Extract each connected component in non-directed graph, the page that these connected components represent Face represents an entity.
The operation result of this example:
For Similarity Measure, have employed 5 kinds of methods and contrast, finally show that the effect of random forest grader is Alright.The calculating of similarity by tetra-kinds of evaluation indexes of Precision, Recall, F1 and Accuracy by used herein Method (SCM) and additive method, including greed coupling (GA), hierarchical clustering (AC), minimum spanning tree cluster (MSTC) and association Compare with cluster (CC), the result obtained such as following table:
Method Precision Recall F1 Accuracy
GA 78.3% 76.1% 77.2% 91.6%
AC 73.0% 79.0% 75.9% 91.5%
MSTC 63.4% 80.5% 71% 88.8%
CC 62.4% 65.5% 63.9% 87.4%
SCM 75.8% 82.5% 79.0% 92.5
Contrasted by upper table it can be seen that this method will be better than additive method in the performance of F1 and Accuracy.Cause This, this method has good use value and application prospect in terms of Entities Matching.

Claims (4)

1. the knowledge mapping construction method merged based on many source entities, it is characterised in that comprise the following steps:
1) the pretreatment encyclopaedia page: extract the synonym of encyclopaedia title, extracts the qi page that disappears, utilizes synon transitive relation structure Building synonym phrase, all synonym phrases form synonym phrase set, corresponding according to each synonym phrase in synonym phrase set Page makeup Candidate Set, with participle instrument, the text of the encyclopaedia page is carried out participle.
2) by step 1) word segmentation result, calculate the feature between the page two-by-two in same Candidate Set, by training point Class device is that every one-dimensional characteristic composes upper different weight, and utilizes the similarity between this classifier calculated page.
3) according to step 2) in similarity between the page that calculates build the weight map of this Candidate Set, utilize mixed linear programming Model, defines this model objective function, and the maximum of calculating target function, obtains the connectedness between summit and summit.Will Each connected component in weight map is as an entity, thus obtains all pages describing same entity.
2. according to a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that Described step 1) including:
1.1) extracting the synonym of encyclopaedia title, extracting mode includes following two:
A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if mated into Merit, then obtain synonym pair.Template is artificially defined, contains major part synonym to pattern occur.
B) link redirects: jump to another page by hyperlink in the page, if the title of another page surpasses with this The text of link is different, then it is assumed that the two word is synonym.
1.2) the qi page that disappears is extracted: kth encyclopaedia is expressed asK maximum is 3, wherein aiRepresentation page, n table Show page total quantity.By all pages occurred in the qi page that disappears, qi page set M that disappears, any two inside set M can be extracted Two pages all can not represent same entity.
M={ai∈εk|ai∈M≠aj∈M}
1.3) Candidate Set is extracted: according to synon transitivity, if A and B synonym each other, A and C synonym each other, then B With C synonym the most each other.In this way, synonym phrase S is obtainedt, all synonym phrase StForm synonym phrase set, should The synonym each other of element two-by-two in each synonym phrase of set.
Given St, from all encyclopaedia source, find out title belong to StThe page, all these page constitute Candidate Set Pt
Pt={ a ∈ ε1,…,K|a.Title∈St}
K is the sum of encyclopaedia;A.Title is the title of page a.
1.4) text to the encyclopaedia page carries out participle: 5 territory participles to the page, including summary, message box (key and value), chain Connect, catalogue, user tag, and remove stop words and the length word less than 2.
3. according to a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that Described step 2) including:
2.1) 6 territories that one page of definition is comprised, including title T, make a summary A, message box I, catalogue C, user tag G and chain Meet L, represent a page by 6 tuples:
A={T, A, I, C, G, L}
Wherein message box is expressed as key-value pair, therefore I={P, V}, and wherein P represents that attribute, V represent property value;
For belonging to 2 pages of same Candidate Set, if what they described is an entity, then their text is overlapping Rate can be bigger, therefore following 7 features of definition, as follows:
1) summary feature
f a ( a i , a j ) = | S w ( a i . A ) &cap; S w ( a j . A ) | | S w ( a i . A ) &cup; S w ( a j . A ) |
2) message box attribute character
f p ( a i , a j ) = | S w ( a i . I . P ) &cap; S w ( a j . I . P ) | | S w ( a i . I . P ) &cup; S w ( a j . I . P ) |
3) message box property value feature
f v ( a i , a j ) = | S w ( a i . I . V ) &cap; S w ( a j . I . V ) | | S w ( a i . I . V ) &cup; S w ( a j . I . V ) |
4) directory feature
f C ( a i , a j ) = | S w ( a i . C ) &cap; S w ( a j . C ) | | S w ( a i . C ) &cup; S w ( a j . C ) |
5) user tag feature
f g ( a i , a j ) = | S w ( a i . G ) &cap; S w ( a j . G ) | | S w ( a i . G ) &cup; S w ( a j . G ) |
6) chain feature
f l ( a i , a j ) = | S w ( a i . L ) &cap; S w ( a j . L ) | | S w ( a i . L ) &cup; S w ( a j . L ) |
7) global characteristics, S represents the 6 tuples { string-concatenation of T, A, I, C, G, L}
f a l l ( a i , a j ) = | S w ( a i . S ) &cap; S w ( a j . S ) | | S w ( a i . S ) &cup; S w ( a j . S ) |
Sw(X) represent the results set after character string X participle.
2.2) will be in step 2.1) 7 features obtaining as the input of grader, utilize in Weka algorithm bag RandomForest Algorithm for Training two classification device, then with this two classification device predict between two pages similar Degree.
4. a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that described Step 3) specifically include following steps:
3.1) according to step 2) similarity between the calculated page builds the weight map of this Candidate Set, between two nodes Weight limit similarity represent.Thus, former problem is converted into the choice problem on limit.Use yijWhether represent between two nodes There is a limit:
It is simultaneously introduced other penalty terms and constraints to build mixed linear programming model:
Penalty term 1:
If aiWith ajThere are limit, and aiWith akThere is limit, so ajWith akBetween also should have a limit, otherwise add penalty term φ, simultaneously It is multiplied by coefficient u as adjusting parameter.Therefore for φ, there is a following constraint:
y i j + y i k &le; 1 + y j k + &phi; j k , &ForAll; a i , a j , a k &Element; P t
φjk≥0
Penalty term 2:
If aiWith ajBetween similarity the highest, then the probability having limit between them is the biggest.The least for two similarities aiWith ajIf there being limit between them, then penalty term is relatively big, if aiWith ajSimilarity bigger, then penalty term is less.Cause This, use ψijRepresenting penalty term, represent adjustment parameter with λ, this penalty term following formula retrains:
&lambda; | y i j - s i m ( a i , a j ) | &le; &psi; i j , &ForAll; a i , a j &Element; P t
ψij≥0
sim(ai,aj) it is aiAnd ajBetween weight;
Penalty term 3:
For a occurred inside disappearing qi page set M atiWith ajIf, yijEqual to 1, then show matching error, therefore need Penalty term ζ to be usedijRetrain aiWith ajBetween there is no limit.This constraints is represented by following formula:
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
ζij≥0
N is the number of qi page set of disappearing;
Additionally, similarity is arranged threshold tau, only similarity is more than a of threshold tauiWith ajThe page between just can have limit.
Comprehensive each penalty term above and threshold value, obtain object function as follows:
max i m i z e &Sigma; a i , a j &Element; P t ( y i j * s i m ( a i , a j ) - u * &phi; i j - &psi; i j ) - &Sigma; n = 1 N &Sigma; a i , a j &Element; M n &zeta; i j
s.t. yij∈{0,1},φijijij≥0
y i j + y i k &le; 1 + y i j + &phi; j k , &ForAll; a i , a j , a k &Element; P t
&lambda; | y i j - s i m ( a i , a j ) | &le; &psi; i j , &ForAll; a i , a j &Element; P t
s i m ( a i , a j ) > y i j * &tau; , &ForAll; a i , a j &Element; P t
y i j < &zeta; i j , &ForAll; a i , a j &Element; M n , n = 1 , 2 , ... , N
Try to achieve the maximum of this object function, thus obtain parameter y on limit corresponding to this maximumij
3.2) by each connected component in this weight map as an entity, obtain describing all pages of an entity.
CN201610583823.0A 2016-07-22 2016-07-22 Knowledge mapping construction method based on the fusion of multi-source entity Active CN106250412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610583823.0A CN106250412B (en) 2016-07-22 2016-07-22 Knowledge mapping construction method based on the fusion of multi-source entity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610583823.0A CN106250412B (en) 2016-07-22 2016-07-22 Knowledge mapping construction method based on the fusion of multi-source entity

Publications (2)

Publication Number Publication Date
CN106250412A true CN106250412A (en) 2016-12-21
CN106250412B CN106250412B (en) 2019-04-23

Family

ID=57604424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610583823.0A Active CN106250412B (en) 2016-07-22 2016-07-22 Knowledge mapping construction method based on the fusion of multi-source entity

Country Status (1)

Country Link
CN (1) CN106250412B (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106777331A (en) * 2017-01-11 2017-05-31 北京航空航天大学 Knowledge mapping generation method and device
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN107038257A (en) * 2017-05-10 2017-08-11 浙江大学 A kind of city Internet of Things data analytical framework of knowledge based collection of illustrative plates
CN107220386A (en) * 2017-06-29 2017-09-29 北京百度网讯科技有限公司 Information-pushing method and device
CN107423820A (en) * 2016-05-24 2017-12-01 清华大学 The knowledge mapping of binding entity stratigraphic classification represents learning method
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN108777635A (en) * 2018-05-24 2018-11-09 梧州井儿铺贸易有限公司 A kind of Enterprise Equipment Management System
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method of Company Knowledge map is constructed from multi-source data integration visual angle
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109657069A (en) * 2018-12-11 2019-04-19 北京百度网讯科技有限公司 The generation method and its device of knowledge mapping
CN109857872A (en) * 2019-02-18 2019-06-07 浪潮软件集团有限公司 The information recommendation method and device of knowledge based map
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110245198A (en) * 2019-06-18 2019-09-17 北京百度网讯科技有限公司 Multi-source ticketing data managing method and system, server and computer-readable medium
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN111708891A (en) * 2019-03-01 2020-09-25 九阳股份有限公司 Food material entity linking method and device among multi-source food material data
CN111813962A (en) * 2020-09-07 2020-10-23 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 Distribution network multi-source grid entity fusion method based on weighted semantic similarity
CN112115328A (en) * 2020-08-24 2020-12-22 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112163094A (en) * 2020-08-25 2021-01-01 中国科学院计算机网络信息中心 Scientific and technological resource convergence and continuous service method and device
CN112328812A (en) * 2021-01-05 2021-02-05 成都数联铭品科技有限公司 Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
CN113139050A (en) * 2021-05-10 2021-07-20 桂林电子科技大学 Text abstract generation method based on named entity identification additional label and priori knowledge
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113326686A (en) * 2020-02-28 2021-08-31 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
US11487832B2 (en) * 2018-09-27 2022-11-01 Google Llc Analyzing web pages to facilitate automatic navigation
CN113326686B (en) * 2020-02-28 2024-05-10 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103729343A (en) * 2013-10-10 2014-04-16 上海交通大学 Semantic ambiguity eliminating method based on encyclopedia link co-occurrence
CN105787105A (en) * 2016-03-21 2016-07-20 浙江大学 Iterative-model-based establishment method of Chinese encyclopedic knowledge graph classification system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
楼仁杰: "基于中文百科的知识图谱分类体系构建研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王龙甫: "基于中文百科的概念知识库构建", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423820A (en) * 2016-05-24 2017-12-01 清华大学 The knowledge mapping of binding entity stratigraphic classification represents learning method
CN106777331A (en) * 2017-01-11 2017-05-31 北京航空航天大学 Knowledge mapping generation method and device
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN106844658B (en) * 2017-01-23 2019-12-13 中山大学 Automatic construction method and system of Chinese text knowledge graph
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN108399180B (en) * 2017-02-08 2021-11-26 腾讯科技(深圳)有限公司 Knowledge graph construction method and device and server
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN106909643B (en) * 2017-02-20 2020-08-14 同济大学 Knowledge graph-based social media big data topic discovery method
CN108694177A (en) * 2017-04-06 2018-10-23 北大方正集团有限公司 Knowledge mapping construction method and system
CN107038257A (en) * 2017-05-10 2017-08-11 浙江大学 A kind of city Internet of Things data analytical framework of knowledge based collection of illustrative plates
CN107220386A (en) * 2017-06-29 2017-09-29 北京百度网讯科技有限公司 Information-pushing method and device
CN107220386B (en) * 2017-06-29 2020-10-02 北京百度网讯科技有限公司 Information pushing method and device
CN108182295A (en) * 2018-02-09 2018-06-19 重庆誉存大数据科技有限公司 A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN108182295B (en) * 2018-02-09 2021-09-10 重庆电信系统集成有限公司 Enterprise knowledge graph attribute extraction method and system
CN108777635A (en) * 2018-05-24 2018-11-09 梧州井儿铺贸易有限公司 A kind of Enterprise Equipment Management System
CN109033129B (en) * 2018-06-04 2021-08-03 桂林电子科技大学 Multi-source information fusion knowledge graph representation learning method based on self-adaptive weight
CN109033129A (en) * 2018-06-04 2018-12-18 桂林电子科技大学 Multi-source Information Fusion knowledge mapping based on adaptive weighting indicates learning method
CN109284394A (en) * 2018-09-12 2019-01-29 青岛大学 A method of Company Knowledge map is constructed from multi-source data integration visual angle
US11971936B2 (en) 2018-09-27 2024-04-30 Google Llc Analyzing web pages to facilitate automatic navigation
US11487832B2 (en) * 2018-09-27 2022-11-01 Google Llc Analyzing web pages to facilitate automatic navigation
CN109522547A (en) * 2018-10-23 2019-03-26 浙江大学 Chinese synonym iteration abstracting method based on pattern learning
CN109657069A (en) * 2018-12-11 2019-04-19 北京百度网讯科技有限公司 The generation method and its device of knowledge mapping
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN109902144B (en) * 2019-01-11 2020-01-31 杭州电子科技大学 entity alignment method based on improved WMD algorithm
CN109857872A (en) * 2019-02-18 2019-06-07 浪潮软件集团有限公司 The information recommendation method and device of knowledge based map
CN111708891A (en) * 2019-03-01 2020-09-25 九阳股份有限公司 Food material entity linking method and device among multi-source food material data
CN111708891B (en) * 2019-03-01 2023-12-08 九阳股份有限公司 Food material entity linking method and device between multi-source food material data
CN110377747A (en) * 2019-06-10 2019-10-25 河海大学 A kind of knowledge base fusion method towards encyclopaedia website
CN110377747B (en) * 2019-06-10 2021-12-07 河海大学 Knowledge base fusion method for encyclopedic website
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110245198A (en) * 2019-06-18 2019-09-17 北京百度网讯科技有限公司 Multi-source ticketing data managing method and system, server and computer-readable medium
CN110427612A (en) * 2019-07-02 2019-11-08 平安科技(深圳)有限公司 Based on multilingual entity disambiguation method, device, equipment and storage medium
CN113326686B (en) * 2020-02-28 2024-05-10 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN113326686A (en) * 2020-02-28 2021-08-31 株式会社斯库林集团 Similarity calculation device, recording medium, and similarity calculation method
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 Distribution network multi-source grid entity fusion method based on weighted semantic similarity
CN112115328B (en) * 2020-08-24 2022-08-19 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112115328A (en) * 2020-08-24 2020-12-22 苏宁金融科技(南京)有限公司 Page flow map construction method and device and computer readable storage medium
CN112163094A (en) * 2020-08-25 2021-01-01 中国科学院计算机网络信息中心 Scientific and technological resource convergence and continuous service method and device
CN111813962A (en) * 2020-09-07 2020-10-23 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN111813962B (en) * 2020-09-07 2020-12-18 北京富通东方科技有限公司 Entity similarity calculation method for knowledge graph fusion
CN113392220A (en) * 2020-10-23 2021-09-14 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN113392220B (en) * 2020-10-23 2024-03-26 腾讯科技(深圳)有限公司 Knowledge graph generation method and device, computer equipment and storage medium
CN112328812A (en) * 2021-01-05 2021-02-05 成都数联铭品科技有限公司 Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
CN113157861B (en) * 2021-04-12 2022-05-24 山东浪潮科学研究院有限公司 Entity alignment method fusing Wikipedia
CN113157861A (en) * 2021-04-12 2021-07-23 山东新一代信息产业技术研究院有限公司 Entity alignment method fusing Wikipedia
CN113139050A (en) * 2021-05-10 2021-07-20 桂林电子科技大学 Text abstract generation method based on named entity identification additional label and priori knowledge
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data

Also Published As

Publication number Publication date
CN106250412B (en) 2019-04-23

Similar Documents

Publication Publication Date Title
CN106250412A (en) The knowledge mapping construction method merged based on many source entities
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106055675B (en) A kind of Relation extraction method based on convolutional neural networks and apart from supervision
CN103473283B (en) Method for matching textual cases
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN104991905B (en) A kind of mathematic(al) representation search method based on level index
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN110674252A (en) High-precision semantic search system for judicial domain
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN102117281A (en) Method for constructing domain ontology
US9146988B2 (en) Hierarchal clustering method for large XML data
CN110175334A (en) Text knowledge&#39;s extraction system and method based on customized knowledge slot structure
CN104317838A (en) Cross-media Hash index method based on coupling differential dictionary
CN112487190A (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN114997288A (en) Design resource association method
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
CN115391553A (en) Method for automatically searching time sequence knowledge graph complement model
CN103064907A (en) System and method for topic meta search based on unsupervised entity relation extraction
CN103699568A (en) Method for extracting hyponymy relation of field terms from wikipedia

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20161221

Assignee: TONGDUN HOLDINGS Co.,Ltd.

Assignor: ZHEJIANG University

Contract record no.: X2021990000612

Denomination of invention: Construction method of knowledge map based on multi-source entity fusion

Granted publication date: 20190423

License type: Common License

Record date: 20211012