CN106250412A

CN106250412A - The knowledge mapping construction method merged based on many source entities

Info

Publication number: CN106250412A
Application number: CN201610583823.0A
Authority: CN
Inventors: 鲁伟明; 戴豪; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2016-12-21
Anticipated expiration: 2036-07-22
Also published as: CN106250412B

Abstract

The invention discloses a kind of knowledge mapping construction method merged based on many source entities.First the present invention crawls Chinese three big encyclopaedias: Baidupedia, interactive encyclopaedia, wikipedia, and data are done pretreatment, extracts including title synonym, the qi page that disappears extracts, Candidate Set extraction and text participle etc..Then, for the page in same Candidate Set, calculate the feature between the page two-by-two, and train the similarity between the classifier calculated page, and build weight map according to similarity.Finally, by mixed linear programming model, retrain the relation between summit and summit in weight map, by the maximum of calculating target function, obtain the connectedness between summit and summit, by each connected component as an entity, thus obtain all pages describing same entity.The present invention, by introducing Candidate Set, substantially reduces the scale of problem；Simultaneously further through mixed linear programming model, improve the accuracy rate that entity merges.

Description

The knowledge mapping construction method merged based on many source entities

Technical field

The present invention relates to Text similarity computing method, particularly relate to a kind of knowledge mapping structure merged based on many source entities Construction method.

Background technology

Along with developing rapidly of the Internet, the approach that people obtain information and knowledge is more and more diversified, but magnanimity Data are distributed in each corner of the Internet, and this obtains knowledge to user and brings the biggest obstacle.Therefore, a system is built One complete knowledge base is extremely urgent.

Currently existing many knowledge base, such as DBpedia is a special semantic net exemplary applications, and it is from Wiki Capture structurized data in the entry of encyclopaedia, to strengthen the function of searching of wikipedia, and other data set are linked to Wikipedia；Freebase is a large-scale cooperation knowledge base, and it incorporates the many resources on network.In Freebase Entry is also similar with DBpedia, all uses the form of structural data.By accessing its data it appeared that the most all of in Holding is all to format, and stores according to the form of tlv triple and shows.This pattern is fixing, and same type of entry is all wrapped Containing identical attribute.For these reasons, just can link together easily between homogeneous data, provide for information inquiry Facility.Freebase comprises number theme in terms of necessarily, thousands of type and attribute.But the language of these knowledge bases It is all English, the complete knowledge base that Chinese field also neither one is large-scale at present.

In traditional Entities Matching algorithm about knowledge base, it is mainly based upon the coupling of paired entity, and this is asked Topic form one classification problem of chemical conversion.But, most of this kind of algorithms all depend heavily on the quality of data template.For For web data, data are not to present with a unified triple form, and the data of homology are not on expression-form Also having bigger difference, the suitability in our this problem of the most this method is relatively low.

In other matching algorithm, the structural information of the page is also allowed in feature, such as in Chinese and English Wiki Entities Matching in because with the presence of quite a few page across language link, so this partial information can be as elder generation Test knowledge.But, there is no any link between our multi-source data, so the architectural feature of the page cannot include feature in Among.

In the feature calculation of two set, it is possible to use Jaccard coefficient.Jaccard coefficient is mainly used in calculating symbol Number tolerance or Boolean tolerance individuality between similarity because the characteristic attribute of individuality is all to be measured or Boolean by symbol Mark, therefore cannot weigh the size of difference occurrence, can only obtain " the most identical " this result, so Jaccard coefficient It is only concerned between individuality and jointly has the characteristic that no this problem consistent.If comparing the Jaccard similarity coefficient of X Yu Y, only than Relatively X_nAnd Y_nIn identical number.

In characteristic similarity calculates, many algorithms are had to apply.Simply can directly calculate Euclidean distance or COS distance.Grader can also be used to calculate similarity according to features training grader.Random forest is that a kind of performance is good Good grader, can be used in characteristic similarity calculating.It refers to utilize many decision trees to be trained sample and in advance A kind of grader surveyed, and classification of its output be classifications by indivedual tree outputs mode depending on.Random forest has perhaps Many advantages, such as during Character losing, still can keep higher accuracy, and will not produce over-fitting problem.

Summary of the invention

The present invention, for integrating multi-source encyclopaedic knowledge, builds unified knowledge base, it is provided that one merges based on many source entities Knowledge mapping construction method.The encyclopaedia of homology would generally not comprise the multiple pages describing same entity, and many source entities melt Conjunction technology can find these pages in the data of magnanimity, and map that to same physically.

The technical scheme that the present invention solves the employing of its technical problem is as follows: a kind of knowledge mapping merged based on many source entities Construction method, comprises the following steps:

1) the pretreatment encyclopaedia page: extract the synonym of encyclopaedia title, extracts the qi page that disappears, and utilizes synon transmission to close System builds synonym phrase, and all synonym phrases form synonym phrase set, according to each synonym phrase in synonym phrase set Corresponding page makeup Candidate Set, carries out participle with participle instrument to the text of the encyclopaedia page.

2) by step 1) word segmentation result, calculate the feature between the page two-by-two in same Candidate Set, by instruction Practicing grader is that every one-dimensional characteristic composes upper different weight, and utilizes the similarity between this classifier calculated page.

3) according to step 2) in similarity between the page that calculates build the weight map of this Candidate Set, utilize mixed linear Plan model, defines this model objective function, and the maximum of calculating target function, obtains the connection between summit and summit Property.By each connected component in weight map as an entity, thus obtain all pages describing same entity.

Further, described step 1) including:

1.1) extracting the synonym of encyclopaedia title, extracting mode includes following two:

A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if It is made into merit, then obtains synonym pair.Template is artificially defined, contains major part synonym to pattern occur.

B) link redirect: jump to another page by hyperlink in the page, if the title of another page and The text of this hyperlink is different, then it is assumed that the two word is synonym.

1.2) the qi page that disappears is extracted: kth encyclopaedia is expressed asK maximum is 3, wherein a_iRepresent page Face, n representation page total quantity.By all pages occurred in the qi page that disappears, qi page set M that disappears can be extracted, inside set M The page all can not represent same entity the most two-by-two.

M={a_i∈ε_k|a_i∈M≠a_j∈M}

1.3) Candidate Set is extracted: according to synon transitivity, if A and B synonym each other, A and C synonym each other, So B and C synonym the most each other.In this way, synonym phrase S is obtained_t, all synonym phrase S_tForm synonym phrase collection Close, the synonym each other of element two-by-two in each synonym phrase of this set.

Given S_t, from all encyclopaedia source, find out title belong to S_tThe page, all these page constitute Candidate Set P_t。

P_t={ a ∈ ε_1,…,K|a.Title∈S_t}

K is the sum of encyclopaedia；A.Title is the title of page a.

1.4) text to the encyclopaedia page carries out participle: 5 territory participles to the page, including summary, message box (key and Value), link, catalogue, user tag, and remove stop words and the length word less than 2.

Further, described step 2) including:

2.1) 6 territories that one page of definition is comprised, including title T, make a summary A, message box I, catalogue C, user tag G With link L, represent a page by 6 tuples:

A={T, A, I, C, G, L}

Wherein message box is expressed as key-value pair, therefore I={P, V}, and wherein P represents that attribute, V represent property value；

For belonging to 2 pages of same Candidate Set, if they describe be an entity, then their text Duplication can be bigger, therefore following 7 features of definition, as follows:

1) summary feature

f_{a} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . A) \cap S_{w} (a_{j} . A) |}{| S_{w} (a_{i} . A) \cup S_{w} (a_{j} . A) |}

2) message box attribute character

f_{p} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . P) \cap S_{w} (a_{j} . I . P) |}{| S_{w} (a_{i} . I . P) \cup S_{w} (a_{j} . I . P) |}

3) message box property value feature

f_{v} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . V) \cap S_{w} (a_{j} . I . V) |}{| S_{w} (a_{i} . I . V) \cup S_{w} (a_{j} . I . V) |}

4) directory feature

f_{C} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . C) \cap S_{w} (a_{j} . C) |}{| S_{w} (a_{i} . C) \cup S_{w} (a_{j} . C) |}

5) user tag feature

f_{g} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . G) \cap S_{w} (a_{j} . G) |}{| S_{w} (a_{i} . G) \cup S_{w} (a_{j} . G) |}

6) chain feature

f_{l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . L) \cap S_{w} (a_{j} . L) |}{| S_{w} (a_{i} . L) \cup S_{w} (a_{j} . L) |}

7) global characteristics, S represents the 6 tuples { string-concatenation of T, A, I, C, G, L}

f_{a l l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . S) \cap S_{w} (a_{j} . S) |}{| S_{w} (a_{i} . S) \cup S_{w} (a_{j} . S) |}

S_w(X) represent the results set after character string X participle.

2.2) will be in step 2.1) 7 features obtaining as the input of grader, utilize in Weka algorithm bag RandomForest Algorithm for Training two classification device, then with this two classification device predict between two pages similar Degree.

Further, described step 3) specifically include following steps:

3.1) according to step 2) similarity between the calculated page builds the weight map of this Candidate Set, two nodes Between weight limit similarity represent.Thus, former problem is converted into the choice problem on limit.Use y_ijRepresent between two nodes Whether there is a limit:

It is simultaneously introduced other penalty terms and constraints to build mixed linear programming model:

Penalty term 1:

If a_iWith a_jThere are limit, and a_iWith a_kThere is limit, so a_jWith a_kBetween also should have a limit, otherwise add penalty term φ, It is multiplied by coefficient u as adjusting parameter simultaneously.Therefore for φ, there is a following constraint:

y_{i j} + y_{i k} \leq 1 + y_{j k} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

φ_jk≥0

Penalty term 2:

If a_iWith a_jBetween similarity the highest, then the probability having limit between them is the biggest.For two similarities very Little a_iWith a_jIf there being limit between them, then penalty term is relatively big, if a_iWith a_jSimilarity bigger, then penalty term is relatively Little.Therefore, ψ is used_ijRepresenting penalty term, represent adjustment parameter with λ, this penalty term following formula retrains:

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

ψ_ij≥0

sim(a_i,a_j) it is a_iAnd a_jBetween weight；

Penalty term 3:

For a occurred inside disappearing qi page set M at_iWith a_jIf, y_ijEqual to 1, then show matching error, because of This needs to use penalty term ζ_ijRetrain a_iWith a_jBetween there is no limit.This constraints is represented by following formula:

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

ζ_ij≥0

N is the number of qi page set of disappearing；

Additionally, similarity is arranged threshold tau, only similarity is more than a of threshold tau_iWith a_jThe page between just can have limit.

Comprehensive each penalty term above and threshold value, obtain object function as follows:

\begin{matrix} \max i m i z e \underset{a_{i}, a_{j} &Element; P_{t}}{Σ} (y_{i j} * s i m (a_{i}, a_{j}) - u * φ_{i j} - ψ_{i j}) \\ - Σ_{n = 1}^{N} \underset{a_{i}, a_{j} &Element; M_{n}}{Σ} ζ_{i j} \end{matrix}

s.t.y_ij∈{0,1},φ_ij,ψ_ij,ζ_ij≥0

y_{i j} + y_{i k} \leq 1 + y_{i j} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

s i m (a_{i}, a_{j}) > y_{i j} * τ, &ForAll; a_{i}, a_{j} &Element; P_{t}

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

Try to achieve the maximum of this object function, thus obtain parameter y on limit corresponding to this maximum_ij。

3.2) by each connected component in this weight map as an entity, obtain describing all pages of an entity Face.

The inventive method compared with prior art has the advantages that

1. the method utilizes title synonym, obtains title Candidate Set, then obtains page Candidate Set from title Candidate Set, At page candidate's centralized calculation Page resemblance, thus reduce the scale of problem largely so that ensuing Algorithm is implemented simpler.

2. the method is according to page structure, is extracted the Jaccard coefficient of 7 text features, and uses random forest to calculate Method calculates the similarity between the page and the page, and this similarity can accurately react the similarity of the page.

3. the method is to the similarity modeling between the page on figure, utilizes mixed linear programming model to try to achieve summit on figure And the relation between summit, i.e. relation between the page and the page.By these relations, a non-directed graph can be built.At this In individual non-directed graph, can accurately obtain describing all pages of an entity.

Accompanying drawing explanation

Fig. 1 is the overview flow chart of the present invention；

Fig. 2 is step 2) flow chart；

Fig. 3 is step 3) flow chart；

Fig. 4 is step 4) flow chart.

Detailed description of the invention

With specific embodiment, the present invention is made into once describing in detail below in conjunction with the accompanying drawings.

As Figure 1-Figure 4, the step of the knowledge mapping construction method merged based on many source entities is as follows:

Described step 1) be:

A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if It is made into merit, then obtains synonym pair.Template is artificially defined, contains major part synonym to pattern occur.Such as: for same , in short would generally there is " A has another name called B " in the beginning of the page or the of summary in the page of justice word, " A have another name called B ", and " A is the same of B Justice word " etc. character string, mated by canonical, a part of synonym pair can be obtained.

M={a_i∈ε_k|a_i∈M≠a_j∈M}

P_t={ a ∈ ε_1,…,K|a.Title∈S_t}

K is the sum of encyclopaedia；A.Title is the title of page a.

Described step 2) including:

A={T, A, I, C, G, L}

For belonging to 2 pages of same Candidate Set, if they describe be an entity, then their text Duplication can be bigger, and therefore following 7 features of definition, as follows: 1) summary feature

f_{a} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . A) \cap S_{w} (a_{j} . A) |}{| S_{w} (a_{i} . A) \cup S_{w} (a_{j} . A) |}

2) message box attribute character

f_{p} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . P) \cap S_{w} (a_{j} . I . P) |}{| S_{w} (a_{i} . I . P) \cup S_{w} (a_{j} . I . P) |}

3) message box property value feature

f_{v} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . V) \cap S_{w} (a_{j} . I . V) |}{| S_{w} (a_{i} . I . V) \cup S_{w} (a_{j} . I . V) |}

4) directory feature

f_{C} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . C) \cap S_{w} (a_{j} . C) |}{| S_{w} (a_{i} . C) \cup S_{w} (a_{j} . C) |}

5) user tag feature

f_{g} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . G) \cap S_{w} (a_{j} . G) |}{| S_{w} (a_{i} . G) \cup S_{w} (a_{j} . G) |}

6) chain feature

f_{l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . L) \cap S_{w} (a_{j} . L) |}{| S_{w} (a_{i} . L) \cup S_{w} (a_{j} . L) |}

f_{a l l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . S) \cap S_{w} (a_{j} . S) |}{| S_{w} (a_{i} . S) \cup S_{w} (a_{j} . S) |}

S_w(X) represent the results set after character string X participle.

Described step 3) specifically include following steps:

Penalty term 1:

y_{i j} + y_{i k} \leq 1 + y_{j k} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

φ_jk≥0

Penalty term 2:

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

ψ_ij≥0

sim(a_i,a_j) it is a_iAnd a_jBetween weight；

Penalty term 3:

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

ζ_ij≥0

N is the number of qi page set of disappearing；

\begin{matrix} \max i m i z e \underset{a_{i}, a_{j} &Element; P_{t}}{Σ} (y_{i j} * s i m (a_{i}, a_{j}) - u * φ_{i j} - ψ_{i j}) \\ - Σ_{n = 1}^{N} \underset{a_{i}, a_{j} &Element; M_{n}}{Σ} ζ_{i j} \end{matrix}

s.t.y_ij∈{0,1},φ_ij,ψ_ij,ζ_ij≥0

y_{i j} + y_{i k} \leq 1 + y_{i j} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

s i m (a_{i}, a_{j}) > y_{i j} * τ, &ForAll; a_{i}, a_{j} &Element; P_{t}

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

Embodiment

Provide below the step that realizes of an example in detail present invention:

(1) data set that example uses is from Baidupedia and interactive encyclopaedia, and wherein the page quantity of Baidupedia is 10143321, the page quantity of interactive encyclopaedia is 6618544.

(2) according to all pages in (1), analyze page column structure, extract title, summary, catalogue, classify, link, The information such as message box, and these information are stored in lucene index.In addition to title, other territory can be all empty.

(3) according to all pages in (1), title synonym is extracted.Synon extracting method mainly includes template Join and link redirection.By the synonym pair extracted, obtain title TongYiCi CiLin further.With these title synonyms Set is gone and the page title coupling in (1), obtains the Candidate Set page.

(4) in the Candidate Set page that (3) obtain, extract the feature between the page two-by-two, and be characterized as input with these, Training random forest grader.In this step, need manually to mark training set.

(5) similarity matrix obtained based on step (4), builds mixed linear programming model, can be pushed up with this model Relation between point and summit, 1 represents there is limit between two summits, and 0 represents do not have limit between two summits.With these summits and Limit is input, can build a non-directed graph.Extract each connected component in non-directed graph, the page that these connected components represent Face represents an entity.

The operation result of this example:

For Similarity Measure, have employed 5 kinds of methods and contrast, finally show that the effect of random forest grader is Alright.The calculating of similarity by tetra-kinds of evaluation indexes of Precision, Recall, F1 and Accuracy by used herein Method (SCM) and additive method, including greed coupling (GA), hierarchical clustering (AC), minimum spanning tree cluster (MSTC) and association Compare with cluster (CC), the result obtained such as following table:

Method	Precision	Recall	F1	Accuracy
					GA	78.3%	76.1%	77.2%	91.6%
AC	73.0%	79.0%	75.9%	91.5%
					MSTC	63.4%	80.5%	71%	88.8%
CC	62.4%	65.5%	63.9%	87.4%
					SCM	75.8%	82.5%	79.0%	92.5

Contrasted by upper table it can be seen that this method will be better than additive method in the performance of F1 and Accuracy.Cause This, this method has good use value and application prospect in terms of Entities Matching.

Claims

1. the knowledge mapping construction method merged based on many source entities, it is characterised in that comprise the following steps:

1) the pretreatment encyclopaedia page: extract the synonym of encyclopaedia title, extracts the qi page that disappears, utilizes synon transitive relation structure Building synonym phrase, all synonym phrases form synonym phrase set, corresponding according to each synonym phrase in synonym phrase set Page makeup Candidate Set, with participle instrument, the text of the encyclopaedia page is carried out participle.

2) by step 1) word segmentation result, calculate the feature between the page two-by-two in same Candidate Set, by training point Class device is that every one-dimensional characteristic composes upper different weight, and utilizes the similarity between this classifier calculated page.

3) according to step 2) in similarity between the page that calculates build the weight map of this Candidate Set, utilize mixed linear programming Model, defines this model objective function, and the maximum of calculating target function, obtains the connectedness between summit and summit.Will Each connected component in weight map is as an entity, thus obtains all pages describing same entity.

2. according to a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that Described step 1) including:

A) template matching: utilize specific template to remove to mate the beginning of each page and a word of summary, if mated into Merit, then obtain synonym pair.Template is artificially defined, contains major part synonym to pattern occur.

B) link redirects: jump to another page by hyperlink in the page, if the title of another page surpasses with this The text of link is different, then it is assumed that the two word is synonym.

1.2) the qi page that disappears is extracted: kth encyclopaedia is expressed asK maximum is 3, wherein a_iRepresentation page, n table Show page total quantity.By all pages occurred in the qi page that disappears, qi page set M that disappears, any two inside set M can be extracted Two pages all can not represent same entity.

M={a_i∈ε_k|a_i∈M≠a_j∈M}

1.3) Candidate Set is extracted: according to synon transitivity, if A and B synonym each other, A and C synonym each other, then B With C synonym the most each other.In this way, synonym phrase S is obtained_t, all synonym phrase S_tForm synonym phrase set, should The synonym each other of element two-by-two in each synonym phrase of set.

P_t={ a ∈ ε_1,…,K|a.Title∈S_t}

K is the sum of encyclopaedia；A.Title is the title of page a.

1.4) text to the encyclopaedia page carries out participle: 5 territory participles to the page, including summary, message box (key and value), chain Connect, catalogue, user tag, and remove stop words and the length word less than 2.

3. according to a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that Described step 2) including:

2.1) 6 territories that one page of definition is comprised, including title T, make a summary A, message box I, catalogue C, user tag G and chain Meet L, represent a page by 6 tuples:

A={T, A, I, C, G, L}

For belonging to 2 pages of same Candidate Set, if what they described is an entity, then their text is overlapping Rate can be bigger, therefore following 7 features of definition, as follows:

1) summary feature

f_{a} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . A) \cap S_{w} (a_{j} . A) |}{| S_{w} (a_{i} . A) \cup S_{w} (a_{j} . A) |}

2) message box attribute character

f_{p} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . P) \cap S_{w} (a_{j} . I . P) |}{| S_{w} (a_{i} . I . P) \cup S_{w} (a_{j} . I . P) |}

3) message box property value feature

f_{v} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . I . V) \cap S_{w} (a_{j} . I . V) |}{| S_{w} (a_{i} . I . V) \cup S_{w} (a_{j} . I . V) |}

4) directory feature

f_{C} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . C) \cap S_{w} (a_{j} . C) |}{| S_{w} (a_{i} . C) \cup S_{w} (a_{j} . C) |}

5) user tag feature

f_{g} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . G) \cap S_{w} (a_{j} . G) |}{| S_{w} (a_{i} . G) \cup S_{w} (a_{j} . G) |}

6) chain feature

f_{l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . L) \cap S_{w} (a_{j} . L) |}{| S_{w} (a_{i} . L) \cup S_{w} (a_{j} . L) |}

f_{a l l} (a_{i}, a_{j}) = \frac{| S_{w} (a_{i} . S) \cap S_{w} (a_{j} . S) |}{| S_{w} (a_{i} . S) \cup S_{w} (a_{j} . S) |}

S_w(X) represent the results set after character string X participle.

4. a kind of knowledge mapping construction method merged based on many source entities described in claim 1, it is characterised in that described Step 3) specifically include following steps:

3.1) according to step 2) similarity between the calculated page builds the weight map of this Candidate Set, between two nodes Weight limit similarity represent.Thus, former problem is converted into the choice problem on limit.Use y_ijWhether represent between two nodes There is a limit:

Penalty term 1:

If a_iWith a_jThere are limit, and a_iWith a_kThere is limit, so a_jWith a_kBetween also should have a limit, otherwise add penalty term φ, simultaneously It is multiplied by coefficient u as adjusting parameter.Therefore for φ, there is a following constraint:

y_{i j} + y_{i k} \leq 1 + y_{j k} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

φ_jk≥0

Penalty term 2:

If a_iWith a_jBetween similarity the highest, then the probability having limit between them is the biggest.The least for two similarities a_iWith a_jIf there being limit between them, then penalty term is relatively big, if a_iWith a_jSimilarity bigger, then penalty term is less.Cause This, use ψ_ijRepresenting penalty term, represent adjustment parameter with λ, this penalty term following formula retrains:

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

ψ_ij≥0

sim(a_i,a_j) it is a_iAnd a_jBetween weight；

Penalty term 3:

For a occurred inside disappearing qi page set M at_iWith a_jIf, y_ijEqual to 1, then show matching error, therefore need Penalty term ζ to be used_ijRetrain a_iWith a_jBetween there is no limit.This constraints is represented by following formula:

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

ζ_ij≥0

N is the number of qi page set of disappearing；

\begin{matrix} \max i m i z e \underset{a_{i}, a_{j} &Element; P_{t}}{Σ} (y_{i j} * s i m (a_{i}, a_{j}) - u * φ_{i j} - ψ_{i j}) \\ - Σ_{n = 1}^{N} \underset{a_{i}, a_{j} &Element; M_{n}}{Σ} ζ_{i j} \end{matrix}

s.t. y_ij∈{0,1},φ_ij,ψ_ij,ζ_ij≥0

y_{i j} + y_{i k} \leq 1 + y_{i j} + φ_{j k}, &ForAll; a_{i}, a_{j}, a_{k} &Element; P_{t}

λ | y_{i j} - s i m (a_{i}, a_{j}) | \leq ψ_{i j}, &ForAll; a_{i}, a_{j} &Element; P_{t}

s i m (a_{i}, a_{j}) > y_{i j} * τ, &ForAll; a_{i}, a_{j} &Element; P_{t}

y_{i j} < ζ_{i j}, &ForAll; a_{i}, a_{j} &Element; M_{n}, n = 1, 2, ..., N

3.2) by each connected component in this weight map as an entity, obtain describing all pages of an entity.