CN106570191B

CN106570191B - Chinese-English cross-language entity matching method based on Wikipedia

Info

Publication number: CN106570191B
Application number: CN201611005395.XA
Authority: CN
Inventors: 鲁伟明; 戴豪; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-11-11
Filing date: 2016-11-11
Publication date: 2020-05-26
Anticipated expiration: 2036-11-11
Also published as: CN106570191A

Abstract

The invention discloses a Chinese and English cross-language entity matching method based on Wikipedia. The method comprises the steps of firstly obtaining Chinese and English Wikipedia page data through wikidump, and preprocessing the data, and specifically comprises the following steps: extracting title, abstract, directory, full text, link-out, link-in, and classification information of each page; translating the title of the Chinese wiki page into English; respectively segmenting the abstract, the catalogue and the full text; extracting existing page cross-language links and classifying cross-language links in the Chinese and English Wikipedia. Then, aiming at each Chinese wiki page, an English cross-language page candidate set is obtained according to the linked-out link information. Next, features between the Chinese page and its candidate set of English cross-language pages are computed. And finally, constructing a sequencing model, sequencing similarity of the English cross-language page candidate set of the current Chinese page, and taking the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

Description

Chinese-English cross-language entity matching method based on Wikipedia

Technical Field

The invention relates to methods of topic models, deep learning, text similarity calculation and the like, in particular to a Chinese and English cross-language entity matching method based on Wikipedia.

Background

With the development of technologies such as machine learning and deep learning, the construction work of the knowledge base is also perfected. Many knowledge bases exist, such as DBpedia, which is a special semantic web application paradigm that extracts structured data from the vocabulary entry of wikipedia to enhance the search function of wikipedia and link other data sets to wikipedia; freebase is a large collaborative knowledge base that integrates many resources on the network. Entries in Freebase are also similar to DBpedia, all in the form of structured data. By accessing its data, it is found that all of the content therein is formatted, stored and presented in a triple format. This schema is fixed and items of the same type all contain the same attributes. For the reasons, the same kind of data can be easily associated, and convenience is provided for information query. Freebase contains tens of millions of topics, thousands of types and attributes. In the chinese domain, many knowledge bases have been constructed, such as zhishi.

In wikipedia, there are over 40 million cross-language links in chinese and english already existing, including common page cross-language links and cross-language links that classify pages. With these cross-language link information, as well as the rich textual and structural information of the wiki, we can do many interesting things, such as machine translation, cross-language information retrieval, etc. Philipp Sorg and Philipp Cimiano use the cross-language link of Wikipedia and the hierarchical relation between the classification page and the common page to put forward a cross-language explicit semantic distribution model to do the cross-language information retrieval of English German. Jie Tang et al propose a probabilistic topic model to solve the problem of cross-source, cross-language patent retrieval. Therefore, the construction of the cross-language link is a very important link in the construction of the cross-language knowledge base.

In the Chinese knowledge base constructed by the Chinese knowledge base, rich corpus information is provided, about 1200 thousands of entities are contained, the relationship among the entities is very rich, and each entity contains 53.8 links on average. However, our knowledge base is not associated with an authoritative knowledge base in other languages, such as DBPedia, which is a significant loss in knowledge integrity and reliability, and also does not conform to the open principles of LOD. Considering that a part of cross-language links exist among wikipedia in Chinese and English, and the Chinese knowledge base is constructed based on three Chinese encyclopedias such as wikipedia, we can learn new cross-language links through the part of existing knowledge.

In the invention, the technology such as a theme model, deep learning and the like is mainly used. The topic model considers that each word of a document is generated by selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability, so that the topic distribution of two documents describing the same entity is similar, and the topic similarity can be calculated through the topic distribution. doc2vec is an application of deep learning, which, unlike topic models, models the context of words in documents in document vectors, thereby more objectively expressing the semantic relationship between documents.

Disclosure of Invention

The invention provides a Chinese and English cross-language entity matching method based on Wikipedia for searching cross-language links and constructing a cross-language knowledge base.

The technical scheme adopted by the invention for solving the technical problems is as follows: a knowledge graph construction method based on multi-source entity fusion comprises the following steps:

1) preprocessing a Chinese and English wiki page:

1.1) extracting information such as title, abstract, catalogue, full text, link-out link, link-in link, classification and the like of each page;

1.2) translating the title of the Chinese wiki page into English;

1.3) respectively segmenting the abstract, the catalogue and the full text;

1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia.

2) Extracting English cross-language page candidate sets of each Chinese wiki page through the page chain-out links extracted in the step 1) and existing common page cross-language links.

5) Aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.

6) Taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

Further, the step 1) comprises:

1.1) extracting the Wikipedia page information: each wikipedia page can be represented by a 7-tuple: a ═ T, a, B, C, G, I, O }, where T, A, B is title, abstract, full text, C and G are two sets representing the directory and classification information of the current page, and I and O are the in-link and out-link, respectively, of the page if a is_iIn which a link points to a_jThen a is_iIs a_jIs linked into a link, a_jIs a_iOut-link. Thus, the Chinese Wikipedia can be expressed as

English Wikipedia can be expressed as

n and m are the total number of Chinese and English pages respectively.

1.2) Chinese wiki page title translation: and (3) according to the Chinese page title information extracted in the step 1.1), translating all Chinese titles into English by using a Baidu translation api, and expressing the translated result by using a.

1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; and (3) removing stop words by using a Stanford word segmentation tool aiming at the English page, and segmenting the abstract, the catalogue and the full text by taking the words as units.

1.4) extracting existing cross-language links: a part of cross-language links originally exist in Wikipedia, and common page cross-language links and classified page cross-language links can be respectively extracted and obtained by distinguishing namespace domains in wikidump. If a is_i∈ε_cn,b_j∈ε_enAnd a is a_iAnd b_jFor cross-language linking, then a_i＝b_jAll existing cross-language links constitute the set:

Eq＝{{a_i，b_j}|a_i∈ε_cn，b_j∈ε_en，a_i＝b_j}

further, the step 2) comprises:

2.1) define the out-of-link equivalence relation:

defining the following structure as chain-link, wherein a, a_i∈ε_cn；b,b_i∈ε_enThat is, page a and page b have equivalent out-link links, and there is an out-link equivalent relationship between such a and b

2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all the English pages b having the link-out equivalent relationship with the Chinese page a constitute candidate (a). If the greater the number of identical out links that exist between a and b, the greater the probability of becoming a cross-language link between them. Sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as an English cross-language page candidate set of the Chinese page a, and recording the English cross-language page candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.

Further, the step 3) specifically comprises the following steps:

3.1) artificial feature extraction: for Chinese page a, extracting the features between it and its English cross-language candidate page b ∈ candidate1000(a), which is defined as follows:

1) title similarity

F1＝Dist(a.translate,b.T)

Dist calculates the edit distance between a.translate and b.T.

2) Number of identical out links

F2＝|{(a′,b′)|a′＝b′,a′∈a.O,b′∈b.O}|

This feature represents the number of equivalent out links between pages a and b.

3) Chaining out chaining Jaccard coefficients

This feature represents the jaccard coefficient of the equivalent out-link between pages a and b. 4) Normalized identical out-chaining

For a Chinese page a, its normalization factor M can be expressed by the following equation:

M＝max|{(a′，b′)|a′＝b′，a′∈a.O,b′∈c.O,c

∈candidate1000(a)}|

feature 4 is the result of feature 2 normalized by M, as follows:

5) number of equivalent links into link

F5＝|{(a′,b′)|a′＝b′,a′∈a.I,b′∈b.I}|

This feature represents the number of equivalent in-links between pages a and b.

6) Chaining in chaining Jaccard coefficient

This feature represents the jaccard coefficient of the equivalent in-link between pages a and b. 7) Number of equal classes

F7＝|{(a′,b′)|a′＝b′,a′∈a.G,b′∈b.G}|

This feature represents the number of equal classifications between pages a and b.

8) Class jaccard coefficient

This feature represents the jaccard coefficient of the equivalence class between pages a and b.

3.2) subject characteristics:

re-representing a page a by using the result a' after the abstract and the catalogue are participled;

a‘＝seg(a.A)∪seg(a.C)

seg (a.A) represents the result of segmenting the abstract of the page a, and seg (a.C) represents the result of segmenting the catalog of the page a;

for Chinese and English pages with equivalent relation, combining the results after word segmentation to form a new document. And training a cross-language latent Dirichlet distribution model through the new document, and mapping the theme distribution of different language pages into the same space. For a new page, the trained model is used to predict its topic distribution. And (4) performing theme distribution splicing on the Chinese and English pages to obtain a new characteristic F9.

3.3) document vector features:

doc2vec models the context of the word in the document vector. And respectively constructing document vectors of the two corpora by using 100-dimensional vectors by using the text of Chinese and English wiki as a corpus. And constructing a model to map the document vectors of the two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors.

Firstly, a Chinese document vector is marked as Q (query), an English document vector which is equivalent to the Chinese document vector is marked as P (positive), and an English document vector which is not equivalent to the Chinese document vector is marked as N (negative). Defining a mapping function f of a Chinese space_zhMapping function f of English space_en. By these two mapping functions, vectors of different spaces are mapped to the same space. The similarity between two documents can be measured by the Euclidean distance between them. The similarity between Q and P is expressed as:

D(f_zh(Q),f_en(P))＝||f_zh(Q)-f_en(P)||

where function D represents the euclidean distance between the two vectors being calculated. The smaller the value of D, the greater the similarity representing two vectors, and the higher the probability that two documents describe an entity. For a set of data (Q, P, N), it is clear that:

D(f_zh(Q),f_en(P))＜D(f_zh(Q),f_en(N))

whereupon the problem turns into f_zhAnd f_enSolving; training the two parameters by adopting a deep learning method; the input of the deep learning network is a 300-dimensional vector which consists of (Q, P, N), the Q is output through three fully-connected hidden layers and normalization layers in the middle of the neural network to obtain a vector f_zhBecause (Q), P and N are English documents, a vector f is obtained by outputting through the same three fully-connected hidden layers and normalization layers_en(P) and f_en(N) is provided. Finally, f_zh(Q)，f_en(P) and f_en(N) are defined in the same vector space V^dIn, i.e. f_en(.)∈V^dAnd f is_zh(.)∈V^dD is taken to be equal to 8; the loss function is defined as follows:

loss(Q,P,N)＝max{0,g+D(f_zh(Q),f_en(P))-D(f_zh(Q),f_en(N))}+λ||ω||²

wherein g is a gap parameter used for adjusting the distance between two documents, λ is a regular term coefficient, and ω is a regular term, which represents all parameters of the hidden layer of the neural network. By training the neural network, f can be obtained_zhAnd f_enThe two functions map the document vectors of the Chinese and English wiki pages into the same space. For a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced serves as a new feature F10.

Further, the step 4) specifically comprises the following steps:

after the feature extraction is completed, 10 features are obtained in total, and the 10 features are spliced to obtain a long vector. The long vector is used as the input of a learning sorting algorithm, a sorting model is trained, and the sorting model is used for sorting the candidate Candidate1000(a) corresponding to the Chinese page a, so that the most similar English page b belongs to the candidate Candidate1000(a) and serves as a cross-language link of a.

Compared with the prior art, the method has the following beneficial effects:

1. the method utilizes the page chain-out link and the existing cross-language link to limit the size of the page of the English cross-language candidate set, thereby greatly reducing the scale of the problem and leading the implementation of the following algorithm to be simpler.

2. According to the method, 3 types of features including artificial features, theme features and document vector features are extracted according to the structural information and the text information of the page, and the features comprehensively reflect the information contained in one page from the semantic angle and the structural angle.

3. The method analogizes the method for searching the cross-language links with the sorting problem, namely uses the Chinese document to search the English document, and uses the method for learning sorting to search the cross-language links, thereby obtaining better effect.

Drawings

FIG. 1 is an overall framework of the present invention;

fig. 2 is a diagram of the neural network structure in step 3).

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

As shown in fig. 1 and 2, the method for matching chinese and english cross-language entities based on wikipedia provided by the invention comprises the following steps:

1) preprocessing a Chinese and English wiki page: 1.1) extracting information such as title, abstract, catalogue, full text, link-out link, link-in link, classification and the like of each page; 1.2) translating the title of the Chinese wiki page into English; 1.3) respectively segmenting the abstract, the catalogue and the full text; 1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia.

3) Aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.

4) And 3) taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

The step 1) comprises the following steps:

English Wikipedia can be expressed as

n and m are the total number of Chinese and English pages respectively.

1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; and (3) removing stop words by using a Stanford word segmentation tool aiming at the English page, and segmenting the domains by taking words as units.

Eq＝{{a_i，b_j}|a_i∈ε_cn，b_j∈ε_en，a_i＝b_j}

the step 2) comprises the following steps:

2.1) define the out-of-link equivalence relation:

first, the following structure is defined as chain-link, wherein a, a_i∈ε_cn；b,b_i∈ε_enThat is, page a and page b have equivalent out-link links, and there is an out-link equivalent relationship between such a and b

2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all the English pages b having the link-out equivalent relationship with the Chinese page a constitute candidate (a). Empirically, if the number of identical out links existing between a and b is greater, the probability of becoming a cross-language link between them is also greater. Since candidate (a) contains a large number of pages, it is necessary to further screen candidate (a). Sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as a candidate set of Chinese pages a and English cross-language pages, and recording the candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.

The step 3) comprises the following steps:

1) title similarity

F1＝Dist(a.translate,b.T)

Dist calculates the edit distance between a.translate and b.T.

2) Number of identical out links

F2＝|{(a′,b′)|a′＝b′,a′∈a.O,b′∈b.O}|

3) Chaining out chaining Jaccard coefficients

This feature represents the jaccard coefficient of the equivalent out-link between pages a and b.

4) Normalized identical out-chaining

M＝max|{(a′，b′)|a′＝b′,a′∈a.O,b′∈c.O,c

∈candidate1000(a)}|

feature 4 is the result of feature 2 normalized by M, as follows:

5) number of equivalent links into link

F5＝|{(a′,b′)|a′＝b′,a′∈a.I,b′∈b.I}|

6) Chaining in chaining Jaccard coefficient

F7＝|{(a′,b′)|a′＝b′,a′∈a.G,b′∈b.G}|

8) Class jaccard coefficient

3.2) subject characteristics:

the latent dirichlet distribution model is usually used to calculate the topic distribution of a page in a page set, however, for the cross-language topic distribution, the topic distribution vectors of different languages are defined in different spaces, and the vectors are not comparable.

By observation, we find that the abstract of the page and the words in the catalog can better express the page theme, the abstract outlines the contents to be expressed by the page, and the catalog defines the description frame of the page. Therefore in this section we re-represent a page a using the summary and the after-table-segmentation result a', the segmentation having been done in the data pre-processing.

a‘＝seg(a.A)∪seg(a.C)

for Chinese and English pages with equivalent relation, combining the results after word segmentation to form a new document. Through the new documents, a cross-language latent Dirichlet distribution model is trained, and the theme distributions of different language pages are mapped into the same space. For a new page, we use this trained model to predict its topic distribution. And (4) performing theme distribution splicing on the Chinese and English pages to obtain a new characteristic F9.

3.3) document vector features:

doc2vec models the context of words in a document vector, enabling the objective expression of the semantic relationship between documents. The text of Chinese and English wiki is used as corpus, and 100-dimensional vectors are used for respectively constructing document vectors of the two corpora. However, since the document vectors of the two languages are constructed in different spaces, we cannot directly obtain the relationship between them through the document vectors. Next we will construct a model to map the document vectors of two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors.

Firstly, we will refer to a Chinese document vector as Q (query), and refer to the equivalent English document vector as P (positive), and refer to the non-equivalent English document vector as P (positive)

N (negative). As mentioned above, we cannot directly calculate the similarity between QP and QN, so we define the mapping function f of Chinese space_zhMapping function f of English space_en. With these two mapping functions, we map vectors of different spaces to the same space. The similarity between two documents can be measured by the Euclidean distance between them. Taking Q, P as an example, it can be expressed as:

D(f_zh(Q),f_en(P))＝||f_zh(Q)-f_en(P)||

D(f_zh(Q),f_en(P))＜D(f_zh(Q),f_en(N))

whereupon the problem turns into f_zhAnd f_enAnd (4) solving.

We adopt a deep learning method to train these two parameters, and the network structure is defined as fig. 2. The input of the network is a 300-dimensional vector which is composed of (Q, P, N), Q passes through a hidden layer in the middle of the neural network, and the output is f_zhSince (Q), P and N are all English documents, they share the network structure on the right side of FIG. 2, and the output is f_en(P) and f_en(N) is provided. The hidden layer structures of the left part and the right part of the graph 2 are the same and are a 3-layer full-connection layer and a normalization layer, and the final output is defined in the same vector space V^dIn, i.e. f_en(.)∈V^dAnd f is_zh(.)∈V^dIn this embodiment, we take d equal to 8.

loss(Q,P,N)＝max{0,g+D(f_zh(Q),f_en(P))-D(f_zh(Q),f_en(N))}+λ‖ω‖²

Where g is a gap parameter used to adjust the distance between two documents, λ is a regular term coefficient, and ω is a regular term, representing all the parameters of the hidden layer in fig. 2. By training the neural network, f can be obtained_zhAnd f_enThe two functions can map the document vectors of the Chinese and English wiki pages into the same space. For a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced serves as a new feature F10.

Further, the step 4) comprises:

Examples

The following provides an example to illustrate the implementation steps of the present invention:

(1) examples use data sets from the Chinese Wikipedia and the English Wikipedia, where the number of pages in the Chinese Wikipedia is 1020863 and the number of pages in the English Wikipedia is 6144107. Analyzing the information structure of the page, extracting information such as title, abstract, directory, classification, link-in link, link-out link, full text and the like, and storing the information into the lucene index, wherein other fields except the title can be empty.

(2) 3000 pages with existing cross-language links are randomly extracted from the Chinese Wikipedia in the step (1), and an English cross-language page candidate set of the 3000 Chinese pages is extracted by using the linked-out links and the existing cross-language links.

(3) Constructing training data by using the existing cross-language link, and training parameters in the potential Dirichlet distribution model; and training the Chinese Wikipedia page and the English Wikipedia page by using doc2vec and using a 100-dimensional vector respectively.

(4) And (4) calculating characteristics between the Chinese page and each English candidate set page thereof according to the extracted English cross-language page candidate set in the step (3), wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.

(5) And (5) taking a part of the data obtained in the step (4) as a training set, constructing a sequencing model, sequencing similarity of the English cross-language page candidate set of the current Chinese page, and taking the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

The run results for this example:

for the Chinese and English cross-language entity matching method, 5 other methods are adopted for comparison, and finally the method in the text is best. Cross-language entity matching the method (SR-Topic-Doc2vec) used by the invention and other methods including title complete matching (TM), title similarity matching (TS) and jaccard feature mean matching (SA) are carried out through three evaluation indexes of Precision, Recall and F1₁) Mean match of similarity between jaccard features and title (SA)₂) Artificial feature learning ranking (SR) was compared and the results obtained are as follows:

method of producing a composite material	Precision	Recall	F1-score
				TM	97.65	26.03	41.10
TS	56.03	56.03	56.03
				SA₁	34.73	34.73	34.73
SA₂	65.60	65.60	65.60
				SR	68.75	68.75	68.75
SR-Topic-doc2vec	70.28	70.28	70.28

As can be seen from the comparison of the above tables, the method performs better on three performance indexes than other methods on a data set similar to the actual situation. Therefore, the method has good use value and application prospect in the aspect of Chinese and English cross-language entity matching.

Claims

1. A Chinese-English cross-language entity matching method based on Wikipedia is characterized by comprising the following steps:

1) preprocessing a Chinese and English wiki page:

1.1) extracting title, abstract, catalogue and full text of each page, chaining out and chaining in and classifying information;

1.2) translating the title of the Chinese wiki page into English;

1.3) respectively segmenting the abstract, the catalogue and the full text;

1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia;

2) extracting English cross-language page candidate sets of each Chinese wiki page through the page chain-out links extracted in the step 1) and existing common page cross-language links;

3) aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics; the step 3) specifically comprises the following steps:

① title similarity

F1＝Dist(a.translate,b.T)

Dist calculates the edit distance between a.translate and b.T;

② number of equivalent out-links

F2＝|{(a′,b′)|a′＝b′,a′∈a.O,b′∈b.O}|

This feature represents the number of equivalent out links between pages a and b;

③ chaining out Link Jaccard coefficients

This feature represents the jaccard coefficient of the equivalent out-link between pages a and b;

④ normalized equivalent out-chaining

M＝max|{(a′,b′)|a′＝b′,a′∈a.O,b′∈c.O,c∈candidate1000(a)}|

feature 4 is the result of feature 2 normalized by M, as follows:

⑤ number of equivalent in-links

F5＝|{(a′,b′)|a′＝b′,a′∈a.I,b′∈b.I}|

This feature represents the number of equivalent in-links between pages a and b;

⑥ chaining in Link Jaccard coefficients

This feature represents the jaccard coefficient of the equivalent in-link between pages a and b;

⑦ number of equal classes

F7＝|{(a′,b′)|a′＝b′,a′∈a.G,b′∈b.G}|

This feature represents the number of equal classifications between pages a and b;

⑧ class jaccard coefficient

This feature represents the jaccard coefficient of the equivalence class between pages a and b;

3.2) subject characteristics:

a‘＝seg(a.A)∪seg(a.C)

combining the results of the Chinese and English pages with the same relation after word segmentation to form a new document; training a cross-language potential Dirichlet distribution model through a new document, and mapping the theme distribution of different language pages to the same space; for a new page, using the trained model to predict the distribution of its topics; distributing and splicing the themes of the Chinese and English pages to obtain a new characteristic F9;

3.3) document vector features:

doc2vec models the context of the word in the document vector; using the text of Chinese and English wiki as corpus, and respectively constructing the document vectors of the two corpora by using 100-dimensional vectors; constructing a model to map the document vectors of the two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors;

firstly, a Chinese document vector is marked as Q (query), an English document vector which is equivalent to the Chinese document vector is marked as P (positive), and an English document vector which is not equivalent to the Chinese document vector is marked as N (negative); defining a mapping function f of a Chinese space_zhMapping function f of English space_en(ii) a Mapping vectors in different spaces to the same space through the two mapping functions; the similarity between two documents can be measured by the Euclidean distance between the two documents; the similarity between Q and P is expressed as:

D(f_zh(Q),f_en(P))＝‖f_zh(Q)-f_en(P)‖

wherein function D represents the Euclidean distance between two vectors; the smaller the value of D is, the greater the similarity of two vectors is, and the higher the probability that two documents describe one entity is; for a set of data (Q, P, N), it is clear that:

D(f_zh(Q),f_en(P))<D(f_zh(Q),f_en(N))

whereupon the problem turns into f_zhAnd f_enSolving; training the two parameters by adopting a deep learning method; the input of the deep learning network is a 300-dimensional vector which consists of (Q, P, N), the Q is output through three fully-connected hidden layers and normalization layers in the middle of the neural network to obtain a vector f_zh(Q), P and N are all EnglishThe document is output to obtain a vector f through the same three fully-connected hidden layers and normalization layers_en(P) and f_en(N); finally, f_zh(Q)，f_en(P) and f_en(N) are defined in the same vector space V^dIn, i.e. f_en(.)∈V^dAnd f is_zh(.)∈V^dD is taken to be equal to 8; the loss function is defined as follows:

loss(Q,P,N)＝max{0,g+D(f_zh(Q),f_en(P))-D(f_zh(Q),f_en(N))}+λ‖ω‖²

wherein g is a gap parameter used for adjusting the distance between two documents, λ is a regular term coefficient, ω is a regular term, and represents all parameters of the hidden layer of the neural network; by training the neural network, f can be obtained_zhAnd f_enThe two functions map the document vectors of the Chinese and English wiki pages to the same space; for a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced is used as a new feature F10;

4) taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

2. The wikipedia-based Chinese and English cross-language entity matching method according to claim 1, wherein the step 1) comprises:

1.1) extracting the Wikipedia page information: each wikipedia page can be represented by a 7-tuple: a ═ T, a, B, C, G, I, O, where T, A, B is title, abstract, full text, C and G are two sets representing the directory and classification information of the current page, I and 0 are the in-link and out-link, respectively, of the page if a is_iIn which a link points to a_jThen a is_iIs a_jIs linked into a link, a_jIs a_iOut-of-link of (1); thus, the Chinese Wikipedia tableShown as

English Wikipedia can be expressed as

n and m are the total number of Chinese and English pages respectively;

1.2) Chinese wiki page title translation: according to the Chinese page title information extracted in the step 1.1), all Chinese titles are translated into English by using Baidu translation api, and the translated result is expressed by a.translate;

1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; removing stop words by using a Stanford word segmentation tool aiming at an English page, and segmenting an abstract, a catalogue and a full text by taking a word as a unit;

1.4) extracting existing cross-language links: part of cross-language links originally exist in Wikipedia, and common page cross-language links and classified page cross-language links can be respectively extracted by distinguishing namespace domains in wikidump; if a is_i∈ε_cn,b_j∈ε_enAnd a is a_iAnd b_jFor cross-language linking, then a_i＝b_jAll existing cross-language links constitute the set:

Eq＝{{a_i,b_j}|a_i∈ε_cn,b_j∈ε_en,a_i＝b_j}。

3. the wikipedia-based Chinese and English cross-language entity matching method according to claim 2, wherein the step 2) comprises:

2.1) define the out-of-link equivalence relation:

defining the following structure as chain-link, wherein a, a_i∈ε_cn；b,b_i∈ε_enThat is, pages a and b have identical out-links, such a andb there is an out-of-link equivalence relationship between

2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all English pages b which have the link-out equivalent relation with Chinese pages a form candidate (a); if the number of identical out-links existing between a and b is larger, the probability of becoming a cross-language link between them is also larger; sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as an English cross-language page candidate set of the Chinese page a, and recording the English cross-language page candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.

4. The wikipedia-based Chinese and English cross-language entity matching method as claimed in claim 1, wherein the step 4) specifically comprises the following steps:

after the feature extraction is finished, 10 features are obtained in total, and the 10 features are spliced to obtain a long vector; the long vector is used as the input of a learning sorting algorithm, a sorting model is trained, and the sorting model is used for sorting the candidate Candidate1000(a) corresponding to the Chinese page a, so that the most similar English page b belongs to the candidate Candidate1000(a) and serves as a cross-language link of a.