CN106570191B - Chinese-English cross-language entity matching method based on Wikipedia - Google Patents

Chinese-English cross-language entity matching method based on Wikipedia Download PDF

Info

Publication number
CN106570191B
CN106570191B CN201611005395.XA CN201611005395A CN106570191B CN 106570191 B CN106570191 B CN 106570191B CN 201611005395 A CN201611005395 A CN 201611005395A CN 106570191 B CN106570191 B CN 106570191B
Authority
CN
China
Prior art keywords
page
chinese
english
language
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611005395.XA
Other languages
Chinese (zh)
Other versions
CN106570191A (en
Inventor
鲁伟明
戴豪
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201611005395.XA priority Critical patent/CN106570191B/en
Publication of CN106570191A publication Critical patent/CN106570191A/en
Application granted granted Critical
Publication of CN106570191B publication Critical patent/CN106570191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese and English cross-language entity matching method based on Wikipedia. The method comprises the steps of firstly obtaining Chinese and English Wikipedia page data through wikidump, and preprocessing the data, and specifically comprises the following steps: extracting title, abstract, directory, full text, link-out, link-in, and classification information of each page; translating the title of the Chinese wiki page into English; respectively segmenting the abstract, the catalogue and the full text; extracting existing page cross-language links and classifying cross-language links in the Chinese and English Wikipedia. Then, aiming at each Chinese wiki page, an English cross-language page candidate set is obtained according to the linked-out link information. Next, features between the Chinese page and its candidate set of English cross-language pages are computed. And finally, constructing a sequencing model, sequencing similarity of the English cross-language page candidate set of the current Chinese page, and taking the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.

Description

Chinese-English cross-language entity matching method based on Wikipedia
Technical Field
The invention relates to methods of topic models, deep learning, text similarity calculation and the like, in particular to a Chinese and English cross-language entity matching method based on Wikipedia.
Background
With the development of technologies such as machine learning and deep learning, the construction work of the knowledge base is also perfected. Many knowledge bases exist, such as DBpedia, which is a special semantic web application paradigm that extracts structured data from the vocabulary entry of wikipedia to enhance the search function of wikipedia and link other data sets to wikipedia; freebase is a large collaborative knowledge base that integrates many resources on the network. Entries in Freebase are also similar to DBpedia, all in the form of structured data. By accessing its data, it is found that all of the content therein is formatted, stored and presented in a triple format. This schema is fixed and items of the same type all contain the same attributes. For the reasons, the same kind of data can be easily associated, and convenience is provided for information query. Freebase contains tens of millions of topics, thousands of types and attributes. In the chinese domain, many knowledge bases have been constructed, such as zhishi.
In wikipedia, there are over 40 million cross-language links in chinese and english already existing, including common page cross-language links and cross-language links that classify pages. With these cross-language link information, as well as the rich textual and structural information of the wiki, we can do many interesting things, such as machine translation, cross-language information retrieval, etc. Philipp Sorg and Philipp Cimiano use the cross-language link of Wikipedia and the hierarchical relation between the classification page and the common page to put forward a cross-language explicit semantic distribution model to do the cross-language information retrieval of English German. Jie Tang et al propose a probabilistic topic model to solve the problem of cross-source, cross-language patent retrieval. Therefore, the construction of the cross-language link is a very important link in the construction of the cross-language knowledge base.
In the Chinese knowledge base constructed by the Chinese knowledge base, rich corpus information is provided, about 1200 thousands of entities are contained, the relationship among the entities is very rich, and each entity contains 53.8 links on average. However, our knowledge base is not associated with an authoritative knowledge base in other languages, such as DBPedia, which is a significant loss in knowledge integrity and reliability, and also does not conform to the open principles of LOD. Considering that a part of cross-language links exist among wikipedia in Chinese and English, and the Chinese knowledge base is constructed based on three Chinese encyclopedias such as wikipedia, we can learn new cross-language links through the part of existing knowledge.
In the invention, the technology such as a theme model, deep learning and the like is mainly used. The topic model considers that each word of a document is generated by selecting a certain topic with a certain probability and selecting a certain word from the topic with a certain probability, so that the topic distribution of two documents describing the same entity is similar, and the topic similarity can be calculated through the topic distribution. doc2vec is an application of deep learning, which, unlike topic models, models the context of words in documents in document vectors, thereby more objectively expressing the semantic relationship between documents.
Disclosure of Invention
The invention provides a Chinese and English cross-language entity matching method based on Wikipedia for searching cross-language links and constructing a cross-language knowledge base.
The technical scheme adopted by the invention for solving the technical problems is as follows: a knowledge graph construction method based on multi-source entity fusion comprises the following steps:
1) preprocessing a Chinese and English wiki page:
1.1) extracting information such as title, abstract, catalogue, full text, link-out link, link-in link, classification and the like of each page;
1.2) translating the title of the Chinese wiki page into English;
1.3) respectively segmenting the abstract, the catalogue and the full text;
1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia.
2) Extracting English cross-language page candidate sets of each Chinese wiki page through the page chain-out links extracted in the step 1) and existing common page cross-language links.
5) Aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.
6) Taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.
Further, the step 1) comprises:
1.1) extracting the Wikipedia page information: each wikipedia page can be represented by a 7-tuple: a ═ T, a, B, C, G, I, O }, where T, A, B is title, abstract, full text, C and G are two sets representing the directory and classification information of the current page, and I and O are the in-link and out-link, respectively, of the page if a isiIn which a link points to ajThen a isiIs ajIs linked into a link, ajIs aiOut-link. Thus, the Chinese Wikipedia can be expressed as
Figure BDA0001150787060000031
English Wikipedia can be expressed as
Figure BDA0001150787060000032
n and m are the total number of Chinese and English pages respectively.
1.2) Chinese wiki page title translation: and (3) according to the Chinese page title information extracted in the step 1.1), translating all Chinese titles into English by using a Baidu translation api, and expressing the translated result by using a.
1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; and (3) removing stop words by using a Stanford word segmentation tool aiming at the English page, and segmenting the abstract, the catalogue and the full text by taking the words as units.
1.4) extracting existing cross-language links: a part of cross-language links originally exist in Wikipedia, and common page cross-language links and classified page cross-language links can be respectively extracted and obtained by distinguishing namespace domains in wikidump. If a isi∈εcn,bj∈εenAnd a is aiAnd bjFor cross-language linking, then ai=bjAll existing cross-language links constitute the set:
Eq={{ai,bj}|ai∈εcn,bj∈εen,ai=bj}
further, the step 2) comprises:
2.1) define the out-of-link equivalence relation:
defining the following structure as chain-link, wherein a, ai∈εcn;b,bi∈εenThat is, page a and page b have equivalent out-link links, and there is an out-link equivalent relationship between such a and b
Figure BDA0001150787060000033
2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all the English pages b having the link-out equivalent relationship with the Chinese page a constitute candidate (a). If the greater the number of identical out links that exist between a and b, the greater the probability of becoming a cross-language link between them. Sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as an English cross-language page candidate set of the Chinese page a, and recording the English cross-language page candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.
Further, the step 3) specifically comprises the following steps:
3.1) artificial feature extraction: for Chinese page a, extracting the features between it and its English cross-language candidate page b ∈ candidate1000(a), which is defined as follows:
1) title similarity
F1=Dist(a.translate,b.T)
Dist calculates the edit distance between a.translate and b.T.
2) Number of identical out links
F2=|{(a′,b′)|a′=b′,a′∈a.O,b′∈b.O}|
This feature represents the number of equivalent out links between pages a and b.
3) Chaining out chaining Jaccard coefficients
Figure BDA0001150787060000041
This feature represents the jaccard coefficient of the equivalent out-link between pages a and b. 4) Normalized identical out-chaining
For a Chinese page a, its normalization factor M can be expressed by the following equation:
M=max|{(a′,b′)|a′=b′,a′∈a.O,b′∈c.O,c
∈candidate1000(a)}|
feature 4 is the result of feature 2 normalized by M, as follows:
Figure BDA0001150787060000042
5) number of equivalent links into link
F5=|{(a′,b′)|a′=b′,a′∈a.I,b′∈b.I}|
This feature represents the number of equivalent in-links between pages a and b.
6) Chaining in chaining Jaccard coefficient
Figure BDA0001150787060000043
This feature represents the jaccard coefficient of the equivalent in-link between pages a and b. 7) Number of equal classes
F7=|{(a′,b′)|a′=b′,a′∈a.G,b′∈b.G}|
This feature represents the number of equal classifications between pages a and b.
8) Class jaccard coefficient
Figure BDA0001150787060000051
This feature represents the jaccard coefficient of the equivalence class between pages a and b.
3.2) subject characteristics:
re-representing a page a by using the result a' after the abstract and the catalogue are participled;
a‘=seg(a.A)∪seg(a.C)
seg (a.A) represents the result of segmenting the abstract of the page a, and seg (a.C) represents the result of segmenting the catalog of the page a;
for Chinese and English pages with equivalent relation, combining the results after word segmentation to form a new document. And training a cross-language latent Dirichlet distribution model through the new document, and mapping the theme distribution of different language pages into the same space. For a new page, the trained model is used to predict its topic distribution. And (4) performing theme distribution splicing on the Chinese and English pages to obtain a new characteristic F9.
3.3) document vector features:
doc2vec models the context of the word in the document vector. And respectively constructing document vectors of the two corpora by using 100-dimensional vectors by using the text of Chinese and English wiki as a corpus. And constructing a model to map the document vectors of the two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors.
Firstly, a Chinese document vector is marked as Q (query), an English document vector which is equivalent to the Chinese document vector is marked as P (positive), and an English document vector which is not equivalent to the Chinese document vector is marked as N (negative). Defining a mapping function f of a Chinese spacezhMapping function f of English spaceen. By these two mapping functions, vectors of different spaces are mapped to the same space. The similarity between two documents can be measured by the Euclidean distance between them. The similarity between Q and P is expressed as:
D(fzh(Q),fen(P))=||fzh(Q)-fen(P)||
where function D represents the euclidean distance between the two vectors being calculated. The smaller the value of D, the greater the similarity representing two vectors, and the higher the probability that two documents describe an entity. For a set of data (Q, P, N), it is clear that:
D(fzh(Q),fen(P))<D(fzh(Q),fen(N))
whereupon the problem turns into fzhAnd fenSolving; training the two parameters by adopting a deep learning method; the input of the deep learning network is a 300-dimensional vector which consists of (Q, P, N), the Q is output through three fully-connected hidden layers and normalization layers in the middle of the neural network to obtain a vector fzhBecause (Q), P and N are English documents, a vector f is obtained by outputting through the same three fully-connected hidden layers and normalization layersen(P) and fen(N) is provided. Finally, fzh(Q),fen(P) and fen(N) are defined in the same vector space VdIn, i.e. fen(.)∈VdAnd f iszh(.)∈VdD is taken to be equal to 8; the loss function is defined as follows:
loss(Q,P,N)=max{0,g+D(fzh(Q),fen(P))-D(fzh(Q),fen(N))}+λ||ω||2
wherein g is a gap parameter used for adjusting the distance between two documents, λ is a regular term coefficient, and ω is a regular term, which represents all parameters of the hidden layer of the neural network. By training the neural network, f can be obtainedzhAnd fenThe two functions map the document vectors of the Chinese and English wiki pages into the same space. For a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced serves as a new feature F10.
Further, the step 4) specifically comprises the following steps:
after the feature extraction is completed, 10 features are obtained in total, and the 10 features are spliced to obtain a long vector. The long vector is used as the input of a learning sorting algorithm, a sorting model is trained, and the sorting model is used for sorting the candidate Candidate1000(a) corresponding to the Chinese page a, so that the most similar English page b belongs to the candidate Candidate1000(a) and serves as a cross-language link of a.
Compared with the prior art, the method has the following beneficial effects:
1. the method utilizes the page chain-out link and the existing cross-language link to limit the size of the page of the English cross-language candidate set, thereby greatly reducing the scale of the problem and leading the implementation of the following algorithm to be simpler.
2. According to the method, 3 types of features including artificial features, theme features and document vector features are extracted according to the structural information and the text information of the page, and the features comprehensively reflect the information contained in one page from the semantic angle and the structural angle.
3. The method analogizes the method for searching the cross-language links with the sorting problem, namely uses the Chinese document to search the English document, and uses the method for learning sorting to search the cross-language links, thereby obtaining better effect.
Drawings
FIG. 1 is an overall framework of the present invention;
fig. 2 is a diagram of the neural network structure in step 3).
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
As shown in fig. 1 and 2, the method for matching chinese and english cross-language entities based on wikipedia provided by the invention comprises the following steps:
1) preprocessing a Chinese and English wiki page: 1.1) extracting information such as title, abstract, catalogue, full text, link-out link, link-in link, classification and the like of each page; 1.2) translating the title of the Chinese wiki page into English; 1.3) respectively segmenting the abstract, the catalogue and the full text; 1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia.
2) Extracting English cross-language page candidate sets of each Chinese wiki page through the page chain-out links extracted in the step 1) and existing common page cross-language links.
3) Aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.
4) And 3) taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.
The step 1) comprises the following steps:
1.1) extracting the Wikipedia page information: each wikipedia page can be represented by a 7-tuple: a ═ T, a, B, C, G, I, O }, where T, A, B is title, abstract, full text, C and G are two sets representing the directory and classification information of the current page, and I and O are the in-link and out-link, respectively, of the page if a isiIn which a link points to ajThen a isiIs ajIs linked into a link, ajIs aiOut-link. Thus, the Chinese Wikipedia can be expressed as
Figure BDA0001150787060000071
English Wikipedia can be expressed as
Figure BDA0001150787060000072
n and m are the total number of Chinese and English pages respectively.
1.2) Chinese wiki page title translation: and (3) according to the Chinese page title information extracted in the step 1.1), translating all Chinese titles into English by using a Baidu translation api, and expressing the translated result by using a.
1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; and (3) removing stop words by using a Stanford word segmentation tool aiming at the English page, and segmenting the domains by taking words as units.
1.4) extracting existing cross-language links: a part of cross-language links originally exist in Wikipedia, and common page cross-language links and classified page cross-language links can be respectively extracted and obtained by distinguishing namespace domains in wikidump. If a isi∈εcn,bj∈εenAnd a is aiAnd bjFor cross-language linking, then ai=bjAll existing cross-language links constitute the set:
Eq={{ai,bj}|ai∈εcn,bj∈εen,ai=bj}
the step 2) comprises the following steps:
2.1) define the out-of-link equivalence relation:
first, the following structure is defined as chain-link, wherein a, ai∈εcn;b,bi∈εenThat is, page a and page b have equivalent out-link links, and there is an out-link equivalent relationship between such a and b
Figure BDA0001150787060000081
2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all the English pages b having the link-out equivalent relationship with the Chinese page a constitute candidate (a). Empirically, if the number of identical out links existing between a and b is greater, the probability of becoming a cross-language link between them is also greater. Since candidate (a) contains a large number of pages, it is necessary to further screen candidate (a). Sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as a candidate set of Chinese pages a and English cross-language pages, and recording the candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.
The step 3) comprises the following steps:
3.1) artificial feature extraction: for Chinese page a, extracting the features between it and its English cross-language candidate page b ∈ candidate1000(a), which is defined as follows:
1) title similarity
F1=Dist(a.translate,b.T)
Dist calculates the edit distance between a.translate and b.T.
2) Number of identical out links
F2=|{(a′,b′)|a′=b′,a′∈a.O,b′∈b.O}|
This feature represents the number of equivalent out links between pages a and b.
3) Chaining out chaining Jaccard coefficients
Figure BDA0001150787060000091
This feature represents the jaccard coefficient of the equivalent out-link between pages a and b.
4) Normalized identical out-chaining
For a Chinese page a, its normalization factor M can be expressed by the following equation:
M=max|{(a′,b′)|a′=b′,a′∈a.O,b′∈c.O,c
∈candidate1000(a)}|
feature 4 is the result of feature 2 normalized by M, as follows:
Figure BDA0001150787060000092
5) number of equivalent links into link
F5=|{(a′,b′)|a′=b′,a′∈a.I,b′∈b.I}|
This feature represents the number of equivalent in-links between pages a and b.
6) Chaining in chaining Jaccard coefficient
Figure BDA0001150787060000093
This feature represents the jaccard coefficient of the equivalent in-link between pages a and b. 7) Number of equal classes
F7=|{(a′,b′)|a′=b′,a′∈a.G,b′∈b.G}|
This feature represents the number of equal classifications between pages a and b.
8) Class jaccard coefficient
Figure BDA0001150787060000094
This feature represents the jaccard coefficient of the equivalence class between pages a and b.
3.2) subject characteristics:
the latent dirichlet distribution model is usually used to calculate the topic distribution of a page in a page set, however, for the cross-language topic distribution, the topic distribution vectors of different languages are defined in different spaces, and the vectors are not comparable.
By observation, we find that the abstract of the page and the words in the catalog can better express the page theme, the abstract outlines the contents to be expressed by the page, and the catalog defines the description frame of the page. Therefore in this section we re-represent a page a using the summary and the after-table-segmentation result a', the segmentation having been done in the data pre-processing.
a‘=seg(a.A)∪seg(a.C)
Seg (a.A) represents the result of segmenting the abstract of the page a, and seg (a.C) represents the result of segmenting the catalog of the page a;
for Chinese and English pages with equivalent relation, combining the results after word segmentation to form a new document. Through the new documents, a cross-language latent Dirichlet distribution model is trained, and the theme distributions of different language pages are mapped into the same space. For a new page, we use this trained model to predict its topic distribution. And (4) performing theme distribution splicing on the Chinese and English pages to obtain a new characteristic F9.
3.3) document vector features:
doc2vec models the context of words in a document vector, enabling the objective expression of the semantic relationship between documents. The text of Chinese and English wiki is used as corpus, and 100-dimensional vectors are used for respectively constructing document vectors of the two corpora. However, since the document vectors of the two languages are constructed in different spaces, we cannot directly obtain the relationship between them through the document vectors. Next we will construct a model to map the document vectors of two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors.
Firstly, we will refer to a Chinese document vector as Q (query), and refer to the equivalent English document vector as P (positive), and refer to the non-equivalent English document vector as P (positive)
N (negative). As mentioned above, we cannot directly calculate the similarity between QP and QN, so we define the mapping function f of Chinese spacezhMapping function f of English spaceen. With these two mapping functions, we map vectors of different spaces to the same space. The similarity between two documents can be measured by the Euclidean distance between them. Taking Q, P as an example, it can be expressed as:
D(fzh(Q),fen(P))=||fzh(Q)-fen(P)||
where function D represents the euclidean distance between the two vectors being calculated. The smaller the value of D, the greater the similarity representing two vectors, and the higher the probability that two documents describe an entity. For a set of data (Q, P, N), it is clear that:
D(fzh(Q),fen(P))<D(fzh(Q),fen(N))
whereupon the problem turns into fzhAnd fenAnd (4) solving.
We adopt a deep learning method to train these two parameters, and the network structure is defined as fig. 2. The input of the network is a 300-dimensional vector which is composed of (Q, P, N), Q passes through a hidden layer in the middle of the neural network, and the output is fzhSince (Q), P and N are all English documents, they share the network structure on the right side of FIG. 2, and the output is fen(P) and fen(N) is provided. The hidden layer structures of the left part and the right part of the graph 2 are the same and are a 3-layer full-connection layer and a normalization layer, and the final output is defined in the same vector space VdIn, i.e. fen(.)∈VdAnd f iszh(.)∈VdIn this embodiment, we take d equal to 8.
loss(Q,P,N)=max{0,g+D(fzh(Q),fen(P))-D(fzh(Q),fen(N))}+λ‖ω‖2
Where g is a gap parameter used to adjust the distance between two documents, λ is a regular term coefficient, and ω is a regular term, representing all the parameters of the hidden layer in fig. 2. By training the neural network, f can be obtainedzhAnd fenThe two functions can map the document vectors of the Chinese and English wiki pages into the same space. For a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced serves as a new feature F10.
Further, the step 4) comprises:
after the feature extraction is completed, 10 features are obtained in total, and the 10 features are spliced to obtain a long vector. The long vector is used as the input of a learning sorting algorithm, a sorting model is trained, and the sorting model is used for sorting the candidate Candidate1000(a) corresponding to the Chinese page a, so that the most similar English page b belongs to the candidate Candidate1000(a) and serves as a cross-language link of a.
Examples
The following provides an example to illustrate the implementation steps of the present invention:
(1) examples use data sets from the Chinese Wikipedia and the English Wikipedia, where the number of pages in the Chinese Wikipedia is 1020863 and the number of pages in the English Wikipedia is 6144107. Analyzing the information structure of the page, extracting information such as title, abstract, directory, classification, link-in link, link-out link, full text and the like, and storing the information into the lucene index, wherein other fields except the title can be empty.
(2) 3000 pages with existing cross-language links are randomly extracted from the Chinese Wikipedia in the step (1), and an English cross-language page candidate set of the 3000 Chinese pages is extracted by using the linked-out links and the existing cross-language links.
(3) Constructing training data by using the existing cross-language link, and training parameters in the potential Dirichlet distribution model; and training the Chinese Wikipedia page and the English Wikipedia page by using doc2vec and using a 100-dimensional vector respectively.
(4) And (4) calculating characteristics between the Chinese page and each English candidate set page thereof according to the extracted English cross-language page candidate set in the step (3), wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics.
(5) And (5) taking a part of the data obtained in the step (4) as a training set, constructing a sequencing model, sequencing similarity of the English cross-language page candidate set of the current Chinese page, and taking the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.
The run results for this example:
for the Chinese and English cross-language entity matching method, 5 other methods are adopted for comparison, and finally the method in the text is best. Cross-language entity matching the method (SR-Topic-Doc2vec) used by the invention and other methods including title complete matching (TM), title similarity matching (TS) and jaccard feature mean matching (SA) are carried out through three evaluation indexes of Precision, Recall and F11) Mean match of similarity between jaccard features and title (SA)2) Artificial feature learning ranking (SR) was compared and the results obtained are as follows:
method of producing a composite material Precision Recall F1-score
TM 97.65 26.03 41.10
TS 56.03 56.03 56.03
SA1 34.73 34.73 34.73
SA2 65.60 65.60 65.60
SR 68.75 68.75 68.75
SR-Topic-doc2vec 70.28 70.28 70.28
As can be seen from the comparison of the above tables, the method performs better on three performance indexes than other methods on a data set similar to the actual situation. Therefore, the method has good use value and application prospect in the aspect of Chinese and English cross-language entity matching.

Claims (4)

1. A Chinese-English cross-language entity matching method based on Wikipedia is characterized by comprising the following steps:
1) preprocessing a Chinese and English wiki page:
1.1) extracting title, abstract, catalogue and full text of each page, chaining out and chaining in and classifying information;
1.2) translating the title of the Chinese wiki page into English;
1.3) respectively segmenting the abstract, the catalogue and the full text;
1.4) extracting the existing common page cross-language links and classified page cross-language links in the Chinese and English Wikipedia;
2) extracting English cross-language page candidate sets of each Chinese wiki page through the page chain-out links extracted in the step 1) and existing common page cross-language links;
3) aiming at each Chinese wiki page in the step 2), calculating the characteristics between the Chinese wiki page and each page in the English cross-language page candidate set, wherein the characteristics comprise artificial characteristics, theme characteristics and document vector characteristics; the step 3) specifically comprises the following steps:
3.1) artificial feature extraction: for Chinese page a, extracting the features between it and its English cross-language candidate page b ∈ candidate1000(a), which is defined as follows:
① title similarity
F1=Dist(a.translate,b.T)
Dist calculates the edit distance between a.translate and b.T;
② number of equivalent out-links
F2=|{(a′,b′)|a′=b′,a′∈a.O,b′∈b.O}|
This feature represents the number of equivalent out links between pages a and b;
③ chaining out Link Jaccard coefficients
Figure FDA0002397894510000011
This feature represents the jaccard coefficient of the equivalent out-link between pages a and b;
④ normalized equivalent out-chaining
For a Chinese page a, its normalization factor M can be expressed by the following equation:
M=max|{(a′,b′)|a′=b′,a′∈a.O,b′∈c.O,c∈candidate1000(a)}|
feature 4 is the result of feature 2 normalized by M, as follows:
Figure FDA0002397894510000021
⑤ number of equivalent in-links
F5=|{(a′,b′)|a′=b′,a′∈a.I,b′∈b.I}|
This feature represents the number of equivalent in-links between pages a and b;
⑥ chaining in Link Jaccard coefficients
Figure FDA0002397894510000022
This feature represents the jaccard coefficient of the equivalent in-link between pages a and b;
⑦ number of equal classes
F7=|{(a′,b′)|a′=b′,a′∈a.G,b′∈b.G}|
This feature represents the number of equal classifications between pages a and b;
⑧ class jaccard coefficient
Figure FDA0002397894510000023
This feature represents the jaccard coefficient of the equivalence class between pages a and b;
3.2) subject characteristics:
re-representing a page a by using the result a' after the abstract and the catalogue are participled;
a‘=seg(a.A)∪seg(a.C)
seg (a.A) represents the result of segmenting the abstract of the page a, and seg (a.C) represents the result of segmenting the catalog of the page a;
combining the results of the Chinese and English pages with the same relation after word segmentation to form a new document; training a cross-language potential Dirichlet distribution model through a new document, and mapping the theme distribution of different language pages to the same space; for a new page, using the trained model to predict the distribution of its topics; distributing and splicing the themes of the Chinese and English pages to obtain a new characteristic F9;
3.3) document vector features:
doc2vec models the context of the word in the document vector; using the text of Chinese and English wiki as corpus, and respectively constructing the document vectors of the two corpora by using 100-dimensional vectors; constructing a model to map the document vectors of the two spaces into the same space, thereby indirectly calculating the similarity between the two document vectors;
firstly, a Chinese document vector is marked as Q (query), an English document vector which is equivalent to the Chinese document vector is marked as P (positive), and an English document vector which is not equivalent to the Chinese document vector is marked as N (negative); defining a mapping function f of a Chinese spacezhMapping function f of English spaceen(ii) a Mapping vectors in different spaces to the same space through the two mapping functions; the similarity between two documents can be measured by the Euclidean distance between the two documents; the similarity between Q and P is expressed as:
D(fzh(Q),fen(P))=‖fzh(Q)-fen(P)‖
wherein function D represents the Euclidean distance between two vectors; the smaller the value of D is, the greater the similarity of two vectors is, and the higher the probability that two documents describe one entity is; for a set of data (Q, P, N), it is clear that:
D(fzh(Q),fen(P))<D(fzh(Q),fen(N))
whereupon the problem turns into fzhAnd fenSolving; training the two parameters by adopting a deep learning method; the input of the deep learning network is a 300-dimensional vector which consists of (Q, P, N), the Q is output through three fully-connected hidden layers and normalization layers in the middle of the neural network to obtain a vector fzh(Q), P and N are all EnglishThe document is output to obtain a vector f through the same three fully-connected hidden layers and normalization layersen(P) and fen(N); finally, fzh(Q),fen(P) and fen(N) are defined in the same vector space VdIn, i.e. fen(.)∈VdAnd f iszh(.)∈VdD is taken to be equal to 8; the loss function is defined as follows:
loss(Q,P,N)=max{0,g+D(fzh(Q),fen(P))-D(fzh(Q),fen(N))}+λ‖ω‖2
wherein g is a gap parameter used for adjusting the distance between two documents, λ is a regular term coefficient, ω is a regular term, and represents all parameters of the hidden layer of the neural network; by training the neural network, f can be obtainedzhAnd fenThe two functions map the document vectors of the Chinese and English wiki pages to the same space; for a pair of Chinese and English wiki pages, two new vectors can be obtained through the neural network, and the result after the two new vectors are spliced is used as a new feature F10;
4) taking the data obtained in the step 3) as a data set, constructing a sequencing model, sequencing the similarity of the English cross-language page candidate set of the current Chinese page, and taking the page in the English cross-language page candidate set with the highest similarity as a cross-language link of the current Chinese page.
2. The wikipedia-based Chinese and English cross-language entity matching method according to claim 1, wherein the step 1) comprises:
1.1) extracting the Wikipedia page information: each wikipedia page can be represented by a 7-tuple: a ═ T, a, B, C, G, I, O, where T, A, B is title, abstract, full text, C and G are two sets representing the directory and classification information of the current page, I and 0 are the in-link and out-link, respectively, of the page if a isiIn which a link points to ajThen a isiIs ajIs linked into a link, ajIs aiOut-of-link of (1); thus, the Chinese Wikipedia tableShown as
Figure FDA0002397894510000042
English Wikipedia can be expressed as
Figure FDA0002397894510000043
n and m are the total number of Chinese and English pages respectively;
1.2) Chinese wiki page title translation: according to the Chinese page title information extracted in the step 1.1), all Chinese titles are translated into English by using Baidu translation api, and the translated result is expressed by a.translate;
1.3) page information word segmentation: according to the page information extracted in the step 1.1), for a Chinese page, segmenting words of an abstract, a catalogue and a full text by using a hand word segmentation tool; removing stop words by using a Stanford word segmentation tool aiming at an English page, and segmenting an abstract, a catalogue and a full text by taking a word as a unit;
1.4) extracting existing cross-language links: part of cross-language links originally exist in Wikipedia, and common page cross-language links and classified page cross-language links can be respectively extracted by distinguishing namespace domains in wikidump; if a isi∈εcn,bj∈εenAnd a is aiAnd bjFor cross-language linking, then ai=bjAll existing cross-language links constitute the set:
Eq={{ai,bj}|ai∈εcn,bj∈εen,ai=bj}。
3. the wikipedia-based Chinese and English cross-language entity matching method according to claim 2, wherein the step 2) comprises:
2.1) define the out-of-link equivalence relation:
defining the following structure as chain-link, wherein a, ai∈εcn;b,bi∈εenThat is, pages a and b have identical out-links, such a andb there is an out-of-link equivalence relationship between
Figure FDA0002397894510000041
2.2) extracting an English cross-language page candidate set of the Chinese page: according to the definition of 2.1), all English pages b which have the link-out equivalent relation with Chinese pages a form candidate (a); if the number of identical out-links existing between a and b is larger, the probability of becoming a cross-language link between them is also larger; sorting the English pages in candidate (a) in a descending order according to the number of the links which are linked out in the same way, then taking the top 1000 sorted English pages as an English cross-language page candidate set of the Chinese page a, and recording the English cross-language page candidate set as candidate1000(a) to obtain a final English cross-language page candidate set.
4. The wikipedia-based Chinese and English cross-language entity matching method as claimed in claim 1, wherein the step 4) specifically comprises the following steps:
after the feature extraction is finished, 10 features are obtained in total, and the 10 features are spliced to obtain a long vector; the long vector is used as the input of a learning sorting algorithm, a sorting model is trained, and the sorting model is used for sorting the candidate Candidate1000(a) corresponding to the Chinese page a, so that the most similar English page b belongs to the candidate Candidate1000(a) and serves as a cross-language link of a.
CN201611005395.XA 2016-11-11 2016-11-11 Chinese-English cross-language entity matching method based on Wikipedia Active CN106570191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611005395.XA CN106570191B (en) 2016-11-11 2016-11-11 Chinese-English cross-language entity matching method based on Wikipedia

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611005395.XA CN106570191B (en) 2016-11-11 2016-11-11 Chinese-English cross-language entity matching method based on Wikipedia

Publications (2)

Publication Number Publication Date
CN106570191A CN106570191A (en) 2017-04-19
CN106570191B true CN106570191B (en) 2020-05-26

Family

ID=58542310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611005395.XA Active CN106570191B (en) 2016-11-11 2016-11-11 Chinese-English cross-language entity matching method based on Wikipedia

Country Status (1)

Country Link
CN (1) CN106570191B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220317B (en) * 2017-05-17 2020-12-18 北京百度网讯科技有限公司 Matching degree evaluation method, device, equipment and storage medium based on artificial intelligence
CN110019810B (en) * 2018-01-02 2021-05-04 中国移动通信有限公司研究院 Ontology matching method and device and computer storage medium
CN108197290B (en) * 2018-01-19 2021-08-03 桂林电子科技大学 Knowledge graph representation learning method integrating entity description and relationship description
CN108304552B (en) * 2018-02-01 2021-01-08 浙江大学 Named entity linking method based on knowledge base feature extraction
CN108563766A (en) * 2018-04-19 2018-09-21 天津科技大学 The method and device of food retrieval
CN109255121A (en) * 2018-07-27 2019-01-22 中山大学 A kind of across language biomedicine class academic paper information recommendation method based on theme class
CN109582756B (en) * 2018-10-30 2022-10-14 长春理工大学 Cloud autonomous logic filing method for unstructured source data
CN109710923B (en) * 2018-12-06 2020-09-01 浙江大学 Cross-language entity matching method based on cross-media information
CN111814465A (en) * 2020-06-17 2020-10-23 平安科技(深圳)有限公司 Information extraction method and device based on machine learning, computer equipment and medium
CN112632282B (en) * 2020-12-30 2021-11-19 中科院计算技术研究所大数据研究院 Chinese and English thesis data classification and query method
CN113032558B (en) * 2021-03-11 2023-08-29 昆明理工大学 Variable semi-supervised hundred degree encyclopedia classification method integrating wiki knowledge
CN113076398B (en) * 2021-03-30 2022-07-29 昆明理工大学 Cross-language information retrieval method based on bilingual dictionary mapping guidance

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336852A (en) * 2013-07-24 2013-10-02 清华大学 Cross-language ontology construction method and device
CN105260483A (en) * 2015-11-16 2016-01-20 金陵科技学院 Microblog-text-oriented cross-language topic detection device and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336852A (en) * 2013-07-24 2013-10-02 清华大学 Cross-language ontology construction method and device
CN105260483A (en) * 2015-11-16 2016-01-20 金陵科技学院 Microblog-text-oriented cross-language topic detection device and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Cross-lingual knowledge linking across wiki;Zhichun Wang等;《Proceedings of the ACM WWW conference》;20120420;第459-468页 *
基于知识图谱的跨语言实体链接与语义查询;苏永浩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20161015;第2016年卷(第10期);第3、5页、第3.5节 *
跨语言信息检索中的命名实体识别与翻译;吴丹 等;《图书情报知识》;20120630;第2012年卷(第3期);第13-19页 *

Also Published As

Publication number Publication date
CN106570191A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
CN106570191B (en) Chinese-English cross-language entity matching method based on Wikipedia
CN110717047B (en) Web service classification method based on graph convolution neural network
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
RU2662688C1 (en) Extraction of information from sanitary blocks of documents using micromodels on basis of ontology
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Zhang et al. A data processing method based on sequence labeling and syntactic analysis for extracting new sentiment words from product reviews
Vuong et al. SM-BERT-CR: a deep learning approach for case law retrieval with supporting model
Anoop et al. A topic modeling guided approach for semantic knowledge discovery in e-commerce
CN111563374B (en) Personnel social relationship extraction method based on judicial official documents
Emami et al. A pattern-matching method for extracting personal information in farsi content
CN108427769B (en) Character interest tag extraction method based on social network
Xia et al. Content-irrelevant tag cleansing via bi-layer clustering and peer cooperation
El-Halees et al. Ontology based Arabic opinion mining
CN116244497A (en) Cross-domain paper recommendation method based on heterogeneous data embedding
Fan et al. Mining collective knowledge: inferring functional labels from online review for business
Liu et al. Keyword extraction using PageRank on synonym networks
CN114238617A (en) Industry hotspot recommendation method and system
CN107861924A (en) A kind of eBook content method for expressing based on Partial Reconstruction model
Xia et al. Research on feature-based opinion mining using topic maps
Tran et al. A named entity recognition approach for tweet streams using active learning
Meng et al. An attention network based on feature sequences for cross-domain sentiment classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant