CN115470393A

CN115470393A - Event pre-training method for Chinese-crossing language event retrieval

Info

Publication number: CN115470393A
Application number: CN202211029783.7A
Authority: CN
Inventors: 余正涛; 吴少扬; 朱恩昌; 线岩团; 黄于欣
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-12-13

Abstract

The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing. According to the method, two pre-training methods are used for carrying out additional pre-training on mBERT, event knowledge is firstly merged into a model by using event element mask pre-training, the representation of the model on events with extremely low resources is improved, then cross-language comparison learning is used, sentences with similar meanings between different languages have closer distance in a representation space, then a Chinese cross-language event pre-training model is obtained, and fine tuning is carried out on the Chinese cross-language event pre-training model so that downstream tasks can obtain better performance. The effectiveness of the event pre-training method for Chinese-crossing language event retrieval is proved on a self-built Chinese-crossing bilingual news event retrieval data set.

Description

Event pre-training method for Chinese cross-language event retrieval

Technical Field

The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing.

Background

The cross-language event retrieval in Chinese refers to a task of inputting a Chinese event query phrase and retrieving a relevant news text set in Vietnamese. Accurate retrieval of news articles about a particular event would facilitate follow-up tasks such as public sentiment event monitoring, news recommendation, event tracking, etc.

In recent years, cross-language search has been carried out with a lot of research and progress, and is mainly classified into a method based on machine translation, a method based on cross-language word embedding, and a method based on a multi-language pre-training language model. Among them, the machine translation based method maps queries and documents to the same semantic space using neural machine translation, followed by monolingual retrieval. The translation modes can be divided into query translation, document translation and intermediate language translation. The method based on machine translation depends heavily on the accuracy of neural machine translation, and word mismatching and translation ambiguity problems are easily caused. And for low-resource languages with large difference such as Chinese and beyond, the error brought by machine translation directly influences the retrieval result. In order to solve the problems, researchers provide cross-language information retrieval based on pre-trained cross-language word vectors, and the core idea of the method is to utilize the cross-language word vectors to map text semantics of different languages into the same semantic space to solve the cross-language problem and then train a neural ordering model. The cross-language word vector-based method causes inaccurate semantic representation of a query or a text to be retrieved due to ignored word order and context information, and error propagation is easily caused in the mapping process of semantic representation spaces of different languages, so that the performance of a retrieval model is influenced.

With the proposal of multi-language pre-training language models such as mBERT, XML-R and the like, a better effect is achieved on the cross-language understanding task, and the multi-language alignment semantic knowledge hidden in the multi-language pre-training language models is shown. However, the training task of the existing multi-language pre-training language model mainly focuses on word-level and sentence-level multi-language alignment semantic information, and the direct application of the multi-language pre-training language model to cross-language retrieval also faces that the query in the source language and the long text alignment in the target language have poor effects. To solve this problem, yu et al propose a pre-training method more suitable for cross-language search, which achieves better effect on 4 languages. Even so, the pre-training method proposed by Yu et al has a drawback in cross-language event retrieval, because events usually contain more complex semantic information, and the representation of events obtained based on semantic information at the word level only results in texts describing different events having very similar semantic information. For example, "A accesses Vietnam" and "B accesses Vietnam" are two different access events, and two events that are not related also have similar vector representations due to the presence of more overlapping words in the news text describing the two events.

Therefore, in order to integrate the event knowledge of Chinese-cross bilingual alignment into the multi-language pre-training language model, the invention provides two pre-training tasks, namely event element mask pre-training and cross-language event contrast learning pre-training. The method aims to solve the problems of lack of knowledge of cross-language pre-training model events and poor alignment effect in a low-resource environment.

Disclosure of Invention

The invention provides an event pre-training method for Chinese cross-language event retrieval, which is used for solving the problems that in a low-resource scene, the Chinese cross-language event retrieval lacks large-scale labeled data, and an existing cross-language pre-training model cannot well represent rich Chinese cross-language alignment event knowledge in a text.

The technical scheme of the invention is as follows: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:

step1, construction of an experimental data set: crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set;

step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving Chinese cross bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;

step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.

As a preferable scheme of the invention, the Step1 comprises the following specific steps:

step1.1, crawling Chinese-Yue bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with cosine similarity greater than 0.4;

step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning.

As a preferred embodiment of the present invention, in Step2, the mBERT is continuously pre-trained through event element mask pre-training (emlm) and cross-language contrast learning (ccl), and the specific steps of Step2 are as follows:

Step2.1, give a Chinese event Sentence Sennce _zh The event element in the sentence is el _l (l =1,2,3.) first with [ MASK ]]The marker will el _l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam _vi Splicing is carried out, and the final input is a sequence input containing a special marker _emlm ＝[CLS]+Sentence _zh +[SEP]+Sentence _vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer ^(k) ∈R ^N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is passed to the subsequent linear layer, resulting in a probability for each masked event element. For in the sequence _zh Each of which is [ MASK]Location of marker replacement el _l Finally the corresponding is denoted as H _l The specific calculation process is as follows:

H ⁽⁰⁾ ＝Embedding(input _emlm )

H ^(k) ＝Transformers(H ^(k-1) )

in event element mask pretraining, only the sequence is processed _zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnam pseudo-parallel sentence while learning cross-language features. The loss function of event element mask pre-training is as follows:

step2.2, given a Chinese query phrase Q _zh The corresponding related document is

Irrelevant documents are

Obtaining, by an encoder, corresponding representations of a query and a document, respectively

Training of the model with the goal of maximization

And with

Similarity of (2), minimization

The similarity of (c). The specific calculation process is as follows:

where sim (-) can be any similarity algorithm, such as cosines similarity, dot product similarity, etc. We extend this training goal to cases where queries and documents belong to different languages.

As a preferable scheme of the invention, the Step3 comprises the following specific steps:

step3.1, given a Chinese query phrase Q _zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) ₁ ,q ₂ ,...q _t A sequence, where n represents the length of the query, q _t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q _zh Pre-adding special marker [ CLS]Model learning is used to distinguish between queries and documents in different languages, and then the query sequence Q is applied using the emBERT _zh ＝{q ₁ ,q ₂ ,...q _t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query _q . The specific encoding formula for the query is as follows:

e _q ＝Normalize(emBert([CLS]q ₁ q ₂ …q _t ))

step3.2, similar to the query encoder, represents the Vietnamese news document as D _vi ＝{d ₁ ,d ₂ ,...d _m M represents the document length, d _j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT _d Specifically, the encoding formula for the document is as follows:

e _d ＝Normalize(emBert([CLS]d ₁ d ₂ …d _m ))

step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT _q And e _d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score _q,d The specific calculation process is as follows:

the invention has the beneficial effects that:

1. according to the method, the mask of the event element is predicted by utilizing an event element mask pre-training method, so that the model pays attention to the main part of the event, and the problem that the existing cross-language pre-training model has insufficient event knowledge is solved;

2. the invention uses a cross-language contrast learning pre-training method to enable sentences with similar meanings among different languages to have closer distance in a pointer space so as to solve the problem of poor alignment effect under the situation of low resources of Chinese;

3. the invention provides an event pre-training method for Chinese cross-language event retrieval, which trains and tests a model in a self-built Chinese cross-language event retrieval data set. Experimental results show that the method provided by the invention can effectively improve the cross-language event retrieval effect of the low-resource language of the Chinese.

Drawings

FIG. 1 is a diagram of a Chinese cross-language event retrieval model in accordance with the present invention;

FIG. 2 is a diagram of an event element mask pre-training model according to the present invention;

FIG. 3 is a diagram of a cross-language contrast learning pre-training model in the present invention;

fig. 4 is a block diagram of the process of the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, the event pre-training method for chinese-oriented cross-language event retrieval specifically comprises the following steps:

step1, construction of an experimental data set: the method comprises the steps of crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set.

step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning. Table 1 is experimental data information.

Table 1 data set statistics

Step2, constructing a pre-training model of Chinese cross-language events: a Chinese cross-language event pre-training model is trained by utilizing event element mask pre-training and cross-language contrast learning, chinese cross-language bilingual alignment representation of the multi-language pre-training model is improved, and event knowledge is fused into the model.

As a preferable scheme of the invention, the Step2 comprises the following specific steps:

step2.1, give a Chinese event Sentence Sennce _zh The event element in the sentence is el _l (l =1,2,3.) first with [ MASK ]]The marker will el _l Make a substitution, then proceed with the Vietnamese pseudo-parallel event Sentence Senence _vi Splicing is carried out, and the final input is a sequence input containing a special marker _emlm ＝[CLS]+Sentence _zh +[SEP]+Sentence _vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer ^(k) ∈R ^N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is sent to the subsequent linear layer, resulting in a probability for each masked event element. For in Sennce _zh Each quilt [ MASK ]]Location el of marker replacement _l Finally corresponding is represented as H _l The specific calculation process is as follows:

H ⁽⁰⁾ ＝Embedding(input _emlm )

H ^(k) ＝Transformers(H ^(k-1) )

in event element mask pretraining, we only work on sequence _zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnamese pseudo-parallel sentence while learning cross-language features. The loss function for event element mask pre-training is as follows:

Is not relatedThe document is

Training of the model with the goal of maximization

And with

Similarity of (2), minimization

The similarity of (c). The specific calculation process is as follows:

As a preferred embodiment of the present invention, the Step3 comprises the following specific steps:

step3.1, give a Chinese query phrase Q _zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) ₁ ,q ₂ ,...q _t Sequence, where n represents the length of the query, q _t (t =1,2,3.) represents each word in the query, unlike ColBERT, the present invention does not add a special marker to identify the query, but rather directly on query Q _zh Pre-adding special marker [ CLS]Model learning to distinguish between queries and documents in different languages, and then using emBERT pair query sequence Q _zh ＝{q ₁ ,q ₂ ,...q _t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query _q . The specific encoding formula for the query is as follows:

e _q ＝Normalize(emBert([CLS]q ₁ q2…q _t ))

step3.2, similar to the query coder, represents the Vietnamese news document as D _vi ＝{d ₁ ,d ₂ ,...d _m M represents the document length, d _j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT _d Specifically, the encoding formula for the document is as follows:

e _d ＝Normalize(emBert([CLS]d ₁ d ₂ …d _m ))

step3.2, obtaining a corresponding representation e by encoding the document of a given query through an emBERT _q And e _d Thereafter, the relevance scores for the query and document are calculated through a post-interaction mechanism, and the sum of the scores obtained using the MaxSim operator is Score _q,d The specific calculation formula is as follows:

in order to illustrate the search effect of the Chinese cross-language event, the results of the baseline method and the method of the invention are compared, specifically, the results are compared with the following cross-language search method.

BM25-QT: BM25 is an unsupervised ranking model based on IDF weighting calculations. In the experiment, chinese queries were first translated into Vietnamese using Google online translation, and then BM25 search models were implemented using Anserini tools, where the hyperparameters were all set according to default values.

CLE-CLER: the method is an unsupervised method, and comprises the steps of firstly aggregating terms forming a query and a document to obtain word embedding, and then calculating cosine similarity of the query embedding and the document embedding to carry out sequencing.

XLM-R: using the checkpoints disclosed in the hugging face, the model is mask language pre-trained on a commoncrowl corpus of over 100 languages.

DPR-mBERT: the encoder of the DPR is directly replaced with the original mBERT and then compared.

ColBERT-X: the model is a multi-lingual extension of ColBERT, which is trained on MS MARCO using a cross-language migration technique.

The results of the Chinese cross-language event retrieval experiment are shown in Table 2:

table 2 chinese cross language news event element extraction experimental results (%)

As can be seen from the results of the cross-language event retrieval experiment in the table 2, the model of the invention is improved compared with the baseline model, the deep learning method is generally superior to the traditional cross-language retrieval method, the model of the invention is greatly improved for the query translation method and the word embedding method, and the model also has improvement effects in different degrees compared with the deep neural matching model.

(1) Compared with the traditional query translation and word embedding method, the method provided by the invention is greatly improved. Because the Chinese word segmentation is inaccurate and errors caused by query translation affect the retrieval result, the query translation method has poor performance on retrieving a data set by Chinese cross-language events. Compared with the cross-language word embedding method, the method has an unsatisfactory effect because the context semantic relationship between the query and the document is difficult to capture by static word embedding.

(2) Compared with a representative cross-language retrieval method, the cross-language pre-training model is improved by 0.0905, 0.1021 and 0.1438 on NDCG @1, NDCG @5 and NDCG @10 and is also improved by 0.095 on an MAP value by using original XLM-RoBERTA as a query and a document encoder to obtain the representation for retrieval, which indicates that the cross-language pre-training model also needs to be subjected to downstream task-specific fine adjustment to improve the representation of the generation of the model. Compared with two methods with better performance in a baseline model, the method provided by the invention is improved by 1% -3% in each evaluation index, the method provided by the invention is proved to be more applicable to the retrieval of the cross-language events of the Chinese language, and the model can better represent the bilingual Chinese language and also learn the differential representation of the events after the model is trained on two pre-training tasks and after the task-specific fine tuning training.

In order to verify the influence of each pre-training task on model performance, three groups of experiments are designed and compared with mBERT, EMLM represents an event element mask pre-training task, CCL represents a cross-language comparison learning task, and the experimental results are shown in Table 3.

TABLE 3 Pre-training ablation experiment on model influence

From the analysis of the experimental results in table 3, it is found that the two pre-training tasks provided by the invention are beneficial to semantic representation, wherein the promotion of the cross-language contrast learning task is larger, and a better effect is obtained in consideration that the cross-language contrast learning target is closer to a downstream task. In the EMLM, the invention uses comparable corpora, so that the task is positively influenced, and the best effect is achieved when the two tasks are combined, which shows that the two pre-training tasks have complementary effects.

Claims

1. The event pre-training method for Chinese-oriented cross-language event retrieval is characterized by comprising the following steps of: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:

step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving the Chinese cross-language bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;

2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:

step1.1, crawling Chinese-Yuetu bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with the cosine similarity larger than 0.4;

3. The method of claim 1, wherein the method comprises: step2 comprises the following steps of utilizing the event element mask pre-training emlm and the cross-language contrast learning ccl to continuously pre-train the mBERT, and specifically comprising the following steps:

step2.1, give a Chinese event Sentence Sennce _zh The event element in the sentence is el _l (l =1,2,3.) first with [ MASK ]]The marker will el _l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam _vi Splicing is carried out, and the final input is a sequence input containing a special marker _emlm ＝[CLS]+Sentence _zh +[SEP]+Sentence _vi +[SEP](ii) a Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer ^(k) ∈R ^N×dim Where N represents the maximum sequence length and dim represents the hidden layer dimension; sending the obtained sequence expression output by the last layer to a subsequent linear layer to obtain the probability of each masked event element; for in the sequence _zh Each of which is [ MASK]Location of marker replacement el _l Finally the corresponding is denoted as H _l The specific calculation process is as follows:

H ⁽⁰⁾ ＝Embedding(input _emlm )

H ^(k) ＝Transformers(H ^(k-1) )

in event element mask pretraining, only the sequence is processed _zh The reason for this is to encourage the model to restore the replaced part by using the semantic information of the vietnam pseudo-parallel sentence, and to learn the cross-language features, and the loss function of the event element mask pre-training is as follows:

Irrelevant documents are

Training of the model with the goal of maximization

And

similarity of (2), minimization

The specific calculation process of the similarity is as follows:

where sim (-) is any similarity algorithm that extends this training goal to the case where queries and documents belong to different languages.

4. The method of claim 1, wherein the method comprises: the concrete steps of Step3 are as follows:

step3.1, given a Chinese query phrase Q _zh First, the query is segmented into { q } s based on a cross-language event pre-training model emBERT ₁ ，q ₂ ，...q _t Sequence, where n represents the length of the query, q _t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q _zh Pre-addition of a Special marker [ CLS]Enabling model learning to distinguish between queries and documents in different languages, and then using emBERT to query sequence Q _zh ＝{q ₁ ，q ₂ ，...q _t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query _q The specific encoding formula for the query is as follows:

e _q ＝Normalize(emBert([CLS]q ₁ q ₂ ...q _t ))

step3.2, similar to the query coder, represents the Vietnamese news document as D _vi ＝{d ₁ ，d ₂ ，...d _m M represents the document length, d _j (j =1,2,3.) represents a word in a document, a context representation e of which is obtained by emBERT _d Specifically, the encoding formula for the document is as follows:

e _d ＝Normalize(emBert([CLS]d ₁ d ₂ ...d _m ))

step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT _q And e _d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score _q，d The specific calculation process is as follows: