CN115470393A - Event pre-training method for Chinese-crossing language event retrieval - Google Patents

Event pre-training method for Chinese-crossing language event retrieval Download PDF

Info

Publication number
CN115470393A
CN115470393A CN202211029783.7A CN202211029783A CN115470393A CN 115470393 A CN115470393 A CN 115470393A CN 202211029783 A CN202211029783 A CN 202211029783A CN 115470393 A CN115470393 A CN 115470393A
Authority
CN
China
Prior art keywords
language
event
training
cross
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211029783.7A
Other languages
Chinese (zh)
Inventor
余正涛
吴少扬
朱恩昌
线岩团
黄于欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202211029783.7A priority Critical patent/CN115470393A/en
Publication of CN115470393A publication Critical patent/CN115470393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing. According to the method, two pre-training methods are used for carrying out additional pre-training on mBERT, event knowledge is firstly merged into a model by using event element mask pre-training, the representation of the model on events with extremely low resources is improved, then cross-language comparison learning is used, sentences with similar meanings between different languages have closer distance in a representation space, then a Chinese cross-language event pre-training model is obtained, and fine tuning is carried out on the Chinese cross-language event pre-training model so that downstream tasks can obtain better performance. The effectiveness of the event pre-training method for Chinese-crossing language event retrieval is proved on a self-built Chinese-crossing bilingual news event retrieval data set.

Description

Event pre-training method for Chinese cross-language event retrieval
Technical Field
The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing.
Background
The cross-language event retrieval in Chinese refers to a task of inputting a Chinese event query phrase and retrieving a relevant news text set in Vietnamese. Accurate retrieval of news articles about a particular event would facilitate follow-up tasks such as public sentiment event monitoring, news recommendation, event tracking, etc.
In recent years, cross-language search has been carried out with a lot of research and progress, and is mainly classified into a method based on machine translation, a method based on cross-language word embedding, and a method based on a multi-language pre-training language model. Among them, the machine translation based method maps queries and documents to the same semantic space using neural machine translation, followed by monolingual retrieval. The translation modes can be divided into query translation, document translation and intermediate language translation. The method based on machine translation depends heavily on the accuracy of neural machine translation, and word mismatching and translation ambiguity problems are easily caused. And for low-resource languages with large difference such as Chinese and beyond, the error brought by machine translation directly influences the retrieval result. In order to solve the problems, researchers provide cross-language information retrieval based on pre-trained cross-language word vectors, and the core idea of the method is to utilize the cross-language word vectors to map text semantics of different languages into the same semantic space to solve the cross-language problem and then train a neural ordering model. The cross-language word vector-based method causes inaccurate semantic representation of a query or a text to be retrieved due to ignored word order and context information, and error propagation is easily caused in the mapping process of semantic representation spaces of different languages, so that the performance of a retrieval model is influenced.
With the proposal of multi-language pre-training language models such as mBERT, XML-R and the like, a better effect is achieved on the cross-language understanding task, and the multi-language alignment semantic knowledge hidden in the multi-language pre-training language models is shown. However, the training task of the existing multi-language pre-training language model mainly focuses on word-level and sentence-level multi-language alignment semantic information, and the direct application of the multi-language pre-training language model to cross-language retrieval also faces that the query in the source language and the long text alignment in the target language have poor effects. To solve this problem, yu et al propose a pre-training method more suitable for cross-language search, which achieves better effect on 4 languages. Even so, the pre-training method proposed by Yu et al has a drawback in cross-language event retrieval, because events usually contain more complex semantic information, and the representation of events obtained based on semantic information at the word level only results in texts describing different events having very similar semantic information. For example, "A accesses Vietnam" and "B accesses Vietnam" are two different access events, and two events that are not related also have similar vector representations due to the presence of more overlapping words in the news text describing the two events.
Therefore, in order to integrate the event knowledge of Chinese-cross bilingual alignment into the multi-language pre-training language model, the invention provides two pre-training tasks, namely event element mask pre-training and cross-language event contrast learning pre-training. The method aims to solve the problems of lack of knowledge of cross-language pre-training model events and poor alignment effect in a low-resource environment.
Disclosure of Invention
The invention provides an event pre-training method for Chinese cross-language event retrieval, which is used for solving the problems that in a low-resource scene, the Chinese cross-language event retrieval lacks large-scale labeled data, and an existing cross-language pre-training model cannot well represent rich Chinese cross-language alignment event knowledge in a text.
The technical scheme of the invention is as follows: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:
step1, construction of an experimental data set: crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set;
step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving Chinese cross bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;
step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling Chinese-Yue bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with cosine similarity greater than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning.
As a preferred embodiment of the present invention, in Step2, the mBERT is continuously pre-trained through event element mask pre-training (emlm) and cross-language contrast learning (ccl), and the specific steps of Step2 are as follows:
Step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is passed to the subsequent linear layer, resulting in a probability for each masked event element. For in the sequence zh Each of which is [ MASK]Location of marker replacement el l Finally the corresponding is denoted as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, only the sequence is processed zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnam pseudo-parallel sentence while learning cross-language features. The loss function of event element mask pre-training is as follows:
Figure BDA0003815688340000031
step2.2, given a Chinese query phrase Q zh The corresponding related document is
Figure BDA0003815688340000032
Irrelevant documents are
Figure BDA0003815688340000038
Obtaining, by an encoder, corresponding representations of a query and a document, respectively
Figure BDA0003815688340000033
Training of the model with the goal of maximization
Figure BDA0003815688340000034
And with
Figure BDA0003815688340000035
Similarity of (2), minimization
Figure BDA0003815688340000036
The similarity of (c). The specific calculation process is as follows:
Figure BDA0003815688340000037
where sim (-) can be any similarity algorithm, such as cosines similarity, dot product similarity, etc. We extend this training goal to cases where queries and documents belong to different languages.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, given a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) 1 ,q 2 ,...q t A sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q zh Pre-adding special marker [ CLS]Model learning is used to distinguish between queries and documents in different languages, and then the query sequence Q is applied using the emBERT zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q . The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q 2 …q t ))
step3.2, similar to the query encoder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 …d m ))
step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT q And e d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score q,d The specific calculation process is as follows:
Figure BDA0003815688340000041
the invention has the beneficial effects that:
1. according to the method, the mask of the event element is predicted by utilizing an event element mask pre-training method, so that the model pays attention to the main part of the event, and the problem that the existing cross-language pre-training model has insufficient event knowledge is solved;
2. the invention uses a cross-language contrast learning pre-training method to enable sentences with similar meanings among different languages to have closer distance in a pointer space so as to solve the problem of poor alignment effect under the situation of low resources of Chinese;
3. the invention provides an event pre-training method for Chinese cross-language event retrieval, which trains and tests a model in a self-built Chinese cross-language event retrieval data set. Experimental results show that the method provided by the invention can effectively improve the cross-language event retrieval effect of the low-resource language of the Chinese.
Drawings
FIG. 1 is a diagram of a Chinese cross-language event retrieval model in accordance with the present invention;
FIG. 2 is a diagram of an event element mask pre-training model according to the present invention;
FIG. 3 is a diagram of a cross-language contrast learning pre-training model in the present invention;
fig. 4 is a block diagram of the process of the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, the event pre-training method for chinese-oriented cross-language event retrieval specifically comprises the following steps:
step1, construction of an experimental data set: the method comprises the steps of crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling Chinese-Yue bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with cosine similarity greater than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning. Table 1 is experimental data information.
Table 1 data set statistics
Figure BDA0003815688340000051
Step2, constructing a pre-training model of Chinese cross-language events: a Chinese cross-language event pre-training model is trained by utilizing event element mask pre-training and cross-language contrast learning, chinese cross-language bilingual alignment representation of the multi-language pre-training model is improved, and event knowledge is fused into the model.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then proceed with the Vietnamese pseudo-parallel event Sentence Senence vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is sent to the subsequent linear layer, resulting in a probability for each masked event element. For in Sennce zh Each quilt [ MASK ]]Location el of marker replacement l Finally corresponding is represented as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, we only work on sequence zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnamese pseudo-parallel sentence while learning cross-language features. The loss function for event element mask pre-training is as follows:
Figure BDA0003815688340000061
step2.2, given a Chinese query phrase Q zh The corresponding related document is
Figure BDA0003815688340000062
Is not relatedThe document is
Figure BDA0003815688340000063
Obtaining, by an encoder, corresponding representations of a query and a document, respectively
Figure BDA0003815688340000064
Training of the model with the goal of maximization
Figure BDA0003815688340000065
And with
Figure BDA0003815688340000066
Similarity of (2), minimization
Figure BDA0003815688340000067
The similarity of (c). The specific calculation process is as follows:
Figure BDA0003815688340000068
where sim (-) can be any similarity algorithm, such as cosines similarity, dot product similarity, etc. We extend this training goal to cases where queries and documents belong to different languages.
Step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
As a preferred embodiment of the present invention, the Step3 comprises the following specific steps:
step3.1, give a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) 1 ,q 2 ,...q t Sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, the present invention does not add a special marker to identify the query, but rather directly on query Q zh Pre-adding special marker [ CLS]Model learning to distinguish between queries and documents in different languages, and then using emBERT pair query sequence Q zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q . The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q2…q t ))
step3.2, similar to the query coder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 …d m ))
step3.2, obtaining a corresponding representation e by encoding the document of a given query through an emBERT q And e d Thereafter, the relevance scores for the query and document are calculated through a post-interaction mechanism, and the sum of the scores obtained using the MaxSim operator is Score q,d The specific calculation formula is as follows:
Figure BDA0003815688340000071
in order to illustrate the search effect of the Chinese cross-language event, the results of the baseline method and the method of the invention are compared, specifically, the results are compared with the following cross-language search method.
BM25-QT: BM25 is an unsupervised ranking model based on IDF weighting calculations. In the experiment, chinese queries were first translated into Vietnamese using Google online translation, and then BM25 search models were implemented using Anserini tools, where the hyperparameters were all set according to default values.
CLE-CLER: the method is an unsupervised method, and comprises the steps of firstly aggregating terms forming a query and a document to obtain word embedding, and then calculating cosine similarity of the query embedding and the document embedding to carry out sequencing.
XLM-R: using the checkpoints disclosed in the hugging face, the model is mask language pre-trained on a commoncrowl corpus of over 100 languages.
DPR-mBERT: the encoder of the DPR is directly replaced with the original mBERT and then compared.
ColBERT-X: the model is a multi-lingual extension of ColBERT, which is trained on MS MARCO using a cross-language migration technique.
The results of the Chinese cross-language event retrieval experiment are shown in Table 2:
table 2 chinese cross language news event element extraction experimental results (%)
Figure BDA0003815688340000072
Figure BDA0003815688340000081
As can be seen from the results of the cross-language event retrieval experiment in the table 2, the model of the invention is improved compared with the baseline model, the deep learning method is generally superior to the traditional cross-language retrieval method, the model of the invention is greatly improved for the query translation method and the word embedding method, and the model also has improvement effects in different degrees compared with the deep neural matching model.
(1) Compared with the traditional query translation and word embedding method, the method provided by the invention is greatly improved. Because the Chinese word segmentation is inaccurate and errors caused by query translation affect the retrieval result, the query translation method has poor performance on retrieving a data set by Chinese cross-language events. Compared with the cross-language word embedding method, the method has an unsatisfactory effect because the context semantic relationship between the query and the document is difficult to capture by static word embedding.
(2) Compared with a representative cross-language retrieval method, the cross-language pre-training model is improved by 0.0905, 0.1021 and 0.1438 on NDCG @1, NDCG @5 and NDCG @10 and is also improved by 0.095 on an MAP value by using original XLM-RoBERTA as a query and a document encoder to obtain the representation for retrieval, which indicates that the cross-language pre-training model also needs to be subjected to downstream task-specific fine adjustment to improve the representation of the generation of the model. Compared with two methods with better performance in a baseline model, the method provided by the invention is improved by 1% -3% in each evaluation index, the method provided by the invention is proved to be more applicable to the retrieval of the cross-language events of the Chinese language, and the model can better represent the bilingual Chinese language and also learn the differential representation of the events after the model is trained on two pre-training tasks and after the task-specific fine tuning training.
In order to verify the influence of each pre-training task on model performance, three groups of experiments are designed and compared with mBERT, EMLM represents an event element mask pre-training task, CCL represents a cross-language comparison learning task, and the experimental results are shown in Table 3.
TABLE 3 Pre-training ablation experiment on model influence
Figure BDA0003815688340000082
From the analysis of the experimental results in table 3, it is found that the two pre-training tasks provided by the invention are beneficial to semantic representation, wherein the promotion of the cross-language contrast learning task is larger, and a better effect is obtained in consideration that the cross-language contrast learning target is closer to a downstream task. In the EMLM, the invention uses comparable corpora, so that the task is positively influenced, and the best effect is achieved when the two tasks are combined, which shows that the two pre-training tasks have complementary effects.

Claims (4)

1. The event pre-training method for Chinese-oriented cross-language event retrieval is characterized by comprising the following steps of: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:
step1, construction of an experimental data set: crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set;
step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving the Chinese cross-language bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;
step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:
step1.1, crawling Chinese-Yuetu bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with the cosine similarity larger than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning.
3. The method of claim 1, wherein the method comprises: step2 comprises the following steps of utilizing the event element mask pre-training emlm and the cross-language contrast learning ccl to continuously pre-train the mBERT, and specifically comprising the following steps:
step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP](ii) a Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N represents the maximum sequence length and dim represents the hidden layer dimension; sending the obtained sequence expression output by the last layer to a subsequent linear layer to obtain the probability of each masked event element; for in the sequence zh Each of which is [ MASK]Location of marker replacement el l Finally the corresponding is denoted as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, only the sequence is processed zh The reason for this is to encourage the model to restore the replaced part by using the semantic information of the vietnam pseudo-parallel sentence, and to learn the cross-language features, and the loss function of the event element mask pre-training is as follows:
Figure FDA0003815688330000021
step2.2, given a Chinese query phrase Q zh The corresponding related document is
Figure FDA0003815688330000022
Irrelevant documents are
Figure FDA0003815688330000023
Obtaining, by an encoder, corresponding representations of a query and a document, respectively
Figure FDA0003815688330000024
Training of the model with the goal of maximization
Figure FDA0003815688330000025
And
Figure FDA0003815688330000026
similarity of (2), minimization
Figure FDA0003815688330000027
The specific calculation process of the similarity is as follows:
Figure FDA0003815688330000028
where sim (-) is any similarity algorithm that extends this training goal to the case where queries and documents belong to different languages.
4. The method of claim 1, wherein the method comprises: the concrete steps of Step3 are as follows:
step3.1, given a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pre-training model emBERT 1 ,q 2 ,...q t Sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q zh Pre-addition of a Special marker [ CLS]Enabling model learning to distinguish between queries and documents in different languages, and then using emBERT to query sequence Q zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q 2 ...q t ))
step3.2, similar to the query coder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document, a context representation e of which is obtained by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 ...d m ))
step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT q And e d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score q,d The specific calculation process is as follows:
Figure FDA0003815688330000031
CN202211029783.7A 2022-08-25 2022-08-25 Event pre-training method for Chinese-crossing language event retrieval Pending CN115470393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211029783.7A CN115470393A (en) 2022-08-25 2022-08-25 Event pre-training method for Chinese-crossing language event retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211029783.7A CN115470393A (en) 2022-08-25 2022-08-25 Event pre-training method for Chinese-crossing language event retrieval

Publications (1)

Publication Number Publication Date
CN115470393A true CN115470393A (en) 2022-12-13

Family

ID=84368944

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211029783.7A Pending CN115470393A (en) 2022-08-25 2022-08-25 Event pre-training method for Chinese-crossing language event retrieval

Country Status (1)

Country Link
CN (1) CN115470393A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028631A (en) * 2023-03-30 2023-04-28 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028631A (en) * 2023-03-30 2023-04-28 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN116028631B (en) * 2023-03-30 2023-07-14 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning
CN116822495B (en) * 2023-08-31 2023-11-03 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning

Similar Documents

Publication Publication Date Title
Vulić et al. On the role of seed lexicons in learning bilingual word embeddings
Yu et al. An attention mechanism and multi-granularity-based Bi-LSTM model for Chinese Q&A system
CN103473280B (en) Method for mining comparable network language materials
CN115470393A (en) Event pre-training method for Chinese-crossing language event retrieval
US20090182547A1 (en) Adaptive Web Mining of Bilingual Lexicon for Query Translation
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN102779135B (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN117076653A (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN103034627B (en) Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation
CN107895000A (en) A kind of cross-cutting semantic information retrieval method based on convolutional neural networks
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
Gough et al. Robust large-scale EBMT with marker-based segmentation
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112883199A (en) Collaborative disambiguation method based on deep semantic neighbor and multi-entity association
CN117574858A (en) Automatic generation method of class case retrieval report based on large language model
CN115017884A (en) Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement
CN117891948A (en) Small sample news classification method based on internal knowledge extraction and contrast learning
Xiong et al. Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention model
CN112749566B (en) Semantic matching method and device for English writing assistance
CN106776590A (en) A kind of method and system for obtaining entry translation
CN116561594A (en) Legal document similarity analysis method based on Word2vec
Mara English-Wolaytta Machine Translation using Statistical Approach
Li et al. Optimizing automatic evaluation of machine translation with the ListMLE approach
Zhang Research on English machine translation system based on the internet
CN110019814A (en) A kind of news information polymerization based on data mining and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination