CN115470393A - Event pre-training method for Chinese-crossing language event retrieval - Google Patents
Event pre-training method for Chinese-crossing language event retrieval Download PDFInfo
- Publication number
- CN115470393A CN115470393A CN202211029783.7A CN202211029783A CN115470393A CN 115470393 A CN115470393 A CN 115470393A CN 202211029783 A CN202211029783 A CN 202211029783A CN 115470393 A CN115470393 A CN 115470393A
- Authority
- CN
- China
- Prior art keywords
- language
- event
- training
- cross
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000013519 translation Methods 0.000 claims description 19
- 239000003550 marker Substances 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 10
- 238000002474 experimental method Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 7
- 230000009193 crawling Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 4
- 230000001537 neural effect Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 241000764238 Isis Species 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing. According to the method, two pre-training methods are used for carrying out additional pre-training on mBERT, event knowledge is firstly merged into a model by using event element mask pre-training, the representation of the model on events with extremely low resources is improved, then cross-language comparison learning is used, sentences with similar meanings between different languages have closer distance in a representation space, then a Chinese cross-language event pre-training model is obtained, and fine tuning is carried out on the Chinese cross-language event pre-training model so that downstream tasks can obtain better performance. The effectiveness of the event pre-training method for Chinese-crossing language event retrieval is proved on a self-built Chinese-crossing bilingual news event retrieval data set.
Description
Technical Field
The invention relates to an event pre-training method for Chinese cross-language event retrieval, and belongs to the technical field of natural language processing.
Background
The cross-language event retrieval in Chinese refers to a task of inputting a Chinese event query phrase and retrieving a relevant news text set in Vietnamese. Accurate retrieval of news articles about a particular event would facilitate follow-up tasks such as public sentiment event monitoring, news recommendation, event tracking, etc.
In recent years, cross-language search has been carried out with a lot of research and progress, and is mainly classified into a method based on machine translation, a method based on cross-language word embedding, and a method based on a multi-language pre-training language model. Among them, the machine translation based method maps queries and documents to the same semantic space using neural machine translation, followed by monolingual retrieval. The translation modes can be divided into query translation, document translation and intermediate language translation. The method based on machine translation depends heavily on the accuracy of neural machine translation, and word mismatching and translation ambiguity problems are easily caused. And for low-resource languages with large difference such as Chinese and beyond, the error brought by machine translation directly influences the retrieval result. In order to solve the problems, researchers provide cross-language information retrieval based on pre-trained cross-language word vectors, and the core idea of the method is to utilize the cross-language word vectors to map text semantics of different languages into the same semantic space to solve the cross-language problem and then train a neural ordering model. The cross-language word vector-based method causes inaccurate semantic representation of a query or a text to be retrieved due to ignored word order and context information, and error propagation is easily caused in the mapping process of semantic representation spaces of different languages, so that the performance of a retrieval model is influenced.
With the proposal of multi-language pre-training language models such as mBERT, XML-R and the like, a better effect is achieved on the cross-language understanding task, and the multi-language alignment semantic knowledge hidden in the multi-language pre-training language models is shown. However, the training task of the existing multi-language pre-training language model mainly focuses on word-level and sentence-level multi-language alignment semantic information, and the direct application of the multi-language pre-training language model to cross-language retrieval also faces that the query in the source language and the long text alignment in the target language have poor effects. To solve this problem, yu et al propose a pre-training method more suitable for cross-language search, which achieves better effect on 4 languages. Even so, the pre-training method proposed by Yu et al has a drawback in cross-language event retrieval, because events usually contain more complex semantic information, and the representation of events obtained based on semantic information at the word level only results in texts describing different events having very similar semantic information. For example, "A accesses Vietnam" and "B accesses Vietnam" are two different access events, and two events that are not related also have similar vector representations due to the presence of more overlapping words in the news text describing the two events.
Therefore, in order to integrate the event knowledge of Chinese-cross bilingual alignment into the multi-language pre-training language model, the invention provides two pre-training tasks, namely event element mask pre-training and cross-language event contrast learning pre-training. The method aims to solve the problems of lack of knowledge of cross-language pre-training model events and poor alignment effect in a low-resource environment.
Disclosure of Invention
The invention provides an event pre-training method for Chinese cross-language event retrieval, which is used for solving the problems that in a low-resource scene, the Chinese cross-language event retrieval lacks large-scale labeled data, and an existing cross-language pre-training model cannot well represent rich Chinese cross-language alignment event knowledge in a text.
The technical scheme of the invention is as follows: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:
step1, construction of an experimental data set: crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set;
step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving Chinese cross bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;
step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling Chinese-Yue bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with cosine similarity greater than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning.
As a preferred embodiment of the present invention, in Step2, the mBERT is continuously pre-trained through event element mask pre-training (emlm) and cross-language contrast learning (ccl), and the specific steps of Step2 are as follows:
Step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is passed to the subsequent linear layer, resulting in a probability for each masked event element. For in the sequence zh Each of which is [ MASK]Location of marker replacement el l Finally the corresponding is denoted as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, only the sequence is processed zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnam pseudo-parallel sentence while learning cross-language features. The loss function of event element mask pre-training is as follows:
step2.2, given a Chinese query phrase Q zh The corresponding related document isIrrelevant documents areObtaining, by an encoder, corresponding representations of a query and a document, respectivelyTraining of the model with the goal of maximizationAnd withSimilarity of (2), minimizationThe similarity of (c). The specific calculation process is as follows:
where sim (-) can be any similarity algorithm, such as cosines similarity, dot product similarity, etc. We extend this training goal to cases where queries and documents belong to different languages.
As a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, given a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) 1 ,q 2 ,...q t A sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q zh Pre-adding special marker [ CLS]Model learning is used to distinguish between queries and documents in different languages, and then the query sequence Q is applied using the emBERT zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q . The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q 2 …q t ))
step3.2, similar to the query encoder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 …d m ))
step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT q And e d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score q,d The specific calculation process is as follows:
the invention has the beneficial effects that:
1. according to the method, the mask of the event element is predicted by utilizing an event element mask pre-training method, so that the model pays attention to the main part of the event, and the problem that the existing cross-language pre-training model has insufficient event knowledge is solved;
2. the invention uses a cross-language contrast learning pre-training method to enable sentences with similar meanings among different languages to have closer distance in a pointer space so as to solve the problem of poor alignment effect under the situation of low resources of Chinese;
3. the invention provides an event pre-training method for Chinese cross-language event retrieval, which trains and tests a model in a self-built Chinese cross-language event retrieval data set. Experimental results show that the method provided by the invention can effectively improve the cross-language event retrieval effect of the low-resource language of the Chinese.
Drawings
FIG. 1 is a diagram of a Chinese cross-language event retrieval model in accordance with the present invention;
FIG. 2 is a diagram of an event element mask pre-training model according to the present invention;
FIG. 3 is a diagram of a cross-language contrast learning pre-training model in the present invention;
fig. 4 is a block diagram of the process of the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, the event pre-training method for chinese-oriented cross-language event retrieval specifically comprises the following steps:
step1, construction of an experimental data set: the method comprises the steps of crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set.
As a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, crawling Chinese-Yue bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with cosine similarity greater than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning. Table 1 is experimental data information.
Table 1 data set statistics
Step2, constructing a pre-training model of Chinese cross-language events: a Chinese cross-language event pre-training model is trained by utilizing event element mask pre-training and cross-language contrast learning, chinese cross-language bilingual alignment representation of the multi-language pre-training model is improved, and event knowledge is fused into the model.
As a preferable scheme of the invention, the Step2 comprises the following specific steps:
step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then proceed with the Vietnamese pseudo-parallel event Sentence Senence vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP]. Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N denotes the maximum sequence length and dim denotes the hidden layer dimension. The resulting sequence representation of the last layer output is sent to the subsequent linear layer, resulting in a probability for each masked event element. For in Sennce zh Each quilt [ MASK ]]Location el of marker replacement l Finally corresponding is represented as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, we only work on sequence zh The reason for this is to encourage the model to restore the replaced part using semantic information of the vietnamese pseudo-parallel sentence while learning cross-language features. The loss function for event element mask pre-training is as follows:
step2.2, given a Chinese query phrase Q zh The corresponding related document isIs not relatedThe document isObtaining, by an encoder, corresponding representations of a query and a document, respectivelyTraining of the model with the goal of maximizationAnd withSimilarity of (2), minimizationThe similarity of (c). The specific calculation process is as follows:
where sim (-) can be any similarity algorithm, such as cosines similarity, dot product similarity, etc. We extend this training goal to cases where queries and documents belong to different languages.
Step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
As a preferred embodiment of the present invention, the Step3 comprises the following specific steps:
step3.1, give a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pretraining model (emBERT) 1 ,q 2 ,...q t Sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, the present invention does not add a special marker to identify the query, but rather directly on query Q zh Pre-adding special marker [ CLS]Model learning to distinguish between queries and documents in different languages, and then using emBERT pair query sequence Q zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q . The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q2…q t ))
step3.2, similar to the query coder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document. Obtaining a contextual representation e of a document by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 …d m ))
step3.2, obtaining a corresponding representation e by encoding the document of a given query through an emBERT q And e d Thereafter, the relevance scores for the query and document are calculated through a post-interaction mechanism, and the sum of the scores obtained using the MaxSim operator is Score q,d The specific calculation formula is as follows:
in order to illustrate the search effect of the Chinese cross-language event, the results of the baseline method and the method of the invention are compared, specifically, the results are compared with the following cross-language search method.
BM25-QT: BM25 is an unsupervised ranking model based on IDF weighting calculations. In the experiment, chinese queries were first translated into Vietnamese using Google online translation, and then BM25 search models were implemented using Anserini tools, where the hyperparameters were all set according to default values.
CLE-CLER: the method is an unsupervised method, and comprises the steps of firstly aggregating terms forming a query and a document to obtain word embedding, and then calculating cosine similarity of the query embedding and the document embedding to carry out sequencing.
XLM-R: using the checkpoints disclosed in the hugging face, the model is mask language pre-trained on a commoncrowl corpus of over 100 languages.
DPR-mBERT: the encoder of the DPR is directly replaced with the original mBERT and then compared.
ColBERT-X: the model is a multi-lingual extension of ColBERT, which is trained on MS MARCO using a cross-language migration technique.
The results of the Chinese cross-language event retrieval experiment are shown in Table 2:
table 2 chinese cross language news event element extraction experimental results (%)
As can be seen from the results of the cross-language event retrieval experiment in the table 2, the model of the invention is improved compared with the baseline model, the deep learning method is generally superior to the traditional cross-language retrieval method, the model of the invention is greatly improved for the query translation method and the word embedding method, and the model also has improvement effects in different degrees compared with the deep neural matching model.
(1) Compared with the traditional query translation and word embedding method, the method provided by the invention is greatly improved. Because the Chinese word segmentation is inaccurate and errors caused by query translation affect the retrieval result, the query translation method has poor performance on retrieving a data set by Chinese cross-language events. Compared with the cross-language word embedding method, the method has an unsatisfactory effect because the context semantic relationship between the query and the document is difficult to capture by static word embedding.
(2) Compared with a representative cross-language retrieval method, the cross-language pre-training model is improved by 0.0905, 0.1021 and 0.1438 on NDCG @1, NDCG @5 and NDCG @10 and is also improved by 0.095 on an MAP value by using original XLM-RoBERTA as a query and a document encoder to obtain the representation for retrieval, which indicates that the cross-language pre-training model also needs to be subjected to downstream task-specific fine adjustment to improve the representation of the generation of the model. Compared with two methods with better performance in a baseline model, the method provided by the invention is improved by 1% -3% in each evaluation index, the method provided by the invention is proved to be more applicable to the retrieval of the cross-language events of the Chinese language, and the model can better represent the bilingual Chinese language and also learn the differential representation of the events after the model is trained on two pre-training tasks and after the task-specific fine tuning training.
In order to verify the influence of each pre-training task on model performance, three groups of experiments are designed and compared with mBERT, EMLM represents an event element mask pre-training task, CCL represents a cross-language comparison learning task, and the experimental results are shown in Table 3.
TABLE 3 Pre-training ablation experiment on model influence
From the analysis of the experimental results in table 3, it is found that the two pre-training tasks provided by the invention are beneficial to semantic representation, wherein the promotion of the cross-language contrast learning task is larger, and a better effect is obtained in consideration that the cross-language contrast learning target is closer to a downstream task. In the EMLM, the invention uses comparable corpora, so that the task is positively influenced, and the best effect is achieved when the two tasks are combined, which shows that the two pre-training tasks have complementary effects.
Claims (4)
1. The event pre-training method for Chinese-oriented cross-language event retrieval is characterized by comprising the following steps of: the event pre-training method for Chinese cross-language event retrieval comprises the following specific steps:
step1, construction of an experimental data set: crawling Chinese-crossing bilingual news data from a Wikipedia news page by utilizing a crawler technology, and constructing a data set required by an experiment through manual marking, wherein the data set comprises an event element mask pre-training data set, a cross-language contrast learning data set and a Chinese-crossing language event retrieval data set;
step2, constructing a pre-training model of Chinese cross-language events: training a Chinese cross-language event pre-training model by utilizing event element mask pre-training and cross-language contrast learning, improving the Chinese cross-language bilingual alignment representation of the multi-language pre-training model, and fusing event knowledge into the model;
step3, constructing a cross-language event retrieval model: and (3) fine-tuning the pre-training model of the Chinese cross-language event on the basis of Step2 to obtain a Chinese cross-language event retrieval result.
2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:
step1.1, crawling Chinese-Yuetu bilingual news data from Wikipedia news pages through a crawler technology, splicing each news sample with the corresponding date, adding hyperlinks in the samples as event elements into an event element set, translating each news into a corresponding target language by using Google online translation, obtaining cross-language word embedding by using aligned word vectors provided by fasttext, and finally adding an event element mask pre-training data set with the cosine similarity larger than 0.4;
step1.2, finding a page corresponding to each event element in the event element set in Wikidata, judging whether the same event element corresponding to a target language exists, if so, using the event element description of a source language as a query, using a first paragraph of the page linked to the corresponding target language as a positive example of the query, forming a cross-language comparison learning event element data set, and simultaneously selecting partial data alignment marking correlation for fine tuning.
3. The method of claim 1, wherein the method comprises: step2 comprises the following steps of utilizing the event element mask pre-training emlm and the cross-language contrast learning ccl to continuously pre-train the mBERT, and specifically comprising the following steps:
step2.1, give a Chinese event Sentence Sennce zh The event element in the sentence is el l (l =1,2,3.) first with [ MASK ]]The marker will el l Make a substitution, then pseudo-parallel event Sentence Sennce with Vietnam vi Splicing is carried out, and the final input is a sequence input containing a special marker emlm =[CLS]+Sentence zh +[SEP]+Sentence vi +[SEP](ii) a Then the context representation H is converted into corresponding context representation H through an embedding layer and a k-layer Transformer (k) ∈R N×dim Where N represents the maximum sequence length and dim represents the hidden layer dimension; sending the obtained sequence expression output by the last layer to a subsequent linear layer to obtain the probability of each masked event element; for in the sequence zh Each of which is [ MASK]Location of marker replacement el l Finally the corresponding is denoted as H l The specific calculation process is as follows:
H (0) =Embedding(input emlm )
H (k) =Transformers(H (k-1) )
in event element mask pretraining, only the sequence is processed zh The reason for this is to encourage the model to restore the replaced part by using the semantic information of the vietnam pseudo-parallel sentence, and to learn the cross-language features, and the loss function of the event element mask pre-training is as follows:
step2.2, given a Chinese query phrase Q zh The corresponding related document isIrrelevant documents areObtaining, by an encoder, corresponding representations of a query and a document, respectivelyTraining of the model with the goal of maximizationAndsimilarity of (2), minimizationThe specific calculation process of the similarity is as follows:
where sim (-) is any similarity algorithm that extends this training goal to the case where queries and documents belong to different languages.
4. The method of claim 1, wherein the method comprises: the concrete steps of Step3 are as follows:
step3.1, given a Chinese query phrase Q zh First, the query is segmented into { q } s based on a cross-language event pre-training model emBERT 1 ,q 2 ,...q t Sequence, where n represents the length of the query, q t (t =1,2,3.) represents each word in the query, unlike ColBERT, without adding a special marker to identify the query, but directly in query Q zh Pre-addition of a Special marker [ CLS]Enabling model learning to distinguish between queries and documents in different languages, and then using emBERT to query sequence Q zh ={q 1 ,q 2 ,...q t Carry on context characterization, and finally output [ CLS]Contextual representation e as a query q The specific encoding formula for the query is as follows:
e q =Normalize(emBert([CLS]q 1 q 2 ...q t ))
step3.2, similar to the query coder, represents the Vietnamese news document as D vi ={d 1 ,d 2 ,...d m M represents the document length, d j (j =1,2,3.) represents a word in a document, a context representation e of which is obtained by emBERT d Specifically, the encoding formula for the document is as follows:
e d =Normalize(emBert([CLS]d 1 d 2 ...d m ))
step3.2, obtaining a corresponding representation e by encoding the document of the given query through the emBERT q And e d Then, calculating the relevance scores of the query and the document through a later stage interaction mechanism, and calculating the sum of the scores obtained by using a MaxSim operator as Score q,d The specific calculation process is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211029783.7A CN115470393A (en) | 2022-08-25 | 2022-08-25 | Event pre-training method for Chinese-crossing language event retrieval |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211029783.7A CN115470393A (en) | 2022-08-25 | 2022-08-25 | Event pre-training method for Chinese-crossing language event retrieval |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115470393A true CN115470393A (en) | 2022-12-13 |
Family
ID=84368944
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211029783.7A Pending CN115470393A (en) | 2022-08-25 | 2022-08-25 | Event pre-training method for Chinese-crossing language event retrieval |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115470393A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116028631A (en) * | 2023-03-30 | 2023-04-28 | 粤港澳大湾区数字经济研究院(福田) | Multi-event detection method and related equipment |
CN116822495A (en) * | 2023-08-31 | 2023-09-29 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
-
2022
- 2022-08-25 CN CN202211029783.7A patent/CN115470393A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116028631A (en) * | 2023-03-30 | 2023-04-28 | 粤港澳大湾区数字经济研究院(福田) | Multi-event detection method and related equipment |
CN116028631B (en) * | 2023-03-30 | 2023-07-14 | 粤港澳大湾区数字经济研究院(福田) | Multi-event detection method and related equipment |
CN116822495A (en) * | 2023-08-31 | 2023-09-29 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
CN116822495B (en) * | 2023-08-31 | 2023-11-03 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vulić et al. | On the role of seed lexicons in learning bilingual word embeddings | |
Yu et al. | An attention mechanism and multi-granularity-based Bi-LSTM model for Chinese Q&A system | |
CN103473280B (en) | Method for mining comparable network language materials | |
CN115470393A (en) | Event pre-training method for Chinese-crossing language event retrieval | |
US20090182547A1 (en) | Adaptive Web Mining of Bilingual Lexicon for Query Translation | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
CN102779135B (en) | Method and device for obtaining cross-linguistic search resources and corresponding search method and device | |
CN117076653A (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN103034627B (en) | Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation | |
CN107895000A (en) | A kind of cross-cutting semantic information retrieval method based on convolutional neural networks | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
Gough et al. | Robust large-scale EBMT with marker-based segmentation | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112883199A (en) | Collaborative disambiguation method based on deep semantic neighbor and multi-entity association | |
CN117574858A (en) | Automatic generation method of class case retrieval report based on large language model | |
CN115017884A (en) | Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement | |
CN117891948A (en) | Small sample news classification method based on internal knowledge extraction and contrast learning | |
Xiong et al. | Pinyin-to-Chinese conversion on sentence-level for domain-specific applications using self-attention model | |
CN112749566B (en) | Semantic matching method and device for English writing assistance | |
CN106776590A (en) | A kind of method and system for obtaining entry translation | |
CN116561594A (en) | Legal document similarity analysis method based on Word2vec | |
Mara | English-Wolaytta Machine Translation using Statistical Approach | |
Li et al. | Optimizing automatic evaluation of machine translation with the ListMLE approach | |
Zhang | Research on English machine translation system based on the internet | |
CN110019814A (en) | A kind of news information polymerization based on data mining and deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |