CN114861631B - Context-based cross-language sentence embedding method - Google Patents

Context-based cross-language sentence embedding method Download PDF

Info

Publication number
CN114861631B
CN114861631B CN202210544674.2A CN202210544674A CN114861631B CN 114861631 B CN114861631 B CN 114861631B CN 202210544674 A CN202210544674 A CN 202210544674A CN 114861631 B CN114861631 B CN 114861631B
Authority
CN
China
Prior art keywords
cross
chinese
context
model
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210544674.2A
Other languages
Chinese (zh)
Other versions
CN114861631A (en
Inventor
黄于欣
武照渊
余正涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210544674.2A priority Critical patent/CN114861631B/en
Publication of CN114861631A publication Critical patent/CN114861631A/en
Application granted granted Critical
Publication of CN114861631B publication Critical patent/CN114861631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a context-based cross-language sentence embedding method, belonging to the field of natural language processing. Firstly, constructing a training data set; and obtaining the cross-language sentence embedding of the corresponding Han-Yuan context in the training data set by utilizing the mBERT model, building a linear fine tuning layer based on the twin network structure, reconstructing the obtained cross-language sentence embedding of the Han-Yuan context, and using the construction contrast loss for reversely optimizing the fine tuning layer. The cross-language sentence embedding of the Chinese-crossing context obtained in the mBERT model is reconstructed by constructing the linear fine tuning layer fused with the twin network structure, so that the problem of poor alignment effect of the cross-language sentence embedding of the Chinese-crossing caused by the scarcity of parallel corpus at the Chinese-crossing sentence level and larger grammar difference in the mBERT model is effectively solved. Experimental results show that the accuracy of the method is greatly improved, the coincidence degree between the Chinese-crossing embedding distribution is improved, and the semantic alignment effect of the Chinese-crossing low-resource context cross-language sentence embedding is improved.

Description

Context-based cross-language sentence embedding method
Technical Field
The invention relates to a context-based cross-language sentence embedding method, belonging to the technical field of natural language processing.
Background
The cross-language sentence embedding task aims at mapping sentence semantic information of different languages into a shared embedding space which is irrelevant to the languages after encoding, so that sentences with similar semantics of the different languages have similar vector representations, and the transmission of semantic information among the different languages is realized. Cross-language sentence embedding can be used for solving some more complex cross-language tasks, such as cross-language document matching, cross-language abstract extraction and the like, and has important application value.
Because the multilingual pre-training model can well capture the syntactic and semantic features in different language sequences, the multilingual pre-training model is often used as a first-choice tool for extracting downstream cross-language task features, and is also a mainstream method for acquiring cross-language sentence embedding at present. However, the sentence-level parallel corpus for which the Chinese-cross is available as a low-resource language is scarce, so that the multilingual pre-training model is trained only by the Chinese-cross unigram corpus and lacks an explicit cross-language supervision signal. In addition, the grammar and word structure of the chinese language and the vietnamese are also greatly different, for example, the chinese language corresponding to the vietnamese sentence "Toi la nguoi Trung Quoc" is translated into "i is human chinese", and the grammar rule of the chinese language is not met, which also causes that the encoder trained by the chinese-beyond single language corpus is difficult to learn the high-quality chinese-beyond context cross-language sentence embedding, thereby influencing the semantic similarity calculation of the chinese-beyond context cross-language sentence embedding.
Disclosure of Invention
The invention provides a context-based cross-language sentence embedding method, which is used for solving the problem that the alignment effect of cross-language sentence embedding semantics of the context-based cross-language sentence which is learned by a multilingual pre-training model is poor due to the fact that sentence-level parallel corpus is rare and grammar difference is large in a low-resource environment of the context.
The technical scheme of the invention is as follows: the method for embedding the cross-language sentence of the Chinese based on the context comprises the following specific steps:
Step1, constructing a small-scale Chinese-to-parallel sentence data set and a non-parallel sentence data set by using a comparable corpus of the same subject of Chinese and Vietnam in Wikipedia as positive examples and negative examples, and performing corresponding preprocessing operation for training a twin network linear fine tuning layer.
Step2, acquiring cross-language sentence embedding of the Han-Yuan context corresponding to the training set based on mBERT model, and building a linear fine tuning layer by fusing a twin network structure, wherein the cross-language sentence embedding of the Han-Yuan context acquired in mBERT model is reconstructed, so that the similarity of embedding between positive examples is higher, the similarity of embedding between negative examples is lower, and the contrast loss is built to reversely optimize the linear fine tuning layer.
Step3, combining the mBERT model with the optimized linear fine tuning layer to obtain a context-based cross-language sentence embedding model mBERT-SF for obtaining high-quality cross-language sentence embedding of the context.
As a preferable scheme of the invention, the Step1 specifically comprises the following steps:
step1.1, extracting the pseudo parallel sentence pairs of the Chinese-crossing under the vocabulary entry of the same subject with the scale of about 2 ten thousand from the wikipedia data;
Step1.2, respectively eliminating Chinese sentences and Vietnam sentences with the number of words less than 5;
Step1.3, removing the error sentence pair containing the special character by using a regularization method;
Step1.4, manually screening 2016 pairs of Chinese-crossing parallel sentence pairs with highest semantic similarity from the rest 7000 pairs of sentence pairs to serve as positive examples, and manually and precisely marking 448 Chinese-crossing parallel sentence pairs to serve as test sets, wherein the data label l=0 of the parallel sentence pairs;
Step1.5, randomly extracting Vietnam translation sentences which belong to a positive example data set but have non-corresponding semantic information as negative examples for Chinese sentences in the positive examples, mixing the Vietnam translation sentences with the positive examples to form a training set together, wherein the scale of the negative examples is 2016 pairs as same as that of the positive examples, and the data label l=1.
As a preferable scheme of the invention, the Step2 specifically comprises the following steps:
step2.1, acquiring corresponding Han-Yue contexts in the training set based on the multilingual pre-training model mBERT, and embedding the Han-Yue contexts into the CLS S and the CLS T;
And (3) constructing two sub-networks Network1 and Network2 with the same structure to form a linear reconstruction layer, and respectively reconstructing the context cross-language sentence embedded CLS S and CLS T corresponding to the Chinese-crossing input sentence pair to enable the Chinese-crossing cross-language sentence embedded with the same semantics to have similar vector representations in a shared embedding space. Each sub-network consists of a full-connection layer and a Dropout layer, wherein the full-connection layer is 768-dimension in size and is responsible for extracting features of the original context cross-language sentence embedding output by the mBERT model. In order to further improve the generalization capability of the model, a Dropout layer is added after the full-connection layer fc, and the probability p of neurons in the full-connection layer is randomly removed to prevent the model from generating an overfitting problem. The feature extraction process of the two subnetworks Network1 and Network2 is shown in formula 1, and because the two networks have the same structure and share weights, the CLS S、CLST is embedded in a cross-language sentence before fine tuning of the x representing Chinese or Vietnam, and the operation process of the two networks is shown by using the same calculation formula.
y=pf(Wx) (1)
In formula 1, x represents the output after the reconstruction of the subnetworks Network1 and Network2, wherein pf (Wx) represents the output of the Dropout layer, p is the random rejection probability of the neurons, and W is the weight value of the full connection layer fc. The final result y can represent the cross-language sentence embedding E zh and E vi of the reconstructed Chinese and Vietnam contexts.
Step2.3, adopting contrast loss (Contrastive loss) to construct a matching layer for reversely fine-tuning the two sub-networks, so that the cross-language sentence embedding of the Chinese context in the positive example is similar as much as possible, and the embedding similarity between the negative examples is as low as possible, as shown in a formula 2.
D(Ezh,Evi)=||Ezh-Evi||2 (3)
E zh and E vi are cross-language sentence embedding of Chinese and Vietnam context after reconstruction by the fine tuning layer; d (E zh,Evi) represents the Euclidean distance between two embeddings, as shown in equation 3; l represents the label corresponding to the input Chinese-surmounting sentence pair, when the input is positive example constructed by parallel sentence pair, l=0, and when the input is negative example constructed by non-parallel sentence pair, l=1; m is the set maximum margin value margin, and smaller loss can be generated for sentence pairs with Euclidean distance exceeding the maximum margin value in negative examples by executing m-D (E zh,Evi) operation so as to meet the optimization target of the model.
As a preferable scheme of the invention, the Step3 specifically comprises the following steps:
and step3.1, fusing the mBERT model with the optimized linear fine tuning layer to form a mBERT-SF model.
Step3.2, when a new Chinese or Vietnam sentence is input, the model first obtains its corresponding Hanover or Vietnam context cross-language sentence embedding based on mBERT model. And then reconstructing the Chinese-cross context cross-language sentence with similar semantics through a linear fine tuning layer fused with a twin network structure, so that the Chinese-cross context cross-language sentence with similar semantics is embedded in a shared embedding space to have more similar vector representation, and the semantic alignment error problem caused by the scarcity of Chinese-cross sentence-level parallel corpus and large language difference in the multilingual pre-training model is effectively relieved.
The beneficial effects of the invention are as follows:
1. The method combines with a twin network structure design on the basis of mBERT models, realizes a context cross-language sentence embedding model mBERT-SF for a low-Han-Yuan bilingual task, utilizes a small-scale Han-Yuan sentence pair dataset to realize the reconstruction of Han-Yuan context cross-language sentence embedding in a multilingual pre-training model, effectively relieves the problem of cross-language sentence embedding semantic alignment deviation caused by scarcity of sentence-level parallel corpus and larger grammar difference in a Han-Yuan low-resource scene, and improves the accuracy of Han-Yuan context cross-language sentence embedding on semantic similarity calculation.
2. Through visual cross-language sentence embedding distribution diagram verification of the Chinese context, the fine tuning method of the fusion twin network can improve the coincidence degree of the cross-language sentence embedding distribution of the Chinese context in the original multilingual pre-training model, so that the distance of the cross-language sentence embedding of the Chinese context with similar semantics in a shared space is closer, the distance of the cross-language sentence embedding of the Chinese context with dissimilar semantics in the shared space is farther, and the distribution is more reasonable.
Drawings
FIG. 1 is a flow chart of a context-based cross-language sentence embedding method in accordance with the present invention;
FIG. 2 is a schematic diagram of a specific structure of a context-based cross-language sentence embedding method according to the present invention;
FIG. 3 is a schematic diagram of a linear fine tuning layer structure of a fusion twinning network structure in a context-based cross-language sentence embedding method;
FIG. 4 is a cross-linguistic sentence embedding distribution diagram for a Chinese context in a model of the present invention and in a different multilingual pre-training model; wherein, (a) an embedded distribution of the XLM model; (b) embedding distribution of XLM-R model; (c) mBERT embedding distributions of the model; (d) mBERT-SF trimmed embedding profile;
FIG. 5 is a diagram showing a design of different dual sub-network structures in a context-based cross-language sentence embedding method according to the present invention;
FIG. 6 shows the loss and accuracy of mBERT-SF models at different learning rates in the context-based cross-sentence embedding method of the present invention;
FIG. 7 shows the loss and accuracy of mBERT-SF models at different margin values in the context-based cross-language sentence embedding method according to the present invention;
FIG. 8 is an effect diagram of model iterative training 30 rounds under optimal super-parameter setting in a context-based cross-language sentence embedding method;
Detailed Description
Example 1: as shown in fig. 1-8, a context-based cross-language sentence embedding method comprises the following specific steps:
Step1, constructing a small-scale Chinese-to-parallel sentence data set and a non-parallel sentence data set by using a comparable corpus of the same subject of Chinese and Vietnam in Wikipedia as positive examples and negative examples, and performing corresponding preprocessing operation for training a twin network fine tuning layer.
Step1.1, extracting the pseudo parallel sentence pairs of the Chinese-crossing under the vocabulary entry of the same subject with the scale of about 2 ten thousand from the wikipedia data;
step1.2, chinese sentences and Vietnam sentences with the number of words less than 5 are removed respectively;
Step1.3, regularizing to remove the error sentence pair containing special characters;
step1.4, manually screening 2016 pairs of Chinese-crossing parallel sentence pairs with highest semantic similarity from the rest 7000 pairs of sentence pairs to serve as positive examples, manually labeling 448 Chinese-crossing parallel sentence pairs as a test set, and constructing a data format shown in table 1, wherein the data label l=0 of the parallel sentence pairs is as follows:
Table 1 chinese parallel sentence pair example
Step1.5, by randomly extracting Chinese sentences in positive examples, the Vietnam translation sentences which belong to a positive example data set but have non-corresponding semantic information are constructed into non-parallel sentence pairs as negative examples, and the non-parallel sentence pairs are mixed with the positive examples to jointly form a training set, wherein the scale of the negative examples is 2016 pairs as the same as that of the positive examples, and the data label l=1. The constructed data format is shown in table 2:
table 2 training set example with addition of Han-Yue non-parallel sentence pairs
The data set sizes are shown in table 3:
Table 3 data set size
Step2, acquiring cross-language sentence embedding of the Han-Yuan context corresponding to the training set based on mBERT model, and building a linear fine tuning layer by fusing a twin network structure, wherein the cross-language sentence embedding of the Han-Yuan context acquired in mBERT model is reconstructed, so that the similarity of embedding between positive examples is higher, the similarity of embedding between negative examples is lower, and the contrast loss is built for reversely optimizing the fine tuning layer.
Step2.1, based on the multilingual pre-training model mBERT, acquiring corresponding hanyue contexts in the training set, and embedding the cross-language sentences into the CLS S and the CLS T.
And (3) constructing two sub-networks Network1 and Network2 with the same structure to form a linear reconstruction layer, and respectively reconstructing the context cross-language sentence embedded CLS S and CLS T corresponding to the Chinese-crossing input sentence pair to enable the Chinese-crossing cross-language sentence embedded with the same semantics to have similar vector representations in a shared embedding space. Each sub-network consists of a full-connection layer and a Dropout layer, wherein the full-connection layer is 768-dimension in size and is responsible for extracting features of the original context cross-language sentence embedding output by the mBERT model. In order to further improve the generalization capability of the model, a Dropout layer is added after the full-connection layer fc, and the probability p of neurons in the full-connection layer is randomly removed to prevent the model from generating an overfitting problem. The feature extraction process of the two subnetworks Network1 and Network2 is shown in formula 1, and because the two networks have the same structure and share weights, the CLS S、CLST is embedded in a cross-language sentence before fine tuning of the x representing Chinese or Vietnam, and the operation process of the two networks is shown by using the same calculation formula.
y=pf(Wx) (1)
In formula 1, x represents the output after the reconstruction of the subnetworks Network1 and Network2, wherein pf (Wx) represents the output of the Dropout layer, p is the random rejection probability of the neurons, and W is the weight value of the full connection layer fc. The final result y can represent the cross-language sentence embedding E zh and E vi of the reconstructed Chinese and Vietnam contexts.
Step2.3, adopting contrast loss (Contrastive loss) to construct a matching layer for reversely fine-tuning the two sub-networks, so that the cross-language sentence embedding of the Chinese context in the positive example is similar as much as possible, and the embedding similarity between the negative examples is as low as possible, as shown in a formula 2.
D(Ezh,Evi)=||Ezh-Evi||2 (3)
E zh and E vi are cross-language sentence embedding of Chinese and Vietnam context after reconstruction by the fine tuning layer; d (E zh,Evi) represents the Euclidean distance between two embeddings, as shown in equation 3; l represents the label corresponding to the input Chinese-surmounting sentence pair, when the input is positive example constructed by parallel sentence pair, l=0, and when the input is negative example constructed by non-parallel sentence pair, l=1; m is the set maximum margin value margin, and smaller loss can be generated for sentence pairs with Euclidean distance exceeding the maximum margin value in negative examples by executing m-D (E zh,Evi) operation so as to meet the optimization target of the model.
Step3, combining the mBERT model with the optimized linear fine tuning layer to obtain a context-based cross-language sentence embedding model mBERT-SF for obtaining high-quality cross-language sentence embedding of the context.
And step3.1, fusing the mBERT model with the optimized linear fine tuning layer to form a mBERT-SF model.
Step3.2, when a new Chinese or Vietnam sentence is input, the model first obtains its corresponding Hanover or Vietnam context cross-language sentence embedding based on mBERT model. And then reconstructing the Chinese-cross context cross-language sentence with similar semantics through a linear fine tuning layer fused with a twin network structure, so that the Chinese-cross context cross-language sentence with similar semantics is embedded in a shared embedding space to have more similar vector representation, and the semantic alignment error problem caused by the scarcity of Chinese-cross sentence-level parallel corpus and large language difference in the multilingual pre-training model is effectively relieved.
To illustrate the effectiveness of the present invention, three comparative experiments and one set of ablation experiments were set up. The first group of experiments verify the improvement of semantic alignment accuracy of mBERT-SF models in a Han-Yuan cross-language sentence meaning matching task, the second group of visual experiments verify the improvement of mBERT-SF models on Han-Yuan context cross-language sentence embedding distribution, the third group of experiments verify the influence of different sub-network structures on model effects, and the last group of ablation experiments are used for exploring optimal parameter setting of the models.
Parameters of the converged twin network fine tuning layer are set as follows: the cross-language sentence embedding dimension of the Chinese-over context as input is 768D, and the fine-tuned output embedding dimension is unchanged; for two sub-networks in the twin network, the random masking probability p of the Dropout layer neurons of the sub-networks is 0.2; in the matcher formed by contrast loss, the maximum margin value m is set to be 2.0; the model optimization uses an Adam optimizer, the learning rate lr is 1e-5, the batch size batch_size of training samples is 64, and the iteration round number epochs is 30; the normalized parameters for the cross-language sentence embedding execution for the hinted output chinese-over context are [ 'unit', 'center', 'unit', where 'unit' represents length normalization and 'center' represents a centering operation.
To better compare with existing work, the model uses cosine similarity to measure semantic similarity between two contextual cross-language sentence embeddings. The model takes the accuracy P@N of sentence alignment (semantic alignment accuracy when N Vietnam sentences are selected as candidate sentences) as a standard for measuring the semantic alignment effect of the model, and the specific calculation process is shown in a formula 4:
Wherein, T represents the scale of the han-yuans pairs in the test set, C (E zh) represents the list of vietnamese candidate sentences searched by E zh for embedding chinese context cross-language sentences according to cosine similarity, if the candidate sentence set C (E zh) contains correct vietnamese translation sentences, 1 is taken, otherwise 0 is taken.
The cosine similarity calculation process is shown in formula 5, wherein E vi is the cross-language sentence embedding of Vietnam context in the test set:
(1) Promotion of model on cross-language sentence meaning matching task
In order to highlight the effectiveness of mBERT-SF models in relieving the problem of embedding semantic alignment deviation among context cross-linguistic sentences caused by the scarcity of parallel corpus at the Han-Yuan sentence level and the language difference, three current mainstream multilingual pre-training models are selected as base lines, and each base line model is introduced as follows:
1) mBERT model: the Devlin et al developed a multilingual BERT model (Multilingual BERT, mbbert) based on the BERT model, and pretrained on wikipedia data composed of 104 languages using a masking language modeling (Masked language modeling, MLM) method.
2) XLM model: lample et al, in addition to using the MLM method in the mBERT model, introduced two additional pre-training methods, causal language modeling (Causal language modeling, CLM) and translational language modeling (Translation language modeling, TLM), respectively. The XLM model is also trained on Wikipedia corpus covering more than 100 languages, but the XLM model has more parameters and the shared vocabulary is larger in scale than the mBERT model.
3) XLM-R model: conneau et al propose an XLM-Roberta Large model based on an XLM model, which uses whole-network data with larger data volume than wikipedia data volume as a large-scale multilingual pre-training model obtained by training corpus.
The cross-language sentence meaning matching task comprises the steps of firstly carrying out normalization operation on the cross-language sentence embedding of the corresponding context of the cross-language sentence pairs in the test set, then searching N cross-language sentence embeddings corresponding to the semantics in the test set for each Chinese sentence embedding by using cosine similarity as candidates for matching, and the semantic matching accuracy P@N results of different models in the N candidate sentences are shown in the table 4:
TABLE 4 Effect of different pre-training models on Han-Yuan sentence semantic matching task
According to experimental results in an analysis table, compared with the original main-stream multilingual pre-training models such as mBERT, XLM, XLM-R and the like, the cross-sentence embedding of the cross-sentence context obtained through the mBERT-SF model is greatly improved on the cross-sentence meaning matching tasks @1 and @5, the mBERT-SF method is fully proved to be capable of effectively improving the accuracy of the cross-sentence embedding of the cross-sentence context on the semantic similarity calculation by only using small-scale parallel sentence pairs, and the problem of semantic alignment deviation caused by scarcity of cross-sentence-level parallel corpus and language difference in the mBERT model is relieved. Meanwhile, the alignment accuracy of different multilingual pre-training models on the Han cross-language sentence meaning matching task is known, mBERT model effect > XLM model > XLM-R model, and the effect gradually decreases along with the increase of model data quantity and parameter quantity. It is presumed that this is caused by the unbalance of data amount and different language differences used in model training, and Vietnam is used as a low-resource language, the occupation proportion of the Vietnam in the pre-training corpus is obviously lower than that of the resource enrichment language such as English, so that the cross-language knowledge learned by the model is more biased to the Ineuler, and the migration effect is poor in the languages with larger language differences and smaller scale such as Vietnam. The method also proves that the cross-language sentence of the Hanover context reconstructed by the twin network fine tuning layer is embedded in the reserved original semantic information, and meanwhile, the influence of language diversity on the semantic similarity calculation is eliminated to a certain extent.
(2) Visual contrast of cross-language sentence embedding distribution of different model Chinese-crossing contexts
In order to more intuitively display the problem of embedding distribution deviation caused by the Chinese-character-cross language difference and the distribution change of cross-language sentence embedding of the Chinese-character-cross contexts before and after fine tuning in the multilingual pre-training model, the test set of experiments reduces the cross-language sentence embedding dimension of the contexts corresponding to the parallel sentence pairs in the test set learned by the multilingual pre-training model such as mBERT, XLM, XLM-R into 2-dimensional embedding, and converts the embedding dimension into a visualized embedding distribution diagram by using a matplotlib tool. FIG. 4 is a cross-linguistic sentence embedding distribution diagram of a Han-Yuan context after visualization of different multilingual pre-training models and mBERT-SF models. Wherein, light dots represent Chinese sentence embedding, and dark dots represent Vietnam sentence embedding.
The Chinese-to-Vietnam sentence embedding distribution in the three models is obvious in deviation, and the embedding subspaces of the Chinese-to-Vietnam sentence embedding distribution are not coincident due to the influence of the Chinese-to-Vietnam language difference. In the graph (d), through reconstruction of the twin network fine tuning layer, the coincidence degree of the Chinese and Vietnam sentence sub-embedding space in the mBERT-SF model is obviously improved, the distribution is more uniform, semantic similarity calculation between sentence embedding is easier to execute, and the accuracy of cross-language sentence embedding of the Chinese-over context on cross-language semantic alignment is improved.
(3) Influence of different subnetwork structures on model effects
In order to explore the optimal fine-tuning layer architecture, three sub-networks with different structures are respectively constructed in the third group of experiments as reconstruction layers in the fine-tuning layer, and the three structures are shown in fig. 5.
The (a) diagram is a single-layer linear reconstruction network constructed based on a linear idea and consists of a full-connection layer fc1 and a Dropout layer; (b) The figure is a single-layer nonlinear network structure constructed by adding an activation function layer on the basis of a single-layer linear reconstruction network, and the tan function is finally selected as the optimal setting of the activation function layer by comparing the model effect on Relu functions and tan function. And (c) a double-layer nonlinear reconstruction network formed by newly adding a full-connection layer fc2 on the basis of the graph (b). Under the same optimal parameter setting, after 10 rounds of iterative training of the three sub-network structures, the best experimental effects on the test set cross-language sentence meaning matching tasks @1 and @5 are shown in table 5:
TABLE 5 alignment accuracy based on models under different subnetwork structures
According to the data in the analysis table, the optimal alignment effect is achieved by the single-layer linear structure compared with the nonlinear structure, so that the model finally adopts the linear structure as the framework of two sub-networks in the reconstruction layer. In contrast of nonlinear structures, the tanh activation function obtains better semantic alignment effect compared with the ReLU function, presumably because the network layer number is shallow, and the semantic information of sentence level is sparse, if the ReLU function is used as the activation function, part of critical semantic information is lost in the reconstruction process due to embedding of cross-language sentences. In addition, the alignment accuracy of the model is reduced along with the improvement of the complexity of the sub-network, and the problem that the mapping process of the two full-connection layers can cause the loss of semantic information in the embedding of the cross-language sentences of the original context is considered, and the single-layer network structure can greatly retain the semantic information extracted in the original embedding while reconstructing the embedding of the cross-language sentences, so that the single-layer linear structure is finally selected as the sub-network structure of the reconstruction layer in the mBERT-SF method.
(4) Ablation experiments
In the fourth group of experiments, in order to realize the optimal model effect, we perform ablation experiments on parameters such as learning rate (lr) of the model, random shielding rate (p) of Dropout layer neurons, maximum margin value (m), and the like, uniformly set iterative training rounds (epochs) to 10 rounds, set data batch size (batch_size) to 64, and explore the accuracy (acc) effect of the model on training loss (train_loss), test loss (test_loss) and test set up @1 semantic alignment tasks under different super-parameter settings.
1. In the first group of super-parametric experiments, the random shielding probability p and the maximum margin value m of the limiting model are respectively 0.0 and 1.0, training effects of the model under different learning rates are shown in fig. 6, wherein three graphs (a), (b) and (c) respectively represent experimental results when the learning rates are 1e-4, 1e-5 and 1e-6, dark dotted lines represent training loss of the model, light dotted lines represent testing loss of the model, and solid lines represent accuracy of the model on a semantic alignment task of @1 on a test set.
As can be seen from the graph (a), when the learning rate is 1e-4, the training loss and the testing loss of the model are lower, the model gradually decreases to be gentle, but the accuracy of the model on the semantic alignment task shows a vibration decreasing trend. In the graph (c), when the learning rate is 1e-6, the training loss and the test loss of the model are both reduced, but the test loss is too large, the convergence speed is slow, and the accuracy rate in the semantic alignment task is low. Compared with the previous two graphs, when the learning rate in the graph (b) is 1e-5, the training loss and the testing loss of the model are steadily reduced, the model is reasonable in size, the accuracy of the model is gradually improved along with the training time, the current better effect is achieved, and therefore the optimal learning rate is selected to be 1e-5.
2. In the second set of super-parametric experiments, the fixed model learning rate is 1e-05, the random shielding probability p of the Dropout layer is 0.0, and model effects when the maximum edge distance value m is 1.0, 2.0 and 3.0 are respectively explored to obtain the optimal edge distance value setting, and the experimental results are shown in fig. 7. Wherein, the dark dotted line represents the training loss of the model, the light dotted line represents the test loss of the model, and the solid line represents the accuracy of the model on the semantic alignment task @1 on the test set:
as can be seen from a comparison of the three graphs (a), (b) and (c) in fig. 7, the training loss and the test loss of the model at the three maximum margin values are steadily reduced under the condition of setting the optimal learning rate. According to the accuracy of semantic alignment of the reference model, when the margin value is set to be 2.0, the model achieves the best effect, so that the maximum margin value is uniformly set to be 2.0.
3. Under the optimal super-parameter setting, the loss and accuracy effects of performing 30 rounds of iterative training on mBERT-SF models are shown in FIG. 8:
The dark dotted line and the light dotted line respectively represent training and testing loss of the model under the optimal parameter setting, and the lower solid black line and the upper solid black line respectively represent semantic alignment accuracy of the model on @1 and @5 tasks in the test set cross-language sentence meaning matching task. From analysis of fig. 8, it can be seen that the training and testing loss of the model steadily decreases with the increase of the training round number, and the accuracy of the @1 and @5 tasks achieves the optimal effect at the 26 th round of iteration, which is 41.07% and 60.04%, respectively.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (2)

1. The context-based cross-language sentence embedding method is characterized by comprising the following steps of: the method comprises the following specific steps:
step1, constructing a Chinese parallel sentence data set and a non-parallel sentence data set by using a comparable corpus of the same subject of Chinese and Vietnam as positive examples and negative examples, and performing corresponding preprocessing operation for training a twin network linear fine tuning layer;
Step2, acquiring cross-language sentence embedding of the Chinese-over context corresponding to the training set based on mBERT model, and building a linear fine tuning layer by fusing a twin network structure, wherein the cross-language sentence embedding of the Chinese-over context acquired in mBERT model is reconstructed, and contrast loss is built for reversely optimizing the linear fine tuning layer;
Step3, combining the mBERT model with the optimized linear fine tuning layer to obtain a context-based cross-language sentence embedding model mBERT-SF for obtaining high-quality cross-language sentence embedding of the context;
The Step2 specifically comprises the following steps:
step2.1, acquiring corresponding Han-Yue contexts in the training set based on the multilingual pre-training model mBERT, and embedding the Han-Yue contexts into the CLS S and the CLS T;
Step2.2, constructing two sub-networks Network1 and Network2 with the same structure to form a linear reconstruction layer, and respectively reconstructing the context cross-language sentence embedded CLS S and CLS T corresponding to the Chinese-crossing input sentence pair to enable the Chinese-crossing cross-language sentences with the same semantics to be embedded in a shared embedding space to have similar vector representations; each sub-network consists of a full-connection layer and a Dropout layer, wherein the full-connection layer is 768-dimension in size and is responsible for extracting features of the original context cross-language sentence embedding output by the mBERT model; in order to further improve the generalization capability of the model, a Dropout layer is added after the full-connection layer fc, and the probability p of neurons in the full-connection layer is randomly removed to prevent the model from generating an overfitting problem; the characteristic extraction process of the two subnetworks Network1 and Network2 is shown in formula 1, and because the two networks have the same structure and share weight, x is used for representing a cross-language sentence embedded CLS S、CLST before Chinese or Vietnam fine tuning, and the operation process of the two networks is shown by using the same calculation formula;
y=pf(Wx)(1)
in the formula 1, x represents the output after the reconstruction of the subnetworks Network1 and Network2, wherein pf (Wx) represents the output of the Dropout layer, p is the random rejection probability of the neuron, and W is the weight value of the full connection layer fc; the final result y can represent the cross-language sentence embedding E zh and E vi of the Chinese and Vietnam context after the reconstruction of the fine tuning layer;
Step2.3, adopting contrast loss to construct a matching layer for reversely fine-tuning the two sub-networks, so that the cross-language sentence embedding of the Chinese-over contexts in the positive examples is as similar as possible, and the embedding similarity between the negative examples is as low as possible, as shown in a formula 2;
D(Ezh,Evi)=||Ezh-Evi||2 (3)
E zh and E vi are cross-language sentence embedding of Chinese and Vietnam context after reconstruction by the fine tuning layer; d (E zh,Evi) represents the Euclidean distance between two embeddings, as shown in equation 3; l represents the label corresponding to the input Chinese-surmounting sentence pair, when the input is positive example constructed by parallel sentence pair, l=0, and when the input is negative example constructed by non-parallel sentence pair, l=1; m is a set maximum margin value margin, and a small loss is generated by executing m-D (E zh,Evi) operation to a sentence pair with Euclidean distance exceeding the maximum margin value in a negative example so as to meet the optimization target of the model;
the Step3 specifically comprises the following steps:
step3.1, fusing the mBERT model with the optimized linear fine tuning layer to form a mBERT-SF model;
Step3.2, when a new Chinese or Vietnam sentence is input, firstly acquiring a corresponding Chinese-crossing or Vietnam context cross-language sentence embedding based on mBERT model; and then reconstructing the Chinese-cross context cross-language sentence with similar semantics through a linear fine tuning layer fused with a twin network structure, so that the Chinese-cross context cross-language sentence with similar semantics is embedded in a shared embedding space to have more similar vector representation, and the semantic alignment error problem caused by the scarcity of Chinese-cross sentence-level parallel corpus and large language difference in the multilingual pre-training model is effectively relieved.
2. The context-based cross-language sentence embedding method of claim 1, wherein: the Step1 specifically comprises the following steps:
Step1.1, extracting Chinese-crossing pseudo parallel sentence pairs under the same subject vocabulary entry from wikipedia data;
Step1.2, respectively eliminating Chinese sentences and Vietnam sentences with the number of words less than 5;
Step1.3, removing the error sentence pair containing the special character by using a regularization method;
Step1.4, manually screening a plurality of pairs of Chinese-crossing parallel sentence pairs with highest semantic similarity from the rest sentence pairs to serve as a positive example, and manually labeling a plurality of Chinese-crossing parallel sentence pairs as a test set, wherein the data label l=0 of the parallel sentence pairs;
Step1.5, randomly extracting Vietnam translation sentences which belong to a positive example data set and have no corresponding semantic information for Chinese sentences in the positive example to serve as negative examples, mixing the negative examples with the positive examples to jointly form a training set, wherein the scale of the negative examples is the same as that of the positive examples, and the data label l=1.
CN202210544674.2A 2022-05-19 2022-05-19 Context-based cross-language sentence embedding method Active CN114861631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210544674.2A CN114861631B (en) 2022-05-19 2022-05-19 Context-based cross-language sentence embedding method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210544674.2A CN114861631B (en) 2022-05-19 2022-05-19 Context-based cross-language sentence embedding method

Publications (2)

Publication Number Publication Date
CN114861631A CN114861631A (en) 2022-08-05
CN114861631B true CN114861631B (en) 2024-06-21

Family

ID=82639375

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210544674.2A Active CN114861631B (en) 2022-05-19 2022-05-19 Context-based cross-language sentence embedding method

Country Status (1)

Country Link
CN (1) CN114861631B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235532B (en) * 2023-11-09 2024-01-26 西南民族大学 Training and detecting method for malicious website detection model based on M-Bert

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445920A (en) * 2016-09-29 2017-02-22 北京理工大学 Sentence similarity calculation method based on sentence meaning structure characteristics
CN108460464A (en) * 2017-02-22 2018-08-28 中兴通讯股份有限公司 Deep learning training method and device
KR102540774B1 (en) * 2018-12-04 2023-06-08 한국전자통신연구원 Sentence embedding method and apparatus using subword embedding and skip-thought model
CN112287695A (en) * 2020-09-18 2021-01-29 昆明理工大学 Cross-language bilingual pre-training and Bi-LSTM-based Chinese-character-cross parallel sentence pair extraction method
US20220093088A1 (en) * 2020-09-24 2022-03-24 Apple Inc. Contextual sentence embeddings for natural language processing applications
CN112380835B (en) * 2020-10-10 2024-02-20 中国科学院信息工程研究所 Question answer extraction method integrating entity and sentence reasoning information and electronic device
CN114461774A (en) * 2022-01-31 2022-05-10 一贯智服(杭州)技术有限公司 Question-answering system search matching method based on semantic similarity and application thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cross-lingual sentence embedding for low-resource Chinese -vietnamese based on contrastive learning;Yuxin Huang等;《ACM Transactions on Asian and low resource language information processing》;20230616;第22卷(第6期);1-18 *
面向汉越低资源跨语言词嵌入及句嵌入方法研究;武照渊;《中国优秀硕士学位论文全文数据库 信息科技辑》;20240430;1-80 *

Also Published As

Publication number Publication date
CN114861631A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN111460820B (en) Network space security domain named entity recognition method and device based on pre-training model BERT
CN111046679B (en) Quality information acquisition method and device of translation model and computer equipment
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN111160452A (en) Multi-modal network rumor detection method based on pre-training language model
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN113593661A (en) Clinical term standardization method, device, electronic equipment and storage medium
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN114861631B (en) Context-based cross-language sentence embedding method
CN110633456B (en) Language identification method, language identification device, server and storage medium
CN115757695A (en) Log language model training method and system
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN114707517A (en) Target tracking method based on open source data event extraction
CN113076744A (en) Cultural relic knowledge relation extraction method based on convolutional neural network
CN110929022A (en) Text abstract generation method and system
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN111104806A (en) Construction method and device of neural machine translation model, and translation method and device
CN111581339B (en) Method for extracting gene events of biomedical literature based on tree-shaped LSTM
CN112632229A (en) Text clustering method and device
CN113283250B (en) Automatic machine translation test method based on syntactic component analysis
CN115114915B (en) Phrase identification method, device, equipment and medium
CN113378567B (en) Chinese short text classification method for improving low-frequency words
CN112926309B (en) Safety information distinguishing method and device and electronic equipment
CN116822508A (en) Structured knowledge extraction method based on text understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant