CN113051936A - Method for enhancing Hanyue neural machine translation based on low-frequency word representation - Google Patents

Method for enhancing Hanyue neural machine translation based on low-frequency word representation Download PDF

Info

Publication number
CN113051936A
CN113051936A CN202110280508.1A CN202110280508A CN113051936A CN 113051936 A CN113051936 A CN 113051936A CN 202110280508 A CN202110280508 A CN 202110280508A CN 113051936 A CN113051936 A CN 113051936A
Authority
CN
China
Prior art keywords
low
frequency
chinese
word
vietnamese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110280508.1A
Other languages
Chinese (zh)
Inventor
余正涛
杨福岸
高盛祥
王振晗
朱俊国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110280508.1A priority Critical patent/CN113051936A/en
Publication of CN113051936A publication Critical patent/CN113051936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for enhancing Hanyue neural machine translation based on low-frequency word representation, and belongs to the field of natural language processing. Low frequency words in neural machine translation are a key factor affecting the performance of translation models. Because the frequency of the low-frequency words appearing in the data set is low, the representation of the low-frequency words in the training process is not accurate enough, and the problem is more obviously influenced in a low-resource neural machine translation task. The method learns the probability distribution of the low-frequency words by utilizing the monolingual data context information, recalculates the word embedding of the low-frequency words according to the distribution, and then retrains the Transformer model on the basis of the obtained word embedding, thereby effectively relieving the problem of inaccurate representation of the low-frequency words. Experiments are respectively carried out on two low-resource translation tasks of Han-Yue and Yue-Han, and the experimental results show that the method provided by the invention is respectively improved by 8.58% and 6.06% on the two tasks compared with a baseline model.

Description

Method for enhancing Hanyue neural machine translation based on low-frequency word representation
Technical Field
The invention relates to a method for enhancing Hanyue neural machine translation based on low-frequency word representation, and belongs to the technical field of natural language processing.
Background
The core of the word representation enhancement method is how to learn more accurately to a more accurate word representation form, and the difficulty is how to represent low-frequency words. In general, there are roughly 2 methods for word representation enhancement: (1) a method based on external knowledge integration. The method is characterized in that the prior knowledge is blended, so that words have richer meanings to achieve the purpose of enhancing word representation; (2) a method based on internal knowledge enhancement. The method has the advantages that the representation form of the word is learnt again through the monolingual data, so that the representation form of the word contains richer translation information, and the representation of the word is more accurate. The 2 methods can enhance the expression form of words to a certain extent, so that the meanings of the enhanced words are more consistent with the meanings of sentences, but the method for enhancing the low-frequency word representation is not available, so that the problem of poor translation of the low-frequency words cannot be solved.
Disclosure of Invention
The invention provides a method for enhancing Hanyue neural machine translation based on low-frequency word representation, which solves the problem that low-frequency words are not well represented in neural machine translation by introducing a language model and a low-frequency word dictionary into a Transformer translation model.
The technical scheme of the invention is as follows: a method of hanyue neural machine translation enhancement based on low frequency word representation, comprising:
step1, collecting Chinese-Yue bilingual corpus, and preprocessing the collected corpus;
step2, learning the probability distribution of each word through a language model;
step3, constructing a Chinese-lower frequency word dictionary;
step4, judging low-frequency words in the translation model input by using the Chinese-lower frequency word dictionary constructed in Step3, and updating the representation of the original low-frequency words by using Step2 probability distribution so as to obtain a new representation form of the translation model input;
and Step5, retraining the Transformer translation model on the basis of the characterization form obtained in Step4.
As a further scheme of the present invention, the Step1 specifically comprises the following steps:
step1.1, translating English into Chinese by a linguistic expert according to the public IWLST English-Vietnamese bilingual parallel corpus to obtain a Chinese-Vietnamese parallel corpus;
step1.2, cleaning and word segmentation processing are carried out on the speech material, and 127,481 pair Chinese and Vietnamese parallel data are finally obtained;
step1.3, segmenting Chinese sentences by using a bus segmentation tool, and cutting punctuations by using a tokenizer for processing Vietnamese.
As a further scheme of the present invention, the Step2 specifically comprises the following steps:
step2.1, for any word w in the dictionary, the probability distribution is:
P(w)=(P1(w),P2(w),P3(w),...,P|V|(w),) (1)
satisfies the following conditions:
Figure BDA0002978100190000021
step2.2, the language model to calculate the conditional probabilities of all words before P (w) and V, for the t-th word x in a sentencetThe method comprises the following steps:
Pj(xt)=LM(wj|x<t) (3)。
as a further scheme of the present invention, the Step3 specifically comprises the following steps:
step3.1, respectively counting word frequencies of Chinese and Vietnamese;
step3.2, defining low-frequency words according to the word frequency distribution rule, namely determining the word grade by adopting a maximum value method, namely arranging the occurrence times of the words from high to low, wherein the grade is the word sequence value k of the words, and constructing a low-frequency word dictionary d with the word sequence value kk
Step3.3, constructing a low-frequency word dictionary Dk
Figure BDA0002978100190000022
As a further scheme of the present invention, the Step4 specifically comprises the following steps:
step4.1, low frequency word dictionary D constructed by usingKJudging words in the input sentence as low-frequency words if xt∈DKY, otherwise N;
step4.2, if Y, P (x) trained using the language modelt) To update x corresponding to iti(ii) a If the number is N, the original sequence is kept unchanged, so that a new source end sequence X' is obtained;
step4.3, multiplying the obtained new source end sequence X' with a word embedding matrix E of a dictionary V to obtain the input of a translation model:
input=X'E (5)。
as a further aspect of the present invention, Step5 further includes:
step5.1, finally obtaining a translation result through a translation model Transformer:
output=Transformer(input,Y) (6)。
the invention has the beneficial effects that:
1. according to the invention, a language model and a low-frequency word dictionary are introduced into a Transformer model, so that the problem that the low-frequency words are not well represented in neural machine translation can be effectively solved.
2. The method can further improve the performance of a machine translation model on a classical Transformer model and a Transformer + LM model without distinguishing word frequency.
3. The experimental result of the invention shows that the Chinese-Yuan neural machine translation method with enhanced low-frequency word representation, which is provided by the invention, improves the BLEU4 scores by 8.58% and 6.06% on two low-resource translation tasks of Han-Yuan and Yuan-Han respectively relative to a baseline model.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a translation model architecture of the present invention;
FIG. 3 is a diagram illustrating the influence of the K-class low-frequency word dictionary on the Han-Yuan model according to the present invention;
FIG. 4 is a diagram illustrating the effect of the K-class low-frequency word dictionary on the over-the-Chinese model according to the present invention.
Detailed Description
For the purpose of describing the invention in more detail and facilitating understanding for those skilled in the art, the present invention will be further described with reference to the accompanying drawings and examples, which are provided for illustration and understanding of the present invention and are not intended to limit the present invention.
Example 1: 1-4, a method for enhanced hanyue neural machine translation based on low frequency word representation, comprising the steps of:
step1, collecting Chinese-Yue bilingual corpus, and preprocessing the collected corpus;
step1.1, translating English into Chinese by a linguistic expert according to the public IWLST English-Vietnamese bilingual parallel corpus to obtain a Chinese-Vietnamese parallel corpus;
step1.2, cleaning and word segmentation processing are carried out on the speech material, and 127,481 pair Chinese and Vietnamese parallel data are finally obtained;
step1.3, segmenting Chinese sentences by using a bus segmentation tool, and cutting punctuations by using a tokenizer for processing Vietnamese.
Step2, learning the probability distribution of each word through a language model;
the language model learns the probability distribution of the low-frequency words by utilizing the monolingual data context information, namely, for a given source end and target end sentence pair, the probability distribution of each word is obtained through the language model;
the purpose of the language model is to obtain the probability distribution of each low frequency word in a lexicon of vocabulary size | V |, which for any low frequency word w is:
P(w)=(P1(w),P2(w),P3(w),...,P|V|(w),) (1)
satisfies the following conditions:
Figure BDA0002978100190000041
the probability distribution P (w) of the low-frequency words w can be calculated by various methods, the invention utilizes a pre-trained 6-layer Transformer decoder as a language model to calculate the conditional probability of all words before P (w) and V, and the t x th word in a sentence istThe words, there are:
Pj(xt)=LM(wj|x<t) (3)
wherein LM (w)j|x<t) Representing the probability of the jth word after it appears in the dictionary, the probability distribution computed by the language model can be seen as a smooth approximation of one-hot since they have the same vocabulary size since they are trained using the same corpus as the translation model.
Step3, constructing a Chinese-lower frequency word dictionary;
as a further scheme of the present invention, the Step3 specifically comprises the following steps:
step3.1, respectively counting word frequencies of Chinese and Vietnamese;
step3.2, defining low-frequency words according to the word frequency distribution rule, namely determining the word grade by adopting a maximum value method, namely arranging the occurrence times of the words from high to low, wherein the grade is the word sequence value k of the words, and constructing a low-frequency word dictionary d with the word sequence value kk
Step3.3, constructing a low-frequency word dictionary Dk
Figure BDA0002978100190000042
Specifically, Chinese and Vietnamese low-frequency word dictionaries are respectively constructed in a statistical mode. The method comprises the steps of selecting low-frequency words by taking a Chinese and Vietnamese training set as a target, defining low-frequency word dictionary words with a word sequence value equal to K as a word sequence value K-type low-frequency word dictionary, and defining the low-frequency word dictionary with the word sequence value K less than or equal to K as a K-type low-frequency word dictionary. And respectively constructing a K-type low-frequency word dictionary and a K-type low-frequency word dictionary according to the word sequence value K (K takes 1 to 10) of each word. And respectively counting the dictionary coverage rate of the low-frequency word dictionary, wherein the dictionary coverage rate is the ratio of the size of the low-frequency word dictionary to the size of the total dictionary, and the total dictionary is obtained by counting the training set.
The Chinese dictionary vocabulary has a size of 47356 and the total number of words in the training set is 2275526. The word sequence value k type low-frequency words respectively have 18496, 6656, 3787, 2508, 1812, 1397, 1067, 832, 719 and 593 words. The Vietnamese dictionary vocabulary size is 22732 and the training set total number of words is 3189350. The low-frequency words of the word sequence value k class respectively have 9428 words, 3188 words, 1667 words, 1006 words, 718 words, 514 words 393 words, 340 words, 188 words and 223 words.
Step4, judging low-frequency words in the translation model input by using the Chinese-lower frequency word dictionary constructed in Step3, and updating the representation of the original low-frequency words by using Step2 probability distribution so as to obtain a new representation form of the translation model input;
as a further scheme of the present invention, the Step4 specifically comprises the following steps:
step4.1, low frequency word dictionary D constructed by usingKJudging words in the input sentence as low-frequency words if xt∈DKY, otherwise N;
step4.2, if Y, P (x) trained using the language modelt) To update x corresponding to iti(ii) a If the number is N, the original sequence is kept unchanged, so that a new source end sequence X' is obtained;
step4.3, multiplying the obtained new source end sequence X' with a word embedding matrix E of a dictionary V to obtain the input of a translation model:
input=X'E (5)。
and Step5, retraining the Transformer translation model on the basis of the characterization form obtained in Step4. Finally, obtaining a translation result through a translation model Transformer:
output=Transformer(input,Y) (6)。
for better effectiveness of the training model and the verification model, the parallel data of the bilingual Hanyue with the scale of 2,000 pairs are randomly extracted from the parallel data of the bilingual Hanyue respectively to serve as a test set and a verification set, and the rest are taken as training sets, wherein specific data information is shown in table 1:
TABLE 1 data size and data set partitioning
Figure BDA0002978100190000051
In the Chinese-Vietnamese translation task, a Transformer Decoder is adopted as a Chinese language model. The training set and the verification set of the Chinese language model are derived from Chinese corpus in the translation model, and the scales of the training set and the verification set are 127,481 and 2,000 Chinese monolingual data respectively; in the Vietnamese-Chinese translation task, the structure of a language model is the same as that of a model in the Chinese-Vietnamese translation task, and a training set and a verification set of the Vietnamese language model are derived from Vietnamese monolingual corpus in the translation model, and the Vietnamese monolingual data are 127,481 and 2,000 pieces of Vietnamese monolingual data respectively.
The low-frequency words are not well performed in the neural machine translation of the lower resources of the Chinese, and in order to distinguish the low-frequency words from other words, the method is used for performing low-frequency word representation enhancement on the low-frequency words, and a low-frequency word dictionary is constructed. The chinese-vietnam word order value k-class low-frequency word dictionary is shown in table 2:
TABLE 2 Chinese-Vietnamese word order value k-class low-frequency words
Figure BDA0002978100190000061
In the invention, the word list size of the Chinese dictionary is 47,356, the word list size of the Vietnam words is 22,732, the maximum Maxpot of batch is 2048, the maximum length of a sentence is 128, the maximum epoch is 100 rounds, dropout is set to 0.1, the word embedding dimension is 512 dimensions, and the hidden layer dimension is 512 dimensions. All models were trained by the Adam optimizer and the initial learning rate was 10-4.
After the language module training is completed, the optimal training parameters of the model are saved, and when the translation model is trained, the optimal training parameters of the language model are fixedly used. The method provided by the invention is verified on two tasks of Chinese-Vietnamese and Vietnamese-Chinese by using Chinese-Vietnamese parallel data. The method adopts a self-help resampling method (resampling for 1000 times) and uses a BLEU4 value as an evaluation index on a test set under the condition that the significance level p is less than 0.05.
The present invention employs the following two models as baseline models. One is the classical Transformer model (Transformer): experiments were performed using the Transformer _ base model in both the chinese-vietnamese and vietnamese-chinese translation tasks. And secondly, adding a language model (Transformer + LM) on the basis of the Transformer, randomly replacing the input of the translation model by using a training result of the language model, wherein the replacement probability is gamma, and the gamma is 0.15 (the gamma value is the optimal setting used in the document [2 ]) and performing experiments on two translation tasks of Chinese-Vietnamese and Vietnamese-Chinese.
To verify the effectiveness of the method of the present invention, two baseline models, respectively, the classical Transformer model and the Transformer + LM model used in the prior art (corpus scale is 127,481 for chinese-vietnamese parallel data) were compared in the experiment. The invention respectively carries out experiments on translation tasks in two directions of Chinese-Vietnamese and Vietnamese-Chinese, and the experimental result is the BLEU4 score of each translation model, as shown in Table 3.
TABLE 3 experimental results of Chinese-Vietnamese, Vietnamese-Chinese
Figure BDA0002978100190000071
As can be seen from the above table, in the translation tasks in the two directions of the chinese-vietnamese and the vietnamese-chinese, the transform + LM model respectively improves the BLEU4 values by 0.87 and 0.59 compared with the classical transform model; compared with a Transformer + LM model, the method provided by the invention respectively improves the BLEU4 values by 0.84 and 0.68. According to the results, the method provided by the invention has better promotion in Chinese-Vietnamese and Vietnamese-Chinese translation tasks compared with a Transformer model and a Transformer + LM model, and proves that the method based on low-frequency word representation enhancement provided by the invention is effective in Chinese-Vietnamese and Vietnamese-Chinese translation tasks. The analysis from the experimental result shows that the Transformer + LM model is superior to the classical Transformer model, and word context information is introduced randomly by the Transformer + LM model through the language model, so that randomly introduced words can obtain richer information, and the effectiveness of the word context information introduced in the Transformer + LM model is proved. Compared with a Transformer + LM model, the translation performance of the method is greatly improved, the information of the low-frequency words is considered in the method, the probability estimation of the context is only carried out on the low-frequency words, the translation performance is improved, and the performance is reduced due to the fact that the low-frequency words and the non-low-frequency words are not distinguished. The experimental result shows that the method can relieve the problem of poor translation of low-frequency words, and has obvious advantages on two translation tasks of Chinese-Vietnamese and Vietnamese-Chinese.
In order to analyze the influence of the occurrence frequency of low-frequency words on the method of the present invention, as shown in fig. 1, the method of the present invention performs model performance tests on two translation tasks of chinese-vietnamese and vietnamese-chinese according to words whose occurrence frequency is less than or equal to K (K ═ 1, 2., 10). The results are shown in FIGS. 3 and 4.
As can be seen from fig. 3 and 4, in the chinese-overtaking and overtaking-chinese translation task, as the K value increases, the overall trend increases first and then decreases, and when the K value takes 5 and 6 respectively, that is, the low-frequency words are set to appear in the training set with frequencies less than or equal to 5 and 6 (70.25% and 70.66% of the word list size respectively), the BLEU4 value takes the highest value; the result of the classical Transformer model is obtained when the K value is 0, and the performance of the model is superior to that of the classical Transformer model when the K value is 1,2, 10; in the ascending process, when the K value is equal to 3, the performance of the model of the method exceeds that of a transform + LM model; in the process of descending, when the K value is respectively 9 and 10, the performance of the Transformer + LM model is slightly superior to that of the method.
As shown in fig. 3, when the K value is 0 (classical Transformer model), the Transformer + LM model is superior to the classical Transformer model because the Transformer + LM model introduces the context information of the random word; when the K value is less than or equal to 5, the model effect is stably increased, the occurrence frequency of words in the K-class low-frequency word dictionary is low, the low-frequency words cannot be well represented in the translation model, and the context information of the low-frequency words is used for replacing the low-frequency words for representation, so that the representation information of the low-frequency words is enriched, the low-frequency words have richer context semantic information, and the model is stably increased. When the K value is more than 5, namely words with the occurrence frequency more than 5 are added into the low-frequency word dictionary, the newly added words can be well trained, and the representation of the trained words is superior to the enhanced representation form provided by the language model. Therefore, the newly added words in the low-frequency word dictionary cannot achieve the effect of optimizing the translation performance. Therefore, when the K value is larger than 5, the translation effect is continuously reduced.
TABLE 4 Han-Yue translation example analysis
Figure BDA0002978100190000081
TABLE 5 analysis of examples of Yuehan-Han translation
Figure BDA0002978100190000082
As can be seen from the analysis table 4, the method has better translation effect on the low-frequency words, and in the example of Chinese-Vietnamese translation, the low-frequency words comprise prerequisites which appear 5 times in the Chinese training data set, and the method translates the words into the words
Figure BDA0002978100190000083
Can be seen from the real translation
Figure BDA0002978100190000084
The meaning is completely similar and the context of the sentence is well understood. The baseline model translation adopts the translation result of the Transformer + LM model and is translated into' nhu
Figure BDA0002978100190000085
", does not well mean the low frequency word" prerequisite ". As shown in Table 5, in the example of Vietnam-Chinese translation, the word "ng-n" is classified as a low frequency word dictionary, which appears 6 in the Vietnam training data set, and is "depressed" in both the translated text and the translated text of the present invention, and the line model translated text is modeled as a transform modelThe translation result is translated into "urgent", which is far from the true translation and is not a qualified translation. Therefore, the method can fully show and translate the low-frequency words, and can better understand the meaning of the low-frequency words in the sentences.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1.一种基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:包括如下:1. a method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement, is characterized in that: comprise as follows: Step1、收集汉越双语语料,并将收集到的语料进行预处理;Step1. Collect Chinese-Vietnamese bilingual corpus, and preprocess the collected corpus; Step2、通过语言模型学习每个词的概率分布;Step2, learn the probability distribution of each word through the language model; Step3、构建汉-越低频词词典;Step3, build a dictionary of Chinese-Vietnamese low-frequency words; Step4、利用Step3构建的汉-越低频词词典判断出翻译模型输入中的低频词,并利用Step2概率分布更新原有低频词的表征,从而得到翻译模型输入的新表征形式;Step4. Use the Chinese-Vietnamese low-frequency word dictionary constructed in Step3 to determine the low-frequency words in the input of the translation model, and use the Step2 probability distribution to update the representation of the original low-frequency words, so as to obtain a new representation form input by the translation model; Step5、在Step4所得的表征形式的基础上重新训练Transformer翻译模型。Step5. Retrain the Transformer translation model on the basis of the representation form obtained in Step4. 2.根据权利要求1所述的基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:所述步骤Step1的具体步骤为:2. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step1 are: Step1.1、通过将公开IWLST英语-越南语双语平行语料,经过语言学专家将英语译文为汉语,得到汉语-越南语平行语料;Step1.1. Obtain the Chinese-Vietnamese parallel corpus by publishing the IWLST English-Vietnamese bilingual parallel corpus and translating the English text into Chinese by linguistic experts; Step1.2、对语料进行了清洗和分词处理,最终获得127,481对汉越双语平行数据;Step1.2. Cleaned and segmented the corpus, and finally obtained 127,481 pairs of Chinese-Vietnamese bilingual parallel data; Step1.3、使用结巴分词工具对中文语句进行分词,处理越南语使用tokenizer切开标点。Step1.3. Use the stuttering word segmentation tool to segment Chinese sentences, and use tokenizer to cut punctuation in Vietnamese. 3.根据权利要求1所述的基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:所述步骤Step2的具体步骤为:3. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step2 are: Step2.1、对于词典中任何词w,概率分布为:Step2.1. For any word w in the dictionary, the probability distribution is: P(w)=(P1(w),P2(w),P3(w),...,P|V|(w),) (1)P(w) = (P1(w),P2 ( w), P3 (w),...,P |V| (w),)(1) 满足:Satisfy:
Figure FDA0002978100180000011
Figure FDA0002978100180000011
Step2.2、语言模型来计算P(w)和V之前所有词的条件概率,对于一个句子中第t个词xt,有:Step2.2, language model to calculate the conditional probability of all words before P(w) and V, for the t-th word x t in a sentence, there are: Pj(xt)=LM(wj|x<t) (3)。P j (x t )=LM(w j | x<t ) (3).
4.根据权利要求1所述的基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:所述步骤Step3的具体步骤为:4. the method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step3 are: Step3.1、分别统计汉语与越南语词频;Step3.1. Count the frequency of Chinese and Vietnamese words respectively; Step3.2、根据词频分布规律来定义低频词,即采用最大值法确定词的等级,即把词出现的次数由高到低排列,其等级就是它的词序值k,构建词序值为k的低频词词典dkStep3.2. Define low-frequency words according to the law of word frequency distribution, that is, use the maximum value method to determine the level of words, that is, arrange the number of words appearing from high to low, and its level is its word order value k, and construct a word order value of k low-frequency word dictionary d k ; Step3.3、构建低频词词典DkStep3.3, construct the low-frequency word dictionary D k :
Figure FDA0002978100180000021
Figure FDA0002978100180000021
5.根据权利要求1所述的基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:所述步骤Step4的具体步骤为:5. the method for the Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: the concrete steps of described step Step4 are: Step4.1、利用构建的低频词词典DK判断输入的句子中那些词属于低频词,若xt∈DK为Y,反之则为N;Step4.1. Use the constructed low-frequency word dictionary D K to judge which words in the input sentence belong to low-frequency words, if x t ∈ D K is Y, otherwise it is N; Step4.2、若为Y,用语言模型所训练出的P(xt)来更新与之对应的xi;若为N,则保持原有不变,从而得到新的源端序列X’;Step4.2. If it is Y, use the P(x t ) trained by the language model to update the corresponding xi ; if it is N, keep the original unchanged, thereby obtaining a new source sequence X'; Step4.3、用所得的新的源端序列X’与词典V的词嵌入矩阵E相乘得到翻译模型的输入:Step4.3. Multiply the obtained new source sequence X' by the word embedding matrix E of the dictionary V to obtain the input of the translation model: input=X'E (5)。input=X'E(5). 6.根据权利要求1所述的基于低频词表示增强的汉越神经机器翻译的方法,其特征在于:所述步骤Step5还包括:6. the method for Chinese-Vietnamese neural machine translation based on low-frequency word representation enhancement according to claim 1, is characterized in that: described step Step5 also comprises: Step5.1、最终经过翻译模型Transformer得到翻译结果:Step5.1. Finally, the translation result is obtained through the translation model Transformer: output=Transformer(input,Y) (6)。output=Transformer(input, Y) (6).
CN202110280508.1A 2021-03-16 2021-03-16 Method for enhancing Hanyue neural machine translation based on low-frequency word representation Pending CN113051936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110280508.1A CN113051936A (en) 2021-03-16 2021-03-16 Method for enhancing Hanyue neural machine translation based on low-frequency word representation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110280508.1A CN113051936A (en) 2021-03-16 2021-03-16 Method for enhancing Hanyue neural machine translation based on low-frequency word representation

Publications (1)

Publication Number Publication Date
CN113051936A true CN113051936A (en) 2021-06-29

Family

ID=76512520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110280508.1A Pending CN113051936A (en) 2021-03-16 2021-03-16 Method for enhancing Hanyue neural machine translation based on low-frequency word representation

Country Status (1)

Country Link
CN (1) CN113051936A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038725A (en) * 2017-12-04 2018-05-15 中国计量大学 A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN109117480A (en) * 2018-08-17 2019-01-01 腾讯科技(深圳)有限公司 Word prediction technique, device, computer equipment and storage medium
CN111428518A (en) * 2019-01-09 2020-07-17 科大讯飞股份有限公司 Low-frequency word translation method and device
CN112215017A (en) * 2020-10-22 2021-01-12 内蒙古工业大学 A Mongolian-Chinese machine translation method based on pseudo-parallel corpus construction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038725A (en) * 2017-12-04 2018-05-15 中国计量大学 A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN109117480A (en) * 2018-08-17 2019-01-01 腾讯科技(深圳)有限公司 Word prediction technique, device, computer equipment and storage medium
CN111428518A (en) * 2019-01-09 2020-07-17 科大讯飞股份有限公司 Low-frequency word translation method and device
CN112215017A (en) * 2020-10-22 2021-01-12 内蒙古工业大学 A Mongolian-Chinese machine translation method based on pseudo-parallel corpus construction

Similar Documents

Publication Publication Date Title
CN110442760B (en) A synonym mining method and device for question answering retrieval system
CN108399163B (en) Text similarity measurement method combining word aggregation and word combination semantic features
CN110543639B (en) An English sentence simplification algorithm based on pre-trained Transformer language model
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN113032550B (en) An opinion summary evaluation system based on pre-trained language model
CN112541364A (en) Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN111291558A (en) Image description automatic evaluation method based on non-paired learning
CN118333067B (en) Old-middle nerve machine translation method based on code transcription enhancement word embedding migration
CN111832281A (en) Composition scoring method, device, computer equipment and computer-readable storage medium
CN109298796A (en) A kind of Word association method and device
CN116976361A (en) RLHF-based self-adaptive machine translation method and storage medium
CN115033753A (en) Training corpus construction method, text processing method and device
Peng Applied mathematics and nonlinear sciences
CN113012685B (en) Audio recognition method, device, electronic device and storage medium
CN107451116A (en) Raw big data statistical analysis technique in a kind of Mobile solution
CN111815426B (en) Data processing method and terminal related to financial investment and research
CN111859915A (en) An English text category recognition method and system based on word frequency saliency level
CN116757188A (en) A cross-language information retrieval training method based on aligned query entity pairs
CN113051936A (en) Method for enhancing Hanyue neural machine translation based on low-frequency word representation
CN111626318A (en) Multi-language harmful information feature intelligent mining method based on deep learning
CN112149405A (en) Program compiling error information feature extraction method based on convolutional neural network
Yao et al. Chinese long text summarization using improved sequence-to-sequence lstm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210629

RJ01 Rejection of invention patent application after publication