CN111680520A - Synonym data enhancement-based Hanyue neural machine translation method - Google Patents

Synonym data enhancement-based Hanyue neural machine translation method Download PDF

Info

Publication number
CN111680520A
CN111680520A CN202010366635.9A CN202010366635A CN111680520A CN 111680520 A CN111680520 A CN 111680520A CN 202010366635 A CN202010366635 A CN 202010366635A CN 111680520 A CN111680520 A CN 111680520A
Authority
CN
China
Prior art keywords
low
synonym
sentences
words
neural machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010366635.9A
Other languages
Chinese (zh)
Inventor
高盛祥
尤丛丛
余正涛
毛存礼
潘润海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010366635.9A priority Critical patent/CN111680520A/en
Publication of CN111680520A publication Critical patent/CN111680520A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a synonym data enhancement-based Hanyue neural machine translation method, and belongs to the technical field of natural language processing. The resource scarcity of the parallel Chinese-crossing corpus greatly influences the translation effect of the Chinese-crossing machine. The method comprises the steps of firstly obtaining a synonym list of low-frequency words of one language by learning a monolingual word vector, then carrying out synonym replacement on the low-frequency words, screening the replaced sentences by using a language model, and finally matching the screened sentences with the sentences in the other language to obtain the expanded parallel corpus. The invention provides powerful support for the work of expanding the Han crossing parallel corpus; the problem of low translation performance of the Hanyue neural machine caused by scarcity of corpus resources is solved.

Description

Synonym data enhancement-based Hanyue neural machine translation method
Technical Field
The invention relates to a synonym data enhancement-based Hanyue neural machine translation method, and belongs to the technical field of natural languages.
Background
In the field of Chinese and Vietnamese natural language processing, high-quality construction of a parallel corpus of Chinese and Vietnamese is the foundation, premise and pillar of the neural machine translation work of Chinese and Vietnamese. The quality and scale of the Han-Yue parallel corpus directly influence the translation performance of the Han-Yue neural machine; in order to solve the quality and performance of subsequent work, a high-quality han-crossing parallel corpus needs to be constructed. At present, in the field of machine translation, two types of methods are mainly used for enhancing data, wherein one type is vocabulary replacement and the other type is translation. Because the more Chinese belongs to low-resource languages, a bilingual dictionary is difficult to obtain, but a cross-language word replacement method needs dictionary resources of a certain scale, the realization is difficult, and Chinese and Vietnamese have a large amount of monolingual linguistic data, and word vectors and related words under a monolingual environment are very easy to obtain through monolingual language training. Therefore, the invention provides a bilingual parallel corpus data enhancement method based on synonym replacement of low-frequency words. The method constructs word vectors by exploring a monolingual environment, further searches synonyms of low-frequency words, and obtains expanded parallel corpora through synonym replacement.
Disclosure of Invention
The invention provides a synonym data enhancement-based Hanyue neural machine translation method, which is used for solving the problem of low neural machine translation performance caused by resource scarcity of a Hanyue parallel corpus.
The technical scheme of the invention is as follows: the synonym data enhancement-based Hanyue neural machine translation method comprises the following specific steps:
step1, constructing a low-frequency word list V of one-end language by using a primitive material libraryR
Step2, searching synonyms of low-frequency words through monolingual word vectors;
step3, carrying out synonym replacement on low-frequency words;
step4, screening the replaced sentences by using a language model;
and Step5, matching the screened sentences with sentences in the other end language to obtain expanded parallel corpora.
Further, the Step1 includes the specific steps of:
step1.1, firstly, constructing a vocabulary V in the source language end range of the training corpus;
step1.2, adding the words with the occurrence frequency of N (N is more than or equal to 1 and less than 5) in the V into a low-frequency word list as low-frequency words so as to form a low-frequency word list VR
Further, the specific Step of Step2 is as follows:
step2.1, training feature vectors represented by monolingual vocabularies;
step2.2, calculating cosine values between vectors representing words, judging the similarity between the two words, and further searching synonyms of the low-frequency words;
step2.3, construct a synonym list of low frequency words.
Further, the specific Step of Step3 is as follows:
step3.1, giving the positions i in the source sentences S and S;
Step3.2、Sias low frequency words
Figure BDA0002476712420000021
Synonyms of (1), with S in SiIs replaced by
Figure BDA0002476712420000022
Further generating a synonymy sentence S' of S;
and Step3.3, adding the replaced sentence into the synonym candidate list of the S.
Further, the specific Step of Step4 is as follows:
step4.1, scoring sentences in the synonym candidate list of the fixed source sentence S by using a Kenlm language model;
and Step4.2, screening the first M sentences with high scores, wherein M is a positive integer.
Further, for sentence pair (S, T), since the operation of "synonym replacement" is performed on the source language end S, the source sentence S and the enhancement sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added to the training set to participate in model training.
The invention has the beneficial effects that: the invention provides powerful support for the work of expanding the Han crossing parallel corpus; the problem of low Hanyue neural machine translation performance caused by scarcity of corpus resources is solved, a better BLEU value can be obtained, and the Hanyue neural machine translation model can be helped to better identify low-frequency words at a source language end, so that translation quality is improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is an exemplary diagram of synonym substitution in the present disclosure;
fig. 3 is a diagram illustrating matching of target sentences in the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 3, the method for synonym data-based enhanced hanyue neural machine translation includes the following specific steps:
step1, firstly, constructing a vocabulary V in the source language end range of the training corpus, and then adding words with the occurrence frequency of N (N is more than or equal to 1 and less than 5) in V as low-frequency words into a low-frequency word list, thereby forming a low-frequency word vocabulary VR
Step2, finding synonyms of low-frequency words through monolingual vectors: because the similarity between two vocabularies can be judged by calculating the distance between the vocabularies in the monolingual semantic space. Therefore, the method utilizes large-scale monolingual corpora of Chinese and Vietnamese to obtain the characteristic vectors represented by monolingual words through monolingual language training, then judges the similarity between the two words through calculating the cosine value between the vectors representing the words, and further searches the synonym of the low-frequency words to construct the synonym list of the low-frequency words;
step3, synonym replacement for low frequency words: giving positions i in source sentences S and S; siAs low frequency words
Figure BDA0002476712420000031
Synonyms of (1), with S in SiIs replaced by
Figure BDA0002476712420000032
Further generating a synonymy sentence S' of S; the sentence after replacement is added to the synonym candidate list of S, which is shown in fig. 2 and is a diagram of a synonym replacement example.
Step4, screening the replaced sentences by using the language model: and (4) scoring sentences in the synonym candidate list of the fixed source sentence S by using a Kenlm language model, and screening the top M sentences with high scores, wherein M is a positive integer.
Step5, matching the screened sentences with sentences in the other end language to obtain expanded parallel corpora; for sentence pair (S, T), since the source language end S is performed with the operation of "synonym replacement", so that the source sentence S and the enhanced sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added into the training set to participate in model training, as shown in fig. 3, which is a target sentence matching diagram.
The BLEU value is used for evaluating the translation accuracy of a neural machine translation model and can be used for measuring the quality of the model; the invention adopts the BLEU value as an evaluation standard of a synonym data-based enhanced Hanyue neural machine translation model.
In order to verify the effect of the method for enhancing the synonym data, the method collects 20 ten thousand Han-Yuan parallel sentence pairs as training data before data enhancement, sets a machine translation experiment to compare results before and after the enhancement so as to prove the effect of the data enhancement, and uses a transformer as a neural machine translation model. The hidden layer size of the model is 512 dimensions, the batch size is 64, and the training round is 20. In all experiments, the neural machine translation model vocabulary was restricted to the most common 30k words in both languages of han and vietz. Note that data enhancement does not introduce new vocabulary into the vocabulary. The present invention trains the kenlm language model using 5G sized chinese text from wikipedia. The Kenlm language model belongs to the n-gram language model, with n set to 3.
In order to verify the effectiveness of the method, three groups of experiments are designed and evaluated through a BLEU value.
Experiment one: in order to achieve the best effect, synonym replacement is carried out on a source language end, a comparison test of experimental parameter selection is carried out by taking a matched target language end as an example, and the synonym list of low-frequency words constructed by different methods and the influence of the synonym number N of the low-frequency words on a neural machine translation model are evaluated. The method comprises the following steps: 1) the Synonyms toolkit is used to build a synonym list of source language end low frequency words. 2) Generating word vectors of source language end words by using Chinese bert model, and calculating and comparing word list V with low-frequency word list VRAnd constructing a synonym list of low-frequency words at the source language end from high to low according to the cosin similarity between the intermediate words. In the experiment, the number N of synonyms of the low-frequency words is set to be 1 to 5, and the word frequency of each low-frequency word is increased to be 5 by the method. Note that the different synonym numbers of the low-frequency words do not affect the size of the corpus differently, and the enhanced chinese-crossing parallel sentence pairs are all 310 k. See table 1.
TABLE 1 comparison of BLEU values for translation Performance for Chinese-Vietnamese with different parameters
Figure BDA0002476712420000041
Table 1 shows that building a synonym list of source language-side low-frequency words using bert achieves higher BLEU values than building the whole using synnyms. This is because the basic technology employed by Synonyms is word2 vect. bert considers context information compared to word2 fact. Therefore, a synonym list of low-frequency words at the source language end is constructed by adopting bert, and a Chinese-cross sentence pair with high parallelism is easier to generate after low-frequency word replacement. Therefore, the method of the invention can specifically adopt bert to construct a synonym list of the source language end low-frequency words.
The results of the bert method adopted in table 1 show that, when the number N of synonyms of the low-frequency words at the source language end is 1, the translation model performance reaches the highest value, and a BLEU value of 11.4 is obtained. The BLEU value of the translation model decreases as the number of synonyms for the low frequency words increases. When the number N of synonyms of the low-frequency word is 5, the lowest BLEU value of 8.6 is obtained. This is because the method of the present invention constructs the synonym list of low frequency words in the source language according to the cosin similarity between words from high to low. Therefore, the larger the number N of synonyms of the low-frequency words is, the larger the probability that the Chinese-to-sentence pairs with lower parallelism are generated after the target sentences are matched is through language model screening. The BLEU value decreases as the translation model corpus noise increases. Therefore, the method of the invention takes the synonym number N of the low-frequency words as 1.
Experiment two: in order to select the best method to realize the idea of the method, the translation performance (BLEU) is compared for different methods of replacing the vocabulary at the source language end, replacing the unidirectional words at the matched target language end and replacing the vocabulary at the replaced target language end and replacing the bidirectional words at the matched source language end on the basis of the unidirectional word replacement. See table 2.
TABLE 2 Effect of unidirectional and bidirectional word substitution on translation Performance BLEU values
Figure BDA0002476712420000042
TDA (C) refers to a one-way word replacement method. TDA (C-V) refers to a bi-directional word replacement method. Note that when the target language end vocabulary is replaced and the source language end is matched, the synonym list of the target language end low-frequency words is constructed by using the vietnam word vectors trained on fasttext and Wikipedia by e.grave et al, and other involved experimental parameters are consistent with those of tda (c).
The data in table 2 shows that one-way word replacement improves translation performance better than two-way word replacement. Analysis of the data in table 2 can yield: after bidirectional word replacement, 160k of linguistic data is added compared with a basic linguistic data base, wherein 110k of unidirectional word replacement is added, 50k of matching practice on the source language end is added for replacing the vocabulary on the target language end. This is because the size of the source language vocabulary in the original corpus is twice the size of the target vocabulary. One-way word replacement works better than two-way word replacement because the quality of translation depends on the level of a translated language rather than the level of the source language. The target language must be a true sentence to make the translation result more fluent and accurate. Therefore, the 50k corpus obtained by replacing the vocabulary at the target language end and matching the vocabulary at the source language end reduces the translation effect. Therefore, the invention adopts the replacement of the one-way words.
Experiment three: to demonstrate the optimality of the method of the present invention, the method of the present invention was compared to a baseline and a translation back. It is well known that the size of the corpus directly affects the quality of the translation model. In order to compare the method of the present invention with the retranslation on the same horizontal line, the size of the retranslation-enhanced corpus was set to 310k in this experiment. Since the more Chinese belongs to low-resource languages, the quality of the enhanced corpus is obviously reduced if the translation model trained by a person is used for retranslation. For this reason, the present invention performs translation back by google translation. In order not to introduce new vocabulary after enhancement, the invention performs the translation back based on the training corpus. Firstly, translating a target end of a training corpus into a source language end by virtue of Google translation to construct a Chinese cross parallel corpus, and then randomly extracting 110k Chinese cross parallel sentences from the Chinese cross parallel corpus to join in a training model. See table 3.
TABLE 3 comparison of BLEU values for translation Performance for Chinese-Vietnamese in different methods
Model Data BLEU
Baseline 200k 9.6
Back-translation 310k 10.3
TDA(C) 310k 11.4
Wherein the second behavior datum, Back-translation, TDA(C)Refers to the process of the present invention.
The data in table 3 show that the method and the retranslation method provided by the present invention effectively improve the translation quality, and improve 1.8 BLEU values and 0.7 BLEU values respectively compared with the baseline. We attribute the invention to the best BLEU value: TDA(C)The method can help the Hanyue neural machine translation model to better identify the low-frequency words at the source language end, so that the translation quality is improved.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. A synonym data enhancement-based Hanyue neural machine translation method is characterized by comprising the following steps: the synonym data enhancement-based Hanyue neural machine translation method comprises the following specific steps:
step1, constructing a low-frequency word list V of one-end language by using a primitive material libraryR
Step2, searching synonyms of low-frequency words through monolingual word vectors;
step3, carrying out synonym replacement on low-frequency words;
step4, screening the replaced sentences by using a language model;
and Step5, matching the screened sentences with sentences in the other end language to obtain expanded parallel corpora.
2. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, firstly, constructing a vocabulary V in the source language end range of the training corpus;
step1.2, adding the words with the occurrence frequency of N (N is more than or equal to 1 and less than 5) in the V into a low-frequency word list as low-frequency words so as to form a low-frequency word list VR
3. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step2 is as follows:
step2.1, training feature vectors represented by monolingual vocabularies;
step2.2, calculating cosine values between vectors representing words, judging the similarity between the two words, and further searching synonyms of the low-frequency words;
step2.3, construct a synonym list of low frequency words.
4. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step3 is as follows:
step3.1, giving the positions i in the source sentences S and S;
Step3.2、Sias low frequency words
Figure FDA0002476712410000011
Synonyms of (1), with S in SiIs replaced by
Figure FDA0002476712410000012
Further generating a synonymy sentence S' of S;
and Step3.3, adding the replaced sentence into the synonym candidate list of the S.
5. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: the specific Step of Step4 is as follows:
step4.1, scoring sentences in the synonym candidate list of the fixed source sentence S by using a Kenlm language model;
and Step4.2, screening the first M sentences with high scores, wherein M is a positive integer.
6. The synonym data-based enhanced hanyue neural machine translation method of claim 1, wherein: for sentence pair (S, T), since the source language end S is performed with the operation of "synonym replacement", so that the source sentence S and the enhanced sentence S 'are synonymous sentences, Step5 does not change the target sentence, and sentence pair (S', T) is directly added into the training set to participate in model training.
CN202010366635.9A 2020-04-30 2020-04-30 Synonym data enhancement-based Hanyue neural machine translation method Pending CN111680520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010366635.9A CN111680520A (en) 2020-04-30 2020-04-30 Synonym data enhancement-based Hanyue neural machine translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010366635.9A CN111680520A (en) 2020-04-30 2020-04-30 Synonym data enhancement-based Hanyue neural machine translation method

Publications (1)

Publication Number Publication Date
CN111680520A true CN111680520A (en) 2020-09-18

Family

ID=72452220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010366635.9A Pending CN111680520A (en) 2020-04-30 2020-04-30 Synonym data enhancement-based Hanyue neural machine translation method

Country Status (1)

Country Link
CN (1) CN111680520A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215017A (en) * 2020-10-22 2021-01-12 内蒙古工业大学 Mongolian Chinese machine translation method based on pseudo parallel corpus construction
CN112668325A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Machine translation enhancing method, system, terminal and storage medium
CN112686028A (en) * 2020-12-25 2021-04-20 掌阅科技股份有限公司 Text translation method based on similar words, computing equipment and computer storage medium
WO2023116655A1 (en) * 2021-12-20 2023-06-29 华为技术有限公司 Communication method and apparatus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739395A (en) * 2009-12-31 2010-06-16 程光远 Machine translation method and system
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN110472252A (en) * 2019-08-15 2019-11-19 昆明理工大学 The method of the more neural machine translation of the Chinese based on transfer learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739395A (en) * 2009-12-31 2010-06-16 程光远 Machine translation method and system
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN110472252A (en) * 2019-08-15 2019-11-19 昆明理工大学 The method of the more neural machine translation of the Chinese based on transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MARZIEH FADAEE: "Data Augmentation for Low-Resource Neural Machine Translation", 《PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (SHORT PAPERS)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215017A (en) * 2020-10-22 2021-01-12 内蒙古工业大学 Mongolian Chinese machine translation method based on pseudo parallel corpus construction
CN112215017B (en) * 2020-10-22 2022-04-29 内蒙古工业大学 Mongolian Chinese machine translation method based on pseudo parallel corpus construction
CN112668325A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Machine translation enhancing method, system, terminal and storage medium
CN112668325B (en) * 2020-12-18 2024-05-10 平安科技(深圳)有限公司 Machine translation enhancement method, system, terminal and storage medium
CN112686028A (en) * 2020-12-25 2021-04-20 掌阅科技股份有限公司 Text translation method based on similar words, computing equipment and computer storage medium
WO2023116655A1 (en) * 2021-12-20 2023-06-29 华为技术有限公司 Communication method and apparatus

Similar Documents

Publication Publication Date Title
CN108399163B (en) Text similarity measurement method combining word aggregation and word combination semantic features
Deng et al. Structure-grounded pretraining for text-to-SQL
CN111680520A (en) Synonym data enhancement-based Hanyue neural machine translation method
US10268685B2 (en) Statistics-based machine translation method, apparatus and electronic device
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
KR101923650B1 (en) System and Method for Sentence Embedding and Similar Question Retrieving
CN108021555A (en) A kind of Question sentence parsing measure based on depth convolutional neural networks
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
He et al. Question answering over linked data using first-order logic
CN108491399B (en) Chinese-English machine translation method based on context iterative analysis
Chakravarthi et al. Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages
Xiong et al. HANSpeller: a unified framework for Chinese spelling correction
CN114388141A (en) Medicine relation extraction method based on medicine entity word mask and Insert-BERT structure
Zhang et al. Chinese-English mixed text normalization
CN116757188A (en) Cross-language information retrieval training method based on alignment query entity pairs
CN111325015A (en) Document duplicate checking method and system based on semantic analysis
Rytting et al. Spelling correction for dialectal Arabic dictionary lookup
Wang et al. Chinese text error correction suggestion generation based on SoundShape code
Chen et al. Semi-supervised lexicon learning for wide-coverage semantic parsing
KR102341563B1 (en) Method for extracting professional text data using mediating text data topics
CN111709245A (en) Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding
Lei et al. An enhanced computational feature selection method for medical synonym identification via bilingualism and multi-corpus training
CN115034239B (en) Machine translation method of Han-Yue nerve based on noise reduction prototype sequence
Trieu et al. A New Feature to Improve Moore's Sentence Alignment Method
CN113946666B (en) Domain-awareness-based simple question knowledge base question-answering method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200918