WO2023240839A1 - Machine translation method and apparatus, and computer device and storage medium - Google Patents

Machine translation method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2023240839A1
WO2023240839A1 PCT/CN2022/122036 CN2022122036W WO2023240839A1 WO 2023240839 A1 WO2023240839 A1 WO 2023240839A1 CN 2022122036 W CN2022122036 W CN 2022122036W WO 2023240839 A1 WO2023240839 A1 WO 2023240839A1
Authority
WO
WIPO (PCT)
Prior art keywords
translation
matched
translated
word
language data
Prior art date
Application number
PCT/CN2022/122036
Other languages
French (fr)
Chinese (zh)
Inventor
贺傲飞
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023240839A1 publication Critical patent/WO2023240839A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the technical fields of artificial intelligence and speech processing, and in particular to a machine translation method, device, computer equipment and storage medium.
  • Machine translation refers to the process of using computers to convert one natural language (source language) into another natural language (target language).
  • source language natural language
  • target language natural language
  • the core of machine translation technology based on neural networks is a deep neural network with a large number of nodes (neurons), which can automatically learn translation knowledge from the corpus. After sentences in one language are vectorized, they are transmitted layer by layer in the network and converted into a representation that the computer can "understand”. Then, through multiple layers of complex transmission operations, a translation in another language is generated, realizing "understanding the language. Generate translation" translation method.
  • machine translation usually uses an encoder-decoder structure to model variable-length input sentences.
  • the encoder realizes the "understanding" of the source language sentences and forms a floating-point number vector of a specific dimension.
  • the decoder then generates a translation of the target language word by word based on this vector.
  • this application provides a machine translation method, which method includes:
  • the target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
  • this application also provides a machine translation device, which includes:
  • the acquisition module is used to obtain the source language data to be translated; the matching module is used to perform forward maximum matching on the source language data to be translated and determine the domain proper nouns in the source language data to be translated; the translation module is used to Input the proper nouns in the field into the target machine translation model for translation, and obtain the translation result of the proper nouns, and input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data.
  • the target machine translation model is obtained by training sample data; the replacement module is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
  • this application also provides a computer device.
  • the computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, it implements the following steps:
  • the target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
  • this application also provides a computer-readable storage medium.
  • the computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by the processor, the following steps are implemented:
  • the target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
  • this application also provides a computer program product.
  • the computer program product includes a computer program that implements the following steps when executed by a processor:
  • the target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
  • the above-mentioned machine translation methods, devices, computer equipment, storage media and computer program products can determine the domain-specific nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated. By inputting the domain proper nouns into the target machine translation model for translation, the translation results of the proper nouns are obtained.
  • the source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained.
  • the translation results of the proper nouns are replaced with the translation results.
  • the corresponding translation results in the target language data can improve the accuracy of the target machine translation model in translating domain proper nouns and obtain accurate machine translation results.
  • Figure 1 is a schematic flowchart of a machine translation method in one embodiment
  • Figure 2 is a schematic flow chart of a machine translation method in another embodiment
  • Figure 3 is a schematic flowchart of a machine translation method in yet another embodiment
  • Figure 4 is a structural block diagram of a machine translation device in one embodiment
  • Figure 5 is an internal structure diagram of a computer device in one embodiment.
  • a machine translation method is provided.
  • This embodiment illustrates the application of this method to a terminal. It can be understood that this method can also be applied to a server, and can also be applied to a server.
  • a system that includes terminals and servers and is implemented through the interaction between terminals and servers.
  • the terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, Internet of Things devices and portable wearable devices.
  • the Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc.
  • Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers. In this embodiment, the method includes the following steps:
  • Step 102 Obtain the source language data to be translated.
  • the source language data to be translated refers to the data that needs to be translated.
  • the source language data to be translated refers to Chinese.
  • the source language data to be translated refers to English.
  • the terminal will obtain the source language data to be translated.
  • Step 104 Perform forward maximum matching on the source language data to be translated, and determine the domain-specific nouns in the source language data to be translated.
  • forward maximum matching refers to extracting the largest phrase that can match the preset proper noun dictionary by analogy in the source language data to be translated.
  • Field-specific nouns refer to nouns that are unique to a field. For example, in the medical field, domain-specific nouns can specifically refer to disease names, drug names, etc.
  • the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and use the preset proper noun dictionary to perform forward maximum matching on the words to be matched. , obtain the domain proper nouns corresponding to the words to be matched, and determine the domain proper nouns in the source language data to be translated based on the obtained domain proper nouns corresponding to the words to be matched.
  • the preset proper noun dictionary refers to a preset dictionary composed of proper nouns in the field.
  • the default proper noun dictionary refers to a dictionary composed of proper nouns such as disease nouns and drug nouns in the medical field.
  • Step 106 Enter the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data.
  • the target machine translation model passes the Obtained by training on sample data.
  • the target machine translation model refers to a model obtained by training sample data and can be used for machine translation, and can translate the source language data to be translated into the translation target language data.
  • the sample data may specifically be a set of sample translation sentence pairs including sample translation sentence pairs.
  • the sample translation sentence pairs refer to sentence pairs including sample source language data and sample target language data.
  • the sample target language data is the translation result of the sample source language data.
  • the translation results of proper nouns refer to the translation results of domain proper nouns output by the target machine translation model.
  • the translation target language data refers to the translation result output by the target machine translation model and the source language data to be translated.
  • the terminal will mark the domain proper nouns in the source language data to be translated, obtain the annotation results, input the domain proper nouns into the target machine translation model for translation, and the target machine translation model will output the proper noun translation results, and
  • the source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained.
  • the target machine translation model can include at least two sub-machine translation models, that is, the terminal can translate the source language data to be translated by training multiple sub-machine translation models with different random loss rates, and then translate the source language data to be translated.
  • the terminal will input the source language data to be translated into the sub-machine translation model to obtain the translation result corresponding to the sub-machine translation model.
  • the translation result includes the word probability of predicting the corresponding word for each word in the source language data to be translated.
  • the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the optimal prediction result corresponding to the word based on the ranking result, that is, the optimal translation result.
  • the optimal translation result corresponding to each word is obtained to obtain the corresponding translation target language data. Among them, after sorting, the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.
  • Step 108 Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
  • the terminal after obtaining the proper noun translation results and the translation target language data, the terminal will replace the proper noun translation results with the corresponding translation results in the translation target language data based on the annotation results of the source language data to be translated, and obtain the machine translation result .
  • the above machine translation method can determine the domain proper nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated, and input the domain proper nouns into the target machine translation model for translation.
  • Translate obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target
  • the accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.
  • forward maximum matching is performed on the words in the source language data to be translated, and determining the domain proper nouns in the source language data to be translated includes: using the words in the source language data to be translated as the words to be matched; Through forward maximum matching, the domain proper nouns corresponding to the words to be matched are obtained; based on the domain proper nouns corresponding to the words to be matched, the domain proper nouns in the source language data to be translated are determined.
  • the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and compare the words to be matched with the preset proper noun dictionary to determine the predicted words. Assume whether there is a matching word corresponding to the word to be matched in the proper noun dictionary, and when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated.
  • One word combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and continue to perform forward maximum matching by comparing the phrase to be matched and the preset proper noun dictionary to obtain the domain expertise corresponding to the word to be matched. There are nouns.
  • the terminal will determine the domain-specific nouns corresponding to the words to be matched.
  • the nouns are deduplicated to obtain domain-specific nouns in the source language data to be translated.
  • the domain proper nouns corresponding to the words to be matched can be obtained, so that the words corresponding to the words to be matched can be obtained.
  • Domain proper nouns determine the domain proper nouns in the source language data to be translated.
  • performing forward maximum matching on the word to be matched and obtaining the domain proper noun corresponding to the word to be matched includes: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the word to be translated The next word corresponding to the word to be matched in the source language data; combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched; when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary , obtain the next word corresponding to the phrase to be matched in the source language data to be translated; combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return when the phrase to be matched exists in the preset proper noun dictionary
  • the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated until there is no matching word corresponding
  • the terminal when performing forward maximum matching on the word to be matched, the terminal will match the word to be matched with the preset proper noun dictionary.
  • the terminal When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, the terminal The next word corresponding to the word to be matched in the source language data to be translated is obtained, that is, the next word after the word to be matched, and the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the comparison is continued.
  • the terminal will continue to obtain the next word corresponding to the phrase to be matched in the source language data to be translated, that is, The next word after the phrase to be matched is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary.
  • the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated is deleted from the latest phrase to be matched until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary.
  • the next word corresponding to the latest phrase to be matched is obtained, and the domain proper noun corresponding to the word to be matched is obtained.
  • the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the phrase to be matched is continued to be combined with the preset word.
  • the domain proper nouns corresponding to the words to be matched can be obtained through forward maximum matching.
  • the machine translation method further includes: obtaining a sample translation sentence pair set and an initial machine translation model; calculating a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, where the word number ratio is the source of the sample translation sentence pair.
  • the ratio of the number of language words to the number of words in the target language filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs; train the initial machine translation model based on the set of filtered sample translated sentence pairs, Obtain the target translation machine model.
  • the initial machine translation model refers to a machine translation model that has not yet undergone parameter training.
  • the number of words in the source language refers to the total number of words in the source language in the sample translation sentence pair
  • the number of words in the target language refers to the total number of words in the target language in the sample translation sentence pair.
  • the number of source language words refers to the total number of Chinese words in the sample translation sentence pair
  • the number of target language words refers to the total number of English words in the sample translation sentence pair.
  • the number of words in the source language refers to the total number of English words in the sample translation sentence pair
  • the number of target language words refers to the total number of Chinese words in the sample translation sentence pair.
  • the sample translation sentence pairs include real translation sentence pairs and back-translation sentence pairs.
  • the real translation sentence pairs refer to the translation obtained after using the original source language data for translation and obtaining the corresponding original target language data.
  • Back-translation sentence pairs refer to using the original target language data for translation.
  • the resulting translated sentence pairs can be trained by using real translation sentence pairs and back-translation sentence pairs at the same time, which can improve the performance of the model. Accuracy.
  • the target machine translation model needs to be trained first.
  • the terminal will obtain the sample translation sentence pair set and the initial machine translation model, and calculate each sample translation sentence in the sample translation sentence pair set.
  • the ratio of the number of words in the pair according to the ratio of the number of words, the data distribution corresponding to the ratio of the number of words is obtained, and the data distribution is used to filter the sample translation sentence pairs in the sample translation sentence pair set, and the filtered sample translation sentence pair set is obtained, and the filtered sample translation sentence pair set is obtained
  • a collection of sample translated sentence pairs is used to train the initial machine translation model to obtain the target translation machine model.
  • the word number ratio can be used to filter the sample translation sentence pair set, filter out deviating samples, improve the quality of model translation training, and reduce irrelevant Data noise, use the set of translated sentence pairs based on filtered samples to train the initial machine translation model, and obtain a target translation machine model that can support accurate translation.
  • obtaining a set of sample translated sentence pairs includes: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs; performing word segmentation on the original source language data in the original translated sentence pairs to obtain a word segmentation result, and Count the character length of each target language word in the original target language data in the original translated sentence pair;
  • the original translation sentence pair set is filtered; the filtered original translation sentence pair set is used as a sample translation sentence pair set.
  • the original translation sentence pairs include real translation sentence pairs and reverse translation sentence pairs.
  • the terminal when obtaining the set of sample translation sentence pairs, the terminal will first obtain the set of original translation sentence pairs, perform word segmentation on the original source language data in the original translation sentence pairs, obtain the word segmentation results, and count the original target language data in the original translation sentence pairs.
  • the character length of each target language word in the filter filter out the original translated sentence pairs corresponding to the original source language data whose sentence length is greater than the preset sentence length threshold and/or the number of words is greater than the preset word number threshold, and filter out the original translated sentence pairs whose character length is greater than
  • the original translation sentence pairs corresponding to the original target language data with a preset character length threshold are used as a set of sample translation sentence pairs after filtering.
  • the preset sentence length threshold, the preset word number threshold, and the preset character length threshold can all be set as needed, and are not specifically limited in this embodiment.
  • the terminal when obtaining the original translation sentence pair set, the terminal needs to first obtain the unintegrated real translation sentence pairs and the de-translation sentence pairs, and integrate the real translation sentence pairs and the de-translation sentence pairs through a deduplication operation to obtain the original translation.
  • the simHash algorithm can be used to deduplicate statements. The core idea is: perform simHash mapping for each text to be deduplicated, segment the simHash value to create an inverted index, and parallelize the hash value of each segment. Deduplication operation.
  • the original source language data in the original translation sentence pair is segmented to obtain the word segmentation results, and the character length of each target language word in the original target language data in the original translation sentence pair is counted.
  • filtering the set of original translated sentences based on the word segmentation results and character length which can filter out deviating samples, improve the quality of model translation training, and reduce irrelevant data noise.
  • filtering the set of sample translated sentence pairs according to the word number ratio to obtain the filtered sample translated sentence pair set includes: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; according to the data distribution, Filter the set of sample translation sentence pairs to obtain a set of filtered sample translation sentence pairs.
  • the terminal can obtain the data distribution corresponding to the word number ratio by counting the word number ratio, so that it can filter the sample translation sentence pair set according to the data distribution and the preset ratio threshold, and obtain the filtered sample translation sentence pair gather.
  • the preset proportion threshold can be set as needed, and is not specifically limited in this embodiment. Further, the preset proportion threshold may include a first proportion threshold and a second proportion threshold, where the first proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio, and the second proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio. A larger sample of translated sentence pairs.
  • a data distribution corresponding to the ratio of the number of words is obtained.
  • the set of sample translated sentence pairs is filtered to obtain a set of filtered sample translated sentence pairs, which can filter out deviating samples. Improve the quality of model translation training and reduce irrelevant data noise.
  • translating a set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model includes: translating the set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model.
  • Machine translation model obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set; based on the translation evaluation source language data set and translation evaluation
  • the target language data set is used to obtain a set of translation evaluation and translation sentence pairs; based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
  • the translation evaluation source language data set refers to the data set used to evaluate the translation model.
  • the translation evaluation source language data set may specifically refer to the evaluation set of the International Machine Translation Competition.
  • the terminal After the terminal translates the sentence pair set based on the filtered samples and trains the initial machine translation model, it will obtain the machine translation model to be optimized. It also needs to optimize the machine translation model to be optimized to obtain the target machine translation model.
  • the terminal will first obtain the translation evaluation source language data set, use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set, obtain the translation evaluation target language data set, and then translate the translation evaluation source language
  • the data set and the translation evaluation target language data set are used as a set of translation sentence pairs for translation evaluation.
  • the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set are used to train the machine translation model to be optimized to obtain the target machine translation model.
  • the terminal will first filter the translation evaluation and translation sentence pair set, and then use the filtered sample translation sentence pair set and the filtered translation evaluation and translation sentence pair set to optimize the machine translation model Conduct training to obtain the machine translation model to be updated. Use the machine translation model to be updated to translate the filtered translation evaluation translation sentence pairs into the translation evaluation source language in the set. Obtain the translation evaluation target language corresponding to the translation evaluation source language.
  • the target language pair is updated after filtering the translation evaluation translation sentence pair set, that is, replacing the translation result corresponding to the translation evaluation source language in the translation evaluation translation sentence pair set, and then using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set
  • the machine translation model to be updated is trained to obtain the target machine translation model.
  • the method used when filtering the set of translation evaluation sentences is the same as the method used when filtering the set of original translation sentences and the set of sample translation sentences. This embodiment will not be described here.
  • the terminal can obtain the target machine translation model through iterative training, that is, the terminal will use the filtered
  • the sample translation sentence pair set and the updated translation evaluation are used to train the machine translation model to be updated on the set to be updated, and a new machine translation model to be updated is obtained, and then the machine translation model to be updated is returned to the filtered translation evaluation set of translated sentence pairs.
  • the translation step is to evaluate the source language for translation until the number of iterations reaches the preset iteration threshold, and then obtain the target machine translation model based on the latest machine translation model to be updated.
  • the terminal will also obtain professional corpus in the field, and use the professional corpus in the field to train the latest machine translation model to be updated to obtain the target machine translation model.
  • an initial machine translation model is trained by translating a set of sentence pairs based on filtered samples to obtain a machine translation model to be optimized, a translation evaluation source language data set is obtained, and the translation evaluation source language data is obtained through the machine translation model to be optimized. Centralize the translation evaluation source language for translation to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained, and the sentence pair set and translation can be translated based on the filtered samples. Evaluate the set of translated sentence pairs, perform optimization training on the machine translation model to be optimized, and obtain the target machine translation model.
  • the machine translation method further includes: performing proper noun recognition on the source language data to be translated by pre-training a proper noun recognition model, and expanding a preset proper noun dictionary based on the recognition results.
  • the terminal when performing machine translation, the terminal will perform proper noun recognition on the source language data to be translated based on the pre-trained proper noun recognition model.
  • the recognition results expand the preset proper noun dictionary so that more proper nouns can be identified during matching.
  • the pre-trained proper noun recognition model is obtained by training the sample proper noun set carrying sequence annotation.
  • the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model.
  • BERT Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers
  • CRF Conditional Random Field, conditional random field
  • the source language data to be translated can be annotated and the proper nouns can be identified.
  • the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.
  • FIG. 2 a schematic flow chart is used to illustrate the machine translation method of the present application.
  • the machine translation method specifically includes the following steps:
  • Step 202 Obtain a set of original translated sentence pairs, which includes original translated sentence pairs;
  • Step 204 Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;
  • Step 206 Filter the set of original translated sentences according to the word segmentation results and character length
  • Step 208 Use the filtered set of original translated sentence pairs as a set of sample translated sentence pairs
  • Step 210 obtain the initial machine translation model
  • Step 212 Calculate the word number ratio of the sample translation sentence pair in the sample translation sentence pair set.
  • the word number ratio is the ratio of the number of source language words to the target language word in the sample translation sentence pair;
  • Step 214 Perform statistics based on the word number ratio to obtain the data distribution corresponding to the word number ratio;
  • Step 216 Filter the set of sample translated sentence pairs according to the data distribution to obtain a set of filtered sample translated sentence pairs
  • Step 218 Train the initial machine translation model based on the filtered sample translation sentence pair set to obtain the machine translation model to be optimized
  • Step 220 Obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set;
  • Step 222 Obtain a set of translation evaluation translation sentence pairs based on the translation evaluation source language data set and the translation evaluation target language data set;
  • Step 224 Train the machine translation model to be optimized based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set to obtain the target machine translation model;
  • Step 226 Obtain the source language data to be translated
  • Step 228 Use words in the source language data to be translated as words to be matched
  • Step 230 When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;
  • Step 232 combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;
  • Step 234 When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;
  • Step 236 Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return to step 234;
  • Step 238 Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. domain specific nouns;
  • Step 240 Determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched;
  • Step 242 Enter the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data;
  • Step 244 Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
  • the machine translation method of the present application is explained. As shown in Figure 3, the machine translation method specifically includes the following steps:
  • the terminal will obtain the real translation sentence pair (i.e., Chinese-English sentence pair).
  • the terminal will use the pre-trained back-translation model (i.e., English-Chinese machine translation model) to compare the real translation sentence Back-translate the pairs to obtain back-translated sentence pairs, and use the real translated sentence pairs and the back-translated sentence pairs as a set of original translated sentence pairs.
  • the terminal will input the Chinese and English data of the real translated sentence pair into the pre-trained back-translation model to obtain the Chinese translation corresponding to the English data, and use the English data and the Chinese translation as the back-translated sentence pair corresponding to the real translated sentence pair.
  • the accuracy of the model can be improved to a certain extent through data back-translation.
  • the terminal when pre-training the back-translation model, the terminal can perform data processing on real translated sentence pairs to obtain back-translation sample pairs for training, and then use the back-translation sample pairs to train the English-Chinese machine translation model.
  • the data processing method can be as follows: use the source language data (i.e., Chinese) in the real translated sentence pairs as the target language data, use the target language data (i.e., English) as the source language data, obtain the translation samples that need to be filtered, and Filter the translation samples that need to be filtered to obtain the back-translation samples.
  • the untrained back-translation model can be based on the transformer-big model.
  • the untrained back-translation model When training, the untrained back-translation model will convert the input words into word vectors, which include token embedding (mark embedding) and position embedding ( Position embedding) two layers, and the encoded word vectors flow to the two-layer network in the encoder (encoding) respectively. Finally, the relevance of the text is obtained through matrix transformation training, and the back-translation model can be obtained. It should be noted that when filtering the translation sample pairs that need to be filtered, the filtering method used is consistent with the filtering method for the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will no longer Writing.
  • the terminal can use the original translated sentence pairs to perform model training to obtain a machine translation model to be optimized, that is, Chinese-English machine translation model training.
  • model training the terminal also needs to perform data processing (i.e., filtering) on the real translated sentence pairs (i.e., Chinese-English sentence pairs) in the original translated sentence pair set to obtain filtered sample translated sentence pairs for training. gather.
  • the specific filtering method can be: the terminal performs word segmentation processing on the original Chinese data in the original translated sentence pair set, filters out the original translated sentence pairs corresponding to the original Chinese data with a sentence length greater than 200 or a word count greater than 150, and then counts After one filtering, the character length of each English word in the original English data in the original translated sentence pair set is filtered out, and the original translated sentence pairs corresponding to the original English data with a maximum character length greater than 40 are filtered out to obtain a sample translated sentence pair set, and the sample translated sentences are calculated
  • the ratio of the number of words in the sample translation sentence pairs in the collection that is, the value of (number of source Chinese words/number of target English words), is statistically analyzed through Gaussian distribution, and the data distribution corresponding to the ratio of the number of words is obtained.
  • the samples are Filter the set of translated sentence pairs, filter out sample translated sentence pairs whose word number ratio is less than the first proportion threshold and greater than the second proportion threshold, and obtain a set of filtered sample translated sentence pairs.
  • the deviation values can be filtered out to improve model translation. Quality of training. Reduce irrelevant data noise.
  • the terminal After obtaining the filtered sample translation sentence pair set, the terminal will train the initial machine translation model based on the filtered sample translation sentence pair set, and debug the appropriate learning rate (learning rate), batch size (batch size), step ( step size) and some related parameter information to obtain the machine translation model to be optimized, thereby achieving Chinese-English machine translation model training.
  • learning rate learning rate
  • batch size batch size
  • step size step size
  • the terminal After obtaining the machine translation model to be optimized, the terminal will obtain the filtered evaluation set (in-field data) in the medical field in the International Machine Translation Competition, that is, the translation evaluation source language data set, and use the translation evaluation source language data set to be optimized.
  • Machine translation models perform model fine-tuning to achieve optimization.
  • model fine-tuning means freezing a series of parameters such as related losses and parameter weights from previous large-batch model training, and then conducting small-batch model training based on these parameters. It should be noted that the way of filtering the evaluation set in the medical field in the International Machine Translation Competition is consistent with the way of filtering the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will not be described here. .
  • the terminal When using the translation evaluation source language data set to fine-tune the machine translation model to be optimized to achieve optimization, the terminal will first translate the translation evaluation Chinese centralized translation evaluation Chinese through the machine translation model to be optimized (that is, data translation, single-language Chinese data), obtain the translation evaluation English set, obtain the translation evaluation translation sentence pair set according to the translation evaluation Chinese set and the translation evaluation English set, filter the translation evaluation translation sentence pair set, and filter the translation sentence pair set according to the filtered sample After the translation evaluation, the translation sentence pair set is trained, and the machine translation model to be optimized is trained to obtain the target machine translation model.
  • the method of filtering the translation evaluation translation sentence pair set is the same as the original translation sentence pairs and sample translation sentences in the above embodiment. The filtering methods are the same, and this embodiment will not be described here.
  • the training step size is one million steps and the batch size is three thousand.
  • the terminal will first obtain the machine translation model to be updated by training the machine translation model to be optimized, and use the machine translation model to be updated to evaluate the filtered translation.
  • the translation results corresponding to the source language are evaluated in the translation, and then the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set are used to train the machine translation model to be updated, and the target machine translation model is obtained, which is the medical field machine translation model.
  • the terminal can obtain the target machine translation model through iterative training, that is, the terminal will Use the filtered sample translation sentence pair set and the updated translation evaluation to train the machine translation model to be updated to train the machine translation model to be updated, and then return to use the machine translation model to be updated to evaluate the filtered translation sentence.
  • the steps of translating the translation evaluation source language in the collection until the number of iterations (i.e., N in Figure 3) reaches the preset iteration threshold obtain the latest machine translation model to be updated, and obtain professional corpus in the field (i.e., medical field data) ), use professional corpus in the field to train the latest machine translation model to be updated (i.e., fine-tune the model through medical field data), and obtain the target machine translation model (i.e., medical field machine translation model).
  • the terminal After obtaining the target machine translation model, the terminal will obtain the Chinese to be translated, use the Chinese words to be translated as the words to be matched, use the medical data professional dictionary to perform forward maximum matching, and obtain the domain proper nouns corresponding to the words to be matched. That is, when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary (i.e., medical data professional dictionary), the next word corresponding to the word to be matched in the Chinese to be translated is obtained, and the word to be matched and the word to be matched are combined The next word of the to-be-matched phrase is obtained.
  • the preset proper noun dictionary i.e., medical data professional dictionary
  • the next word corresponding to the to-be-matched phrase in the Chinese to be translated is obtained, and the to-be-matched phrase and the to-be-matched phrase are combined.
  • the next word corresponding to the phrase is obtained, and a new phrase to be matched is obtained.
  • the next word corresponding to the phrase to be matched in the source language data to be translated is obtained.
  • Step until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. Domain specific nouns.
  • the terminal After obtaining the domain proper nouns corresponding to the words to be matched, the terminal will determine the domain proper nouns in the Chinese to be translated based on the domain proper nouns corresponding to the words to be matched, and input the domain proper nouns into the target machine translation model for translation.
  • Translate obtain the translation results of proper nouns, input the Chinese to be translated into the target machine translation model for translation, obtain the translation target language data, replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, and obtain the machine translation results ( That is, the translation result output).
  • the terminal can obtain a professional dictionary of medical data through entity recognition. Specifically, the terminal will obtain a sample proper noun set carrying sequence annotations, and obtain a pre-trained proper noun set by training the sample proper noun set carrying sequence annotations. Noun recognition model, so that when performing machine translation, the terminal can perform proper noun recognition on the source language data to be translated through the pre-trained proper noun recognition model, so as to expand the preset proper noun dictionary based on the recognition results so that it can be used when matching. Identify more proper nouns.
  • the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model.
  • BERT Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers
  • CRF Conditional Random Field, conditional random field
  • the source language data to be translated can be annotated and the proper nouns can be identified.
  • the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.
  • the terminal can use multi-model fusion to obtain translation target language data.
  • the target machine translation model can include at least two sub-machine translation models, that is, the terminal can treat multiple sub-machine translation models with different random loss rates by training Translate the source language data for translation.
  • the terminal When translating the source language data to be translated, the terminal will input the source language data to be translated into the sub-machine translation model to obtain a translation result corresponding to the sub-machine translation model.
  • the translation result includes the information to be translated.
  • Each word in the source language data is predicted to obtain the word probability of the corresponding word. After obtaining this word probability, the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the word probability based on the sorting results.
  • the corresponding optimal prediction result that is, the optimal translation result
  • the corresponding optimal prediction result is based on the optimal translation result corresponding to each word, and the corresponding translation target language data is obtained.
  • the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.
  • embodiments of the present application also provide a machine translation device for implementing the above-mentioned machine translation method.
  • the problem-solving solution provided by this device is similar to the solution recorded in the above method. Therefore, for the specific limitations in one or more machine translation device embodiments provided below, please refer to the above limitations on the machine translation method. I won’t go into details here.
  • a machine translation device including: an acquisition module 402, a matching module 404, a translation module 406 and a replacement module 408, wherein:
  • the acquisition module 402 is used to acquire the source language data to be translated
  • the matching module 404 is used to perform forward maximum matching on the source language data to be translated and determine the domain-specific nouns in the source language data to be translated;
  • the translation module 406 is used to input the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target.
  • Language data, the target machine translation model is obtained by training sample data;
  • the replacement module 408 is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
  • the above-mentioned machine translation device can determine the domain proper nouns in the source language data to be translated by acquiring the source language data to be translated, performing forward maximum matching on the source language data to be translated, and inputting the domain proper nouns into the target machine translation model for translation.
  • Translate obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target
  • the accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.
  • the matching module is also used to use words in the source language data to be translated as words to be matched, perform forward maximum matching on the words to be matched, and obtain domain proper nouns corresponding to the words to be matched. According to the words to be matched, Corresponding domain proper nouns determine the domain proper nouns in the source language data to be translated.
  • the matching module is also used to obtain the next word corresponding to the word to be matched in the source language data to be translated when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, and combine the words to be matched The next word corresponding to the word to be matched is obtained to obtain the phrase to be matched.
  • the next word corresponding to the phrase to be matched in the source language data to be translated is obtained. Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched.
  • next word corresponding to the phrase is until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then the next word corresponding to the latest phrase to be matched is deleted from the latest phrase to be matched, Get the domain-specific noun corresponding to the word to be matched.
  • the machine translation device also includes a model training module.
  • the model training module is used to obtain a sample translation sentence pair set and an initial machine translation model, and calculate the word number ratio of the sample translation sentence pair set in the sample translation sentence pair set.
  • the number of words The ratio is the ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pair.
  • the sample translation sentence pair set is filtered according to the word number ratio to obtain the filtered sample translation sentence pair set.
  • the initial machine translation model is trained to obtain the target translation machine model.
  • the model training module is also used to obtain a set of original translated sentence pairs, which includes original translated sentence pairs, segment the original source language data in the original translated sentence pairs, obtain the segmentation results, and count the original The character length of each target language word in the original target language data in the translated sentence pair, filter the original translated sentence pair set based on the word segmentation result and character length, and use the filtered original translated sentence pair set as a sample translated sentence pair set .
  • the model training module is also used to perform statistics based on the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words. According to the data distribution, filter the set of sample translated sentence pairs to obtain a set of filtered sample translated sentence pairs. .
  • the model training module is also used to translate a set of sentence pairs based on the filtered samples, train the initial machine translation model, obtain the machine translation model to be optimized, obtain the translation evaluation source language data set, and use the machine translation model to be optimized Translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set.
  • the translation evaluation source language data set and the translation evaluation target language data set obtain a set of translation evaluation translation sentence pairs, and translate according to the filtered samples Sentence pair set and translation evaluation Translate the sentence pair set, train the machine translation model to be optimized, and obtain the target machine translation model.
  • Each module in the above machine translation device can be implemented in whole or in part by software, hardware and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a terminal, and its internal structure diagram may be shown in Figure 5 .
  • the computer device includes a processor, memory, input/output interface, communication interface, display unit and input device.
  • the processor, memory and input/output interface are connected through the system bus, and the communication interface, display unit and input device are connected to the system bus through the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes non-volatile storage media and internal memory.
  • the non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media.
  • the input/output interface of the computer device is used to exchange information between the processor and external devices.
  • the communication interface of the computer device is used for wired or wireless communication with external terminals.
  • the wireless mode can be implemented through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies.
  • the computer program when executed by the processor, implements a machine translation method.
  • the display unit of the computer device is used to form a visually visible picture and can be a display screen, a projection device or a virtual reality imaging device.
  • the display screen can be a liquid crystal display screen or an electronic ink display screen.
  • the input device of the computer device can be a display screen.
  • the touch layer covered above can also be buttons, trackballs or touch pads provided on the computer equipment shell, or it can also be an external keyboard, touch pad or mouse, etc.
  • FIG. 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment can May include more or fewer parts than shown, or combine certain parts, or have a different arrangement of parts.
  • a computer device including a memory and a processor.
  • a computer program is stored in the memory.
  • the processor executes the computer program, it implements the following steps: obtains the source language data to be translated; performs the following steps on the source language data to be translated: Forward maximum matching determines the domain proper nouns in the source language data to be translated; inputs the domain proper nouns into the target machine translation model for translation, obtains the proper noun translation results, and inputs the source language data to be translated into the target machine translation model Translate to obtain the translation target language data.
  • the target machine translation model is obtained by training the sample data; replace the proper noun translation results with the corresponding translation results in the translation target language data to obtain the machine translation results.
  • the processor when the processor executes the computer program, it also implements the following steps: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific nouns corresponding to the words to be matched. , based on the domain proper nouns corresponding to the words to be matched, determine the domain proper nouns in the source language data to be translated.
  • the processor also implements the following steps when executing the computer program: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. word, combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the phrase to be matched in the source language data to be translated The corresponding next word is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched.
  • the processor also implements the following steps when executing the computer program: obtaining a set of sample translated sentence pairs and an initial machine translation model, calculating a word number ratio of the sample translated sentence pairs in the sample translated sentence pair set, and the word number ratio is the sample The ratio of the number of words in the source language to the number of words in the target language in the translated sentence pair. Filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs. Based on the set of filtered sample translated sentence pairs, the initial machine The translation model is trained to obtain the target translation machine model.
  • the processor also implements the following steps when executing the computer program: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs, performing word segmentation on the original source language data in the original translated sentence pairs, and obtaining the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair, filter the original translation sentence pair set according to the word segmentation result and character length, and use the filtered original translation sentence pair set as A collection of sample translated sentence pairs.
  • the processor also implements the following steps when executing the computer program: performing statistics based on the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of sample translated sentence pairs.
  • the processor also implements the following steps when executing the computer program: training the initial machine translation model according to the set of filtered sample translation sentence pairs to obtain the machine translation model to be optimized, obtaining the translation evaluation source language data set, and passing The machine translation model to be optimized translates the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. Based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
  • a computer-readable storage medium is provided, with a computer program stored thereon.
  • the computer program When the computer program is executed by a processor, the following steps are implemented: obtaining the source language data to be translated; performing forward maximum processing on the source language data to be translated. Match and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation.
  • the translation target language data is obtained, and the target machine translation model is obtained by training the sample data; the translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain the machine translation result.
  • the following steps are also implemented: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific information corresponding to the words to be matched.
  • Nouns determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched.
  • the following steps are also implemented: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, and obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the matched word in the source language data to be translated. The next word corresponding to the phrase is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched.
  • the following steps are also implemented: obtain a set of sample translation sentence pairs and an initial machine translation model, calculate a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, and the word number ratio is The ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pairs, filter the sample translation sentence pair set according to the word number ratio, and obtain the filtered sample translation sentence pair set.
  • the initial The machine translation model is trained to obtain the target translation machine model.
  • the following steps are also implemented: obtain a set of original translated sentence pairs, the original translated sentence pair set includes the original translated sentence pairs, perform word segmentation on the original source language data in the original translated sentence pairs, and obtain The word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pairs, filter the original translation sentence pairs set based on the word segmentation results and character length, and put the filtered original translation sentence pairs into a set, As a collection of sample translation sentence pairs.
  • the following steps are also implemented: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of post-sample translated sentence pairs.
  • the following steps are also implemented: training the initial machine translation model according to the set of filtered sample translation sentence pairs, obtaining the machine translation model to be optimized, and obtaining the translation evaluation source language data set,
  • the machine translation model to be optimized is used to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set.
  • a translation evaluation translation sentence pair set is obtained.
  • the machine translation model to be optimized is trained to obtain the target machine translation model.
  • the computer-readable storage medium may be non-volatile or volatile.
  • a computer program product including a computer program.
  • the computer program When executed by a processor, the computer program implements the following steps: obtaining source language data to be translated; performing forward maximum matching on the source language data to be translated, and determining Translate the domain proper nouns in the source language data; input the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language Data, the target machine translation model is obtained by training the sample data; the machine translation result is obtained by replacing the translation result of the proper noun with the corresponding translation result in the translation target language data.
  • the following steps are also implemented: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific information corresponding to the words to be matched.
  • Nouns determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched.
  • the following steps are also implemented: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, and obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the matched word in the source language data to be translated. The next word corresponding to the phrase is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched.
  • the following steps are also implemented: obtain a set of sample translation sentence pairs and an initial machine translation model, calculate a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, and the word number ratio is The ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pairs, filter the sample translation sentence pair set according to the word number ratio, and obtain the filtered sample translation sentence pair set.
  • the initial The machine translation model is trained to obtain the target translation machine model.
  • the following steps are also implemented: obtain a set of original translated sentence pairs, the original translated sentence pair set includes the original translated sentence pairs, perform word segmentation on the original source language data in the original translated sentence pairs, and obtain The word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pairs, filter the original translation sentence pairs set based on the word segmentation results and character length, and put the filtered original translation sentence pairs into a set, As a collection of sample translation sentence pairs.
  • the following steps are also implemented: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of post-sample translated sentence pairs.
  • the following steps are also implemented: training the initial machine translation model according to the set of filtered sample translation sentence pairs, obtaining the machine translation model to be optimized, and obtaining the translation evaluation source language data set,
  • the machine translation model to be optimized is used to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set.
  • a translation evaluation translation sentence pair set is obtained.
  • the machine translation model to be optimized is trained to obtain the target machine translation model.
  • data involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data require Comply with relevant laws, regulations and standards of relevant countries and regions.
  • the computer program can be stored in a non-volatile computer-readable storage.
  • the computer program when executed, may include the processes of the above method embodiments.
  • Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory.
  • Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory (MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (Phase Change Memory, PCM), graphene memory, etc.
  • Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc.
  • RAM Random Access Memory
  • RAM random access memory
  • RAM Random Access Memory
  • the databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database.
  • Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto.
  • the processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical fields of artificial intelligence and speech processing. Provided are a machine translation method and apparatus, and a computer device and a storage medium. The method comprises: acquiring source language data to be translated; performing forward maximum matching on said source language data, and determining a field proper noun in said source language data; inputting the field proper noun into a target machine translation model for translation, so as to obtain a proper noun translation result, and inputting said source language data into the target machine translation model for translation, so as to obtain translation target language data, wherein the target machine translation model is obtained by means of performing training by using sample data; and replacing a corresponding translation result in the translation target language data with the proper noun translation result, so as to obtain a machine translation result. By using the method, the accuracy of translating a field proper noun by means of a target machine translation model can be improved, thereby obtaining a machine translation result with accurate translation.

Description

机器翻译方法、装置、计算机设备和存储介质Machine translation methods, devices, computer equipment and storage media
本申请要求与2022年6月14日提交中国专利局、申请号为202210667744.3,申请名称为“机器翻译方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims priority with the Chinese patent application filed with the China Patent Office on June 14, 2022, with application number 202210667744.3 and the application title "Machine Translation Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference. In application.
技术领域Technical field
本申请涉及人工智能及语音处理技术领域,尤其是涉及到一种机器翻译方法、装置、计算机设备和存储介质。This application relates to the technical fields of artificial intelligence and speech processing, and in particular to a machine translation method, device, computer equipment and storage medium.
背景技术Background technique
随着人工智能技术的发展,出现了基于神经网络的机器翻译技术,机器翻译是指利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。基于神经网络的机器翻译技术的核心是一个拥有海量结点(神经元)的深度神经网络,可以自动的从语料库中学习翻译知识。一种语言的句子被向量化之后,在网络中层层传递,转化为计算机可以“理解”的表示形式,再经过多层复杂的传导运算,生成另一种语言的译文,实现了“理解语言,生成译文”的翻译方式。With the development of artificial intelligence technology, machine translation technology based on neural networks has emerged. Machine translation refers to the process of using computers to convert one natural language (source language) into another natural language (target language). The core of machine translation technology based on neural networks is a deep neural network with a large number of nodes (neurons), which can automatically learn translation knowledge from the corpus. After sentences in one language are vectorized, they are transmitted layer by layer in the network and converted into a representation that the computer can "understand". Then, through multiple layers of complex transmission operations, a translation in another language is generated, realizing "understanding the language. Generate translation" translation method.
发明人意识到传统技术中,机器翻译通常采用编码器-解码器结构,实现对变长输入句子的建模,编码器实现对源语言句子的"理解",形成一个特定维度的浮点数向量,之后解码器根据此向量逐字生成目标语言的翻译结果。The inventor realized that in traditional technology, machine translation usually uses an encoder-decoder structure to model variable-length input sentences. The encoder realizes the "understanding" of the source language sentences and forms a floating-point number vector of a specific dimension. The decoder then generates a translation of the target language word by word based on this vector.
然而,传统方法,在应用于存在领域专有名词的专业领域时,存在翻译不准确的问题。However, the traditional method has the problem of inaccurate translation when applied to professional fields where domain proper nouns exist.
发明内容Contents of the invention
基于此,有必要针对上述技术问题,提供一种能够实现准确翻译的机器翻译方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a machine translation method, device, computer equipment, computer-readable storage medium and computer program product that can achieve accurate translation in response to the above technical problems.
第一方面,本申请提供了一种机器翻译方法,所述方法包括:In a first aspect, this application provides a machine translation method, which method includes:
获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Obtain the source language data to be translated; perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, and obtain the proper noun translation results , input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
第二方面,本申请还提供了一种机器翻译装置,所述装置包括:In a second aspect, this application also provides a machine translation device, which includes:
获取模块,用于获取待翻译源语言数据;匹配模块,用于对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;翻译模块,用于将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;替换模块,用于将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。The acquisition module is used to obtain the source language data to be translated; the matching module is used to perform forward maximum matching on the source language data to be translated and determine the domain proper nouns in the source language data to be translated; the translation module is used to Input the proper nouns in the field into the target machine translation model for translation, and obtain the translation result of the proper nouns, and input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training sample data; the replacement module is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
第三方面,本申请还提供了一种计算机设备,所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:In a third aspect, this application also provides a computer device. The computer device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the following steps:
获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Obtain the source language data to be translated; perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, and obtain the proper noun translation results , input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
第四方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:In a fourth aspect, this application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by the processor, the following steps are implemented:
获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Obtain the source language data to be translated; perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, and obtain the proper noun translation results , input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
第五方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现以下步骤:In a fifth aspect, this application also provides a computer program product. The computer program product includes a computer program that implements the following steps when executed by a processor:
获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Obtain the source language data to be translated; perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, and obtain the proper noun translation results , input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.
上述机器翻译方法、装置、计算机设备、存储介质和计算机程序产品,通过获取待翻译源语言数据,对待翻译源语言数据进行正向最大匹配,能够确定待翻译源语言数据中的领域专有名词,通过将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,能够提高目标机器翻译模型对领域专有名词翻译的准确性,得到翻译准确的机器翻译结果。The above-mentioned machine translation methods, devices, computer equipment, storage media and computer program products can determine the domain-specific nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated. By inputting the domain proper nouns into the target machine translation model for translation, the translation results of the proper nouns are obtained. The source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained. The translation results of the proper nouns are replaced with the translation results. The corresponding translation results in the target language data can improve the accuracy of the target machine translation model in translating domain proper nouns and obtain accurate machine translation results.
附图说明Description of the drawings
图1为一个实施例中机器翻译方法的流程示意图;Figure 1 is a schematic flowchart of a machine translation method in one embodiment;
图2为另一个实施例中机器翻译方法的流程示意图;Figure 2 is a schematic flow chart of a machine translation method in another embodiment;
图3为又一个实施例中机器翻译方法的流程示意图;Figure 3 is a schematic flowchart of a machine translation method in yet another embodiment;
图4为一个实施例中机器翻译装置的结构框图;Figure 4 is a structural block diagram of a machine translation device in one embodiment;
图5为一个实施例中计算机设备的内部结构图。Figure 5 is an internal structure diagram of a computer device in one embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.
在一个实施例中,如图1所示,提供了一种机器翻译方法,本实施例以该方法应用于终端进行举例说明,可以理解的是,该方法也可以应用于服务器,还可以应用于包括终端和服务器的系统,并通过终端和服务器的交互实现。其中,终端可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。本实施例中,该方法包括以下步骤:In one embodiment, as shown in Figure 1, a machine translation method is provided. This embodiment illustrates the application of this method to a terminal. It can be understood that this method can also be applied to a server, and can also be applied to a server. A system that includes terminals and servers and is implemented through the interaction between terminals and servers. Among them, the terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, Internet of Things devices and portable wearable devices. The Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc. Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc. The server can be implemented as an independent server or a server cluster composed of multiple servers. In this embodiment, the method includes the following steps:
步骤102,获取待翻译源语言数据。Step 102: Obtain the source language data to be translated.
其中,待翻译源语言数据是指需要翻译的数据。比如,在将中文翻译为英文的机器翻译中,待翻译源语言数据是指中文。又比如,在将英文翻译为中文的机器翻译中,待翻译源语言数据是指英文。Among them, the source language data to be translated refers to the data that needs to be translated. For example, in machine translation from Chinese to English, the source language data to be translated refers to Chinese. For another example, in machine translation from English to Chinese, the source language data to be translated refers to English.
具体的,在需要进行机器翻译时,终端会获取到待翻译源语言数据。Specifically, when machine translation is required, the terminal will obtain the source language data to be translated.
步骤104,对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词。Step 104: Perform forward maximum matching on the source language data to be translated, and determine the domain-specific nouns in the source language data to be translated.
其中,正向最大匹配是指在待翻译源语言数据中一次类推取出最大的、可以与预设专有名词词典匹配的词组。领域专有名词是指在领域内所特有的名词。比如,在医疗领域,领域专有名词具体可以是指病症名称、药品名称等。Among them, forward maximum matching refers to extracting the largest phrase that can match the preset proper noun dictionary by analogy in the source language data to be translated. Field-specific nouns refer to nouns that are unique to a field. For example, in the medical field, domain-specific nouns can specifically refer to disease names, drug names, etc.
具体的,终端会对待翻译源语言数据进行分词,得到待翻译源语言数据中单词,将待翻译源语言数据中单词作为待匹配单词,利用预设专有名词词典对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词,根据所得到的待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。其中,预设专有名词词典是指预先设置的由领域内专有名词组成的词典。比如,在医疗领域,预设专有名词词典是指由医疗领域内病症名词、药品名词等专有名词组成的词典。Specifically, the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and use the preset proper noun dictionary to perform forward maximum matching on the words to be matched. , obtain the domain proper nouns corresponding to the words to be matched, and determine the domain proper nouns in the source language data to be translated based on the obtained domain proper nouns corresponding to the words to be matched. The preset proper noun dictionary refers to a preset dictionary composed of proper nouns in the field. For example, in the medical field, the default proper noun dictionary refers to a dictionary composed of proper nouns such as disease nouns and drug nouns in the medical field.
步骤106,将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到。Step 106: Enter the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target machine translation model passes the Obtained by training on sample data.
其中,目标机器翻译模型是指通过对样本数据训练所得到的,可用于机器翻译的模型,能够将待翻译源语言数据翻译为翻译目标语言数据。样本数据具体可以为包括样本翻译句子对的样本翻译句子对集合,样本翻译句子对是指包括样本源语言数据和样本目标语言数据的句子对,样本目标语言数据为样本源语言数据的翻译结果。专有名词翻译结果是指目标机器翻译模型所输出的、对领域专有名词的翻译结果。翻译目标语言数据是指目标机器翻译模型所输出的、对待翻译源语言数据的翻译结果。Among them, the target machine translation model refers to a model obtained by training sample data and can be used for machine translation, and can translate the source language data to be translated into the translation target language data. The sample data may specifically be a set of sample translation sentence pairs including sample translation sentence pairs. The sample translation sentence pairs refer to sentence pairs including sample source language data and sample target language data. The sample target language data is the translation result of the sample source language data. The translation results of proper nouns refer to the translation results of domain proper nouns output by the target machine translation model. The translation target language data refers to the translation result output by the target machine translation model and the source language data to be translated.
具体的,终端会在待翻译源语言数据中标注出领域专有名词,得到标注结果,将领域专有名词输入目标机器翻译模型进行翻译,目标机器翻译模型会输出专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据。Specifically, the terminal will mark the domain proper nouns in the source language data to be translated, obtain the annotation results, input the domain proper nouns into the target machine translation model for translation, and the target machine translation model will output the proper noun translation results, and The source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained.
进一步的,目标机器翻译模型可以包括至少两个子机器翻译模型,即终端可以通过训练多个随机失活率不同的子机器翻译模型来对待翻译源语言数据进行翻译,在对待翻译源语言数据进行翻译时,终端会将待翻译源语言数据输入子机器翻译模型,得到与子机器翻译模型对应的翻译结果,在翻译结果中包括对于待翻译源语言数据中每个单词预测得到对应单词的单词概率,在得到这个单词概率后,终端会对每个子机器翻译模型所输出的翻译结果中相同单词的单词概率做排序,根据排序结果确定与单词对应的最优预测结果,即最优翻译结果,根据与每个单词对应的最优翻译结果,得到对应的翻译目标语言数据。其中,在做排序后,终端会确定针对每个单词的最大单词概率,将该最大单词概率对应的单词作为最优预测结果。Further, the target machine translation model can include at least two sub-machine translation models, that is, the terminal can translate the source language data to be translated by training multiple sub-machine translation models with different random loss rates, and then translate the source language data to be translated. When , the terminal will input the source language data to be translated into the sub-machine translation model to obtain the translation result corresponding to the sub-machine translation model. The translation result includes the word probability of predicting the corresponding word for each word in the source language data to be translated. After obtaining the word probability, the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the optimal prediction result corresponding to the word based on the ranking result, that is, the optimal translation result. The optimal translation result corresponding to each word is obtained to obtain the corresponding translation target language data. Among them, after sorting, the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.
步骤108,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Step 108: Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
具体的,在得到专有名词翻译结果和翻译目标语言数据后,终端会根据对待翻译源语言数据的标注结果,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Specifically, after obtaining the proper noun translation results and the translation target language data, the terminal will replace the proper noun translation results with the corresponding translation results in the translation target language data based on the annotation results of the source language data to be translated, and obtain the machine translation result .
上述机器翻译方法,通过获取待翻译源语言数据,对待翻译源语言数据进行正向最大匹配,能够确定待翻译源语言数据中的领域专有名词,通过将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,将 专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,能够提高目标机器翻译模型对领域专有名词翻译的准确性,得到翻译准确的机器翻译结果。The above machine translation method can determine the domain proper nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated, and input the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target The accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.
在一个实施例中,对待翻译源语言数据中单词进行正向最大匹配,确定待翻译源语言数据中的领域专有名词包括:将待翻译源语言数据中单词作为待匹配单词;对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词;根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In one embodiment, forward maximum matching is performed on the words in the source language data to be translated, and determining the domain proper nouns in the source language data to be translated includes: using the words in the source language data to be translated as the words to be matched; Through forward maximum matching, the domain proper nouns corresponding to the words to be matched are obtained; based on the domain proper nouns corresponding to the words to be matched, the domain proper nouns in the source language data to be translated are determined.
具体的,终端会对待翻译源语言数据进行分词,得到待翻译源语言数据中单词,将待翻译源语言数据中单词作为待匹配单词,比对待匹配单词和预设专有名词词典,以确定预设专有名词词典中是否存在与待匹配单词对应的匹配单词,并在预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,继续通过比对待匹配词组和预设专有名词词典进行正向最大匹配,得到与待匹配单词对应的领域专有名词。Specifically, the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and compare the words to be matched with the preset proper noun dictionary to determine the predicted words. Assume whether there is a matching word corresponding to the word to be matched in the proper noun dictionary, and when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and continue to perform forward maximum matching by comparing the phrase to be matched and the preset proper noun dictionary to obtain the domain expertise corresponding to the word to be matched. There are nouns.
具体的,由于不同的待匹配单词之间所对应的领域专有名词可能有重复,因此,在得到与待匹配单词对应的领域专有名词后,终端会对与待匹配单词对应的领域专有名词进行去重,以得到待翻译源语言数据中的领域专有名词。Specifically, since the domain-specific nouns corresponding to different words to be matched may overlap, after obtaining the domain-specific nouns corresponding to the words to be matched, the terminal will determine the domain-specific nouns corresponding to the words to be matched. The nouns are deduplicated to obtain domain-specific nouns in the source language data to be translated.
本实施例中,通过将待翻译源语言数据中单词作为待匹配单词,对待匹配单词进行正向最大匹配,能够得到与待匹配单词对应的领域专有名词,从而可以根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In this embodiment, by using the words in the source language data to be translated as the words to be matched, and performing forward maximum matching on the words to be matched, the domain proper nouns corresponding to the words to be matched can be obtained, so that the words corresponding to the words to be matched can be obtained. Domain proper nouns, determine the domain proper nouns in the source language data to be translated.
在一个实施例中,对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词包括:当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词;联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组;当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词;联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤;直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。In one embodiment, performing forward maximum matching on the word to be matched and obtaining the domain proper noun corresponding to the word to be matched includes: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the word to be translated The next word corresponding to the word to be matched in the source language data; combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched; when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary , obtain the next word corresponding to the phrase to be matched in the source language data to be translated; combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return when the phrase to be matched exists in the preset proper noun dictionary When matching the matching word corresponding to the phrase, the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated; until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, starting from the latest Delete the next word corresponding to the latest phrase to be matched from the phrases to be matched, and obtain the domain proper noun corresponding to the word to be matched.
具体的,在对待匹配单词进行正向最大匹配时,终端会将待匹配单词与预设专有名词词典进行匹配,当预设专有名词词典中存在与待匹配单词对应的匹配单词时,终端会获取待翻译源语言数据中待匹配单词对应的下一单词,即位置处于待匹配单词之后的下一个单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,继续比对待匹配词组和预设专有名词词典,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,终端会继续获取待翻译源语言数据中待匹配词组对应的下一单词,即位置处于待匹配词组之后的下一个单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。Specifically, when performing forward maximum matching on the word to be matched, the terminal will match the word to be matched with the preset proper noun dictionary. When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, the terminal The next word corresponding to the word to be matched in the source language data to be translated is obtained, that is, the next word after the word to be matched, and the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the comparison is continued. For the phrase to be matched and the preset proper noun dictionary, when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the terminal will continue to obtain the next word corresponding to the phrase to be matched in the source language data to be translated, that is, The next word after the phrase to be matched is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary. When, the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated is deleted from the latest phrase to be matched until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary. The next word corresponding to the latest phrase to be matched is obtained, and the domain proper noun corresponding to the word to be matched is obtained.
本实施例中,通过当预设专有名词词典中存在与待匹配单词对应的匹配单词时,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,继续对待匹配词组和预设专有名词词典继续匹配,能够通过正向最大匹配,得到与待匹配单词对应的领域专有名词。In this embodiment, when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the phrase to be matched is continued to be combined with the preset word. Assuming that the proper noun dictionary continues to match, the domain proper nouns corresponding to the words to be matched can be obtained through forward maximum matching.
在一个实施例中,机器翻译方法还包括:获取样本翻译句子对集合以及初始机器翻译模型;计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子对中源语言单词数与目标语言单词数的比值;根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合;根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。In one embodiment, the machine translation method further includes: obtaining a sample translation sentence pair set and an initial machine translation model; calculating a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, where the word number ratio is the source of the sample translation sentence pair. The ratio of the number of language words to the number of words in the target language; filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs; train the initial machine translation model based on the set of filtered sample translated sentence pairs, Obtain the target translation machine model.
其中,初始机器翻译模型是指尚未进行参数训练的机器翻译模型。源语言单词数是指样本翻译句子对中源语言的单词总数,目标语言单词数是指样本翻译句子对中目标语言的单词总数。比如,在中文翻译为英文的样本翻译句子对中,源语言单词数是指样本翻译句子对中中文单词总数,目标语言单词数是指样本翻译句子对中英文单词总数。又比如,在英文翻译为中文的样本翻译句子对中,源语言单词数是指样本翻译句子对中英文单词总数,目标语言单词数是指样本翻译句子对中中文单词总数。需要说明的是,在样本翻译句子对中包括真实翻译句子对以及反翻译句子对,真实翻译句子对是指利用原始源语言数据进行翻译,得出对应的原始目标语言数据后,所得到的翻译句子对。反翻译句子对是指利用原始目标语言数据进行翻译,得出对应的原始源语言数据后,所得到的翻译句子对,通过同时利用真实翻译句子对以及反翻译句子对进行训练,可以提高模型的准确率。Among them, the initial machine translation model refers to a machine translation model that has not yet undergone parameter training. The number of words in the source language refers to the total number of words in the source language in the sample translation sentence pair, and the number of words in the target language refers to the total number of words in the target language in the sample translation sentence pair. For example, in a sample translation sentence pair that is translated from Chinese to English, the number of source language words refers to the total number of Chinese words in the sample translation sentence pair, and the number of target language words refers to the total number of English words in the sample translation sentence pair. For another example, in a sample translation sentence pair that is translated from English to Chinese, the number of words in the source language refers to the total number of English words in the sample translation sentence pair, and the number of target language words refers to the total number of Chinese words in the sample translation sentence pair. It should be noted that the sample translation sentence pairs include real translation sentence pairs and back-translation sentence pairs. The real translation sentence pairs refer to the translation obtained after using the original source language data for translation and obtaining the corresponding original target language data. Sentence pairs. Back-translation sentence pairs refer to using the original target language data for translation. After obtaining the corresponding original source language data, the resulting translated sentence pairs can be trained by using real translation sentence pairs and back-translation sentence pairs at the same time, which can improve the performance of the model. Accuracy.
具体的,在进行机器翻译之前,需要先训练得到目标机器翻译模型,在进行模型训练时,终端会获取样本翻译句子对集合以及初始机器翻译模型,计算样本翻译句子对集合中每个样本翻译句子对的单词数比值,根据单词数比值,得到与单词数比值对应的数据分布,利用数据分布对样本翻译句子对集合中样本翻译句子对进行过滤,得到过滤后样本翻译句子对集合,利用过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。Specifically, before performing machine translation, the target machine translation model needs to be trained first. During model training, the terminal will obtain the sample translation sentence pair set and the initial machine translation model, and calculate each sample translation sentence in the sample translation sentence pair set. The ratio of the number of words in the pair, according to the ratio of the number of words, the data distribution corresponding to the ratio of the number of words is obtained, and the data distribution is used to filter the sample translation sentence pairs in the sample translation sentence pair set, and the filtered sample translation sentence pair set is obtained, and the filtered sample translation sentence pair set is obtained A collection of sample translated sentence pairs is used to train the initial machine translation model to obtain the target translation machine model.
本实施例中,通过计算样本翻译句子对集合中样本翻译句子对的单词数比值,能够利用单词数比值对样本翻译句子对集合进行过滤,过滤掉偏离样本提高模型翻译训练的质量,减少无关的数据噪音,利用根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到能够实现支持准确翻译的目标翻译机器模型。In this embodiment, by calculating the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, the word number ratio can be used to filter the sample translation sentence pair set, filter out deviating samples, improve the quality of model translation training, and reduce irrelevant Data noise, use the set of translated sentence pairs based on filtered samples to train the initial machine translation model, and obtain a target translation machine model that can support accurate translation.
在一个实施例中,获取样本翻译句子对集合包括:获取原始翻译句子对集合,原始翻译句子对集合包括原始翻译句子对;对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度;In one embodiment, obtaining a set of sample translated sentence pairs includes: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs; performing word segmentation on the original source language data in the original translated sentence pairs to obtain a word segmentation result, and Count the character length of each target language word in the original target language data in the original translated sentence pair;
根据分词结果和字符长度,对原始翻译句子对集合进行过滤;将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。According to the word segmentation results and character length, the original translation sentence pair set is filtered; the filtered original translation sentence pair set is used as a sample translation sentence pair set.
其中,原始翻译句子对包括真实翻译句子对以及反翻译句子对。Among them, the original translation sentence pairs include real translation sentence pairs and reverse translation sentence pairs.
具体的,在获取样本翻译句子对集合时,终端会先获取原始翻译句子对集合,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,过滤掉句子长度大于预设句子长度阈值和/或单词数大于预设单词数阈值的原始源语言数据所对应的原始翻译句子对,并过滤掉字符长度大于预设字符长度阈值的原始目标语言数据所对应的原始翻译句子对,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。其中,预设句子长度阈值、预设单词数阈值以及预设字符长度阈值均可按照需要自行设置,本实施例在此处不做具体限定。Specifically, when obtaining the set of sample translation sentence pairs, the terminal will first obtain the set of original translation sentence pairs, perform word segmentation on the original source language data in the original translation sentence pairs, obtain the word segmentation results, and count the original target language data in the original translation sentence pairs. The character length of each target language word in the filter, filter out the original translated sentence pairs corresponding to the original source language data whose sentence length is greater than the preset sentence length threshold and/or the number of words is greater than the preset word number threshold, and filter out the original translated sentence pairs whose character length is greater than The original translation sentence pairs corresponding to the original target language data with a preset character length threshold are used as a set of sample translation sentence pairs after filtering. Among them, the preset sentence length threshold, the preset word number threshold, and the preset character length threshold can all be set as needed, and are not specifically limited in this embodiment.
进一步的,在获取原始翻译句子对集合时,终端需要先获取未整合的真实翻译句子对以及反翻译句子对,通过去重操作对真实翻译句子对以及反翻译句子对进行整合,以得到原始翻译句子对集合。其中,可以采用simHash的算法进行语句的去重,其核心思想为:针对每一个待去重文本进行simHash映射,将simHash值分段建立倒排索引,在每一个分段的hash值中并行化去重操作。Further, when obtaining the original translation sentence pair set, the terminal needs to first obtain the unintegrated real translation sentence pairs and the de-translation sentence pairs, and integrate the real translation sentence pairs and the de-translation sentence pairs through a deduplication operation to obtain the original translation. A collection of sentence pairs. Among them, the simHash algorithm can be used to deduplicate statements. The core idea is: perform simHash mapping for each text to be deduplicated, segment the simHash value to create an inverted index, and parallelize the hash value of each segment. Deduplication operation.
本实施例中,通过获取原始翻译句子对集合,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,根据分词结果和字符长度,对原始翻译句子对集合进行过滤,能够过滤掉偏离样本提高模型翻译训练的质量,减少无关的数据噪音。In this embodiment, by obtaining a set of original translation sentence pairs, the original source language data in the original translation sentence pair is segmented to obtain the word segmentation results, and the character length of each target language word in the original target language data in the original translation sentence pair is counted. , filtering the set of original translated sentences based on the word segmentation results and character length, which can filter out deviating samples, improve the quality of model translation training, and reduce irrelevant data noise.
在一个实施例中,根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合包括:根据单词数比值进行统计,得到与单词数比值对应的数据分布;根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。In one embodiment, filtering the set of sample translated sentence pairs according to the word number ratio to obtain the filtered sample translated sentence pair set includes: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; according to the data distribution, Filter the set of sample translation sentence pairs to obtain a set of filtered sample translation sentence pairs.
具体的,终端通过对单词数比值进行统计,能够得到与单词数比值对应的数据分布,从而可以根据数据分布以及预设比例阈值,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。其中,预设比例阈值可按照需要自行设置,本实施例在此处不做具体限定。进一步的,预设比例阈值可以包括第一比例阈值以及第二比例阈值,其中第一比例阈值用于过滤掉单词数比值较小的样本翻译句子对,第二比例阈值用于过滤掉单词数比值较大的样本翻译句子对。Specifically, the terminal can obtain the data distribution corresponding to the word number ratio by counting the word number ratio, so that it can filter the sample translation sentence pair set according to the data distribution and the preset ratio threshold, and obtain the filtered sample translation sentence pair gather. The preset proportion threshold can be set as needed, and is not specifically limited in this embodiment. Further, the preset proportion threshold may include a first proportion threshold and a second proportion threshold, where the first proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio, and the second proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio. A larger sample of translated sentence pairs.
本实施例中,通过根据单词数比值进行统计,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合,能够过滤掉偏离样本提高模型翻译训练的质量,减少无关的数据噪音。In this embodiment, by performing statistics based on the ratio of the number of words, a data distribution corresponding to the ratio of the number of words is obtained. According to the data distribution, the set of sample translated sentence pairs is filtered to obtain a set of filtered sample translated sentence pairs, which can filter out deviating samples. Improve the quality of model translation training and reduce irrelevant data noise.
在一个实施例中,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型包括:根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型;获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集;根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合;根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。In one embodiment, translating a set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model includes: translating the set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model. Machine translation model; obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set; based on the translation evaluation source language data set and translation evaluation The target language data set is used to obtain a set of translation evaluation and translation sentence pairs; based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
其中,翻译评估源语言数据集是指用于对翻译模型进行评估的数据集。比如,翻译评估源语言数据集具体可以是指国际机器翻译大赛的评估集。Among them, the translation evaluation source language data set refers to the data set used to evaluate the translation model. For example, the translation evaluation source language data set may specifically refer to the evaluation set of the International Machine Translation Competition.
具体的,终端在根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练后,会得到待优化机器翻译模型,还需要通过对待优化机器翻译模型进行优化,才能得到目标机器翻译模型。在进行优化时,终端会先获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,再将翻译评估源语言数据集和翻译评估目标语言数据集作为翻译评估翻译句子对集合,利用过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。Specifically, after the terminal translates the sentence pair set based on the filtered samples and trains the initial machine translation model, it will obtain the machine translation model to be optimized. It also needs to optimize the machine translation model to be optimized to obtain the target machine translation model. When optimizing, the terminal will first obtain the translation evaluation source language data set, use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set, obtain the translation evaluation target language data set, and then translate the translation evaluation source language The data set and the translation evaluation target language data set are used as a set of translation sentence pairs for translation evaluation. The filtered sample translation sentence pair set and the translation evaluation translation sentence pair set are used to train the machine translation model to be optimized to obtain the target machine translation model.
进一步的,在得到翻译评估翻译句子对集合后,终端会先对翻译评估翻译句子对集合进行过滤,再根据过滤后样本翻译句子对集合和过滤后翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到待更新机器翻译模型,利用待更新机器翻译模型,对过滤后翻译评估翻译句子对集合中翻译评估源语言进行翻译,得到与翻译评估源语言对应的翻译评估目标语言,利用翻译评估目标语言对过滤后翻译评估翻译句子对集合进行更新,即替换翻译评估翻译句子对集合中翻译评估源语言对应的翻译结果,再利用过滤后样本翻译句子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到目标机器翻译模型。Further, after obtaining the translation evaluation and translation sentence pair set, the terminal will first filter the translation evaluation and translation sentence pair set, and then use the filtered sample translation sentence pair set and the filtered translation evaluation and translation sentence pair set to optimize the machine translation model Conduct training to obtain the machine translation model to be updated. Use the machine translation model to be updated to translate the filtered translation evaluation translation sentence pairs into the translation evaluation source language in the set. Obtain the translation evaluation target language corresponding to the translation evaluation source language. Use the translation evaluation The target language pair is updated after filtering the translation evaluation translation sentence pair set, that is, replacing the translation result corresponding to the translation evaluation source language in the translation evaluation translation sentence pair set, and then using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set The machine translation model to be updated is trained to obtain the target machine translation model.
进一步的,在对翻译评估翻译句子对集合进行过滤时所采用的方式,与对原始翻译句子对集合和样本翻译句子对集合进行过滤时所采用的方式相同,本实施例在此处不再撰述。在利用过滤后样本翻译句 子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到目标机器翻译模型时,终端可通过迭代训练得到目标机器翻译模型,即终端会利用过滤后样本翻译句子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到新的待更新机器翻译模型,再返回利用待更新机器翻译模型,对过滤后翻译评估翻译句子对集合中翻译评估源语言进行翻译的步骤,直到迭代次数达到预先设置的迭代阈值为止,根据最新的待更新机器翻译模型得到目标机器翻译模型。Furthermore, the method used when filtering the set of translation evaluation sentences is the same as the method used when filtering the set of original translation sentences and the set of sample translation sentences. This embodiment will not be described here. . When the machine translation model to be updated is trained using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set to obtain the target machine translation model, the terminal can obtain the target machine translation model through iterative training, that is, the terminal will use the filtered The sample translation sentence pair set and the updated translation evaluation are used to train the machine translation model to be updated on the set to be updated, and a new machine translation model to be updated is obtained, and then the machine translation model to be updated is returned to the filtered translation evaluation set of translated sentence pairs. The translation step is to evaluate the source language for translation until the number of iterations reaches the preset iteration threshold, and then obtain the target machine translation model based on the latest machine translation model to be updated.
进一步的,在得到最新的待更新机器翻译模型后,终端还会获取领域内专业语料,利用领域内专业语料对最新的待更新机器翻译模型进行训练,得到目标机器翻译模型。Furthermore, after obtaining the latest machine translation model to be updated, the terminal will also obtain professional corpus in the field, and use the professional corpus in the field to train the latest machine translation model to be updated to obtain the target machine translation model.
本实施例中,通过根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合,能够根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行优化训练,得到目标机器翻译模型。In this embodiment, an initial machine translation model is trained by translating a set of sentence pairs based on filtered samples to obtain a machine translation model to be optimized, a translation evaluation source language data set is obtained, and the translation evaluation source language data is obtained through the machine translation model to be optimized. Centralize the translation evaluation source language for translation to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained, and the sentence pair set and translation can be translated based on the filtered samples. Evaluate the set of translated sentence pairs, perform optimization training on the machine translation model to be optimized, and obtain the target machine translation model.
在一个实施例中,机器翻译方法还包括:通过预训练专有名词识别模型,对待翻译源语言数据进行专有名词识别,根据识别结果扩充预设专有名词词典。In one embodiment, the machine translation method further includes: performing proper noun recognition on the source language data to be translated by pre-training a proper noun recognition model, and expanding a preset proper noun dictionary based on the recognition results.
具体的,由于预设专有名词词典中的专有名词数量有限,因此,在进行机器翻译时,终端会通过预训练专有名词识别模型,对待翻译源语言数据进行专有名词识别,以根据识别结果扩充预设专有名词词典,以便在匹配时能够识别出更多的专有名词。其中,预训练专有名词识别模型通过对携带序列标注的样本专有名词集进行训练得到。Specifically, since the number of proper nouns in the preset proper noun dictionary is limited, when performing machine translation, the terminal will perform proper noun recognition on the source language data to be translated based on the pre-trained proper noun recognition model. The recognition results expand the preset proper noun dictionary so that more proper nouns can be identified during matching. Among them, the pre-trained proper noun recognition model is obtained by training the sample proper noun set carrying sequence annotation.
具体的,预训练专有名词识别模型具体可以为BERT(Bidirectional Encoder Representation from Transformers,基于转换器的双向编码表征)+CRF(Conditional Random Field,条件随机场)模型,在输入待翻译源语言数据时,其会根据序列条件来将翻译的词进行条件概率的打散分布,通过BERT模型可实现对待翻译源语言数据的标注,识别出专有名词,在识别出来之后,通过接入CRF模型,可判断所识别出的专有名词是否准确。比如,当识别出某名词的标签为BIII时,若CRF模型可判断该名词的标签是否准确,即是否确实为BIII,从而可以实现对专有名词的识别。Specifically, the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model. When inputting the source language data to be translated , which will break up the conditional probability distribution of the translated words according to the sequence conditions. Through the BERT model, the source language data to be translated can be annotated and the proper nouns can be identified. After being identified, the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.
在一个实施例中,如图2所示,通过一个流程示意图来说明本申请的机器翻译方法,该机器翻译方法具体包括以下步骤:In one embodiment, as shown in Figure 2, a schematic flow chart is used to illustrate the machine translation method of the present application. The machine translation method specifically includes the following steps:
步骤202,获取原始翻译句子对集合,原始翻译句子对集合包括原始翻译句子对;Step 202: Obtain a set of original translated sentence pairs, which includes original translated sentence pairs;
步骤204,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度;Step 204: Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;
步骤206,根据分词结果和字符长度,对原始翻译句子对集合进行过滤;Step 206: Filter the set of original translated sentences according to the word segmentation results and character length;
步骤208,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合;Step 208: Use the filtered set of original translated sentence pairs as a set of sample translated sentence pairs;
步骤210,获取初始机器翻译模型; Step 210, obtain the initial machine translation model;
步骤212,计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子对中源语言单词数与目标语言单词数的比值;Step 212: Calculate the word number ratio of the sample translation sentence pair in the sample translation sentence pair set. The word number ratio is the ratio of the number of source language words to the target language word in the sample translation sentence pair;
步骤214,根据单词数比值进行统计,得到与单词数比值对应的数据分布;Step 214: Perform statistics based on the word number ratio to obtain the data distribution corresponding to the word number ratio;
步骤216,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合;Step 216: Filter the set of sample translated sentence pairs according to the data distribution to obtain a set of filtered sample translated sentence pairs;
步骤218,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型;Step 218: Train the initial machine translation model based on the filtered sample translation sentence pair set to obtain the machine translation model to be optimized;
步骤220,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集;Step 220: Obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set;
步骤222,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合;Step 222: Obtain a set of translation evaluation translation sentence pairs based on the translation evaluation source language data set and the translation evaluation target language data set;
步骤224,根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型;Step 224: Train the machine translation model to be optimized based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set to obtain the target machine translation model;
步骤226,获取待翻译源语言数据;Step 226: Obtain the source language data to be translated;
步骤228,将待翻译源语言数据中单词作为待匹配单词;Step 228: Use words in the source language data to be translated as words to be matched;
步骤230,当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词;Step 230: When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;
步骤232,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组; Step 232, combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;
步骤234,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词;Step 234: When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;
步骤236,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回步骤234;Step 236: Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return to step 234;
步骤238,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词;Step 238: Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. domain specific nouns;
步骤240,根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词;Step 240: Determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched;
步骤242,将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据;Step 242: Enter the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data;
步骤244,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。Step 244: Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
在一个实施例中,以上述机器翻译方法应用于医疗领域中英翻译为例,对本申请的机器翻译方法进行说明,如图3所示,该机器翻译方法具体包括以下步骤:In one embodiment, taking the application of the above machine translation method in Chinese-English translation in the medical field as an example, the machine translation method of the present application is explained. As shown in Figure 3, the machine translation method specifically includes the following steps:
首先,终端会获取真实翻译句子对(即中-英句子对),在获取到中-英句子对后,终端会利用预先训练的反翻译模型(即英-中机器翻译模型)对真实翻译句子对进行反翻译,得到反翻译句子对,将真实翻译句子对和反翻译句子对,作为原始翻译句子对集合。其中,终端会将真实翻译句子对中英文数据输入预先训练反翻译模型,以得到与英文数据对应的中文翻译,将英文数据与中文翻译作为真实翻译句子对所对应的反翻译句子对。通过数据反翻译可以一定的提高模型的准确率。其中,在预先训练反翻译模型时,终端可以通过对真实翻译句子对进行数据处理,得到用于训练的反翻译样本对,再利用反翻译样本对训练英-中机器翻译模型。其中,数据处理的方式可以为,将真实翻译句子对中的源语言数据(即中文)作为目标语言数据,将目标语言数据(即英文)作为源语言数据,得到需要过滤的翻译样本,并对需要过滤的翻译样本进行过滤,得到反翻译样本。举例说明,未训练反翻译模型具体可以为基于transformer-big模型,该进行训练时,该未训练反翻译模型会将输入的单词转为词向量,它包括token embedding(标记嵌入)和position embedding(位置嵌入)两层,编码之后的词向量再分别的流向encoder(编码)里面的两层网络,最后通过矩阵转化训练得出文本的关联度,即可得到反翻译模型。需要说明的是,在对需要过滤的翻译样本对进行过滤时,所采用的过滤方式与上述实施例中对原始翻译句子对以及样本翻译句子对的过滤方式一致,本实施例在此处不再撰述。First, the terminal will obtain the real translation sentence pair (i.e., Chinese-English sentence pair). After obtaining the Chinese-English sentence pair, the terminal will use the pre-trained back-translation model (i.e., English-Chinese machine translation model) to compare the real translation sentence Back-translate the pairs to obtain back-translated sentence pairs, and use the real translated sentence pairs and the back-translated sentence pairs as a set of original translated sentence pairs. Among them, the terminal will input the Chinese and English data of the real translated sentence pair into the pre-trained back-translation model to obtain the Chinese translation corresponding to the English data, and use the English data and the Chinese translation as the back-translated sentence pair corresponding to the real translated sentence pair. The accuracy of the model can be improved to a certain extent through data back-translation. Among them, when pre-training the back-translation model, the terminal can perform data processing on real translated sentence pairs to obtain back-translation sample pairs for training, and then use the back-translation sample pairs to train the English-Chinese machine translation model. Among them, the data processing method can be as follows: use the source language data (i.e., Chinese) in the real translated sentence pairs as the target language data, use the target language data (i.e., English) as the source language data, obtain the translation samples that need to be filtered, and Filter the translation samples that need to be filtered to obtain the back-translation samples. For example, the untrained back-translation model can be based on the transformer-big model. When training, the untrained back-translation model will convert the input words into word vectors, which include token embedding (mark embedding) and position embedding ( Position embedding) two layers, and the encoded word vectors flow to the two-layer network in the encoder (encoding) respectively. Finally, the relevance of the text is obtained through matrix transformation training, and the back-translation model can be obtained. It should be noted that when filtering the translation sample pairs that need to be filtered, the filtering method used is consistent with the filtering method for the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will no longer Writing.
在得到原始翻译句子对集合之后,终端可以利用原始翻译句子对进行模型训练得到待优化机器翻译模型,即中-英机器翻译模型训练。其中,在进行模型训练之前,终端还需要对原始翻译句子对集合中的真实翻译句子对(即中-英句子对)进行数据处理(即过滤),得到用于训练的过滤后样本翻译句子 对集合。其中,具体的过滤方式可以为:终端会对原始翻译句子对集合中原始中文数据进行分词处理,过滤掉句子长度大于200或者单词数量大于150个的原始中文数据对应的原始翻译句子对,再统计一次过滤后原始翻译句子对集合中原始英文数据中每个英文单词的字符长度,过滤掉最大字符长度大于40的原始英文数据对应的原始翻译句子对,得到样本翻译句子对集合,计算样本翻译句子对集合中样本翻译句子对的单词数比值,即(源中文单词数/目标英文单词数)的数值,通过高斯分布进行统计分析,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,过滤掉单词数比值小于第一比例阈值和大于第二比例阈值的样本翻译句子对,得到过滤后样本翻译句子对集合,通过多重过滤,可以过滤掉偏离值提高模型翻译训练的质量。减少无关的数据噪音。After obtaining the set of original translated sentence pairs, the terminal can use the original translated sentence pairs to perform model training to obtain a machine translation model to be optimized, that is, Chinese-English machine translation model training. Among them, before model training, the terminal also needs to perform data processing (i.e., filtering) on the real translated sentence pairs (i.e., Chinese-English sentence pairs) in the original translated sentence pair set to obtain filtered sample translated sentence pairs for training. gather. Among them, the specific filtering method can be: the terminal performs word segmentation processing on the original Chinese data in the original translated sentence pair set, filters out the original translated sentence pairs corresponding to the original Chinese data with a sentence length greater than 200 or a word count greater than 150, and then counts After one filtering, the character length of each English word in the original English data in the original translated sentence pair set is filtered out, and the original translated sentence pairs corresponding to the original English data with a maximum character length greater than 40 are filtered out to obtain a sample translated sentence pair set, and the sample translated sentences are calculated The ratio of the number of words in the sample translation sentence pairs in the collection, that is, the value of (number of source Chinese words/number of target English words), is statistically analyzed through Gaussian distribution, and the data distribution corresponding to the ratio of the number of words is obtained. According to the data distribution, the samples are Filter the set of translated sentence pairs, filter out sample translated sentence pairs whose word number ratio is less than the first proportion threshold and greater than the second proportion threshold, and obtain a set of filtered sample translated sentence pairs. Through multiple filtering, the deviation values can be filtered out to improve model translation. Quality of training. Reduce irrelevant data noise.
在得到过滤后样本翻译句子对集合后,终端会根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,通过调试合适的learning rate(学习率),batch size(批量大小),step(步长)以及相关的一些参数信息,得到待优化机器翻译模型,以此实现中-英机器翻译模型训练。After obtaining the filtered sample translation sentence pair set, the terminal will train the initial machine translation model based on the filtered sample translation sentence pair set, and debug the appropriate learning rate (learning rate), batch size (batch size), step ( step size) and some related parameter information to obtain the machine translation model to be optimized, thereby achieving Chinese-English machine translation model training.
在得到待优化机器翻译模型后,终端会获取过滤过后的国际机器翻译大赛中医疗领域内的评估集(领域内数据),即翻译评估源语言数据集,利用该翻译评估源语言数据集对待优化机器翻译模型进行模型微调,以实现优化。其中,模型微调也就是冻结之前大批量模型训练的相关损失、参数权重等一系列参数,再这些参数基础上进行小批量的模型训练。需要说明的是,对国际机器翻译大赛中医疗领域内的评估集进行过滤的方式与上述实施例中对原始翻译句子对以及样本翻译句子对的过滤方式一致,本实施例在此处不再撰述。After obtaining the machine translation model to be optimized, the terminal will obtain the filtered evaluation set (in-field data) in the medical field in the International Machine Translation Competition, that is, the translation evaluation source language data set, and use the translation evaluation source language data set to be optimized. Machine translation models perform model fine-tuning to achieve optimization. Among them, model fine-tuning means freezing a series of parameters such as related losses and parameter weights from previous large-batch model training, and then conducting small-batch model training based on these parameters. It should be noted that the way of filtering the evaluation set in the medical field in the International Machine Translation Competition is consistent with the way of filtering the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will not be described here. .
在利用该翻译评估源语言数据集对待优化机器翻译模型进行模型微调,以实现优化时,终端会先通过待优化机器翻译模型对翻译评估中文集中翻译评估中文进行翻译(即数据翻译,单语种中文数据),得到翻译评估英文集,根据待翻译评估中文集和待翻译评估英文集,得到翻译评估翻译句子对集合,对翻译评估翻译句子对集合进行过滤,根据过滤后样本翻译句子对集合和过滤后的翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型,其中,对翻译评估翻译句子对集合进行过滤的方式与上述实施例中对原始翻译句子对以及样本翻译句子对的过滤方式一致,本实施例在此处不再撰述。在进行训练时,优选的,训练步长为一百万步,批量大小为三千。When using the translation evaluation source language data set to fine-tune the machine translation model to be optimized to achieve optimization, the terminal will first translate the translation evaluation Chinese centralized translation evaluation Chinese through the machine translation model to be optimized (that is, data translation, single-language Chinese data), obtain the translation evaluation English set, obtain the translation evaluation translation sentence pair set according to the translation evaluation Chinese set and the translation evaluation English set, filter the translation evaluation translation sentence pair set, and filter the translation sentence pair set according to the filtered sample After the translation evaluation, the translation sentence pair set is trained, and the machine translation model to be optimized is trained to obtain the target machine translation model. The method of filtering the translation evaluation translation sentence pair set is the same as the original translation sentence pairs and sample translation sentences in the above embodiment. The filtering methods are the same, and this embodiment will not be described here. When training, preferably, the training step size is one million steps and the batch size is three thousand.
进一步的,在对待优化机器翻译模型进行训练,得到目标机器翻译模型时,终端通过对待优化机器翻译模型进行训练,会先得到待更新机器翻译模型,利用待更新机器翻译模型,对过滤后翻译评估翻译句子对集合中翻译评估源语言进行翻译,得到与翻译评估源语言对应的翻译评估目标语言,利用翻译评估目标语言对过滤后翻译评估翻译句子对集合进行更新,即替换翻译评估翻译句子对集合中翻译评估源语言对应的翻译结果,再利用过滤后样本翻译句子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到目标机器翻译模型,即医疗领域机器翻译模型。Further, when the machine translation model to be optimized is trained to obtain the target machine translation model, the terminal will first obtain the machine translation model to be updated by training the machine translation model to be optimized, and use the machine translation model to be updated to evaluate the filtered translation. Translate the translation evaluation source language in the translation sentence pair set to obtain the translation evaluation target language corresponding to the translation evaluation source language. Use the translation evaluation target language pair to update the translation evaluation translation sentence pair set after filtering, that is, replace the translation evaluation translation sentence pair set. The translation results corresponding to the source language are evaluated in the translation, and then the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set are used to train the machine translation model to be updated, and the target machine translation model is obtained, which is the medical field machine translation model.
进一步的,在利用过滤后样本翻译句子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到目标机器翻译模型时,终端可通过迭代训练得到目标机器翻译模型,即终端会利用过滤后样本翻译句子对集合和更新后翻译评估翻译句子对集合对待更新机器翻译模型进行训练,得到新的待更新机器翻译模型,再返回利用待更新机器翻译模型,对过滤后翻译评估翻译句子对集合中翻译评估源语言进行翻译的步骤,直到迭代次数(即图3中的N)达到预先设置的迭代阈值为止,得到最新的待更新机器翻译模型,获取领域内专业语料(即医疗领域数据),利用领域内专业语料对最新的待更新机器翻译模型进行训练(即通过医疗领域数据进行模型微调),得到目标机器翻译模型(即医疗领域机器翻译模型)。Further, when the machine translation model to be updated is trained using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set to obtain the target machine translation model, the terminal can obtain the target machine translation model through iterative training, that is, the terminal will Use the filtered sample translation sentence pair set and the updated translation evaluation to train the machine translation model to be updated to train the machine translation model to be updated, and then return to use the machine translation model to be updated to evaluate the filtered translation sentence. The steps of translating the translation evaluation source language in the collection until the number of iterations (i.e., N in Figure 3) reaches the preset iteration threshold, obtain the latest machine translation model to be updated, and obtain professional corpus in the field (i.e., medical field data) ), use professional corpus in the field to train the latest machine translation model to be updated (i.e., fine-tune the model through medical field data), and obtain the target machine translation model (i.e., medical field machine translation model).
在得到目标机器翻译模型后,终端会获取待翻译中文,将待翻译中文中单词作为待匹配单词,利用医疗数据专业词典进行正向最大匹配,得到与待匹配单词对应的领域专有名词。即当预设专有名词词典(即医疗数据专业词典)中存在与待匹配单词对应的匹配单词时,获取待翻译中文中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译中文中待匹配词组对应的下一单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。After obtaining the target machine translation model, the terminal will obtain the Chinese to be translated, use the Chinese words to be translated as the words to be matched, use the medical data professional dictionary to perform forward maximum matching, and obtain the domain proper nouns corresponding to the words to be matched. That is, when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary (i.e., medical data professional dictionary), the next word corresponding to the word to be matched in the Chinese to be translated is obtained, and the word to be matched and the word to be matched are combined The next word of the to-be-matched phrase is obtained. When there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary, the next word corresponding to the to-be-matched phrase in the Chinese to be translated is obtained, and the to-be-matched phrase and the to-be-matched phrase are combined. The next word corresponding to the phrase is obtained, and a new phrase to be matched is obtained. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the next word corresponding to the phrase to be matched in the source language data to be translated is obtained. Step, until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. Domain specific nouns.
在得到与待匹配单词对应的领域专有名词后,终端会根据与待匹配单词对应的领域专有名词,确定待翻译中文中的领域专有名词,将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译中文输入目标机器翻译模型进行翻译,得到翻译目标语言数据,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果(即翻译结果输出)。After obtaining the domain proper nouns corresponding to the words to be matched, the terminal will determine the domain proper nouns in the Chinese to be translated based on the domain proper nouns corresponding to the words to be matched, and input the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the Chinese to be translated into the target machine translation model for translation, obtain the translation target language data, replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, and obtain the machine translation results ( That is, the translation result output).
进一步的,终端可以通过实体识别的方式得到医疗数据专业词典,具体的,终端会获取携带序列标注的样本专有名词集,通过对携带序列标注的样本专有名词集进行训练得到预训练专有名词识别模型,从而可以在进行机器翻译时,终端通过预训练专有名词识别模型,对待翻译源语言数据进行专有名词识别,以根据识别结果扩充预设专有名词词典,以便在匹配时能够识别出更多的专有名词。具体的,预训练专有名词识别模型具体可以为BERT(Bidirectional Encoder Representation from Transformers,基于转换器的双向编码表征)+CRF(Conditional Random Field,条件随机场)模型,在输入待翻译源语言数据时,其会根据序列条件来将翻译的词进行条件概率的打散分布,通过BERT模型可实现对待翻译源语言数据的标注,识别出专有名词,在识别出来之后,通过接入CRF模型,可判断所识别出的专有名词是否准确。比如,当识别出某名词的标签为BIII时,若CRF模型可判断该名词的标签是否准确,即是否确实为BIII,从而可以实现对专有名词的识别。Furthermore, the terminal can obtain a professional dictionary of medical data through entity recognition. Specifically, the terminal will obtain a sample proper noun set carrying sequence annotations, and obtain a pre-trained proper noun set by training the sample proper noun set carrying sequence annotations. Noun recognition model, so that when performing machine translation, the terminal can perform proper noun recognition on the source language data to be translated through the pre-trained proper noun recognition model, so as to expand the preset proper noun dictionary based on the recognition results so that it can be used when matching. Identify more proper nouns. Specifically, the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model. When inputting the source language data to be translated , which will break up the conditional probability distribution of the translated words according to the sequence conditions. Through the BERT model, the source language data to be translated can be annotated and the proper nouns can be identified. After being identified, the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.
进一步的,终端可利用多模型融合得到翻译目标语言数据,此时,目标机器翻译模型可以包括至少两个子机器翻译模型,即终端可以通过训练多个随机失活率不同的子机器翻译模型来对待翻译源语言数据进行翻译,在对待翻译源语言数据进行翻译时,终端会将待翻译源语言数据输入子机器翻译模型,得到与子机器翻译模型对应的翻译结果,在翻译结果中包括对于待翻译源语言数据中每个单词预测得到对应单词的单词概率,在得到这个单词概率后,终端会对每个子机器翻译模型所输出的翻译结果中相同单词的单词概率做排序,根据排序结果确定与单词对应的最优预测结果,即最优翻译结果,根据与每个单词对应的最优翻译结果,得到对应的翻译目标语言数据。其中,在做排序后,终端会确定针对每个单词的最大单词概率,将该最大单词概率对应的单词作为最优预测结果。Further, the terminal can use multi-model fusion to obtain translation target language data. At this time, the target machine translation model can include at least two sub-machine translation models, that is, the terminal can treat multiple sub-machine translation models with different random loss rates by training Translate the source language data for translation. When translating the source language data to be translated, the terminal will input the source language data to be translated into the sub-machine translation model to obtain a translation result corresponding to the sub-machine translation model. The translation result includes the information to be translated. Each word in the source language data is predicted to obtain the word probability of the corresponding word. After obtaining this word probability, the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the word probability based on the sorting results. The corresponding optimal prediction result, that is, the optimal translation result, is based on the optimal translation result corresponding to each word, and the corresponding translation target language data is obtained. Among them, after sorting, the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的机器翻译方法的机器翻 译装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个机器翻译装置实施例中的具体限定可以参见上文中对于机器翻译方法的限定,在此不再赘述。Based on the same inventive concept, embodiments of the present application also provide a machine translation device for implementing the above-mentioned machine translation method. The problem-solving solution provided by this device is similar to the solution recorded in the above method. Therefore, for the specific limitations in one or more machine translation device embodiments provided below, please refer to the above limitations on the machine translation method. I won’t go into details here.
在一个实施例中,如图4所示,提供了一种机器翻译装置,包括:获取模块402、匹配模块404、翻译模块406和替换模块408,其中:In one embodiment, as shown in Figure 4, a machine translation device is provided, including: an acquisition module 402, a matching module 404, a translation module 406 and a replacement module 408, wherein:
获取模块402,用于获取待翻译源语言数据;The acquisition module 402 is used to acquire the source language data to be translated;
匹配模块404,用于对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;The matching module 404 is used to perform forward maximum matching on the source language data to be translated and determine the domain-specific nouns in the source language data to be translated;
翻译模块406,用于将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;The translation module 406 is used to input the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target. Language data, the target machine translation model is obtained by training sample data;
替换模块408,用于将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。The replacement module 408 is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
上述机器翻译装置,通过获取待翻译源语言数据,对待翻译源语言数据进行正向最大匹配,能够确定待翻译源语言数据中的领域专有名词,通过将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,能够提高目标机器翻译模型对领域专有名词翻译的准确性,得到翻译准确的机器翻译结果。The above-mentioned machine translation device can determine the domain proper nouns in the source language data to be translated by acquiring the source language data to be translated, performing forward maximum matching on the source language data to be translated, and inputting the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target The accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.
在一个实施例中,匹配模块还用于将待翻译源语言数据中单词作为待匹配单词,对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词,根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In one embodiment, the matching module is also used to use words in the source language data to be translated as words to be matched, perform forward maximum matching on the words to be matched, and obtain domain proper nouns corresponding to the words to be matched. According to the words to be matched, Corresponding domain proper nouns determine the domain proper nouns in the source language data to be translated.
在一个实施例中,匹配模块还用于当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。In one embodiment, the matching module is also used to obtain the next word corresponding to the word to be matched in the source language data to be translated when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, and combine the words to be matched The next word corresponding to the word to be matched is obtained to obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the next word corresponding to the phrase to be matched in the source language data to be translated is obtained. Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Return when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary and obtain the matching word in the source language data to be translated. The step of the next word corresponding to the phrase is until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then the next word corresponding to the latest phrase to be matched is deleted from the latest phrase to be matched, Get the domain-specific noun corresponding to the word to be matched.
在一个实施例中,机器翻译装置还包括模型训练模块,模型训练模块用于获取样本翻译句子对集合以及初始机器翻译模型,计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子对中源语言单词数与目标语言单词数的比值,根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。In one embodiment, the machine translation device also includes a model training module. The model training module is used to obtain a sample translation sentence pair set and an initial machine translation model, and calculate the word number ratio of the sample translation sentence pair set in the sample translation sentence pair set. The number of words The ratio is the ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pair. The sample translation sentence pair set is filtered according to the word number ratio to obtain the filtered sample translation sentence pair set. According to the filtered sample translation sentence pair set, The initial machine translation model is trained to obtain the target translation machine model.
在一个实施例中,模型训练模块还用于获取原始翻译句子对集合,原始翻译句子对集合包括原始翻译句子对,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,根据分词结果和字符长度,对原始翻译句子对集合进行过滤,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。In one embodiment, the model training module is also used to obtain a set of original translated sentence pairs, which includes original translated sentence pairs, segment the original source language data in the original translated sentence pairs, obtain the segmentation results, and count the original The character length of each target language word in the original target language data in the translated sentence pair, filter the original translated sentence pair set based on the word segmentation result and character length, and use the filtered original translated sentence pair set as a sample translated sentence pair set .
在一个实施例中,模型训练模块还用于根据单词数比值进行统计,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。In one embodiment, the model training module is also used to perform statistics based on the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words. According to the data distribution, filter the set of sample translated sentence pairs to obtain a set of filtered sample translated sentence pairs. .
在一个实施例中,模型训练模块还用于根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合,根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。In one embodiment, the model training module is also used to translate a set of sentence pairs based on the filtered samples, train the initial machine translation model, obtain the machine translation model to be optimized, obtain the translation evaluation source language data set, and use the machine translation model to be optimized Translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. According to the translation evaluation source language data set and the translation evaluation target language data set, obtain a set of translation evaluation translation sentence pairs, and translate according to the filtered samples Sentence pair set and translation evaluation Translate the sentence pair set, train the machine translation model to be optimized, and obtain the target machine translation model.
上述机器翻译装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the above machine translation device can be implemented in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端,其内部结构图可以如图5所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中,处理器、存储器和输入/输出接口通过系统总线连接,通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种机器翻译方法。该计算机设备的显示单元用于形成视觉可见的画面,可以是显示屏、投影装置或虚拟现实成像装置,显示屏可以是液晶显示屏或电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be shown in Figure 5 . The computer device includes a processor, memory, input/output interface, communication interface, display unit and input device. Among them, the processor, memory and input/output interface are connected through the system bus, and the communication interface, display unit and input device are connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and external devices. The communication interface of the computer device is used for wired or wireless communication with external terminals. The wireless mode can be implemented through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. The computer program, when executed by the processor, implements a machine translation method. The display unit of the computer device is used to form a visually visible picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen. The input device of the computer device can be a display screen. The touch layer covered above can also be buttons, trackballs or touch pads provided on the computer equipment shell, or it can also be an external keyboard, touch pad or mouse, etc.
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment can May include more or fewer parts than shown, or combine certain parts, or have a different arrangement of parts.
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现以下步骤:获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。In one embodiment, a computer device is provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements the following steps: obtains the source language data to be translated; performs the following steps on the source language data to be translated: Forward maximum matching determines the domain proper nouns in the source language data to be translated; inputs the domain proper nouns into the target machine translation model for translation, obtains the proper noun translation results, and inputs the source language data to be translated into the target machine translation model Translate to obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the proper noun translation results with the corresponding translation results in the translation target language data to obtain the machine translation results.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:将待翻译源语言数据中单词作为待匹配单词,对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词,根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In one embodiment, when the processor executes the computer program, it also implements the following steps: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific nouns corresponding to the words to be matched. , based on the domain proper nouns corresponding to the words to be matched, determine the domain proper nouns in the source language data to be translated.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对 应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。In one embodiment, the processor also implements the following steps when executing the computer program: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. word, combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the phrase to be matched in the source language data to be translated The corresponding next word is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary and the translation to be obtained is obtained The step of finding the next word corresponding to the phrase to be matched in the source language data, until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then deleting the latest phrase to be matched from the latest phrase to be matched. Corresponding to the next word, get the domain proper noun corresponding to the word to be matched.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:获取样本翻译句子对集合以及初始机器翻译模型,计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子对中源语言单词数与目标语言单词数的比值,根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。In one embodiment, the processor also implements the following steps when executing the computer program: obtaining a set of sample translated sentence pairs and an initial machine translation model, calculating a word number ratio of the sample translated sentence pairs in the sample translated sentence pair set, and the word number ratio is the sample The ratio of the number of words in the source language to the number of words in the target language in the translated sentence pair. Filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs. Based on the set of filtered sample translated sentence pairs, the initial machine The translation model is trained to obtain the target translation machine model.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:获取原始翻译句子对集合,原始翻译句子对集合包括原始翻译句子对,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,根据分词结果和字符长度,对原始翻译句子对集合进行过滤,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。In one embodiment, the processor also implements the following steps when executing the computer program: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs, performing word segmentation on the original source language data in the original translated sentence pairs, and obtaining the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair, filter the original translation sentence pair set according to the word segmentation result and character length, and use the filtered original translation sentence pair set as A collection of sample translated sentence pairs.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:根据单词数比值进行统计,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。In one embodiment, the processor also implements the following steps when executing the computer program: performing statistics based on the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of sample translated sentence pairs.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合,根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。In one embodiment, the processor also implements the following steps when executing the computer program: training the initial machine translation model according to the set of filtered sample translation sentence pairs to obtain the machine translation model to be optimized, obtaining the translation evaluation source language data set, and passing The machine translation model to be optimized translates the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. Based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。In one embodiment, a computer-readable storage medium is provided, with a computer program stored thereon. When the computer program is executed by a processor, the following steps are implemented: obtaining the source language data to be translated; performing forward maximum processing on the source language data to be translated. Match and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation. The translation target language data is obtained, and the target machine translation model is obtained by training the sample data; the translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain the machine translation result.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:将待翻译源语言数据中单词作为待匹配单词,对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词,根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific information corresponding to the words to be matched. Nouns, determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, and obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the matched word in the source language data to be translated. The next word corresponding to the phrase is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary. The step of translating the next word corresponding to the phrase to be matched in the source language data until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then deleting the latest word to be matched from the latest phrase to be matched. The next word corresponding to the phrase is obtained, and the domain proper noun corresponding to the word to be matched is obtained.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:获取样本翻译句子对集合以及初始机器翻译模型,计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子 对中源语言单词数与目标语言单词数的比值,根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of sample translation sentence pairs and an initial machine translation model, calculate a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, and the word number ratio is The ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pairs, filter the sample translation sentence pair set according to the word number ratio, and obtain the filtered sample translation sentence pair set. According to the filtered sample translation sentence pair set, the initial The machine translation model is trained to obtain the target translation machine model.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:获取原始翻译句子对集合,原始翻译句子对集合包括原始翻译句子对,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,根据分词结果和字符长度,对原始翻译句子对集合进行过滤,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of original translated sentence pairs, the original translated sentence pair set includes the original translated sentence pairs, perform word segmentation on the original source language data in the original translated sentence pairs, and obtain The word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pairs, filter the original translation sentence pairs set based on the word segmentation results and character length, and put the filtered original translation sentence pairs into a set, As a collection of sample translation sentence pairs.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据单词数比值进行统计,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of post-sample translated sentence pairs.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合,根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: training the initial machine translation model according to the set of filtered sample translation sentence pairs, obtaining the machine translation model to be optimized, and obtaining the translation evaluation source language data set, The machine translation model to be optimized is used to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. , based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
具体地,计算机可读存储介质可以是非易失性,也可以是易失性。Specifically, the computer-readable storage medium may be non-volatile or volatile.
在一个实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现以下步骤:获取待翻译源语言数据;对待翻译源语言数据进行正向最大匹配,确定待翻译源语言数据中的领域专有名词;将领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将待翻译源语言数据输入目标机器翻译模型进行翻译,得到翻译目标语言数据,目标机器翻译模型通过对样本数据训练得到;将专有名词翻译结果替换翻译目标语言数据中对应的翻译结果,得到机器翻译结果。In one embodiment, a computer program product is provided, including a computer program. When executed by a processor, the computer program implements the following steps: obtaining source language data to be translated; performing forward maximum matching on the source language data to be translated, and determining Translate the domain proper nouns in the source language data; input the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language Data, the target machine translation model is obtained by training the sample data; the machine translation result is obtained by replacing the translation result of the proper noun with the corresponding translation result in the translation target language data.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:将待翻译源语言数据中单词作为待匹配单词,对待匹配单词进行正向最大匹配,得到与待匹配单词对应的领域专有名词,根据与待匹配单词对应的领域专有名词,确定待翻译源语言数据中的领域专有名词。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific information corresponding to the words to be matched. Nouns, determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:当预设专有名词词典中存在与待匹配单词对应的匹配单词时,获取待翻译源语言数据中待匹配单词对应的下一单词,联合待匹配单词和待匹配单词对应的下一单词,得到待匹配词组,当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词,联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回当预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取待翻译源语言数据中待匹配词组对应的下一单词的步骤,直到预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与待匹配单词对应的领域专有名词。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, and obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the matched word in the source language data to be translated. The next word corresponding to the phrase is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary. The step of translating the next word corresponding to the phrase to be matched in the source language data until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then deleting the latest word to be matched from the latest phrase to be matched. The next word corresponding to the phrase is obtained, and the domain proper noun corresponding to the word to be matched is obtained.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:获取样本翻译句子对集合以及初始机器翻译模型,计算样本翻译句子对集合中样本翻译句子对的单词数比值,单词数比值为样本翻译句子对中源语言单词数与目标语言单词数的比值,根据单词数比值对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合,根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到目标翻译机器模型。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of sample translation sentence pairs and an initial machine translation model, calculate a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, and the word number ratio is The ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pairs, filter the sample translation sentence pair set according to the word number ratio, and obtain the filtered sample translation sentence pair set. According to the filtered sample translation sentence pair set, the initial The machine translation model is trained to obtain the target translation machine model.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:获取原始翻译句子对集合,原始翻 译句子对集合包括原始翻译句子对,对原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度,根据分词结果和字符长度,对原始翻译句子对集合进行过滤,将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of original translated sentence pairs, the original translated sentence pair set includes the original translated sentence pairs, perform word segmentation on the original source language data in the original translated sentence pairs, and obtain The word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pairs, filter the original translation sentence pairs set based on the word segmentation results and character length, and put the filtered original translation sentence pairs into a set, As a collection of sample translation sentence pairs.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据单词数比值进行统计,得到与单词数比值对应的数据分布,根据数据分布,对样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of post-sample translated sentence pairs.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据过滤后样本翻译句子对集合,对初始机器翻译模型进行训练,得到待优化机器翻译模型,获取翻译评估源语言数据集,通过待优化机器翻译模型对翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集,根据翻译评估源语言数据集和翻译评估目标语言数据集,得到翻译评估翻译句子对集合,根据过滤后样本翻译句子对集合和翻译评估翻译句子对集合,对待优化机器翻译模型进行训练,得到目标机器翻译模型。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: training the initial machine translation model according to the set of filtered sample translation sentence pairs, obtaining the machine translation model to be optimized, and obtaining the translation evaluation source language data set, The machine translation model to be optimized is used to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. , based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.
需要说明的是,本申请所涉及的数据(包括但不限于用于分析的数据等),均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the data involved in this application (including but not limited to data used for analysis, etc.) are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data require Comply with relevant laws, regulations and standards of relevant countries and regions.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory (MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (Phase Change Memory, PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can be in many forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims (20)

  1. 一种机器翻译方法,其中,所述方法包括:A machine translation method, wherein the method includes:
    获取待翻译源语言数据;Obtain the source language data to be translated;
    对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;
    将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;
    将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
  2. 根据权利要求1所述的方法,其中,所述对所述待翻译源语言数据中单词进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词包括:The method according to claim 1, wherein the forward maximum matching of words in the source language data to be translated and determining the domain proper nouns in the source language data to be translated includes:
    将所述待翻译源语言数据中单词作为待匹配单词;Use words in the source language data to be translated as words to be matched;
    对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词;Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;
    根据所述与所述待匹配单词对应的领域专有名词,确定所述待翻译源语言数据中的领域专有名词。According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
  3. 根据权利要求2所述的方法,其中,所述对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词包括:The method according to claim 2, wherein performing forward maximum matching on the words to be matched to obtain domain proper nouns corresponding to the words to be matched includes:
    当预设专有名词词典中存在与所述待匹配单词对应的匹配单词时,获取所述待翻译源语言数据中所述待匹配单词对应的下一单词;When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;
    联合所述待匹配单词和所述待匹配单词对应的下一单词,得到待匹配词组;Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;
    当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词;When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;
    联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回所述当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词的步骤;Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;
    直到所述预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与所述待匹配单词对应的领域专有名词。Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
  4. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1, further comprising:
    获取样本翻译句子对集合以及初始机器翻译模型;Obtain the sample translation sentence pair set and the initial machine translation model;
    计算所述样本翻译句子对集合中样本翻译句子对的单词数比值,所述单词数比值为所述样本翻译句子对中源语言单词数与目标语言单词数的比值;Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;
    根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合;Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;
    根据所述过滤后样本翻译句子对集合,对所述初始机器翻译模型进行训练,得到目标翻译机器模型。The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
  5. 根据权利要求4所述的方法,其中,所述获取样本翻译句子对集合包括:The method according to claim 4, wherein said obtaining a set of sample translation sentence pairs includes:
    获取原始翻译句子对集合,所述原始翻译句子对集合包括原始翻译句子对;Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;
    对所述原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计所述原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度;Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;
    根据所述分词结果和所述字符长度,对所述原始翻译句子对集合进行过滤;Filter the set of original translated sentence pairs according to the word segmentation result and the character length;
    将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
  6. 根据权利要求4所述的方法,其中,所述根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合包括:The method according to claim 4, wherein filtering the sample translation sentence pair set according to the word number ratio, and obtaining the filtered sample translation sentence pair set includes:
    根据所述单词数比值进行统计,得到与所述单词数比值对应的数据分布;Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;
    根据所述数据分布,对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.
  7. 根据权利要求4所述的方法,其中,所述根据所述过滤后样本翻译句子对集合,对所述初始机器翻译模型进行训练,得到目标翻译机器模型包括:The method according to claim 4, wherein said translating a set of sentence pairs based on the filtered samples, training the initial machine translation model, and obtaining a target translation machine model includes:
    根据所述过滤后样本翻译句子对集合,对所述初始机器翻译模型进行训练,得到待优化机器翻译模型;Train the initial machine translation model according to the set of filtered sample translation sentence pairs to obtain a machine translation model to be optimized;
    获取翻译评估源语言数据集,通过所述待优化机器翻译模型对所述翻译评估源语言数据集中翻译评估源语言进行翻译,得到翻译评估目标语言数据集;Obtain the translation evaluation source language data set, and use the to-be-optimized machine translation model to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set;
    根据所述翻译评估源语言数据集和所述翻译评估目标语言数据集,得到翻译评估翻译句子对集合;Obtain a set of translation evaluation translation sentence pairs according to the translation evaluation source language data set and the translation evaluation target language data set;
    根据所述过滤后样本翻译句子对集合和所述翻译评估翻译句子对集合,对所述待优化机器翻译模型进行训练,得到目标机器翻译模型。According to the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain a target machine translation model.
  8. 一种机器翻译装置,其中,所述装置包括:A machine translation device, wherein the device includes:
    获取模块,用于获取待翻译源语言数据;The acquisition module is used to obtain the source language data to be translated;
    匹配模块,用于对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;A matching module, used to perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;
    翻译模块,用于将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;The translation module is used to input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and to input the source language data to be translated into the target machine translation model for translation to obtain the translation target language. Data, the target machine translation model is obtained by training sample data;
    替换模块,用于将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。A replacement module is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现机器翻译方法,包括:A computer device, including a memory and a processor, the memory stores a computer program, wherein the processor implements a machine translation method when executing the computer program, including:
    获取待翻译源语言数据;Obtain the source language data to be translated;
    对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;
    将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;
    将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
  10. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述对所述待翻译源语言数据中单词进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词包括:The computer device according to claim 9, wherein when the processor executes the computer readable instructions, the processor implements the forward maximum matching of words in the source language data to be translated, and determines the source language to be translated. Domain specific nouns in the data include:
    将所述待翻译源语言数据中单词作为待匹配单词;Use words in the source language data to be translated as words to be matched;
    对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词;Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;
    根据所述与所述待匹配单词对应的领域专有名词,确定所述待翻译源语言数据中的领域专有名词。According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
  11. 根据权利要求10所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词包括:The computer device according to claim 10, wherein when the processor executes the computer readable instructions, the forward maximum matching of the word to be matched is performed, and the field expertise corresponding to the word to be matched is obtained. Some nouns include:
    当预设专有名词词典中存在与所述待匹配单词对应的匹配单词时,获取所述待翻译源语言数据中 所述待匹配单词对应的下一单词;When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;
    联合所述待匹配单词和所述待匹配单词对应的下一单词,得到待匹配词组;Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;
    当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词;When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;
    联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回所述当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词的步骤;Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;
    直到所述预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与所述待匹配单词对应的领域专有名词。Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
  12. 根据权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述机器翻译方法还包括:The computer device of claim 9, wherein implementing the machine translation method when the processor executes the computer readable instructions further includes:
    获取样本翻译句子对集合以及初始机器翻译模型;Obtain the sample translation sentence pair set and the initial machine translation model;
    计算所述样本翻译句子对集合中样本翻译句子对的单词数比值,所述单词数比值为所述样本翻译句子对中源语言单词数与目标语言单词数的比值;Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;
    根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合;Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;
    根据所述过滤后样本翻译句子对集合,对所述初始机器翻译模型进行训练,得到目标翻译机器模型。The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
  13. 根据权利要求12所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述获取样本翻译句子对集合包括:The computer device according to claim 12, wherein when the processor executes the computer-readable instructions, implementing the acquisition of the set of sample translation sentence pairs includes:
    获取原始翻译句子对集合,所述原始翻译句子对集合包括原始翻译句子对;Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;
    对所述原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计所述原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度;Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;
    根据所述分词结果和所述字符长度,对所述原始翻译句子对集合进行过滤;Filter the set of original translated sentence pairs according to the word segmentation result and the character length;
    将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
  14. 根据权利要求12所述的计算机设备,其中,所述处理器执行所述计算机可读指令时实现所述根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合包括:The computer device according to claim 12, wherein when the processor executes the computer readable instructions, the processor implements filtering the set of sample translation sentences according to the word number ratio to obtain filtered sample translation sentences. The pair set includes:
    根据所述单词数比值进行统计,得到与所述单词数比值对应的数据分布;Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;
    根据所述数据分布,对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机可读指令被处理器执行时实现机器翻译方法,包括:A computer-readable storage medium with a computer program stored thereon, wherein the computer-readable instructions implement a machine translation method when executed by a processor, including:
    获取待翻译源语言数据;Obtain the source language data to be translated;
    对所述待翻译源语言数据进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词;Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;
    将所述领域专有名词输入目标机器翻译模型进行翻译,得到专有名词翻译结果,并将所述待翻译源语言数据输入所述目标机器翻译模型进行翻译,得到翻译目标语言数据,所述目标机器翻译模型通过对样本数据训练得到;Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;
    将所述专有名词翻译结果替换所述翻译目标语言数据中对应的翻译结果,得到机器翻译结果。The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述对所述待翻译源语言数据中单词进行正向最大匹配,确定所述待翻译源语言数据中的领域专有名词包括:The computer-readable storage medium according to claim 15, wherein when the computer-readable instructions are executed by a processor, the forward maximum matching of words in the source language data to be translated is performed, and the words to be translated are determined. Domain-specific nouns in source language data include:
    将所述待翻译源语言数据中单词作为待匹配单词;Use words in the source language data to be translated as words to be matched;
    对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词;Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;
    根据所述与所述待匹配单词对应的领域专有名词,确定所述待翻译源语言数据中的领域专有名词。According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述对所述待匹配单词进行正向最大匹配,得到与所述待匹配单词对应的领域专有名词包括:The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by a processor, the forward maximum matching of the words to be matched is performed, and the words corresponding to the words to be matched are obtained. Domain specific nouns include:
    当预设专有名词词典中存在与所述待匹配单词对应的匹配单词时,获取所述待翻译源语言数据中所述待匹配单词对应的下一单词;When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;
    联合所述待匹配单词和所述待匹配单词对应的下一单词,得到待匹配词组;Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;
    当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词;When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;
    联合待匹配词组和待匹配词组对应的下一单词,得到新的待匹配词组,返回所述当所述预设专有名词词典中存在与待匹配词组对应的匹配单词时,获取所述待翻译源语言数据中待匹配词组对应的下一单词的步骤;Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;
    直到所述预设专有名词词典中不存在与最新的待匹配词组对应的匹配单词为止,从最新的待匹配词组中删除最新的待匹配词组对应的下一单词,得到与所述待匹配单词对应的领域专有名词。Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述机器翻译方法还包括:The computer-readable storage medium according to claim 15, wherein when the computer-readable instructions are executed by a processor, implementing the machine translation method further includes:
    获取样本翻译句子对集合以及初始机器翻译模型;Obtain the sample translation sentence pair set and the initial machine translation model;
    计算所述样本翻译句子对集合中样本翻译句子对的单词数比值,所述单词数比值为所述样本翻译句子对中源语言单词数与目标语言单词数的比值;Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;
    根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合;Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;
    根据所述过滤后样本翻译句子对集合,对所述初始机器翻译模型进行训练,得到目标翻译机器模型。The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述获取样本翻译句子对集合包括:The computer-readable storage medium according to claim 18, wherein when the computer-readable instructions are executed by a processor, achieving the obtaining the set of sample translation sentence pairs includes:
    获取原始翻译句子对集合,所述原始翻译句子对集合包括原始翻译句子对;Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;
    对所述原始翻译句子对中原始源语言数据进行分词,得到分词结果,并统计所述原始翻译句子对中原始目标语言数据中每个目标语言单词的字符长度;Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;
    根据所述分词结果和所述字符长度,对所述原始翻译句子对集合进行过滤;Filter the set of original translated sentence pairs according to the word segmentation result and the character length;
    将过滤后的原始翻译句子对集合,作为样本翻译句子对集合。The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述计算机可读指令被处理器执行时实现所述根据所述单词数比值对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合包括:The computer-readable storage medium according to claim 18, wherein when the computer-readable instructions are executed by a processor, they implement filtering the set of sample translation sentences according to the word number ratio to obtain filtered samples. The set of translated sentence pairs includes:
    根据所述单词数比值进行统计,得到与所述单词数比值对应的数据分布;Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;
    根据所述数据分布,对所述样本翻译句子对集合进行过滤,得到过滤后样本翻译句子对集合。According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.
PCT/CN2022/122036 2022-06-14 2022-09-28 Machine translation method and apparatus, and computer device and storage medium WO2023240839A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210667744.3 2022-06-14
CN202210667744.3A CN114997190A (en) 2022-06-14 2022-06-14 Machine translation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023240839A1 true WO2023240839A1 (en) 2023-12-21

Family

ID=83035859

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122036 WO2023240839A1 (en) 2022-06-14 2022-09-28 Machine translation method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN114997190A (en)
WO (1) WO2023240839A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997190A (en) * 2022-06-14 2022-09-02 平安科技(深圳)有限公司 Machine translation method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5161105A (en) * 1989-06-30 1992-11-03 Sharp Corporation Machine translation apparatus having a process function for proper nouns with acronyms
WO2009002141A1 (en) * 2007-06-27 2008-12-31 Mimos Berhad A system amd method of language translation
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN112329482A (en) * 2020-10-28 2021-02-05 北京嘀嘀无限科技发展有限公司 Machine translation method, device, electronic equipment and readable storage medium
CN114997190A (en) * 2022-06-14 2022-09-02 平安科技(深圳)有限公司 Machine translation method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082324A1 (en) * 2008-09-30 2010-04-01 Microsoft Corporation Replacing terms in machine translation
CN114330375A (en) * 2021-11-12 2022-04-12 中译语通科技股份有限公司 Term translation method and system based on fixed paradigm
CN114462427A (en) * 2022-01-26 2022-05-10 四川语言桥信息技术有限公司 Machine translation method and device based on term protection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5161105A (en) * 1989-06-30 1992-11-03 Sharp Corporation Machine translation apparatus having a process function for proper nouns with acronyms
WO2009002141A1 (en) * 2007-06-27 2008-12-31 Mimos Berhad A system amd method of language translation
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN112329482A (en) * 2020-10-28 2021-02-05 北京嘀嘀无限科技发展有限公司 Machine translation method, device, electronic equipment and readable storage medium
CN114997190A (en) * 2022-06-14 2022-09-02 平安科技(深圳)有限公司 Machine translation method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114997190A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
Torfi et al. Natural language processing advancements by deep learning: A survey
Ghosh et al. Neural networks for text correction and completion in keyboard decoding
Snyder et al. Interactive learning for identifying relevant tweets to support real-time situational awareness
CN113535984B (en) Knowledge graph relation prediction method and device based on attention mechanism
US20200104729A1 (en) Method and system for extracting information from graphs
US10599686B1 (en) Method and system for extracting information from graphs
Cai et al. An CNN-LSTM attention approach to understanding user query intent from online health communities
CN110427486B (en) Body condition text classification method, device and equipment
Onan SRL-ACO: A text augmentation framework based on semantic role labeling and ant colony optimization
Bi et al. Unrestricted multi-hop reasoning network for interpretable question answering over knowledge graph
WO2023109436A1 (en) Part of speech perception-based nested named entity recognition method and system, device and storage medium
CN114528898A (en) Scene graph modification based on natural language commands
WO2023240839A1 (en) Machine translation method and apparatus, and computer device and storage medium
CN113821635A (en) Text abstract generation method and system for financial field
CN114365122A (en) Learning interpretable relationships between entities, relational terms, and concepts through bayesian structure learning of open domain facts
CN110852066B (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
Manias et al. An evaluation of neural machine translation and pre-trained word embeddings in multilingual neural sentiment analysis
CN111914084A (en) Deep learning-based emotion label text generation and evaluation system
WO2022141855A1 (en) Text regularization method and apparatus, and electronic device and storage medium
JP2022171502A (en) Meta-learning data augmentation framework
Xu Multi-region English translation synchronization mechanism driven by big data
Zhen et al. Frequent words and syntactic context integrated biomedical discontinuous named entity recognition method
Zhao et al. Test case classification via few-shot learning
Ren et al. Pointer-Generator Abstractive Text Summarization Model with Part of Speech Features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22946511

Country of ref document: EP

Kind code of ref document: A1