WO2023240839A1

WO2023240839A1 - Machine translation method and apparatus, and computer device and storage medium

Info

Publication number: WO2023240839A1
Application number: PCT/CN2022/122036
Authority: WO
Inventors: 贺傲飞
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-06-14
Filing date: 2022-09-28
Publication date: 2023-12-21
Also published as: CN114997190A

Abstract

The present application relates to the technical fields of artificial intelligence and speech processing. Provided are a machine translation method and apparatus, and a computer device and a storage medium. The method comprises: acquiring source language data to be translated; performing forward maximum matching on said source language data, and determining a field proper noun in said source language data; inputting the field proper noun into a target machine translation model for translation, so as to obtain a proper noun translation result, and inputting said source language data into the target machine translation model for translation, so as to obtain translation target language data, wherein the target machine translation model is obtained by means of performing training by using sample data; and replacing a corresponding translation result in the translation target language data with the proper noun translation result, so as to obtain a machine translation result. By using the method, the accuracy of translating a field proper noun by means of a target machine translation model can be improved, thereby obtaining a machine translation result with accurate translation.

Description

Machine translation methods, devices, computer equipment and storage media

This application claims priority with the Chinese patent application filed with the China Patent Office on June 14, 2022, with application number 202210667744.3 and the application title "Machine Translation Method, Device, Computer Equipment and Storage Medium", the entire content of which is incorporated by reference. In application.

Technical field

This application relates to the technical fields of artificial intelligence and speech processing, and in particular to a machine translation method, device, computer equipment and storage medium.

Background technique

With the development of artificial intelligence technology, machine translation technology based on neural networks has emerged. Machine translation refers to the process of using computers to convert one natural language (source language) into another natural language (target language). The core of machine translation technology based on neural networks is a deep neural network with a large number of nodes (neurons), which can automatically learn translation knowledge from the corpus. After sentences in one language are vectorized, they are transmitted layer by layer in the network and converted into a representation that the computer can "understand". Then, through multiple layers of complex transmission operations, a translation in another language is generated, realizing "understanding the language. Generate translation" translation method.

The inventor realized that in traditional technology, machine translation usually uses an encoder-decoder structure to model variable-length input sentences. The encoder realizes the "understanding" of the source language sentences and forms a floating-point number vector of a specific dimension. The decoder then generates a translation of the target language word by word based on this vector.

However, the traditional method has the problem of inaccurate translation when applied to professional fields where domain proper nouns exist.

Contents of the invention

Based on this, it is necessary to provide a machine translation method, device, computer equipment, computer-readable storage medium and computer program product that can achieve accurate translation in response to the above technical problems.

In a first aspect, this application provides a machine translation method, which method includes:

Obtain the source language data to be translated; perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, and obtain the proper noun translation results , input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, Get machine translation results.

In a second aspect, this application also provides a machine translation device, which includes:

The acquisition module is used to obtain the source language data to be translated; the matching module is used to perform forward maximum matching on the source language data to be translated and determine the domain proper nouns in the source language data to be translated; the translation module is used to Input the proper nouns in the field into the target machine translation model for translation, and obtain the translation result of the proper nouns, and input the source language data to be translated into the target machine translation model for translation, and obtain the translation target language data. The target machine translation model is obtained by training sample data; the replacement module is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.

In a third aspect, this application also provides a computer device. The computer device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the following steps:

In a fourth aspect, this application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by the processor, the following steps are implemented:

In a fifth aspect, this application also provides a computer program product. The computer program product includes a computer program that implements the following steps when executed by a processor:

The above-mentioned machine translation methods, devices, computer equipment, storage media and computer program products can determine the domain-specific nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated. By inputting the domain proper nouns into the target machine translation model for translation, the translation results of the proper nouns are obtained. The source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained. The translation results of the proper nouns are replaced with the translation results. The corresponding translation results in the target language data can improve the accuracy of the target machine translation model in translating domain proper nouns and obtain accurate machine translation results.

Description of the drawings

Figure 1 is a schematic flowchart of a machine translation method in one embodiment;

Figure 2 is a schematic flow chart of a machine translation method in another embodiment;

Figure 3 is a schematic flowchart of a machine translation method in yet another embodiment;

Figure 4 is a structural block diagram of a machine translation device in one embodiment;

Figure 5 is an internal structure diagram of a computer device in one embodiment.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be further described in detail below with reference to the drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

In one embodiment, as shown in Figure 1, a machine translation method is provided. This embodiment illustrates the application of this method to a terminal. It can be understood that this method can also be applied to a server, and can also be applied to a server. A system that includes terminals and servers and is implemented through the interaction between terminals and servers. Among them, the terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, Internet of Things devices and portable wearable devices. The Internet of Things devices can be smart speakers, smart TVs, smart air conditioners, smart vehicle-mounted devices, etc. Portable wearable devices can be smart watches, smart bracelets, head-mounted devices, etc. The server can be implemented as an independent server or a server cluster composed of multiple servers. In this embodiment, the method includes the following steps:

Step 102: Obtain the source language data to be translated.

Among them, the source language data to be translated refers to the data that needs to be translated. For example, in machine translation from Chinese to English, the source language data to be translated refers to Chinese. For another example, in machine translation from English to Chinese, the source language data to be translated refers to English.

Specifically, when machine translation is required, the terminal will obtain the source language data to be translated.

Step 104: Perform forward maximum matching on the source language data to be translated, and determine the domain-specific nouns in the source language data to be translated.

Among them, forward maximum matching refers to extracting the largest phrase that can match the preset proper noun dictionary by analogy in the source language data to be translated. Field-specific nouns refer to nouns that are unique to a field. For example, in the medical field, domain-specific nouns can specifically refer to disease names, drug names, etc.

Specifically, the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and use the preset proper noun dictionary to perform forward maximum matching on the words to be matched. , obtain the domain proper nouns corresponding to the words to be matched, and determine the domain proper nouns in the source language data to be translated based on the obtained domain proper nouns corresponding to the words to be matched. The preset proper noun dictionary refers to a preset dictionary composed of proper nouns in the field. For example, in the medical field, the default proper noun dictionary refers to a dictionary composed of proper nouns such as disease nouns and drug nouns in the medical field.

Step 106: Enter the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target machine translation model passes the Obtained by training on sample data.

Among them, the target machine translation model refers to a model obtained by training sample data and can be used for machine translation, and can translate the source language data to be translated into the translation target language data. The sample data may specifically be a set of sample translation sentence pairs including sample translation sentence pairs. The sample translation sentence pairs refer to sentence pairs including sample source language data and sample target language data. The sample target language data is the translation result of the sample source language data. The translation results of proper nouns refer to the translation results of domain proper nouns output by the target machine translation model. The translation target language data refers to the translation result output by the target machine translation model and the source language data to be translated.

Specifically, the terminal will mark the domain proper nouns in the source language data to be translated, obtain the annotation results, input the domain proper nouns into the target machine translation model for translation, and the target machine translation model will output the proper noun translation results, and The source language data to be translated is input into the target machine translation model for translation, and the translation target language data is obtained.

Further, the target machine translation model can include at least two sub-machine translation models, that is, the terminal can translate the source language data to be translated by training multiple sub-machine translation models with different random loss rates, and then translate the source language data to be translated. When , the terminal will input the source language data to be translated into the sub-machine translation model to obtain the translation result corresponding to the sub-machine translation model. The translation result includes the word probability of predicting the corresponding word for each word in the source language data to be translated. After obtaining the word probability, the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the optimal prediction result corresponding to the word based on the ranking result, that is, the optimal translation result. The optimal translation result corresponding to each word is obtained to obtain the corresponding translation target language data. Among them, after sorting, the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.

Step 108: Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.

Specifically, after obtaining the proper noun translation results and the translation target language data, the terminal will replace the proper noun translation results with the corresponding translation results in the translation target language data based on the annotation results of the source language data to be translated, and obtain the machine translation result .

The above machine translation method can determine the domain proper nouns in the source language data to be translated by obtaining the source language data to be translated and performing forward maximum matching on the source language data to be translated, and input the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target The accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.

In one embodiment, forward maximum matching is performed on the words in the source language data to be translated, and determining the domain proper nouns in the source language data to be translated includes: using the words in the source language data to be translated as the words to be matched; Through forward maximum matching, the domain proper nouns corresponding to the words to be matched are obtained; based on the domain proper nouns corresponding to the words to be matched, the domain proper nouns in the source language data to be translated are determined.

Specifically, the terminal will segment the source language data to be translated, obtain the words in the source language data to be translated, use the words in the source language data to be translated as the words to be matched, and compare the words to be matched with the preset proper noun dictionary to determine the predicted words. Assume whether there is a matching word corresponding to the word to be matched in the proper noun dictionary, and when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and continue to perform forward maximum matching by comparing the phrase to be matched and the preset proper noun dictionary to obtain the domain expertise corresponding to the word to be matched. There are nouns.

Specifically, since the domain-specific nouns corresponding to different words to be matched may overlap, after obtaining the domain-specific nouns corresponding to the words to be matched, the terminal will determine the domain-specific nouns corresponding to the words to be matched. The nouns are deduplicated to obtain domain-specific nouns in the source language data to be translated.

In this embodiment, by using the words in the source language data to be translated as the words to be matched, and performing forward maximum matching on the words to be matched, the domain proper nouns corresponding to the words to be matched can be obtained, so that the words corresponding to the words to be matched can be obtained. Domain proper nouns, determine the domain proper nouns in the source language data to be translated.

In one embodiment, performing forward maximum matching on the word to be matched and obtaining the domain proper noun corresponding to the word to be matched includes: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the word to be translated The next word corresponding to the word to be matched in the source language data; combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched; when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary , obtain the next word corresponding to the phrase to be matched in the source language data to be translated; combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return when the phrase to be matched exists in the preset proper noun dictionary When matching the matching word corresponding to the phrase, the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated; until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, starting from the latest Delete the next word corresponding to the latest phrase to be matched from the phrases to be matched, and obtain the domain proper noun corresponding to the word to be matched.

Specifically, when performing forward maximum matching on the word to be matched, the terminal will match the word to be matched with the preset proper noun dictionary. When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, the terminal The next word corresponding to the word to be matched in the source language data to be translated is obtained, that is, the next word after the word to be matched, and the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the comparison is continued. For the phrase to be matched and the preset proper noun dictionary, when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the terminal will continue to obtain the next word corresponding to the phrase to be matched in the source language data to be translated, that is, The next word after the phrase to be matched is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary. When, the step of obtaining the next word corresponding to the phrase to be matched in the source language data to be translated is deleted from the latest phrase to be matched until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary. The next word corresponding to the latest phrase to be matched is obtained, and the domain proper noun corresponding to the word to be matched is obtained.

In this embodiment, when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, the word to be matched and the next word corresponding to the word to be matched are combined to obtain the phrase to be matched, and the phrase to be matched is continued to be combined with the preset word. Assuming that the proper noun dictionary continues to match, the domain proper nouns corresponding to the words to be matched can be obtained through forward maximum matching.

In one embodiment, the machine translation method further includes: obtaining a sample translation sentence pair set and an initial machine translation model; calculating a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, where the word number ratio is the source of the sample translation sentence pair. The ratio of the number of language words to the number of words in the target language; filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs; train the initial machine translation model based on the set of filtered sample translated sentence pairs, Obtain the target translation machine model.

Among them, the initial machine translation model refers to a machine translation model that has not yet undergone parameter training. The number of words in the source language refers to the total number of words in the source language in the sample translation sentence pair, and the number of words in the target language refers to the total number of words in the target language in the sample translation sentence pair. For example, in a sample translation sentence pair that is translated from Chinese to English, the number of source language words refers to the total number of Chinese words in the sample translation sentence pair, and the number of target language words refers to the total number of English words in the sample translation sentence pair. For another example, in a sample translation sentence pair that is translated from English to Chinese, the number of words in the source language refers to the total number of English words in the sample translation sentence pair, and the number of target language words refers to the total number of Chinese words in the sample translation sentence pair. It should be noted that the sample translation sentence pairs include real translation sentence pairs and back-translation sentence pairs. The real translation sentence pairs refer to the translation obtained after using the original source language data for translation and obtaining the corresponding original target language data. Sentence pairs. Back-translation sentence pairs refer to using the original target language data for translation. After obtaining the corresponding original source language data, the resulting translated sentence pairs can be trained by using real translation sentence pairs and back-translation sentence pairs at the same time, which can improve the performance of the model. Accuracy.

Specifically, before performing machine translation, the target machine translation model needs to be trained first. During model training, the terminal will obtain the sample translation sentence pair set and the initial machine translation model, and calculate each sample translation sentence in the sample translation sentence pair set. The ratio of the number of words in the pair, according to the ratio of the number of words, the data distribution corresponding to the ratio of the number of words is obtained, and the data distribution is used to filter the sample translation sentence pairs in the sample translation sentence pair set, and the filtered sample translation sentence pair set is obtained, and the filtered sample translation sentence pair set is obtained A collection of sample translated sentence pairs is used to train the initial machine translation model to obtain the target translation machine model.

In this embodiment, by calculating the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, the word number ratio can be used to filter the sample translation sentence pair set, filter out deviating samples, improve the quality of model translation training, and reduce irrelevant Data noise, use the set of translated sentence pairs based on filtered samples to train the initial machine translation model, and obtain a target translation machine model that can support accurate translation.

In one embodiment, obtaining a set of sample translated sentence pairs includes: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs; performing word segmentation on the original source language data in the original translated sentence pairs to obtain a word segmentation result, and Count the character length of each target language word in the original target language data in the original translated sentence pair;

According to the word segmentation results and character length, the original translation sentence pair set is filtered; the filtered original translation sentence pair set is used as a sample translation sentence pair set.

Among them, the original translation sentence pairs include real translation sentence pairs and reverse translation sentence pairs.

Specifically, when obtaining the set of sample translation sentence pairs, the terminal will first obtain the set of original translation sentence pairs, perform word segmentation on the original source language data in the original translation sentence pairs, obtain the word segmentation results, and count the original target language data in the original translation sentence pairs. The character length of each target language word in the filter, filter out the original translated sentence pairs corresponding to the original source language data whose sentence length is greater than the preset sentence length threshold and/or the number of words is greater than the preset word number threshold, and filter out the original translated sentence pairs whose character length is greater than The original translation sentence pairs corresponding to the original target language data with a preset character length threshold are used as a set of sample translation sentence pairs after filtering. Among them, the preset sentence length threshold, the preset word number threshold, and the preset character length threshold can all be set as needed, and are not specifically limited in this embodiment.

Further, when obtaining the original translation sentence pair set, the terminal needs to first obtain the unintegrated real translation sentence pairs and the de-translation sentence pairs, and integrate the real translation sentence pairs and the de-translation sentence pairs through a deduplication operation to obtain the original translation. A collection of sentence pairs. Among them, the simHash algorithm can be used to deduplicate statements. The core idea is: perform simHash mapping for each text to be deduplicated, segment the simHash value to create an inverted index, and parallelize the hash value of each segment. Deduplication operation.

In this embodiment, by obtaining a set of original translation sentence pairs, the original source language data in the original translation sentence pair is segmented to obtain the word segmentation results, and the character length of each target language word in the original target language data in the original translation sentence pair is counted. , filtering the set of original translated sentences based on the word segmentation results and character length, which can filter out deviating samples, improve the quality of model translation training, and reduce irrelevant data noise.

In one embodiment, filtering the set of sample translated sentence pairs according to the word number ratio to obtain the filtered sample translated sentence pair set includes: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; according to the data distribution, Filter the set of sample translation sentence pairs to obtain a set of filtered sample translation sentence pairs.

Specifically, the terminal can obtain the data distribution corresponding to the word number ratio by counting the word number ratio, so that it can filter the sample translation sentence pair set according to the data distribution and the preset ratio threshold, and obtain the filtered sample translation sentence pair gather. The preset proportion threshold can be set as needed, and is not specifically limited in this embodiment. Further, the preset proportion threshold may include a first proportion threshold and a second proportion threshold, where the first proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio, and the second proportion threshold is used to filter out sample translation sentence pairs with a smaller word number ratio. A larger sample of translated sentence pairs.

In this embodiment, by performing statistics based on the ratio of the number of words, a data distribution corresponding to the ratio of the number of words is obtained. According to the data distribution, the set of sample translated sentence pairs is filtered to obtain a set of filtered sample translated sentence pairs, which can filter out deviating samples. Improve the quality of model translation training and reduce irrelevant data noise.

In one embodiment, translating a set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model includes: translating the set of sentence pairs according to the filtered samples, training the initial machine translation model, and obtaining the target translation machine model. Machine translation model; obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set; based on the translation evaluation source language data set and translation evaluation The target language data set is used to obtain a set of translation evaluation and translation sentence pairs; based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.

Among them, the translation evaluation source language data set refers to the data set used to evaluate the translation model. For example, the translation evaluation source language data set may specifically refer to the evaluation set of the International Machine Translation Competition.

Specifically, after the terminal translates the sentence pair set based on the filtered samples and trains the initial machine translation model, it will obtain the machine translation model to be optimized. It also needs to optimize the machine translation model to be optimized to obtain the target machine translation model. When optimizing, the terminal will first obtain the translation evaluation source language data set, use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set, obtain the translation evaluation target language data set, and then translate the translation evaluation source language The data set and the translation evaluation target language data set are used as a set of translation sentence pairs for translation evaluation. The filtered sample translation sentence pair set and the translation evaluation translation sentence pair set are used to train the machine translation model to be optimized to obtain the target machine translation model.

Further, after obtaining the translation evaluation and translation sentence pair set, the terminal will first filter the translation evaluation and translation sentence pair set, and then use the filtered sample translation sentence pair set and the filtered translation evaluation and translation sentence pair set to optimize the machine translation model Conduct training to obtain the machine translation model to be updated. Use the machine translation model to be updated to translate the filtered translation evaluation translation sentence pairs into the translation evaluation source language in the set. Obtain the translation evaluation target language corresponding to the translation evaluation source language. Use the translation evaluation The target language pair is updated after filtering the translation evaluation translation sentence pair set, that is, replacing the translation result corresponding to the translation evaluation source language in the translation evaluation translation sentence pair set, and then using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set The machine translation model to be updated is trained to obtain the target machine translation model.

Furthermore, the method used when filtering the set of translation evaluation sentences is the same as the method used when filtering the set of original translation sentences and the set of sample translation sentences. This embodiment will not be described here. . When the machine translation model to be updated is trained using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set to obtain the target machine translation model, the terminal can obtain the target machine translation model through iterative training, that is, the terminal will use the filtered The sample translation sentence pair set and the updated translation evaluation are used to train the machine translation model to be updated on the set to be updated, and a new machine translation model to be updated is obtained, and then the machine translation model to be updated is returned to the filtered translation evaluation set of translated sentence pairs. The translation step is to evaluate the source language for translation until the number of iterations reaches the preset iteration threshold, and then obtain the target machine translation model based on the latest machine translation model to be updated.

Furthermore, after obtaining the latest machine translation model to be updated, the terminal will also obtain professional corpus in the field, and use the professional corpus in the field to train the latest machine translation model to be updated to obtain the target machine translation model.

In this embodiment, an initial machine translation model is trained by translating a set of sentence pairs based on filtered samples to obtain a machine translation model to be optimized, a translation evaluation source language data set is obtained, and the translation evaluation source language data is obtained through the machine translation model to be optimized. Centralize the translation evaluation source language for translation to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained, and the sentence pair set and translation can be translated based on the filtered samples. Evaluate the set of translated sentence pairs, perform optimization training on the machine translation model to be optimized, and obtain the target machine translation model.

In one embodiment, the machine translation method further includes: performing proper noun recognition on the source language data to be translated by pre-training a proper noun recognition model, and expanding a preset proper noun dictionary based on the recognition results.

Specifically, since the number of proper nouns in the preset proper noun dictionary is limited, when performing machine translation, the terminal will perform proper noun recognition on the source language data to be translated based on the pre-trained proper noun recognition model. The recognition results expand the preset proper noun dictionary so that more proper nouns can be identified during matching. Among them, the pre-trained proper noun recognition model is obtained by training the sample proper noun set carrying sequence annotation.

Specifically, the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model. When inputting the source language data to be translated , which will break up the conditional probability distribution of the translated words according to the sequence conditions. Through the BERT model, the source language data to be translated can be annotated and the proper nouns can be identified. After being identified, the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.

In one embodiment, as shown in Figure 2, a schematic flow chart is used to illustrate the machine translation method of the present application. The machine translation method specifically includes the following steps:

Step 202: Obtain a set of original translated sentence pairs, which includes original translated sentence pairs;

Step 204: Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;

Step 206: Filter the set of original translated sentences according to the word segmentation results and character length;

Step 208: Use the filtered set of original translated sentence pairs as a set of sample translated sentence pairs;

Step 210, obtain the initial machine translation model;

Step 212: Calculate the word number ratio of the sample translation sentence pair in the sample translation sentence pair set. The word number ratio is the ratio of the number of source language words to the target language word in the sample translation sentence pair;

Step 214: Perform statistics based on the word number ratio to obtain the data distribution corresponding to the word number ratio;

Step 216: Filter the set of sample translated sentence pairs according to the data distribution to obtain a set of filtered sample translated sentence pairs;

Step 218: Train the initial machine translation model based on the filtered sample translation sentence pair set to obtain the machine translation model to be optimized;

Step 220: Obtain the translation evaluation source language data set, and use the machine translation model to be optimized to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set;

Step 222: Obtain a set of translation evaluation translation sentence pairs based on the translation evaluation source language data set and the translation evaluation target language data set;

Step 224: Train the machine translation model to be optimized based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set to obtain the target machine translation model;

Step 226: Obtain the source language data to be translated;

Step 228: Use words in the source language data to be translated as words to be matched;

Step 230: When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;

Step 232, combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;

Step 234: When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;

Step 236: Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched, and return to step 234;

Step 238: Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. domain specific nouns;

Step 240: Determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched;

Step 242: Enter the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data;

Step 244: Replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.

In one embodiment, taking the application of the above machine translation method in Chinese-English translation in the medical field as an example, the machine translation method of the present application is explained. As shown in Figure 3, the machine translation method specifically includes the following steps:

First, the terminal will obtain the real translation sentence pair (i.e., Chinese-English sentence pair). After obtaining the Chinese-English sentence pair, the terminal will use the pre-trained back-translation model (i.e., English-Chinese machine translation model) to compare the real translation sentence Back-translate the pairs to obtain back-translated sentence pairs, and use the real translated sentence pairs and the back-translated sentence pairs as a set of original translated sentence pairs. Among them, the terminal will input the Chinese and English data of the real translated sentence pair into the pre-trained back-translation model to obtain the Chinese translation corresponding to the English data, and use the English data and the Chinese translation as the back-translated sentence pair corresponding to the real translated sentence pair. The accuracy of the model can be improved to a certain extent through data back-translation. Among them, when pre-training the back-translation model, the terminal can perform data processing on real translated sentence pairs to obtain back-translation sample pairs for training, and then use the back-translation sample pairs to train the English-Chinese machine translation model. Among them, the data processing method can be as follows: use the source language data (i.e., Chinese) in the real translated sentence pairs as the target language data, use the target language data (i.e., English) as the source language data, obtain the translation samples that need to be filtered, and Filter the translation samples that need to be filtered to obtain the back-translation samples. For example, the untrained back-translation model can be based on the transformer-big model. When training, the untrained back-translation model will convert the input words into word vectors, which include token embedding (mark embedding) and position embedding ( Position embedding) two layers, and the encoded word vectors flow to the two-layer network in the encoder (encoding) respectively. Finally, the relevance of the text is obtained through matrix transformation training, and the back-translation model can be obtained. It should be noted that when filtering the translation sample pairs that need to be filtered, the filtering method used is consistent with the filtering method for the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will no longer Writing.

After obtaining the set of original translated sentence pairs, the terminal can use the original translated sentence pairs to perform model training to obtain a machine translation model to be optimized, that is, Chinese-English machine translation model training. Among them, before model training, the terminal also needs to perform data processing (i.e., filtering) on the real translated sentence pairs (i.e., Chinese-English sentence pairs) in the original translated sentence pair set to obtain filtered sample translated sentence pairs for training. gather. Among them, the specific filtering method can be: the terminal performs word segmentation processing on the original Chinese data in the original translated sentence pair set, filters out the original translated sentence pairs corresponding to the original Chinese data with a sentence length greater than 200 or a word count greater than 150, and then counts After one filtering, the character length of each English word in the original English data in the original translated sentence pair set is filtered out, and the original translated sentence pairs corresponding to the original English data with a maximum character length greater than 40 are filtered out to obtain a sample translated sentence pair set, and the sample translated sentences are calculated The ratio of the number of words in the sample translation sentence pairs in the collection, that is, the value of (number of source Chinese words/number of target English words), is statistically analyzed through Gaussian distribution, and the data distribution corresponding to the ratio of the number of words is obtained. According to the data distribution, the samples are Filter the set of translated sentence pairs, filter out sample translated sentence pairs whose word number ratio is less than the first proportion threshold and greater than the second proportion threshold, and obtain a set of filtered sample translated sentence pairs. Through multiple filtering, the deviation values can be filtered out to improve model translation. Quality of training. Reduce irrelevant data noise.

After obtaining the filtered sample translation sentence pair set, the terminal will train the initial machine translation model based on the filtered sample translation sentence pair set, and debug the appropriate learning rate (learning rate), batch size (batch size), step ( step size) and some related parameter information to obtain the machine translation model to be optimized, thereby achieving Chinese-English machine translation model training.

After obtaining the machine translation model to be optimized, the terminal will obtain the filtered evaluation set (in-field data) in the medical field in the International Machine Translation Competition, that is, the translation evaluation source language data set, and use the translation evaluation source language data set to be optimized. Machine translation models perform model fine-tuning to achieve optimization. Among them, model fine-tuning means freezing a series of parameters such as related losses and parameter weights from previous large-batch model training, and then conducting small-batch model training based on these parameters. It should be noted that the way of filtering the evaluation set in the medical field in the International Machine Translation Competition is consistent with the way of filtering the original translation sentence pairs and the sample translation sentence pairs in the above embodiment, and this embodiment will not be described here. .

When using the translation evaluation source language data set to fine-tune the machine translation model to be optimized to achieve optimization, the terminal will first translate the translation evaluation Chinese centralized translation evaluation Chinese through the machine translation model to be optimized (that is, data translation, single-language Chinese data), obtain the translation evaluation English set, obtain the translation evaluation translation sentence pair set according to the translation evaluation Chinese set and the translation evaluation English set, filter the translation evaluation translation sentence pair set, and filter the translation sentence pair set according to the filtered sample After the translation evaluation, the translation sentence pair set is trained, and the machine translation model to be optimized is trained to obtain the target machine translation model. The method of filtering the translation evaluation translation sentence pair set is the same as the original translation sentence pairs and sample translation sentences in the above embodiment. The filtering methods are the same, and this embodiment will not be described here. When training, preferably, the training step size is one million steps and the batch size is three thousand.

Further, when the machine translation model to be optimized is trained to obtain the target machine translation model, the terminal will first obtain the machine translation model to be updated by training the machine translation model to be optimized, and use the machine translation model to be updated to evaluate the filtered translation. Translate the translation evaluation source language in the translation sentence pair set to obtain the translation evaluation target language corresponding to the translation evaluation source language. Use the translation evaluation target language pair to update the translation evaluation translation sentence pair set after filtering, that is, replace the translation evaluation translation sentence pair set. The translation results corresponding to the source language are evaluated in the translation, and then the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set are used to train the machine translation model to be updated, and the target machine translation model is obtained, which is the medical field machine translation model.

Further, when the machine translation model to be updated is trained using the filtered sample translation sentence pair set and the updated translation evaluation translation sentence pair set to obtain the target machine translation model, the terminal can obtain the target machine translation model through iterative training, that is, the terminal will Use the filtered sample translation sentence pair set and the updated translation evaluation to train the machine translation model to be updated to train the machine translation model to be updated, and then return to use the machine translation model to be updated to evaluate the filtered translation sentence. The steps of translating the translation evaluation source language in the collection until the number of iterations (i.e., N in Figure 3) reaches the preset iteration threshold, obtain the latest machine translation model to be updated, and obtain professional corpus in the field (i.e., medical field data) ), use professional corpus in the field to train the latest machine translation model to be updated (i.e., fine-tune the model through medical field data), and obtain the target machine translation model (i.e., medical field machine translation model).

After obtaining the target machine translation model, the terminal will obtain the Chinese to be translated, use the Chinese words to be translated as the words to be matched, use the medical data professional dictionary to perform forward maximum matching, and obtain the domain proper nouns corresponding to the words to be matched. That is, when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary (i.e., medical data professional dictionary), the next word corresponding to the word to be matched in the Chinese to be translated is obtained, and the word to be matched and the word to be matched are combined The next word of the to-be-matched phrase is obtained. When there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary, the next word corresponding to the to-be-matched phrase in the Chinese to be translated is obtained, and the to-be-matched phrase and the to-be-matched phrase are combined. The next word corresponding to the phrase is obtained, and a new phrase to be matched is obtained. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the next word corresponding to the phrase to be matched in the source language data to be translated is obtained. Step, until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word corresponding to the word to be matched. Domain specific nouns.

After obtaining the domain proper nouns corresponding to the words to be matched, the terminal will determine the domain proper nouns in the Chinese to be translated based on the domain proper nouns corresponding to the words to be matched, and input the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the Chinese to be translated into the target machine translation model for translation, obtain the translation target language data, replace the translation results of the proper nouns with the corresponding translation results in the translation target language data, and obtain the machine translation results ( That is, the translation result output).

Furthermore, the terminal can obtain a professional dictionary of medical data through entity recognition. Specifically, the terminal will obtain a sample proper noun set carrying sequence annotations, and obtain a pre-trained proper noun set by training the sample proper noun set carrying sequence annotations. Noun recognition model, so that when performing machine translation, the terminal can perform proper noun recognition on the source language data to be translated through the pre-trained proper noun recognition model, so as to expand the preset proper noun dictionary based on the recognition results so that it can be used when matching. Identify more proper nouns. Specifically, the pre-trained proper noun recognition model can be BERT (Bidirectional Encoder Representation from Transformers, bidirectional encoding representation based on transformers) + CRF (Conditional Random Field, conditional random field) model. When inputting the source language data to be translated , which will break up the conditional probability distribution of the translated words according to the sequence conditions. Through the BERT model, the source language data to be translated can be annotated and the proper nouns can be identified. After being identified, the CRF model can be accessed. Determine whether the identified proper nouns are accurate. For example, when it is recognized that the label of a certain noun is BIII, if the CRF model can determine whether the label of the noun is accurate, that is, whether it is indeed BIII, the recognition of proper nouns can be achieved.

Further, the terminal can use multi-model fusion to obtain translation target language data. At this time, the target machine translation model can include at least two sub-machine translation models, that is, the terminal can treat multiple sub-machine translation models with different random loss rates by training Translate the source language data for translation. When translating the source language data to be translated, the terminal will input the source language data to be translated into the sub-machine translation model to obtain a translation result corresponding to the sub-machine translation model. The translation result includes the information to be translated. Each word in the source language data is predicted to obtain the word probability of the corresponding word. After obtaining this word probability, the terminal will sort the word probabilities of the same words in the translation results output by each sub-machine translation model, and determine the word probability based on the sorting results. The corresponding optimal prediction result, that is, the optimal translation result, is based on the optimal translation result corresponding to each word, and the corresponding translation target language data is obtained. Among them, after sorting, the terminal will determine the maximum word probability for each word, and use the word corresponding to the maximum word probability as the optimal prediction result.

It should be understood that although the steps in the flowcharts involved in the above-mentioned embodiments are shown in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowcharts involved in the above embodiments may include multiple steps or stages. These steps or stages are not necessarily executed at the same time, but may be completed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least part of the steps or stages in other steps.

Based on the same inventive concept, embodiments of the present application also provide a machine translation device for implementing the above-mentioned machine translation method. The problem-solving solution provided by this device is similar to the solution recorded in the above method. Therefore, for the specific limitations in one or more machine translation device embodiments provided below, please refer to the above limitations on the machine translation method. I won’t go into details here.

In one embodiment, as shown in Figure 4, a machine translation device is provided, including: an acquisition module 402, a matching module 404, a translation module 406 and a replacement module 408, wherein:

The acquisition module 402 is used to acquire the source language data to be translated;

The matching module 404 is used to perform forward maximum matching on the source language data to be translated and determine the domain-specific nouns in the source language data to be translated;

The translation module 406 is used to input the domain proper nouns into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target. Language data, the target machine translation model is obtained by training sample data;

The replacement module 408 is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.

The above-mentioned machine translation device can determine the domain proper nouns in the source language data to be translated by acquiring the source language data to be translated, performing forward maximum matching on the source language data to be translated, and inputting the domain proper nouns into the target machine translation model for translation. Translate, obtain the translation results of proper nouns, input the source language data to be translated into the target machine translation model for translation, obtain the translation target language data, and replace the translation results of proper nouns with the corresponding translation results in the translation target language data, which can improve the target The accuracy of the machine translation model in translating domain-specific nouns results in accurate machine translation results.

In one embodiment, the matching module is also used to use words in the source language data to be translated as words to be matched, perform forward maximum matching on the words to be matched, and obtain domain proper nouns corresponding to the words to be matched. According to the words to be matched, Corresponding domain proper nouns determine the domain proper nouns in the source language data to be translated.

In one embodiment, the matching module is also used to obtain the next word corresponding to the word to be matched in the source language data to be translated when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, and combine the words to be matched The next word corresponding to the word to be matched is obtained to obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, the next word corresponding to the phrase to be matched in the source language data to be translated is obtained. Combine the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Return when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary and obtain the matching word in the source language data to be translated. The step of the next word corresponding to the phrase is until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then the next word corresponding to the latest phrase to be matched is deleted from the latest phrase to be matched, Get the domain-specific noun corresponding to the word to be matched.

In one embodiment, the machine translation device also includes a model training module. The model training module is used to obtain a sample translation sentence pair set and an initial machine translation model, and calculate the word number ratio of the sample translation sentence pair set in the sample translation sentence pair set. The number of words The ratio is the ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pair. The sample translation sentence pair set is filtered according to the word number ratio to obtain the filtered sample translation sentence pair set. According to the filtered sample translation sentence pair set, The initial machine translation model is trained to obtain the target translation machine model.

In one embodiment, the model training module is also used to obtain a set of original translated sentence pairs, which includes original translated sentence pairs, segment the original source language data in the original translated sentence pairs, obtain the segmentation results, and count the original The character length of each target language word in the original target language data in the translated sentence pair, filter the original translated sentence pair set based on the word segmentation result and character length, and use the filtered original translated sentence pair set as a sample translated sentence pair set .

In one embodiment, the model training module is also used to perform statistics based on the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words. According to the data distribution, filter the set of sample translated sentence pairs to obtain a set of filtered sample translated sentence pairs. .

In one embodiment, the model training module is also used to translate a set of sentence pairs based on the filtered samples, train the initial machine translation model, obtain the machine translation model to be optimized, obtain the translation evaluation source language data set, and use the machine translation model to be optimized Translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. According to the translation evaluation source language data set and the translation evaluation target language data set, obtain a set of translation evaluation translation sentence pairs, and translate according to the filtered samples Sentence pair set and translation evaluation Translate the sentence pair set, train the machine translation model to be optimized, and obtain the target machine translation model.

Each module in the above machine translation device can be implemented in whole or in part by software, hardware and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure diagram may be shown in Figure 5 . The computer device includes a processor, memory, input/output interface, communication interface, display unit and input device. Among them, the processor, memory and input/output interface are connected through the system bus, and the communication interface, display unit and input device are connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile storage media and internal memory. The non-volatile storage medium stores operating systems and computer programs. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and external devices. The communication interface of the computer device is used for wired or wireless communication with external terminals. The wireless mode can be implemented through WIFI, mobile cellular network, NFC (Near Field Communication) or other technologies. The computer program, when executed by the processor, implements a machine translation method. The display unit of the computer device is used to form a visually visible picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen. The input device of the computer device can be a display screen. The touch layer covered above can also be buttons, trackballs or touch pads provided on the computer equipment shell, or it can also be an external keyboard, touch pad or mouse, etc.

Those skilled in the art can understand that the structure shown in Figure 5 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. The specific computer equipment can May include more or fewer parts than shown, or combine certain parts, or have a different arrangement of parts.

In one embodiment, a computer device is provided, including a memory and a processor. A computer program is stored in the memory. When the processor executes the computer program, it implements the following steps: obtains the source language data to be translated; performs the following steps on the source language data to be translated: Forward maximum matching determines the domain proper nouns in the source language data to be translated; inputs the domain proper nouns into the target machine translation model for translation, obtains the proper noun translation results, and inputs the source language data to be translated into the target machine translation model Translate to obtain the translation target language data. The target machine translation model is obtained by training the sample data; replace the proper noun translation results with the corresponding translation results in the translation target language data to obtain the machine translation results.

In one embodiment, when the processor executes the computer program, it also implements the following steps: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific nouns corresponding to the words to be matched. , based on the domain proper nouns corresponding to the words to be matched, determine the domain proper nouns in the source language data to be translated.

In one embodiment, the processor also implements the following steps when executing the computer program: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. word, combine the word to be matched and the next word corresponding to the word to be matched, to obtain the phrase to be matched, and when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the phrase to be matched in the source language data to be translated The corresponding next word is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary and the translation to be obtained is obtained The step of finding the next word corresponding to the phrase to be matched in the source language data, until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then deleting the latest phrase to be matched from the latest phrase to be matched. Corresponding to the next word, get the domain proper noun corresponding to the word to be matched.

In one embodiment, the processor also implements the following steps when executing the computer program: obtaining a set of sample translated sentence pairs and an initial machine translation model, calculating a word number ratio of the sample translated sentence pairs in the sample translated sentence pair set, and the word number ratio is the sample The ratio of the number of words in the source language to the number of words in the target language in the translated sentence pair. Filter the set of sample translated sentence pairs according to the ratio of the number of words to obtain a set of filtered sample translated sentence pairs. Based on the set of filtered sample translated sentence pairs, the initial machine The translation model is trained to obtain the target translation machine model.

In one embodiment, the processor also implements the following steps when executing the computer program: obtaining a set of original translated sentence pairs, which includes original translated sentence pairs, performing word segmentation on the original source language data in the original translated sentence pairs, and obtaining the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair, filter the original translation sentence pair set according to the word segmentation result and character length, and use the filtered original translation sentence pair set as A collection of sample translated sentence pairs.

In one embodiment, the processor also implements the following steps when executing the computer program: performing statistics based on the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of sample translated sentence pairs.

In one embodiment, the processor also implements the following steps when executing the computer program: training the initial machine translation model according to the set of filtered sample translation sentence pairs to obtain the machine translation model to be optimized, obtaining the translation evaluation source language data set, and passing The machine translation model to be optimized translates the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. Based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.

In one embodiment, a computer-readable storage medium is provided, with a computer program stored thereon. When the computer program is executed by a processor, the following steps are implemented: obtaining the source language data to be translated; performing forward maximum processing on the source language data to be translated. Match and determine the domain proper nouns in the source language data to be translated; input the domain proper nouns into the target machine translation model for translation, obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation. The translation target language data is obtained, and the target machine translation model is obtained by training the sample data; the translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain the machine translation result.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: using words in the source language data to be translated as words to be matched, performing forward maximum matching on the words to be matched, and obtaining domain-specific information corresponding to the words to be matched. Nouns, determine the domain-specific nouns in the source language data to be translated based on the domain-specific nouns corresponding to the words to be matched.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: when there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtaining the next word corresponding to the word to be matched in the source language data to be translated. One word, combine the word to be matched and the next word corresponding to the word to be matched, and obtain the phrase to be matched. When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the matched word in the source language data to be translated. The next word corresponding to the phrase is combined with the phrase to be matched and the next word corresponding to the phrase to be matched to obtain a new phrase to be matched. Returns when there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary. The step of translating the next word corresponding to the phrase to be matched in the source language data until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, and then deleting the latest word to be matched from the latest phrase to be matched. The next word corresponding to the phrase is obtained, and the domain proper noun corresponding to the word to be matched is obtained.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of sample translation sentence pairs and an initial machine translation model, calculate a word number ratio of the sample translation sentence pair in the sample translation sentence pair set, and the word number ratio is The ratio of the number of words in the source language to the number of words in the target language in the sample translation sentence pairs, filter the sample translation sentence pair set according to the word number ratio, and obtain the filtered sample translation sentence pair set. According to the filtered sample translation sentence pair set, the initial The machine translation model is trained to obtain the target translation machine model.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: obtain a set of original translated sentence pairs, the original translated sentence pair set includes the original translated sentence pairs, perform word segmentation on the original source language data in the original translated sentence pairs, and obtain The word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pairs, filter the original translation sentence pairs set based on the word segmentation results and character length, and put the filtered original translation sentence pairs into a set, As a collection of sample translation sentence pairs.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: performing statistics according to the word number ratio to obtain a data distribution corresponding to the word number ratio; filtering the sample translation sentence set according to the data distribution to obtain the filtered A collection of post-sample translated sentence pairs.

In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: training the initial machine translation model according to the set of filtered sample translation sentence pairs, obtaining the machine translation model to be optimized, and obtaining the translation evaluation source language data set, The machine translation model to be optimized is used to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set. Based on the translation evaluation source language data set and the translation evaluation target language data set, a translation evaluation translation sentence pair set is obtained. , based on the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain the target machine translation model.

Specifically, the computer-readable storage medium may be non-volatile or volatile.

In one embodiment, a computer program product is provided, including a computer program. When executed by a processor, the computer program implements the following steps: obtaining source language data to be translated; performing forward maximum matching on the source language data to be translated, and determining Translate the domain proper nouns in the source language data; input the domain proper nouns into the target machine translation model for translation to obtain the proper noun translation results, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language Data, the target machine translation model is obtained by training the sample data; the machine translation result is obtained by replacing the translation result of the proper noun with the corresponding translation result in the translation target language data.

It should be noted that the data involved in this application (including but not limited to data used for analysis, etc.) are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data require Comply with relevant laws, regulations and standards of relevant countries and regions.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage. In the media, when executed, the computer program may include the processes of the above method embodiments. Any reference to memory, database or other media used in the embodiments provided in this application may include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory (MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (Phase Change Memory, PCM), graphene memory, etc. Volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can be in many forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include blockchain-based distributed databases, etc., but are not limited thereto. The processors involved in the various embodiments provided in this application may be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to this.

The technical features of the above embodiments can be combined in any way. To simplify the description, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, all possible combinations should be used. It is considered to be within the scope of this manual.

The above-described embodiments only express several implementation modes of the present application, and their descriptions are relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all fall within the protection scope of the present application. Therefore, the scope of protection of this application should be determined by the appended claims.

Claims

A machine translation method, wherein the method includes:

Obtain the source language data to be translated;

Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;

Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;

The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
The method according to claim 1, wherein the forward maximum matching of words in the source language data to be translated and determining the domain proper nouns in the source language data to be translated includes:

Use words in the source language data to be translated as words to be matched;

Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;

According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
The method according to claim 2, wherein performing forward maximum matching on the words to be matched to obtain domain proper nouns corresponding to the words to be matched includes:

When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;

Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;

When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;

Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;

Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
The method of claim 1, further comprising:

Obtain the sample translation sentence pair set and the initial machine translation model;

Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;

Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;

The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
The method according to claim 4, wherein said obtaining a set of sample translation sentence pairs includes:

Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;

Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;

Filter the set of original translated sentence pairs according to the word segmentation result and the character length;

The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
The method according to claim 4, wherein filtering the sample translation sentence pair set according to the word number ratio, and obtaining the filtered sample translation sentence pair set includes:

Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;

According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.
The method according to claim 4, wherein said translating a set of sentence pairs based on the filtered samples, training the initial machine translation model, and obtaining a target translation machine model includes:

Train the initial machine translation model according to the set of filtered sample translation sentence pairs to obtain a machine translation model to be optimized;

Obtain the translation evaluation source language data set, and use the to-be-optimized machine translation model to translate the translation evaluation source language in the translation evaluation source language data set to obtain the translation evaluation target language data set;

Obtain a set of translation evaluation translation sentence pairs according to the translation evaluation source language data set and the translation evaluation target language data set;

According to the filtered sample translation sentence pair set and the translation evaluation translation sentence pair set, the machine translation model to be optimized is trained to obtain a target machine translation model.
A machine translation device, wherein the device includes:

The acquisition module is used to obtain the source language data to be translated;

A matching module, used to perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;

The translation module is used to input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and to input the source language data to be translated into the target machine translation model for translation to obtain the translation target language. Data, the target machine translation model is obtained by training sample data;

A replacement module is used to replace the translation result of the proper noun with the corresponding translation result in the translation target language data to obtain a machine translation result.
A computer device, including a memory and a processor, the memory stores a computer program, wherein the processor implements a machine translation method when executing the computer program, including:

Obtain the source language data to be translated;

Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;

Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;

The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
The computer device according to claim 9, wherein when the processor executes the computer readable instructions, the processor implements the forward maximum matching of words in the source language data to be translated, and determines the source language to be translated. Domain specific nouns in the data include:

Use words in the source language data to be translated as words to be matched;

Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;

According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
The computer device according to claim 10, wherein when the processor executes the computer readable instructions, the forward maximum matching of the word to be matched is performed, and the field expertise corresponding to the word to be matched is obtained. Some nouns include:

When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;

Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;

When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;

Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;

Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
The computer device of claim 9, wherein implementing the machine translation method when the processor executes the computer readable instructions further includes:

Obtain the sample translation sentence pair set and the initial machine translation model;

Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;

Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;

The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
The computer device according to claim 12, wherein when the processor executes the computer-readable instructions, implementing the acquisition of the set of sample translation sentence pairs includes:

Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;

Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;

Filter the set of original translated sentence pairs according to the word segmentation result and the character length;

The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
The computer device according to claim 12, wherein when the processor executes the computer readable instructions, the processor implements filtering the set of sample translation sentences according to the word number ratio to obtain filtered sample translation sentences. The pair set includes:

Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;

According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.
A computer-readable storage medium with a computer program stored thereon, wherein the computer-readable instructions implement a machine translation method when executed by a processor, including:

Obtain the source language data to be translated;

Perform forward maximum matching on the source language data to be translated, and determine the domain proper nouns in the source language data to be translated;

Input the proper nouns in the field into the target machine translation model for translation to obtain the translation results of the proper nouns, and input the source language data to be translated into the target machine translation model for translation to obtain the translation target language data. The target The machine translation model is obtained by training sample data;

The translation result of the proper noun is replaced with the corresponding translation result in the translation target language data to obtain a machine translation result.
The computer-readable storage medium according to claim 15, wherein when the computer-readable instructions are executed by a processor, the forward maximum matching of words in the source language data to be translated is performed, and the words to be translated are determined. Domain-specific nouns in source language data include:

Use words in the source language data to be translated as words to be matched;

Perform forward maximum matching on the word to be matched to obtain the domain proper noun corresponding to the word to be matched;

According to the domain-specific nouns corresponding to the words to be matched, the domain-specific nouns in the source language data to be translated are determined.
The computer-readable storage medium according to claim 16, wherein when the computer-readable instructions are executed by a processor, the forward maximum matching of the words to be matched is performed, and the words corresponding to the words to be matched are obtained. Domain specific nouns include:

When there is a matching word corresponding to the word to be matched in the preset proper noun dictionary, obtain the next word corresponding to the word to be matched in the source language data to be translated;

Combine the word to be matched and the next word corresponding to the word to be matched to obtain the phrase to be matched;

When there is a matching word corresponding to the phrase to be matched in the preset proper noun dictionary, obtain the next word corresponding to the phrase to be matched in the source language data to be translated;

Combine the to-be-matched phrase and the next word corresponding to the to-be-matched phrase to obtain a new to-be-matched phrase, and return the description of obtaining the to-be-translated phrase when there is a matching word corresponding to the to-be-matched phrase in the preset proper noun dictionary. The steps of finding the next word corresponding to the phrase to be matched in the source language data;

Until there is no matching word corresponding to the latest phrase to be matched in the preset proper noun dictionary, delete the next word corresponding to the latest phrase to be matched from the latest phrase to be matched, and obtain the word to be matched. Corresponding domain specific nouns.
The computer-readable storage medium according to claim 15, wherein when the computer-readable instructions are executed by a processor, implementing the machine translation method further includes:

Obtain the sample translation sentence pair set and the initial machine translation model;

Calculate the word number ratio of the sample translation sentence pairs in the sample translation sentence pair set, where the word number ratio is the ratio of the number of source language words to the target language word number in the sample translation sentence pair;

Filter the sample translation sentence pair set according to the word number ratio to obtain a filtered sample translation sentence pair set;

The initial machine translation model is trained according to the set of filtered sample translation sentence pairs to obtain a target translation machine model.
The computer-readable storage medium according to claim 18, wherein when the computer-readable instructions are executed by a processor, achieving the obtaining the set of sample translation sentence pairs includes:

Obtain a set of original translated sentence pairs, where the set of original translated sentence pairs includes original translated sentence pairs;

Perform word segmentation on the original source language data in the original translation sentence pair, obtain the word segmentation results, and count the character length of each target language word in the original target language data in the original translation sentence pair;

Filter the set of original translated sentence pairs according to the word segmentation result and the character length;

The filtered set of original translated sentence pairs is used as a set of sample translated sentence pairs.
The computer-readable storage medium according to claim 18, wherein when the computer-readable instructions are executed by a processor, they implement filtering the set of sample translation sentences according to the word number ratio to obtain filtered samples. The set of translated sentence pairs includes:

Perform statistics according to the ratio of the number of words to obtain a data distribution corresponding to the ratio of the number of words;

According to the data distribution, the sample translation sentence pair set is filtered to obtain a filtered sample translation sentence pair set.