CN116205242A

CN116205242A - Translation method, translation device, translation apparatus, translation medium, and translation program product

Info

Publication number: CN116205242A
Application number: CN202211714333.1A
Authority: CN
Inventors: 史庭训; 薛征山
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-02

Abstract

The embodiment of the application discloses a translation method, a translation device, translation equipment, translation media and translation program products, and belongs to the technical field of natural language processing. The method comprises the following steps: performing word segmentation processing on sentences to be translated in at least two word segmentation modes to obtain at least two word sequences; embedding and encoding at least two word sequences through a translation model to obtain at least two encoding vectors; decoding the at least two coding vectors through a translation model to obtain at least two decoding results; and determining a translation result from at least two decoding results. The method can realize more accurate word segmentation, thereby obtaining more accurate translation results.

Description

Translation method, translation device, translation apparatus, translation medium, and translation program product

Technical Field

The embodiment of the application relates to the technical field of natural language processing, in particular to a translation method, a translation device, translation equipment, translation media and translation program products.

Background

Word segmentation is a basic natural speech processing (Natural Language Processing, NLP) underlying technology. For example, in the process of sentence translation, a sentence to be translated needs to be split into words first, a word sequence is generated based on the split words, then the word sequence is encoded and decoded, and finally the translated sentence is output.

Common chinese word segmentation tools are jieba word segmentation tools, pkuseg word segmentation tools, SWCS word segmentation tools, and the like. Different word segmentation tools can generate word segmentation results with overall similarity and detail difference for the same sentence. In particular, for new words and proper nouns, different word segmentation tools show different segmentation modes according to different training corpuses.

Disclosure of Invention

The embodiment of the application provides a translation method, a translation device, translation equipment, translation media and a translation program product. The technical scheme is as follows:

according to an aspect of the present application, there is provided a translation method, the method including:

performing word segmentation processing on sentences to be translated in at least two word segmentation modes to obtain at least two word sequences;

embedding and coding the at least two word sequences through a translation model to obtain at least two coding vectors;

decoding the at least two coding vectors through the translation model to obtain at least two decoding results;

and determining a translation result from the at least two decoding results.

According to another aspect of the present application, there is provided a translation apparatus, the apparatus including:

the word segmentation module is used for carrying out word segmentation processing on sentences to be translated in at least two word segmentation modes to obtain at least two word sequences;

The coding module is used for carrying out embedded coding on the at least two word sequences through a translation model to obtain at least two coding vectors;

the decoding module is used for decoding the at least two coding vectors through the translation model to obtain at least two decoding results;

and the output module is used for determining a translation result from the at least two decoding results.

According to another aspect of the present application, there is provided a computer device, the computer device including a processor, a memory coupled to the processor, the memory having program instructions stored thereon, the processor, when executing the program instructions, implementing a translation method as provided by the various aspects of the present application.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein program instructions which, when executed by a processor, implement a translation method as provided in various aspects of the present application.

According to another aspect of the present application, there is provided a computer program product (or computer program) comprising computer instructions stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer readable storage medium, which when executed implements a translation method as provided by various aspects of the present application.

According to another aspect of the present application, there is provided a chip comprising programmable logic circuitry and/or program instructions for implementing the translation method as provided in the various aspects of the present application when the chip is running.

The beneficial effects that technical scheme that this application embodiment provided can include:

according to the translation method, word segmentation processing is carried out on sentences to be translated in a plurality of word segmentation modes to obtain a plurality of word sequences, then encoding and decoding are carried out on the plurality of word sequences through a translation model to obtain a plurality of decoding results, namely a plurality of translation results to be selected are obtained, and one translation result is determined from the plurality of translation results to be selected. According to the translation method, a plurality of word segmentation modes are adopted, the problem that new words and professional words are not accurately segmented in one word segmentation mode is solved, meanwhile, translation is carried out according to word sequences corresponding to various word segmentation modes, more accurate translation results are selected, and model robustness is improved.

Drawings

In order to more clearly describe the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates a flow chart of a translation method provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of a translation method provided by another exemplary embodiment of the present application;

FIG. 3 illustrates a flow chart of a translation method provided by another exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a model training method provided by an exemplary embodiment of the present application;

FIG. 5 illustrates a block diagram of a translation apparatus provided in an exemplary embodiment of the present application;

fig. 6 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Since the advent of machine translation technology, it has undergone multiple technological iterative evolutions, and has now entered the age of a transducer network architecture with self-monitoring attention mechanisms as the core. The model has achieved significantly better results on various standard data sets than previous generation recurrent neural network (Recurrent Neural Networks, RNN) based models, and is widely used in the industry. However, the transducer model has a complex structure, a large scale and numerous parameters, and the conventional research work proves that the transducer model is fragile, and if a small disturbance exists in input, a large difference result is generated, so that the model effect is reduced. For low-resource scenes (training data is less than 100 ten thousand), the model parameter quantity is obviously larger than the training data quantity, so that the over-fitting phenomenon is more easy to occur, and the generalization capability of the model is further weakened.

On the other hand, machine translation models typically require an embedding matrix to convert discrete input tokens (token) into dense vectors to perform subsequent computational logic. The number of elements contained in the embedding matrix is typically the vocabulary size |v| times the vector dimension d. If the vocabulary is too large, the embedding matrix becomes very large, adding to the computational and storage burden. If the word in the vocabulary is too sparse, no meaningful vector representation can be learned. Chinese is a text system without word boundaries, and if the original chinese sentence is fed into a model, a large number of sparse words are obtained. How to efficiently encode chinese into word vectors is an important issue in the field of machine translation, especially in the field of low-resource machine translation.

Word segmentation is a technique for preprocessing the original sentence of machine translation. For Chinese text, the following word segmentation strategy is generally adopted: for normal text, chinese input is split into words by a conditional random field (Conditional Random Field, CRF) based strategy, e.g. "I have an apple" can be split into "I have an apple" by the jieba word segmentation tool. In this way, the common words "apple" in the two sentences of "I have an apple" and "the apple is green" can be identified and uniformly coded.

However, because of the rapid evolution of the language itself in the information age, there are differences between spoken and written language, and some terms in some fields, some new words, hotwords, proper nouns, and spoken words are not well recognized by the word segmentation tools. For example, "i tired loving" may be segmented into "i tired loving" by jieba, and the division of word boundaries and direct knowledge of the user may differ. In addition, chinese word segmentation may be followed by more pronounced long-tailed than western language, potentially yielding more words that appear only 1 or 2 times. As previously mentioned, the presence of a large number of sparse words can also affect the effectiveness of word representation learning.

Some methods for reducing the number of sparse words are to further segment the words into a sequence of "subwords" based on statistical information, and common methods are Byte-Pair Encoding (BPE) and sentence piece. The basic idea is to break up the word into letter sequences, then combine the high frequency two-tuple to generate new atomic units (called "subwords") and stop after the combination reaches the designated times. By the method, the low-frequency words can be decomposed into a plurality of high-frequency sub-words, so that the probability of occurrence of sparse words is reduced. Sub-words typically have special labels to restore word boundaries. For example, the word "transfer" is segmented by the BPE method, a sequence of three subwords "trans @ la @ station" may be obtained, and the special mark "@" added after each subword is removed and the word itself may be restored back. This kind of mark is indispensable for western language with word boundary, but there is a certain obstruction to the Chinese, for example, after the word "international trade" is split by the sub-word, the sub-word sequence "international @ trade" may be obtained, where the sub-word "international @ and the sub-word" international "have the same meaning, but because of the different forms of the two sub-words, two different identifiers (Identity Document, ID) are allocated during encoding, the model will not recognize the relationship between the two, which is equivalent to introducing two different representations for the same concept, and the two representations have no interaction directly.

With the development of word segmentation technology, many different word segmentation tools are presented. Different word segmentation tools can produce overall similarity for the same sentence, and word segmentation results with different details are different. Especially for the new words and proper nouns, different word segmentation tools show different segmentation modes according to different training corpus.

The translation method is provided, multiple word segmentation modes are adopted, so that the segmentation modes of sentences are increased, on one hand, the possibility that difficult words are segmented correctly is enhanced, on the other hand, the sample number is also expanded, meanwhile, noise is properly introduced, and the robustness of a model can be further improved for low-resource tasks. For detailed implementation of the translation method, please refer to the following examples.

Fig. 1 is a flowchart of a translation method according to an exemplary embodiment of the present application, where the method is applied to a computer device, and the computer device may be a terminal or a server, by way of example, and the method includes:

step 110, word segmentation processing is performed on the sentence to be translated in at least two word segmentation modes, so as to obtain at least two word sequences.

Illustratively, the translation model comprises a word segmentation module; the sentence to be translated is input into a word segmentation module, and the word segmentation module carries out word segmentation processing on the sentence to be translated in at least two word segmentation modes to obtain at least two word sequences.

Optionally, the at least two word segmentation manners include a first word segmentation manner and a second word segmentation manner; the method comprises the steps that computer equipment splits sentences to be translated into words and words in a first word segmentation mode to generate a first word sequence; and splitting the sentence to be translated into characters and words in a second word splitting mode to generate a second word sequence.

Illustratively, the same input is segmented using multiple different segmentation tools (e.g., jieba, pkuseg, scws) simultaneously. Different word segmentation tools adopt different word segmentation modes. For each different word segmentation mode, a different mark is added to the head of the word sequence after word segmentation to indicate the word segmentation mode used by the word sequence. For example, for "i want to sit in bus to go to school today", the following results are obtained through the word segmentation process of the three word segmentation tools described above:

< jieba > i want to sit on bus to go to school today;

< pkuseg > i want to sit on bus to go to school today;

< swcs > i want to sit on bus to go to school today.

Because the segmentation accuracy and granularity of different word segmentation modes are different, word segmentation processing is carried out on sentences to be translated by adopting a plurality of word segmentation modes, the possibility of introducing a correct word segmentation mode can be increased, and meanwhile, the influence caused by noise is relieved by expanding sample data.

Optionally, the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting a sentence to be translated into characters and words in a second word splitting mode to generate a second intermediate sequence; taking the first intermediate sequence and the second intermediate sequence as a whole, splitting the whole word in the sequence into sub-words, and obtaining a first sub-word sequence corresponding to the first intermediate sequence and a second sub-word sequence corresponding to the second intermediate sequence; the first sub word sequence is determined as a first word sequence and the second sub word sequence is determined as a second word sequence. Or the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting the whole word in the first intermediate sequence into subwords (subwords) to obtain a first subword sequence; splitting the sentence to be translated into words and characters in a second word splitting mode to generate a second intermediate sequence; splitting the whole word in the second intermediate sequence into sub-words to obtain a second sub-word sequence; the first sub word sequence is determined as a first word sequence and the second sub word sequence is determined as a second word sequence.

Optionally, the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting a sentence to be translated into characters and words in a second word splitting mode to generate a second intermediate sequence; taking the first intermediate sequence and the second intermediate sequence as a whole, splitting the whole word in the sequence into sub-words, and obtaining a first sub-word sequence corresponding to the first intermediate sequence and a second sub-word sequence corresponding to the second intermediate sequence; and taking the first sub word sequence and the second sub word sequence as a whole, and splitting the sub word corresponding to the target word frequency in the sequence into words to obtain a first word sequence corresponding to the first sub word sequence and a second word sequence corresponding to the second sub word sequence. Or the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting the whole word in the first intermediate sequence into sub words to obtain a first sub word sequence; splitting the sub word corresponding to the target word frequency in the first sub word sequence into words to obtain a first word sequence; splitting the sentence to be translated into words and characters in a second word splitting mode to generate a second intermediate sequence; splitting the whole word in the second intermediate sequence into sub-words to obtain a second sub-word sequence; and splitting the sub word corresponding to the target word frequency in the second sub word sequence into words to obtain a second word sequence.

When the sentence to be translated is a Chinese sentence, for example, when the segmentation of the subword is performed, marks introduced behind the subword are also removed, for example, marks "@" are added behind the subword after the segmentation, and "@" are removed, so that the embedded representation shared by the subword and the whole word is learned.

Illustratively, the target word frequency is a word frequency range, such as a word frequency range greater than or equal to a word frequency threshold; the computer equipment counts the word frequency of each subword in the first subword sequence and the second subword sequence; and splitting the sub-words with the word frequency lower than the word frequency threshold value into words to obtain a first word element sequence and a second word element sequence. Or, the computer equipment counts the word frequency of each sub word in the first sub word sequence, and splits the sub word with the word frequency lower than the word frequency threshold into words to obtain a first word element sequence (namely the first word sequence); and counting the word frequency of each sub word in the second sub word sequence, and splitting the sub word with the word frequency lower than the word frequency threshold value into words to obtain a second word element sequence (namely the second word sequence).

That is, in order to further reduce the size of the vocabulary, reduce the influence that the model volume is too large and the decoding speed is slow due to the too large embedding matrix, further make word frequency statistics on the word elements after the segmentation of the sub word, and further disassemble the word elements with the word frequency lower than the word frequency threshold (for example, set the word frequency threshold as 10) into words. In addition to the goals of model compression and decoding acceleration, another reason is included: the model is limited in information learned for low frequency words in the context of low resource tasks, and a good representation of such words cannot be obtained, for example, in low resource tasks, the low frequency learned vector representation is insufficient to distinguish the words, and the semantics of each word's neighboring words are not related to the word. For Chinese, the single word itself contains abundant semantic information, and in many cases the meaning of the word can be obtained by integrating the Chinese characters of the composed word, and the frequency of single word occurrence is usually significantly higher. In this case, the model is allowed to learn well the single word representation, and further word representations are integrated by the context information, which is more effective than using low frequency words that are not sufficiently learned.

Step 120, performing embedded coding on at least two word sequences through a translation model to obtain at least two coding vectors.

Illustratively, the translation model includes an encoding module; the computer equipment performs embedded coding on at least two word sequences through the coding module to obtain at least two coding vectors. For example, the computer device performs embedded coding on at least two word sequences in sequence through the coding module; or the computer device performs embedded coding on at least two word sequences through the coding module at the same time, for example, after splicing at least two word sequences, the computer device performs embedded coding on the spliced word sequences.

And 130, decoding at least two coding vectors through a translation model to obtain at least two decoding results.

Illustratively, the translation model includes a decoding module; the computer equipment decodes the at least two coded vectors through the decoding module to obtain at least two decoding results. For example, the computer device decodes the at least two encoded vectors sequentially by the decoding module; or the computer equipment decodes at least two coded vectors through the decoding module at the same time, for example, after splicing at least two coded vectors, the spliced coded vectors are decoded; or decoding the coded vector corresponding to the spliced word sequence.

And 140, determining a translation result from at least two decoding results.

Optionally, the computer device scores the at least two decoding results through a scoring algorithm or a language model to obtain at least two scores; and determining a decoding result corresponding to the highest score in the at least two scores as an output result.

For example, according to the sentence head labels (i.e. the labels corresponding to the word segmentation modes), the decoding results are divided into n parts, n is the number of the word segmentation modes, an automatic scoring tool is used to evaluate different decoding results respectively, and the decoding result with the best performance (highest score) is selected and output.

Alternatively, language models are used to score different outputs of the same input, and each sentence selects the sentence with the highest score of the language models as the final output. In order to shorten the evaluation time of the output result, a statistical-based language KenLM may be used as a scoring tool. In order to counteract word segmentation tendency brought by different word segmentation methods, training KenLM corpus is completely segmented into single word sequences.

For example, english turns Chinese, at this time, chinese appears at the target end, and 3 different word segmentation schemes exist for the target sentence. If "I want to go to the park by bus" is translated, three inputs can be constructed at the time of translation:

<char>I want to go to the park by bus；

<jieba>I want to go to the park by bus；

<pkuseg>I want to go to the park by bus。

The three outputs are:

< char > I want to go to park to sit on public buses;

< jieba > i want to walk to park on bus;

< pkuseg > i want to sit on a bus to park.

Because different word segmentation tools are specified, the model can generate different translation results, and the different word segmentation effects are not simple, so that the generated contents are different; the best output result is selected from the two modes, namely 'I want to sit on the bus to go to park'.

In summary, in the translation method provided in this embodiment, word segmentation is performed on a sentence to be translated in multiple word segmentation manners to obtain multiple word sequences, and then encoding and decoding are performed on the multiple word sequences through a translation model to obtain multiple decoding results, that is, multiple translation results to be selected are obtained, and one translation result is determined from the multiple translation results to be selected. According to the translation method, a plurality of word segmentation modes are adopted, the problem that new words and professional words are not accurately segmented in one word segmentation mode is solved, meanwhile, translation is carried out according to word sequences corresponding to various word segmentation modes, more accurate translation results are selected, and model robustness is improved.

Fig. 2 is a flowchart of a translation method according to another exemplary embodiment of the present application, where the method is applied to a computer device, and the computer device may be a terminal or a server, by way of example, and the method includes:

step 210, performing word segmentation processing on the sentence to be translated in at least two word segmentation modes to obtain at least two word sequences.

Illustratively, the same input is segmented using multiple different segmentation tools (e.g., jieba, pkuseg, scws) simultaneously. Different word segmentation tools adopt different word segmentation modes. For each different word segmentation mode, a different mark is added to the head of the word sequence after word segmentation to indicate the word segmentation mode used by the word sequence. Because the segmentation accuracy and granularity of different word segmentation modes are different, word segmentation processing is carried out on sentences to be translated by adopting a plurality of word segmentation modes, the possibility of introducing a correct word segmentation mode can be increased, and meanwhile, the influence caused by noise is relieved by expanding sample data.

Optionally, the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting a sentence to be translated into characters and words in a second word splitting mode to generate a second intermediate sequence; taking the first intermediate sequence and the second intermediate sequence as a whole, splitting the whole word in the sequence into sub-words, and obtaining a first sub-word sequence corresponding to the first intermediate sequence and a second sub-word sequence corresponding to the second intermediate sequence; the first sub word sequence is determined as a first word sequence and the second sub word sequence is determined as a second word sequence. Or the computer equipment splits the sentence to be translated into words and characters in a first word segmentation mode to generate a first intermediate sequence; splitting the whole word in the first intermediate sequence into sub words to obtain a first sub word sequence; splitting the sentence to be translated into words and characters in a second word splitting mode to generate a second intermediate sequence; splitting the whole word in the second intermediate sequence into sub-words to obtain a second sub-word sequence; the first sub word sequence is determined as a first word sequence and the second sub word sequence is determined as a second word sequence.

Step 220, splitting the sentence to be translated into single words, and generating a single word sequence.

In order to better let the model learn the representation of the Chinese word, each piece of text data is broken up to form a sequence consisting entirely of words. By joint multitasking learning of a single word sequence and a word-word sequence, the model can apply word representations learned from a single word sequence to a word-word sequence, learning better representation results for the word-word sequence.

By way of example, still taking the above sentence "i want to sit in bus to go to school today", four sequences are obtained by the preprocessing of step 210 and step 220 as follows:

< jieba > i want to sit on bus to go to school today;

< pkuseg > i want to sit on bus to go to school today;

< swcs > I want to sit on bus to go to school today;

< char > i want to sit in bus to go to school today.

When chinese is the target language (language of translation result), the input non-chinese text does not need to be modified, but only a method label (< char > corresponds to a full word sequence, < pkuseg > corresponds to a text using pkuseg words, < jieba > corresponds to a text using jieba words, < swcs > corresponds to a text using swcs words) is given at the beginning of the sentence. In the reasoning stage, the decoder automatically generates a corresponding decoding result according to the label prompt.

And 230, taking at least two word sequences and single word sequences as input data, and inputting the input data into a translation model for embedded coding to obtain at least three coding vectors.

Wherein the number of at least three encoding vectors is equal to the sum of the number of at least two word sequences and the single word sequence. The number of word sequences is denoted as n, and the number of encoding vectors is denoted as m, where the value of m is n plus 1.

Illustratively, the translation model includes an encoding module; the computer equipment performs embedded coding on at least three word sequences through the coding module to obtain at least three coding vectors. For example, the computer device performs embedded coding on at least three word sequences in sequence through the coding module; or the computer device performs embedded coding on at least three word sequences through the coding module at the same time, for example, after splicing the at least three word sequences, the computer device performs embedded coding on the spliced word sequences.

And step 240, decoding the at least three coding vectors through the translation model to obtain at least three decoding results.

Illustratively, the translation model includes a decoding module; the computer equipment decodes the at least three coding vectors through the decoding module to obtain at least three decoding results. For example, the computer device decodes the at least three encoded vectors sequentially by the decoding module; or the computer equipment decodes at least three coding vectors through the decoding module at the same time, for example, after splicing at least three coding vectors, the spliced coding vectors are decoded; or decoding the coded vector corresponding to the spliced word sequence.

Step 250, determining a translation result from at least three decoding results.

Optionally, the computer device scores the at least three decoding results through a scoring algorithm or a language model to obtain at least three scores; and determining a decoding result corresponding to the highest score in the at least three scores as an output result.

Illustratively, the decoding results are divided into m parts according to the sentence head labels, and the automatic scoring tool is used for respectively evaluating different decoding results and selecting and outputting the decoding result with the best performance (highest score).

The overall flow of this embodiment is shown in fig. 3, and after the sentence to be translated is input into the translation model, word segmentation is performed on the sentence to be translated to generate a single word sequence; meanwhile, word segmentation is carried out on sentences to be translated in a plurality of word segmentation modes to obtain a plurality of word sequences, and then word segmentation is carried out on the plurality of word sequences by adopting BPE (Business process element) to obtain a plurality of word sequences; and removing BPE symbols in the sub word sequences, counting sub word frequency, segmenting low-frequency words into words, and finally generating a plurality of word element sequences. Furthermore, the single word sequence and the plurality of word sequences are used as input data, the input data are input into a model for coding and decoding, kenLM is adopted for scoring and sequencing a plurality of decoding results, and a final translation result is output.

The scheme has obvious effect on the Tibetan-to-Chinese translation task data set of the machine translation meeting. Compared with the character-based model, the effect is improved by 2.5 points on the development set, and is improved by nearly 2 points on the test set. There is also a 1.3 point improvement over the model based on pkuseg segmentation on the development set. According to the ablation experiments, the specific lifting amplitude of each part is shown in the following table 1. Wherein BLEU is bilingual replacement evaluation (Bilingual Evaluation Understudy).

TABLE 1

In summary, according to the translation method provided by the application, aiming at the problem that word segmentation errors are introduced in the word segmentation method, negative effects caused by the errors are counteracted by the multi-word segmentation method, training data are added, and robustness is improved. According to the characteristics of Chinese language characters, BPE symbols are removed, low-frequency subwords are broken into characters, robustness of a model to representation learning is improved, the size of a model word list is reduced, occupied disk space and display memory required in operation are reduced, and decoding speed is improved. The method further uses KenLM ordering output results, and further improves translation effect on the premise of not remarkably enhancing decoding time of the model.

Fig. 4 shows a flowchart of a model training method according to an exemplary embodiment of the present application, where the method is applied to a computer device, and the exemplary computer device includes a terminal or a server, and the method includes:

step 310, obtaining a reference translation result of the sample sentence corresponding to the sample sentence.

Illustratively, the sample sentence is stored in pairs with the reference translation result in a local memory or database; the computer equipment obtains a reference translation result of a sample sentence corresponding to the sample sentence from a local memory or a database.

Step 320, performing word segmentation processing on the sample sentence by at least two word segmentation modes to obtain at least two sample word sequences.

Illustratively, the translation model comprises a word segmentation module; the method comprises the steps of inputting a sample sentence into a word segmentation module, and carrying out word segmentation on the sample sentence by the word segmentation module in at least two word segmentation modes to obtain at least two sample word sequences.

Optionally, the at least two word segmentation manners include a first word segmentation manner and a second word segmentation manner; the computer equipment splits the sample sentence into characters and words in a first word segmentation mode to generate a first sample word sequence; and splitting the sample sentence into words and words in a second word splitting mode to generate a second sample word sequence.

Optionally, the computer device splits the sample sentence into words and characters in a first word splitting manner to generate a first intermediate sample sequence; splitting the sample sentence into words and words in a second word splitting mode to generate a second intermediate sample sequence; taking the first intermediate sample sequence and the second intermediate sample sequence as a whole, splitting the whole word in the sequence into subwords, and obtaining a first sample subword sequence corresponding to the first intermediate sample sequence and a second sample subword sequence corresponding to the second intermediate sample sequence; the first sample word sequence is determined to be the first sample word sequence and the second sample word sequence is determined to be the second sample word sequence. Or the computer equipment splits the sample sentence into words and characters in a first word segmentation mode to generate a first intermediate sample sequence; splitting whole words in the first intermediate sample sequence into subwords (subwords) to obtain a first sample subword sequence; splitting the sample sentence into words and characters in a second word splitting mode to generate a second intermediate sample sequence; splitting the whole word in the second intermediate sample sequence into sub-words to obtain a second sample sub-word sequence; the first sample word sequence is determined to be the first sample word sequence and the second sample word sequence is determined to be the second sample word sequence.

Optionally, the computer device splits the sample sentence into words and characters in a first word splitting manner to generate a first intermediate sample sequence; splitting the sample sentence into words and words in a second word splitting mode to generate a second intermediate sample sequence; taking the first intermediate sample sequence and the second intermediate sample sequence as a whole, splitting the whole word in the sequence into subwords, and obtaining a first sample subword sequence corresponding to the first intermediate sample sequence and a second sample subword sequence corresponding to the second intermediate sample sequence; and taking the first sample sub-word sequence and the second sample sub-word sequence as a whole, and splitting the sub-word corresponding to the target word frequency in the sequence into words to obtain a first sample word sequence corresponding to the first sample sub-word sequence and a second sample word sequence corresponding to the second sample sub-word sequence.

Or the computer equipment splits the sample sentence into words and characters in a first word segmentation mode to generate a first intermediate sample sequence; splitting the whole word in the first intermediate sample sequence into sub-words to obtain a first sample sub-word sequence; splitting the sub-word corresponding to the target word frequency in the first sample sub-word sequence into words to obtain a first sample word sequence; splitting the sample sentence into words and characters in a second word splitting mode to generate a second intermediate sample sequence; splitting the whole word in the second intermediate sample sequence into sub-words to obtain a second sample sub-word sequence; splitting the sub word corresponding to the target word frequency in the second sample sub word sequence into words to obtain a second sample word sequence.

When the sample sentence is a Chinese sentence, the marks introduced behind the sub-words are removed, for example, the marks "@ @" and "@" which are added behind the sub-words after the segmentation are removed, so that the embedded representation shared by the sub-words and the whole word learning is obtained.

Illustratively, the target word frequency is a word frequency range, such as a word frequency range greater than or equal to a word frequency threshold; the computer equipment counts the word frequency of each sub-word in the first sample sub-word sequence and the second sample sub-word sequence; and splitting the sub-words with the word frequency lower than the word frequency threshold value into words to obtain a first word element sequence and a second word element sequence. Or, the computer equipment counts the word frequency of each sub word in the first sample sub word sequence, and splits the sub word with the word frequency lower than the word frequency threshold into words to obtain a first sample word element sequence (i.e. a first sample word sequence); and counting the word frequency of each sub word in the second sample sub word sequence, and splitting the sub word with the word frequency lower than the word frequency threshold value into words to obtain a second sample word element sequence (namely a second sample word sequence).

And 330, performing embedded coding on the at least two sample word sequences through a translation model to obtain at least two sample coding vectors.

Illustratively, the translation model includes an encoding module; the computer equipment performs embedded coding on at least two sample word sequences through the coding module to obtain at least two sample coding vectors. For example, the computer device performs embedded coding on at least two sample word sequences in sequence through the coding module; or the computer equipment performs embedded coding on at least two sample word sequences through the coding module at the same time, for example, after splicing the at least two sample word sequences, the computer equipment performs embedded coding on the spliced sample word sequences.

And step 340, decoding the at least two sample coding vectors through the translation model to obtain at least two sample decoding results.

Illustratively, the translation model includes a decoding module; the computer equipment decodes the at least two sample coding vectors through the decoding module to obtain at least two sample decoding results. For example, the computer device decodes the at least two sample encoded vectors sequentially by the decoding module; or the computer equipment decodes at least two sample coding vectors through the decoding module at the same time, for example, after splicing the at least two sample coding vectors, the computer equipment decodes the spliced sample coding vectors; or decoding the sample coding vector corresponding to the spliced sample word sequence.

And step 350, determining a sample translation result from at least two sample decoding results.

Optionally, the computer device scores the at least two sample decoding results through a scoring algorithm or a language model to obtain at least two scores; and determining a sample decoding result corresponding to the highest score in the at least two scores as an output result.

For example, according to the sentence head label, the sample decoding results are divided into n parts, n is the number of word segmentation modes, an automatic scoring tool is used for respectively evaluating the different sample decoding results, and the sample decoding result with the best performance (highest score) is selected and output.

Step 360, adjusting model parameters of the translation model based on the translation error between the sample translation result and the reference translation result.

The computer device calculates cross entropy loss between the sample translation result and the reference translation result, and performs back propagation training on the translation model by taking the cross entropy loss as a translation error, so as to adjust model parameters in the translation model.

In some embodiments, the computer device further splits the sample sentence into words, generating a sequence of sample words; taking at least two sample word sequences and sample single word sequences as input data, inputting the input data into a translation model for embedded coding to obtain at least three sample coding vectors, wherein the number of the at least three sample coding vectors is equal to the sum of the number of the at least two sample word sequences and the sample single word sequences; decoding the at least three sample coding vectors through the translation model to obtain at least three sample decoding results; and determining a sample translation result from at least three sample decoding results. Further, error calculation is performed based on the sample translation result, and model training is performed based on the error.

In summary, according to the model training method provided by the application, aiming at the problem that word segmentation errors are introduced in the word segmentation method, negative effects caused by the errors are counteracted by the multi-word segmentation method, training data are added, and robustness is improved. According to the characteristics of Chinese language characters, BPE symbols are removed, low-frequency subwords are broken into characters, robustness of a model to representation learning is improved, the size of a model word list is reduced, occupied disk space and display memory required in operation are reduced, and decoding speed is improved. The method further uses KenLM ordering output results, and further improves translation effect on the premise of not remarkably enhancing decoding time of the model.

FIG. 5 illustrates a block diagram of a translation apparatus provided in an exemplary embodiment of the present application, which may be part or all of a computer device by way of software, hardware, or a combination of both, the apparatus comprising:

the word segmentation module 410 is configured to perform word segmentation on the sentence to be translated in at least two word segmentation manners to obtain at least two word sequences;

the encoding module 420 is configured to perform embedded encoding on at least two word sequences through a translation model to obtain at least two encoding vectors;

a decoding module 430, configured to decode the at least two encoded vectors through the translation model to obtain at least two decoding results;

and an output module 440, configured to determine a translation result from at least two decoding results.

In some embodiments, the at least two word segmentation approaches include a first word segmentation approach and a second word segmentation approach;

word segmentation processing is carried out on sentences to be translated in at least two word segmentation modes to obtain at least two word sequences, wherein the word sequences comprise:

splitting a sentence to be translated into characters and words in a first word segmentation mode to generate a first word sequence;

and splitting the sentence to be translated into characters and words in a second word splitting mode to generate a second word sequence.

In some embodiments, splitting a sentence to be translated into words and characters in a first word segmentation mode to generate a first word sequence; splitting the sentence to be translated into words and words in a second word splitting mode to generate a second word sequence, wherein the second word sequence comprises the following steps:

splitting a sentence to be translated into characters and words in a first word segmentation mode to generate a first intermediate sequence;

splitting a sentence to be translated into characters and words in a second word splitting mode to generate a second intermediate sequence;

taking the first intermediate sequence and the second intermediate sequence as a whole, splitting the whole word in the sequence into sub-words, and obtaining a first sub-word sequence corresponding to the first intermediate sequence and a second sub-word sequence corresponding to the second intermediate sequence;

and taking the first sub word sequence and the second sub word sequence as a whole, and splitting the sub word corresponding to the target word frequency in the sequence into words to obtain a first word sequence corresponding to the first sub word sequence and a second word sequence corresponding to the second sub word sequence.

In some embodiments, taking the first sub word sequence and the second sub word sequence as a whole, splitting the sub word corresponding to the target word frequency in the sequence into words, and obtaining a first word sequence corresponding to the first sub word sequence and a second word sequence corresponding to the second sub word sequence, including:

Counting word frequency of each sub word in the first sub word sequence and the second sub word sequence;

and splitting the sub-words with the word frequency lower than the word frequency threshold value into words to obtain a first word element sequence and a second word element sequence.

In some embodiments, the apparatus is further to:

splitting sentences to be translated into single words to generate a single word sequence;

taking at least two word sequences and single word sequences as input data, inputting the input data into a translation model for embedded coding to obtain at least three coding vectors, wherein the number of the at least three coding vectors is equal to the sum of the number of the at least two word sequences and the single word sequences;

decoding at least three coding vectors through a translation model to obtain at least three decoding results;

and determining a translation result from at least three decoding results.

In some embodiments, determining a translation result from at least two decoding results comprises:

scoring the at least two decoding results by a scoring algorithm or a language model to obtain at least two scores;

and determining a decoding result corresponding to the highest score in the at least two scores as an output result.

In some embodiments, the training process of the translation model includes:

acquiring a sample sentence and a reference translation result corresponding to the sample sentence;

Performing word segmentation processing on the sample sentences in at least two word segmentation modes to obtain at least two sample word sequences;

embedding and encoding at least two sample word sequences through a translation model to obtain at least two sample encoding vectors;

decoding the at least two sample coding vectors through a translation model to obtain at least two sample decoding results;

determining a sample translation result from at least two sample decoding results;

and adjusting model parameters of the translation model based on the translation error between the sample translation result and the reference translation result.

Fig. 6 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application. The computer device may be a terminal or a server performing the translation method and/or the entity identification method as provided herein. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer apparatus 600 includes a central processing unit (CPU, central Processing Unit) 601, a system Memory 604 including a random access Memory (RAM, random Access Memory) 602 and a Read Only Memory (ROM) 603, and a system bus 605 connecting the system Memory 604 and the central processing unit 601. The computer device 600 also includes a basic input/output system (I/O system, input Output System) 606, which helps to transfer information between the various devices within the computer, and a mass storage device 607 for storing an operating system 613, application programs 614, and other program modules 615.

The basic input/output system 606 includes a display 608 for displaying information and an input device 609, such as a mouse, keyboard, etc., for a user to input information. Wherein both the display 608 and the input device 609 are coupled to the central processing unit 601 via an input output controller 610 coupled to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 610 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the computer device 600. That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or compact disc read only memory (CD-ROM, compact Disc Read Only Memory) drive.

Computer readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (EPROM, erasable Programmable Read Only Memory), electrically erasable programmable read-only memory (EEPROM, electrically Erasable Programmable Read Only Memory), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DVD, digital Versatile Disc) or solid state disks (SSD, solid State Drives), other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 604 and mass storage device 607 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 600 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 600 may be connected to the network 612 through a network interface unit 611 connected to the system bus 605, or alternatively, the network interface unit 611 may be used to connect to other types of networks or remote computer systems (not shown).

The memory further includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU to implement the translation method and/or model training method as described above.

Embodiments of the present application also provide a computer readable storage medium storing at least one instruction that is loaded and executed by a processor to implement the translation method and/or model training method described in the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others.

It should be noted that: the apparatus provided in the foregoing embodiments is only exemplified by the division of the foregoing functional modules when executing the method provided in the foregoing embodiments, and in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is merely illustrative of the possible embodiments of the present application and is not intended to limit the present application, but any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method of translation, the method comprising:

and determining a translation result from the at least two decoding results.

2. The method of claim 1, wherein the at least two word segmentation methods include a first word segmentation method and a second word segmentation method;

the word segmentation processing is performed on the sentence to be translated in at least two word segmentation modes to obtain at least two word sequences, including:

splitting the sentence to be translated into words and words in the first word segmentation mode to generate a first word sequence;

and splitting the sentence to be translated into characters and words in the second word splitting mode to generate a second word sequence.

3. The method according to claim 2, wherein the sentence to be translated is split into words and words by the first word segmentation method to generate a first word sequence; splitting the sentence to be translated into words and words in the second word splitting mode to generate a second word sequence, wherein the second word sequence comprises the following steps:

Splitting the sentence to be translated into words and characters in the first word segmentation mode to generate a first intermediate sequence;

splitting the sentence to be translated into words and characters in the second word splitting mode to generate a second intermediate sequence;

taking the first intermediate sequence and the second intermediate sequence as a whole, splitting whole words in the sequences into subwords, and obtaining a first subword sequence corresponding to the first intermediate sequence and a second subword sequence corresponding to the second intermediate sequence;

and taking the first sub word sequence and the second sub word sequence as a whole, and splitting sub words corresponding to the target word frequency in the sequence into words to obtain the first word sequence corresponding to the first sub word sequence and the second word sequence corresponding to the second sub word sequence.

4. The method of claim 3, wherein the splitting the first word sub-sequence and the second word sub-sequence into words corresponding to the target word frequency in the sequence to obtain the first word sequence corresponding to the first word sub-sequence and the second word sequence corresponding to the second word sub-sequence includes:

counting word frequencies of all sub words in the first sub word sequence and the second sub word sequence;

And splitting the sub-words with the word frequency lower than the word frequency threshold value into words to obtain the first word element sequence and the second word element sequence.

5. The method according to any one of claims 1 to 4, further comprising:

splitting the sentence to be translated into single words to generate a single word sequence;

the at least two word sequences and the single word sequence are used as input data, the input data are input into the translation model for embedded coding, and at least three coding vectors are obtained, wherein the number of the at least three coding vectors is equal to the sum of the number of the at least two word sequences and the single word sequence;

decoding the at least three coding vectors through the translation model to obtain at least three decoding results;

and determining a translation result from the at least three decoding results.

6. The method according to any one of claims 1 to 4, wherein determining a translation result from the at least two decoding results comprises:

and determining a decoding result corresponding to the highest score in the at least two scores as the output result.

7. The method of any one of claims 1 to 4, wherein the training process of the translation model comprises:

performing word segmentation processing on the sample sentences in the at least two word segmentation modes to obtain at least two sample word sequences;

embedding and encoding the at least two sample word sequences through a translation model to obtain at least two sample encoding vectors;

decoding the at least two sample coding vectors through the translation model to obtain at least two sample decoding results;

determining a sample translation result from the at least two sample decoding results;

8. A translation apparatus, the apparatus comprising:

9. A computer device comprising a processor, a memory coupled to the processor, the memory having program instructions stored thereon, the processor, when executing the program instructions, implementing a translation method according to any of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein program instructions which, when executed by a processor, implement the translation method according to any of claims 1 to 7.

11. A computer program product, the computer program product comprising computer instructions stored in a computer readable storage medium; a processor of a computer device reads the computer instructions from the computer readable storage medium, the processor executing the computer instructions, causing the computer device to perform the translation method of any one of claims 1 to 7.