CN113591493B - Translation model training method and translation model device - Google Patents

Translation model training method and translation model device Download PDF

Info

Publication number
CN113591493B
CN113591493B CN202110125748.4A CN202110125748A CN113591493B CN 113591493 B CN113591493 B CN 113591493B CN 202110125748 A CN202110125748 A CN 202110125748A CN 113591493 B CN113591493 B CN 113591493B
Authority
CN
China
Prior art keywords
language
corpus
word
training
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110125748.4A
Other languages
Chinese (zh)
Other versions
CN113591493A (en
Inventor
张映雪
孟凡东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110125748.4A priority Critical patent/CN113591493B/en
Publication of CN113591493A publication Critical patent/CN113591493A/en
Application granted granted Critical
Publication of CN113591493B publication Critical patent/CN113591493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to the field of artificial intelligence services, and more particularly, to a training method of a translation model, an apparatus including the translation model, an electronic device, and a computer-readable storage medium. The method comprises the steps of obtaining a third corpus pair sample set based on a first corpus pair sample set and a second corpus pair sample set, wherein each third corpus pair sample in the third corpus pair sample set is a text pair composed of a mixed language text and a second language text with the same semantics, and the mixed language text comprises one or more first language words and one or more third language words; and training the translation model using the third corpus to a sample set. The training method can enable the translation model to learn enough information related to low-resource languages, and further improve the translation effect on the low-resource languages.

Description

Translation model training method and translation model device
Technical Field
The present disclosure relates to the field of artificial intelligence services, and more particularly, to a training method of a translation model, an apparatus including the translation model, an electronic device, and a computer-readable storage medium.
Background
Translation models typically require a large amount of training data to achieve a high level of accuracy. In a machine translation scenario, these training data are typically two language aligned corpus pair data. For example, chinese "I want to watch a movie" and English "I want to watch a movie" form a pair of "Chinese-English" corpus pairs. Because the amount of data of the Chinese-English material pair data is large enough, the translation model can well complete Chinese-English inter-translation under normal conditions. The language of the data for which there is a sufficiently large data amount of corpus is hereinafter referred to as a high-resource language.
However, it is difficult to obtain a large amount of corpus-pair data for a small language. For example, most text found on the internet is english-based, whereas the amount of text found in small languages such as the paleo Ji Late language is small. This makes the translation model often poorly effective in translation of small languages. The language in which there is a smaller amount of data of the corpus versus data is hereinafter also referred to as a low-resource language.
The current common treatment methods are: the pre-trained translation model is pre-trained using high-resource corpus pairs, and then fine-tuned using low-resource corpus pairs. Thereby improving the translation accuracy of the translation model on the translation of the small language.
However, such a processing method will result in insufficient learning of the low-resource language corpus by the translation model, and the translation model cannot extract enough information from the low-resource language corpus, so that the translation effect is still not accurate enough. Therefore, further improvements to the translation model are needed to enable the translation model to return more accurate translation results for low resource languages.
Disclosure of Invention
Embodiments of the present disclosure provide a training method of a translation model, an apparatus for translating multilingual languages, an electronic device, and a computer-readable storage medium.
The embodiment of the disclosure provides a training method of a translation model, which comprises the following steps: acquiring a first corpus pair sample set, wherein each first corpus pair sample in the first corpus pair sample set is a text pair consisting of a first language text and a second language text with the same semantic meaning; acquiring a second corpus pair sample set, wherein each second corpus pair sample in the second corpus pair sample set is a text pair composed of a third language text and a second language text with the same semantic meaning; acquiring a third corpus pair sample set based on the first corpus pair sample set and the second corpus pair sample set, wherein each third corpus pair sample in the third corpus pair sample set is a text pair consisting of a mixed-language text and a second-language text with the same semantics, and the mixed-language text comprises one or more first-language words and one or more third-language words; and training the translation model using the third corpus to a sample set.
For example, training the translation model using the third corpus to sample set further comprises: in a first training stage, training the translation model by using the first corpus to a sample set so as to obtain a first trained translation model; in a second training stage, training the first trained translation model with the third corpus on a sample set to obtain a second trained translation model; and training the second trained translation model using the second corpus pair sample set in a third training stage.
For example, the obtaining the third corpus-pair sample set based on the first corpus-pair sample set and the second corpus-pair sample set further includes: extracting first language words and second language words with the same semantics from all or part of the first corpus pair samples in the first corpus pair sample set to form a first word list; extracting third language words and second language words with the same semantics from all or part of second corpus pair samples in the second corpus pair sample set to form a second word list; acquiring a third vocabulary based on the first vocabulary and the second vocabulary, wherein each word pair in the third vocabulary comprises a first-language word and a third-language word with the same semantic, and the first-language word and the third-language word with the same semantic correspond to the same second-language word; and acquiring the third corpus pair sample set based on the third vocabulary.
For example, based on the third vocabulary, obtaining the third corpus pair sample set further includes: for each third corpus pair sample in the third corpus pair sample set, selecting a first corpus pair sample from a first corpus pair sample set, and acquiring a plurality of first language words from a first language text in the first corpus pair sample; obtaining a transition probability specific to the sample for the third corpus, and determining whether to replace each first-language word of the plurality of first-language words with a third-language word based on the transition probability; for each first language word to be replaced in the plurality of first language words, replacing the first language word with a third language word based on the third word list so as to obtain a mixed language text; and combining the mixed-language text with the second-language text in the first corpus pair sample to obtain the third corpus pair sample.
For example, the obtaining a transition probability specific to the sample for the third corpus further includes: based on the training steps in the second training phase, obtaining a transition probability specific to the sample for the third corpus, wherein the transition probability has a positive correlation with the training steps; the training the first trained translation model with the third corpus on a sample set to obtain a second trained translation model further comprises: in the second training phase, at the moment the training step number is reached, a third corpus-pair sample set is determined based on the transition probability, and the first trained translation model is trained by using the determined third corpus-pair sample set to obtain a second trained translation model.
For example, the relationship between the transition probability and the number of training steps of the second training phase is: wherein p is the transition probability, T is the training step number, T is the total training step number of the second training stage, N is the preset segmentation number of the second training stage, and/> To round up the operator.
For example, the relationship between the transition probability and the number of training steps of the second training phase is: wherein p is the transition probability, T is the training step number, p 0 is the initial probability, and T is the total training step number of the second training stage.
For example, the replacing the first language word with the third language word further includes: and replacing the continuous multiple first language words in the first language text in the first corpus pair sample with the continuous multiple third language words, wherein the continuous multiple first language words form a first language phrase, the continuous multiple third language words form a third language phrase, and the first language phrase and the third language phrase have the same semantic meaning.
For example, the translation model includes a first word embedding layer, a second word embedding layer, an encoder, a decoder, and an output word embedding layer, wherein the first word embedding layer is configured to convert a first-language word in a first-language text and a mixed-language text into a first-language word vector, and output the first-language word vector to the encoder; the second word embedding layer is configured to convert third-language words in the mixed-language text and the third-language text into third-language word vectors, and output the third-language word vectors to the encoder; the encoder is connected to the first word embedding layer and the second word embedding layer, and is configured to encode a first language word vector and a third language word vector to obtain an encoded hidden vector and output the encoded hidden vector to the decoder; the decoder is connected to the encoder and is configured to decode the coded hidden vector to obtain a decoded hidden vector and output the decoded hidden vector to the output word embedding layer; the output word embedding layer is configured to convert the decoded hidden vector into a second language word.
For example, the first word embedding layer is further configured to: in a first training stage, sequentially converting first language words in a first corpus pair sample into first language word vectors, and outputting the first language word vectors to the encoder; in a second training stage, sequentially converting the first language words in the third corpus pair samples into first language word vectors, and outputting the first language word vectors to the encoder; the second word embedding layer is configured to: in a second training stage, sequentially converting third language words in a third corpus pair sample into third language word vectors, and outputting the third language word vectors to the encoder; and in a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder.
For example, the encoder is configured to: in a first training stage, the first language word vectors are coded in sequence to obtain coded hidden vectors; in a second training stage, the first language word vector and the third language word vector are coded in sequence to obtain a coded hidden vector; and in a third training stage, the third language word vectors are sequentially encoded to obtain encoded hidden vectors.
Embodiments of the present disclosure provide an apparatus comprising a translation model, wherein the translation model comprises: the system comprises a first word embedding layer, a second word embedding layer, an encoder, a decoder and an output word embedding layer, wherein the first word embedding layer is configured to convert a first language word into a first language word vector and output the first language word vector to the encoder; the second word embedding layer is configured to convert the third language word into a third language word vector and output the third language word vector to the encoder; the encoder is connected to the first word embedding layer and the second word embedding layer, and is configured to encode a first language word vector or a third language word vector to obtain an encoded hidden vector and output the encoded hidden vector to the decoder; the decoder is connected to the encoder, and is configured to decode the encoded hidden vector to obtain a decoded hidden vector and output the decoded hidden vector to the output word embedding layer, and the output word embedding layer is configured to convert the decoded hidden vector into a second language word.
For example, the first word embedding layer is further configured to: in a first training stage, sequentially converting first language words in a first corpus pair sample into first language word vectors, and outputting the first language word vectors to the encoder; in a second training stage, sequentially converting the first language words in the third corpus pair samples into first language word vectors, and outputting the first language word vectors to the encoder; the second word embedding layer is configured to: in a second training stage, sequentially converting third language words in a third corpus pair sample into third language word vectors, and outputting the third language word vectors to the encoder; in a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder; the first corpus pair sample is a text pair consisting of a first language text and a second language text with the same semantics, the second corpus pair sample is a text pair consisting of a third language text and a second language text with the same semantics, and the third corpus pair sample is a text pair consisting of a mixed language text and a second language text with the same semantics, wherein the mixed language text comprises one or more first language words and one or more third language words.
The embodiment of the disclosure discloses an electronic device, comprising: one or more processors; and one or more memories, wherein the memories have stored therein a computer executable program that, when executed by the processor, performs the method described above.
Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.
The embodiment of the disclosure provides a training method of a translation model, which utilizes corpus data of high-resource languages to generate an extended training sample mixed with the high-resource language data and the low-resource language data, so that the translation model can learn enough information related to the low-resource languages by utilizing the extended training sample, and further the translation effect of the low-resource languages is improved.
For example, by mixing the extended training samples with the high-resource language data and the low-resource language data, the translation model can learn context information, alignment information, word information, and the like for the low-resource language from the extended training samples.
The embodiment of the disclosure also performs transition by inserting another intermediate training stage between the pre-training stage of training the translation model by using the high-resource corpus pair and the fine-tuning stage of training the pre-trained translation model by using the low-resource corpus pair, and trains the translation model in the intermediate training stage by using the extended training sample, thereby greatly improving the final translation effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. The drawings in the following description are only exemplary embodiments of the present disclosure.
FIG. 1 is an example schematic diagram illustrating a scenario in which a translation model performs reasoning in accordance with an embodiment of the present disclosure.
Fig. 2A is a flowchart illustrating a method of training a translation model according to an embodiment of the present disclosure.
Fig. 2B is a schematic diagram illustrating a first corpus-to-sample set, a second corpus-to-sample set, and a third corpus-to-sample set according to an embodiment of the present disclosure.
Fig. 2C is a schematic diagram illustrating a training method of a translation model according to an embodiment of the present disclosure.
Fig. 3A is an example flowchart illustrating steps of obtaining a third corpus-to-sample set according to an embodiment of the disclosure.
FIG. 3B is a flowchart illustrating steps for obtaining a third corpus-versus-sample set, according to an embodiment of the disclosure
Is an example schematic diagram of (a).
Fig. 3C is a schematic diagram illustrating a relationship of training step number and transition probability according to an embodiment of the present disclosure.
Fig. 4 is a block diagram illustrating a translation model according to an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of an electronic device according to an embodiment of the disclosure.
Fig. 6 illustrates a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.
In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.
For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.
The translation model of the present disclosure may be artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) based. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. For example, for the translation model of the present disclosure, it is capable of translating languages of a plurality of different languages in a manner similar to a human reading and understanding the languages. Artificial intelligence is provided with a function of understanding languages of a plurality of different languages and translating the languages into the languages of another language by researching design principles and implementation methods of various intelligent machines.
Artificial intelligence technology relates to a wide range of technology, both hardware-level and software-level. The artificial intelligence software technology mainly comprises the directions of computer vision technology, natural language processing, machine learning/deep learning and the like.
Optionally, the translation model in the present disclosure employs natural language processing (Nature Language processing, NLP) techniques. Natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and can implement various theories and methods for effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, based on natural language processing technology, the translation model disclosed by the invention can analyze a plurality of different input source languages and analyze the semantics of the source languages so as to acquire better target languages.
Optionally, the natural language processing techniques employed by embodiments of the present disclosure may also be based on machine learning (MACHINE LEARNING, ML) and deep learning. Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. Natural language processing techniques utilize machine learning to study how a computer simulates or implements the behavior of a human learning language by analyzing existing, categorized text data to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their performance. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
Alternatively, the translation models that may be used in embodiments of the present disclosure below may be artificial intelligence models, particularly neural network models based on artificial intelligence. Typically, artificial intelligence based neural network models are implemented as loop-free graphs, in which neurons are arranged in different layers. Typically, the neural network model includes an input layer and an output layer, which are separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes are fully connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure.
Embodiments of the present disclosure provide solutions related to techniques such as artificial intelligence, natural language processing, and machine learning, and are specifically described by the following embodiments.
The translation model of the embodiments of the present disclosure may be integrated in an electronic device, which may be a terminal or a server, or the like. For example, the translation model may be integrated in the terminal. The terminal may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal computer (PC, personal Computer), a smart box, a smart watch, or the like. For another example, the translation model may be integrated at a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Delivery Network), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited herein.
It is understood that the device that applies the translation model of the embodiments of the present disclosure to make reasoning may be either a terminal, a server, or a system composed of a terminal and a server.
It will be appreciated that the method of training the translation model of the embodiments of the present disclosure may be performed on the terminal, may be performed on the server, or may be performed by both the terminal and the server.
The translation model provided by the embodiment of the disclosure can also relate to an artificial intelligence cloud service in the field of cloud technology. Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
Among them, the artificial intelligence cloud service is also generally called AIaaS (AI AS A SERVICE, chinese is "AI as service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through application program interfaces (APIs, application ProgrammingInterface), and part of the deep developers can also use the AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.
FIG. 1 is an example schematic diagram illustrating a scenario 100 of reasoning by a translation model according to an embodiment of the present disclosure.
Currently, there are a variety of translation applications. The user may enter the source language text (e.g., the first language text and the third language text in fig. 1) to be translated in a translation application installed in his user terminal. The user terminal may then transmit a translation request over the network to a server of the application for translating the source language text to be translated.
After receiving the source language text to be translated, the server translates the source language text by using the translation model to obtain a target language text, and then feeds back the target language text (for example, the second language text translated from the first language text or the second language text translated from the third language text in fig. 1) to the user.
For example, the first language text may be chinese text and the second language text may be english text. At present, a large number of texts for translating Chinese and English exist on the network, so that a translation model on a server can provide a good English translation result for Chinese. At this time, chinese and English are also called high-resource languages.
For example, the third language text may be paleo Ji Late language text. At present, enough text for interconverting the ancient Ji Late language and the English exists on the network, so that a translation model on a server can not provide a good English translation result for the ancient Ji Late language temporarily. In this case, the Golay language is also called a low-resource language.
If the user confirms that the translation of the target language text is correct and feeds back the confirmation to the server, the server may take the source language text-target language text pair as a corpus pair sample for training the translation model in real time.
If the user believes that the translation of the target language text is incorrect and provides the correct target language text to the server, the server may use the source language text-corrected target language text pair as a corpus pair sample for training the translation model in real-time.
Of course, the server may also use other means to obtain corpus-versus-sample for training the translation model. For example, the server may capture corpus pairs of two languages that already exist in the current internet environment for inter-translation, and then use such corpus pairs to train the translation model.
For example, referring to FIG. 1, a server may obtain corpus pairs for multi-lingual intercoranslation from a database and then use for training of translation models.
The server may obtain a first corpus-pair sample set from the database, where each first corpus-pair sample in the first corpus-pair sample set is a text pair composed of a first language text and a second language text having the same semantics.
The server may further obtain a second corpus pair sample set from the database, wherein each second corpus pair sample in the second corpus pair sample set is a text pair composed of a third language text and a second language text having the same semantics.
Optionally, the number of samples of the first corpus-pair sample set is greater than the number of samples of the second corpus-pair sample set, where the first corpus-pair is also referred to as a high-resource corpus pair and the second corpus-pair is also referred to as a low-resource corpus pair.
A conventional translation model may employ the following two methods to train with a first corpus-to-sample set and a second corpus-to-sample set.
The method comprises the following steps: zoph B, yuret D, may J, etc. researchers have proposed pre-training a translation model using a first corpus on a sample set and then fine-tuning the pre-trained translation model using a second corpus on the sample set in "TRANSFER LEARNING for low-resource neural machine translation". According to the method, a first corpus is utilized to pretrain a parent model on a sample set, and then a second corpus is utilized to train the sample set on the parent model to obtain a child model, so that the translation effect of the low-resource languages is greatly improved.
However, the method does not build a relation or mapping between the first language and the third language, so that the method cannot improve the inter-translation effect between the first language and the third language. Moreover, the learning of the translation model for the third language is also based on the second corpus only for the sample set. Since the second corpus has a smaller number of samples in the sample set, the translation model does not extract enough information from the second corpus in the sample set to encode the third language, resulting in an insufficient translation of the third language.
The second method is as follows: researchers at Kocmi T, bojar O, etc. have proposed extracting shared vocabulary (vocab) during pre-training of translation models to improve translation effects in the text TRIVIAL TRANSFER LEARNING for low-resource neural machine translation. And secondly, combining the first corpus-pair sample set and the second corpus-pair sample set into a sum corpus-pair sample set. Then, a plurality of mutually translated word pairs with the front occurrence frequency are extracted from the sum corpus pair sample set to form a shared word list. These inter-translated word pairs include first-language word-second-language word pairs having the same semantics and third-language word-second-language word pairs having the same semantics. Such shared vocabulary is then used to train a translation model.
However, the shared vocabulary obtained by method two is limiting and problematic. Because the number of samples of the first corpus-pair sample set is much greater than the number of samples of the second corpus-pair sample set, the number of third-language word-second-language word pairs will be much less than the first-language word-second-language word pairs. Moreover, the shared space of the vocabulary is limited among different languages, especially among different language families. In this case, only a few third-language word-second-language word pairs have the same or similar semantics as the first-language word-second-language word pairs. Therefore, the second method cannot significantly improve the inter-translation effect between the first language and the third language.
The present disclosure provides a training method for a translation model, which uses corpus data of high-resource languages to generate an extended training sample mixed with the high-resource language data and the low-resource language data, so that the translation model can learn enough information related to the low-resource languages by using the extended training sample, and further improve the translation effect on the low-resource languages.
For example, by mixing the extended training samples with the high-resource language data and the low-resource language data, the translation model can learn context information, alignment information, word sense information, and the like for the low-resource language from the extended training samples.
The embodiment of the disclosure also performs transition by inserting another intermediate training stage between the pre-training stage of training the translation model by using the high-resource corpus pair and the fine-tuning stage of training the pre-trained translation model by using the low-resource corpus pair, and trains the translation model in the intermediate training stage by using the extended training sample, thereby greatly improving the final translation effect.
Fig. 2A is a flowchart illustrating a method 200 of training a translation model according to an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a first corpus-to-sample set, a second corpus-to-sample set, and a third corpus-to-sample set according to an embodiment of the present disclosure. Fig. 2C is a schematic diagram illustrating a training method 200 of a translation model according to an embodiment of the present disclosure.
Translation models according to embodiments of the present disclosure may be convolutional neural networks (Convolution Neural Network, CNN), recurrent neural networks (RNN, recurrent Neural Network), time recursive neural networks (LSTM, long Short Term Memory), bi-directional recurrent neural networks (BiRNN, bidirectional Recurrent Neural Network), and the like. It should be noted that the above examples should not be construed as limiting the type of translation model.
Referring to fig. 2A, in step S201, a first corpus-pair sample set is obtained, wherein each first corpus-pair sample in the first corpus-pair sample set is a text pair composed of a first-language text and a second-language text having the same semantics.
Alternatively, referring to fig. 2B, the first language may be chinese, the second language may be english, and the third language may be the paleo Ji Late language.
The first corpus-pair sample set includes a plurality of first corpus-pair samples. For example, an example of a first corpus versus sample may be "I love you" - "I love you," where "I love you" is a first language text and "I love you" is a second language text, where the first and second language texts have the same semantics. That is, the first corpus pair sample is an aligned corpus pair of Chinese-English inter-translations. The method for obtaining the first corpus to the sample set is not limited in the present disclosure, for example, as described in fig. 1, the first corpus to the sample set may be obtained through interaction between the user terminal and the server, or the first corpus to the sample set may be obtained by the server by crawling/crawling the first corpus to the sample set on a network, or the server may directly obtain the first corpus to the sample set through a database.
In step S202, a second corpus-pair sample set is obtained, where each second corpus-pair sample in the second corpus-pair sample set is a text pair composed of a third language text and a second language text having the same semantics.
The second corpus-pair sample set includes a plurality of second corpus-pair samples. Optionally, the number of samples in the second corpus-to-sample set is less or substantially less than the number of samples in the first corpus-to-sample set, and at this time, the third language is a low-resource language. Of course, those skilled in the art will appreciate that the number of samples in the second corpus-to-sample set may be greater than or equal to the number of samples in the first corpus-to-sample set, which is not limited by those skilled in the art.
For example, an example of a second corpus versus sample may beWherein,For the third language text, "I love you" is the second language text, and the third language text and the second language text have the same semantic meaning. That is, the second corpus pair sample is an aligned corpus pair of the ancient Ji Late language and english inter-translation. The method for obtaining the second corpus to the sample set is not limited in the present disclosure, for example, as described in fig. 1, the second corpus to the sample set may be obtained through interaction between the user terminal and the server, or the second corpus to the sample set may be obtained by the server by crawling/crawling the second corpus to the sample set on the network, or the server may directly obtain the second corpus to the sample set through the database.
In step S203, a third corpus pair sample set is obtained based on the first corpus pair sample set and the second corpus pair sample set, where each third corpus pair sample in the third corpus pair sample set is a text pair composed of a mixed-language text and a second-language text with the same semantics, and the mixed-language text includes one or more first-language words and one or more third-language words.
The third corpus-pair sample set includes a plurality of third corpus-pair samples. For example, an example of a second corpus versus a sample may be "I amYou "-" I love you ", where" I/>You "is a mixed-language text," I love you "is a second-language text, the mixed-language text and the second-language text have the same semantics. Mixed-language text "I/>"I" in you "and" you "are words of the first language,/>Is a third language word. And third language words/>Has the same semantics as the first language word love.
The third corpus-pair sample set is obtained based on the first corpus-pair sample set and the second corpus-pair sample set. For example, assume that a first corpus-to-sample, "I love you" - "I love you" exists in a first corpus-to-sample set, and a second corpus-to-sample exists in a second corpus-to-sample set- "I love you" and/> - "I love book". At this time, the server may possibly determine/>Having the same semantics as "I love", then replace "I love" in "I love you" with/>For another example, a semantic vector of the first language word in each first corpus-to-sample in the first corpus-to-sample set may also be extracted; extracting semantic vectors of third language words in each second corpus pair sample in the second corpus pair sample set; if the distance between the semantic vector of the first language word and the semantic vector of the third language word is less than the predetermined threshold, the first language word in the first corpus pair sample may be replaced with the third language word to form a corpus pair of mixed-language text-second language text, or the third language word in the second corpus pair sample may be replaced with the first language word to form a corpus pair of mixed-language text-second language text.
It will be appreciated by those skilled in the art that the present disclosure is not limited to how to obtain the third corpus based on the first corpus-to-sample set and the second corpus-to-sample set, so long as the mixed-language text and the second-language text have the same semantics. An example of how to obtain the third corpus-pair sample set based on the first corpus-pair sample set and the second corpus-pair sample set will be described later in fig. 3, and will not be described here again.
In step S204, the translation model is trained using the third corpus to a sample set.
Optionally, training the translation model includes three training phases, and the server may train the translation model with a third corpus on the sample set in the middle phase.
Referring to fig. 2C, in a first training phase, the translation model is trained using the first corpus to a sample set to obtain a first trained translation model.
For example, training the translation model using the first corpus to a sample set may include: and inputting the first language text in each first corpus-to-sample in the first corpus-to-sample set into a translation model to obtain translated second language text. And then calculating the difference/loss between the translated second language text and the second language text in the first corpus pair sample. Parameters in the translation model are adjusted to minimize the difference/loss.
For example, for each first corpus pair sample in the first corpus pair sample set (where each first corpus pair sample is a text pair composed of a first language text and a second language text having the same semantics), the translation model may first segment the first language text into a plurality of first language words. The translation model then converts the plurality of first-language words into first-language word vectors by way of word embedding (word embedding) using its first word embedding (Embedding) layer, respectively. The translation model may then concatenate the word vectors as a numerical vector corresponding to the first language text. Similarly, the translation model may segment the second-language text into a plurality of second-language terms. And then the translation model respectively converts the plurality of words in the second language into the vectors of the words in the second language by utilizing another embedding layer of the translation model. The translation model may then concatenate the word vectors as a numerical vector corresponding to the second language text.
Then, the translation model can calculate the corresponding numerical vector of the first language text by utilizing a plurality of built-in hidden layers to obtain the translated corresponding numerical vector of the first language text.
Then, the translation model can also calculate the distance between the numerical vector corresponding to the translated first language text and the numerical vector corresponding to the second language text, and repeatedly adjust the parameters of the embedded layer and the hidden layer so as to minimize the distance between the embedded layer and the hidden layer. And for the first corpus to the sample set, repeatedly adjusting parameters in the translation model, and finally converging average distances between numerical vectors corresponding to the translated plurality of first language texts and numerical vectors corresponding to the plurality of second language texts. At this point, a first trained translation model is obtained.
It will be appreciated by those skilled in the art that the training process of the first training stage described above is only one example, and the present disclosure is not limited to how the first corpus is utilized to train the translation model to obtain a first trained translation model.
In a second training stage, the first trained translation model is trained using the third corpus to a sample set to obtain a second trained translation model.
For example, training the first trained translation model with the third corpus on a sample set may include: and inputting the mixed language text in each third corpus pair sample in the third corpus pair sample set into a translation model to obtain translated second language text. And then calculating the difference/loss between the translated second language text and the second language text in the third corpus pair sample. Parameters in the translation model are adjusted to minimize the difference/loss.
For example, for each third corpus pair sample in the third corpus pair sample set (where each third corpus pair sample is a text pair composed of mixed-language text and second-language text having the same semantics), the translation model may first segment the mixed-language text into a plurality of first-language words and a plurality of third-language words. And then the translation model respectively converts a plurality of words in the first language into word vectors in the first language by utilizing a first word embedding layer of the translation model. The translation model utilizes the second word embedding layer to respectively convert a plurality of third-language words into third-language word vectors in a word embedding mode. The translation model may then stitch the word vectors together as numerical vectors corresponding to the mixed-language text. Similarly, the translation model may segment the second-language text into a plurality of second-language terms. And then the translation model respectively converts the plurality of words in the second language into the vectors of the words in the second language by utilizing another embedding layer of the translation model. The translation model may then concatenate the word vectors as a numerical vector corresponding to the second language text.
Then, the translation model can calculate the corresponding numerical vector of the mixed-language text by utilizing a plurality of built-in hidden layers, and the translated corresponding numerical vector of the mixed-language text is obtained.
Then, the translation model can also calculate the distance between the numerical vector corresponding to the translated mixed-language text and the numerical vector corresponding to the second-language text, and repeatedly adjust the parameters of the embedded layer and the hidden layer so as to minimize the distance between the embedded layer and the hidden layer. And for the third corpus, repeatedly adjusting parameters in the translation model, and finally converging average distances between numerical vectors corresponding to the translated multiple mixed-language texts and numerical vectors corresponding to the multiple second-language texts. At this point, a second trained translation model is obtained.
It will be appreciated by those skilled in the art that the training process of the second training stage described above is only one example, and the present disclosure is not limited as to how the first trained translation model may be trained using the third corpus to obtain a second trained translation model.
In the second training stage, the mixed-language text comprises the first-language word and the third-language word, so that in the training process, the translation model can assist in understanding the third-language word through the first-language word, and the translation model can effectively learn semantic information, context information and the like of the third-language word.
In a third training stage, the second trained translation model is trained using the second corpus to sample set.
For example, training the second trained translation model using the second corpus on a sample set may include: and inputting the third language text in each second corpus pair sample in the second corpus pair sample set to a translation model to obtain translated second language text. And then calculating the difference/loss between the translated second language text and the second language text in the second corpus pair sample. Parameters in the translation model are adjusted to minimize the difference/loss. The training method is similar to the first training stage and the second training stage, and will not be described herein.
Alternatively, the translation model according to the embodiment of the present disclosure may be of a neural network structure of an encoder-decoder, and an example training method of the translation model for the neural network structure of the encoder-decoder will be described later with reference to fig. 4, which will not be repeated.
Therefore, the training method 200 of the translation model provided in the embodiment of the present disclosure uses the corpus data of the high-resource language to generate the extended training sample (i.e., the third corpus-to-sample set) mixed with the high-resource language data and the low-resource language data, so that the translation model can learn enough information related to the low-resource language by using the extended training sample, thereby improving the translation effect of the low-resource language.
For example, with the third corpus pair sample set, the translation model can learn context information, alignment information, word information, etc. for low-resource language languages from the expanded training samples.
The embodiment of the present disclosure also performs transition by inserting another intermediate training stage (i.e., the second training stage described above) between the pre-training stage (i.e., the first training stage described above) that trains the translation model using the high-resource corpus pair and the fine-tuning stage (i.e., the third training stage described above) that trains the pre-training translation model using the low-resource corpus pair, and training the translation model in the intermediate training stage using the extended training sample described above, greatly improving the final translation effect.
According to the embodiment of the disclosure, through the second training stage, the inter-translation information between the first language and the third language can be effectively learned, and enough information can be extracted in the stage to encode the third language, so that the final translation effect is greatly improved.
Fig. 3A is an example flowchart illustrating step S203 of acquiring a third corpus-to-sample set according to an embodiment of the present disclosure. Fig. 3B is an example schematic diagram illustrating step S203 of acquiring a third corpus-to-sample set according to an embodiment of the present disclosure. Fig. 3C is a schematic diagram illustrating a relationship of training step number and transition probability according to an embodiment of the present disclosure.
As shown in fig. 3A, step S203 of acquiring a third corpus-pair sample set according to an embodiment of the present disclosure further includes the following steps S301 to S304.
In step S301, a first language word and a second language word having the same semantic meaning are extracted from all or part of the first corpus pair samples in the first corpus pair sample set to form a first vocabulary.
Referring to fig. 3B, word segmentation and alignment of the first language text and the second language text of all or part of the first corpus sample in the sample set may extract a plurality of first language words and second language words having the same semantics. These first-language word and second-language word pairs may form a first vocabulary. For example, the first vocabulary may include pairs of first and second language words as shown in fig. 3B: "I" - "I", "true" - "ver", "love" - "love", etc.
Alternatively, the process of extracting the first vocabulary may be unsupervised or supervised. For example, in an unsupervised process, FAST ALIGN tools may be used to extract the first vocabulary. The present disclosure is not limited as to how the first vocabulary can be extracted from the first corpus without supervision to the manner in which the first vocabulary is extracted from the sample set.
For example, in a supervised process, embodiments of the present disclosure may also use existing dictionaries of first and second languages interconverted on a network to construct the first vocabulary. For example, the constructing of the first vocabulary may further include: word segmentation is carried out on the first language text in the sample by using part of the first corpus; counting a plurality of first language words with the front occurrence frequency; searching a dictionary, determining a plurality of second language words with the same semantic meaning as the plurality of first language words, and further forming a first vocabulary by the pairs of the first language words and the second language words. The present disclosure is not limited as to how the first vocabulary can be extracted from the first corpus in a supervised manner for the sample set.
In step S302, third language words and second language words having the same semantics are extracted from all or part of the second corpus pair samples in the second corpus pair sample set to form a second vocabulary.
Similar to step S301, word segmentation and alignment of the third language text and the second language text of all or part of the first corpus sample in the sample set by the second corpus may extract a plurality of third language words and second language words having the same semantics. These first and second language word pairs may form a second vocabulary. For example, the second vocabulary may include third language words and second language word pairs as shown in FIG. 3B: - "love" and so on. Similarly, the second vocabulary may also be extracted in a supervised or unsupervised manner, which is not limiting to the present disclosure.
In step S303, a third vocabulary is obtained based on the first vocabulary and the second vocabulary, where each word pair in the third vocabulary includes a first language word and a third language word having the same semantic meaning, and the first language word and the third language word having the same semantic meaning correspond to the same second language word.
And taking the second language word as a hub, and acquiring a third vocabulary. For example, the server may find the first language word "I" and the third language wordAll correspond to the same second language word I, it can be determined that the first language word I and the third language word IHas the same semantic meaning, thereby combining the first language word and the third language word with' IAs a word pair in the third vocabulary.
In step S304, the third corpus pair sample set is obtained based on the third vocabulary.
With continued reference to fig. 3B, step S304 may further include the steps of: for each third corpus pair sample in the third corpus pair sample set, selecting a first corpus pair sample from a first corpus pair sample set, and acquiring a plurality of first language words from a first language text in the first corpus pair sample; obtaining a transition probability specific to the sample for the third corpus, and determining whether to replace each first-language word of the plurality of first-language words with a third-language word based on the transition probability; for each first language word to be replaced in the plurality of first language words, replacing the first language word with a third language word based on the third word list so as to obtain a mixed language text; and combining the mixed-language sentence with the second language text in the first corpus pair sample to obtain the third corpus pair sample.
For example, the process of obtaining the third corpus pair sample set may be in the second training stage, and when a third corpus pair sample is obtained, the third corpus pair sample is directly input to the translation model for training. Thus, the transition probability may be associated with the number of training steps, and as the number of training steps increases, the transition probability increases.
Further, the obtaining a transition probability specific to the sample for the third corpus further includes: based on the number of training steps in the second training phase, a transition probability specific to the sample for the third corpus is obtained, wherein the transition probability has a positive correlation with the number of training steps.
For example, referring to fig. 3C, the relationship between the transition probability and the number of training steps is: wherein p is the transition probability, T is the training step number, T is the total training step number of the second training stage, N is the preset segmentation number of the second training stage, and/> To round up the operator. For example, in fig. 3C, n=4, t=20 (ten thousand steps), whereby the transition probability increases by 0.25 every 5 ten thousand steps. In this way, the transition probability can be calculated easily, and the training speed can be improved.
For another example, the relationship between the transition probability and the number of training steps is: Wherein p is the transition probability, T is the training step number, p 0 is the initial probability, and T is the total training step number of the second training stage. In this way, the transition probability can be better adapted to the number of training steps, although this may lead to a reduction in training speed.
For example, as shown in fig. 3B, assume that the current number of steps is 7 ten thousand steps, and the first corpus-pair sample "blue true-beauty" - "blue is so beautiful" is selected for forming the third corpus-pair sample in the current number of training steps. Referring to fig. 3C, at the current training step number, the transition probability of the third corpus to the sample-specific is calculated to be 0.5.
At this time, a random number generator may be used to determine whether to replace each of the plurality of first-language words with a third-language word based on the transition probabilities. The random number generator generates a normal distribution of random numbers between 0 and 1, and the average value of the random numbers generated by the random number generator is 0.5. Then for the three words "blue", "true" and "beauty" in the "blue true beauty" of the first language text, the random number generator may generate three random values, respectively.
Assume that these three random values are 0.8, 0.4, 0.1, respectively. The server will set the first language word to be converted into the third language word if the random number is greater than 0.5, thereby obtaining the table shown in fig. 3B, i.e., determining that the first language word "blue" is to be converted into the third language wordWithout converting the first language words true and beauty. Thus, the third corpus-to-sample can be determined to be/>Zhenmei "-" blue is so beautiful ". And inputting the third corpus into a translation model for training.
Of course, since the number of word pairs of the third vocabulary is often smaller than the number of word pairs of the first vocabulary, the corresponding third-language word may not be found for the first-language word. For example, for the first language word "mei", the corresponding third language word may not be found in the third vocabulary. At this time, as shown in fig. 3B, the value corresponding to the third language word is "NULL" (or any identifier for identifying a NULL value). At this time, even if it is determined that the first-language word "mei" is to be replaced with the third-language word based on the transition probability, the server does not perform the replacement.
In a second training phase, the first trained translation model may be trained with a set of samples using a third corpus based on the transition probabilities at the time of reaching the training step number to obtain a second trained translation model. Further, the granularity of the above replacement process may be not only a word level replacement but also a phrase level replacement or a sentence level replacement.
For example, for phrase level replacement, the replacing all or part of the first language words in the first language text in the first corpus pair sample with third language words further includes: and replacing the continuous multiple first language words in the first language text in the first corpus pair sample with the continuous multiple third language words. The continuous multiple first language words form first language phrases, the continuous multiple third language words form third language phrases, and the first language phrases and the third language phrases have the same semantic meaning.
The first vocabulary, the second vocabulary, and the third vocabulary may also be phrase-level vocabularies, corresponding to phrase-level substitutions. For example, the first vocabulary may include first-language phrase-second-language phrase pairs in the form of "at school" - "at the school". The manner of extracting the first vocabulary, the second vocabulary and the third vocabulary of the phrase granularity is similar to that of extracting the first vocabulary, the second vocabulary and the third vocabulary of the phrase granularity, and is not repeated here.
Of course, phrase-level substitution may also continue to be performed using the first, second, and third vocabularies of vocabulary granularity. For example, the random number generator described above may be caused to output the same random number for all of the consecutive plurality of first-language words, e.g., for the first corpus pair sample "i learn at school" - "I study at the school", the random number generator may output the same random number for both the words "at" and "at school", thereby ensuring that the first-language words "at" and "at school" are replaced entirely, or that the first-language words "at" and "at school" are not replaced entirely.
The method for replacing all or part of the first language words in the first language text in the first corpus pair sample with the third language words is not limited, so long as the method can achieve the acquisition of the third corpus pair sample.
Thus, embodiments of the present disclosure greatly improve the final translation effect by inserting another intermediate training phase between a pre-training phase in which a translation model is trained using a high-resource corpus pair and a fine-tuning phase in which the pre-trained translation model is trained using a low-resource corpus pair, and training the translation model in the intermediate training phase using the extended training samples described above.
According to the embodiment of the disclosure, the conversion probability is gradually increased in the middle stage, so that the inter-translation information between the first language and the third language can be effectively learned in the middle stage, and enough information can be extracted in the middle stage to encode the third language, so that the final translation effect is greatly improved.
Fig. 4 is a block diagram illustrating a translation model according to an embodiment of the present disclosure.
As shown in fig. 4, the translation model includes a first word embedding layer, a second word embedding layer, an encoder, a decoder, and an output word embedding layer.
The first word embedding layer is configured to convert a first language word in a first language text into a first language word vector and output the first language word vector to the encoder.
Optionally, the first word embedding layer is further configured to: in the first training stage, the first language words in the first corpus pair samples are sequentially converted into first language word vectors, and the first language word vectors are output to the encoder. For example, suppose the first corpus vs. the first language text in the sample is "I love you". The first language text may be segmented into "me", "love", and "you". "me", "love", and "you" will be converted to first-language word vectors in numerical form by the first word embedding layer, respectively. These first language word vectors are concatenated to form a first language text corresponding numerical vector.
Optionally, the first word embedding layer is further configured to: in the second training stage, the first language words in the third corpus pair samples are sequentially converted into first language word vectors, and the first language word vectors are output to the encoder. For example, assume that the third corpus pairs have mixed-language text in the sample as "Love you). The mixed-language text may be segmented into words"Love", and "you". "love" and "you" will be converted to first-language word vectors in numerical form by the first word embedding layer, respectively.
The second word embedding layer is configured to convert third-language words in the mixed-language text or the third-language text into third-language word vectors, and output the third-language word vectors to the encoder.
Optionally, the second word embedding layer is further configured to: and in the second training stage, sequentially converting the third language words in the third corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder. For example, assume that the third corpus pairs have mixed-language text in the sample as "Love you). The mixed-language text may be segmented into words"Love", and "you". /(I)And converting the third language word vector into a numerical form through the second word embedding layer.
Optionally, the second word embedding layer is further configured to: and in a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder. For example, assume that the second corpus is specific to the third language text in the sampleMixed-language text may be segmented into/>And/>These segmentations will be converted into third language word vectors in numerical form by the second word embedding layer, respectively.
In the second training stage, the first word vector and the third word vector in the numerical form output by the first word embedding layer and the second word embedding layer are spliced to form the numerical vector corresponding to the mixed-language text.
And an encoder connected to the first word embedding layer and the second word embedding layer. The encoder is configured to encode the first or third language word vector to obtain an encoded concealment vector and output the encoded concealment vector to the decoder.
In some embodiments, the encoder may be implemented as an encoding network. Exemplary encoding networks include Long Short Term Memory (LSTM) networks. It will be appreciated that the encoding network may also be implemented as any machine learning model capable of encoding word vectors.
For example, with first-language word vectors corresponding to "me", "love", and "you" as inputs, the encoding unit may output encoded hidden vectors corresponding to the respective first-language word vectors. The number of coded hidden vectors and the number of first language word vectors may be the same or different.
For example, the encoder is further configured to: in a first training stage, the first language word vectors are coded in sequence to obtain coded hidden vectors; in a second training stage, the first language word vector or the third language word vector is coded in sequence to obtain a coded hidden vector; the encoder is further configured to: and in a third training stage, the third language word vectors are sequentially encoded to obtain encoded hidden vectors.
And a decoder connected to the encoder. The decoder is configured to decode the encoded concealment vector to obtain a decoded concealment vector and output the decoded concealment vector to the output word embedding layer.
For example, the decoded hidden vectors may be sequentially spliced together to serve as a translated corresponding numeric vector for the first language text.
In some embodiments, the decoder may be implemented as a decoding network. An exemplary decoding network includes a long and short term memory network. It will be appreciated that the decoding network may also be implemented as any machine learning model capable of decoding the output of the encoding network.
In some embodiments, the above described encoding and decoding networks may be represented as a sequence-to-sequence model (Sequence to Sequence, seq2 Seq) that is used to enable the conversion of one input sequence, such as "I love you", to another output sequence, such as "I love you" (such as, for example, as translation of a first language text to a second language text).
The output word embedding layer is configured to convert the decoded hidden vector into a second language word.
For example, the output word embedding layer may sequentially convert each decoded hidden vector into a second language word and concatenate the second language words to form text translated into the second language.
Optionally, the output word embedding layer may also correspondingly convert the second language word into a second language word vector. In each of the training stages described above, the output word embedding layer may be configured to convert a plurality of second-language words in the second-language text into a second-language word vector. The translation model may then concatenate the word vectors as a numerical vector corresponding to the second language text.
For example, in the first training stage, the output word embedding layer may be configured to obtain a numerical vector corresponding to the second language text in the sample of the first corpus. The translation model may also calculate a distance between the translated numeric vector corresponding to the first language text and the numeric vector corresponding to the second language text, and iteratively adjust parameters of the first word embedding layer, the encoder, the decoder, and the output word embedding layer to minimize the distance therebetween. And for the first corpus to the sample set, repeatedly adjusting parameters in the translation model, and finally converging average distances between numerical vectors corresponding to the translated plurality of first language texts and numerical vectors corresponding to the plurality of second language texts. At this point, a first trained translation model is obtained.
For example, in the second training stage, the output word embedding layer may be configured to obtain a numerical vector corresponding to the second language text in the sample of the third corpus. The translation model may also calculate a distance between the translated numeric vector corresponding to the mixed-language text and the numeric vector corresponding to the second-language text, and iteratively adjust parameters of the first word embedding layer, the second word embedding layer, the encoder, the decoder, and the output word embedding layer to minimize the distance therebetween. And for the third corpus, repeatedly adjusting parameters in the translation model, and finally converging average distances between numerical vectors corresponding to the translated multiple mixed-language texts and numerical vectors corresponding to the multiple second-language texts. At this point, a second trained translation model is obtained.
For example, in the third training stage, the output word embedding layer may be configured to obtain a numerical vector corresponding to the second language text in the second corpus-pair sample. The translation model may also calculate a distance between the translated third language text corresponding value vector and the second language text corresponding value vector, and iteratively adjust parameters of the second word embedding layer, the encoder, the decoder, and the output word embedding layer to minimize the distance therebetween. And for the second corpus to the sample set, repeatedly adjusting parameters in the translation model, and finally converging average distances between numerical vectors corresponding to the translated plurality of second language texts and numerical vectors corresponding to the plurality of second language texts. At this time, a trained translation model can be obtained.
Thus, according to one aspect of the present disclosure, there is provided an apparatus comprising a translation model, wherein the translation model comprises: the system comprises a first word embedding layer, a second word embedding layer, an encoder, a decoder and an output word embedding layer, wherein the first word embedding layer is configured to convert a first language word into a first language word vector and output the first language word vector to the encoder; the second word embedding layer is configured to convert the third language word into a third language word vector and output the third language word vector to the encoder; the encoder is connected to the first word embedding layer and the second word embedding layer, and is configured to encode a first language word vector or a third language word vector to obtain an encoded hidden vector and output the encoded hidden vector to the decoder; the decoder is connected to the encoder, and is configured to decode the encoded hidden vector to obtain a decoded hidden vector and output the decoded hidden vector to the output word embedding layer, and the output word embedding layer is configured to convert the decoded hidden vector into a second language word.
In the above apparatus, the first word embedding layer is further configured to: in a first training stage, sequentially converting first language words in a first corpus pair sample into first language word vectors, and outputting the first language word vectors to the encoder; in the second training stage, the first language words in the third corpus pair samples are sequentially converted into first language word vectors, and the first language word vectors are output to the encoder.
In the above apparatus, the second word embedding layer is configured to: in a second training stage, sequentially converting third language words in a third corpus pair sample into third language word vectors, and outputting the third language word vectors to the encoder; and in a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder.
The first corpus pair sample is a text pair consisting of a first language text and a second language text with the same semantics, the second corpus pair sample is a text pair consisting of a third language text and a second language text with the same semantics, and the third corpus pair sample is a text pair consisting of a mixed language text and a second language text with the same semantics, wherein the mixed language text comprises one or more first language words and one or more third language words.
The device comprising the translation model is utilized to translate the third language, so that the translation effect is greatly improved.
As an example, experiments were performed with six tens of thousands of pairs of zebra-english aligned corpora as a first corpus pair sample set and 16 tens of thousands of pairs of paleo Ji Late-english aligned corpora as a second corpus pair sample set, with translation effect scores as shown in the following table.
Comparison of translation Effect
The translation effect score is indicated by an index BLEU (bilingual evaluation understudy, bilingual replacement evaluation) score. BLEU is an algorithm for evaluating the quality of a natural language word translated by a machine. If the machine translation is closer to the translation result of the professional, the better the machine translation is, the higher the BLEU score is, and the better the translation effect is.
Therefore, the translation model disclosed by the invention is trained by using the extended training sample mixed with the high-resource language data and the low-resource language data, so that enough information related to the low-resource language is learned, and the translation effect on the low-resource language is further improved.
According to yet another aspect of the present disclosure, there is also provided an electronic device for implementing the method according to the embodiments of the present disclosure. Fig. 5 shows a schematic diagram of an electronic device 2000 in accordance with an embodiment of the present disclosure.
As shown in fig. 5, the electronic device 2000 may include one or more processors 2010, and one or more memories 2020. Wherein said memory 2020 has stored therein computer readable code which, when executed by said one or more processors 2010, can perform a search request processing method as described above.
The processor in embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X86 architecture or ARM architecture.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
For example, a method or apparatus according to embodiments of the present disclosure may also be implemented by means of the architecture of computing device 3000 shown in fig. 6. As shown in fig. 6, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 connected to a network, an input/output component 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as a ROM 3030 or hard disk 3070, may store various data or files for processing and/or communication of the methods provided by the present disclosure and program instructions for execution by the CPU. The computing device 3000 may also include a user interface 3080. Of course, the architecture shown in FIG. 6 is merely exemplary, and one or more components of the computing device shown in FIG. 6 may be omitted as may be practical in implementing different devices.
According to yet another aspect of the present disclosure, a computer-readable storage medium is also provided. Fig. 7 shows a schematic diagram of a storage medium 4000 according to the present disclosure.
As shown in fig. 7, the computer storage medium 4020 has stored thereon computer readable instructions 4010. When the computer readable instructions 4010 are executed by a processor, a method according to an embodiment of the disclosure described with reference to the above figures may be performed. The computer readable storage medium in embodiments of the present disclosure may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memory of the methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions, causing the computer device to perform a method according to an embodiment of the present disclosure.
It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The exemplary embodiments of the present disclosure described in detail above are illustrative only and are not limiting. Those skilled in the art will understand that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and such modifications should fall within the scope of the disclosure.

Claims (14)

1. A method of training a translation model, comprising:
acquiring a first corpus pair sample set, wherein each first corpus pair sample in the first corpus pair sample set is a text pair consisting of a first language text and a second language text with the same semantic meaning;
acquiring a second corpus pair sample set, wherein each second corpus pair sample in the second corpus pair sample set is a text pair composed of a third language text and a second language text with the same semantic meaning;
Acquiring a third corpus pair sample set based on the first corpus pair sample set and the second corpus pair sample set, wherein each third corpus pair sample in the third corpus pair sample set is a text pair consisting of a mixed-language text and a second-language text with the same semantics, and the mixed-language text comprises one or more first-language words and one or more third-language words; and
Training the translation model using the third corpus to a sample set,
Wherein training the translation model using the third corpus to sample set further comprises:
In a first training stage, training the translation model by using the first corpus to a sample set so as to obtain a first trained translation model;
in a second training stage, training the first trained translation model with the third corpus on a sample set to obtain a second trained translation model; and
In a third training stage, the second trained translation model is trained using the second corpus to sample set.
2. The training method of claim 1, wherein the obtaining a third corpus-pair sample set based on the first corpus-pair sample set and the second corpus-pair sample set further comprises:
extracting first language words and second language words with the same semantics from all or part of the first corpus pair samples in the first corpus pair sample set to form a first word list;
extracting third language words and second language words with the same semantics from all or part of second corpus pair samples in the second corpus pair sample set to form a second word list;
Acquiring a third vocabulary based on the first vocabulary and the second vocabulary, wherein each word pair in the third vocabulary comprises a first-language word and a third-language word with the same semantic, and the first-language word and the third-language word with the same semantic correspond to the same second-language word; and
And acquiring the third corpus pair sample set based on the third word list.
3. The training method of claim 2, wherein the obtaining the third corpus pair sample set based on a third vocabulary further comprises:
for each third corpus pair sample in the third corpus pair sample set,
Selecting a first corpus pair sample from a first corpus sample set, and acquiring a plurality of first language words from a first language text in the first corpus pair sample;
obtaining a transition probability specific to the sample for the third corpus, and determining whether to replace each first-language word of the plurality of first-language words with a third-language word based on the transition probability;
for each first language word to be replaced in the plurality of first language words, replacing the first language word with a third language word based on the third word list so as to obtain a mixed language text;
and combining the mixed-language text with the second-language text in the first corpus pair sample to obtain the third corpus pair sample.
4. The training method of claim 3, wherein,
The obtaining a transition probability specific to the sample for the third corpus further includes:
Based on the training step number in the second training stage, obtaining a transition probability specific to the sample for the third corpus, wherein the transition probability has a positive correlation with the training step number;
The training the first trained translation model with the third corpus on a sample set to obtain a second trained translation model further comprises:
In the second training phase, at the moment the training step number is reached, a third corpus-pair sample set is determined based on the transition probability, and the first trained translation model is trained by using the determined third corpus-pair sample set to obtain a second trained translation model.
5. The training method of claim 3 or 4, wherein the relationship between the transition probability and the number of training steps of the second training phase is: wherein p is the transition probability, T is the training step number, T is the total training step number of the second training stage, N is the preset segmentation number of the second training stage, and/> To round up the operator.
6. The training method of claim 3 or 4, wherein the relationship between the transition probability and the number of training steps of the second training phase is: wherein p is the transition probability, T is the training step number, p 0 is the initial probability, and T is the total training step number of the second training stage.
7. The training method of claim 3, wherein said replacing the first language term with a third language term further comprises:
And replacing the continuous multiple first language words in the first language text in the first corpus pair sample with the continuous multiple third language words, wherein the continuous multiple first language words form a first language phrase, the continuous multiple third language words form a third language phrase, and the first language phrase and the third language phrase have the same semantic meaning.
8. The training method of claim 1, wherein the translation model comprises a first word embedding layer, a second word embedding layer, an encoder, a decoder, and an output word embedding layer, wherein,
The first word embedding layer is configured to convert a first language word in a first language text and a mixed language text into a first language word vector and output the first language word vector to the encoder;
The second word embedding layer is configured to convert third-language words in the mixed-language text and the third-language text into third-language word vectors, and output the third-language word vectors to the encoder;
The encoder is connected to the first word embedding layer and the second word embedding layer, and is configured to encode a first language word vector and a third language word vector to obtain an encoded hidden vector and output the encoded hidden vector to the decoder;
The decoder is connected to the encoder and is configured to decode the coded hidden vector to obtain a decoded hidden vector and output the decoded hidden vector to the output word embedding layer;
The output word embedding layer is configured to convert the decoded hidden vector into a second language word.
9. The training method of claim 8, wherein,
The first word embedding layer is further configured to:
In a first training stage, sequentially converting first language words in a first corpus pair sample into first language word vectors, and outputting the first language word vectors to the encoder;
in a second training stage, sequentially converting the first language words in the third corpus pair samples into first language word vectors, and outputting the first language word vectors to the encoder;
the second word embedding layer is configured to:
In a second training stage, sequentially converting third language words in a third corpus pair sample into third language word vectors, and outputting the third language word vectors to the encoder;
And in a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder.
10. The training method of claim 9, wherein,
The encoder is configured to:
In a first training stage, the first language word vectors are coded in sequence to obtain coded hidden vectors;
In a second training stage, the first language word vector and the third language word vector are coded in sequence to obtain a coded hidden vector;
and in a third training stage, the third language word vectors are sequentially encoded to obtain encoded hidden vectors.
11. An apparatus comprising a translation model is provided which,
Wherein the translation model is trained by the method of any one of claims 1-10,
Wherein the translation model comprises: a first word embedding layer, a second word embedding layer, an encoder, a decoder, and an output word embedding layer, wherein,
The first word embedding layer is configured to convert the first language word into a first language word vector and output the first language word vector to the encoder;
The second word embedding layer is configured to convert the third language word into a third language word vector and output the third language word vector to the encoder;
The encoder is connected to the first word embedding layer and the second word embedding layer, and is configured to encode a first language word vector and a third language word vector to obtain an encoded hidden vector and output the encoded hidden vector to the decoder;
The decoder is connected to the encoder and is configured to decode the coded hidden vector to obtain a decoded hidden vector and output the decoded hidden vector to the output word embedding layer;
The output word embedding layer is configured to convert the decoded hidden vector into a second language word.
12. The apparatus of claim 11, wherein,
The first word embedding layer is further configured to:
In a first training stage, sequentially converting first language words in a first corpus pair sample into first language word vectors, and outputting the first language word vectors to the encoder;
in a second training stage, sequentially converting the first language words in the third corpus pair samples into first language word vectors, and outputting the first language word vectors to the encoder;
the second word embedding layer is configured to:
In a second training stage, sequentially converting third language words in a third corpus pair sample into third language word vectors, and outputting the third language word vectors to the encoder;
In a third training stage, sequentially converting third language words in the second corpus pair samples into third language word vectors, and outputting the third language word vectors to the encoder;
The first corpus pair sample is a text pair consisting of a first language text and a second language text with the same semantics, the second corpus pair sample is a text pair consisting of a third language text and a second language text with the same semantics, and the third corpus pair sample is a text pair consisting of a mixed language text and a second language text with the same semantics, wherein the mixed language text comprises one or more first language words and one or more third language words.
13. An electronic device, comprising:
One or more processors; and
One or more memories, wherein the memories have stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-10.
14. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1-10.
CN202110125748.4A 2021-01-29 2021-01-29 Translation model training method and translation model device Active CN113591493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110125748.4A CN113591493B (en) 2021-01-29 2021-01-29 Translation model training method and translation model device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110125748.4A CN113591493B (en) 2021-01-29 2021-01-29 Translation model training method and translation model device

Publications (2)

Publication Number Publication Date
CN113591493A CN113591493A (en) 2021-11-02
CN113591493B true CN113591493B (en) 2024-06-07

Family

ID=78238038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110125748.4A Active CN113591493B (en) 2021-01-29 2021-01-29 Translation model training method and translation model device

Country Status (1)

Country Link
CN (1) CN113591493B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238708B (en) * 2022-08-17 2024-02-27 腾讯科技(深圳)有限公司 Text semantic recognition method, device, equipment, storage medium and program product

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021560A (en) * 2017-12-07 2018-05-11 苏州大学 A kind of data enhancement methods, system, device and computer-readable recording medium
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110543643A (en) * 2019-08-21 2019-12-06 语联网(武汉)信息技术有限公司 Training method and device of text translation model
CN111178094A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Pre-training-based scarce resource neural machine translation training method
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN111860001A (en) * 2020-07-31 2020-10-30 北京小米松果电子有限公司 Machine translation method and device, electronic equipment and storage medium
CN112101047A (en) * 2020-08-07 2020-12-18 江苏金陵科技集团有限公司 Machine translation method for matching language-oriented precise terms
CN112257459A (en) * 2020-10-16 2021-01-22 北京有竹居网络技术有限公司 Language translation model training method, translation method, device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037028B2 (en) * 2018-12-31 2021-06-15 Charles University Faculty of Mathematics and Physics Computer-implemented method of creating a translation model for low resource language pairs and a machine translation system using this translation model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021560A (en) * 2017-12-07 2018-05-11 苏州大学 A kind of data enhancement methods, system, device and computer-readable recording medium
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method
CN110543643A (en) * 2019-08-21 2019-12-06 语联网(武汉)信息技术有限公司 Training method and device of text translation model
CN111178094A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Pre-training-based scarce resource neural machine translation training method
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN111860001A (en) * 2020-07-31 2020-10-30 北京小米松果电子有限公司 Machine translation method and device, electronic equipment and storage medium
CN112101047A (en) * 2020-08-07 2020-12-18 江苏金陵科技集团有限公司 Machine translation method for matching language-oriented precise terms
CN112257459A (en) * 2020-10-16 2021-01-22 北京有竹居网络技术有限公司 Language translation model training method, translation method, device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Human Quality Evaluation of Machine-Translated Poetry;S. Seljan;《43rd International Convention on Information, Communication and Electronic Technology (MIPRO)》;20211106;1040-1045 *

Also Published As

Publication number Publication date
CN113591493A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN113228030B (en) Multilingual text generation system and method
KR102382499B1 (en) Translation method, target information determination method, related apparatus and storage medium
Prakash et al. Neural paraphrase generation with stacked residual LSTM networks
WO2022007823A1 (en) Text data processing method and device
US11972365B2 (en) Question responding apparatus, question responding method and program
CN109359297B (en) Relationship extraction method and system
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN114676234A (en) Model training method and related equipment
CN110807335B (en) Translation method, device, equipment and storage medium based on machine learning
CN110874536B (en) Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method
CN115221846A (en) Data processing method and related equipment
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN111368531B (en) Translation text processing method and device, computer equipment and storage medium
CN113536795B (en) Method, system, electronic device and storage medium for entity relation extraction
CN113704460A (en) Text classification method and device, electronic equipment and storage medium
CN113449081A (en) Text feature extraction method and device, computer equipment and storage medium
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114492661B (en) Text data classification method and device, computer equipment and storage medium
Caglayan Multimodal machine translation
CN114757210A (en) Translation model training method, sentence translation method, device, equipment and program
US11972218B1 (en) Specific target-oriented social media tweet sentiment analysis method
CN113591493B (en) Translation model training method and translation model device
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40054512

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant