WO2022228041A1 - 翻译模型的训练方法、装置、设备和存储介质 - Google Patents

翻译模型的训练方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2022228041A1
WO2022228041A1 PCT/CN2022/084963 CN2022084963W WO2022228041A1 WO 2022228041 A1 WO2022228041 A1 WO 2022228041A1 CN 2022084963 W CN2022084963 W CN 2022084963W WO 2022228041 A1 WO2022228041 A1 WO 2022228041A1
Authority
WO
WIPO (PCT)
Prior art keywords
corpus
pseudo
translation model
original
parallel
Prior art date
Application number
PCT/CN2022/084963
Other languages
English (en)
French (fr)
Inventor
潘骁
王明轩
吴礼蔚
李磊
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022228041A1 publication Critical patent/WO2022228041A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, for example, to a method, apparatus, device, and storage medium for training a translation model.
  • the establishment of the language translation model of the translation software is usually obtained by training based on a common language-centered parallel corpus, which is used to realize the translation between the common language and other languages (taking the common language as English as an example, such as realizing English translation) law, etc.).
  • common language-centered parallel corpus which is used to realize the translation between the common language and other languages (taking the common language as English as an example, such as realizing English translation) law, etc.).
  • common language-centered parallel corpus which is used to realize the translation between the common language and other languages (taking the common language as English as an example, such as realizing English translation) law, etc.).
  • such translation software has lower translation accuracy on other non-universal language pairs (eg, German to French).
  • the present disclosure provides a training method, apparatus, device and storage medium for a translation model, so as to improve the translation accuracy of the translation model in various scenarios.
  • Embodiments of the present disclosure provide a method for training a translation model, including:
  • a pseudo-parallel corpus is constructed based on the original corpus and the replacement corpus, and a preset basic translation model is trained by using the pseudo-parallel corpus to obtain a target translation model.
  • An embodiment of the present disclosure provides a training device for a translation model, including:
  • Get module set to get at least one original corpus
  • the replacement module is configured to align and replace at least one original vocabulary of the source-end corpus in the original corpus with a target vocabulary of the same meaning, so as to obtain a replacement corpus corresponding to the original corpus; wherein, the difference between the original vocabulary and the target vocabulary is different languages;
  • a construction module configured to construct a pseudo-parallel corpus based on the original corpus and the replacement corpus;
  • a training module configured to use the pseudo-parallel corpus to train a preset basic translation model to obtain a target translation model.
  • An embodiment of the present disclosure provides a training device for a translation model, including a memory and a processor, where the memory stores a computer program, and the processor implements the above-mentioned training method for a translation model when the computer program is executed.
  • An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the above-mentioned training method for a translation model.
  • FIG. 1 is a schematic flowchart of a method for training a translation model according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of the principle of a construction process of a pseudo-parallel corpus provided by an embodiment of the present disclosure
  • FIG. 3 is a schematic schematic diagram of the construction process of another pseudo-parallel corpus provided by an embodiment of the present disclosure
  • FIG. 4 is a schematic flowchart of another method for training a translation model according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of the principle of a training process of a translation model according to an embodiment of the present disclosure
  • FIG. 6 is a schematic structural diagram of a training device for a translation model according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a training device for a translation model according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • the execution subject of the following method embodiments may be a training device for a translation model, and the device may be implemented as part or all of an electronic device through software, hardware, or a combination of software and hardware.
  • the electronic device may be a client, including but not limited to a smart phone, a tablet computer, an e-book reader, a vehicle terminal, and the like.
  • the electronic device may also be an independent server or a server cluster, and the embodiment of the present disclosure does not limit the form of the electronic device.
  • the following method embodiments are described by taking the execution subject being an electronic device as an example.
  • FIG. 1 is a schematic flowchart of a method for training a translation model according to an embodiment of the present disclosure. This embodiment relates to a process of how an electronic device trains a multilingual translation model. As shown in Figure 1, the method may include:
  • the original corpus is used as training data to train the basic translation model.
  • the modality of the original corpus may be at least one of image, text, video or audio.
  • the language contained in the original corpus may be a single language or multiple languages, which is not limited in this embodiment.
  • the original corpus may include monolingual corpus and/or parallel corpus.
  • the parallel corpus includes a pair of source corpus and target corpus.
  • the source-end corpus can be understood as the corpus before translation
  • the target-end corpus can be understood as the translated corpus of the source-end corpus.
  • the Chinese-English parallel corpus includes a Chinese document and a corresponding English document. If the Chinese-English translation operation is performed through the translation model, the Chinese document is the source corpus, and the English document is the target corpus. .
  • Monolingual corpus can be source corpus or target corpus, lacking corresponding parallel corpus.
  • a large amount of Chinese corpus and English corpus can be obtained, but it is difficult to obtain parallel corpora that correspond to each other in Chinese and English.
  • the electronic device can directly acquire at least one original corpus from the corpus database.
  • the target vocabulary is a synonym of the original vocabulary, and the original vocabulary is in a different language than the target vocabulary. Since a multilingual thesaurus dictionary is relatively easy to obtain, the electronic device can replace at least one original word in the source corpus in the original corpus with a synonym in any other language based on the multilingual thesaurus dictionary, so as to obtain a replacement corresponding to the original corpus corpus.
  • the languages of each target vocabulary in the replacement corpus are at least partially different.
  • the original corpus is a parallel corpus
  • part of the lexical alignment of the source corpus in the parallel corpus can be replaced with synonyms of any other language, and the target corpus remains unchanged.
  • the source corpus of the parallel corpus is "I like sing and dance", and its language is English
  • the target corpus is "J , adore chanter et danser", and its language is French.
  • the electronic device can align and replace the word "sing" in the source corpus with the Chinese word “singing” with the same meaning, and replace the word "dance” with the Chinese word “dancing” with the same meaning, so as to form the source
  • the replacement corpus corresponding to the corpus "I like singing and dancing" the target corpus is not replaced and remains unchanged.
  • the language of each target vocabulary after alignment and replacement can be the same or different, that is, the above vocabulary "sing” can be aligned and replaced with a German vocabulary with the same meaning, and the above vocabulary "dance” can be aligned and replaced with a Chinese vocabulary with the same meaning.
  • the monolingual corpus is the above-mentioned source corpus, and there is no corresponding translation corpus.
  • some words in the monolingual corpus can be aligned and replaced with synonyms of any other language.
  • the languages of the replaced synonyms are at least partially different, that is, the replaced synonyms may be of the same language or different languages.
  • the electronic device can The word “like” is aligned and replaced with the English word “like” of the same meaning, the word “which” is aligned and replaced with the French word “quel” of the same meaning, and the word “music” is aligned and replaced with the German word “Musik” of the same meaning , so as to form the replacement corpus corresponding to the monolingual corpus "Do you like the Musik of the quel type".
  • a parallel corpus can be constructed based on the original corpus and the corresponding replacement corpus. Since the parallel corpus is not a standard corpus, but obtained after synonym alignment and replacement, the parallel corpus is marked as a pseudo-parallel corpus.
  • the original corpus is a parallel corpus
  • the corresponding replacement corpus of the source corpus in the parallel corpus can be replaced.
  • the corpus is used as a pseudo-source corpus
  • the target-end corpus in the parallel corpus is used as a pseudo-target-end corpus to form a pseudo-parallel corpus.
  • the target corpus remains unchanged, which can ensure that the translation model learns the correct translation result.
  • the replacement corpus "I like singing and dancing” is used as the pseudo-source corpus, and the target-end corpus "J , adore chanter et danser" in the parallel corpus continues to be the pseudo-target corpus, thereby forming a pseudo-parallel corpus.
  • the replacement corpus corresponding to the monolingual corpus can be used as a pseudo source
  • a pseudo-parallel corpus is formed by using the monolingual corpus as a pseudo-target-end corpus.
  • the replacement corpus "Do you like quel-type Musik" is used as the pseudo-source corpus
  • the monolingual corpus itself "What type of music do you like” is used as the pseudo-target corpus, thus forming a pseudo-parallel corpus.
  • the electronic device can use the pseudo-parallel corpus to train a preset basic translation model to obtain a target translation model.
  • the monolingual corpus can be directly applied to the training process of the translation model, without the need to use "back translation" technology, which greatly shortens the use of monolingual corpus for training translation. The process of the model, thereby improving the training efficiency of the translation model.
  • the above-mentioned basic translation model may include a sequence to sequence (sequence to sequence, seq2seq) model, which is a neural network with an encoder-decoder structure, the input is a sequence (Sequence), and the output is also a Sequence;
  • sequence to sequence sequence to sequence
  • seq2seq sequence to sequence
  • the variable-length sequence is converted into a fixed-length vector representation
  • the Decoder converts the fixed-length vector representation into a variable-length target signal sequence, thereby realizing variable-length input to variable-length output.
  • the sequence-to-sequence model can include various types, for example, a seq2seq model based on a recurrent neural network (Recurrent Neural Network, RNN) and a seq2seq model based on a convolution operation (Convolution, CONV), etc.
  • Type is not limited.
  • the constructed pseudo-parallel corpus contains other languages that do not appear in the original corpus.
  • the model is trained so that the basic translation model can learn the grammatical structure and lexical association between other languages, which enhances the translation accuracy of the translation model in other non-universal language pairs.
  • the translation model is usually trained on English-centered parallel corpus, so the translation effect on other non-English language pairs cannot meet the expected requirements.
  • the embodiment of the present disclosure uses the original corpus
  • the English word "sing" in the source corpus is aligned and replaced with the Chinese word "singing", and the target corpus remains unchanged.
  • the basic translation model learns that the Chinese word “singing” and the French word “chanter” have the same meaning, and can learn the grammatical structure and lexical association between the sentence “I like singing and dancing” and the sentence "J , adore chanter et danser” , thus realizing the translation between Chinese and French, and ensuring the accuracy of the translation between Chinese and French.
  • the training method for a translation model includes acquiring at least one original corpus, aligning and replacing at least one original vocabulary of the source corpus in the original corpus with a target vocabulary of the same meaning, and obtaining a replacement corpus corresponding to the original corpus, and the original vocabulary Different from the language of the target vocabulary; construct pseudo-parallel corpus based on the original corpus and replacement corpus, and use the pseudo-parallel corpus to train the preset basic translation model to obtain the target translation model.
  • the translation model can learn the grammatical structure and lexical association between other arbitrary languages, thereby improving the translation accuracy of the translation model on other non-universal language pairs.
  • the base translation model described above includes an encoder and a decoder.
  • the encoder can perform feature extraction on the input sequence to obtain feature vectors.
  • the pseudo-parallel corpus is used in the above-mentioned S103 to train the preset basic translation model, and the process of obtaining the target translation model can be is: using the pseudo-parallel corpus to train the encoder of the basic translation model through a first loss function.
  • the first loss function is a contrastive learning loss function, which is used to update the parameters of the encoder. Since the pseudo-parallel corpus is constructed based on the replacement corpus and the original corpus corresponding to the original corpus, that is, the replacement corpus can be regarded as a synonym of the original corpus. Therefore, in order to improve the translation accuracy, when using the pseudo-parallel corpus to train the basic translation model , you can use the contrast loss function to train the encoder of the basic translation model, so as to narrow the high-dimensional expression of synonymous sentences after encoding by the encoder, and at the same time narrow the high-dimensional expression of irrelevant sentences after encoding by the encoder.
  • the encoder is trained by the comparative learning loss function, the two input corpora that are originally similar, after being encoded by the encoder, in the feature space, the features of the two input corpora are still similar; After the two input corpora are encoded by the encoder, the features of the two input corpora are still dissimilar in the feature space.
  • the basic translation model is trained using the pseudo-parallel corpus constructed from the parallel corpus and/or the monolingual corpus, the translation effect of the target translation model obtained by training can be guaranteed.
  • the above-mentioned process of using the pseudo-parallel corpus to train the encoder of the basic translation model through the first loss function may be as follows:
  • Pseudo-parallel corpus can include paired pseudo-source corpus and pseudo-target corpus.
  • the pseudo-source corpus can be understood as the corpus before translation, and the pseudo-target corpus can be understood as the translated corpus of the pseudo-source corpus.
  • the above-mentioned positive example corpus refers to a corpus whose match degree with the pseudo-source corpus is greater than the first preset value, that is, the two are similar corpora. When the two corpora are completely similar, the matching degree of the two is 1.
  • the above negative example corpus refers to the corpus whose matching degree with the pseudo-source corpus is less than the second preset value, that is, the two are irrelevant corpus. When the two corpora are completely unrelated, the matching degree of the two is 0.
  • the first preset value is greater than the second preset value.
  • the above-mentioned pseudo-source corpus is usually the replacement corpus corresponding to the source-end corpus in the original corpus. Therefore, the above-mentioned positive example corpus may be the pseudo-target-end corpus in the current pseudo-parallel corpus, and the above-mentioned negative example corpus is other pseudo-parallel corpus.
  • the pseudo-target corpus in .
  • the pseudo-source corpus in the above pseudo-parallel corpus can be the replacement corpus corresponding to the monolingual corpus, and the pseudo-target corpus can be the monolingual corpus itself.
  • the electronic device can use the monolingual corpus itself as The positive example corpus of the pseudo-source-end corpus, and the pseudo-target end in other pseudo-parallel corpus is used as the negative example corpus of the pseudo-source-end corpus.
  • the pseudo-source corpus in the above pseudo-parallel corpus can be the replacement corpus corresponding to the source corpus in the parallel corpus
  • the pseudo-target corpus can be the target-end corpus in the parallel corpus.
  • the target-end corpus in the parallel corpus is regarded as the positive example corpus of the pseudo-source-end corpus
  • the pseudo-target-end corpus in other pseudo-parallel corpora is regarded as the negative example corpus of the pseudo-source-end corpus.
  • the pseudo-source-end corpus in the pseudo-parallel corpus is "I love you” and the pseudo-target-end corpus is "Jet , aime”, and the pseudo-source-end corpus is the source-end corpus in the original corpus after The replacement corpus after the replacement of synonyms in any other language.
  • the pseudo-target corpus "Jet , aime” can be used as the positive example corpus of the replacement corpus "I love you", and the pseudo-target corpus in other pseudo-parallel corpus can be selected as Replace the negative example corpus of the corpus "I love you” (the English corpus "It , s sunny” in Figure 5, the French corpus "C , est la vie” and the Chinese corpus "Who are you”).
  • multiple negative example corpora can be selected to train the encoder, thereby improving the training efficiency of the translation model.
  • S402. Use the pseudo-source corpus, the positive example corpus, and the negative example corpus to train the encoder through a first loss function.
  • the electronic device can use the pseudo-source corpus, the positive example corpus and the negative example corpus as the training data of the encoder, and use the contrastive learning loss function to repeatedly train the encoder , to continuously update the parameters of the encoder until the training target is reached.
  • the training objective is to maximize the similarity between the vector representations of the pseudo-source corpus and the positive corpus, and minimize the similarity between the vector representations of the pseudo-source corpus and the negative corpus.
  • the electronic device uses the pseudo-target corpus "Jet , aime” as the positive example corpus of the anchor point "I love you", and the English corpus “It” sampled from the pseudo-target corpus of other pseudo-parallel corpora , s sunny", the French corpus “C , est la vie” and the Chinese corpus "Who are you” as the negative example corpus of the anchor "I love you”, using the contrast loss function L ctl to train the encoder of the basic translation model, It enables the encoder to close the high-dimensional expression encoded by synonymous sentences, and at the same time zoom out the high-dimensional expression encoded by irrelevant sentences.
  • the above process of S402 may include the following steps:
  • the encoder is used to extract the feature vector of the input corpus. Therefore, the pseudo-source corpus, the positive example corpus of the pseudo-source corpus, and the negative example corpus of the pseudo-source corpus are respectively input into the encoder, and the pseudo-source corpus is processed by the encoder. , positive example corpus, and negative example corpus are encoded to obtain the first vector representation corresponding to the pseudo-source corpus, the second vector representation corresponding to the positive example corpus, and the third vector representation corresponding to the negative example corpus.
  • the first loss function is a comparative learning loss function. Its optimization goal is that when the input corpus is similar, it is hoped that the vector representations corresponding to the two input corpora after encoding by the encoder are also similar. The vector representations corresponding to the two input corpora are also similar. Therefore, the electronic device may determine the first loss value of the contrastive learning loss function based on the first vector representation and the second vector representation of the matching degree between the pseudo-source corpus and the positive example corpus.
  • the parameters of the encoder are updated, and the updated encoder is used as the encoder in S4021, and the above-mentioned S4021 is continued to be executed until the first loss value of the contrast loss function satisfies the convergence condition.
  • the electronic device may determine the second loss value of the contrastive learning loss function based on the first vector representation and the third vector representing the matching degree between the pseudo-source corpus and the negative example corpus.
  • the parameters of the encoder are updated, and the updated encoder is used as the encoder in S4021, and the above-mentioned S4021 is continued to execute until the second loss value of the contrast loss function satisfies the convergence condition.
  • the training of the encoder can be superimposed on the training process of the entire basic translation model, so that the encoder can share the parameters of multi-task training.
  • the pseudo-parallel corpus is used to train the preset basic translation model, and the process of obtaining the target translation model may include: using the pseudo-parallel corpus, through The first loss function and the second loss function perform multi-task training on the basic translation model to obtain a target translation model.
  • the first loss function is a comparative learning loss function, which is used to update the parameters of the encoder
  • the second loss function is used to update the parameters of the encoder and the decoder, that is, the second loss function is used to perform the entire basic translation model. train.
  • the second loss function may be a cross-entropy loss function.
  • the above multi-task may at least include a training task for the encoder and a training task for the encoder and the decoder, and the encoder may share the parameters updated by the multi-task training.
  • One of the tasks is to use anchor corpus, positive example corpus and negative example corpus to train the encoder by comparing the loss function, so that the trained encoder can close the high-dimensional expression encoded by the synonymous sentence, and at the same time pull the irrelevant sentence away.
  • the encoded high-dimensional expression another task is to use the pseudo-parallel corpus to train the basic translation model through the second loss function, so that the basic translation model can learn the grammatical structure and lexical association between other arbitrary languages contained in the pseudo-parallel corpus , so as to realize mutual translation between multiple languages in zero-resource scenarios and unsupervised scenarios.
  • the multi-task training is performed by adding the contrastive learning loss function, so that the trained target translation model can support translation between multiple languages in any direction, and To ensure the accuracy of translation results.
  • the electronic device takes the pseudo-source corpus (ie anchor) "I love you” in the pseudo-parallel corpus as the input of the basic translation model, and the pseudo-target corpus "Jet , aime” in the pseudo-parallel corpus as the expectation Output, the encoder and decoder in the base translation model are trained with a second loss function Lmt .
  • the electronic device converts the anchor points such as "I love you", the positive example corpus of the anchor point such as “Jet , aime”, and the negative example corpus of the anchor point such as "It , s sunny", “C , est la vie” and "Who are you", etc., are input into the encoder, and after encoding by the encoder, the corresponding first vector representation, second vector representation and third vector representation are obtained, based on the first vector representation, the second vector representation and the third vector representation Indicates that the encoder is trained with a contrastive loss function until the convergence conditions of the contrastive loss function L ctl and the second loss function L mt are reached.
  • a contrast loss function is added to train the encoder of the basic translation model, so that the trained encoder can close the high-dimensional expression encoded by the synonym, and at the same time High-dimensional representations encoded by far less relevant sentences.
  • the basic translation model is trained using the pseudo-parallel corpus constructed from the replacement corpus after synonym alignment in any other language, so that the trained target translation model can support translation between multiple languages in any direction, and the accuracy of the translation result is ensured. sex.
  • FIG. 6 is a schematic structural diagram of an apparatus for training a translation model according to an embodiment of the present disclosure.
  • the apparatus may include: an acquisition module 601 , a replacement module 602 , a construction module 603 and a training module 604 .
  • the acquisition module 601 is set to acquire at least one original corpus; the replacement module 602 is set to align and replace at least one original vocabulary of the source corpus in the original corpus with a target vocabulary of the same meaning, so as to obtain the replacement corpus corresponding to the original corpus; wherein , the language of the original vocabulary and the target vocabulary is different; the construction module 603 is set to construct pseudo-parallel corpus based on the original corpus and the replacement corpus; the training module 604 is set to use the pseudo-parallel corpus to preset the basis The translation model is trained to obtain the target translation model.
  • the training device for a translation model acquires at least one original corpus, aligns and replaces at least one original vocabulary of the source corpus in the original corpus with a target vocabulary of the same meaning, and obtains a replacement corpus corresponding to the original corpus, and the original vocabulary Different from the language of the target vocabulary; construct pseudo-parallel corpus based on the original corpus and replacement corpus, and use the pseudo-parallel corpus to train the preset basic translation model to obtain the target translation model.
  • the translation model can learn the grammatical structure and lexical association between other arbitrary languages, thereby improving the translation accuracy of the translation model on other non-universal language pairs.
  • the original corpus includes monolingual corpus and/or parallel corpus; wherein, the monolingual corpus is the source corpus, and the parallel corpus includes a pair of source corpus and target corpus.
  • the basic translation model includes an encoder and a decoder;
  • the above training module 604 includes: a first training unit; the first training unit is set to use the pseudo-parallel corpus, through the first training unit
  • a loss function is used to train the encoder of the basic translation model; wherein, the first loss function is a contrastive learning loss function for updating parameters of the encoder.
  • the training module 604 further includes: a second training unit; the second training unit is configured to use the pseudo-parallel corpus, and use the first loss function and the second loss function to analyze the basic The translation model performs multi-task training to obtain the target translation model; wherein, the first loss function is a comparative learning loss function, which is used to update the parameters of the encoder, and the second loss function is used to update the encoder. and the parameters of the decoder.
  • the first training unit is configured to construct a positive example corpus and a negative example corpus of the pseudo-source-end corpus in the pseudo-parallel corpus; use the pseudo-source-end corpus, the positive example corpus Example corpus and the negative example corpus, the encoder is trained through the first loss function; wherein, the training objective is to maximize the similarity between the pseudo-source corpus and the vector representation of the positive example corpus, Minimize the similarity between the vector representations of the pseudo-source corpus and the negative corpus.
  • the first training unit is configured to input the pseudo-source corpus, the positive example corpus and the negative example corpus into the encoder to obtain the pseudo-source corpus
  • the third vector indicates that the second loss value of the first loss function is determined, and the parameters of the encoder are updated based on the second loss value until the second loss value of the first loss function satisfies the convergence condition .
  • the construction module 603 is configured to use the replacement corpus corresponding to the monolingual corpus as the pseudo-source corpus and use the monolingual corpus as the pseudo-source corpus.
  • a pseudo-target corpus a pseudo-parallel corpus is formed.
  • the construction module 603 is configured to use the replacement corpus corresponding to the source corpus in the parallel corpus as a pseudo-source corpus, and use the The target-end corpus in the parallel corpus is used as a pseudo-target-end corpus to form a pseudo-parallel corpus.
  • the positive example corpus is the pseudo-target corpus.
  • the negative example corpus is a pseudo-target-end corpus among other pseudo-parallel corpora.
  • the language types of each target vocabulary in the replacement corpus are at least partially different.
  • FIG. 7 it shows a schematic structural diagram of an electronic device 700 (ie, a training device for a translation model) suitable for implementing an embodiment of the present disclosure.
  • the electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistants, PDAs), tablet computers (PADs), portable multimedia players (Portable Media Players) , PMP), mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), etc., as well as fixed terminals such as digital televisions (Television, TV), desktop computers, and the like.
  • PDAs Personal Digital Assistants
  • PMP portable multimedia players
  • PMP portable multimedia players
  • the electronic device 700 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 701, which may be based on a program stored in a read-only memory (Read-Only Memory, ROM) 702 or from a storage device 708 is a program loaded into random access memory (RAM) 703 to perform various appropriate actions and processes.
  • ROM Read-Only Memory
  • RAM random access memory
  • various programs and data necessary for the operation of the electronic device 700 are also stored.
  • the processing device 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
  • An Input/Output (I/O) interface 705 is also connected to the bus 704 .
  • I/O interface 705 the following devices can be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 707 , speaker, vibrator, etc.; storage device 708 , including, for example, magnetic tape, hard disk, etc.; and communication device 709 .
  • Communication means 709 may allow electronic device 700 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 7 shows an electronic device 700 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 709, or from the storage device 708, or from the ROM 702.
  • the processing device 701 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM) or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer-readable medium may be transmitted by any suitable medium, including but not limited to: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: obtains at least one original corpus; The original vocabulary is aligned and replaced with the target vocabulary of the same meaning, and the replacement corpus corresponding to the original corpus is obtained; wherein, the original vocabulary and the target vocabulary are of different languages; based on the original corpus and the replacement corpus, a pseudo-parallel corpus is constructed , and use the pseudo-parallel corpus to train a preset basic translation model to obtain a target translation model.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner.
  • the name of the unit does not constitute a limitation of the unit itself in one case, for example, the first obtaining unit may also be described as "a unit that obtains at least two Internet Protocol addresses".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSP Application Specific Standard Products
  • SOC System on Chip
  • complex programmable logic device Complex Programmable Logic Device, CPLD
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. Examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM or flash memory, optical fibers, CD-ROMs, optical storage devices, magnetic storage devices, or Any suitable combination of the above.
  • a training device for a translation model comprising a memory and a processor, the memory stores a computer program, and the processor implements when executing the computer program:
  • a pseudo-parallel corpus is constructed based on the original corpus and the replacement corpus, and a preset basic translation model is trained by using the pseudo-parallel corpus to obtain a target translation model.
  • the original corpus includes monolingual corpus and/or parallel corpus; wherein, the monolingual corpus is the source corpus, and the parallel corpus includes a pair of source corpus and target corpus.
  • the basic translation model includes an encoder and a decoder
  • the processor when the processor executes the computer program, the processor further implements: using the pseudo-parallel corpus to train the encoder of the basic translation model by using a first loss function; wherein the first loss function is comparative learning Loss function for updating the parameters of the encoder.
  • the processor when the processor executes the computer program, the processor further implements: using the pseudo-parallel corpus to perform multi-task training on the basic translation model by using the first loss function and the second loss function to obtain the target translation model; wherein , the first loss function is a contrastive learning loss function, which is used to update the parameters of the encoder, and the second loss function is used to update the parameters of the encoder and the decoder.
  • the first loss function is a contrastive learning loss function, which is used to update the parameters of the encoder
  • the second loss function is used to update the parameters of the encoder and the decoder.
  • the processor when the processor executes the computer program, the processor further implements: constructing a positive example corpus and a negative example corpus of the pseudo-source corpus in the pseudo-parallel corpus; using the pseudo-source corpus, the positive example corpus, and The negative example corpus is used to train the encoder through a first loss function; wherein, the training objective is to maximize the similarity between the pseudo-source corpus and the vector representation of the positive example corpus, and minimize all The similarity between the vector representation of the pseudo-source corpus and the negative example corpus.
  • the processor when the processor executes the computer program, the processor further implements: inputting the pseudo-source corpus, the positive example corpus and the negative example corpus into the encoder to obtain the corresponding pseudo-source corpus The first vector representation of , the second vector representation corresponding to the positive example corpus, and the third vector representation corresponding to the negative example corpus; according to the first vector representation and the second vector representation, determine the first loss function and update the parameters of the encoder based on the first loss value until the first loss value of the first loss function satisfies the convergence condition; according to the first vector representation and the third Vector representation, determining a second loss value of the first loss function, and updating the parameters of the encoder based on the second loss value until the second loss value of the first loss function satisfies a convergence condition.
  • the processor when the original corpus is a monolingual corpus, when the processor executes the computer program, the processor further implements: using the replacement corpus corresponding to the monolingual corpus as a pseudo-source corpus and using the monolingual corpus as a pseudo-source corpus The target corpus forms a pseudo-parallel corpus.
  • the processor when the original corpus is a parallel corpus, when the processor executes the computer program, the processor further implements: taking the replacement corpus corresponding to the source corpus in the parallel corpus as a pseudo-source corpus, and using the parallel corpus The mid-target corpus is used as a pseudo-target corpus to form a pseudo-parallel corpus.
  • the positive example corpus is the pseudo-target corpus.
  • the negative example corpus is a pseudo-target-end corpus among other pseudo-parallel corpora.
  • the language types of each target vocabulary in the replacement corpus are at least partially different.
  • a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to realize:
  • a pseudo-parallel corpus is constructed based on the original corpus and the replacement corpus, and a preset basic translation model is trained by using the pseudo-parallel corpus to obtain a target translation model.
  • the translation model training apparatus, device, and storage medium provided in the foregoing embodiments can execute the translation model training method provided by any embodiment of the present disclosure, and have corresponding functional modules and effects for executing the method.
  • the training method of the translation model provided by any embodiment of the present disclosure.
  • a training method for a translation model including:
  • a pseudo-parallel corpus is constructed based on the original corpus and the replacement corpus, and a preset basic translation model is trained by using the pseudo-parallel corpus to obtain a target translation model.
  • the original corpus includes monolingual corpus and/or parallel corpus; wherein, the monolingual corpus is the source corpus, and the parallel corpus includes a pair of source corpus and target corpus.
  • the basic translation model includes an encoder and a decoder
  • the above translation model training method further comprising: using the pseudo-parallel corpus to train the encoder of the basic translation model through a first loss function; wherein, The first loss function is a contrastive learning loss function for updating parameters of the encoder.
  • the above translation model training method further comprising: using the pseudo-parallel corpus, performing multi-tasking on the basic translation model through a first loss function and a second loss function training to obtain the target translation model; wherein, the first loss function is a comparative learning loss function, which is used to update the parameters of the encoder, and the second loss function is used to update the encoder and the decoder. parameter.
  • the first loss function is a comparative learning loss function, which is used to update the parameters of the encoder
  • the second loss function is used to update the encoder and the decoder. parameter.
  • the above training method for a translation model further comprising: constructing a positive example corpus and a negative example corpus of a pseudo-source corpus in the pseudo-parallel corpus; using the pseudo-source corpus end corpus, the positive example corpus and the negative example corpus, and train the encoder through a first loss function; wherein, the training objective is to maximize the vector representation of the pseudo-source corpus and the positive example corpus The similarity between the pseudo-source corpus and the vector representation of the negative example corpus is minimized.
  • the above translation model training method further comprising: inputting the pseudo-source corpus, the positive example corpus and the negative example corpus into the encoder , to obtain the first vector representation corresponding to the pseudo-source corpus, the second vector representation corresponding to the positive example corpus, and the third vector representation corresponding to the negative example corpus; according to the first vector representation and the third vector representation Two-vector representation, determine the first loss value of the first loss function, and update the parameters of the encoder based on the first loss value until the first loss value of the first loss function satisfies the convergence condition; according to the The first vector representation and the third vector representation, determining the second loss value of the first loss function, and updating the parameters of the encoder based on the second loss value until the first loss function of the first loss function. The second loss value satisfies the convergence condition.
  • the above translation model training method further comprising: when the original corpus is a monolingual corpus, using a replacement corpus corresponding to the monolingual corpus as a pseudo-source corpus and using the monolingual corpus as a pseudo-target corpus to form a pseudo-parallel corpus.
  • the above translation model training method further comprising: when the original corpus is a parallel corpus, using a replacement corpus corresponding to the source corpus in the parallel corpus as a pseudo-source A pseudo-parallel corpus is formed by using the target-end corpus in the parallel corpus as a pseudo-target-end corpus.
  • the positive example corpus is the pseudo-target corpus.
  • the negative example corpus is a pseudo-target-end corpus among other pseudo-parallel corpora.
  • the language types of each target vocabulary in the replacement corpus are at least partially different.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种翻译模型的训练方法、装置、设备和存储介质。该翻译模型的训练方法包括:获取至少一个原始语料(S101);将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料(S102);其中,所述原始词汇与所述目标词汇的语种不同;基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型(S103)。

Description

翻译模型的训练方法、装置、设备和存储介质
本申请要求在2021年04月26日提交中国专利局、申请号为202110454958.8的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及计算机技术领域,例如涉及一种翻译模型的训练方法、装置、设备和存储介质。
背景技术
随着计算机技术的不断发展,各种各样的翻译软件应运而生,成为了人们获取外部信息的重要渠道。
翻译软件的语言翻译模型的建立,通常是基于一通用语种为中心的平行语料进行训练得到的,用于实现通用语种与其它语种之间的翻译(以通用语种为英语为例,如实现英译法等)。但是,这样的翻译软件在其它非通用语种对(例如,德译法)上的翻译准确性较低。
发明内容
本公开提供一种翻译模型的训练方法、装置、设备和存储介质,以提高翻译模型在各场景下的翻译准确性。
本公开实施例提供了一种翻译模型的训练方法,包括:
获取至少一个原始语料;
将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
本公开实施例提供了一种翻译模型的训练装置,包括:
获取模块,设置为获取至少一个原始语料;
替换模块,设置为将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
构造模块,设置为基于所述原始语料与所述替换语料构造伪平行语料;
训练模块,设置为使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
本公开实施例提供了一种翻译模型的训练设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述的翻译模型的训练方法。
本公开实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的翻译模型的训练方法。
附图说明
图1为本公开实施例提供的一种翻译模型的训练方法的流程示意图;
图2为本公开实施例提供的一种伪平行语料的构造过程的原理示意图;
图3为本公开实施例提供的另一种伪平行语料的构造过程的原理示意图;
图4为本公开实施例提供的另一种翻译模型的训练方法的流程示意图;
图5为本公开实施例提供的一种翻译模型的训练过程的原理示意图;
图6为本公开实施例提供的一种翻译模型的训练装置的结构示意图;
图7为本公开实施例提供的一种翻译模型的训练设备的结构示意图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而,本公开可以通过各种形式来实现,提供这些实施例是为了理解本公开。本公开的附图及实施例仅用于示例性作用。
本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
下文中将结合附图对本公开的实施例进行说明。需要说明的是,在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互任意组合。
下述方法实施例的执行主体可以是翻译模型的训练装置,该装置可以通过软件、硬件或者软硬件结合的方式实现成为电子设备的部分或者全部。可选的,该电子设备可以为客户端,包括但不限于智能手机、平板电脑、电子书阅读器以及车载终端等。该电子设备也可以为独立的服务器或者服务器集群,本公开实施例对电子设备的形式不做限定。下述方法实施例以执行主体是电子设备为例进行说明。
图1为本公开实施例提供的一种翻译模型的训练方法的流程示意图。本实施例涉及的是电子设备如何训练多语种翻译模型的过程。如图1所示,该方法可以包括:
S101、获取至少一个原始语料。
原始语料作为训练数据,用于训练基础翻译模型。该原始语料的模态可以为图像、文本、视频或者音频中的至少一种。原始语料中所包含的语种可以为单一语种,也可以为多个语种,本实施例对此不做限定。
可选地,原始语料可以包括单语语料和/或平行语料。其中,所述平行语料包括成对的源端语料和目标端语料。源端语料可以理解为翻译之前的语料,目标端语料可以理解为源端语料经过翻译后的语料。以平行语料为文本语料为例,中英平行语料包括一个中文文档和一个对应的英文文档,如果通过翻译模型进行中译英操作,那么中文文档即为源端语料,英文文档即为目标端语料。
单语语料可以为源端语料或者目标端语料,缺少对应的平行语料。例如,在中医领域,可以获取到大量的中文语料,也可以获取到英文语料,但是中英文相互对应的平行语料就很难获取。
通常,语料数据库中存储有大量的原始语料,因此,电子设备可以直接从语料数据库中获取至少一个原始语料。
S102、将原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到原始语料对应的替换语料。
目标词汇是原始词汇的同义词,且原始词汇与目标词汇的语种不同。由于 多语种的同义词词典相对比较容易获取,因此,电子设备可以基于多语种的同义词词典,将原始语料中源端语料的至少一个原始词汇替换成其它任意语种的同义词,从而得到原始语料对应的替换语料。
为了能够实现多语种之间的翻译,可选地,所述替换语料中的各目标词汇的语种至少部分不同。
以下基于平行语料和单语语料,介绍上述替换语料的获取过程:
当原始语料为平行语料时,可以将平行语料中源端语料的部分词汇对齐替换为其它任意语种的同义词汇,目标端语料保持不变。示例性的,如图2所示,假设平行语料的源端语料为“I like sing and dance”,其语种为英语,目标端语料为“J adore chanter et danser”,其语种为法语。对此,电子设备可以将源端语料中的词汇“sing”对齐替换为同含义的中文词汇“唱歌”,以及将词汇“dance”对齐替换为同含义的中文词汇“跳舞”,从而形成源端语料对应的替换语料“I like唱歌and跳舞”,目标端语料不替换,保持不变。对齐替换后的各目标词汇的语种可以相同,也可以不同,即可以将上述词汇“sing”对齐替换为同含义的德语词汇,将上述词汇“dance”对齐替换为同含义的中文词汇。
当原始语料为单语语料时,单语语料即为上述源端语料,其不存在对应的翻译语料,此时,可以将单语语料中的部分词汇对齐替换为其它任意语种的同义词汇。替换后的各同义词汇的语种至少部分不同,即替换后的各同义词汇可以为同一语种,也可以为不同语种。示例性的,如图3所示,假设单语语料为中文语料“你喜欢哪种类型的音乐呢”,为了能够使用大量的单语语料进行翻译模型的训练,电子设备可以将上述单语语料中的词汇“喜欢”对齐替换为同含义的英语词汇“like”、词汇“哪种”对齐替换为同含义的法语词汇“quel”,词汇“音乐”对齐替换为同含义的德语词汇“Musik”,从而形成单语语料对应的替换语料“你like quel类型的Musik呢”。
S103、基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
在得到原始语料对应的替换语料之后,可以基于原始语料与对应的替换语料来构造平行语料。由于该平行语料并不是标准的语料,而是通过同义词对齐替换后获取得到的,因此,该平行语料被标注为伪平行语料。
以下针对不同的原始语料,介绍伪平行语料的构造过程:
当原始语料为平行语料时,在将平行语料中源端语料的部分词汇对齐替换为其它任意语种的同义词汇,得到源端语料对应的替换语料后,可以将平行语料中源端语料对应的替换语料作为伪源端语料,将平行语料中目标端语料作为 伪目标端语料,组成伪平行语料。目标端语料保持不变,能够保证翻译模型学习到正确的翻译结果。继续参见图2,将替换语料“I like唱歌and跳舞”作为伪源端语料,平行语料中目标端语料“J adore chanter et danser”继续作为伪目标端语料,从而形成伪平行语料。
当原始语料为单语语料时,在将单语语料中的部分词汇对齐替换为其它任意语种的同义词汇,得到单语语料对应的替换语料后,可以将单语语料对应的替换语料作为伪源端语料以及将单语语料作为伪目标端语料,组成伪平行语料。继续参见图3,将替换语料“你like quel类型的Musik呢”作为伪源端语料,将单语语料本身“你喜欢哪种类型的音乐呢”作为伪目标端语料,从而形成伪平行语料。
在得到大量的伪平行语料之后,电子设备可以使用该伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。对于单语语料来说,经过简单的同义词汇对齐替换后,就可以将单语语料直接应用于翻译模型的训练过程中,不需要采用“回译”技术,大大缩短了使用单语语料训练翻译模型的流程,从而提高了翻译模型的训练效率。
可选地,上述基础翻译模型可以包括序列到序列(sequence to sequence,seq2seq)模型,是一种编码(Encoder)-解码(Decoder)结构的神经网络,输入是一个序列(Sequence),输出也是一个序列;在Encoder中,将可变长度的序列转变为固定长度的向量表示,Decoder将这个固定长度的向量表示转换为可变长度的目标信号序列,进而实现不定长的输入到不定长的输出。序列到序列模型可以包括多种类型,例如,基于循环神经网络(Recurrent Neural Network,RNN)的seq2seq模型和基于卷积运算(Convolution,CONV)的seq2seq模型等,本实施例中对基础翻译模型的类型不做限定。
通过将原始语料中源端语料的部分词汇替换为其它任意语种的同义词汇,使得所构造的伪平行语料中包含了原始语料中没有出现的其它语种,进而通过所构造的伪平行语料对基础翻译模型进行训练,使得基础翻译模型能够学习到其它语种之间的语法结构以及词汇关联,增强了翻译模型在其它非通用语种对上的翻译准确性。例如,翻译模型通常是以英语为中心的平行语料进行训练得到的,那么在其它非英语语对上的翻译效果达不到期望要求,为此,继续参见图2,本公开实施例将原始语料中的源端语料的英文词汇“sing”对齐替换为中文词汇“唱歌”,目标端语料保持不变,使用替换语料和平行语料中目标端语料所构造的伪平行语料训练基础翻译模型,可以使得基础翻译模型学习到中文词汇“唱歌”与法语词汇“chanter”是同一含义,并能学习到句子“I like唱歌and跳舞”与句子“J adore chanter et danser”之间的语法结构和词汇关联,从而实现了中文到法语对之间的翻译,且保证了中文到法语之间的翻译准确性。
本公开实施例提供的翻译模型的训练方法,获取至少一个原始语料,将原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到原始语料对应的替换语料,且原始词汇与目标词汇的语种不同;基于原始语料与替换语料构造伪平行语料,并使用伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。也就是说,通过将原始语料中源端语料的至少一个原始词汇对齐替换为其它任意语种的同义词汇后,能够构造出大量包含其它任意语种的伪平行语料,使用该伪平行语料训练翻译模型,使得翻译模型能够学习到其它任意语种之间的语法结构以及词汇关联,从而提高了翻译模型在其它非通用语种对上的翻译准确性。
在一个实施例中,上述基础翻译模型包括编码器和解码器。编码器可以对输入序列进行特征提取,得到特征向量。为了提高编码器对输入序列特征提取的准确性,在上述实施例的基础上,可选地,上述S103中使用伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型的过程可以为:使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练。
第一损失函数为对比学习损失函数,用于更新编码器的参数。由于伪平行语料是基于原始语料对应的替换语料与原始语料所构造的,即替换语料可以认为是原始语料的同义句,因此,为了提高翻译准确性,在使用伪平行语料训练基础翻译模型时,可以使用对比损失函数对基础翻译模型的编码器进行训练,从而拉近同义句经过编码器编码后的高维表达,同时拉远不相关句子经过编码器编码后的高维表达。也就是说,通过对比学习损失函数对编码器进行训练后,原本相似的两个输入语料,在经过编码器编码后,在特征空间中,两个输入语料的特征仍旧相似;而原本不相似的两个输入语料,在经过编码器编码后,在特征空间中,两个输入语料的特征也仍旧不相似。这样,在使用由平行语料和/或单语语料所构造的伪平行语料训练基础翻译模型时,能够保证训练得到的目标翻译模型的翻译效果。
可选地,如图4所示,上述使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练的过程可以为:
S401、构建所述伪平行语料中的伪源端语料的正例语料和负例语料。
伪平行语料可以包括成对的伪源端语料和伪目标端语料,伪源端语料可以理解为翻译之前的语料,伪目标端语料可以理解为伪源端语料经过翻译后的语料。上述正例语料是指与伪源端语料匹配度大于第一预设值的语料,即两者为相似语料,当两个语料完全相似时,两者的匹配度为1。上述负例语料是指与伪源端语料匹配度小于第二预设值的语料,即两者为不相关语料,当两个语料完全不相关时,两者的匹配度为0。上述第一预设值大于第二预设值。
可选地,上述伪源端语料通常为原始语料中源端语料对应的替换语料,因此,上述正例语料可以为当前伪平行语料中的伪目标端语料,上述负例语料为其它伪平行语料中的伪目标端语料。
当原始语料为单语语料时,上述伪平行语料中伪源端语料可以为单语语料对应的替换语料,伪目标端语料可以为单语语料本身,这样,电子设备可以将单语语料本身作为伪源端语料的正例语料,将其它伪平行语料中的伪目标端作为伪源端语料的负例语料。
当原始语料为平行语料时,上述伪平行语料中伪源端语料可以为平行语料中源端语料对应的替换语料,伪目标端语料可以为平行语料中目标端语料,这样,电子设备可以将当前平行语料中目标端语料作为伪源端语料的正例语料,将其它伪平行语料中的伪目标端语料作为伪源端语料的负例语料。
示例性的,参见图5,假设伪平行语料中的伪源端语料为“I love you”,伪目标端语料为“Je t aime”,该伪源端语料是原始语料中源端语料经过其它任意语种同义词替换后的替换语料,此时,可以将伪目标端语料“Je t aime”作为替换语料“I love you”的正例语料,选择其它伪平行语料中的伪目标端语料作为替换语料“I love you”的负例语料(如图5中的英语语料“It s sunny”,法语语料“C est la vie”以及中文语料“你是谁”)。另外,可以选取多个负例语料对编码器进行训练,从而提高翻译模型的训练效率。
S402、使用所述伪源端语料、所述正例语料以及所述负例语料,通过第一损失函数对所述编码器进行训练。
在得到伪源端语料的正例语料以及负例语料后,电子设备可以将伪源端语料、正例语料以及负例语料作为编码器的训练数据,采用对比学习损失函数对编码器进行反复训练,以不断更新编码器的参数,直至达到训练目标。训练目标为最大化伪源端语料和正例语料的向量表示之间的相似度,最小化伪源端语料和负例语料的向量表示之间的相似度。
继续参见图5,电子设备将伪目标端语料“Je t aime”作为锚点“I love you”的正例语料,将从其它伪平行语料的伪目标端语料中采样的英语语料“It s sunny”,法语语料“C est la vie”以及中文语料“你是谁”作为锚点“I love you”的负例语料,采用对比损失函数L ctl对基础翻译模型的编码器进行训练,使得编码器能够拉近同义句编码后的高维表达,同时拉远不相关句子编码后的高维表达。
作为一种可选地实施方式,上述S402的过程可以包括以下步骤:
S4021、将所述伪源端语料、所述正例语料以及所述负例语料输入至所述编码器中,得到所述伪源端语料对应的第一向量表示、所述正例语料对应的第二 向量表示以及所述负例语料对应的第三向量表示。
编码器用于提取输入语料的特征向量,因此,将伪源端语料、伪源端语料的正例语料以及伪源端语料的负例语料分别输入至编码器中,经过编码器对伪源端语料、正例语料以及负例语料进行编码,从而得到伪源端语料对应的第一向量表示、正例语料对应的第二向量表示以及负例语料对应的第三向量表示。
S4022、根据所述第一向量表示和所述第二向量表示,确定第一损失函数的第一损失值,并基于所述第一损失值更新所述编码器的参数,直至所述第一损失函数的第一损失值满足收敛条件。
S4023、根据所述第一向量表示和所述第三向量表示,确定所述第一损失函数的第二损失值,并基于所述第二损失值更新所述编码器的参数,直至所述第一损失函数的第二损失值满足收敛条件。
第一损失函数为对比学习损失函数,其优化目标是当输入语料相似时,希望经过编码器编码后两个输入语料对应的向量表示也相似,当输入语料不相似时,希望经过编码器编码后两个输入语料对应的向量表示也不相似。因此,电子设备可以基于第一向量表示、以及第二向量表示伪源端语料与正例语料之间的匹配度,确定对比学习损失函数的第一损失值。当该第一损失值不满足收敛条件时,对编码器的参数进行更新,并将更新后的编码器作为S4021中的编码器,继续执行上述S4021,直至对比损失函数的第一损失值满足收敛条件。
同理,电子设备可以基于第一向量表示、以及第三向量表示伪源端语料与负例语料之间的匹配度,确定对比学习损失函数的第二损失值。当该第二损失值不满足收敛条件时,对编码器的参数进行更新,并将更新后的编码器作为S4021中的编码器,继续执行上述S4021,直至对比损失函数的第二损失值满足收敛条件。
在实际应用中,对编码器的训练可以叠加在整个基础翻译模型的训练过程中,使得编码器能够共享多任务训练的参数。为此,在上述实施例的基础上,可选地,上述S103中使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型的过程可以包括:使用伪平行语料,通过第一损失函数和第二损失函数对所述基础翻译模型进行多任务训练,以获取目标翻译模型。
所述第一损失函数为对比学习损失函数,用于更新编码器的参数,所述第二损失函数用于更新编码器和解码器的参数,即第二损失函数用于对整个基础翻译模型进行训练。可选地,第二损失函数可以为交叉熵损失函数。上述多任务至少可以包括对编码器的训练任务以及对编码器和解码器的训练任务,编码器可以共享多任务训练所更新的参数。其中一个任务是使用锚点语料、正例语 料以及负例语料,通过对比损失函数训练编码器,使得训练后的编码器能够拉近同义句编码后的高维表达,同时拉远不相关句子编码后的高维表达;另一个任务是使用伪平行语料,通过第二损失函数训练基础翻译模型,使得基础翻译模型学习到伪平行语料中所包含的其它任意语种之间的语法结构以及词汇关联,从而实现在零资源场景下以及无监督场景下的多语种之间的相互翻译。即在使用伪平行语料,通过第二损失函数训练基础翻译模型的基础上,加入对比学习损失函数进行多任务训练,使得训练后的目标翻译模型能够支持任意方向的多语种之间的翻译,且确保了翻译结果的准确性。
继续参见图5,电子设备将伪平行语料中伪源端语料(即锚点)“I love you”作为基础翻译模型的输入,将伪平行语料中伪目标端语料“Je t aime”作为期望输出,采用第二损失函数L mt对基础翻译模型中的编码器和解码器进行训练。同时,电子设备将锚点如“I love you”、锚点的正例语料如“Je t aime”以及锚点的负例语料如“It s sunny”、“C est la vie”以及“你是谁”等,输入至编码器中,经过编码器编码后得到对应的第一向量表示、第二向量表示以及第三向量表示,基于第一向量表示、第二向量表示以及第三向量表示采用对比损失函数对编码器进行训练,直至达到对比损失函数L ctl和第二损失函数L mt的收敛条件。
在本实施例中,在基础翻译模型的训练过程中,加入对比损失函数对基础翻译模型的编码器进行训练,使得训练后的编码器能够拉近同义句编码后的高维表达,同时拉远不相关句子编码后的高维表达。这样,使用其它任意语种同义词对齐替换后的替换语料所构造的伪平行语料训练基础翻译模型,使得训练后的目标翻译模型能够支持任意方向的多语种之间的翻译,且确保了翻译结果的准确性。
图6为本公开实施例提供的一种翻译模型的训练装置的结构示意图。如图6所示,该装置可以包括:获取模块601、替换模块602、构造模块603和训练模块604。
获取模块601设置为获取至少一个原始语料;替换模块602设置为将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;构造模块603设置为基于所述原始语料与所述替换语料构造伪平行语料;训练模块604设置为使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
本公开实施例提供的翻译模型的训练装置,获取至少一个原始语料,将原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到原始语料对应的替换语料,且原始词汇与目标词汇的语种不同;基于原始语料与 替换语料构造伪平行语料,并使用伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。也就是说,通过将原始语料中源端语料的至少一个原始词汇对齐替换为其它任意语种的同义词汇后,能够构造出大量包含其它任意语种的伪平行语料,使用该伪平行语料训练翻译模型,使得翻译模型能够学习到其它任意语种之间的语法结构以及词汇关联,从而提高了翻译模型在其它非通用语种对上的翻译准确性。
可选地,所述原始语料包括单语语料和/或平行语料;其中,所述单语语料为所述源端语料,所述平行语料包括成对的源端语料和目标端语料。
在上述实施例的基础上,可选地,所述基础翻译模型包括编码器和解码器;上述训练模块604包括:第一训练单元;第一训练单元设置为使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数。
在上述实施例的基础上,可选地,训练模块604还包括:第二训练单元;第二训练单元设置为使用所述伪平行语料,通过第一损失函数和第二损失函数对所述基础翻译模型进行多任务训练,以获取目标翻译模型;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数,所述第二损失函数用于更新所述编码器和所述解码器的参数。
在上述实施例的基础上,可选地,第一训练单元设置为构建所述伪平行语料中的伪源端语料的正例语料和负例语料;使用所述伪源端语料、所述正例语料以及所述负例语料,通过第一损失函数对所述编码器进行训练;其中,训练目标为最大化所述伪源端语料和所述正例语料的向量表示之间的相似度,最小化所述伪源端语料和所述负例语料的向量表示之间的相似度。
在上述实施例的基础上,可选地,第一训练单元设置为将所述伪源端语料、所述正例语料以及所述负例语料输入至所述编码器中,得到所述伪源端语料对应的第一向量表示、所述正例语料对应的第二向量表示以及所述负例语料对应的第三向量表示;根据所述第一向量表示和所述第二向量表示,确定第一损失函数的第一损失值,并基于所述第一损失值更新所述编码器的参数,直至所述第一损失函数的第一损失值满足收敛条件;根据所述第一向量表示和所述第三向量表示,确定所述第一损失函数的第二损失值,并基于所述第二损失值更新所述编码器的参数,直至所述第一损失函数的第二损失值满足收敛条件。
在上述实施例的基础上,可选地,当所述原始语料为单语语料时,构造模块603设置为将所述单语语料对应的替换语料作为伪源端语料以及将所述单语语料作为伪目标端语料,组成伪平行语料。
在上述实施例的基础上,可选地,当所述原始语料为平行语料时,构造模块603设置为将所述平行语料中源端语料对应的替换语料作为伪源端语料,以及将所述平行语料中目标端语料作为伪目标端语料,组成伪平行语料。
可选地,所述正例语料为所述伪目标端语料。
可选地,所述负例语料为其它伪平行语料中的伪目标端语料。
可选地,所述替换语料中的各目标词汇的语种至少部分不同。
下面参考图7,其示出了适于用来实现本公开实施例的电子设备700(即翻译模型的训练设备)的结构示意图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(PAD)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(Television,TV)、台式计算机等等的固定终端。图7示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,其可以根据存储在只读存储器(Read-Only Memory,ROM)702中的程序或者从存储装置708加载到随机访问存储器(Random Access Memory,RAM)703中的程序而执行各种适当的动作和处理。在RAM703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(Input/Output,I/O)接口705也连接至总线704。
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有各种装置的电子设备700,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置708被安装,或者从ROM 702被安装。在该计算机程序被处理装置701执行时,执行本公开实施例的方法中限定的上 述功能。
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取至少一个原始语料;将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的 计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在一种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、 ROM、EPROM或快闪存储器、光纤、CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
在一个实施例中,提供了一种翻译模型的训练设备,包括存储器和处理器,存储器存储有计算机程序,该处理器执行计算机程序时实现:
获取至少一个原始语料;
将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
可选地,所述原始语料包括单语语料和/或平行语料;其中,所述单语语料为所述源端语料,所述平行语料包括成对的源端语料和目标端语料。
可选地,所述基础翻译模型包括编码器和解码器;
在一个实施例中,处理器执行计算机程序时还实现:使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数。
在一个实施例中,处理器执行计算机程序时还实现:使用所述伪平行语料,通过第一损失函数和第二损失函数对所述基础翻译模型进行多任务训练,以获取目标翻译模型;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数,所述第二损失函数用于更新所述编码器和所述解码器的参数。
在一个实施例中,处理器执行计算机程序时还实现:构建所述伪平行语料中的伪源端语料的正例语料和负例语料;使用所述伪源端语料、所述正例语料以及所述负例语料,通过第一损失函数对所述编码器进行训练;其中,训练目标为最大化所述伪源端语料和所述正例语料的向量表示之间的相似度,最小化所述伪源端语料和所述负例语料的向量表示之间的相似度。
在一个实施例中,处理器执行计算机程序时还实现:将所述伪源端语料、所述正例语料以及所述负例语料输入至所述编码器中,得到所述伪源端语料对应的第一向量表示、所述正例语料对应的第二向量表示以及所述负例语料对应的第三向量表示;根据所述第一向量表示和所述第二向量表示,确定第一损失函数的第一损失值,并基于所述第一损失值更新所述编码器的参数,直至所述第一损失函数的第一损失值满足收敛条件;根据所述第一向量表示和所述第三向量表示,确定所述第一损失函数的第二损失值,并基于所述第二损失值更新所述编码器的参数,直至所述第一损失函数的第二损失值满足收敛条件。
在一个实施例中,当所述原始语料为单语语料时,处理器执行计算机程序时还实现:将所述单语语料对应的替换语料作为伪源端语料以及将所述单语语料作为伪目标端语料,组成伪平行语料。
在一个实施例中,当所述原始语料为平行语料时,处理器执行计算机程序时还实现:将所述平行语料中源端语料对应的替换语料作为伪源端语料,以及将所述平行语料中目标端语料作为伪目标端语料,组成伪平行语料。
可选地,所述正例语料为所述伪目标端语料。
可选地,所述负例语料为其它伪平行语料中的伪目标端语料。
可选地,所述替换语料中的各目标词汇的语种至少部分不同。
在一个实施例中,还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现:
获取至少一个原始语料;
将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
上述实施例中提供的翻译模型的训练装置、设备以及存储介质可执行本公开任意实施例所提供的翻译模型的训练方法,具备执行该方法相应的功能模块和效果。未在上述实施例中详尽描述的技术细节,可参见本公开任意实施例所提供的翻译模型的训练方法。
根据本公开的一个或多个实施例,提供一种翻译模型的训练方法,包括:
获取至少一个原始语料;
将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
可选地,所述原始语料包括单语语料和/或平行语料;其中,所述单语语料为所述源端语料,所述平行语料包括成对的源端语料和目标端语料。
可选地,所述基础翻译模型包括编码器和解码器;
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数。
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:使用所述伪平行语料,通过第一损失函数和第二损失函数对所述基础翻译模型进行多任务训练,以获取目标翻译模型;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数,所述第二损失函数用于更新所述编码器和所述解码器的参数。
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:构建所述伪平行语料中的伪源端语料的正例语料和负例语料;使用所述伪源端语料、所述正例语料以及所述负例语料,通过第一损失函数对所述编码器进行训练;其中,训练目标为最大化所述伪源端语料和所述正例语料的向量表示之间的相似度,最小化所述伪源端语料和所述负例语料的向量表示之间的相似度。
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:将所述伪源端语料、所述正例语料以及所述负例语料输入至所述编码器中,得到所述伪源端语料对应的第一向量表示、所述正例语料对应的第二向量表示以及所述负例语料对应的第三向量表示;根据所述第一向量表示和所述第二向量表示,确定第一损失函数的第一损失值,并基于所述第一损失值更新所述编码器的参数,直至所述第一损失函数的第一损失值满足收敛条件;根据所述第一向量表示和所述第三向量表示,确定所述第一损失函数的第二损失值,并基于所述第二损失值更新所述编码器的参数,直至所述第一损失函数的第二损失值满足收敛条件。
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:当所述原始语料为单语语料时,将所述单语语料对应的替换语料作为伪源端语料以及将所述单语语料作为伪目标端语料,组成伪平行语料。
根据本公开的一个或多个实施例,提供了如上的翻译模型的训练方法,还包括:当所述原始语料为平行语料时,将所述平行语料中源端语料对应的替换语料作为伪源端语料,以及将所述平行语料中目标端语料作为伪目标端语料,组成伪平行语料。
可选地,所述正例语料为所述伪目标端语料。
可选地,所述负例语料为其它伪平行语料中的伪目标端语料。
可选地,所述替换语料中的各目标词汇的语种至少部分不同。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (14)

  1. 一种翻译模型的训练方法,包括:
    获取至少一个原始语料;
    将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
    基于所述原始语料与所述替换语料构造伪平行语料,并使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型。
  2. 根据权利要求1所述的方法,其中,所述原始语料包括单语语料和平行语料中的至少之一;其中,所述单语语料为所述源端语料,所述平行语料包括成对的源端语料和目标端语料。
  3. 根据权利要求1所述的方法,其中,所述基础翻译模型包括编码器和解码器;
    所述使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型,包括:
    使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练;其中,所述第一损失函数为对比学习损失函数,用于更新所述编码器的参数。
  4. 根据权利要求3所述的方法,其中,所述使用所述伪平行语料对预设的基础翻译模型进行训练,以获取目标翻译模型,包括:
    使用所述伪平行语料,通过所述第一损失函数和第二损失函数对所述基础翻译模型进行多任务训练,以获取所述目标翻译模型;其中,所述第二损失函数用于更新所述编码器和所述解码器的参数。
  5. 根据权利要求3所述的方法,其中,所述使用所述伪平行语料,通过第一损失函数对所述基础翻译模型的编码器进行训练,包括:
    构建所述伪平行语料中的伪源端语料的正例语料和负例语料;
    使用所述伪源端语料、所述正例语料以及所述负例语料,通过所述第一损失函数对所述编码器进行训练;其中,训练目标为最大化所述伪源端语料和所述正例语料的向量表示之间的相似度,最小化所述伪源端语料和所述负例语料的向量表示之间的相似度。
  6. 根据权利要求5所述的方法,其中,所述使用所述伪源端语料、所述正例语料以及所述负例语料,通过所述第一损失函数对所述编码器进行训练,包 括:
    将所述伪源端语料、所述正例语料以及所述负例语料输入至所述编码器中,得到所述伪源端语料对应的第一向量表示、所述正例语料对应的第二向量表示以及所述负例语料对应的第三向量表示;
    根据所述第一向量表示和所述第二向量表示,确定所述第一损失函数的第一损失值,并基于所述第一损失值更新所述编码器的参数,直至所述第一损失函数的第一损失值满足收敛条件;
    根据所述第一向量表示和所述第三向量表示,确定所述第一损失函数的第二损失值,并基于所述第二损失值更新所述编码器的参数,直至所述第一损失函数的第二损失值满足收敛条件。
  7. 根据权利要求5所述的方法,其中,在所述原始语料为单语语料的情况下,所述基于所述原始语料与所述替换语料构造伪平行语料,包括:
    将所述单语语料对应的替换语料作为所述伪源端语料以及将所述单语语料作为伪目标端语料,组成所述伪平行语料。
  8. 根据权利要求5所述的方法,其中,在所述原始语料为平行语料的情况下,所述基于所述原始语料与所述替换语料构造伪平行语料,包括:
    将所述平行语料中源端语料对应的替换语料作为所述伪源端语料,以及将所述平行语料中目标端语料作为伪目标端语料,组成所述伪平行语料。
  9. 根据权利要求7或8所述的方法,其中,所述正例语料为所述伪目标端语料。
  10. 根据权利要求7或8所述的方法,其中,所述负例语料为其它伪平行语料中的伪目标端语料。
  11. 根据权利要求1至8中任一项所述的方法,其中,所述替换语料中的各目标词汇的语种至少部分不同。
  12. 一种翻译模型的训练装置,包括:
    获取模块,设置为获取至少一个原始语料;
    替换模块,设置为将所述原始语料中源端语料的至少一个原始词汇对齐替换为同含义的目标词汇,得到所述原始语料对应的替换语料;其中,所述原始词汇与所述目标词汇的语种不同;
    构造模块,设置为基于所述原始语料与所述替换语料构造伪平行语料;
    训练模块,用于使用所述伪平行语料对预设的基础翻译模型进行训练,以 获取目标翻译模型。
  13. 一种翻译模型的训练设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现权利要求1至11中任一项所述的翻译模型的训练方法。
  14. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现权利要求1至11中任一项所述的翻译模型的训练方法。
PCT/CN2022/084963 2021-04-26 2022-04-02 翻译模型的训练方法、装置、设备和存储介质 WO2022228041A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110454958.8A CN113139391B (zh) 2021-04-26 2021-04-26 翻译模型的训练方法、装置、设备和存储介质
CN202110454958.8 2021-04-26

Publications (1)

Publication Number Publication Date
WO2022228041A1 true WO2022228041A1 (zh) 2022-11-03

Family

ID=76812391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084963 WO2022228041A1 (zh) 2021-04-26 2022-04-02 翻译模型的训练方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN113139391B (zh)
WO (1) WO2022228041A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116844523A (zh) * 2023-08-31 2023-10-03 深圳市声扬科技有限公司 语音数据生成方法、装置、电子设备及可读存储介质
CN117251555A (zh) * 2023-11-17 2023-12-19 深圳须弥云图空间科技有限公司 一种语言生成模型训练方法和装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139391B (zh) * 2021-04-26 2023-06-06 北京有竹居网络技术有限公司 翻译模型的训练方法、装置、设备和存储介质
US20230095352A1 (en) * 2022-05-16 2023-03-30 Beijing Baidu Netcom Science Technology Co., Ltd. Translation Method, Apparatus and Storage Medium
CN115618891B (zh) * 2022-12-19 2023-04-07 湖南大学 一种基于对比学习的多模态机器翻译方法和系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941966A (zh) * 2019-12-10 2020-03-31 北京小米移动软件有限公司 机器翻译模型的训练方法、装置及系统
CN111046677A (zh) * 2019-12-09 2020-04-21 北京字节跳动网络技术有限公司 一种翻译模型的获取方法、装置、设备和存储介质
CN112417902A (zh) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 文本翻译方法、装置、设备及存储介质
WO2021037559A1 (en) * 2019-08-23 2021-03-04 Sony Corporation Electronic device, method and computer program
CN112668671A (zh) * 2021-03-15 2021-04-16 北京百度网讯科技有限公司 预训练模型的获取方法和装置
CN113139391A (zh) * 2021-04-26 2021-07-20 北京有竹居网络技术有限公司 翻译模型的训练方法、装置、设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597988B (zh) * 2018-10-31 2020-04-28 清华大学 跨语言的词汇义原预测方法、装置与电子设备
CN109670190B (zh) * 2018-12-25 2023-05-16 北京百度网讯科技有限公司 翻译模型构建方法和装置
CN111709249B (zh) * 2020-05-29 2023-02-24 北京百度网讯科技有限公司 多语种模型的训练方法、装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021037559A1 (en) * 2019-08-23 2021-03-04 Sony Corporation Electronic device, method and computer program
CN111046677A (zh) * 2019-12-09 2020-04-21 北京字节跳动网络技术有限公司 一种翻译模型的获取方法、装置、设备和存储介质
CN110941966A (zh) * 2019-12-10 2020-03-31 北京小米移动软件有限公司 机器翻译模型的训练方法、装置及系统
CN112417902A (zh) * 2020-12-04 2021-02-26 北京有竹居网络技术有限公司 文本翻译方法、装置、设备及存储介质
CN112668671A (zh) * 2021-03-15 2021-04-16 北京百度网讯科技有限公司 预训练模型的获取方法和装置
CN113139391A (zh) * 2021-04-26 2021-07-20 北京有竹居网络技术有限公司 翻译模型的训练方法、装置、设备和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116844523A (zh) * 2023-08-31 2023-10-03 深圳市声扬科技有限公司 语音数据生成方法、装置、电子设备及可读存储介质
CN116844523B (zh) * 2023-08-31 2023-11-10 深圳市声扬科技有限公司 语音数据生成方法、装置、电子设备及可读存储介质
CN117251555A (zh) * 2023-11-17 2023-12-19 深圳须弥云图空间科技有限公司 一种语言生成模型训练方法和装置
CN117251555B (zh) * 2023-11-17 2024-04-16 深圳须弥云图空间科技有限公司 一种语言生成模型训练方法和装置

Also Published As

Publication number Publication date
CN113139391A (zh) 2021-07-20
CN113139391B (zh) 2023-06-06

Similar Documents

Publication Publication Date Title
WO2022228041A1 (zh) 翻译模型的训练方法、装置、设备和存储介质
WO2022057637A1 (zh) 语音翻译方法、装置、设备和存储介质
WO2022042512A1 (zh) 文本处理方法、装置、电子设备及介质
CN110969012B (zh) 文本纠错方法、装置、存储介质及电子设备
WO2022143058A1 (zh) 语音识别方法、装置、存储介质及电子设备
CN111046677B (zh) 一种翻译模型的获取方法、装置、设备和存储介质
WO2022116841A1 (zh) 文本翻译方法、装置、设备及存储介质
US10824664B2 (en) Method and apparatus for providing text push information responsive to a voice query request
US11562010B2 (en) Method and apparatus for outputting information
WO2022116821A1 (zh) 基于多语言机器翻译模型的翻译方法、装置、设备和介质
CN111382261B (zh) 摘要生成方法、装置、电子设备及存储介质
WO2022127620A1 (zh) 语音唤醒方法、装置、电子设备及存储介质
WO2022228221A1 (zh) 信息翻译方法、装置、设备和存储介质
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
WO2022135080A1 (zh) 语料样本确定方法、装置、电子设备及存储介质
WO2023082931A1 (zh) 用于语音识别标点恢复的方法、设备和存储介质
CN113688256B (zh) 临床知识库的构建方法、装置
WO2022237665A1 (zh) 语音合成方法、装置、电子设备和存储介质
WO2023185563A1 (zh) 语音翻译模型的训练方法、语音翻译方法、装置及设备
WO2022116819A1 (zh) 模型训练方法及装置、机器翻译方法及装置、设备、存储介质
CN113051933B (zh) 模型训练方法、文本语义相似度确定方法、装置和设备
CN112257459B (zh) 语言翻译模型的训练方法、翻译方法、装置和电子设备
US11854422B2 (en) Method and device for information interaction
WO2022121859A1 (zh) 口语信息处理方法、装置和电子设备
WO2022174804A1 (zh) 文本简化方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22794513

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22794513

Country of ref document: EP

Kind code of ref document: A1