CN116167386A

CN116167386A - Training method for speech translation model, speech translation method and storage medium

Info

Publication number: CN116167386A
Application number: CN202310351451.9A
Authority: CN
Inventors: 雷易锟; 薛征山; 熊德意; 李丙春; 刘静; 林晓东; 史庭训; 张晓雷
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-05-26

Abstract

The application is applicable to the technical field of information processing, and provides a training method of a speech translation model, a speech translation method and a storage medium. According to the method, a training set is obtained, and a training sample comprises source language voice data and corresponding source language text data; training the speech translation model based on a plurality of training samples until the speech translation model converges; by constructing at least one loss function based on the similarity of the first translation result and the second translation result, the quality performance of the text translation and the speech translation can be reduced according to the loss function in the optimization process, so that the semantic extraction capability in the text translation is transferred to the speech translation, the speech translation can continuously learn the semantic extraction capability of the text translation based on the same corpus of each training sample, and the speech translation quality is improved.

Description

Training method for speech translation model, speech translation method and storage medium

Technical Field

The application belongs to the technical field of information processing, and particularly relates to a training method of a speech translation model, a speech translation method and a storage medium.

Background

With the rapid development of economy and society, the demands of business trip and leisure trip of people are rapidly increased, language failure often occurs in cross-regional trip, and although text-oriented translation technology has been developed and matured, in the scenes of conversations, conferences and the like, text-oriented translation technology has poor flexibility, and a translation technology (speech translation technology, speech Translation, ST) for translating the speech of a source language into the text of a target language is required.

At present, the semantic complexity of the speech expression in daily communication is generally greater than that of the text expression, so that a larger modal gap is formed between the speech and the text, the translation result presented by the speech translation technology is difficult to reproduce the original semantic of the speech expression, and the speech translation quality is influenced. Therefore, how to improve the quality of speech translation is a current urgent problem to be solved.

Disclosure of Invention

In view of this, the embodiments of the present application provide a training method of a speech translation model, a speech translation method, and a storage medium, so as to solve the problem of poor quality of the existing speech translation.

A first aspect of an embodiment of the present application provides a method for training a speech translation model, including:

Acquiring a training set, wherein the training set comprises a plurality of training samples, and the training samples comprise source language voice data and corresponding source language text data;

training the speech translation model based on a plurality of training samples until the speech translation model converges;

in the training process, the input data of the voice translation model comprises source language voice data and corresponding source language text data, the output data of the voice translation model comprises a first translation result and a second translation result which face a target language, the first translation result is a result output by the voice translation model based on the input source language voice data, and the second translation result is a result output by the voice translation model based on the input source language text data; at least one penalty function of the speech translation model is constructed based on the similarity of the first translation result and the second translation result.

A first aspect of embodiments of the present application provides a training method for a speech translation model, where a training set is obtained, and a training sample includes source language speech data and corresponding source language text data; training the speech translation model based on a plurality of training samples until the speech translation model converges; by constructing at least one loss function based on the similarity of the first translation result and the second translation result, the quality performance of the text translation and the speech translation can be reduced according to the loss function in the optimization process, so that the semantic extraction capability in the text translation is transferred to the speech translation, the speech translation can continuously learn the semantic extraction capability of the text translation based on the same corpus of each training sample, and the speech translation quality is improved.

A second aspect of an embodiment of the present application provides a speech translation method based on a speech translation model, including:

inputting the source language voice data into a voice translation model to obtain voice translation data of a target language;

the speech translation model comprises a speech coding network and a translation network, and the speech coding network and the translation network are trained based on the training method provided by the first aspect.

According to the voice translation method based on the voice translation model, voice translation is carried out through the trained voice translation model, and the voice translation method benefits from the targeted training of the voice translation model on the semantic extraction capability and the balance of the migration of the target words and the non-target words, so that the semantic extraction capability and the translation accuracy of the voice translation model are greatly improved, and voice translation data with high semantic fitness and language accuracy can be obtained.

A third aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the method for debugging a display panel provided in the first aspect of the embodiments of the present application.

It will be appreciated that the advantages of the third aspect may be found in the relevant description of the first or second aspects, and are not described in detail herein.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 2 is a schematic architecture diagram of an operating system running in a terminal device according to an embodiment of the present application;

FIG. 3 is a first flowchart of a training method of a speech translation model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a first architecture of a speech translation model in training provided by embodiments of the present application;

FIG. 5 is a schematic diagram of a second architecture of a speech translation model in training provided by embodiments of the present application;

FIG. 6 is a second flowchart of a training method of a speech translation model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a third architecture of a speech translation model in training provided by embodiments of the present application;

FIG. 8 is a schematic diagram of an architecture of a trained speech translation model provided by embodiments of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In application, the semantic complexity of the speech expression in daily communication is generally greater than that of the text expression, so that a larger modal gap is formed between the speech and the text, the translation result presented by the speech translation technology is difficult to reproduce the original semantic of the speech expression, and the speech translation quality is affected. Therefore, how to improve the quality of speech translation is a current urgent problem to be solved.

Aiming at the technical problems, the embodiment of the application provides a training method of a speech translation model, which comprises the steps of acquiring a training set, wherein the training sample comprises source language speech data and corresponding source language text data; training the speech translation model based on a plurality of training samples until the speech translation model converges; by constructing at least one loss function based on the similarity of the first translation result and the second translation result, the quality performance of the text translation and the speech translation can be reduced according to the loss function in the optimization process, so that the semantic extraction capability in the text translation is transferred to the speech translation, the speech translation can continuously learn the semantic extraction capability of the text translation based on the same corpus of each training sample, and the speech translation quality is improved.

The training method of the speech translation model provided by the embodiment of the application can be applied to terminal equipment. The terminal device may be a cell phone, tablet, wearable device, vehicle-mounted device, augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) device, notebook, ultra-Mobile Personal Computer, UMPC, netbook, personal digital assistant (Personal Digital Assistant, PDA), etc. The embodiment of the application does not limit the specific type of the terminal equipment.

Fig. 1 exemplarily shows a schematic structure of a terminal device 1, and the terminal device 1 includes a processor 10, a memory 20, an audio module 30, a camera module 40, a sensor module 50, an input module 60, a display module 70, a wireless communication module 80, a power module 90, and the like. The audio module 30 may include a speaker 31, a microphone 32, and the like, the camera module 40 may include a short-focus camera 41, a long-focus camera 42, a flash 43, and the like, the sensor module 50 may include an infrared sensor 51, an acceleration sensor 52, a position sensor 53, a fingerprint sensor 54, an iris sensor 55, and the like, the input module 60 may include a touch panel 61, an external input unit 62, and the like, and the Wireless communication module 80 may include Wireless communication units such as bluetooth, optical Wireless communication (Optical Wireless), mobile communication (Mobile Communications), wireless local area network (Wireless Local Area Network, WLAN), near field communication (Near Field Communication, NFC), and ZigBee (ZigBee).

In application, the processor 10 may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In applications, the memory 20 may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory 20 may in other embodiments also be an external storage device of the terminal device, such as a plug-in hard disk provided on the terminal device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like. Further, the memory 20 may also include both an internal storage unit of the terminal device and an external storage device. The memory 20 is used for storing a computer program 21 such as an operating system, an application program, a boot loader (BootLoader), and the like. The memory 20 may also be used to temporarily store data that has been output or is to be output.

In application, the display module 70 may be a straight screen, a curved screen, or a flexible screen, and in particular may be a folded screen, where the folded screen may include at least one flexible screen, or the folded screen may include at least one flexible screen and at least one straight screen or curved screen, which in the embodiment of the present application does not limit the specific type of the display module 70.

It is to be understood that the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the terminal device 1. In other embodiments of the present application, the terminal device 1 may comprise more or less components than shown, or may combine certain components, or different components, e.g. may also comprise a graphics processor or the like. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

As shown in fig. 2, the architecture of the operating system running in the terminal device 1 provided in the embodiment of the present application may be a layered architecture, where the operating system may be divided into multiple layers, and the layers communicate with each other through a software interface. In some embodiments, the operating system may be, from top to bottom, an application layer 100, a framework layer 200, a system runtime layer 300, a hardware abstraction layer 400, and a Linux kernel layer 500.

It should be noted that the type of the operating system may be an Android system, or an operating system that is customized and developed based on the Android system, or other operating systems of different types, and the specific type of the operating system is not limited in this embodiment of the present application.

In application, the following description is presented in terms of five layers of the operating system:

the application layer 100 may include a built-in application 120 built in the system and an upper application 110 provided by a third party application provider, and an application program of the application layer 100 may directly interact with a user to implement different functions provided by the application program; for example, the application layer 100 may specifically include built-in applications such as mail, phone, calendar, camera, contacts, and bluetooth, and may include upper-layer applications such as map location, food takeaway, and video playback;

the framework layer (Application Framwork) 200 may include an application programming interface (Application Programming Interface, API) and a programming framework, which may be used to provide an interface for developing application programs for developers of an operating system, and may also be used to invoke corresponding basic services when implementing different functions for applications of the application layer; for example, the framework layer may include different types of application programming interfaces for window managers, content providers, telephony managers, location managers, view systems, and the like.

The window manager is used for managing window programs, can be used for acquiring the size of a window, and can be used for judging whether a status bar exists, whether a screen is locked, whether the screen is intercepted or not and the like;

the content provider is used for storing and acquiring data, and enabling the data to be accessed by an application program, wherein the data can comprise video, images, audio, call records, contacts, browsing history, bookmarks and the like;

the telephone manager is used for providing communication functions of the terminal device 1, such as management of call states including call on and call off, etc.;

the location manager is configured to obtain location information of the terminal device 1, and may include satellite location information (acquired through a global positioning system), network location information (acquired through a network location service), and converged location information (acquired through a converged location service) and other different types of location information;

the view system is used to provide visual controls, such as controls to display text and controls to display pictures, and the like, as well as to build applications to display the application layer 100. The view system may run multiple visual controls simultaneously, causing the terminal device to display multiple views simultaneously, e.g., a view of text and a view of a picture simultaneously.

The system runtime layer (Native) 300 may include a C/c++ library 310 and an android runtime 320, where C/c++ Cheng Xuku 310 may include a drawing function library, font engine, rendering engine, multimedia library, database engine, and the like. The drawing function library can be OpenGL For Embedded Systems 3D (3D open graphic library running in the embedded system); the font engine is used for providing different fonts, and can be FreeType (a portable font engine); the rendering engine is used for rendering two-dimensional graphics or three-dimensional graphics, and can be Skia Graphics Library (an engine for rendering two-dimensional graphics); the multimedia library (Media Framework) is used for supporting playing, recording and playback of audio and video with different formats; the database engine is used for providing the storage function of different types of databases, so that different types of data can be stored in different types of databases according to actual needs, and different types of data can be uniformly stored in one database.

The Android Runtime 320 may include a Core library (Core library) and an Android Runtime environment (ART), wherein the Dalvik virtual machine is replaced with ART after the Android 5.0 system. The core library provides most of the functionality of the Java language core library so that developers can write android applications using Java language. In contrast to Java virtual machines (Java Virtual Machine, JVM), dalvik virtual machines are custom made specifically for mobile devices, allowing multiple instances of virtual machines to run simultaneously in limited memory, and each Dalvik application executes as a separate Linux process. The independent process can prevent all programs from being closed when the virtual machine crashes. And the mechanism of ART to replace Dalvik virtual machines is different from Dalvik. Under Dalvik, the bytecode needs to be converted into machine code by a just-in-time compiler every time the application runs, which can slow down the running efficiency of the application, while in the ART environment, the bytecode is pre-compiled into machine code when the application is first installed, so that the bytecode becomes a real local application.

The hardware abstraction layer (Hardware Abstraction Layer, HAL) 400 is an interface layer between the operating system kernel and the hardware circuitry, which aims at abstracting the hardware, and conceals the hardware interface details of a specific platform in order to protect the intellectual property rights of hardware manufacturers, thereby providing a virtual hardware platform for the operating system, so that the virtual hardware platform has hardware independence and can be transplanted on various platforms. From the perspective of software and hardware testing, the software and hardware testing can be completed based on the hardware abstraction layer, so that the software and hardware testing can be performed in parallel.

The Linux kernel layer 500 may be used to provide system services of an operating system, including security services of the operating system, memory management, process management, network protocol stack, and driving model.

It will be appreciated that FIG. 2 is a diagram illustrating an architecture of an operating system, which may also be four-tier or six-tier, where the architecture of a four-tier operating system may include: the architecture of the six-layer operating system can comprise: the embodiment of the method does not limit the architecture layer and the specific architecture of the operating system.

As shown in fig. 3, the training method of the speech translation model provided in the embodiment of the present application includes the following steps S301 to S302:

step 301, a training set is obtained, wherein the training set comprises a plurality of training samples, and the training samples comprise source language voice data and corresponding source language text data.

In the application, before the speech translation model is trained, an input training set of the speech translation model is obtained, the training set can comprise a plurality of training samples, the training samples can be source language speech data and source language text data specific to a section of corpus, and the corpora corresponding to different training samples can be different. It should be noted that, when the training set is obtained, the training samples may be screened, specifically, the screening may be performed according to the noise level of the source language voice data of each training sample: and inputting the source language voice data of each training sample into a trained noise discrimination model, outputting the noise discrimination model as the noise level of the corresponding source language voice data, and discarding the corresponding training sample if the noise level reaches the preset noise level so as to improve the usability of the training sample and the training effect of the voice translation model.

Step S302, training a speech translation model based on a plurality of training samples until the speech translation model converges;

In the training process, input data of a voice translation model comprises source language voice data and corresponding source language text data, output data of the voice translation model comprises a first translation result and a second translation result which face a target language, the first translation result is a result output by the voice translation model based on the input source language voice data, and the second translation result is a result output by the voice translation model based on the input source language text data; at least one penalty function of the speech translation model is constructed based on the similarity of the first translation result and the second translation result.

In the application, in the training stage, the voice translation model is used for translating the voice data of the source language in a target language, and outputting a first translation result; and the system is used for translating the text data in the source language in a target language and outputting a second translation result, namely, the voice translation model performs two actions of voice translation and text translation. It should be noted that, compared with speech translation, text translation has less noise interference and lower semantic complexity, and semantic information is easily extracted according to text data in source language, so the quality of text translation is generally better than speech translation.

In the application, at least one loss function is constructed based on the similarity of the first translation result and the second translation result, and the quality expressions of the text translation and the voice translation can be pulled up according to the loss function in the optimization process, so that the semantic extraction capability in the text translation is transferred to the voice translation, the voice translation can continuously learn the semantic extraction capability of the text translation based on the same corpus, and the voice translation quality is improved.

In one embodiment, the Speech translation model includes a Speech encoding network (Speech Encoder), a Text encoding network (Text Encoder), and a translation network;

the input data of the voice coding network comprises source language voice data, and the output data of the voice coding network comprises source language voice vector data;

the input data of the text encoding network comprises source language text data, and the output data of the text encoding network comprises source language text vector data;

the input data of the translation network comprises source language voice vector data and source language text vector data, and the output data of the translation network comprises a first translation result and a second translation result which face a target language; the first translation result is a result output by the translation network based on the input source language voice vector data, and the second translation result is a result output by the translation network based on the input source language text vector data.

Fig. 4 illustrates an architectural diagram when a speech translation model includes a speech coding network, a text coding network, and a translation network.

In application, the specific architecture and input-output relationship of the speech translation model may be shown in fig. 4, where the speech coding network is used to code the source language speech data into source language speech vector data (the vector data is a machine language that can be read and processed by the terminal device); similarly, the text encoding network is used for encoding the source language text data into the source language text vector data, so that the translation network can conveniently translate the source language speech vector data and the source language text vector data.

In application, the speech coding network, the text coding network and the translation network can be built by a convolutional neural network (Convolutional Neural Networks, CNN), a cyclic neural network (Recurrent Neural Network, RNN) and other network architectures. Specifically, the voice coding network can be built by a voice recognition algorithm and/or CNN/RNN, and specifically can be built by a voice recognition pre-training algorithm wav2vec 2.0 and two layers of one-dimensional CNNs; the text coding network can be built by a natural language processing algorithm (Natural Language Processing) and/or CNN/RNN, and particularly can be built by a Word Embedding vector algorithm Word Embedding (belonging to one of NLP algorithm branches); the translation network can be built by NLP and/or CNN/RNN, in particular by a multi-layer translation algorithm transducer (belonging to one of NLP algorithm branches).

In one embodiment, the translation network includes a common encoding sub-network (Shared Encoder) and a common decoding sub-network (Shared Decoder);

the input data of the general coding sub-network comprises source language voice vector data and source language text vector data, and the output data of the general decoding sub-network comprises source language voice semantic data and source language text semantic data; the source language voice semantic data is a result output by the universal coding sub-network based on the input source language voice vector data, and the source language text semantic data is a result output by the universal coding sub-network based on the input source language text vector data;

the input data of the general decoding sub-network comprises source language voice semantic data and source language text semantic data, and the output data of the general decoding sub-network comprises a first translation result and a second translation result which face a target language; the first translation result is a result output by the general decoding sub-network based on the input source language voice vector data, and the second translation result is a result output by the general decoding sub-network based on the input source language text semantic data.

Fig. 5 illustrates an architectural diagram of a translation network including a generic encoding sub-network and a generic decoding sub-network.

In application, the architecture and input-output relationships when the translation network includes a generic encoding sub-network and a generic decoding sub-network may be as shown with reference to fig. 5. The universal coding sub-network is used for extracting semantic information in the source language voice vector data, and integrating the semantic information with the source language voice vector data to obtain source language voice semantic data (still being vector form data); and the method is also used for extracting semantic information in the source language text vector data and integrating the semantic information with the source language text vector data to obtain the source language text semantic data (still in vector form data).

In application, the general coding sub-network can be built up by a multi-layer translation coding algorithm Transformer Encoder; the generic decoding subnetwork may be built up by a multi-layer translation decoding algorithm Transformer Decoder.

The network architecture of each network can be determined according to the actual algorithm precision requirement and the platform computational load, and the specific network architectures of the voice coding network, the text coding network, the general coding sub-network and the general decoding sub-network are not limited in the embodiment of the application.

In the application, a training sample comprises source language voice data and corresponding source language text data by acquiring a training set; training the speech translation model based on a plurality of training samples until the speech translation model converges; by constructing at least one loss function based on the similarity of the first translation result and the second translation result, the quality performance of the text translation and the speech translation can be reduced according to the loss function in the optimization process, so that the semantic extraction capability in the text translation is transferred to the speech translation, the speech translation can continuously learn the semantic extraction capability of the text translation based on the same corpus of each training sample, and the speech translation quality is improved.

As shown in fig. 6, in one embodiment, based on the embodiment corresponding to fig. 3, the following steps S601 to S605 are included:

step S601, a training set is obtained, wherein the training set comprises a plurality of training samples, and the training samples comprise source language voice data and corresponding source language text data.

In application, the training method provided in step S601 may refer to the training method provided in step S301, which is not described herein.

Step S602, obtaining a target translation result of each training sample in a target language.

In the application, before training the speech translation model, a target translation result of each training sample facing the target language can be obtained and used as a true value of the speech translation model to be compared with the first translation result and the second translation result.

Step S603, constructing a first loss function based on the similarity of the first translation result and the target translation result;

step S604, constructing a second loss function based on the similarity of the second translation result and the target translation result;

step S605, optimizing the speech translation model according to the first loss function and the second loss function until the speech translation model converges.

In the application, the first loss function may be constructed based on the similarity between the first translation result and the target translation result, or the first prediction correct probability of the first translation result may be determined based on the target translation result, and the first loss function may be constructed, where a calculation formula of the first loss function is:

Wherein L is _ST Representing a first loss function, y representing a target translation result, S representing a first translation result, y representing the number of words in the corresponding training sample, i representing the position of any word in the first translation result, log (y) _i |y _＜i S) a first predicted correct probability of the word characterizing the i-th position/similarity to the target translation result.

In the application, the second loss function may be constructed based on the similarity between the second translation result and the target translation result, or the second predicted correct probability of the second translation result may be determined based on the target translation result and the second loss function may be constructed, where a calculation formula of the second loss function is as follows:

wherein L is _ST Representing a second loss function, y representing a target translation result, T representing a second translation result, |y| representing the number of words in the corresponding training sample, i representing the position of any word, log (y) _i |y _＜i T) a second predicted correct probability of the word characterizing the i-th position/similarity to the target translation result.

In the application, the voice translation model is optimized according to the first loss function and the second loss function, so that the first translation result and the second translation result are enabled to be continuously close to the target translation result, meanwhile, the voice translation capacity and the text translation capacity of the voice translation model are improved, and the voice translation quality can be further improved by matching with the continuously improved semantic extraction capacity.

In one embodiment, step S605 includes:

optimizing the speech coding network and the translation network according to the first loss function;

the text encoding network and the translation network are optimized according to the second penalty function.

In application, when the speech translation model comprises a speech coding network, a text coding network and a translation network, the speech coding network and the translation network can be optimized according to a first loss function, and the coding performance of the speech coding network and the speech translation capability of the translation network are improved in a targeted manner; and the text coding network and the translation network can be optimized according to the second loss function, so that the coding performance of the text coding network and the text translation capacity of the translation network are improved in a targeted manner.

In one embodiment, a third loss function of the speech translation model is constructed based on the output probabilities of the positive and negative pairs of samples; the third loss function is used for optimizing the translation network;

any positive sample pair comprises a group of first translation result and second translation result with similarity larger than a preset similarity threshold value, and any negative sample pair comprises a group of first translation result and second translation result with similarity not larger than the preset similarity threshold value.

In the application, after all training samples are input into a voice translation model, the similarity of the corresponding first translation result and all second translation results is obtained for each first translation result, if the similarity is larger than a preset similarity threshold value, the corresponding first translation result and the corresponding second translation result are judged to be positive sample pairs, and if the similarity is not larger than the preset similarity threshold value, the corresponding first translation result and the corresponding second translation result are judged to be negative sample pairs.

In the application, after the first training of the speech translation model is completed, the current performance of the speech translation model can be evaluated before each training is started, specifically, the output probability of the speech translation model for outputting a positive sample pair and the output probability of the speech translation model for outputting a negative sample pair can be obtained, and a third loss function is constructed, wherein the calculation formula of the third loss function is as follows:

L _cCRD ＝-(E _p(T，S) [log h(T，S)]+NE _p(T)p(S) [log1-h(T，S)])；

wherein L is _CCRD Characterizing a third loss function, p (T, S) characterizing a positive pair of samples, p (T) p (S) characterizing a negative pair of samples, h (T, S) for evaluating similarity of the first translation result and the second translation result, E _p(T，S) [log h(T，S)]Characterizing the output probability of positive sample pairs, NE _p(T)p(S) [log1-h(T，S)]The output probability of the negative sample pair is characterized.

In the application, the third loss function is used for training the voice translation model, so that the performance similarity of the current voice translation model in voice translation and text translation can be effectively quantized, and the efficiency of migrating the semantic extraction capability in text translation to voice translation is improved.

In application, a lower threshold of the similarity between the first translation result and the second translation result can be set as an optimization target of the third loss function, and a calculation formula of the similarity between the first translation result and the second translation result is as follows:

MI(T，S)≥log(N)+E _p(T，S) [log h(T，S)]+NE _p(T)p(S) [log1-h(T，S)]；

wherein MI (T, S) characterizes the similarity of the first translation result and the second translation result, and N characterizes the number of negative sample pairs.

In application, the lower threshold of the similarity is determined through the similarity calculation formula, so that the optimization target of the performance similarity of the voice translation model in voice translation and text translation can be further quantized on the basis of effectively quantizing the performance similarity of the current voice translation model in voice translation and text translation, the performance of the voice translation model is accurately trained and accurately controlled, and the quality of voice translation is further improved.

In one embodiment, a fourth penalty function of the speech translation model is constructed based on the predicted transition weights of the target words and the predicted transition weights of the non-target words; the fourth loss function is used for optimizing the translation network;

wherein, in the training process, word sets are set, the word sets comprise target word sets, the target words comprise the corresponding relation between the semantics and the target words, the target words are the best translation results of the corresponding semantics, for example, one semantic corresponds to four words: "you", "me", "she", "he"; the semantic corresponds to the target word "I" and the non-target word "you", "she", "he". The predicted weight ratio of the target words is determined according to the confidence/output probability of each target word in the first translation result and the second translation result; the predicted weight ratio of the non-target words is determined according to the confidence/output probability of each non-target word in the first translation result and the second translation result.

In the application, after the first training of the speech translation model is completed, the current performance of the speech translation model can be evaluated before each training is started, specifically, the predicted migration weight of the target word and the predicted migration weight of the non-target word can be obtained, and a fourth loss function is constructed, wherein the calculation formula of the fourth loss function is as follows:

wherein L is _KD Representing a fourth loss function, t representing a target word, t representing a non-target word,

confidence/output probability characterizing the target word in the second translation result, ++>

Representing confidence/output probability of non-target words in the first translation result,/for>

Confidence/output probability of non-target word representing the i-th position in the second translation result (i in this formula is different from i in the above formula because i is not equal to t in the formula, i in this formula is used to refer to the position of non-target word), confidence/output probability of non-target word representing the i-th position in the second translation result, confidence/output probability of non-target word representing the i-th position in the formula is equal to t in the formula>

The confidence/output probability of the non-target word representing the i-th position in the first translation result,

predictive migration weight characterizing target word, +.>

The predicted shift weights for non-target words are characterized.

In the application, the voice translation model is trained through the fourth loss function, so that translation logic of target words and non-target words can be effectively transferred from text translation to voice translation, and the learning efficiency of voice translation is improved.

In an application, the training objective of the speech translation model for the fourth loss function may be: the predicted migration weight of the non-target word reaches the expected migration weight, and the calculation formula of the training target can be:

L _SDKD ＝CK+βNCK；

wherein TCK refers to the predicted shift weight of the target word, NCK refers to the predicted shift weight of the non-target word, beta represents the expected shift weight, and beta can be 0.2, 0.3 or 0.4.

In the application, the higher the confidence of the target word in the second translation result is, or the lower the confidence of the target word in the first translation result is, the larger the predicted migration weight of the target word is, the migration of the non-target word is suppressed, although the target word is the best translation result corresponding to the semantics, in the development process of deep learning, it is found that slightly improving the migration weight of the non-target word can greatly improve the performance of the model under a specific scene, so that the migration weight of the non-target word can be balanced with the migration weight of the non-target word by improving the migration weight of the non-target word to the expected migration weight, and the migration of the non-target word is prevented from being suppressed, thereby effectively improving the accuracy of the voice translation model in voice translation and text translation.

FIG. 7 illustrates a schematic diagram of a scenario in which a speech translation model is optimized.

In conjunction with the trained speech translation model shown in fig. 8, the speech translation method based on the speech translation model provided in the embodiment of the application includes:

step S801, inputting source language voice data into a voice translation model to obtain voice translation data of a target language;

the speech translation model comprises a speech coding network and a translation network, and the speech coding network and the translation network are trained based on the training method.

In the application, the trained voice translation model is used for carrying out voice translation, and the voice translation data with high semantic matching degree and language accuracy can be obtained by benefiting from the targeted training of the voice translation model on the semantic extraction capability and balancing the migration of the target words and the non-target words, so that the semantic extraction capability and the translation accuracy of the voice translation model are greatly improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The functional modules in the embodiment may be integrated in one processing module, or each module may exist alone physically, or two or more modules may be integrated in one module, where the integrated modules may be implemented in a form of hardware or a form of software functional modules. In addition, the specific names of the functional modules are only for distinguishing from each other, and are not used for limiting the protection scope of the application. For the specific working process of the modules in the above system, reference may be made to the corresponding process in the above training method or the above embodiment of the speech translation method, which is not described herein again.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps in embodiments of the training method or the speech translation method described above.

The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable storage medium may include at least: any entity or device capable of carrying computer program code to a photo terminal equipment, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other manners. For example, the above-described embodiments of the terminal device are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or modules, which may be in electrical, mechanical or other forms.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for training a speech translation model, comprising:

2. The training method of claim 1 wherein the speech translation model comprises a speech coding network, a text coding network, and a translation network;

the input data of the translation network comprises the source language voice vector data and the source language text vector data, and the output data of the translation network comprises a first translation result and a second translation result which face a target language; the first translation result is a result output by the translation network based on the input source language voice vector data, and the second translation result is a result output by the translation network based on the input source language text vector data.

3. The training method of claim 2, wherein the translation network comprises a generic encoding sub-network and a generic decoding sub-network;

the input data of the universal coding sub-network comprises the source language voice vector data and the source language text vector data, and the output data of the universal coding sub-network comprises the source language voice semantic data and the source language text semantic data; the source language voice semantic data is a result output by the universal coding sub-network based on the input source language voice vector data, and the source language text semantic data is a result output by the universal coding sub-network based on the input source language text vector data;

The input data of the universal decoding sub-network comprises the source language voice semantic data and the source language text semantic data, and the output data of the universal decoding sub-network comprises a first translation result and a second translation result which face the target language; the first translation result is a result output by the general decoding sub-network based on the input source language voice vector data, and the second translation result is a result output by the general decoding sub-network based on the input source language text semantic data.

4. A training method as claimed in claim 2 or 3, wherein the method further comprises:

obtaining a target translation result of each training sample facing to a target language;

the training the speech translation model based on the plurality of training samples includes:

constructing a first loss function based on the similarity of the first translation result and the target translation result;

constructing a second loss function based on the similarity of the second translation result and the target translation result;

and optimizing the speech translation model according to the first loss function and the second loss function.

5. The training method of claim 4 wherein said optimizing said speech translation model based on said first and second loss functions comprises:

Optimizing a speech coding network and a translation network according to the first loss function;

and optimizing the text coding network and the translation network according to the second loss function.

6. A training method as claimed in claim 2 or 3, characterized in that the third loss function of the speech translation model is constructed based on the output probabilities of the positive and negative pairs of samples; the third loss function is used for optimizing the translation network;

7. A training method as claimed in claim 2 or 3 wherein the fourth penalty function of the speech translation model is constructed based on predicted transition weights of target words and predicted transition weights of non-target words; the fourth loss function is used for optimizing the translation network;

wherein, a target word set is set in the training process, and the target word set comprises a plurality of target words; the predicted weight ratio of the target word is determined according to the output probability of each target word in the first translation result and the second translation result; the predicted weight ratio of the non-target words is determined according to the output probability of each non-target word in the first translation result and the second translation result.

8. The training method of claim 7 wherein the training objective of the speech translation model for the fourth loss function is: the predicted migration weight of the non-target word reaches the desired migration weight.

9. A speech translation method based on a speech translation model, comprising:

the speech translation model comprises a speech coding network and a translation network, wherein the speech coding network and the translation network are trained based on the training method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the training method according to any one of claims 1 to 8 or the steps of the speech translation method according to claim 9.