CN114495977B

CN114495977B - Speech translation and model training method, device, electronic equipment and storage medium

Info

Publication number: CN114495977B
Application number: CN202210110163.XA
Authority: CN
Inventors: 梁芸铭; 赵情恩; 熊新雷; 陈蓉; 张银辉; 周羊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2024-01-30
Anticipated expiration: 2042-01-28
Also published as: CN114495977A; WO2023142454A1

Abstract

The disclosure provides a voice translation method, a model training method, a device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of voice translation, voice synthesis and deep learning. The specific implementation scheme is as follows: determining source spectrum sequence data corresponding to source language voice data, wherein the source spectrum sequence data comprises at least one source spectrum data; performing feature extraction on the source spectrum sequence data and the first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises position codes corresponding to at least one source spectrum sequence data; processing the target feature vector sequence and second position code sequence data to obtain target spectrum sequence data, wherein the second position code sequence data comprises position codes corresponding to the target feature vector sequence; and processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

Description

Speech translation and model training method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to language translation, speech synthesis, and deep learning techniques. And in particular to a speech translation method, a model training method, a device, an electronic device and a storage medium.

Background

With the development of artificial intelligence technology, the artificial intelligence technology is widely applied in various fields. For example, in the field of speech technology, which is the field of artificial intelligence technology, speech translation is widely used.

Speech translation refers to translating source language speech data into target language speech data. The source language voice data and the target language voice data are different in language.

Disclosure of Invention

The disclosure provides a voice translation method, a model training method, a device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a speech translation method including: determining source spectrum sequence data corresponding to source language voice data, wherein the source spectrum sequence data comprises at least one source spectrum data; performing feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises position codes corresponding to the at least one source spectrum data; processing the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprises position codes corresponding to the target feature vector sequence; and processing the target spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

According to another aspect of the present disclosure, there is provided a model training method including: respectively determining source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, wherein the source sample spectrum sequence data comprises at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data; performing feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, wherein the first sample position coding sequence data comprises sample position codes corresponding to the at least one source sample spectrum data; processing the sample feature vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, wherein the second sample position coding sequence data comprises sample position codes corresponding to the sample feature vector sequence; and training a predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

According to another aspect of the present disclosure, there is provided a speech translation apparatus including: a first determining module, configured to determine source spectrum sequence data corresponding to source language speech data, where the source spectrum sequence data includes at least one source spectrum data; the first obtaining module is used for extracting features of the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises position codes corresponding to the at least one source spectrum data; the second obtaining module is used for processing the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprises position codes corresponding to the target feature vector sequence; and a third obtaining module, configured to process the target spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

According to another aspect of the present disclosure, there is provided a model training apparatus including: the second determining module is configured to determine source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, where the source sample spectrum sequence data includes at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data; a fourth obtaining module, configured to perform feature extraction on the source sample spectrum sequence data and first sample position code sequence data to obtain a sample feature vector sequence, where the first sample position code sequence data includes a sample position code corresponding to the at least one source sample spectrum data; a fifth obtaining module, configured to process the sample feature vector sequence and second sample position code sequence data to obtain predicted spectrum sequence data, where the second sample position code sequence data includes a sample position code corresponding to the sample feature vector sequence; and a sixth obtaining module, configured to train a predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data, to obtain a speech translation model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method described in the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which speech translation methods, training methods, and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a speech translation method according to an embodiment of the present disclosure;

FIG. 3A schematically illustrates an example schematic diagram of a speech translation process according to an embodiment of the present disclosure;

FIG. 3B schematically illustrates an example schematic diagram of data in a speech translation process according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a model training method according to an embodiment of the disclosure;

FIG. 5 schematically illustrates an example schematic diagram of a training process according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a block diagram of a speech translation apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a speech translation method and a training method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Speech translation may be implemented as follows. First, source language speech data is converted into source language text data using a speech recognition model. The source language text data is then translated into target language text data using the text translation model. Finally, the language synthesis model is utilized to convert the target language text data into target language voice data.

The above approach requires that both the source language and the target language have respective words, and that the source language have corresponding speech recognition models, and that the target language have corresponding speech synthesis models. However, there are many languages for which there are no corresponding characters, that is, there are many languages without characters, and therefore, it is also difficult to have a corresponding speech recognition model and speech synthesis model. For languages without text, the above approach would be difficult to apply. In addition, the above-mentioned mode relates to a speech recognition model, a text translation model and a speech synthesis model, and the final result is affected by errors generated by the models, so that the speech translation quality is reduced.

For this reason, the embodiment of the present disclosure proposes a speech translation scheme. And extracting the characteristics of the source spectrum sequence data and the first position coding sequence data to obtain a target characteristic vector sequence, directly processing the target characteristic vector sequence and the second position coding sequence data to obtain target spectrum sequence data, and then processing the target spectrum sequence data to obtain target language voice data. Further, since text translation is not required, the present invention can be applied to speech translation for a language having no text.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

Fig. 1 schematically illustrates an exemplary system architecture to which speech translation methods, training methods, and apparatus may be applied, according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the speech translation method, training method, and apparatus may be applied may include a terminal device, but the terminal device may implement the speech translation method, training method, and apparatus provided by the embodiments of the present disclosure without interaction with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the terminal devices 101, 102, 103, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (as examples only).

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be various types of servers that provide various services. For example, the server 105 may be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS services (Virtual Private Server, VPS). The server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, the speech translation method provided in the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the speech translation apparatus provided in the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

Alternatively, the speech translation method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the speech translation apparatus provided in the embodiments of the present disclosure may be generally disposed in the server 105. The speech translation method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the speech translation apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

The model training methods provided by embodiments of the present disclosure may be generally performed by server 105. Accordingly, the model training apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The model training method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the model training apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.

The model training method provided by the embodiments of the present disclosure may be generally performed by the terminal device 101, 102, or 103. Accordingly, the model training apparatus provided by the embodiments of the present disclosure may also be provided in the terminal device 101, 102, or 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a speech translation method according to an embodiment of the present disclosure.

As shown in fig. 2, the method 200 includes operations S210 to S240.

In operation S210, source spectrum sequence data corresponding to source language voice data is determined. The source spectral sequence data comprises at least one source spectral data.

In operation S220, feature extraction is performed on the source spectrum sequence data and the first position-coded sequence data, so as to obtain a target feature vector sequence. The first position-coded sequence data includes a position code corresponding to at least one source spectral data.

In operation S230, the target feature vector sequence and the second position-encoded sequence data are processed to obtain target spectrum sequence data. The second position-coding sequence data includes a position code corresponding to the target feature vector.

In operation S240, the target spectrum sequence data is processed to obtain target language voice data corresponding to the source language voice data.

According to embodiments of the present disclosure, the source language may refer to the language to be translated. The target language may refer to the language in which translation is desired. The source language and the target language are different in language. The languages may include languages with text or languages without text. The language with the text may refer to the language with the corresponding text. Languages without text may refer to languages without corresponding text. The source language speech data may refer to speech data that requires speech translation. The target language speech data may refer to speech data of a target language. The source language speech data may include at least one object. The object may include a word or a word.

According to an embodiment of the present disclosure, the source spectrum sequence data may be obtained by extracting acoustic features of source language voice data. The source language voice data may refer to voice data of a predetermined period of time. The source spectral sequence data may include source spectral data corresponding to at least one object. For example, the source spectral sequence data may include source spectral data corresponding to each of the at least one object. Alternatively, the source spectral sequence data may comprise source spectral data corresponding to each of the partial objects of the at least one object. The first position-coded sequence data may include a position code corresponding to at least one source spectrum data. For example, the first position-coded sequence data may include position codes each corresponding to at least one source spectrum data. Alternatively, the first position-coded sequence data may comprise position codes each corresponding to a portion of the source spectral data of the at least one source spectral data. The first position encoding may characterize the absolute position of the object (i.e., the source spectral data) in the source language speech data.

According to an embodiment of the present disclosure, the target feature vector sequence may include at least one target feature vector. The second position-coding sequence data may include a position code corresponding to the target feature vector sequence. For example, the second position-coded sequence data may include position codes each corresponding to at least one target feature vector. Alternatively, the second position-encoded sequence data may include position encodings corresponding to respective ones of the partial target feature vectors of the at least one target feature vector. The target spectrum sequence data may be obtained by extracting acoustic features of target language voice data.

According to the embodiment of the disclosure, source language voice data can be obtained, and the source language voice data is preprocessed to obtain source spectrum sequence data corresponding to the source language voice data. The pre-treatment may comprise at least one of: framing, windowing and acoustic feature extraction. The acoustic features may include at least one of: fbank (i.e., filterBank), mel-frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC), timbre vectors, zero-crossing rate, subband energy entropy, spectral center, spectral spread, spectral entropy, spectral flux, spectral roll-off, and color deviation. For example, the source spectral data may include source linear spectral sequence data or source mel-spectral sequence data.

According to the embodiment of the disclosure, the position of at least one source spectrum data included in the source spectrum sequence data may be encoded by using a position encoding method, so as to obtain a position code corresponding to the at least one source spectrum data. And obtaining first position coding sequence data according to the position coding corresponding to the at least one source spectrum data. For example, the position coding method may be used to code the position corresponding to each of at least one source spectrum data included in the source spectrum sequence data, so as to obtain a position code corresponding to each of the at least one source spectrum data. Alternatively, the position encoding method may be used to encode positions corresponding to each of the partial source spectrum data in the at least one source spectrum data included in the source spectrum sequence data, so as to obtain position encodings corresponding to each of the partial source spectrum data in the at least one source spectrum data. The position coding method may include a sine and cosine position coding method or a learning position vector method.

According to the embodiment of the disclosure, after the source spectrum sequence data and the first position coding sequence data are obtained, fusion processing can be performed on the source spectrum sequence data and the first position coding sequence data to obtain a fusion result, and then feature extraction is performed on the fusion result to obtain a target feature vector sequence.

According to the embodiment of the disclosure, after the target feature vector sequence is obtained, a position encoding method may be used to encode a position of at least one target feature vector included in the target feature vector sequence, so as to obtain a position code corresponding to the at least one target feature vector. And obtaining second position coding sequence data according to the position codes corresponding to the at least one target feature vector.

According to the embodiment of the disclosure, the target feature vector sequence and the second position coding sequence can be decoded to obtain target spectrum sequence data. And processing the target frequency spectrum sequence data to obtain target language voice data.

According to the embodiment of the disclosure, the source spectrum sequence data and the first position coding sequence data are subjected to feature extraction to obtain the target feature vector sequence, the target feature vector sequence and the second position coding sequence data are directly processed to obtain the target spectrum sequence data, and then the target spectrum sequence data are processed to obtain the target language voice data. Further, since text translation is not required, the present invention can be applied to speech translation for a language having no text.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And preprocessing the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data. And processing the source linear spectrum sequence data to obtain source Mel spectrum sequence data corresponding to the source language voice data. The source mel-spectrum sequence data is determined as source spectrum sequence data.

According to the embodiment of the disclosure, framing processing and windowing processing can be performed on the source language data preprocessing to obtain source language voice matrix data. The source language speech matrix data may include at least one frame of source language speech sub-data. There may be identical portions between adjacent frames of source language speech sub-data. After the source language voice matrix data is obtained, short-time fourier transformation may be performed on the source language voice matrix data to obtain source language voice matrix data of a frequency domain, that is, source linear spectrum sequence data. The source linear spectral sequence data may be processed using a mel filter to obtain source mel spectral sequence data.

According to embodiments of the present disclosure, mel-spectrum sequence data can reflect speech characteristics, and the resulting mel frequency conforms to human ear hearing characteristics. Based on each frequency peak value in the mel-spectrum sequence data, the boundary between the co-peak value of the voice frequency and the phonemes can be clearly displayed, so that the source mel-spectrum sequence data is used as the source spectrum sequence data to participate in voice translation, the boundary relation between different objects in the source language voice data can be clarified, word segmentation recognition time can be further reduced, and voice translation speed is improved.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

And obtaining intermediate code sequence data according to the source spectrum sequence data and the first position code sequence data. And extracting the characteristics of the intermediate coding sequence data to obtain a target characteristic vector sequence.

According to embodiments of the present disclosure, the number of dimensions of the source spectral sequence data and the position-coded sequence data may be the same. The source spectrum sequence data and the position code sequence data may be subjected to addition processing to obtain intermediate code sequence data.

According to an embodiment of the present disclosure, feature extraction is performed on intermediate code sequence data to obtain a target feature vector sequence, which may include the following operations.

Based on the first attention strategy, the intermediate code sequence data is processed to obtain a first intermediate feature vector sequence. And processing the first intermediate feature vector sequence based on the first multi-layer perception strategy to obtain a target feature vector sequence.

According to embodiments of the present disclosure, an attention policy may be used to achieve focusing of important information with high weight, ignoring non-important information with low weight, and enabling information exchange with other information by sharing important information, thereby achieving transfer of important information. The first attention layer may be determined according to a first attention policy. The first feedforward neural network layer may be determined according to a first multi-layer perception strategy.

According to an embodiment of the present disclosure, the target feature vector sequence is obtained by processing the source spectral sequence data and the first position-coded sequence data using an encoder included in the speech translation model.

According to embodiments of the present disclosure, the speech translation model may include an encoder. The source spectral sequence data and the first position-coded sequence data may be processed by an encoder to obtain a target feature vector sequence. Intermediate code sequence data is obtained, for example, from the source spectral sequence data and the first position-coded sequence data. And processing the intermediate coding sequence data by using an encoder to obtain a target characteristic vector sequence.

According to an embodiment of the present disclosure, an encoder may include N encoding units in cascade. The encoding unit may include a first attention layer and a first feedforward neural network layer. N is an integer greater than or equal to 1.

According to an embodiment of the present disclosure, processing intermediate coded sequence data with an encoder to obtain a target feature vector sequence may include the following operations.

In the case of i=1, the intermediate feature vector data is processed with the first attention layer of level 1, resulting in the first intermediate feature vector sequence of level 1. And processing the first intermediate feature vector sequence by using a first feedforward neural network layer of the 1 st level to obtain a fifth intermediate feature vector sequence of the 1 st level.

And under the condition that i is more than 1 and less than or equal to N, processing the fifth intermediate feature vector sequence of the (i-1) th level by using the first attention layer of the i th level to obtain a sixth intermediate feature vector sequence of the i th level. And processing the sixth intermediate feature vector sequence of the ith level by using the first feedforward neural network layer of the ith level to obtain a fifth intermediate feature vector sequence of the ith level. And obtaining a target feature vector sequence according to the fifth intermediate feature vector sequence of the N-th level.

According to the embodiment of the present disclosure, the value of N may be configured according to actual service requirements, which is not limited herein. For example, n=6.

According to an embodiment of the present disclosure, obtaining a target feature vector sequence according to a fifth intermediate feature vector sequence of an nth hierarchy may include: the fifth intermediate feature vector sequence of the nth hierarchy may be determined as the target feature vector sequence.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

And obtaining a second intermediate feature vector sequence according to the target feature vector sequence and the second position coding sequence data. And processing the second intermediate feature vector sequence to obtain target spectrum sequence data.

According to the embodiment of the disclosure, the target feature vector sequence and the second position coding sequence data may be subjected to addition processing to obtain a second intermediate feature vector sequence. And decoding the second intermediate feature vector sequence to obtain target spectrum sequence data.

According to an embodiment of the present disclosure, processing the second intermediate feature vector sequence to obtain the target spectrum sequence data may include the following operations.

And processing the second intermediate feature vector sequence based on the second attention strategy to obtain a third intermediate feature vector sequence. And processing the third intermediate feature vector sequence based on the second multi-layer perception strategy to obtain a fourth intermediate feature vector sequence. And processing the fourth intermediate feature vector sequence to obtain target spectrum sequence data.

According to embodiments of the present disclosure, the second attention layer may be determined according to a second attention policy. And processing the second intermediate feature vector sequence by using the second attention layer to obtain a third intermediate feature vector sequence. And determining a second feedforward neural network layer according to a second multi-layer sensing strategy. And processing the third intermediate feature vector sequence by using the second feedforward neural network layer to obtain a fourth intermediate feature vector sequence.

According to an embodiment of the present disclosure, the target spectral sequence data is obtained by processing the target feature vector sequence and the second position-encoded sequence data using a decoder included in the speech translation model.

According to embodiments of the present disclosure, the speech translation model may include a decoder. The target spectral sequence data may be obtained by processing a second intermediate eigenvector sequence obtained from the target eigenvector sequence and the second position code sequence with a decoder.

According to an embodiment of the present disclosure, the decoder may include N decoding units. The decoding unit may include a second attention layer and a second feedforward neural network layer.

According to an embodiment of the present disclosure, processing the second intermediate feature vector sequence with a decoder to obtain the target spectrum sequence data may include the following operations.

And when i=N, processing the second intermediate feature vector sequence by using the second attention layer of the nth level to obtain a third intermediate feature vector sequence of the nth level. And processing the third intermediate feature vector sequence of the N level by using the second feedforward network layer of the N level to obtain a fourth intermediate feature vector sequence of the N level.

And under the condition that i is less than or equal to 1 and less than N, processing the fourth intermediate feature vector sequence of the (i+1) th level by using the second attention layer of the i th level to obtain a third intermediate feature vector sequence of the i th level. And processing the third intermediate feature vector sequence of the ith level by using the second feedforward neural network layer of the ith level to obtain a fourth intermediate feature vector sequence of the ith level. And processing the fourth intermediate feature vector sequence of the 1 st level to obtain target spectrum sequence data.

According to an embodiment of the present disclosure, operation S240 may include the following operations.

And processing the target frequency spectrum sequence data by using the vocoder to obtain target language voice data corresponding to the source language voice data.

According to embodiments of the present disclosure, the vocoder may be a speech analysis-by-synthesis system. The target spectral sequence data may be reconstructed using a vocoder to obtain target language speech data corresponding to the source language speech data. For example, in the process of synthesizing the target spectrum sequence into the target language voice data, the response of the sound channel is firstly modeled by utilizing linear prediction, that is, the target spectrum sequence data is reconstructed based on the linear prediction, and the reconstructed target spectrum sequence data is subjected to voice synthesis to obtain the target language voice data.

According to the embodiment of the disclosure, if the target spectrum sequence data is the target linear spectrum sequence data, the target line spectrum sequence data may be converted into the target mel spectrum sequence data, and the target mel spectrum sequence data is processed by the vocoder to obtain the target language voice data corresponding to the source language voice data.

The speech translation method according to the embodiments of the present disclosure will be further described with reference to fig. 3A and 3B.

Fig. 3A schematically illustrates an example schematic diagram of a speech translation process according to an embodiment of the disclosure.

As shown in fig. 3A, in 300A, speech translation model 304 includes an encoder 3040 and a decoder 3041.

The source language voice data 301 is preprocessed to obtain source linear spectrum sequence data corresponding to the source language voice data 301. The source linear spectrum sequence data is processed to obtain source mel spectrum sequence data corresponding to the source language voice data 301. The source mel-spectrum sequence data is determined as source spectrum sequence data 302.

A position code corresponding to at least one source spectral data included in the source spectral sequence data 302 is determined, resulting in first position-coded sequence data 303. Intermediate code sequence data is obtained from the source spectral sequence data 302 and the first position code sequence data 303.

Intermediate coded sequence data is processed by an encoder 3040 to obtain a target feature vector sequence 305.

A position code corresponding to at least one target feature vector comprised by the sequence of target feature vectors 305 is determined, resulting in second position-coded sequence data 306.

A second intermediate feature vector sequence is derived from the target feature vector sequence 305 and the second position-encoded sequence data 306. The second intermediate feature vector sequence is processed by a decoder 3041 to obtain target spectrum sequence data 307.

The target spectral sequence data 307 is processed by the vocoder 308 to obtain target language speech data 309 corresponding to the source language speech data 301.

Fig. 3B schematically illustrates an example schematic diagram of data in a speech translation process according to an embodiment of the present disclosure.

As shown in fig. 3B, 301 in 300B is the source language speech data 301 in fig. 3A. 302 is the source spectral sequence data 302 in fig. 3A. 307 is the target spectral sequence data 307 in fig. 3A. 309 is the target language speech data 309 in fig. 3A.

Fig. 4 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure.

As shown in fig. 4, the method 400 includes operations S410 to S440.

In operation S410, source sample spectral sequence data corresponding to the source sample language voice data and real spectral sequence data corresponding to the target sample language voice data are determined, respectively. The source sample spectral sequence data comprises at least one source sample spectral data, and the target sample language voice data is obtained by translating the source sample language voice data.

In operation S420, feature extraction is performed on the source sample spectrum sequence data and the first sample position code sequence data to obtain a sample feature vector sequence. The first sample position-coding sequence data comprises a sample position code corresponding to at least one source sample spectral data

In operation S430, the sample feature vector sequence and the second sample position-encoded sequence data are processed to obtain predicted spectrum sequence data. The second sample position-coding sequence data includes sample position codes corresponding to the sample feature vector sequences.

In operation S440, a predetermined model is trained using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

According to embodiments of the present disclosure, the source sample spectral sequence data may include source sample spectral data corresponding to at least one sample object. For example, the source sample spectral sequence data may include source sample spectral data corresponding to each of the at least one sample object. Alternatively, the source sample spectral sequence data may comprise source sample spectral data corresponding to each of the partial sample objects of the at least one sample object. The first sample position encoded sequence data may comprise a sample position encoding corresponding to at least one source sample spectral data. For example, the first sample position-coding sequence data may comprise sample position codes each corresponding to at least one source sample spectral data. Alternatively, the first sample position encoded sequence data may comprise sample position encodings corresponding to respective ones of the portions of the source sample spectral data. The first sample position encoding may characterize the absolute position of the object (i.e., the source sample spectral data) in the source sample language speech data.

According to an embodiment of the present disclosure, the sequence of sample feature vectors may comprise at least one sample feature vector. The second sample position-coding sequence data may include sample position codes corresponding to the at least one sample feature vector. For example, the second sample position-coding sequence data may include sample position codes each corresponding to at least one sample feature vector. Alternatively, the second sample position-encoded sequence data may comprise sample position encodings corresponding to respective ones of the partial sample feature vectors of the at least one sample feature vector.

According to the embodiment of the disclosure, the source sample language voice data can be preprocessed to obtain the source sample spectrum sequence data. The target sample language voice data can be preprocessed to obtain the real spectrum sequence data. The pre-treatment may comprise at least one of: framing, windowing and acoustic feature extraction. For example, framing and windowing may be performed on the source sample language speech data to obtain source sample language speech matrix data. And performing short-time Fourier transform on the source sample language voice matrix data to obtain source sample language voice matrix data of a frequency domain, namely source sample linear spectrum sequence data. The linear spectral sequence data of the source sample may be processed with a mel filter to obtain mel spectral sequence data of the source sample. The source sample mel-spectrum sequence data is determined as source sample spectral sequence data. The target sample language voice data can be subjected to framing and windowing to obtain target sample language voice matrix data. And carrying out Fourier transform on the target sample language voice matrix data to obtain the target sample language voice matrix data of the frequency domain, namely the target sample linear spectrum sequence data. The linear spectrum sequence data of the target sample can be processed by using a Mel filter to obtain the real Mel spectrum sequence data. The true mel-spectrum sequence data is determined as the true spectrum sequence data.

According to an embodiment of the present disclosure, the predetermined model may include an encoder and a decoder. The predetermined model may include a transducer model.

According to an embodiment of the present disclosure, feature extraction is performed on source sample spectrum sequence data and first sample position code sequence data to obtain a sample feature vector sequence, which may include: and obtaining intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data. And extracting the characteristics of the intermediate sample coding sequence data to obtain a sample characteristic vector sequence.

According to an embodiment of the present disclosure, processing the sample feature vector sequence and the second sample position-coded sequence to obtain predicted spectral sequence data may include: and obtaining a third intermediate sample characteristic vector sequence according to the sample characteristic vector sequence and the second sample position coding sequence. And processing the third intermediate sample feature vector sequence to obtain predicted spectrum sequence data.

According to the embodiment of the disclosure, after the predicted spectrum sequence data is obtained, the predicted spectrum sequence data and the real spectrum sequence data can be utilized to train the predetermined model, a trained model is obtained, and the trained predetermined model is determined to be a speech translation model.

According to an embodiment of the present disclosure, the predetermined model may include an encoder.

According to an embodiment of the present disclosure, operation S420 may include the following operations.

And obtaining intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data. And processing the intermediate sample coding sequence data by using an encoder to obtain a sample characteristic vector sequence.

According to an embodiment of the present disclosure, an encoder may include a model structure implementing a first attention strategy and a first multi-layer perception strategy.

According to an embodiment of the present disclosure, an encoder may include N encoding units in cascade. The encoding unit includes a first attention layer and a first feedforward neural network layer. N is an integer greater than or equal to 1.

According to an embodiment of the present disclosure, processing the intermediate sample vector sequence with an encoder to obtain a sample feature vector sequence may include the following operations.

In the case of i=1, the intermediate sample encoded sequence data is processed with the first attention layer of level 1, resulting in the second intermediate sample feature vector sequence of level 1.

And processing the second intermediate sample feature vector sequence of the 1 st level by using the first feedforward neural network layer of the 1 st level to obtain the first intermediate sample feature vector sequence of the 1 st level.

And under the condition that i is more than 1 and less than or equal to N, processing the first intermediate sample feature vector sequence of the (i-1) th level by using the first attention layer of the i th level to obtain a second intermediate sample feature vector sequence of the i th level.

And processing the second intermediate sample feature vector sequence of the ith level by using the first feedforward neural network layer of the ith level to obtain the first intermediate sample feature vector sequence of the ith level. And obtaining a sample characteristic vector sequence according to the first intermediate sample characteristic vector sequence of the Nth level.

According to an embodiment of the present disclosure, the predetermined model may further include a decoder.

According to an embodiment of the present disclosure, operation S430 may include the following operations.

And obtaining a third intermediate sample characteristic vector sequence according to the sample characteristic vector sequence and the second sample position coding sequence data. And processing the third intermediate sample feature vector sequence by using a decoder to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, the decoder may include a model structure implementing the second attention strategy and the second multi-layer perception strategy.

According to an embodiment of the present disclosure, processing the third intermediate sample feature vector sequence with a decoder to obtain predicted spectral sequence data may include the following operations.

And under the condition that i is less than or equal to 1 and less than N, processing the fourth intermediate sample feature vector sequence of the (i+1) th level by using the second attention layer of the i th level to obtain a fifth intermediate sample feature vector sequence of the i th level. And processing the fifth intermediate sample feature vector sequence of the ith level by using the second feedforward neural network layer of the ith level to obtain a fourth intermediate sample feature vector sequence of the ith level. And processing the fourth intermediate sample characteristic vector sequence of the 1 st level to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, in the case of i=n, the third intermediate sample feature vector sequence is processed with the second attention layer of the nth level, resulting in a fifth intermediate sample feature vector sequence of the nth level. And processing the fifth intermediate sample feature vector sequence of the N level by using the second feedforward network layer of the N level to obtain a fourth intermediate sample feature vector sequence of the N level.

According to an embodiment of the present disclosure, operation S440 may include the following operations.

Based on the loss function, the output value is obtained by using the real spectrum sequence data and the predicted spectrum sequence data. And adjusting model parameters of the preset model according to the output value until preset conditions are met. The model obtained in the case where the predetermined condition is satisfied is determined as a speech translation model.

According to embodiments of the present disclosure, the loss function may include a mean square error loss function, an average pair-wise squared error loss function, or a cross entropy loss function. The predetermined condition may include at least one of convergence of the output value and reaching of the training round to a maximum training round.

According to the embodiment of the disclosure, the real spectrum sequence data and the predicted spectrum sequence data can be input into a loss function to obtain an output value. Model parameters of the predetermined model may be adjusted according to the output values until a predetermined condition is satisfied. For example, model parameters of the predetermined model may be adjusted according to a back-propagation algorithm or a random gradient descent algorithm until a predetermined condition is satisfied.

According to an embodiment of the present disclosure, the training method may further include the following operations.

In the event that the first time period is determined to be inconsistent with the second time period, a target time period is determined. The target time period is a time period of small value in a first time period representing the time period of the initial source sample language voice data and a second time period representing the time period of the initial target sample language voice data. Non-human voice data in the sample language voice data is determined. The sample language voice data is voice data corresponding to a target period. And processing the sample language voice data by using the non-human voice data to obtain the processed sample language voice data. And obtaining the source sample language voice data and the target sample language voice data according to the processed sample language voice data and the sample language voice data corresponding to the non-target time period. The non-target time period is a time period in which the number of values in the first time period and the second time period is large.

According to embodiments of the present disclosure, a duration of the initial source sample language data may be determined, resulting in a first time period. And determining the duration of the initial target sample language data to obtain a second time period. The first time period and the second time period may be compared. If it is determined that the first period and the second period are not identical, a period of which the value is small among the first period and the second period may be determined as the target period. The voice data corresponding to the target time period may be determined as sample language voice data.

According to embodiments of the present disclosure, after determining the sample language voice data, non-human voice data in the sample language voice data may be determined. For example, non-human voice data may be determined from the sample language voice data using a voice activity detection tool. The sample language speech data may be processed using the non-human voice data to obtain processed sample language speech data. For example, the first non-human voice fragment data and the second non-human voice fragment data in the non-human voice data may be randomly extracted. And adding the first non-human voice fragment data to the beginning part of the sample language voice data, and adding the second non-human voice fragment data to the ending part of the sample language voice data to obtain the processed sample language voice data. The duration of the first non-human voice clip data and the duration of the second non-human voice clip data may be the same or different.

According to an embodiment of the present disclosure, after obtaining the processed sample language voice data, if it is determined that the sample language voice data is initial source sample language voice data, it may be explained that the sample language voice data corresponding to the non-target time period is initial target sample language voice data. Thus, the processed sample language speech data can be determined as source sample language speech data. The initial target sample language speech data is determined as target sample language speech data.

According to an embodiment of the present disclosure, if the sample language voice data is initial target sample language voice data, it may be explained that the sample language voice data corresponding to the non-target period is initial source sample language voice data. Thus, the processed sample language voice data can be determined as target sample language voice data. The initial source sample language speech data is determined to be source sample language speech data.

According to the embodiment of the disclosure, the sample language voice data is processed by using the non-human voice data, so that the duration of the source sample language voice data and the target sample language voice data are the same.

According to an embodiment of the present disclosure, the sampling frequency of the source sample language voice data is the same as the sampling frequency of the target sample language voice data.

According to embodiments of the present disclosure, the sampling frequency of the source sample language voice data and the sampling frequency of the target sample language voice data may be the same. The sampling frequency of both may be the same as the sampling frequency of the vocoder synthesized voice data.

According to the embodiment of the disclosure, the sampling frequency of the source sample language voice data is the same as that of the target sample language voice data, so that the time domain frequency change rule of the source sample spectrum sequence data obtained by processing the source sample language voice data is the same as that of the target sample spectrum sequence data obtained by processing the target sample language voice data, and further the time domain change rule of the predicted spectrum sequence data is the same as that of the real spectrum sequence data, thereby being beneficial to improving the training speed of the model.

The above is only an exemplary embodiment, but is not limited thereto, and other speech translation methods and training methods known in the art may be included as long as the speech translation quality can be improved.

A model training method according to embodiments of the present disclosure is further described below with reference to fig. 5, in conjunction with a specific embodiment.

Fig. 5 schematically illustrates an example schematic diagram of a training process according to an embodiment of the disclosure.

As shown in fig. 5, in 500, a predetermined model 506 includes an encoder 5060 and a decoder 5061.

The source sample language voice data 501 is preprocessed to obtain source sample linear spectrum sequence data corresponding to the source sample language voice data 501. The source sample linear spectrum sequence data is processed to obtain source sample mel spectrum sequence data corresponding to the source sample language voice data 501. The source sample mel-spectrum sequence data is determined as source sample spectrum sequence data 502.

The target sample language voice data 503 is preprocessed to obtain target sample linear spectrum sequence data corresponding to the target sample language voice data 503. And processing the linear spectrum sequence data of the target sample to obtain the mel spectrum sequence data of the target sample corresponding to the language voice data 503 of the target sample. The target sample mel-spectrum sequence data is determined as the true spectrum sequence data 504.

Sample position codes corresponding to at least one source sample spectral data included in the source sample spectral sequence data 502 are determined, resulting in first sample position code sequence data 505. Intermediate sample code sequence data is obtained from the source sample spectral sequence data 502 and the first sample position code sequence data 505.

The intermediate sample code sequence data is processed by an encoder 5060 to obtain a sample feature vector sequence 507.

Sample position codes corresponding to at least one sample feature vector included in the sequence of sample feature vectors 507 are determined to obtain second sample position code sequence data 508.

And obtaining a third intermediate sample characteristic vector sequence according to the sample characteristic vector sequence 507 and the second sample position coding sequence data 508. The third intermediate sample feature vector sequence is processed by decoder 5061 to obtain predicted spectral sequence data 509.

The predicted spectral sequence data 509 and the real spectral sequence data 504 are input to a loss function 510, resulting in an output value 511. Model parameters of the predetermined model 506 are adjusted according to the output values 511 until a predetermined condition is satisfied. The model 506 obtained in the case where the predetermined condition is satisfied is determined as a speech translation model. The speech translation model may be speech translation model 304 in fig. 3A.

Fig. 6 schematically shows a block diagram of a speech translation apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the speech translation apparatus 600 may include a first determination module 610, a first obtaining module 620, a second obtaining module 630, and a third obtaining module 640.

The first determining module 610 is configured to determine source spectrum sequence data corresponding to source language speech data, where the source spectrum sequence data includes at least one source spectrum data.

The first obtaining module 620 is configured to perform feature extraction on the source spectrum sequence data and the first position-coded sequence data to obtain a target feature vector sequence. The first position-coded sequence data includes a position code corresponding to at least one source spectral data.

The second obtaining module 630 is configured to process the target feature vector sequence and the second position-coding sequence data to obtain target spectrum sequence data. The second position-coding sequence data includes a position code corresponding to the target feature vector sequence.

And a third obtaining module 640, configured to process the target spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

According to an embodiment of the present disclosure, the first obtaining module 620 may include a first obtaining sub-module and a second obtaining sub-module.

And the first obtaining submodule is used for obtaining intermediate coding sequence data according to the source spectrum sequence data and the first position coding sequence data.

And the second obtaining submodule is used for carrying out feature extraction on the intermediate coding sequence data to obtain a target feature vector sequence.

According to an embodiment of the present disclosure, the second obtaining sub-module may include a first obtaining unit and a second obtaining unit.

The first obtaining unit is used for processing the intermediate coding sequence data based on a first attention strategy to obtain a first intermediate feature vector sequence.

The second obtaining unit is used for processing the first intermediate feature vector sequence based on the first multi-layer perception strategy to obtain a target feature vector sequence.

According to an embodiment of the present disclosure, the second obtaining module 630 may include a third obtaining sub-module and a fourth obtaining sub-module.

And the third obtaining submodule is used for obtaining a second intermediate feature vector sequence according to the target feature vector sequence and the second position coding sequence data.

And a fourth obtaining sub-module, configured to process the second intermediate feature vector sequence to obtain target spectrum sequence data.

According to an embodiment of the present disclosure, the fourth obtaining sub-module may include a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit.

And the third obtaining unit is used for processing the second intermediate feature vector sequence based on the second attention strategy to obtain a third intermediate feature vector sequence.

And the fourth obtaining unit is used for processing the third intermediate feature vector sequence based on the second multi-layer perception strategy to obtain a fourth intermediate feature vector sequence.

And a fifth obtaining unit, configured to process the fourth intermediate feature vector sequence to obtain target spectrum sequence data.

According to an embodiment of the present disclosure, the third obtaining module 640 may include a fifth obtaining sub-module.

And fifth obtaining sub-module for processing the target frequency spectrum sequence data by using the vocoder to obtain the target language voice data corresponding to the source language voice data.

According to an embodiment of the present disclosure, the first determination module 610 may include a sixth acquisition sub-module, a seventh acquisition sub-module, and a first determination sub-module.

And a sixth obtaining sub-module, configured to pre-process the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data.

And a seventh obtaining sub-module, configured to process the source linear spectrum sequence data to obtain source mel spectrum sequence data corresponding to the source language voice data.

A first determination submodule for determining the source mel-spectrum sequence data as source spectrum sequence data.

Fig. 7 schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the model training apparatus 700 may include a second determination module 710, a fourth acquisition module 720, a fifth acquisition module 730, and a sixth acquisition module 740.

The third determining module 710 is configured to determine source sample spectrum sequence data corresponding to the source sample language voice data and real spectrum sequence data corresponding to the target sample language voice data, respectively. The source sample spectral sequence data comprises at least one source sample spectral data, and the target sample language voice data is obtained by translating the source sample language voice data.

The fourth obtaining module 720 is configured to perform feature extraction on the source sample spectrum sequence data and the first sample position code sequence data to obtain a sample feature vector sequence. The first sample position encoded sequence data includes a sample position encoding corresponding to at least one source sample spectral data.

A fifth obtaining module 730, configured to process the sample feature vector sequence and the second sample position encoded sequence data to obtain predicted spectrum sequence data. The second sample position-coded sequence data includes a position code corresponding to the sample feature vector sequence.

A sixth obtaining module 740 is configured to train the predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

According to an embodiment of the present disclosure, the predetermined model includes an encoder.

According to an embodiment of the present disclosure, the fourth obtaining module 720 may include an eighth obtaining sub-module and a ninth obtaining sub-module.

And an eighth obtaining sub-module, configured to obtain intermediate sample code sequence data according to the source sample spectrum sequence data and the first sample position code sequence data.

And a ninth obtaining submodule, configured to process the intermediate sample coding sequence data by using the encoder to obtain a sample feature vector sequence.

According to an embodiment of the present disclosure, an encoder includes N encoding units in cascade, the encoding units including a first attention layer and a first feedforward neural network layer, N being an integer greater than or equal to 1.

According to an embodiment of the present disclosure, the ninth obtaining sub-module may include a sixth obtaining unit, a seventh obtaining unit, and an eighth obtaining unit.

A sixth obtaining unit, configured to process the first intermediate sample feature vector sequence of the (i-1) th level by using the first attention layer of the i-th level to obtain a second intermediate sample feature vector sequence of the i-th level, where i is greater than 1 and less than or equal to N.

A seventh obtaining unit, configured to process the second intermediate sample feature vector sequence of the ith level by using the first feedforward neural network layer of the ith level, so as to obtain the first intermediate sample feature vector sequence of the ith level.

And an eighth obtaining unit, configured to obtain a sample feature vector sequence according to the first intermediate sample feature vector sequence of the nth level.

According to an embodiment of the present disclosure, the predetermined model further comprises a decoder.

According to an embodiment of the present disclosure, the fifth obtaining module 730 may include a tenth obtaining sub-module and an eleventh obtaining sub-module.

And a tenth obtaining submodule, configured to obtain a third intermediate sample feature vector sequence according to the sample feature vector sequence and the second sample position coding sequence data.

An eleventh obtaining sub-module is configured to process the third intermediate sample feature vector sequence with a decoder to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, the decoder includes N decoding units including a second attention layer and a second feedforward neural network layer.

According to an embodiment of the present disclosure, the eleventh obtaining sub-module may include a ninth obtaining unit, a tenth obtaining unit, and an eleventh obtaining unit.

A ninth obtaining unit, configured to process the fourth intermediate sample feature vector sequence of the (i+1) th level by using the second attention layer of the i-th level to obtain a fifth intermediate sample feature vector sequence of the i-th level, where i is equal to or less than N.

A tenth obtaining unit, configured to process the fifth intermediate sample feature vector sequence of the ith level by using the second feedforward neural network layer of the ith level, to obtain a fourth intermediate sample feature vector sequence of the ith level.

An eleventh obtaining unit, configured to process the fourth intermediate sample feature vector sequence of the 1 st level to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, the sixth obtaining module 740 may include a twelfth obtaining mold module, an adjusting sub-module, and a second determining sub-module.

And a twelfth obtaining sub-module, configured to obtain an output value based on the loss function by using the real spectrum sequence data and the predicted spectrum sequence data.

And the adjusting sub-module is used for adjusting the model parameters of the preset model according to the output value until the preset condition is met.

And a second determination sub-module for determining a model obtained in the case that a predetermined condition is satisfied as a speech translation model.

According to an embodiment of the present disclosure, the training apparatus 700 may further include a third determining module, a fourth determining module, a seventh obtaining module, and an eighth obtaining module

And a third determining module, configured to determine the target time period when it is determined that the first time period is inconsistent with the second time period. The target time period is a time period of small value in a first time period representing the time period of the initial source sample language voice data and a second time period representing the time period of the initial target sample language voice data.

And a fourth determining module for determining non-human voice data in the sample language voice data, wherein the sample language voice data is voice data corresponding to the target time period.

And a seventh obtaining module, configured to process the sample language voice data by using the non-human voice data, so as to obtain processed sample language voice data.

And the eighth obtaining module is used for obtaining the source sample language voice data and the target sample language voice data according to the processed sample language voice data and the sample language voice data corresponding to the non-target time period. The non-target time period is a time period in which the number of values in the first time period and the second time period is large.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Fig. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a speech translation method and a training method according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a speech translation method and a model training method. For example, in some embodiments, the speech translation method and model training method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the speech translation method and model training method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the speech translation method and the model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A speech translation method, comprising:

determining source spectrum sequence data corresponding to source language voice data, wherein the source spectrum sequence data comprises at least one source spectrum data;

performing feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises position codes corresponding to the at least one source spectrum sequence data;

Processing the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprises position codes corresponding to the target feature vector sequence; and

processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data;

wherein the first position coding sequence data is obtained according to a position code corresponding to the at least one source spectrum data, and the position code corresponding to the at least one source spectrum data is obtained by coding the position of the at least one source spectrum data by using a position coding method; the second position coding sequence data is obtained according to the position coding of at least one target feature vector included in the target feature vector sequence, and the position coding corresponding to the at least one target feature vector is obtained by coding the position of the at least one target feature vector by using the position coding method; the position encoding method includes a learning position vector method.

2. The method of claim 1, wherein the feature extracting the source spectrum sequence data and the first position encoded sequence data to obtain a target feature vector sequence comprises:

Obtaining intermediate coding sequence data according to the source spectrum sequence data and the first position coding sequence data; and

and extracting the characteristics of the intermediate coding sequence data to obtain the target characteristic vector sequence.

3. The method according to claim 2, wherein the feature extraction of the intermediate code sequence data to obtain the target feature vector sequence includes:

processing the intermediate code sequence data based on a first attention strategy to obtain a first intermediate feature vector sequence; and

and processing the first intermediate feature vector sequence based on a first multi-layer perception strategy to obtain the target feature vector sequence.

4. A method according to any one of claims 1 to 3, wherein said processing the target feature vector sequence and the second position-coded sequence data to obtain target spectral sequence data comprises:

obtaining a second intermediate feature vector sequence according to the target feature vector sequence and the second position coding sequence data; and

and processing the second intermediate feature vector sequence to obtain the target spectrum sequence data.

5. The method of claim 4, wherein the processing the second intermediate feature vector sequence to obtain the target spectral sequence data comprises:

processing the second intermediate feature vector sequence based on a second attention strategy to obtain a third intermediate feature vector sequence;

processing the third intermediate feature vector sequence based on a second multi-layer perception strategy to obtain a fourth intermediate feature vector sequence; and

and processing the fourth intermediate feature vector sequence to obtain the target spectrum sequence data.

6. A method according to any one of claims 1 to 3, wherein said processing said target spectral sequence data to obtain target language speech data corresponding to said source language speech data comprises:

and processing the target frequency spectrum sequence data by using a vocoder to obtain target language voice data corresponding to the source language voice data.

7. A method according to any one of claims 1 to 3, wherein said determining source spectral sequence data corresponding to source language speech data comprises:

preprocessing the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data;

Processing the source linear spectrum sequence data to obtain source Mel spectrum sequence data corresponding to the source language voice data; and

and determining the source Mel spectrum sequence data as the source spectrum sequence data.

8. The method of claim 1, wherein the target feature vector sequence is derived by processing the source spectral sequence data and the first position-coded sequence data with an encoder included in a speech translation model;

wherein the target spectrum sequence data is obtained by processing the target feature vector sequence and the second position coding sequence data by a decoder included in the speech translation model.

9. A model training method, comprising:

respectively determining source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, wherein the source sample spectrum sequence data comprises at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data;

performing feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, wherein the first sample position coding sequence data comprises sample position codes corresponding to the at least one source sample spectrum data;

Processing the sample feature vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, wherein the second sample position coding sequence data comprises position codes corresponding to the sample feature vector sequence; and

training a preset model by utilizing the real spectrum sequence data and the predicted spectrum sequence data to obtain a voice translation model;

10. The method of claim 9, wherein the predetermined model comprises an encoder;

The feature extraction of the source sample spectrum sequence data and the first sample position coding sequence data to obtain a sample feature vector sequence includes:

obtaining intermediate sample code sequence data according to the source sample spectrum sequence data and the first sample position code sequence data; and

and processing the intermediate sample coding sequence data by using the encoder to obtain the sample characteristic vector sequence.

11. The method of claim 10, wherein the encoder comprises a cascade of N encoding units, the encoding units comprising a first attention layer and a first feed forward neural network layer, N being an integer greater than 1;

wherein said processing said intermediate sample encoded sequence data with said encoder to obtain said sample feature vector sequence comprises:

processing the intermediate sample code sequence data by using a first attention layer of a 1 st level to obtain a second intermediate sample feature vector sequence of the 1 st level under the condition that i=1; and

processing the second intermediate sample feature vector sequence of the 1 st level by using a first feedforward neural network layer of the 1 st level to obtain a first intermediate sample feature vector sequence of the 1 st level;

Under the condition that 1<i is less than or equal to N, processing the first intermediate sample feature vector sequence of the (i-1) th level by using the first attention layer of the i th level to obtain a second intermediate sample feature vector sequence of the i th level;

processing the second intermediate sample feature vector sequence of the ith level by using a first feedforward neural network layer of the ith level to obtain a first intermediate sample feature vector sequence of the ith level; and

and obtaining the sample characteristic vector sequence according to the first intermediate sample characteristic vector sequence of the Nth level.

12. The method of claim 10 or 11, wherein the predetermined model further comprises a decoder;

the processing the sample feature vector sequence and the second sample position coding sequence data to obtain predicted spectrum sequence data includes:

obtaining a third intermediate sample feature vector sequence according to the sample feature vector sequence and the second sample position coding sequence data; and

and processing the third intermediate sample feature vector sequence by using the decoder to obtain the predicted spectrum sequence data.

13. The method of claim 12, wherein the decoder comprises N decoding units including a second attention layer and a second feedforward neural network layer;

Wherein said processing said third intermediate sample feature vector sequence with said decoder to obtain said predicted spectral sequence data comprises:

under the condition that i is less than or equal to 1 and less than N, processing a fourth intermediate sample feature vector sequence of an (i+1) th level by using a second attention layer of the i th level to obtain a fifth intermediate sample feature vector sequence of the i th level;

processing the fifth intermediate sample feature vector sequence of the ith level by using a second feedforward neural network layer of the ith level to obtain a fourth intermediate sample feature vector sequence of the ith level; and

and processing the fourth intermediate sample feature vector sequence of the 1 st level to obtain the predicted spectrum sequence data.

14. The method according to claim 10 or 11, wherein said training a predetermined model with said real spectral sequence data and said predicted spectral sequence data resulting in a speech translation model comprises:

based on a loss function, obtaining an output value by utilizing the real spectrum sequence data and the predicted spectrum sequence data;

adjusting model parameters of the preset model according to the output value until preset conditions are met; and

and determining a model obtained when the predetermined condition is satisfied as the speech translation model.

15. The method of claim 10 or 11, further comprising:

determining a target time period under the condition that the first time period is inconsistent with a second time period, wherein the target time period is a time period with small values in the first time period and the second time period, the first time period represents the time period of the initial source sample language voice data, and the second time period represents the time period of the initial target sample language voice data;

determining non-human voice data in sample language voice data, wherein the sample language voice data is voice data corresponding to a target time period;

processing the sample language voice data by using the non-human voice data to obtain processed sample language voice data; and

and obtaining the source sample language voice data and the target sample language voice data according to the processed sample language voice data and sample language voice data corresponding to a non-target time period, wherein the non-target time period is a time period with a large value in the first time period and the second time period.

16. A speech translation apparatus comprising:

a first determining module, configured to determine source spectrum sequence data corresponding to source language speech data, where the source spectrum sequence data includes at least one source spectrum data;

The first obtaining module is used for extracting features of the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises position codes corresponding to the at least one source spectrum sequence data;

the second obtaining module is used for processing the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprises position codes corresponding to the target feature vector sequence; and

the third obtaining module is used for processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data;

17. A model training apparatus comprising:

the second determining module is used for respectively determining source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, wherein the source sample spectrum sequence data comprises at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data;

a fourth obtaining module, configured to perform feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, where the first sample position coding sequence data includes a sample position code corresponding to the at least one source sample spectrum data;

a fifth obtaining module, configured to process the sample feature vector sequence and second sample position code sequence data to obtain predicted spectrum sequence data, where the second sample position code sequence data includes a sample position code corresponding to the sample feature vector sequence; and

the sixth obtaining module is used for training a preset model by utilizing the real spectrum sequence data and the predicted spectrum sequence data to obtain a voice translation model;

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8 or any one of claims 9 to 15.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8 or any one of claims 9-15.