CN114495977A

CN114495977A - Speech translation and model training method, device, electronic equipment and storage medium

Info

Publication number: CN114495977A
Application number: CN202210110163.XA
Authority: CN
Inventors: 梁芸铭; 赵情恩; 熊新雷; 陈蓉; 张银辉; 周羊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13
Anticipated expiration: 2042-01-28
Also published as: WO2023142454A1; CN114495977B

Abstract

The disclosure provides a voice translation method, a model training device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of voice translation, voice synthesis and deep learning. The specific implementation scheme is as follows: determining source spectral sequence data corresponding to source language speech data, wherein the source spectral sequence data comprises at least one source spectral data; performing feature extraction on the source frequency spectrum sequence data and the first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises a position code corresponding to at least one source frequency spectrum sequence data; processing the target characteristic vector sequence and second position coding sequence data to obtain target frequency spectrum sequence data, wherein the second position coding sequence data comprise position codes corresponding to the target characteristic vector sequence; and processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

Description

Speech translation and model training method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to techniques for language translation, speech synthesis, and deep learning. And more particularly, to a speech translation method, a model training method, an apparatus, an electronic device, and a storage medium.

Background

With the development of artificial intelligence technology, the artificial intelligence technology has been widely used in various fields. For example, speech translation is widely used in the field of speech technology in the field of artificial intelligence technology.

The voice translation refers to translating the source language voice data into the target language voice data. The source speech sound data and the target speech sound data are of different languages.

Disclosure of Invention

The disclosure provides a speech translation method, a model training device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a speech translation method including: determining source spectral sequence data corresponding to source language speech data, wherein the source spectral sequence data comprises at least one source spectral data; performing feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises a position code corresponding to the at least one source spectrum data; processing the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprise position codes corresponding to the target feature vector sequence; and processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

According to another aspect of the present disclosure, there is provided a model training method, including: respectively determining source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, wherein the source sample spectrum sequence data comprises at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data; performing feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, wherein the first sample position coding sequence data comprises a sample position code corresponding to the at least one source sample spectrum data; processing the sample characteristic vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, wherein the second sample position coding sequence data comprises a sample position code corresponding to the sample characteristic vector sequence; and training a predetermined model by using the real spectrum sequence data and the prediction spectrum sequence data to obtain a voice translation model.

According to another aspect of the present disclosure, there is provided a speech translation apparatus including: the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining source spectrum sequence data corresponding to source language voice data, and the source spectrum sequence data comprises at least one source spectrum data; a first obtaining module, configured to perform feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, where the first position coding sequence data includes a position code corresponding to the at least one source spectrum data; a second obtaining module, configured to process the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, where the second position coding sequence data includes a position code corresponding to the target feature vector sequence; and the third obtaining module is used for processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

According to another aspect of the present disclosure, there is provided a model training apparatus including: a second determining module, configured to determine source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, respectively, where the source sample spectrum sequence data includes at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data; a fourth obtaining module, configured to perform feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, where the first sample position coding sequence data includes a sample position code corresponding to the at least one source sample spectrum data; a fifth obtaining module, configured to process the sample feature vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, where the second sample position coding sequence data includes a sample position code corresponding to the sample feature vector sequence; and a sixth obtaining module, configured to train a predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data, to obtain a speech translation model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture to which speech translation methods, training methods, and apparatus may be applied, according to embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of speech translation according to an embodiment of the present disclosure;

FIG. 3A schematically illustrates an example schematic of a speech translation process according to an embodiment of this disclosure;

FIG. 3B schematically shows an example diagram of data in a speech translation process according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates an example schematic diagram of a training process according to an embodiment of the disclosure;

FIG. 6 schematically shows a block diagram of a speech translation apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure; and

FIG. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a speech translation method and a training method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Speech translation may be implemented as follows. First, source language speech data is converted into source language text data using a speech recognition model. The source language text data is then translated into target language text data using a text translation model. And finally, converting the target language text data into target language voice data by using a language synthesis model.

The above approach requires that both the source and target languages have their own text, and that the source language has a corresponding speech recognition model and the target language has a corresponding speech synthesis model. However, since there are many languages that do not have characters corresponding thereto, that is, many languages that do not have characters, it is difficult to have a corresponding speech recognition model and speech synthesis model. For languages without text, the above approach would be difficult to apply. In addition, the above-mentioned methods relate to a speech recognition model, a text translation model and a speech synthesis model, and the final result will be affected by the error generated by the above-mentioned models, thereby reducing the speech translation quality.

Therefore, the embodiment of the disclosure provides a speech translation scheme. The method comprises the steps of extracting the characteristics of source frequency spectrum sequence data and first position coding sequence data to obtain a target characteristic vector sequence, directly processing the target characteristic vector sequence and second position coding sequence data to obtain target frequency spectrum sequence data, and processing the target frequency spectrum sequence data to obtain target speech voice data. Further, since text translation is not necessary, it can be applied to speech translation for a language having no characters.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

Fig. 1 schematically illustrates an exemplary system architecture to which the speech translation method, training method and apparatus may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the speech translation method, the training method, and the apparatus can be applied may include a terminal device, but the terminal device may implement the speech translation method, the training method, and the apparatus provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be various types of servers that provide various services. For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and a VPS (Virtual Private Server, VPS). The server 105 may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that the speech translation method provided by the embodiment of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Correspondingly, the speech translation apparatus provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the speech translation method provided by the embodiment of the present disclosure may also be generally executed by the server 105. Accordingly, the speech translation apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The speech translation method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Correspondingly, the speech translation apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

The model training methods provided by embodiments of the present disclosure may generally be performed by the server 105. Accordingly, the model training apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The model training method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the model training apparatus provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

The model training method provided by the embodiments of the present disclosure may be generally performed by the

terminal device

101, 102, or 103. Accordingly, the model training apparatus provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 schematically shows a flow chart of a method of speech translation according to an embodiment of the present disclosure.

As shown in FIG. 2, the method 200 includes operations S210-S240.

In operation S210, source spectrum sequence data corresponding to source language speech data is determined. The source spectral sequence data includes at least one source spectral data.

In operation S220, feature extraction is performed on the source spectrum sequence data and the first position encoding sequence data to obtain a target feature vector sequence. The first location-encoding sequence data includes a location code corresponding to at least one source spectral data.

In operation S230, the target feature vector sequence and the second position encoding sequence data are processed to obtain target spectrum sequence data. The second position code sequence data includes position codes corresponding to the target feature vectors.

In operation S240, the target spectrum sequence data is processed to obtain target speech voice data corresponding to the source speech voice data.

According to embodiments of the present disclosure, the source language may refer to the language to be translated. The target language may refer to the language in which translation is desired. The source and target languages are different languages. Languages may include languages with or without text. A language with a word may refer to a language with a corresponding word. Languages without words may mean that the language does not have corresponding words. Source speech voice data may refer to voice data that requires voice translation. The target language speech voice data may refer to voice data in a target language. The source speech sound data may include at least one object. The object may include a word or word.

According to an embodiment of the present disclosure, the source spectrum sequence data may be obtained by performing acoustic feature extraction on source language speech data. The source speech voice data may refer to voice data for a predetermined period of time. The source spectral sequence data may include source spectral data corresponding to at least one object. For example, the source spectral sequence data may include source spectral data corresponding to each of the at least one object. Alternatively, the source spectral sequence data may include source spectral data corresponding to each of partial objects of the at least one object. The first location-encoding sequence data may include a location code corresponding to at least one source spectral data. For example, the first location-encoding sequence data may include location codes corresponding to respective ones of the at least one source spectral data. Alternatively, the first location-encoding sequence data may include location codes corresponding to respective ones of the at least one source spectral data. The first location encoding can characterize the absolute location of the object (i.e., source spectral data) in the source speech sound data.

According to an embodiment of the present disclosure, the sequence of target feature vectors may comprise at least one target feature vector. The second location code sequence data may include a location code corresponding to the target feature vector sequence. For example, the second location code sequence data may include location codes corresponding to respective ones of the at least one target feature vector. Alternatively, the second position-encoding sequence data may include position encodings corresponding to respective ones of the partial target feature vectors of the at least one target feature vector. The target spectrum sequence data may be obtained by performing acoustic feature extraction on the target speech voice data.

According to the embodiment of the disclosure, source language voice data can be obtained, and the source language voice data is preprocessed to obtain source spectrum sequence data corresponding to the source language voice data. The pre-processing may include at least one of: framing, windowing and acoustic feature extraction. The acoustic features may include at least one of: fbank (i.e., FilterBank), Mel-Frequency Cepstral coeffients (MFCCs), timbre vectors, zero-crossing rates, subband energies, subband energy entropies, spectral centers, spectral spreads, spectral entropies, spectral fluxes, spectral roll-offs, and tone deviations. For example, the source spectral data may include source linear spectral sequence data or source mel spectral sequence data.

According to the embodiment of the disclosure, the position of at least one source spectrum data included in the source spectrum sequence data can be encoded by using a position encoding method, so as to obtain a position code corresponding to the at least one source spectrum data. And obtaining first position code sequence data according to the position code corresponding to the at least one source frequency spectrum data. For example, the position coding method may be used to code a position corresponding to each of at least one source spectrum data included in the source spectrum sequence data, so as to obtain a position code corresponding to each of the at least one source spectrum data. Alternatively, the positions corresponding to each of the partial source spectrum data in the at least one source spectrum data included in the source spectrum sequence data may be encoded by using a position encoding method, so as to obtain position codes corresponding to each of the partial source spectrum data in the at least one source spectrum data. The position encoding method may include a sine and cosine position encoding method or a learned position vector method.

According to the embodiment of the disclosure, after the source spectrum sequence data and the first position encoding sequence data are obtained, fusion processing may be performed on the source spectrum sequence data and the first position encoding sequence data to obtain a fusion result, and then feature extraction may be performed on the fusion result to obtain a target feature vector sequence.

According to the embodiment of the present disclosure, after the target feature vector sequence is obtained, the position of at least one target feature vector included in the target feature vector sequence may be encoded by using a position encoding method, so as to obtain a position code corresponding to the at least one target feature vector. And obtaining second position coding sequence data according to the position codes corresponding to the at least one target feature vector.

According to the embodiment of the disclosure, the target feature vector sequence and the second position coding sequence can be decoded to obtain target spectrum sequence data. And then processing the target frequency spectrum sequence data to obtain target speech voice data.

According to the embodiment of the disclosure, the source frequency spectrum sequence data and the first position coding sequence data are subjected to feature extraction to obtain the target feature vector sequence, the target feature vector sequence and the second position coding sequence data are directly processed to obtain the target frequency spectrum sequence data, and then the target frequency spectrum sequence data are processed to obtain the target speech voice data. Further, since text translation is not necessary, it can be applied to speech translation for a language having no characters.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And preprocessing the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data. And processing the source linear spectrum sequence data to obtain source Mel spectrum sequence data corresponding to the source language voice data. The source mel spectral sequence data is determined as source spectral sequence data.

According to the embodiment of the disclosure, the source language speech matrix data can be obtained by performing framing processing and windowing processing on source language data preprocessing. The source language speech matrix data may include at least one frame of source language speech sub-data. There may be the same portion between two adjacent frames of source language speech subdata. After obtaining the source language speech matrix data, short-time fourier transform may be performed on the source language speech matrix data to obtain source language speech matrix data in the frequency domain, that is, source linear spectrum sequence data. The source linear spectral sequence data may be processed using a mel filter to obtain source mel spectral sequence data.

According to an embodiment of the present disclosure, the mel spectrum sequence data can reflect the voice characteristics, and the resulting mel frequency conforms to the auditory characteristics of human ears. Based on each frequency peak in the Mel spectrum sequence data, the boundary between the common peak of the voice frequency and the phoneme can be displayed more clearly, so that the source Mel spectrum sequence data is used as the source frequency spectrum sequence data to participate in voice translation, the boundary relation between different objects in the source language voice data is facilitated to be clarified, the word segmentation recognition time can be further reduced, and the voice translation speed is improved.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

And coding the sequence data according to the source spectrum sequence data and the first position to obtain intermediate coding sequence data. And performing feature extraction on the intermediate coding sequence data to obtain a target feature vector sequence.

According to an embodiment of the present disclosure, the number of dimensions of the source spectral sequence data and the position encoding sequence data may be the same. The source spectral sequence data and the position code sequence data may be summed to obtain intermediate code sequence data.

According to the embodiment of the disclosure, the feature extraction of the intermediate coding sequence data to obtain the target feature vector sequence may include the following operations.

And processing the intermediate coding sequence data based on the first attention strategy to obtain a first intermediate feature vector sequence. And processing the first intermediate characteristic vector sequence based on a first multilayer perception strategy to obtain a target characteristic vector sequence.

According to the embodiment of the disclosure, the attention strategy can be used for realizing that important information is focused with high weight, non-important information is ignored with low weight, and information exchange can be carried out with other information by sharing the important information, so that the important information is transferred. The first attention layer may be determined according to a first attention strategy. A first feed-forward neural network layer may be determined according to a first multi-layer perceptual strategy.

According to the embodiment of the present disclosure, the target feature vector sequence is obtained by processing the source spectrum sequence data and the first position encoding sequence data by an encoder included in the speech translation model.

According to an embodiment of the present disclosure, the speech translation model may include an encoder. The source spectral sequence data and the first position encoded sequence data may be processed by an encoder to obtain a target feature vector sequence. For example, the intermediate encoded sequence data is obtained based on the source spectral sequence data and the first location encoded sequence data. And processing the intermediate coding sequence data by using an encoder to obtain a target feature vector sequence.

According to an embodiment of the present disclosure, the encoder may include a cascade of N encoding units. The encoding unit may include a first attention layer and a first feedforward neural network layer. N is an integer greater than or equal to 1.

According to the embodiment of the disclosure, processing the intermediate coding sequence data by using the encoder to obtain the target feature vector sequence may include the following operations.

If i is 1, the intermediate feature vector data is processed by the first attention level of level 1, resulting in a first intermediate feature vector sequence of level 1. And processing the first intermediate feature vector sequence by utilizing the first feedforward neural network layer of the 1 st level to obtain a fifth intermediate feature vector sequence of the 1 st level.

And under the condition that i is more than 1 and less than or equal to N, processing the fifth intermediate feature vector sequence of the (i-1) th level by using the first attention level of the i th level to obtain a sixth intermediate feature vector sequence of the i th level. And processing the sixth intermediate feature vector sequence of the ith level by utilizing the first feedforward neural network layer of the ith level to obtain a fifth intermediate feature vector sequence of the ith level. And obtaining a target feature vector sequence according to the fifth intermediate feature vector sequence of the Nth level.

According to the embodiment of the present disclosure, the value of N may be configured according to actual service requirements, and is not limited herein. For example, N ═ 6.

According to an embodiment of the present disclosure, obtaining a target feature vector sequence according to the fifth intermediate feature vector sequence of the nth level may include: a fifth intermediate feature vector sequence of the nth level may be determined as the target feature vector sequence.

Operation S230 may include the following operations according to an embodiment of the present disclosure.

And obtaining a second intermediate feature vector sequence according to the target feature vector sequence and the second position coding sequence data. And processing the second intermediate characteristic vector sequence to obtain target frequency spectrum sequence data.

According to the embodiment of the disclosure, the target feature vector sequence and the second position coding sequence data may be subjected to addition processing to obtain a second intermediate feature vector sequence. And decoding the second intermediate characteristic vector sequence to obtain target frequency spectrum sequence data.

According to an embodiment of the present disclosure, processing the second intermediate feature vector sequence to obtain the target spectrum sequence data may include the following operations.

And processing the second intermediate feature vector sequence based on the second attention strategy to obtain a third intermediate feature vector sequence. And processing the third intermediate feature vector sequence based on the second multilayer perception strategy to obtain a fourth intermediate feature vector sequence. And processing the fourth intermediate characteristic vector sequence to obtain target frequency spectrum sequence data.

According to an embodiment of the present disclosure, the second attention layer may be determined according to a second attention strategy. And processing the second intermediate feature vector sequence by utilizing the second attention layer to obtain a third intermediate feature vector sequence. A second feed-forward neural network layer is determined according to a second multi-layer perceptual strategy. And processing the third intermediate feature vector sequence by utilizing a second feedforward neural network layer to obtain a fourth intermediate feature vector sequence.

According to an embodiment of the present disclosure, the target spectral sequence data is obtained by processing the target feature vector sequence and the second position encoding sequence data using a decoder included in the speech translation model.

According to an embodiment of the present disclosure, the speech translation model may include a decoder. The decoder may be configured to process a second intermediate feature vector sequence obtained from the target feature vector sequence and the second position-encoding sequence to obtain target spectrum sequence data.

According to an embodiment of the present disclosure, a decoder may include N decoding units. The decoding unit may include a second attention layer and a second feedforward neural network layer.

According to an embodiment of the present disclosure, processing the second intermediate feature vector sequence by a decoder to obtain the target spectrum sequence data may include the following operations.

If i is equal to N, the second intermediate feature vector sequence is processed by the second attention layer of the nth hierarchy, and a third intermediate feature vector sequence of the nth hierarchy is obtained. And processing the third intermediate feature vector sequence of the Nth level by utilizing the second feedforward network layer of the Nth level to obtain a fourth intermediate feature vector sequence of the Nth level.

And under the condition that i is more than or equal to 1 and less than N, processing the fourth intermediate feature vector sequence of the (i +1) th level by using the second attention level of the i th level to obtain a third intermediate feature vector sequence of the i th level. And processing the third intermediate feature vector sequence of the ith level by utilizing the second feedforward neural network layer of the ith level to obtain a fourth intermediate feature vector sequence of the ith level. And processing the fourth intermediate characteristic vector sequence of the level 1 to obtain target frequency spectrum sequence data.

According to an embodiment of the present disclosure, operation S240 may include the following operations.

And processing the target frequency spectrum sequence data by using a vocoder to obtain target speech voice data corresponding to the source speech voice data.

According to embodiments of the present disclosure, the vocoder may be a speech analysis synthesis system. The vocoder can be used for reconstructing the target frequency spectrum sequence data to obtain target speech voice data corresponding to the source speech voice data. For example, in the process of synthesizing the target frequency spectrum sequence into the target speech voice data, the response of the vocal tract is modeled by linear prediction, that is, the target frequency spectrum sequence data is reconstructed based on the linear prediction, and the reconstructed target frequency spectrum sequence data is subjected to voice synthesis to obtain the target speech voice data.

According to an embodiment of the present disclosure, if the target spectrum sequence data is target linear spectrum sequence data, the target linear spectrum sequence data may be converted into target mel spectrum sequence data, and the target mel spectrum sequence data is processed by a vocoder, resulting in target speech voice data corresponding to source speech voice data.

The speech translation method according to the embodiment of the present disclosure is further described with reference to fig. 3A and 3B in conjunction with specific embodiments.

FIG. 3A schematically shows an example schematic of a speech translation process according to an embodiment of the disclosure.

As shown in FIG. 3A, in 300A, speech translation model 304 includes an encoder 3040 and a decoder 3041.

The source language voice data 301 is preprocessed to obtain source linear spectrum sequence data corresponding to the source language voice data 301. The source linear spectrum sequence data is processed to obtain source mel spectrum sequence data corresponding to the source language speech data 301. The source mel spectral sequence data is determined as source spectral sequence data 302.

A position code corresponding to at least one source spectrum data included in the source spectrum sequence data 302 is determined, resulting in first position code sequence data 303. The intermediate encoded sequence data is obtained based on the source spectral sequence data 302 and the first position encoded sequence data 303.

The intermediate encoded sequence data is processed by the encoder 3040 to obtain the target feature vector sequence 305.

A position code corresponding to at least one target feature vector included in the target feature vector sequence 305 is determined, and a second position code sequence data 306 is obtained.

A second intermediate feature vector sequence is obtained based on the target feature vector sequence 305 and the second position-encoded sequence data 306. The second intermediate feature vector sequence is processed by the decoder 3041 to obtain the target spectrum sequence data 307.

The target spectrum sequence data 307 is processed by the vocoder 308, resulting in target speech voice data 309 corresponding to the source speech voice data 301.

FIG. 3B schematically shows an example schematic of data in a speech translation process according to an embodiment of the disclosure.

As shown in FIG. 3B, 301 in 300B is the source language speech data 301 in FIG. 3A. 302 is the source spectral sequence data 302 of fig. 3A. 307 is the target spectrum sequence data 307 of fig. 3A. 309 is the target speech voice data 309 of fig. 3A.

FIG. 4 schematically shows a flow chart of a model training method according to an embodiment of the present disclosure.

As shown in fig. 4, the method 400 includes operations S410 to S440.

In operation S410, source sample spectrum sequence data corresponding to the source sample language voice data and real spectrum sequence data corresponding to the target sample language voice data are respectively determined. The source sample spectrum sequence data includes at least one source sample spectrum data, and the target sample language voice data is translated from the source sample language voice data.

In operation S420, feature extraction is performed on the source sample spectrum sequence data and the first sample position encoding sequence data to obtain a sample feature vector sequence. The first sample position code sequence data includes sample position codes corresponding to at least one source sample spectrum data

In operation S430, the sample feature vector sequence and the second sample position encoding sequence data are processed to obtain prediction spectrum sequence data. The second sample position code sequence data comprises a sample position code corresponding to the sequence of sample feature vectors.

In operation S440, a predetermined model is trained using the real spectrum sequence data and the predicted spectrum sequence data, resulting in a speech translation model.

According to an embodiment of the present disclosure, the source sample spectral sequence data may include source sample spectral data corresponding to at least one sample object. For example, the source sample spectral sequence data may include source sample spectral data corresponding to each of the at least one sample object. Alternatively, the source sample spectral sequence data may include source sample spectral data corresponding to each of a portion of the at least one sample object. The first sample position code sequence data may include a sample position code corresponding to at least one source sample spectral data. For example, the first sample position encoding sequence data may include sample position encodings corresponding to respective ones of the at least one source sample spectral data. Alternatively, the first sample position code sequence data may include sample position codes corresponding to respective ones of the partial source sample spectral data of the at least one source sample spectral data. The first sample position encoding may characterize the absolute position of an object (i.e., source sample spectral data) in the source sample speech data.

According to an embodiment of the present disclosure, the sequence of sample feature vectors may comprise at least one sample feature vector. The second sample position code sequence data may comprise a sample position code corresponding to at least one sample feature vector. For example, the second sample position code sequence data may comprise sample position codes corresponding to respective ones of the at least one sample feature vector. Alternatively, the second sample position code sequence data may comprise sample position codes corresponding to respective ones of the partial sample feature vectors of the at least one sample feature vector.

According to the embodiment of the disclosure, source sample speech data can be preprocessed to obtain source sample spectrum sequence data. The target sample language voice data can be preprocessed to obtain real frequency spectrum sequence data. The pre-processing may include at least one of: framing, windowing and acoustic feature extraction. For example, the source sample speech matrix data may be obtained by performing framing and windowing on the source sample speech data. And performing short-time Fourier transform on the source sample language voice matrix data to obtain source sample language voice matrix data of a frequency domain, namely source sample linear spectrum sequence data. The source sample linear spectral sequence data may be processed using a mel-filter to obtain source sample mel-spectral sequence data. The source sample mel spectral sequence data is determined as source sample spectral sequence data. The target sample language voice data can be subjected to framing processing and windowing processing to obtain target sample language voice matrix data. And performing Fourier transform on the target sample language voice matrix data to obtain frequency domain target sample language voice matrix data, namely target sample linear spectrum sequence data. The linear spectrum sequence data of the target sample can be processed by utilizing a Mel filter to obtain real Mel spectrum sequence data. The true mel-spectrum sequence data is determined as true spectrum sequence data.

According to an embodiment of the present disclosure, the predetermined model may include an encoder and a decoder. The predetermined model may comprise a transform model.

According to an embodiment of the present disclosure, performing feature extraction on the source sample spectrum sequence data and the first sample position encoding sequence data to obtain a sample feature vector sequence may include: and obtaining intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data. And performing feature extraction on the intermediate sample coding sequence data to obtain a sample feature vector sequence.

According to an embodiment of the present disclosure, processing the sample feature vector sequence and the second sample position coding sequence to obtain predicted spectrum sequence data may include: and obtaining a third intermediate sample feature vector sequence according to the sample feature vector sequence and the second sample position coding sequence. And processing the third intermediate sample feature vector sequence to obtain predicted spectrum sequence data.

According to the embodiment of the disclosure, after the predicted spectrum sequence data is obtained, the predetermined model may be trained by using the predicted spectrum sequence data and the real spectrum sequence data to obtain a trained model, and the trained predetermined model is determined as a speech translation model.

According to an embodiment of the present disclosure, the predetermined model may include an encoder.

According to an embodiment of the present disclosure, operation S420 may include the following operations.

And obtaining intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data. And processing the coding sequence data of the intermediate sample by using the coder to obtain a sample feature vector sequence.

According to an embodiment of the present disclosure, the encoder may include a model structure implementing a first attention strategy and a first multi-layer perception strategy.

According to an embodiment of the present disclosure, the encoder may include a cascade of N encoding units. The encoding unit includes a first attention layer and a first feedforward neural network layer. N is an integer greater than or equal to 1.

According to an embodiment of the present disclosure, processing the intermediate sample vector sequence with the encoder to obtain the sample feature vector sequence may include the following operations.

When i is 1, the intermediate sample coded sequence data is processed by the first attention level of the 1 st level to obtain a second intermediate sample feature vector sequence of the 1 st level.

And processing the second intermediate sample feature vector sequence of the 1 st level by utilizing the first feedforward neural network layer of the 1 st level to obtain a first intermediate sample feature vector sequence of the 1 st level.

And under the condition that i is more than 1 and less than or equal to N, processing the first intermediate sample feature vector sequence of the (i-1) th level by utilizing the first attention layer of the i th level to obtain a second intermediate sample feature vector sequence of the i th level.

And processing the second intermediate sample feature vector sequence of the ith level by utilizing the first feedforward neural network layer of the ith level to obtain a first intermediate sample feature vector sequence of the ith level. And obtaining a sample feature vector sequence according to the first intermediate sample feature vector sequence of the Nth level.

According to an embodiment of the present disclosure, the predetermined model may further include a decoder.

Operation S430 may include the following operations according to an embodiment of the present disclosure.

And coding sequence data according to the sample feature vector sequence and the second sample position to obtain a third intermediate sample feature vector sequence. And processing the third intermediate sample feature vector sequence by using a decoder to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, the decoder may include a model structure implementing a second attention strategy and a second multi-layer perception strategy.

According to an embodiment of the present disclosure, processing the third intermediate sample feature vector sequence by a decoder to obtain predicted spectrum sequence data may include the following operations.

And under the condition that i is more than or equal to 1 and less than N, processing the fourth intermediate sample feature vector sequence of the (i +1) th level by using the second attention layer of the i th level to obtain a fifth intermediate sample feature vector sequence of the i th level. And processing the fifth intermediate sample feature vector sequence of the ith level by utilizing the second feedforward neural network layer of the ith level to obtain a fourth intermediate sample feature vector sequence of the ith level. And processing the fourth intermediate sample feature vector sequence of the level 1 to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, in case that i is equal to N, the third intermediate sample feature vector sequence is processed with the second attention layer of the nth level, resulting in a fifth intermediate sample feature vector sequence of the nth level. And processing the fifth intermediate sample feature vector sequence of the Nth level by utilizing the second feedforward network layer of the Nth level to obtain a fourth intermediate sample feature vector sequence of the Nth level.

Operation S440 may include the following operations according to an embodiment of the present disclosure.

And obtaining an output value by using the real spectrum sequence data and the predicted spectrum sequence data based on the loss function. And adjusting the model parameters of the preset model according to the output value until the preset condition is met. The model obtained in the case where the predetermined condition is satisfied is determined as a speech translation model.

According to embodiments of the present disclosure, the loss function may include a mean square error loss function, a mean pair-wise squared error loss function, or a cross-entropy loss function. The predetermined condition may include at least one of convergence of the output value and a training round reaching a maximum training round.

According to an embodiment of the present disclosure, the real spectrum sequence data and the predicted spectrum sequence data may be input to a loss function, resulting in an output value. The model parameters of the predetermined model may be adjusted according to the output values until a predetermined condition is satisfied. For example, the model parameters of the predetermined model may be adjusted according to a back-propagation algorithm or a stochastic gradient descent algorithm until the predetermined condition is met.

According to an embodiment of the present disclosure, the training method may further include the following operations.

In a case where it is determined that the first time period does not coincide with the second time period, a target time period is determined. The target time period is a time period of a small value of a first time period characterizing the time period of the initial source sample language speech data and a second time period characterizing the time period of the initial target sample language speech data. Non-human voice data in the sample language voice data is determined. The sample language voice data is voice data corresponding to a target time period. And processing the sample language voice data by utilizing the non-human voice data to obtain the processed sample language voice data. And obtaining source sample language voice data and target sample language voice data according to the processed sample language voice data and the sample language voice data corresponding to the non-target time period. The non-target period is a period having a large value among the first period and the second period.

According to an embodiment of the disclosure, a duration of the initial source sample language data may be determined, resulting in a first time period. And determining the duration of the initial target sample language data to obtain a second time period. The first time period and the second time period may be compared. If it is determined that the first period of time and the second period of time do not coincide, a period of time having a small value of the first period of time and the second period of time may be determined as the target period of time. The speech data corresponding to the target time period may be determined as sample language speech data.

According to an embodiment of the present disclosure, after determining the sample language speech data, non-human speech data in the sample language speech data may be determined. For example, non-human voice data may be determined from sample language voice data using a voice activity detection tool. The sample language voice data can be processed by utilizing the non-human voice data to obtain the processed sample language voice data. For example, the first non-human voice speech section data and the second non-human voice speech section data in the non-human voice speech data may be randomly extracted. And adding the first non-human voice fragment data to the beginning part of the sample language voice data, and adding the second non-human voice fragment data to the ending part of the sample language voice data to obtain the processed sample language voice data. The time length of the first non-human voice snippet data and the time length of the second non-human voice snippet data may be the same or different.

According to an embodiment of the present disclosure, after obtaining the processed sample language voice data, if it is determined that the sample language voice data is the original source sample language voice data, it may be stated that the sample language voice data corresponding to the non-target period is the original target sample language voice data. Thus, the processed sample language voice data can be determined as the source sample language voice data. The initial target sample language speech data is determined to be target sample language speech data.

According to an embodiment of the present disclosure, if the sample language voice data is the initial target sample language voice data, it can be stated that the sample language voice data corresponding to the non-target time period is the initial source sample language voice data. Thus, the processed sample language voice data can be determined as the target sample language voice data. The original source sample language speech data is determined to be source sample language speech data.

According to the embodiment of the disclosure, the source sample language voice data and the target sample language voice data are made to have the same duration by using the non-human voice data to process the sample language voice data.

According to an embodiment of the present disclosure, the sampling frequency of the source sample language speech data is the same as the sampling frequency of the target sample language speech data.

According to an embodiment of the present disclosure, the sampling frequency of the source sample language speech data and the sampling frequency of the target sample language speech data may be the same. The sampling frequencies of the two may be the same as the sampling frequency of the synthesized voice data of the vocoder.

According to the embodiment of the disclosure, the sampling frequency of the source sample language voice data is the same as that of the target sample language voice data, so that the time domain frequency variation rule of the source sample spectrum sequence data obtained by processing the source sample language voice data is the same as that of the target sample spectrum sequence data obtained by processing the target sample language voice data, and further the time domain variation rule of the predicted spectrum sequence data is the same as that of the real spectrum sequence data, thereby being beneficial to improving the training speed of the model.

The above is only an exemplary embodiment, but is not limited thereto, and other speech translation methods and training methods known in the art may be included as long as the speech translation quality can be improved.

The model training method according to the embodiment of the present disclosure is further described with reference to fig. 5.

Fig. 5 schematically shows an example schematic of a training process according to an embodiment of the disclosure.

As shown in fig. 5, in 500, the predetermined model 506 includes an encoder 5060 and a decoder 5061.

The source sample language voice data 501 is preprocessed to obtain source sample linear spectrum sequence data corresponding to the source sample language voice data 501. The linear spectral sequence data of the source sample is processed to obtain mel spectral sequence data of the source sample corresponding to the language voice data 501 of the source sample. The source sample mel spectral sequence data is determined as source sample spectral sequence data 502.

The target sample language voice data 503 is preprocessed to obtain target sample linear spectrum sequence data corresponding to the target sample language voice data 503. The target sample linear spectrum sequence data is processed to obtain target sample mel spectrum sequence data corresponding to the target sample language voice data 503. The target sample mel-spectrum sequence data is determined as true spectrum sequence data 504.

A sample position code corresponding to at least one source sample spectrum data included in the source sample spectrum sequence data 502 is determined, resulting in first sample position code sequence data 505. Intermediate sample coded sequence data is obtained from the source sample spectral sequence data 502 and the first sample position coded sequence data 505.

The intermediate sample encoded sequence data is processed by encoder 5060 to obtain sample feature vector sequence 507.

A sample position code corresponding to at least one sample feature vector comprised by the sequence of sample feature vectors 507 is determined, resulting in second sample position code sequence data 508.

And obtaining a third intermediate sample feature vector sequence according to the sample feature vector sequence 507 and the second sample position coding sequence data 508. The third intermediate sample feature vector sequence is processed by a decoder 5061 to obtain predicted spectral sequence data 509.

The predicted spectral sequence data 509 and the actual spectral sequence data 504 are input to a loss function 510 to obtain an output value 511. The model parameters of the predetermined model 506 are adjusted according to the output values 511 until the predetermined condition is satisfied. The model 506 obtained in the case where a predetermined condition is satisfied is determined as a speech translation model. The speech translation model may be speech translation model 304 in fig. 3A.

Fig. 6 schematically shows a block diagram of a speech translation apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the speech translation apparatus 600 may include a first determining module 610, a first obtaining module 620, a second obtaining module 630, and a third obtaining module 640.

A first determining module 610 is configured to determine source spectral sequence data corresponding to source language speech data, where the source spectral sequence data includes at least one source spectral data.

The first obtaining module 620 is configured to perform feature extraction on the source spectrum sequence data and the first position encoding sequence data to obtain a target feature vector sequence. The first location-encoding sequence data includes a location code corresponding to at least one source spectral data.

The second obtaining module 630 is configured to process the target feature vector sequence and the second position coding sequence data to obtain target spectrum sequence data. The second position code sequence data includes a position code corresponding to the target feature vector sequence.

And a third obtaining module 640, configured to process the target frequency spectrum sequence data to obtain target speech voice data corresponding to the source speech voice data.

According to an embodiment of the present disclosure, the first obtaining module 620 may include a first obtaining sub-module and a second obtaining sub-module.

And the first obtaining sub-module is used for obtaining intermediate coding sequence data according to the source spectrum sequence data and the first position coding sequence data.

And the second obtaining submodule is used for carrying out feature extraction on the intermediate coding sequence data to obtain a target feature vector sequence.

According to an embodiment of the present disclosure, the second obtaining sub-module may include a first obtaining unit and a second obtaining unit.

And the first obtaining unit is used for processing the intermediate coding sequence data based on the first attention strategy to obtain a first intermediate feature vector sequence.

And the second obtaining unit is used for processing the first intermediate characteristic vector sequence based on the first multilayer perception strategy to obtain a target characteristic vector sequence.

According to an embodiment of the present disclosure, the second obtaining module 630 may include a third obtaining sub-module and a fourth obtaining sub-module.

And the third obtaining submodule is used for obtaining a second intermediate characteristic vector sequence according to the target characteristic vector sequence and the second position coding sequence data.

And the fourth obtaining submodule is used for processing the second intermediate characteristic vector sequence to obtain target frequency spectrum sequence data.

According to an embodiment of the present disclosure, the fourth obtaining submodule may include a third obtaining unit, a fourth obtaining unit, and a fifth obtaining unit.

And the third obtaining unit is used for processing the second intermediate feature vector sequence based on the second attention strategy to obtain a third intermediate feature vector sequence.

And the fourth obtaining unit is used for processing the third intermediate feature vector sequence based on the second multilayer perception strategy to obtain a fourth intermediate feature vector sequence.

And the fifth obtaining unit is used for processing the fourth intermediate feature vector sequence to obtain target frequency spectrum sequence data.

According to an embodiment of the present disclosure, the third obtaining module 640 may include a fifth obtaining sub-module.

And the fifth obtaining submodule is used for processing the target frequency spectrum sequence data by utilizing the vocoder to obtain target speech voice data corresponding to the source speech voice data.

According to an embodiment of the present disclosure, the first determination module 610 may include a sixth obtaining sub-module, a seventh obtaining sub-module, and a first determination sub-module.

And the sixth obtaining submodule is used for preprocessing the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data.

And the seventh obtaining submodule is used for processing the source linear spectrum sequence data to obtain source Mel spectrum sequence data corresponding to the source language voice data.

A first determining sub-module for determining the source mel spectral sequence data as the source spectral sequence data.

According to an embodiment of the present disclosure, the target spectrum sequence data is obtained by processing the target feature vector sequence and the second position encoding sequence data using a decoder included in the speech translation model.

FIG. 7 schematically shows a block diagram of a model training apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the model training apparatus 700 may include a second determining module 710, a fourth obtaining module 720, a fifth obtaining module 730, and a sixth obtaining module 740.

A third determining module 710, configured to determine source sample spectral sequence data corresponding to the source sample language voice data and real spectral sequence data corresponding to the target sample language voice data, respectively. The source sample spectrum sequence data includes at least one source sample spectrum data, and the target sample language voice data is translated from the source sample language voice data.

A fourth obtaining module 720, configured to perform feature extraction on the source sample spectrum sequence data and the first sample position coding sequence data to obtain a sample feature vector sequence. The first sample position code sequence data includes a sample position code corresponding to at least one source sample spectrum data.

A fifth obtaining module 730, configured to process the sample feature vector sequence and the second sample position coding sequence data to obtain predicted spectrum sequence data. The second sample position code sequence data includes position codes corresponding to the sequence of sample feature vectors.

A sixth obtaining module 740, configured to train the predetermined model using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

According to an embodiment of the present disclosure, the predetermined model includes an encoder.

According to an embodiment of the present disclosure, the fourth obtaining module 720 may include an eighth obtaining sub-module and a ninth obtaining sub-module.

And the eighth obtaining submodule is used for obtaining the intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data.

And the ninth obtaining submodule is used for processing the coding sequence data of the intermediate sample by using the coder to obtain a sample feature vector sequence.

According to an embodiment of the present disclosure, an encoder includes a cascade of N encoding units, the encoding units including a first attention layer and a first feedforward neural network layer, N being an integer greater than or equal to 1.

According to an embodiment of the present disclosure, the ninth obtaining sub-module may include a sixth obtaining unit, a seventh obtaining unit, and an eighth obtaining unit.

And the sixth obtaining unit is used for processing the first intermediate sample feature vector sequence of the (i-1) th level by using the first attention level of the i-th level to obtain a second intermediate sample feature vector sequence of the i-th level under the condition that i is more than 1 and less than or equal to N.

And the seventh obtaining unit is used for processing the second intermediate sample feature vector sequence of the ith level by utilizing the first feedforward neural network layer of the ith level to obtain the first intermediate sample feature vector sequence of the ith level.

And the eighth obtaining unit is configured to obtain a sample feature vector sequence according to the first intermediate sample feature vector sequence of the nth level.

According to an embodiment of the present disclosure, the predetermined model further comprises a decoder.

According to an embodiment of the present disclosure, the fifth obtaining module 730 may include a tenth obtaining sub-module and an eleventh obtaining sub-module.

And the tenth obtaining submodule is used for coding the sequence data according to the sample feature vector sequence and the second sample position to obtain a third intermediate sample feature vector sequence.

And the eleventh obtaining submodule is used for processing the third intermediate sample feature vector sequence by using a decoder to obtain the predicted spectrum sequence data.

According to an embodiment of the present disclosure, a decoder includes N decoding units including a second attention layer and a second feedforward neural network layer.

According to an embodiment of the present disclosure, the eleventh obtaining sub-module may include a ninth obtaining unit, a tenth obtaining unit, and an eleventh obtaining unit.

And the ninth obtaining unit is used for processing the fourth intermediate sample feature vector sequence of the (i +1) th level by using the second attention level of the i-th level to obtain a fifth intermediate sample feature vector sequence of the i-th level under the condition that i is more than or equal to 1 and less than N.

A tenth obtaining unit, configured to process the fifth intermediate sample feature vector sequence of the ith level by using the second feedforward neural network layer of the ith level, so as to obtain a fourth intermediate sample feature vector sequence of the ith level.

An eleventh obtaining unit, configured to process the fourth intermediate sample feature vector sequence of the level 1 to obtain predicted spectrum sequence data.

According to an embodiment of the present disclosure, the sixth obtaining module 740 may include a twelfth obtaining mold module, an adjustment sub-module, and a second determination sub-module.

And a twelfth obtaining sub-module for obtaining an output value by using the real spectrum sequence data and the predicted spectrum sequence data based on the loss function.

And the adjusting submodule is used for adjusting the model parameters of the preset model according to the output value until the preset condition is met.

And the second determining submodule is used for determining the model obtained under the condition that the preset condition is met as the voice translation model.

According to an embodiment of the present disclosure, the training apparatus 700 may further include a third determining module, a fourth determining module, a seventh obtaining module, and an eighth obtaining module

And the third determining module is used for determining the target time period under the condition that the first time period is determined to be inconsistent with the second time period. The target time period is a time period of a small value of a first time period characterizing the time period of the initial source sample language speech data and a second time period characterizing the time period of the initial target sample language speech data.

And the fourth determination module is used for determining the non-human voice data in the sample language voice data, wherein the sample language voice data is the voice data corresponding to the target time period.

And the seventh obtaining module is used for processing the sample language voice data by utilizing the non-human voice data to obtain the processed sample language voice data.

And the eighth obtaining module is used for obtaining the source sample language voice data and the target sample language voice data according to the processed sample language voice data and the sample language voice data corresponding to the non-target time period. The non-target period is a period of time having a large value of the first period of time and the second period of time.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

FIG. 8 schematically illustrates a block diagram of an electronic device suitable for implementing a speech translation method and a training method according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as the speech translation method and the model training method. For example, in some embodiments, the speech translation method and the model training method may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more steps of the speech translation method and model training method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the speech translation method and the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of speech translation, comprising:

determining source spectral sequence data corresponding to source language speech data, wherein the source spectral sequence data comprises at least one source spectral data;

performing feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, wherein the first position coding sequence data comprises a position code corresponding to the at least one source spectrum sequence data;

processing the target characteristic vector sequence and second position coding sequence data to obtain target spectrum sequence data, wherein the second position coding sequence data comprise position codes corresponding to the target characteristic vector sequence; and

and processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

2. The method of claim 1, wherein the feature extracting the source spectrum sequence data and the first position coding sequence data to obtain a target feature vector sequence comprises:

coding sequence data according to the source frequency spectrum sequence data and the first position to obtain intermediate coding sequence data; and

and performing feature extraction on the intermediate coding sequence data to obtain the target feature vector sequence.

3. The method of claim 2, wherein the feature extracting the intermediate coding sequence data to obtain the target feature vector sequence comprises:

processing the intermediate coding sequence data based on a first attention strategy to obtain a first intermediate feature vector sequence; and

and processing the first intermediate characteristic vector sequence based on a first multilayer perception strategy to obtain the target characteristic vector sequence.

4. The method according to any one of claims 1 to 3, wherein the processing the target feature vector sequence and the second position coding sequence data to obtain target spectrum sequence data comprises:

obtaining a second intermediate feature vector sequence according to the target feature vector sequence and the second position coding sequence data; and

and processing the second intermediate characteristic vector sequence to obtain the target frequency spectrum sequence data.

5. The method of claim 4, wherein the processing the second intermediate feature vector sequence to obtain the target spectral sequence data comprises:

processing the second intermediate feature vector sequence based on a second attention strategy to obtain a third intermediate feature vector sequence;

processing the third intermediate feature vector sequence based on a second multilayer perception strategy to obtain a fourth intermediate feature vector sequence; and

and processing the fourth intermediate feature vector sequence to obtain the target frequency spectrum sequence data.

6. The method according to any one of claims 1 to 5, wherein the processing the target spectrum sequence data to obtain target speech voice data corresponding to the source speech voice data comprises:

7. The method of any of claims 1-6, wherein the determining source spectral sequence data corresponding to source language speech data comprises:

preprocessing the source language voice data to obtain source linear spectrum sequence data corresponding to the source language voice data;

processing the source linear spectrum sequence data to obtain source Mel spectrum sequence data corresponding to the source language voice data; and

determining the source Mel spectral sequence data as the source spectral sequence data.

8. The method of claim 1, wherein the sequence of target feature vectors is derived from processing the source spectral sequence data and the first position-encoded sequence data with an encoder comprised in a speech translation model;

wherein the target spectrum sequence data is obtained by processing the target feature vector sequence and the second position encoding sequence data by a decoder included in the speech translation model.

9. A model training method, comprising:

respectively determining source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, wherein the source sample spectrum sequence data comprises at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data;

performing feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, wherein the first sample position coding sequence data comprises a sample position code corresponding to the at least one source sample spectrum data;

processing the sample characteristic vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, wherein the second sample position coding sequence data comprises a position code corresponding to the sample characteristic vector sequence; and

and training a predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

10. The method of claim 9, wherein the predetermined model comprises an encoder;

wherein, the performing feature extraction on the source sample spectrum sequence data and the first sample position coding sequence data to obtain a sample feature vector sequence includes:

obtaining intermediate sample coding sequence data according to the source sample spectrum sequence data and the first sample position coding sequence data; and

and processing the intermediate sample coding sequence data by using the encoder to obtain the sample characteristic vector sequence.

11. The method of claim 10, wherein the encoder comprises a cascade of N encoding units, the encoding units comprising a first attention layer and a first feed-forward neural network layer, N being an integer greater than or equal to 1;

wherein the processing the intermediate sample code sequence data with the encoder to obtain the sample feature vector sequence comprises:

if i is 1, processing the intermediate sample coding sequence data by using a first attention layer of the level 1 to obtain a second intermediate sample feature vector sequence of the level 1; and

processing the second intermediate sample feature vector sequence of the 1 st level by utilizing a first feedforward neural network layer of the 1 st level to obtain a first intermediate sample feature vector sequence of the 1 st level;

under the condition that i is more than 1 and less than or equal to N, processing the first intermediate sample feature vector sequence of the (i-1) th level by utilizing the first attention layer of the i th level to obtain a second intermediate sample feature vector sequence of the i th level;

processing the second intermediate sample feature vector sequence of the ith level by utilizing a first feedforward neural network layer of the ith level to obtain a first intermediate sample feature vector sequence of the ith level; and

and obtaining the sample feature vector sequence according to the first intermediate sample feature vector sequence of the Nth level.

12. The method of claim 10 or 11, wherein the predetermined model further comprises a decoder;

wherein, the processing the sample feature vector sequence and the second sample position coding sequence data to obtain prediction spectrum sequence data includes:

obtaining a third intermediate sample feature vector sequence according to the sample feature vector sequence and the second sample position coding sequence data; and

and processing the third intermediate sample feature vector sequence by using the decoder to obtain the predicted spectrum sequence data.

13. The method of claim 12, wherein the decoder comprises N decoding units comprising a second attention layer and a second feed-forward neural network layer;

wherein the processing, by the decoder, the third intermediate sample feature vector sequence to obtain the predicted spectrum sequence data includes:

under the condition that i is more than or equal to 1 and less than N, processing a fourth intermediate sample feature vector sequence of the (i +1) th level by using a second attention layer of the i th level to obtain a fifth intermediate sample feature vector sequence of the i th level;

processing the fifth intermediate sample feature vector sequence of the ith level by utilizing a second feedforward neural network layer of the ith level to obtain a fourth intermediate sample feature vector sequence of the ith level; and

and processing the fourth intermediate sample feature vector sequence of the level 1 to obtain the predicted spectrum sequence data.

14. The method according to any one of claims 10 to 13, wherein the training of a predetermined model using the real spectral sequence data and the predicted spectral sequence data to obtain a speech translation model comprises:

obtaining an output value by using the real spectrum sequence data and the predicted spectrum sequence data based on a loss function;

adjusting the model parameters of the preset model according to the output value until the preset condition is met; and

and determining the model obtained under the condition that the preset condition is met as the voice translation model.

15. The method of any of claims 10-14, further comprising:

determining a target time period in the case that the first time period is determined to be inconsistent with a second time period, wherein the target time period is a time period with a small value in the first time period and the second time period, the first time period represents a time period of the initial source sample language voice data, and the second time period represents a time period of the initial target sample language voice data;

determining non-human voice data in sample language voice data, wherein the sample language voice data is voice data corresponding to a target time period;

processing the sample language voice data by using the non-human voice data to obtain processed sample language voice data; and

and obtaining the source sample language voice data and the target sample language voice data according to the processed sample language voice data and the sample language voice data corresponding to the non-target time period, wherein the non-target time period is a time period with a large value in the first time period and the second time period.

16. A speech translation apparatus comprising:

a first determining module, configured to determine source spectral sequence data corresponding to source language speech data, wherein the source spectral sequence data includes at least one source spectral data;

a first obtaining module, configured to perform feature extraction on the source spectrum sequence data and first position coding sequence data to obtain a target feature vector sequence, where the first position coding sequence data includes a position code corresponding to the at least one source spectrum sequence data;

a second obtaining module, configured to process the target feature vector sequence and second position coding sequence data to obtain target spectrum sequence data, where the second position coding sequence data includes a position code corresponding to the target feature vector sequence; and

and the third obtaining module is used for processing the target frequency spectrum sequence data to obtain target language voice data corresponding to the source language voice data.

17. A model training apparatus comprising:

a second determining module, configured to determine source sample spectrum sequence data corresponding to source sample language voice data and real spectrum sequence data corresponding to target sample language voice data, respectively, where the source sample spectrum sequence data includes at least one source sample spectrum data, and the target sample language voice data is obtained by translating the source sample language voice data;

a fourth obtaining module, configured to perform feature extraction on the source sample spectrum sequence data and first sample position coding sequence data to obtain a sample feature vector sequence, where the first sample position coding sequence data includes a sample position code corresponding to the at least one source sample spectrum data;

a fifth obtaining module, configured to process the sample feature vector sequence and second sample position coding sequence data to obtain predicted spectrum sequence data, where the second sample position coding sequence data includes a sample position code corresponding to the sample feature vector sequence; and

and the sixth obtaining module is used for training a predetermined model by using the real spectrum sequence data and the predicted spectrum sequence data to obtain a speech translation model.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 8 or any one of claims 9 to 15.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any of claims 1-8 or any of claims 9-15.

20. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 8 or any one of claims 9 to 15.