CN113793598B

CN113793598B - Training method of voice processing model, data enhancement method, device and equipment

Info

Publication number: CN113793598B
Application number: CN202111083473.9A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2023-10-27
Anticipated expiration: 2041-09-15
Also published as: CN113793598A

Abstract

The disclosure provides a training method of a voice processing model, a method, a device, equipment and a medium for enhancing data, and relates to the field of artificial intelligence, in particular to the technical fields of voice recognition, voice synthesis and deep learning. The specific implementation scheme of the training method of the voice processing model is as follows: determining a first phoneme feature of the audio sample based on a first acoustic feature of the first speech data; based on the first acoustic feature, the first speech recognition feature and the first phoneme feature of the first speech data, obtaining a first prosodic feature of the first speech data by adopting a prosodic coding network of the speech processing model; based on the first acoustic feature, the first voice recognition feature and the first prosody feature, obtaining a predicted acoustic feature by adopting a decoding network of the voice processing model; and training the speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature.

Description

Training method of voice processing model, data enhancement method, device and equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, and further relates to the technical fields of speech recognition, speech synthesis and deep learning, in particular to a training method of a speech processing model, a data enhancement method, a device and equipment.

Background

With the development of artificial intelligence, speech recognition technology has become an important technical support for the scenarios of intelligent search, intelligent speech assistant, and the like. In order to improve the accuracy of speech recognition, speech recognition technology based on deep learning is rapidly developed. Among them, improving the accuracy of the deep learning-based speech recognition technique needs to rely on a large amount of labeling data.

Disclosure of Invention

Based on this, the present disclosure provides a training method of a speech processing model, a method, an apparatus, a device and a medium for enhancing data, which facilitate enhancing speech data and improving the interest of speech recording.

According to one aspect of the present disclosure, there is provided a training method of a speech processing model including a prosodic encoding network and a decoding network; the method comprises the following steps: determining a first phoneme feature of the first speech data based on the first acoustic feature of the first speech data; based on the first acoustic feature, the first speech recognition feature and the first phoneme feature of the first speech data, obtaining a first prosodic feature of the first speech data by adopting a prosodic coding network; based on the first acoustic feature, the first speech recognition feature and the first prosody feature, obtaining a predicted acoustic feature by using a decoding network; and training the speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature.

According to another aspect of the present disclosure, there is provided a method of enhancing data, comprising: determining a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data; based on the second acoustic feature, the second voice recognition feature and the second phoneme feature of the third voice data, obtaining the second prosodic feature of the third voice data by adopting a prosodic coding network of the voice processing model; based on the second acoustic feature, the target voice recognition feature and the second prosody feature, obtaining the target acoustic feature by adopting a decoding network of the voice processing model; and obtaining fourth voice data with target voice recognition characteristics based on the target acoustic characteristics, wherein the voice processing model is trained by adopting the training method of the voice processing model, and the target voice recognition characteristics are voice recognition characteristics of other objects except the object related to the third voice data.

According to another aspect of the present disclosure, there is provided a training apparatus of a speech processing model, wherein the speech processing model includes a prosodic encoding network and a decoding network; the device comprises: a first phoneme feature determining module for determining a first phoneme feature of the first speech data based on a first acoustic feature of the first speech data; the first prosodic feature obtaining module is used for obtaining first prosodic features of the first voice data by adopting a prosodic coding network based on the first acoustic features, the first voice recognition features and the first phoneme features of the first voice data; the first acoustic feature obtaining module is used for obtaining predicted acoustic features by adopting a decoding network based on the first acoustic features, the first voice recognition features and the first prosody features; and a model training module for training the speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature.

According to another aspect of the present disclosure, there is provided an apparatus for enhancing data, comprising: a second phoneme feature determining module for determining a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data; the second prosodic feature obtaining module is used for obtaining second prosodic features of the third voice data by adopting a prosodic coding network of the voice processing model based on the second voice recognition features and the second phoneme features of the second voice features and the third voice data; the second acoustic feature obtaining module is used for obtaining target acoustic features by adopting a decoding network of the voice processing model based on the second acoustic features, the target voice recognition features and the second prosody features; and a voice data obtaining module, configured to obtain fourth voice data having a target voice recognition feature based on the target acoustic feature, where the voice processing model is obtained by training using the training device of the voice processing model, and the target voice recognition feature is a voice recognition feature of an object other than the object related to the third voice data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training the speech processing model and/or the method of enhancing data provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of training a speech processing model and/or the method of enhancing data provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of training and/or enhancing data of a speech processing model provided by the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of a training method and a method and an apparatus for enhancing data of a speech processing model according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of the structure of a speech processing model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a prosody encoding network according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a training method of a speech processing model according to an embodiment of the present disclosure;

FIG. 6 is a flow diagram of a method of enhancing data according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a training device of a speech processing model according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of an apparatus for enhancing data according to an embodiment of the present disclosure; and

FIG. 9 is a block diagram of an electronic device for implementing a training method and/or a data enhancement method for a speech processing model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a training method of a speech processing model, wherein the speech processing model includes a prosodic encoding network and a decoding network. The method comprises a phoneme characteristic determining stage, a prosodic characteristic obtaining stage, an acoustic characteristic obtaining stage and a model training stage. In the phoneme feature determining stage, a first phoneme feature of the first speech data is determined based on the first acoustic feature of the first speech data. In the prosodic feature obtaining stage, a first prosodic feature of the first speech data is obtained using a prosodic coding network based on the first acoustic feature, the first speech recognition feature and the first phoneme feature of the first speech data. In the acoustic feature acquisition stage, a predicted acoustic feature is obtained using a decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature. In a model training phase, a speech processing model is trained based on differences between the predicted acoustic features and the first acoustic features.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is an application scenario schematic diagram of a training method of a speech processing model and a method and an apparatus for enhancing data according to an embodiment of the disclosure.

As shown in fig. 1, the scenario 100 of this embodiment includes an electronic device 110 and a database 120.

Wherein the electronic device 110 may access the database 120 through a network, for example. The database 120 may store a plurality of audio data, which may include voice data obtained by capturing subject voices. In an embodiment, the voice data may have a tag that indicates text information corresponding to the voice in the voice data, and may also indicate an object corresponding to the voice in the voice data.

In one embodiment, the electronic device 110 may read the voice data from the database 120 and perform object conversion on the voice data to obtain voice data with the same content as the voice in the voice data but different objects. The electronic device 110 may also store the converted voice data in a database, for example, to increase the diversity of the voice data stored in the database. For example, the electronic device 110 may employ a speech processing model to subject speech data to object conversion.

For example, the electronic device 110 may also process the voice data by adding noise to the read voice data, changing the rate of the read voice data, spectrally enhancing the read voice data, and the like, and store the processed data in a database. In this way, the diversity of voice data stored in the database can be further improved.

In an embodiment, the electronic device 110 may also read the voice data with the tag from the database 120, and train the voice recognition model using the read voice data as a sample. The voice recognition model is used for processing the voice data to obtain texts corresponding to voices in the voice data. The resulting text is then compared to the text indicated by the tag, and a speech recognition model is trained based on the comparison.

In an embodiment, the application scenario 100 may further comprise a terminal device 130, where the terminal device 130 is communicatively connected to the electronic device 110 through a network. For example, the terminal device 130 may obtain a trained speech recognition model 140 from the electronic device 110, and process the speech data 150 collected in real-time based on the obtained speech recognition model 140 to recognize speech uttered by the subject in real-time. The terminal device 130 may also provide services to the object, for example, based on the recognition result 160 from the speech recognition.

It should be noted that, the training method and/or the data enhancing method of the speech processing model provided by the embodiments of the present disclosure may be generally performed by the electronic device 110, or may be performed by a server or the like communicatively connected to the electronic device 110. Accordingly, the training device and/or the data enhancing device for a speech processing model provided in the embodiments of the present disclosure may be generally disposed in the electronic device 110, or may also be disposed in a server or the like communicatively connected to the electronic device 110.

It should be understood that the number and types of electronic devices, databases and terminal devices in fig. 1 are merely illustrative. There may be any number and type of electronic devices, databases and terminal devices, as desired for implementation.

The training method of the speech processing model provided by the present disclosure will be described in detail below with reference to fig. 1 through fig. 2 to 5.

Fig. 2 is a flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method 200 of the speech processing model of this embodiment may include operations S210 to S240.

In operation S210, first phoneme features of the first speech data are determined based on the first acoustic features of the first speech data.

According to embodiments of the present disclosure, the acoustic features of the speech data may be, for example, mel-frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC) features, perceptual linear prediction features (Perceptual Linear Prediction, PLP), filterBank (Fbank) features, and the like of the speech data. Wherein the MFCC characteristics are obtained by performing discrete cosine transform on the filter bank data.

For example, the first acoustic feature may be used as an input to a phoneme recognition model, and the first phoneme feature may be obtained after processing through the phoneme recognition model. The phoneme recognition model can be constructed based on a time delay two-way long-short-term memory network.

For example, a portion of the speech data may be randomly extracted from a database storing audio. After extracting the voice data, the extracted voice data may be preprocessed, for example, noise (including environmental noise, busy tone, ring tone, etc.) in the voice data is removed, and the voice data is enhanced by adopting an existing data enhancement method. Existing data enhancement methods may include methods of varying speech rates, methods of mixing echoes, methods of time domain warping, and/or methods of frequency domain masking, among others. And taking the preprocessed voice data as first voice data. The first voice data may be subjected to framing processing, and then feature extraction is performed on each frame of voice data to obtain MFCC features of each frame of voice data, where MFCC features of multiple frames of voice data obtained by framing constitute a first acoustic feature in a sequence form. In the framing process, the frame length may be, for example, 25ms, the frame shift may be, for example, 10ms, etc., which is not limited in the present disclosure.

In operation S220, a first prosodic feature of the first speech data is obtained using a prosodic coding network based on the first acoustic feature, the first speech recognition feature of the first speech data, and the first phoneme feature.

According to an embodiment of the present disclosure, the first acoustic feature may be input into the object recognition model, and after being processed by the object recognition model, feature data of a full-connection layer input into the object recognition model may be used as the first speech recognition feature. The object recognition model may be, for example, a pooled vector-based object encoder. The object encoder includes a multi-layer delayed neural network, and the penultimate layer is a global pooling layer. The embodiment may use the feature data output by the global pooling layer as speech recognition features for uniquely representing the features of the object to which the first speech data relates.

According to an embodiment of the present disclosure, the speech processing model includes a prosodic encoding network, and the embodiment may input the first acoustic feature, the first speech recognition feature, and the first phoneme feature into the prosodic encoding network after concatenation, and output the first prosodic feature after processing via the prosodic encoding network. The prosody encoding network may be configured based on, for example, a long-short-term memory network architecture or an attention network (e.g., transducer) architecture. Prosodic features (Prosodic features) are representations of the entire speech segment, such as: syllable accent, intonation pattern, rate and rhythm of speaking, etc.

In operation S230, a predicted acoustic feature is obtained using a decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature.

According to embodiments of the present disclosure, the speech processing model may also include a decoding network. The embodiment can splice the first acoustic feature, the first voice recognition feature and the first prosody feature and then input the spliced first acoustic feature, the first voice recognition feature and the first prosody feature into a decoding network, and output the predicted acoustic feature after processing the first voice recognition feature and the first prosody feature through the decoding network. The decoding network may be formed based on a convolutional neural network (Convolutional Neural Network, CNN) and a Bi-gating cyclic unit (Bi-Gated Recurrent Unit, BGRU).

In operation S240, a speech processing model is trained based on differences between the predicted acoustic features and the first acoustic features.

According to embodiments of the present disclosure, the loss of the speech processing model may be determined from the difference between the predicted acoustic feature and the first acoustic feature. The speech processing model is trained by a back-propagation algorithm to minimize the loss of the speech processing model.

It is understood that the foregoing operations S220 and S230 are essentially two processes of separating the speech recognition feature from the prosodic feature and then fusing the speech recognition feature with the prosodic feature. By comparing the difference between the fused predicted acoustic feature and the first acoustic feature of the first speech data, it may be determined whether the isolated prosodic feature is accurate. If the prosodic features obtained by separation are accurate, after the first speech recognition features and the first prosodic features of the first speech data are input into the decoding network at the same time, the predicted acoustic features obtained by decoding the decoding network should have smaller differences from the first acoustic features.

In summary, by the training method of the speech processing model, the accuracy of the speech processing model can be improved. Thus, if other voice recognition features besides the first voice recognition feature are used for replacing the first voice recognition feature to input into the decoding network, the object transformation of the voice data can be accurately realized. For example, with the speech processing model, enhancement in object property dimension can be performed on speech data. By adopting the voice processing model, the object transformation can be carried out on voice data acquired in real time, the voice changing function is provided for users, the interestingness of the voice recording function in the terminal equipment is improved, and the like.

Fig. 3 is a schematic diagram of a structure of a speech processing model according to an embodiment of the present disclosure.

According to the embodiment of the disclosure, when determining the phoneme characteristic, the phoneme and the acoustic characteristic can be forcedly aligned, so that the condition of lack of sound of voice data obtained based on the predicted acoustic characteristic is avoided.

Illustratively, as shown in fig. 3, the speech processing model 300 of this embodiment may include a phoneme alignment network 330 and a phoneme encoding network 340 in addition to the prosody encoding network 310 and the decoding network 320.

Wherein the phoneme alignment network 330 is similar to a speech recognition network and may be constructed based on Long-short term memory network-connection timing classification (Long-Short Term Memory-Connectionist Temporal Classification, LSTM-CTC) architecture, chain Model (Chain Model) architecture, etc. After the first acoustic feature is obtained, the first acoustic feature 301 may be input into the phoneme alignment network 330 to obtain a first phoneme sequence for the first speech data. For example, the first acoustic feature 301 is a sequence { M } of MFCC features of n audio frames obtained by framing ₁ ，M ₂ ，...，M _n }. Through the phoneme alignment network 330 processing, a phoneme corresponding to each MFCC feature may be obtained. The N phonemes corresponding to the N MFCC features form a first phoneme sequence. Through the phoneme alignment network 330, MFCC features may be forced aligned with phonemes through the strongAlignment is made such that the resulting sequence of phonemes can characterize the number of sustained frames per phoneme.

For example, if the phoneme alignment network 330 is configured based on an LSTM-CTC architecture, the first phoneme sequence is obtained based on phonemes corresponding to the largest element of each of n probability vectors output by the LSTM-CTC architecture. If the phoneme alignment network 330 is configured based on the Chain Model architecture, after obtaining the corresponding phonemes based on the probability vectors output by the Chain Model architecture, post-processing of the phonemes is required. For example, if the Chain Model decodes MFCC features that are taken one frame every several frames, the corresponding phonemes obtained based on the probability vector need to be complemented according to the number of skipped frames. If two corresponding phonemes obtained based on two adjacent probability vectors output by the Chain Model are "b" and "a", and the Chain Model takes one frame every two frames, the phoneme sequence obtained after the supplementing should include "bba" or "baa" and the like. The rules for supplementing phonemes may be empirically set, which is not limited by the present disclosure.

The phoneme coding network 340 may adopt a combined architecture of CNN and RNN, so as to comprehensively consider the context information of the phonemes while extracting features of the phonemes, and improve the accuracy of the obtained phoneme features.

For example, after obtaining the phoneme sequence, the first phoneme sequence may be mapped to a phoneme feature space to obtain a feature matrix 302 that characterizes the first phoneme sequence. The mapping procedure described above may be implemented, for example, using a fully connected layer. After the feature matrix is obtained, the feature matrix 302 may be directly input into a phoneme encoding network, and the first phoneme feature may be obtained by encoding the phoneme encoding network.

For example, the feature matrix 302 and the first speech recognition feature 303 of the first speech data may be input simultaneously to the phoneme encoding network 340 such that the encoded first phoneme feature has the object feature of the first speech data. For example, the feature matrix 302 may be spliced with the first speech recognition feature 303 and input into the phoneme encoding network 340. Thus, when the prosodic features are obtained based on the first phoneme features, the object features can be provided for the prosodic coding network, so that the prosodic coding network can better eliminate the object features.

In one embodiment, the phoneme encoding network 340 may be formed by sequentially connecting a plurality of CNNs, a maximum pooling layer (Maxpooling), an activation layer and a two-way long and short-term memory network. Wherein the activation layer may be built based on the activation function ReLU. It will be appreciated that the structure of the phoneme encoding network is merely an example to facilitate an understanding of the present disclosure, which is not limited thereto.

After the first phoneme feature is obtained, the first phoneme feature, the first speech recognition feature 303 and the first acoustic feature 301 may be spliced and input into the prosodic coding network 310, and the prosodic coding network 310 outputs the first prosodic feature. The first prosodic feature, the first speech recognition feature 303 and the first acoustic feature 301 are spliced and input into the decoding network 320, and the decoding network 320 outputs the predicted acoustic feature { M } ₁ ’，M ₂ ’，...，M _n ’}304。

Fig. 4 is a schematic structural diagram of a prosody encoding network according to an embodiment of the present disclosure.

According to embodiments of the present disclosure, prosodic features may be extracted via multiple channels to improve the integrity of the extracted prosodic features. When extracting the features from the plurality of channels, the plurality of extracted features can be normalized to eliminate the features of the object in the extracted features as much as possible.

As shown in fig. 4, in this embodiment 400, the prosodic encoding network may include a feature extraction sub-network 411, a normalization sub-network 412, and an encoding sub-network 413, which are connected in sequence.

The feature extraction sub-network 411 may be configured by, for example, a plurality of convolution units Conv1 to Convp arranged in parallel, where p is a natural number greater than 1. The first acoustic feature 401, the first speech recognition feature 402, and the first phoneme feature 403 may be spliced and input into the feature extraction sub-network 411, and a plurality of feature data may be output from the feature extraction sub-network 411. For example, three features may be spliced and then input into a plurality of convolution units Conv1 to Convp, and the plurality of convolution units Conv1 to Convp output one feature data, thereby obtaining a total of a plurality of feature data.

The normalization sub-network may be, for example, a Norm layer, and is configured to normalize the plurality of feature data to reject the object feature from the plurality of feature data. For example, the plurality of feature data may be input into the normalized subnetwork 412, and the data characterizing the first speech recognition feature 402 may be culled from the plurality of feature data by the normalized subnetwork 412 to obtain the target feature data. The normalization processing of the plurality of feature data may first normalize each feature data to obtain a mean and a variance for each feature data. Then, each feature data is regularized based on the mean and variance of each feature data. For example, the mean value may be subtracted from each data in each feature data, and then divided by the variance to obtain regularized feature data as one target feature data. By this regularization process, object features can be eliminated as much as possible while prosodic features are preserved.

Wherein the encoding subnetwork 413 may employ a recurrent neural network or an attention model. After the target feature data is obtained, a plurality of target feature data are sequentially input to the encoding sub-network 413, and the first prosodic feature can be obtained based on the output of the encoding sub-network 413.

For example, if the encoding subnetwork 413 employs a two-way long and short term memory network or a transducer network, the output of the encoding subnetwork 413 may be used as the first prosodic feature. The encoding subnetwork 413 may also employ BGRU, for example, to improve accuracy of the resulting first prosodic features.

For example, the encoding subnetwork 413 may also include variable self-encoders (Variational Autoencoder, VAE) to obtain a distribution of prosodic features through the variable encoders. For example, in one embodiment, the encoding subnetwork 413 is comprised of BGRU and VAE. After the target feature data is obtained, the target feature data may be input into the BGRU first, and the BGRU outputs the intermediate feature. The intermediate features are then input into the VAE, and the mean and variance of the prosodic feature distribution is obtained after processing through the VAE. Based on the mean and variance, a prosodic feature distribution 404 may be obtained, and random sampling of the prosodic feature distribution 404 may result in a first prosodic feature 405. This embodiment sets the VAE to obtain the first prosodic feature because the prosodic feature distribution can be approximated as a gaussian distribution based on a priori experience. By setting the VAE, the prosodic feature distribution can better approximate to the target probability distribution, so that the richness of the feature space is enhanced, and the condition that the extracted prosodic features are inaccurate due to feature dispersion is avoided. And thus the accuracy of the obtained first prosodic features can be effectively improved.

According to the embodiment of the disclosure, in the case of setting the VAE, the VAE and each network before the VAE can be trained according to the difference between the prosodic distribution determined by the mean value and the variance obtained by the VAE and the preset characteristic distribution, so that the prosodic distribution can be closer to the target probability distribution, and the accuracy of the obtained prosodic characteristics is further improved. The target probability distribution may be, for example, a normal distribution N (0, 1).

Thus, when training the speech processing model, the first loss of the speech processing model may be determined based on the prosodic feature distribution and the predetermined feature distribution. A second loss of the speech processing model is then determined based on the difference between the predicted acoustic feature and the first acoustic feature. Finally, training the speech processing model based on the first loss and the second loss. Wherein, for example, the KL divergence may be used to represent the first loss and the Smooth-L1 loss may be used to represent the second loss. It is to be understood that the above-described methods of representing the first loss and the second loss are merely examples to facilitate understanding of the present disclosure, which is not limited thereto.

For example, the entire speech processing model may be trained based on the first loss and the second loss. The decoding network may also be trained based on the second loss, and the prosodic encoding network may be trained based on the first loss and the second loss. During the training process, a back propagation algorithm may be employed to train the decoding network and the prosody encoding network.

Fig. 5 is a schematic diagram of a training method of a speech processing model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the speech processing model may further comprise an object recognition network for extracting first speech recognition features from the first acoustic features. In particular, a first learning feature may be input into the object recognition network, the first speech recognition feature being output by a layer of the object recognition network adjacent to the output layer. Wherein the object recognition network may be constructed based on the object recognition model described above, which is not limited by the present disclosure.

After the first speech recognition feature is obtained, the entire speech processing model may be trained using operations S210 through S240 described above.

In one embodiment, the object recognition network may be pre-trained based on second speech data in a predetermined audio set and first labeling information for the second speech data. And then obtaining first voice recognition characteristics based on the pre-trained object recognition network, and training the whole voice processing model. The predetermined audio set may be any open source audio set, and the open source audio set may be Aishell, libriSpeech, for example. And the audio set in which the first speech data is located is a data set for training a speech recognition model. Thus, the voice processing model has higher stability and generalization capability by pre-training the object recognition network.

For example, the first labeling information may be a tag indicating object probability information for the second voice data. For example, the first labeling information may include actual object information of the voice in the second voice data, and if the actual object is the speaker a, the probability of the object probability information for the speaker a is 1, and the probability of the object probability information for other speakers than the speaker a is 0. The probability that the object of the speech in the second speech data is each object of the plurality of predetermined objects is output via the object recognition model. The cross entropy loss function may be employed to determine the loss of the object recognition model based on probabilities of the actual object and the object recognition model output. And converging the object recognition network after a plurality of iterations.

In accordance with an embodiment of the present disclosure, as shown in fig. 5, in embodiment 500, a speech processing model may include an object recognition network 510, a phoneme alignment network 520, a phoneme encoding network 530, a prosody encoding network 540, and a decoding network 550.

The object recognition network 510 may include, among other things, a convolution layer, a gating loop unit, a bottleneck (bottleneck) layer, and a full connection layer, which are connected in sequence. The embodiment may input the first acoustic feature 501 into a convolution layer included in the object recognition network 510, sequentially process the convolution layer, the gating loop unit, the bottleneck layer and the full connection layer, and then output, by the full connection layer, probabilities that an object of speech in the first speech data is each object of a plurality of predetermined objects. The data output by the bottleneck layer is the first voice recognition characteristic of the first voice data.

In accordance with an embodiment of the present disclosure, as shown in fig. 5, in the case where the speech processing model further includes a phoneme alignment network 520 and a phoneme encoding network 530, the first acoustic feature 501 is input into the phoneme alignment network 520, and a first phoneme sequence may be obtained. The first phoneme sequence and the first speech recognition feature output by the bottleneck layer are spliced and input to the phoneme encoding network 530, and after being processed by the phoneme encoding network 530, the first phoneme feature may be output by the phoneme encoding network 530.

Illustratively, where the speech processing model includes a phoneme alignment network 520, the embodiment may also pre-train the phoneme alignment network based on the second speech data in the predetermined audio set and the second labeling information for the second speech data. During training, acoustic features of the second speech data (i.e., MFCC features of a plurality of audio frames obtained by framing the second speech data) are input into the phoneme alignment network 520 to obtain a predicted phoneme sequence. From the difference between the predicted phoneme sequence and the second phoneme sequence, a loss of the phoneme alignment network 520 may be determined using the cross entropy loss. The phoneme alignment network may then be made to converge after multiple iterations based on the penalty. Thus, by pre-training the phoneme alignment network, the speech processing model can have higher stability and generalization capability.

For example, the second labeling information indicates a second phoneme sequence for the second speech data. The second sequence of phonemes indicates that the second speech data includes a duration for each phoneme. Wherein the predetermined audio set may be the open source data set described previously. The second phoneme sequence is similar to the first phoneme sequence described above, and may be configured of a plurality of phonemes actually corresponding to MFCC characteristics of a plurality of audio frames obtained by framing the second speech data.

According to an embodiment of the present disclosure, as shown in fig. 5, in the case where the prosody encoding network 540 includes a feature extraction sub-network, a normalized sub-network, BGRU, and VAE connected in sequence, the first acoustic feature 501, the first phoneme feature, and the first speech recognition feature output by the bottleneck layer may be spliced and then input into the feature extraction sub-network, and after being sequentially processed through the feature extraction sub-network, the normalized sub-network, BGRU, and VAE, variance and mean of prosody feature distribution may be output by the VAE. The first prosodic features may be derived by randomly sampling from a prosodic feature distribution determined based on the variance and the mean.

After the first prosodic feature is obtained, the first prosodic feature, the first acoustic feature, and the first speech recognition feature may be spliced and input into the decoding network 550, and the predicted acoustic feature 502 may be output after processing via the decoding network 550.

The present disclosure further provides a method for enhancing data based on a speech processing model obtained by training the training method of the speech processing model provided by the present disclosure, and the method will be described in detail below with reference to fig. 6.

Fig. 6 is a flow diagram of a method of enhancing data according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the method 600 of enhancing data may include operations S610 to S640. The speech processing model may be trained using the training method described above.

In operation S610, second phoneme features of the third speech data are determined based on the second acoustic features of the third speech data.

According to an embodiment of the present disclosure, the third speech data is similar to the first speech data described previously, and the second acoustic feature is similar to the first acoustic feature described previously. The method for obtaining the second phoneme feature is similar to the method for obtaining the first phoneme feature described above, and will not be described here again.

In operation S620, a prosodic encoding network of the speech processing model is employed to obtain second prosodic features of the third speech data based on the second acoustic features, the second speech recognition features and the second phoneme features of the third speech data.

According to an embodiment of the present disclosure, the second speech recognition feature is similar to the first speech recognition feature described previously. The method of obtaining the second prosodic features is similar to the method of obtaining the first prosodic features described above and will not be described again.

In operation S630, the target acoustic feature is obtained using the decoding network of the speech processing model based on the second acoustic feature, the target speech recognition feature, and the second prosodic feature.

Wherein the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates. For example, if the third speech data is obtained by recording the speech of speaker a, the target speech recognition feature may be obtained by processing the speech data obtained by recording the speech of speaker B. The method for processing the target speech recognition feature is similar to the method for processing the third speech data to obtain the second speech recognition feature, and will not be described herein.

Fourth voice data having target voice recognition characteristics is obtained based on the target acoustic characteristics in operation S640.

According to embodiments of the present disclosure, a vocoder may be used to upsample target acoustic features to synthesize speech data having a target object style corresponding to target speech recognition features.

For example, when the third speech data is a training sample of the speech recognition model, the number of training samples may be increased from the object feature dimension by employing the method of enhancing data of this embodiment. Thus, when the speech recognition model is trained based on the training sample, the accuracy and generalization capability of the speech recognition model obtained by training can be improved.

According to an embodiment of the present disclosure, when third voice data is a training sample of a voice recognition model and the third voice data has a first tag indicating a text corresponding to voice and a second tag indicating an object corresponding to voice, the first tag may be given to fourth voice data obtained based on the third voice data, and a third tag indicating an object corresponding to a target voice recognition feature may be added to the fourth voice data. Thus, the fourth voice data added with the label can be used as a training sample of the voice recognition model.

Based on the training method of the voice processing model provided by the disclosure, the disclosure also provides a training device of the voice processing model. The device will be described in detail below in connection with fig. 7.

Fig. 7 is a block diagram of a training apparatus of a speech processing model according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 of the speech processing model of this embodiment may include a first phoneme feature determining module 710, a first prosodic feature obtaining module 720, a first acoustic feature obtaining module 730, and a model training module 740. Wherein the speech processing model includes a prosodic encoding network and a decoding network.

The first phoneme feature determining module 710 is configured to determine a first phoneme feature of the first speech data based on a first acoustic feature of the first speech data. In an embodiment, the first phoneme characteristic determining module 710 may be used to perform the operation S210 described above, which is not described herein.

The first prosodic feature obtaining module 720 is configured to obtain a first prosodic feature of the first speech data using the prosodic coding network based on the first acoustic feature, the first speech recognition feature and the first phoneme feature of the first speech data. In an embodiment, the first prosodic feature obtaining module 720 may be used to perform the operation S220 described above, which is not described herein.

The first acoustic feature obtaining module 730 is configured to obtain a predicted acoustic feature using a decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature. In an embodiment, the first acoustic feature obtaining module 730 may be configured to perform the operation S230 described above, which is not described herein.

The model training module 740 is configured to train the speech processing model based on differences between the predicted acoustic features and the first acoustic features. In an embodiment, the model training module 740 may be used to perform the operation S240 described above, which is not described herein.

According to an embodiment of the present disclosure, the above-described speech processing model further includes a phoneme alignment network and a phoneme encoding network. The first phoneme characteristic determining module 710 may include a phoneme sequence obtaining submodule and a phoneme characteristic obtaining submodule. The phoneme sequence obtaining submodule is used for inputting the first acoustic feature into the phoneme alignment network to obtain a first phoneme sequence aiming at the first voice data. The phoneme characteristic obtaining submodule is used for inputting a characteristic matrix representing the first phoneme sequence and the first voice recognition characteristic into the phoneme coding network to obtain the first phoneme characteristic.

According to an embodiment of the present disclosure, the prosodic encoding network includes a feature extraction sub-network, a normalization sub-network, and an encoding sub-network. The above-described first prosodic feature obtaining module 720 may include a feature data obtaining sub-module, a target data obtaining sub-module, and a prosodic feature obtaining sub-module. The feature data obtaining submodule is used for inputting the first acoustic feature, the first voice recognition feature and the first phoneme feature into the feature extraction subnetwork to obtain a plurality of feature data. The target data obtaining submodule is used for inputting the plurality of feature data into the normalized subnetwork to obtain target feature data from which the data representing the first voice recognition feature is removed. The prosodic feature obtaining submodule is used for inputting the target feature data into the coding subnetwork to obtain the first prosodic feature.

According to an embodiment of the present disclosure, an encoding subnetwork includes a bi-directional gated loop unit and a variable self encoder. The prosodic feature obtaining submodule includes an intermediate feature obtaining unit, a distribution determining unit, and a prosodic feature obtaining unit. The intermediate feature obtaining unit is used for inputting the target feature data into the bidirectional gating circulating unit to obtain intermediate features. The distribution determining unit is used for dividing the intermediate characteristic input variation from the encoder to obtain the mean value and variance of the prosodic characteristic distribution. The prosodic feature obtaining unit is used for randomly sampling the prosodic feature distribution based on the mean value and the variance to obtain a first prosodic feature.

According to an embodiment of the present disclosure, the model training module 740 includes a first loss determination sub-module, a second loss determination sub-module, and a training sub-module. The first loss determination submodule is used for determining a first loss of the voice processing model based on the prosodic feature distribution and the preset feature distribution. The second loss determination submodule is used for determining a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature. The training submodule is used for training the speech processing model based on the first loss and the second loss.

According to an embodiment of the present disclosure, the training sub-module includes a first training unit and a second training unit. The first training unit is configured to train the decoding network based on the second penalty. The second training unit is used for training the prosody encoding network based on the first loss and the second loss.

According to an embodiment of the present disclosure, the above-described speech processing model further includes an object recognition network. The training device 700 of the above-mentioned speech processing model further comprises a speech feature extraction module for extracting a first speech recognition feature from the first acoustic feature by: the first acoustic feature is input into an object recognition network to obtain a first speech recognition feature.

According to an embodiment of the present disclosure, the training apparatus 700 of the foregoing speech processing model further includes a first pre-training module configured to pre-train the object recognition network based on the second speech data in the predetermined audio set and the first labeling information for the second speech data, where the first labeling information is used to indicate object probability information for the second speech data.

According to an embodiment of the present disclosure, the training device 700 of the above-mentioned speech processing model is a second pre-training module, configured to pre-train the phoneme alignment network based on second speech data in the predetermined audio set and second labeling information for the second speech data, where the second labeling information indicates a second phoneme sequence for the second speech data, and the second phoneme sequence indicates that the second speech data includes a duration for each phoneme.

Based on the method for enhancing data provided by the disclosure, the disclosure also provides a device for enhancing data. The device will be described in detail below in connection with fig. 8.

Fig. 8 is a block diagram of an apparatus for enhancing data according to an embodiment of the present disclosure.

As shown in fig. 8, the apparatus 800 for extracting enhancement data of this embodiment may include a second phoneme characteristic determining module 810, a second prosodic characteristic obtaining module 820, a second acoustic characteristic obtaining module 830, and a speech data obtaining module 840.

The second phoneme feature determining module 810 is configured to determine the second phoneme feature of the third speech data based on the second acoustic feature of the third speech data. In an embodiment, the second phoneme characteristic determining module 810 may be used to perform the operation S610 described above, which is not described herein.

The second prosodic feature obtaining module 820 is configured to obtain a second prosodic feature of the third speech data using a prosodic coding network of the speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phoneme feature. The speech processing model may be trained by the training device using the speech processing model described above. In an embodiment, the second prosodic feature obtaining module 820 may be used to perform the operation S620 described above, which is not described herein.

The second acoustic feature obtaining module 830 is configured to obtain the target acoustic feature using a decoding network of the speech processing model based on the second acoustic feature, the target speech recognition feature, and the second prosodic feature. Wherein the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates. In an embodiment, the second acoustic feature obtaining module 830 may be configured to perform the operation S630 described above, which is not described herein.

The voice data obtaining module 840 is configured to obtain fourth voice data having target voice recognition features based on the target acoustic features. In an embodiment, the voice data obtaining module 840 may be used to perform the operation S640 described above, which is not described herein.

It should be noted that, in the technical solution of the present disclosure, the related processes of obtaining, collecting, storing, using, processing, transmitting, providing, disclosing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and do not violate the public order colloquial.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the training methods of the speech processing model and/or the methods of enhancing data of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as a training method of a speech processing model and/or a method of enhancing data. For example, in some embodiments, the training method of the speech processing model and/or the method of enhancing data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the above-described training method of a speech processing model and/or method of enhancing data may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured in any other suitable manner (e.g., by means of firmware) to perform the training method of the speech processing model and/or the method of enhancing the data.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS"). The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a speech processing model, wherein the speech processing model comprises a prosodic encoding network and a decoding network; the method comprises the following steps:

determining a first phoneme feature of first speech data based on a first acoustic feature of the first speech data;

based on the first acoustic feature, the first voice recognition feature of the first voice data and the first phoneme feature, obtaining a first prosodic feature of the first voice data by adopting the prosodic coding network, wherein the first voice recognition feature is used for uniquely representing the feature of an object involved in the first voice data;

Obtaining predicted acoustic features by using the decoding network based on the first acoustic features, the first speech recognition features, and the first prosodic features; and

training the speech processing model based on differences between the predicted acoustic features and the first acoustic features;

wherein the speech processing model further comprises an object recognition network; the method further comprises the steps of: inputting the first acoustic feature into the object recognition network to obtain the first voice recognition feature;

wherein the object recognition network is constructed based on an object recognition model comprising a pooled vector-based object encoder.

2. The method of claim 1, wherein the speech processing model further comprises a phoneme alignment network and a phoneme encoding network; the determining the first phoneme feature of the first speech data based on the first acoustic feature of the first speech data comprises:

inputting the first acoustic feature into the phoneme alignment network to obtain a first phoneme sequence aiming at the first voice data; and

and inputting the feature matrix representing the first phoneme sequence and the first voice recognition feature into the phoneme coding network to obtain the first phoneme feature.

3. The method of claim 1, wherein the prosodic encoding network includes a feature extraction sub-network, a normalization sub-network, and an encoding sub-network; obtaining a first prosodic feature of the first speech data using the prosodic encoding network comprises:

inputting the first acoustic feature, the first voice recognition feature and the first phoneme feature into the feature extraction sub-network to obtain a plurality of feature data;

inputting the plurality of feature data into the normalized sub-network to obtain target feature data from which feature data representing the first voice recognition feature is removed; and

and inputting the target characteristic data into the coding sub-network to obtain the first prosodic features.

4. A method according to claim 3, wherein the encoding sub-network comprises a bi-directional gating loop and a variational self-encoder; the inputting the target feature data into the encoding sub-network, the obtaining the first prosodic feature comprising:

inputting the target characteristic data into the bidirectional gating circulating unit to obtain intermediate characteristics;

inputting the intermediate features into the variational self-encoder to obtain the mean value and variance of prosodic feature distribution; and

And randomly sampling the prosodic feature distribution based on the mean and variance to obtain the first prosodic feature.

5. The method of claim 4, wherein training the speech processing model comprises:

determining a first penalty of the speech processing model based on the prosodic feature distribution and a predetermined feature distribution;

determining a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature; and

the speech processing model is trained based on the first loss and the second loss.

6. The method of claim 5, wherein training the speech processing model based on the first loss and the second loss comprises:

training the decoding network based on the second penalty; and

training the prosody encoding network based on the first loss and the second loss.

7. The method of claim 1, further comprising:

pre-training the object recognition network based on second speech data in a predetermined audio set and first labeling information for the second speech data,

The first annotation information is used for indicating object probability information aiming at the second voice data.

8. The method of claim 2, further comprising:

pre-training the phoneme alignment network based on second speech data in a predetermined audio set and second labeling information for the second speech data,

the second labeling information is used for indicating a second phoneme sequence aiming at the second voice data, and the second phoneme sequence is used for indicating the duration of each phoneme included in the second voice data.

9. A method of enhancing data, comprising:

determining a second phoneme feature of third speech data based on a second acoustic feature of the third speech data;

obtaining a second prosodic feature of the third speech data using a prosodic encoding network of a speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phoneme feature;

based on the second acoustic feature, the target voice recognition feature and the second prosodic feature, obtaining a target acoustic feature by adopting a decoding network of the voice processing model; and

based on the target acoustic features, fourth speech data with the target speech recognition features are obtained,

Wherein the speech processing model is trained by the method of any one of claims 1-8; the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates.

10. A training device of a speech processing model, wherein the speech processing model comprises a prosodic encoding network and a decoding network; the device comprises:

a first phoneme feature determining module for determining a first phoneme feature of first speech data based on a first acoustic feature of the first speech data;

a first prosodic feature obtaining module, configured to obtain, based on the first acoustic feature, a first speech recognition feature of the first speech data, and the first phoneme feature, a first prosodic feature of the first speech data using the prosodic coding network, where the first speech recognition feature is used to uniquely represent a feature of an object to which the first speech data relates;

a first acoustic feature obtaining module, configured to obtain a predicted acoustic feature using the decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature; and

A model training module for training the speech processing model based on differences between the predicted acoustic features and the first acoustic features;

wherein the speech processing model further comprises an object recognition network; the device also comprises a voice feature extraction module, a first voice recognition module and a second voice recognition module, wherein the voice feature extraction module is used for inputting the first acoustic feature into the object recognition network to obtain the first voice recognition feature;

11. The apparatus of claim 10, wherein the speech processing model further comprises a phoneme alignment network and a phoneme encoding network; the first phoneme characteristic determining module comprises:

a phoneme sequence obtaining sub-module, configured to input the first acoustic feature into the phoneme alignment network to obtain a first phoneme sequence for the first speech data; and

and the phoneme characteristic obtaining submodule is used for inputting the characteristic matrix representing the first phoneme sequence and the first voice recognition characteristic into the phoneme coding network to obtain the first phoneme characteristic.

12. The apparatus of claim 10, wherein the prosody encoding network comprises a feature extraction sub-network, a normalization sub-network, and an encoding sub-network; the first prosodic feature obtaining module includes:

The feature data obtaining sub-module is used for inputting the first acoustic feature, the first voice recognition feature and the first phoneme feature into the feature extraction sub-network to obtain a plurality of feature data;

a target data obtaining sub-module, configured to input the plurality of feature data into the normalized sub-network, to obtain target feature data from which feature data representing the first speech recognition feature is removed; and

and the prosodic feature obtaining submodule is used for inputting the target feature data into the coding subnetwork to obtain the first prosodic feature.

13. The apparatus of claim 12, wherein the encoding subnetwork comprises a bi-directional gating loop and a variational self-encoder; the prosodic feature obtaining submodule includes:

the intermediate feature obtaining unit is used for inputting the target feature data into the bidirectional gating circulating unit to obtain intermediate features;

the distribution determining unit is used for inputting the intermediate characteristics into the variation self-encoder to obtain the mean value and variance of prosodic characteristic distribution; and

and the prosodic feature obtaining unit is used for randomly sampling the prosodic feature distribution based on the mean value and the variance to obtain the first prosodic features.

14. The apparatus of claim 1 3, wherein the model training module comprises:

a first loss determination submodule for determining a first loss of the speech processing model based on the prosodic feature distribution and a predetermined feature distribution;

a second loss determination submodule for determining a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature; and

and the training submodule is used for training the voice processing model based on the first loss and the second loss.

15. The apparatus of claim 14, wherein the training submodule comprises:

a first training unit for training the decoding network based on the second penalty; and

and a second training unit for training the prosody encoding network based on the first loss and the second loss.

16. The apparatus of claim 10, further comprising:

a first pre-training module for pre-training the object recognition network based on second speech data in a predetermined audio set and first labeling information for the second speech data,

17. The apparatus of claim 11, further comprising:

a second pre-training module for pre-training the phoneme alignment network based on second speech data in a predetermined audio set and second labeling information for the second speech data,

wherein the second labeling information indicates a second phoneme sequence for the second speech data, the second phoneme sequence indicating a duration of each phoneme included in the second speech data.

18. An apparatus for enhancing data, comprising:

a second phoneme feature determining module for determining a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data;

a second prosodic feature obtaining module, configured to obtain a second prosodic feature of the third speech data using a prosodic encoding network of a speech processing model based on the second acoustic feature, a second speech recognition feature of the third speech data, and the second phoneme feature;

a second acoustic feature obtaining module, configured to obtain a target acoustic feature by using a decoding network of the speech processing model based on the second acoustic feature, the target speech recognition feature, and the second prosody feature; and

A voice data obtaining module for obtaining fourth voice data with the target voice recognition feature based on the target acoustic feature,

wherein the speech processing model is trained using the apparatus of any one of claims 10-17; the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.