CN113793598A - Training method of voice processing model, data enhancement method, device and equipment - Google Patents

Training method of voice processing model, data enhancement method, device and equipment Download PDF

Info

Publication number
CN113793598A
CN113793598A CN202111083473.9A CN202111083473A CN113793598A CN 113793598 A CN113793598 A CN 113793598A CN 202111083473 A CN202111083473 A CN 202111083473A CN 113793598 A CN113793598 A CN 113793598A
Authority
CN
China
Prior art keywords
feature
data
speech
network
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111083473.9A
Other languages
Chinese (zh)
Other versions
CN113793598B (en
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111083473.9A priority Critical patent/CN113793598B/en
Publication of CN113793598A publication Critical patent/CN113793598A/en
Application granted granted Critical
Publication of CN113793598B publication Critical patent/CN113793598B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The disclosure provides a training method of a voice processing model, a method, a device, equipment and a medium for enhancing data, and relates to the field of artificial intelligence, in particular to the technical field of voice recognition, voice synthesis and deep learning. The specific implementation scheme of the training method of the speech processing model is as follows: determining a first phoneme feature of the audio sample based on a first acoustic feature of the first speech data; based on the first acoustic feature, the first voice recognition feature and the first phoneme feature of the first voice data, obtaining a first prosody feature of the first voice data by adopting a prosody coding network of a voice processing model; based on the first acoustic feature, the first voice recognition feature and the first prosodic feature, obtaining a predicted acoustic feature by adopting a decoding network of a voice processing model; and training a speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature.

Description

Training method of voice processing model, data enhancement method, device and equipment
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and further relates to the technical fields of speech recognition, speech synthesis, and deep learning, and in particular, to a training method for a speech processing model, and a data enhancement method, apparatus, and device.
Background
With the development of artificial intelligence, the speech recognition technology has become an important technical support for scenes such as intelligent search and intelligent speech assistant. In order to improve the accuracy of speech recognition, a speech recognition technology based on deep learning is rapidly developed. Among other things, improving the accuracy of deep learning based speech recognition techniques requires reliance on large amounts of labeled data.
Disclosure of Invention
Based on the above, the present disclosure provides a training method of a voice processing model, a method, an apparatus, a device and a medium for enhancing data, which are convenient for enhancing voice data and improving the interestingness of voice recording.
According to one aspect of the present disclosure, there is provided a method of training a speech processing model, the speech processing model comprising a prosody coding network and a decoding network; the method comprises the following steps: determining a first phoneme feature of the first speech data based on a first acoustic feature of the first speech data; obtaining a first prosody feature of the first voice data by adopting a prosody coding network based on the first acoustic feature, the first voice recognition feature of the first voice data and the first phoneme feature; based on the first acoustic feature, the first voice recognition feature and the first prosodic feature, obtaining a predicted acoustic feature by adopting a decoding network; and training a speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature.
According to another aspect of the present disclosure, there is provided a method of enhancing data, including: determining a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data; obtaining a second prosodic feature of the third voice data by adopting a prosodic coding network of the voice processing model based on the second acoustic feature, the second voice recognition feature of the third voice data and the second phoneme feature; based on the second acoustic feature, the target voice recognition feature and the second prosodic feature, obtaining a target acoustic feature by adopting a decoding network of a voice processing model; and obtaining fourth voice data with target voice recognition characteristics based on the target acoustic characteristics, wherein the voice processing model is obtained by training through the training method of the voice processing model, and the target voice recognition characteristics are voice recognition characteristics of other objects except the object related to the third voice data.
According to another aspect of the present disclosure, there is provided a training apparatus for a speech processing model, wherein the speech processing model includes a prosody coding network and a decoding network; the device includes: a first phoneme feature determination module for determining a first phoneme feature of the first speech data based on a first acoustic feature of the first speech data; the first prosodic feature obtaining module is used for obtaining a first prosodic feature of the first voice data by adopting a prosodic coding network based on the first acoustic feature, the first voice recognition feature of the first voice data and the first phoneme feature; the first acoustic feature obtaining module is used for obtaining a predicted acoustic feature by adopting a decoding network based on the first acoustic feature, the first voice recognition feature and the first prosodic feature; and a model training module for training the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature.
According to another aspect of the present disclosure, there is provided an apparatus for enhancing data, including: a second phoneme feature determination module, configured to determine a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data; a second prosodic feature obtaining module, configured to obtain a second prosodic feature of the third speech data by using a prosodic coding network of the speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phoneme feature; the second acoustic feature obtaining module is used for obtaining target acoustic features by adopting a decoding network of the voice processing model based on the second acoustic features, the target voice recognition features and the second prosodic features; and the voice data obtaining module is used for obtaining fourth voice data with target voice recognition characteristics based on the target acoustic characteristics, wherein the voice processing model is obtained by adopting the training device of the voice processing model for training, and the target voice recognition characteristics are the voice recognition characteristics of other objects except the object related to the third voice data.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a speech processing model and/or a method of enhancing data provided by the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method of training a speech processing model and/or a method of enhancing data provided by the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of training of speech processing models and/or the method of enhancing data provided by the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of an application scenario of a method for training a speech processing model and a method and apparatus for enhancing data according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow diagram of a method of training a speech processing model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a structure of a speech processing model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a prosody coding network according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a method of training a speech processing model according to an embodiment of the present disclosure;
FIG. 6 is a flow diagram of a method of enhancing data according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of an apparatus for training speech processing models according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of an apparatus for enhancing data according to an embodiment of the present disclosure; and
FIG. 9 is a block diagram of an electronic device for implementing a method of training speech processing models and/or a method of enhancing data according to embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The present disclosure provides a training method of a speech processing model, wherein the speech processing model comprises a prosody coding network and a decoding network. The method comprises a phoneme feature determining stage, a prosodic feature obtaining stage, an acoustic feature obtaining stage and a model training stage. In a phoneme feature determination stage, a first phoneme feature of the first speech data is determined based on a first acoustic feature of the first speech data. In the prosodic feature obtaining stage, a prosodic coding network is adopted to obtain a first prosodic feature of the first voice data based on the first acoustic feature, the first voice recognition feature of the first voice data and the first phoneme feature. In the acoustic feature obtaining stage, a decoding network is adopted to obtain a predicted acoustic feature based on the first acoustic feature, the first voice recognition feature and the first prosodic feature. In a model training phase, a speech processing model is trained based on a difference between the predicted acoustic feature and the first acoustic feature.
An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.
FIG. 1 is a schematic diagram of an application scenario of a training method of a speech processing model and a method and an apparatus for enhancing data according to an embodiment of the present disclosure.
As shown in fig. 1, the scenario 100 of this embodiment includes an electronic device 110 and a database 120.
Wherein the electronic device 110 may access the database 120, for example, over a network. The database 120 may store therein a plurality of audio data, which may include voice data obtained by capturing a subject voice. In one embodiment, the voice data may have a tag indicating text information corresponding to the voice in the voice data and may also indicate an object corresponding to the voice in the voice data.
In an embodiment, the electronic device 110 may read the voice data from the database 120 and perform object conversion on the voice data to obtain voice data with the same content as the voice in the voice data but different objects. The electronic device 110 may also store the converted voice data in a database, for example, to improve the diversity of the voice data stored in the database. For example, the electronic device 110 may employ a speech processing model to perform object transformations on speech data.
For example, the electronic device 110 may also process the voice data by adding noise to the read voice data, changing the rate of the read voice data, performing spectral enhancement on the read voice data, and the like, and store the processed data in a database. By this means, the diversity of the voice data stored in the database can be further improved.
In an embodiment, the electronic device 110 may further read the tagged speech data from the database 120, and train the speech recognition model using the read speech data as a sample. The voice recognition model is used for processing the voice data to obtain a text corresponding to the voice in the voice data. The resulting text is then compared with the text indicated by the labels, and the speech recognition model is trained based on the comparison.
In an embodiment, the application scenario 100 may further include a terminal device 130, and the terminal device 130 is communicatively connected to the electronic device 110 through a network. For example, the terminal device 130 may obtain the trained speech recognition model 140 from the electronic device 110, and process the speech data 150 collected in real time based on the obtained speech recognition model 140 to recognize the speech uttered by the object in real time. The terminal device 130 may also provide services to the object based on the recognition result 160 obtained by speech recognition, for example.
It should be noted that the training method and/or the data enhancement method of the speech processing model provided by the embodiment of the present disclosure may be generally executed by the electronic device 110, or may be executed by a server or the like communicatively connected to the electronic device 110. Accordingly, the training device and/or the data enhancement device of the speech processing model provided by the embodiment of the present disclosure may be generally disposed in the electronic device 110, or may also be disposed in a server or the like communicatively connected to the electronic device 110.
It should be understood that the number and types of electronic devices, databases, and terminal devices in fig. 1 are merely illustrative. There may be any number and type of electronic devices, databases, and terminal devices, as desired for implementation.
The training method of the speech processing model provided by the present disclosure will be described in detail below with reference to fig. 1 through fig. 2 to 5 below.
FIG. 2 is a flow chart diagram of a method of training a speech processing model according to an embodiment of the present disclosure.
As shown in FIG. 2, the training method 200 of the speech processing model of this embodiment may include operations S210-S240.
In operation S210, a first phoneme feature of the first speech data is determined based on a first acoustic feature of the first speech data.
According to an embodiment of the present disclosure, the acoustic feature of the voice data may be, for example, a Mel Frequency Cepstrum Coefficient (MFCC) feature, a Perceptual Linear Prediction feature (PLP), a FilterBank (Fbank) feature, and the like of the voice data. The MFCC characteristics can be obtained by performing discrete cosine transform on the filter bank data.
For example, the first acoustic feature may be used as an input of a phoneme recognition model, and the first phoneme feature may be obtained after processing by the phoneme recognition model. Wherein, the phoneme recognition model can be constructed based on a time-delay bidirectional long-short term memory network.
For example, a portion of the speech data may be randomly extracted from a database storing audio. After extracting the voice data, the extracted voice data may be preprocessed, for example, noise (including ambient noise, busy tone, color ring tone, etc.) in the voice data is removed, and the voice data is enhanced by using the existing data enhancement method. Existing data enhancement methods may include methods that change speech rate, methods that mix echoes, methods that time-domain warp, and/or methods that frequency-domain mask, etc. And taking the preprocessed voice data as first voice data. The first voice data may be subjected to framing processing, and then feature extraction may be performed on each frame of voice data to obtain MFCC features of each frame of voice data, and the MFCC features of the multi-frame voice data obtained by framing constitute the first acoustic features in a sequence form. In the framing process, the frame length may be, for example, 25ms, and the frame shift may be, for example, 10ms, which is not limited in this disclosure.
In operation S220, a first prosody feature of the first speech data is obtained using a prosody coding network based on the first acoustic feature, the first speech recognition feature of the first speech data, and the first phoneme feature.
According to the embodiment of the present disclosure, the first acoustic feature may be input into the object recognition model, and after being processed by the object recognition model, the feature data of the full connection layer of the input object recognition model may be used as the first speech recognition feature. The object recognition model may be, for example, a pooling vector based object encoder. The object encoder includes a multi-layer time-delay neural network, and the penultimate layer is a global pooling layer. The embodiment can take the feature data output by the global pooling layer as the voice recognition feature for uniquely representing the feature of the object to which the first voice data relates.
According to an embodiment of the disclosure, the speech processing model includes a prosody coding network, and the embodiment may output the first prosody feature after the first acoustic feature, the first speech recognition feature and the first phoneme feature are spliced and input to the prosody coding network and processed by the prosody coding network. The prosody coding network may be constructed based on a long-term and short-term memory network architecture or an attention network (e.g., a Transformer) architecture, for example. Prosodic features (Prosodic features) are representations of entire speech segments, such as: syllable emphasis, intonation patterns, speaking rate and rhythm, etc.
In operation S230, a predicted acoustic feature is obtained using a decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature.
According to an embodiment of the present disclosure, the speech processing model may further include a decoding network. The embodiment can splice the first acoustic feature, the first speech recognition feature and the first prosodic feature and input the spliced first acoustic feature, the first speech recognition feature and the first prosodic feature into a decoding network, and output the predicted acoustic feature after processing the spliced first acoustic feature, the first speech recognition feature and the first prosodic feature by the decoding network. The decoding Network may be configured based on a Convolutional Neural Network (CNN) and a Bi-directional Gated cyclic Unit (BGRU).
In operation S240, a speech processing model is trained based on a difference between the predicted acoustic feature and the first acoustic feature.
According to embodiments of the present disclosure, a loss of the speech processing model may be determined from a difference between the predicted acoustic feature and the first acoustic feature. The speech processing model is trained by a back propagation algorithm to minimize the loss of the speech processing model.
It is understood that the operations S220 and S230 are two processes of separating the speech recognition feature from the prosodic feature and then fusing the speech recognition feature with the prosodic feature. By comparing the difference between the predicted acoustic feature obtained by fusion and the first acoustic feature of the first speech data, it can be determined whether the prosodic feature obtained by separation is accurate. If the separated prosodic feature is accurate, after the first speech recognition feature and the first prosodic feature of the first speech data are simultaneously input into the decoding network, the difference between the predicted acoustic feature decoded by the decoding network and the first acoustic feature should be small.
In summary, the accuracy of the speech processing model can be improved by the above training method of the speech processing model. Therefore, if other voice recognition features except the first voice recognition feature are used for replacing the first voice recognition feature to be input into the decoding network, the object transformation of the voice data can be accurately realized. For example, using the speech processing model, object property dimensional enhancement can be performed on speech data. By adopting the voice processing model, the object transformation can be carried out on the voice data acquired in real time, the voice changing function is provided for a user, the interestingness of the voice recording function in the terminal equipment is improved, and the like.
FIG. 3 is a schematic diagram of a structure of a speech processing model according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, when determining the phoneme characteristics, the phoneme and the acoustic characteristics can be forcibly aligned, so as to avoid the situation that the voice data obtained based on the predicted acoustic characteristics is in a sound shortage state.
Illustratively, as shown in fig. 3, the speech processing model 300 of this embodiment may include a phoneme alignment network 330 and a phoneme encoding network 340 in addition to the prosody encoding network 310 and the decoding network 320.
The phone alignment network 330 is similar to a voice recognition network, and may be configured based on a Long-Short Term Memory network-connection time Classification (LSTM-CTC) architecture, a Chain Model (Chain Model) architecture, and the like. After obtaining the first acoustic feature, the first acoustic feature 301 may be input into the phoneme alignment network 330 to obtain a first phoneme sequence for the first speech data. For example, the first acoustic feature 301 is a sequence { M } of MFCC features for n audio frames framed1,M2,...,Mn}. Through the processing of the phone alignment network 330, phones corresponding to each MFCC feature can be obtained. N corresponding to N MFCC featuresThe phonemes constitute a first phoneme sequence. Through the phone alignment network 330, the MFCC features can be forcibly aligned with phones, and through this forced alignment, the resulting phone sequence can be made to characterize the number of sustained frames per phone.
For example, if the phoneme alignment network 330 is configured based on the LSTM-CTC architecture, the first phoneme sequence is obtained based on the phoneme corresponding to the largest element of each of the n probability vectors output by the LSTM-CTC architecture. If the phone alignment network 330 is constructed based on the Chain Model architecture, after the corresponding phone is obtained based on the probability vector output by the Chain Model architecture, the phone needs to be post-processed. For example, if the Chain Model decodes the MFCC features of one frame every several frames, the corresponding phonemes obtained based on the probability vectors need to be supplemented according to the number of frame skipping. If two corresponding phonemes obtained based on two adjacent probability vectors output by the Chain Model are "b" and "a", and the Chain Model takes one frame every two frames, the phoneme sequence obtained after supplementation should include "bba" or "baa", etc. The rules for the supplementary phonemes may be set empirically, and this disclosure is not limited thereto.
The phoneme coding network 340 may adopt a joint architecture of CNN and RNN, so as to comprehensively consider context information of phonemes while performing feature extraction on the phonemes, thereby improving accuracy of obtained phoneme features.
For example, after obtaining the phoneme sequence, the first phoneme sequence may be mapped to a phoneme feature space to obtain a feature matrix 302 representing the first phoneme sequence. The aforementioned mapping process may be implemented, for example, using a fully connected layer. After the feature matrix is obtained, the feature matrix 302 may be directly input into a phoneme coding network, and the phoneme coding network codes the feature matrix to obtain the first phoneme feature.
For example, the feature matrix 302 and the first speech recognition feature 303 of the first speech data may be simultaneously input to the phoneme encoding network 340 such that the encoded first phoneme feature has an object feature of the first speech data. For example, the feature matrix 302 may be spliced with the first speech recognition feature 303 and input to the phoneme encoding network 340. Therefore, when the prosody feature is obtained based on the first phoneme feature, the object feature can be provided for the prosody coding network, and the object feature can be eliminated better by the prosody coding network.
In one embodiment, the phoneme coding network 340 may be formed by sequentially connecting a plurality of CNNs, a maximum pooling layer (maxpoling), an activation layer, and a bi-directional long-short term memory network. Wherein the activation layer may be constructed based on the activation function ReLU. It is to be understood that the structure of the phoneme encoding network is merely an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.
After obtaining the first phonetic feature, the first speech recognition feature 303 and the first acoustic feature 301 may be spliced and input to the prosody coding network 310, and the prosody coding network 310 outputs the first prosody feature. The first prosodic feature, the first speech recognition feature 303 and the first acoustic feature 301 are spliced and input into a decoding network 320, and the decoding network 320 outputs a predicted acoustic feature { M }1’,M2’,...,Mn’}304。
Fig. 4 is a schematic structural diagram of a prosody coding network according to an embodiment of the present disclosure.
According to the embodiment of the disclosure, the prosodic features may be extracted via a plurality of channels to improve the integrity of the extracted prosodic features. When extracting features from multiple channels, the extracted features may also be normalized to eliminate features of objects in the extracted features as much as possible.
As shown in fig. 4, in this embodiment 400, the prosody coding network may include a feature extraction sub-network 411, a normalization sub-network 412, and a coding sub-network 413 connected in sequence.
The feature extraction sub-network 411 may be configured by, for example, a plurality of convolution units Conv1 to Convp provided in parallel, where p is a natural number greater than 1. The first acoustic feature 401, the first speech recognition feature 402 and the first phoneme feature 403 may be concatenated and input into the feature extraction subnetwork 411, and a plurality of feature data may be output by the feature extraction subnetwork 411. For example, three signatures may be concatenated, input to the plurality of convolution units Conv1 to Convp, and output as one signature data from each of the plurality of convolution units Conv1 to Convp, thereby obtaining a plurality of signature data in total.
The normalization sub-network may be, for example, a Norm layer, and is configured to perform normalization processing on the plurality of feature data to remove the object feature from the plurality of feature data. For example, the plurality of feature data may be input into the normalization subnet 412, and the normalization subnet 412 may remove the data characterizing the first speech recognition feature 402 from the plurality of feature data to obtain the target feature data. The normalization processing on the plurality of feature data may be performed by normalizing each feature data to obtain a mean and a variance for each feature data. Each feature data is then regularized based on its mean and variance. For example, the mean value of each data in each feature data may be subtracted, and then the difference of the squares may be divided to obtain the normalized feature data, which is used as a target feature data. By this regularization processing, the object feature can be eliminated as much as possible while the prosodic feature is retained.
The coding subnetwork 413 may employ a recurrent neural network or an attention model, among others. After the target feature data is obtained, the target feature data is sequentially input to the coding sub-network 413, and the first prosodic feature can be obtained based on the output of the coding sub-network 413.
For example, if the coding sub-network 413 employs a bidirectional long-short term memory network or a transform network, the output of the coding sub-network 413 may be used as the first prosodic feature. The coding sub-network 413 may also employ, for example, BGRU, thereby improving the accuracy of the obtained first prosodic feature.
For example, the encoding sub-network 413 may further include a Variational Auto Encoder (VAE) to obtain the distribution of the prosodic features through the Variational encoder. For example, in one embodiment, coding subnetwork 413 is comprised of a BGRU and a VAE. After the target feature data is obtained, the target feature data may be input to the BGRU, and the BGRU may output the intermediate feature. And then inputting the intermediate features into VAE, and obtaining the mean value and the variance of prosodic feature distribution after VAE processing. Based on the mean and variance, a prosodic feature distribution 404 is obtained, and the prosodic feature distribution 404 is randomly sampled to obtain a first prosodic feature 405. This embodiment sets the VAE to obtain the first prosodic feature because the prosodic feature distribution can be approximated to a gaussian distribution according to a priori experience. By setting the VAE, the prosodic feature distribution can better approach the target probability distribution, so that the richness of a feature space is enhanced, and the condition that extracted prosodic features are inaccurate due to feature dispersion is avoided. And therefore the accuracy of the resulting first prosodic feature can be effectively improved.
According to the embodiment of the disclosure, in the case of setting the VAE, the VAE and the networks before the VAE can be trained according to the difference between the prosody distribution determined by the mean and the variance obtained by the VAE and the predetermined feature distribution, so that the prosody distribution can be closer to the target probability distribution, and the accuracy of the obtained prosody feature is further improved. The target probability distribution may be, for example, a normal distribution N (0, 1).
In this manner, when training the speech processing model, a first loss of the speech processing model may be determined based on the prosodic feature distribution and the predetermined feature distribution. A second loss of the speech processing model is then determined based on a difference between the predicted acoustic feature and the first acoustic feature. And finally, training the voice processing model based on the first loss and the second loss. For example, the first loss may be expressed by a KL divergence, and the second loss may be expressed by a Smooth-L1 loss. It is to be understood that the above-described methods of representing the first loss and the second loss are merely examples to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.
For example, the entire speech processing model may be trained based on the first loss and the second loss. The decoding network may also be trained based on the second loss, and the prosodic coding network may be trained based on the first loss and the second loss. In the training process, a back propagation algorithm may be used to train the decoding network and the prosody coding network.
FIG. 5 is a schematic diagram illustrating a method for training a speech processing model according to an embodiment of the present disclosure.
According to an embodiment of the present disclosure, the speech processing model may further comprise an object recognition network for extracting the first speech recognition feature from the first acoustic feature. In particular, a first generic feature may be input into the object recognition network, the first speech recognition feature being output by a layer of the object recognition network adjacent to the output layer. The object recognition network may be constructed based on the object recognition model described above, which is not limited in this disclosure.
After the first speech recognition feature is obtained, the entire speech processing model may be trained using operations S210 through S240 described above.
In an embodiment, the object recognition network may be pre-trained based on second speech data in a predetermined audio set and first labeling information for the second speech data. And then, obtaining a first voice recognition characteristic based on the pre-trained object recognition network, and training the whole voice processing model. The predetermined tone set may be any open source tone set, and the open source tone set may be, for example, aisell, LibriSpeech, or the like. And the audio set in which the first speech data is located is the data set used for training the speech recognition model. Therefore, the voice processing model has higher stability and generalization capability by training the object recognition network in advance.
For example, the first labeling information may be a tag indicating object probability information for the second voice data. For example, the first labeling information may include actual object information of a speech in the second speech data, and if the actual object is a speaker a, the probability for the speaker a in the object probability information is 1, and the probability for speakers other than the speaker a is 0. The probability that the object of the speech in the second speech data is each of the plurality of predetermined objects is output via the object recognition model. A cross entropy loss function may be employed to determine the loss of the object recognition model based on the probabilities of the actual object and the object recognition model output. And converging the object recognition network after multiple iterations.
According to an embodiment of the present disclosure, as shown in fig. 5, in an embodiment 500, a speech processing model may include an object recognition network 510, a phoneme alignment network 520, a phoneme encoding network 530, a prosody encoding network 540, and a decoding network 550.
The object recognition network 510 may include a convolutional layer, a gated round-robin unit, a bottleneck (bottle) layer, and a fully connected layer, which are connected in sequence. The embodiment may input the first acoustic feature 501 into a convolutional layer included in the object recognition network 510, and after sequentially processing the convolutional layer, the gated loop unit, the bottleneck layer, and the fully-connected layer, the fully-connected layer outputs a probability that an object of the speech in the first speech data is each of the plurality of predetermined objects. The data output by the bottleneck layer is the first voice recognition characteristic of the first voice data.
According to an embodiment of the present disclosure, as shown in fig. 5, in the case that the speech processing model further includes a phoneme alignment network 520 and a phoneme encoding network 530, the first acoustic feature 501 is input into the phoneme alignment network 520, and a first phoneme sequence may be obtained. The first phoneme sequence and the first speech recognition feature output from the bottleneck layer are spliced and input to the phoneme encoding network 530, and after being processed by the phoneme encoding network 530, the first phoneme feature can be output from the phoneme encoding network 530.
Illustratively, where the speech processing model includes a phoneme alignment network 520, the embodiment may also pre-train the phoneme alignment network based on second speech data in the predetermined set of audio and second annotation information for the second speech data. In the training process, the acoustic features of the second speech data (i.e., the MFCC features of the multiple audio frames framed by the second speech data) are input to the phoneme alignment network 520 to obtain a predicted phoneme sequence. Based on the difference between the predicted phoneme sequence and the second phoneme sequence, a loss of the phoneme alignment network 520 may be determined using cross-entropy loss. The phoneme alignment network may then be made to converge through a number of iterations based on the loss. Therefore, the phoneme alignment network is trained in advance, so that the speech processing model has high stability and generalization capability.
For example, the second annotation information indicates a second phoneme sequence for the second speech data. The second phoneme sequence indicates that the second speech data includes a duration for each phoneme. Wherein the predetermined audio set may be the open source data set described above. The second phone sequence, which may be constructed of a plurality of phones that actually correspond to the MFCC features of the plurality of audio frames framed by the second speech data, is similar to the first phone sequence described above.
According to an embodiment of the present disclosure, as shown in fig. 5, in the case where the prosody coding network 540 includes a feature extraction sub-network, a normalization sub-network, a BGRU, and a VAE connected in sequence, the first acoustic feature 501, the first phoneme feature, and the first speech recognition feature output from the bottleneck layer may be spliced and input to the feature extraction sub-network, and after being sequentially processed through the feature extraction sub-network, the normalization sub-network, the BGRU, and the VAE, the variance and the mean of the prosody feature distribution may be output from the VAE. The first prosodic feature may be derived by randomly sampling from a distribution of prosodic features determined based on the variance and the mean.
After the first prosodic feature is obtained, the first prosodic feature, the first acoustic feature, and the first speech recognition feature may be concatenated and input to the decoding network 550, and the predicted acoustic feature 502 may be output after being processed by the decoding network 550.
Based on the speech processing model obtained by training the training method of the speech processing model provided by the present disclosure, the present disclosure also provides a method for enhancing data, which will be described in detail below with reference to fig. 6.
Fig. 6 is a flow diagram of a method of enhancing data according to an embodiment of the present disclosure.
According to an embodiment of the present disclosure, the method 600 of enhancing data may include operations S610 to S640. The speech processing model may be trained using the training method described above.
In operation S610, a second phoneme feature of the third voice data is determined based on a second acoustic feature of the third voice data.
According to an embodiment of the present disclosure, the third speech data is similar to the first speech data described above, and the second acoustic feature is similar to the first acoustic feature described above. The method for obtaining the second phone feature is similar to the method for obtaining the first phone feature, and is not described herein again.
In operation S620, a second prosody feature of the third speech data is obtained using a prosody coding network of the speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phoneme feature.
According to an embodiment of the present disclosure, the second speech recognition feature is similar to the first speech recognition feature described previously. The method of obtaining the second prosodic feature is similar to the method of obtaining the first prosodic feature described above, and is not described herein again.
In operation S630, a target acoustic feature is obtained using a decoding network of the speech processing model based on the second acoustic feature, the target speech recognition feature, and the second prosodic feature.
Wherein the target voice recognition feature is a voice recognition feature of an object other than the object to which the third voice data relates. For example, if the third speech data is obtained by recording the speech of the speaker a, the target speech recognition feature may be obtained by processing the speech data obtained by recording the speech of the speaker B. The method for processing to obtain the target speech recognition feature is similar to the method for processing the third speech data to obtain the second speech recognition feature, and is not described herein again.
In operation S640, fourth voice data having the target voice recognition feature is obtained based on the target acoustic feature.
According to an embodiment of the present disclosure, a vocoder may be used to upsample a target acoustic feature to synthesize speech data having a target object style corresponding to the target speech recognition feature.
For example, when the third speech data is a training sample of the speech recognition model, the number of training samples can be increased from the object feature dimension by using the data enhancement method of this embodiment. Thus, when a speech recognition model is trained based on training samples, the accuracy and generalization ability of the trained speech recognition model can be improved.
According to an embodiment of the present disclosure, when the third speech data is a training sample of a speech recognition model and the third speech data has a first tag indicating a text corresponding to speech and a second tag indicating an object corresponding to speech, the first tag may be assigned to fourth speech data obtained based on the third speech data, and a third tag indicating an object corresponding to a target speech recognition feature may be added to the fourth speech data. Thus, the fourth voice data with the label added can be used as a training sample of the voice recognition model.
Based on the training method of the voice processing model provided by the disclosure, the disclosure also provides a training device of the voice processing model. The apparatus will be described in detail below with reference to fig. 7.
FIG. 7 is a block diagram of a training apparatus for a speech processing model according to an embodiment of the present disclosure.
As shown in fig. 7, the training apparatus 700 for a speech processing model of this embodiment may include a first phoneme feature determining module 710, a first prosodic feature obtaining module 720, a first acoustic feature obtaining module 730, and a model training module 740. The voice processing model comprises a prosody coding network and a decoding network.
The first phoneme feature determination module 710 is configured to determine a first phoneme feature of the first speech data based on the first acoustic feature of the first speech data. In an embodiment, the first phoneme characteristic determining module 710 may be configured to perform the operation S210 described above, and will not be described herein again.
The first prosodic feature obtaining module 720 is configured to obtain a first prosodic feature of the first speech data by using a prosodic coding network based on the first acoustic feature, the first speech recognition feature of the first speech data, and the first phoneme feature. In an embodiment, the first prosodic feature obtaining module 720 may be configured to perform operation S220 described above, and is not described herein again.
The first acoustic feature obtaining module 730 is configured to obtain a predicted acoustic feature by using a decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature. In an embodiment, the first acoustic feature obtaining module 730 may be configured to perform the operation S230 described above, and is not described herein again.
The model training module 740 is configured to train a speech processing model based on the difference between the predicted acoustic feature and the first acoustic feature. In an embodiment, the model training module 740 may be configured to perform the operation S240 described above, which is not described herein again.
According to an embodiment of the present disclosure, the speech processing model further includes a phoneme alignment network and a phoneme coding network. The first phoneme characteristic determining module 710 may include a phoneme sequence obtaining sub-module and a phoneme characteristic obtaining sub-module. The phoneme sequence obtaining submodule is used for inputting the first acoustic characteristics into the phoneme alignment network to obtain a first phoneme sequence aiming at the first voice data. The phoneme characteristic obtaining submodule is used for inputting a characteristic matrix for representing the first phoneme sequence and the first speech recognition characteristic into the phoneme coding network to obtain the first phoneme characteristic.
According to an embodiment of the present disclosure, a prosody coding network includes a feature extraction subnetwork, a normalization subnetwork, and a coding subnetwork. The first prosodic feature obtaining module 720 may include a feature data obtaining sub-module, a target data obtaining sub-module, and a prosodic feature obtaining sub-module. The feature data obtaining submodule is used for inputting the first acoustic feature, the first voice recognition feature and the first phoneme feature into a feature extraction sub-network to obtain a plurality of feature data. The target data obtaining sub-module is used for inputting the plurality of feature data into the standardized sub-network to obtain target feature data with the data representing the first voice recognition feature removed. The prosodic feature obtaining submodule is used for inputting the target feature data into the coding sub-network to obtain a first prosodic feature.
According to an embodiment of the present disclosure, an encoding subnetwork includes a bi-directional gated round-robin unit and a variational self-encoder. The prosodic feature obtaining submodule comprises an intermediate feature obtaining unit, a distribution determining unit and a prosodic feature obtaining unit. The intermediate characteristic obtaining unit is used for inputting the target characteristic data into the bidirectional gating circulation unit to obtain intermediate characteristics. The distribution determining unit is used for inputting the intermediate features into the variational self-encoder to obtain the mean value and the variance of the prosodic feature distribution. The prosodic feature obtaining unit is used for randomly sampling the distribution of the prosodic features based on the mean value and the variance to obtain first prosodic features.
According to an embodiment of the present disclosure, the model training module 740 includes a first loss determination sub-module, a second loss determination sub-module, and a training sub-module. The first loss determination submodule is used for determining a first loss of the voice processing model based on the prosodic feature distribution and the preset feature distribution. The second loss determination sub-module is configured to determine a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature. The training submodule is used for training the voice processing model based on the first loss and the second loss.
According to an embodiment of the present disclosure, the training submodule includes a first training unit and a second training unit. The first training unit is used for training the decoding network based on the second loss. The second training unit is used for training the prosody coding network based on the first loss and the second loss.
According to an embodiment of the present disclosure, the above-mentioned speech processing model further includes an object recognition network. The training apparatus 700 for the speech processing model further comprises a speech feature extraction module, configured to extract a first speech recognition feature from the first acoustic feature by: and inputting the first acoustic feature into an object recognition network to obtain a first voice recognition feature.
According to an embodiment of the present disclosure, the training apparatus 700 of the speech processing model further includes a first pre-training module, configured to pre-train the object recognition network based on second speech data in a predetermined audio set and first label information for the second speech data, where the first label information is used to indicate object probability information for the second speech data.
According to an embodiment of the present disclosure, the training apparatus 700 of the above-mentioned speech processing model is a second pre-training module, configured to pre-train the phoneme alignment network based on second speech data in a predetermined audio set and second labeling information for the second speech data, where the second labeling information indicates a second phoneme sequence for the second speech data, and the second phoneme sequence indicates that the second speech data includes a duration for each phoneme.
Based on the method for enhancing the data provided by the disclosure, the disclosure also provides a device for enhancing the data. The apparatus will be described in detail below with reference to fig. 8.
Fig. 8 is a block diagram of an apparatus for enhancing data according to an embodiment of the present disclosure.
As shown in fig. 8, the apparatus 800 for extracting enhanced data according to this embodiment may include a second phoneme feature determining module 810, a second prosodic feature obtaining module 820, a second acoustic feature obtaining module 830, and a speech data obtaining module 840.
The second phoneme feature determination module 810 is configured to determine a second phoneme feature of the third speech data based on a second acoustic feature of the third speech data. In an embodiment, the second phone feature determining module 810 may be configured to perform the operation S610 described above, which is not described herein again.
The second prosodic feature obtaining module 820 is configured to obtain a second prosodic feature of the third speech data by using a prosodic coding network of the speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phoneme feature. The speech processing model may be obtained by training with the training device of the speech processing model described above. In an embodiment, the second prosodic feature obtaining module 820 may be configured to perform the operation S620 described above, and is not described herein again.
The second acoustic feature obtaining module 830 is configured to obtain the target acoustic feature by using a decoding network of the speech processing model based on the second acoustic feature, the target speech recognition feature and the second prosodic feature. Wherein the target voice recognition feature is a voice recognition feature of an object other than the object to which the third voice data relates. In an embodiment, the second acoustic feature obtaining module 830 may be configured to perform the operation S630 described above, and is not described herein again.
The voice data obtaining module 840 is configured to obtain fourth voice data with a target voice recognition feature based on the target acoustic feature. In an embodiment, the voice data obtaining module 840 may be configured to perform the operation S640 described above, which is not described herein again.
In the technical scheme of the present disclosure, the processes of acquiring, collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all conform to the regulations of related laws and regulations, and do not violate the good custom of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that may be used to implement the training methods of speech processing models and/or the methods of enhancing data of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 901 performs the various methods and processes described above, such as a training method of a speech processing model and/or a method of enhancing data. For example, in some embodiments, the training method of the speech processing model and/or the method of enhancing data may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the method for training speech processing models and/or the method for enhancing data described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of a speech processing model and/or a method of enhancing data.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (23)

1. A training method of a voice processing model, wherein the voice processing model comprises a prosody coding network and a decoding network; the method comprises the following steps:
determining a first phoneme feature of first speech data based on a first acoustic feature of the first speech data;
obtaining a first prosodic feature of the first voice data by adopting the prosodic coding network based on the first acoustic feature, the first voice recognition feature of the first voice data and the first phoneme feature;
obtaining a predicted acoustic feature by using the decoding network based on the first acoustic feature, the first speech recognition feature and the first prosodic feature; and
training the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature.
2. The method of claim 1, wherein the speech processing model further comprises a phoneme alignment network and a phoneme coding network; the determining a first phoneme feature of the first speech data based on a first acoustic feature of the first speech data comprises:
inputting the first acoustic feature into the phoneme alignment network to obtain a first phoneme sequence for the first speech data; and
and inputting the feature matrix representing the first phoneme sequence and the first speech recognition feature into the phoneme coding network to obtain the first phoneme feature.
3. The method of claim 1, wherein the prosody coding network comprises a feature extraction subnetwork, a normalization subnetwork, and a coding subnetwork; obtaining a first prosodic feature of the first speech data using the prosodic coding network comprises:
inputting the first acoustic feature, the first speech recognition feature and the first phoneme feature into the feature extraction sub-network to obtain a plurality of feature data;
inputting the plurality of feature data into the normalized sub-network to obtain target feature data with feature data representing the first voice recognition feature removed; and
and inputting the target feature data into the coding sub-network to obtain the first prosodic feature.
4. The method of claim 3, wherein the coding sub-network comprises bi-directional gated cyclic units and a variational self-encoder; the inputting the target feature data into the coding sub-network, and the obtaining the first prosodic feature comprises:
inputting the target characteristic data into the bidirectional gating circulation unit to obtain an intermediate characteristic;
inputting the intermediate features into the variational self-encoder to obtain a mean value and a variance of prosodic feature distribution; and
and randomly sampling the prosodic feature distribution based on the mean value and the variance to obtain the first prosodic feature.
5. The method of claim 4, wherein training the speech processing model comprises:
determining a first loss of the speech processing model based on the prosodic feature distribution and a predetermined feature distribution;
determining a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature; and
training the speech processing model based on the first loss and the second loss.
6. The method of claim 5, wherein training the speech processing model based on the first loss and the second loss comprises:
training the decoding network based on the second loss; and
training the prosodic coding network based on the first loss and the second loss.
7. The method of claim 1, wherein the speech processing model further comprises an object recognition network; the method further includes extracting the first speech recognition feature from the first acoustic feature by:
and inputting the first acoustic feature into the object recognition network to obtain the first voice recognition feature.
8. The method of claim 7, further comprising:
pre-training the object recognition network based on second speech data in a predetermined set of audio and first annotation information for the second speech data,
wherein the first label information is used for indicating object probability information for the second voice data.
9. The method of claim 2, further comprising:
pre-training the phoneme alignment network based on second speech data in a predetermined set of audio and second labeling information for the second speech data,
wherein the second label information is used for indicating a second phoneme sequence aiming at the second voice data, and the second phoneme sequence is used for indicating the duration of each phoneme included in the second voice data.
10. A method of enhancing data, comprising:
determining a second phoneme feature of third speech data based on a second acoustic feature of the third speech data;
obtaining a second prosodic feature of the third voice data by adopting a prosodic coding network of a voice processing model based on the second acoustic feature, a second voice recognition feature of the third voice data and the second phoneme feature;
based on the second acoustic feature, the target voice recognition feature and the second prosodic feature, obtaining a target acoustic feature by adopting a decoding network of the voice processing model; and
obtaining fourth voice data with the target voice recognition feature based on the target acoustic feature,
wherein the speech processing model is obtained by training by adopting the method of any one of claims 1-9; the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates.
11. A training device of a voice processing model, wherein the voice processing model comprises a prosody coding network and a decoding network; the device comprises:
a first phoneme feature determination module for determining a first phoneme feature of first speech data based on a first acoustic feature of the first speech data;
a first prosodic feature obtaining module, configured to obtain a first prosodic feature of the first voice data by using the prosodic coding network based on the first acoustic feature, the first voice recognition feature of the first voice data, and the first phoneme feature;
a first acoustic feature obtaining module, configured to obtain a predicted acoustic feature by using the decoding network based on the first acoustic feature, the first speech recognition feature, and the first prosodic feature; and
a model training module to train the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature.
12. The apparatus of claim 11, wherein the speech processing model further comprises a phoneme alignment network and a phoneme encoding network; the first phoneme feature determination module includes:
a phoneme sequence obtaining submodule, configured to input the first acoustic feature into the phoneme alignment network, so as to obtain a first phoneme sequence for the first speech data; and
and the phoneme characteristic obtaining submodule is used for inputting the characteristic matrix representing the first phoneme sequence and the first speech recognition characteristic into the phoneme coding network to obtain the first phoneme characteristic.
13. The apparatus of claim 11, wherein the prosody coding network comprises a feature extraction subnetwork, a normalization subnetwork, and a coding subnetwork; the first prosodic feature obtaining module includes:
a feature data obtaining submodule, configured to input the first acoustic feature, the first speech recognition feature, and the first phoneme feature into the feature extraction sub-network, so as to obtain a plurality of feature data;
a target data obtaining sub-module, configured to input the plurality of feature data into the normalized sub-network, so as to obtain target feature data from which feature data representing the first speech recognition feature is removed; and
and the prosodic feature obtaining sub-module is used for inputting the target feature data into the coding sub-network to obtain the first prosodic feature.
14. The apparatus of claim 13, wherein the coding sub-network comprises bi-directional gated cyclic units and a variational self-encoder; the prosodic feature obtaining submodule comprises:
the intermediate characteristic obtaining unit is used for inputting the target characteristic data into the bidirectional gating circulating unit to obtain intermediate characteristics;
the distribution determining unit is used for inputting the intermediate features into the variational self-encoder to obtain the mean value and the variance of the prosodic feature distribution; and
and the prosodic feature obtaining unit is used for randomly sampling the prosodic feature distribution based on the mean value and the variance to obtain the first prosodic feature.
15. The apparatus of claim 14, wherein the model training module comprises:
a first loss determination submodule for determining a first loss of the speech processing model based on the prosodic feature distribution and a predetermined feature distribution;
a second loss determination sub-module for determining a second loss of the speech processing model based on a difference between the predicted acoustic feature and the first acoustic feature; and
a training sub-module for training the speech processing model based on the first loss and the second loss.
16. The apparatus of claim 15, wherein the training submodule comprises:
a first training unit to train the decoding network based on the second loss; and
and the second training unit is used for training the prosody coding network based on the first loss and the second loss.
17. The apparatus of claim 11, wherein the speech processing model further comprises an object recognition network; the apparatus also includes a speech feature extraction module to extract the first speech recognition feature from the first acoustic feature by:
and inputting the first acoustic feature into the object recognition network to obtain the first voice recognition feature.
18. The apparatus of claim 17, further comprising:
a first pre-training module for pre-training the object recognition network based on second speech data in a predetermined audio set and first labeling information for the second speech data,
wherein the first label information is used for indicating object probability information for the second voice data.
19. The apparatus of claim 12, further comprising:
a second pre-training module for pre-training the phoneme alignment network based on second speech data in a predetermined audio set and second labeling information for the second speech data,
wherein the second annotation information indicates a second phoneme sequence for the second speech data, the second phoneme sequence indicating a duration of each phoneme included in the second speech data.
20. An apparatus for enhancing data, comprising:
a second phoneme feature determination module, configured to determine a second phoneme feature of third speech data based on a second acoustic feature of the third speech data;
a second prosodic feature obtaining module, configured to obtain a second prosodic feature of the third speech data by using a prosodic coding network of a speech processing model based on the second acoustic feature, the second speech recognition feature of the third speech data, and the second phonetic feature;
a second acoustic feature obtaining module, configured to obtain a target acoustic feature by using a decoding network of the speech processing model based on the second acoustic feature, a target speech recognition feature, and the second prosodic feature; and
a voice data obtaining module for obtaining fourth voice data with the target voice recognition feature based on the target acoustic feature,
wherein the speech processing model is obtained by training by adopting the device of any one of claims 11-19; the target speech recognition feature is a speech recognition feature of an object other than the object to which the third speech data relates.
21. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-10.
23. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 10.
CN202111083473.9A 2021-09-15 2021-09-15 Training method of voice processing model, data enhancement method, device and equipment Active CN113793598B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111083473.9A CN113793598B (en) 2021-09-15 2021-09-15 Training method of voice processing model, data enhancement method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111083473.9A CN113793598B (en) 2021-09-15 2021-09-15 Training method of voice processing model, data enhancement method, device and equipment

Publications (2)

Publication Number Publication Date
CN113793598A true CN113793598A (en) 2021-12-14
CN113793598B CN113793598B (en) 2023-10-27

Family

ID=79183718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111083473.9A Active CN113793598B (en) 2021-09-15 2021-09-15 Training method of voice processing model, data enhancement method, device and equipment

Country Status (1)

Country Link
CN (1) CN113793598B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007178686A (en) * 2005-12-27 2007-07-12 Matsushita Electric Ind Co Ltd Speech converter
CN101606190A (en) * 2007-02-19 2009-12-16 松下电器产业株式会社 Firmly sound conversion device, sound conversion device, speech synthesizing device, sound converting method, speech synthesizing method and program
US10008193B1 (en) * 2016-08-19 2018-06-26 Oben, Inc. Method and system for speech-to-singing voice conversion
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
US20200027440A1 (en) * 2017-03-23 2020-01-23 D&M Holdings, Inc. System Providing Expressive and Emotive Text-to-Speech
CN111754973A (en) * 2019-09-23 2020-10-09 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112382271A (en) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN112420016A (en) * 2020-11-20 2021-02-26 四川长虹电器股份有限公司 Method and device for aligning synthesized voice and text and computer storage medium
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112786012A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device
US20210193160A1 (en) * 2019-12-24 2021-06-24 Ubtech Robotics Corp Ltd. Method and apparatus for voice conversion and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007178686A (en) * 2005-12-27 2007-07-12 Matsushita Electric Ind Co Ltd Speech converter
CN101606190A (en) * 2007-02-19 2009-12-16 松下电器产业株式会社 Firmly sound conversion device, sound conversion device, speech synthesizing device, sound converting method, speech synthesizing method and program
US10008193B1 (en) * 2016-08-19 2018-06-26 Oben, Inc. Method and system for speech-to-singing voice conversion
US20200027440A1 (en) * 2017-03-23 2020-01-23 D&M Holdings, Inc. System Providing Expressive and Emotive Text-to-Speech
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN111754973A (en) * 2019-09-23 2020-10-09 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
US20210193160A1 (en) * 2019-12-24 2021-06-24 Ubtech Robotics Corp Ltd. Method and apparatus for voice conversion and storage medium
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112420016A (en) * 2020-11-20 2021-02-26 四川长虹电器股份有限公司 Method and device for aligning synthesized voice and text and computer storage medium
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112382271A (en) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 Voice processing method, device, electronic equipment and storage medium
CN112786012A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HUAIPING MING ET AL.: "EXEMPLAR-BASED SPARSE REPRESENTATION OF TIMBRE AND PROSODY FOR VOICE CONVERSION", 《ICASSP 2016》 *
张筱;张巍;王文浩;万永菁;: "基于多谱特征生成对抗网络的语音转换算法", 计算机工程与科学, no. 05 *

Also Published As

Publication number Publication date
CN113793598B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN107945786B (en) Speech synthesis method and device
EP3680894A1 (en) Real-time speech recognition method and apparatus based on truncated attention, device and computer-readable storage medium
CN111402891B (en) Speech recognition method, device, equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN107437417B (en) Voice data enhancement method and device based on recurrent neural network voice recognition
CN114360557B (en) Voice tone conversion method, model training method, device, equipment and medium
CN114141228B (en) Training method of speech synthesis model, speech synthesis method and device
CN109697978B (en) Method and apparatus for generating a model
WO2023142454A1 (en) Speech translation and model training methods, apparatus, electronic device, and storage medium
CN112397051A (en) Voice recognition method and device and terminal equipment
CN113793591A (en) Speech synthesis method and related device, electronic equipment and storage medium
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN113327575A (en) Speech synthesis method, device, computer equipment and storage medium
CN114694637A (en) Hybrid speech recognition method, device, electronic equipment and storage medium
EP4024393A2 (en) Training a speech recognition model
CN115762489A (en) Data processing system and method of voice recognition model and voice recognition method
CN112735432B (en) Audio identification method, device, electronic equipment and storage medium
CN113920987A (en) Voice recognition method, device, equipment and storage medium
CN114512121A (en) Speech synthesis method, model training method and device
CN113889089A (en) Method and device for acquiring voice recognition model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant