WO2023001128A1 - 音频数据的处理方法、装置及设备 - Google Patents

音频数据的处理方法、装置及设备 Download PDF

Info

Publication number
WO2023001128A1
WO2023001128A1 PCT/CN2022/106380 CN2022106380W WO2023001128A1 WO 2023001128 A1 WO2023001128 A1 WO 2023001128A1 CN 2022106380 W CN2022106380 W CN 2022106380W WO 2023001128 A1 WO2023001128 A1 WO 2023001128A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio data
feature vector
model
initial
target
Prior art date
Application number
PCT/CN2022/106380
Other languages
English (en)
French (fr)
Inventor
陈展
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2023001128A1 publication Critical patent/WO2023001128A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to the field of speech processing, in particular to a method, device and equipment for processing audio data.
  • audio noise In audio systems such as voice calls, video conferencing, broadcasting, home theater, etc., problems such as audio noise often occur. For example, audio noise caused by improper grounding, audio noise caused by electromagnetic radiation interference, audio noise generated by internal circuits of equipment, audio noise caused by power supply interference, etc.
  • noise reduction processing In order to remove noise in the audio data and improve the quality of the audio data, it is necessary to perform noise reduction processing on the audio data to obtain audio data with noise removed.
  • noise reduction algorithms for single-channel signal processing such as Wiener filtering algorithm and spectral subtraction algorithm, etc.
  • noise reduction algorithms for multi-channel signal processing such as beamforming algorithm and blind source can also be used separation algorithm, etc.
  • noise reduction algorithms all directly perform noise reduction processing on the noise in the audio data, and there are problems such as being unable to perform effective noise reduction on the audio data, and the noise reduction effect is not good.
  • noise reduction algorithms such as Wiener filtering algorithm, spectral subtraction algorithm, beamforming algorithm, and blind source separation algorithm cannot reduce the non-stationary noise in the audio data. The noise effect is poor.
  • the application provides a method for processing audio data, the method comprising:
  • the audio feature vector is input to the trained target vocoder model, and the target audio data corresponding to the audio feature vector is output by the target vocoder model; wherein, the target audio data is for the target Process the noise of the audio data to perform noise reduction processing on the audio data.
  • the training process of the target vocoder model includes:
  • the text feature vector is input to an initial vocoder model, and the initial audio data corresponding to the text feature vector is output by the initial vocoder model;
  • the initial vocoder model is trained based on the sample audio data and the initial audio data to obtain the trained target vocoder model.
  • the training of the initial vocoder model based on the sample audio data and the initial audio data to obtain the trained target vocoder model includes:
  • the input of the text feature vector to the initial vocoder model, and the initial audio data corresponding to the text feature vector output by the initial vocoder model include:
  • the text feature vector is input to the first initial submodel of the initial vocoder model, and the text feature vector is processed by the first initial submodel to obtain the Mel corresponding to the text feature vector Frequency cepstral coefficient MFCC eigenvector;
  • the MFCC eigenvector is input to the second initial submodel of the initial vocoder model, and the MFCC eigenvector is processed by the second initial submodel to obtain the same The initial audio data corresponding to the above text feature vector.
  • the number of sample audio data is multiple, and the multiple sample audio data includes sample audio data with noise and sample audio data without noise; the sample audio data without noise The quantity of is greater than the quantity of the sample audio data with noise.
  • the determining the audio feature vector corresponding to the audio data to be processed includes:
  • An audio feature vector corresponding to the audio data to be processed is determined based on the MFCC feature vector.
  • the target vocoder model includes a first target sub-model and a second target sub-model, the first target sub-model is used to map text feature vectors to MFCC feature vectors, and the second target sub-model The model is used to map MFCC feature vectors to audio data;
  • the audio feature vector is input to the second target sub-model, and the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector;
  • the audio feature vector is processed to obtain target audio data corresponding to the audio feature vector.
  • the noise reduction application scenario is an application scenario requiring voice noise reduction; the noise reduction application scenario is a voice call application scenario, or the noise reduction application scenario is a video conference application scenario.
  • the present application provides an audio data processing device, the device comprising:
  • An acquisition module configured to acquire audio data to be processed with noise in a noise reduction application scenario
  • a determining module configured to determine an audio feature vector corresponding to the audio data to be processed
  • a processing module configured to input the audio feature vector to the trained target vocoder model, and output the target audio data corresponding to the audio feature vector by the target vocoder model; wherein, the target audio data is the audio data after noise reduction processing is performed on the noise of the audio data to be processed.
  • the processing device further includes:
  • the training module is used to obtain the target vocoder model by training in the following manner:
  • the text feature vector is input to an initial vocoder model, and the initial audio data corresponding to the text feature vector is output by the initial vocoder model;
  • the initial vocoder model is trained based on the sample audio data and the initial audio data to obtain the trained target vocoder model.
  • the training module trains the initial vocoder model based on the sample audio data and the initial audio data, and obtains the trained target vocoder model for:
  • the training module inputs the text feature vector to the initial vocoder model, and when the initial vocoder model outputs the initial audio data corresponding to the text feature vector, it is used for:
  • the text feature vector is input to the first initial submodel of the initial vocoder model, and the text feature vector is processed by the first initial submodel to obtain the Mel corresponding to the text feature vector frequency cepstral coefficient MFCC eigenvector;
  • the MFCC feature vector is input to the second initial submodel of the initial vocoder model, and the MFCC feature vector is processed by the second initial submodel to obtain the initial audio corresponding to the text feature vector data.
  • the training module acquires sample audio data
  • the number of sample audio data is multiple, and the multiple sample audio data includes sample audio data with noise and sample audio data without noise;
  • the number of sample audio data with noise is greater than the number of sample audio data with noise.
  • the determination module determines the audio feature vector corresponding to the audio data to be processed, it is used for:
  • An audio feature vector corresponding to the audio data to be processed is determined based on the MFCC feature vector.
  • the target vocoder model includes a first target sub-model and a second target sub-model, the first target sub-model is used to map text feature vectors to MFCC feature vectors, and the second target sub-model Used to map MFCC feature vectors to audio data;
  • the processing module inputs the audio feature vector to the trained target vocoder model, and when the target vocoder model outputs the target audio data corresponding to the audio feature vector, it is used for:
  • the audio feature vector is input to the second target sub-model, and the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector;
  • the audio feature vector is input to the first target sub-model, the first target sub-model inputs the audio feature vector to the second target sub-model, and the second target sub-model
  • the audio feature vector is processed to obtain target audio data corresponding to the audio feature vector.
  • the noise reduction application scenario is an application scenario requiring voice noise reduction; wherein, the noise reduction application scenario is a voice call application scenario, or the noise reduction application scenario is a video conference application scenario.
  • the present application provides an audio data processing device, including: a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor; wherein, the processor uses for executing machine-executable instructions to achieve the following steps:
  • the audio feature vector is input to the trained target vocoder model, and the target audio data corresponding to the audio feature vector is output by the target vocoder model; wherein, the target audio data is for the target Process the noise of the audio data to perform noise reduction processing on the audio data.
  • the processor is prompted to train the target vocoder model in the following manner:
  • the text feature vector is input to an initial vocoder model, and the initial audio data corresponding to the text feature vector is output by the initial vocoder model;
  • the initial vocoder model is trained based on the sample audio data and the initial audio data to obtain the trained target vocoder model.
  • the initial vocoder model is trained based on the sample audio data and the initial audio data, and when the trained target vocoder model is obtained, the processor is prompted to:
  • the processor is prompted to:
  • the text feature vector is input to the first initial submodel of the initial vocoder model, and the text feature vector is processed by the first initial submodel to obtain the Mel corresponding to the text feature vector Frequency cepstral coefficient MFCC eigenvector;
  • the MFCC eigenvector is input to the second initial submodel of the initial vocoder model, and the MFCC eigenvector is processed by the second initial submodel to obtain the same The initial audio data corresponding to the above text feature vector.
  • the number of sample audio data is multiple, and the multiple sample audio data includes sample audio data with noise and sample audio data without noise; wherein, the sample without noise The amount of audio data is greater than the amount of the noisy sample audio data.
  • the processor when determining the audio feature vector corresponding to the audio data to be processed, the processor is prompted to:
  • An audio feature vector corresponding to the audio data to be processed is determined based on the MFCC feature vector.
  • the target vocoder model includes a first target sub-model and a second target sub-model, the first target sub-model is used to map text feature vectors to MFCC feature vectors, and the second target sub-model Used to map MFCC feature vectors to audio data;
  • the processor is prompted to:
  • the audio feature vector is input to the second target sub-model, and the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector;
  • the audio feature vector is processed to obtain target audio data corresponding to the audio feature vector.
  • the noise reduction application scenario is an application scenario requiring voice noise reduction; wherein, the noise reduction application scenario is a voice call application scenario, or the noise reduction application scenario is a video conference application scenario.
  • the audio feature vector corresponding to the audio data to be processed with noise can be input to the target vocoder model, and the output of the target vocoder model is consistent with the audio
  • the target audio data corresponding to the feature vector so that the target audio data is directly synthesized based on the audio feature vector, that is, the target audio data is speech synthesis, that is, the target audio data is directly synthesized by speech synthesis, and there is no need to pay attention to the audio data to be processed Noise itself, only needs to input the audio feature vector to the target vocoder model, the target audio data can be generated by speech synthesis, the reliability of speech noise reduction is higher, it has stronger noise reduction ability, and can analyze the audio data Effective noise reduction, with a very good noise reduction effect.
  • the non-stationary noise in the audio data can be removed, so as to achieve the purpose of denoising the non-stationary noise in the audio data
  • Fig. 1 is a schematic diagram of the training process of the vocoder model in an embodiment of the present application
  • FIG. 2 is a schematic flow diagram of a method for processing audio data in an embodiment of the present application
  • Fig. 3 is the schematic diagram of obtaining MFCC feature vector in one embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a speech synthesis system in an embodiment of the present application.
  • FIG. 5 is a schematic flow diagram of a method for processing audio data in an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an audio data processing device in an embodiment of the present application.
  • Fig. 7 is a hardware structural diagram of an audio data processing device in an embodiment of the present application.
  • first, second, and third may use terms such as first, second, and third to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, furthermore, the use of the word “if” could be interpreted as “at” or “when” or "in response to a determination.”
  • noise reduction algorithms for single-channel signal processing such as Wiener filtering algorithm and spectral subtraction algorithm, etc.
  • noise reduction algorithms for multi-channel signal processing such as beamforming algorithm and blind source can also be used
  • Separation algorithms, etc. can also use deep learning algorithms to perform noise reduction processing on audio data by training deep neural networks.
  • noise reduction algorithm for single-channel signal processing and the noise reduction algorithm for multi-channel signal processing both directly perform noise reduction processing on the noise in the audio data, and there are problems such as the inability to effectively reduce the noise of the audio data, and the noise reduction effect is not good.
  • these noise reduction algorithms cannot perform effective noise reduction on the non-stationary noise in the audio data, resulting in a poor noise reduction effect.
  • noise reduction processing method of the deep learning algorithm there are problems such as the low reliability of the deep learning algorithm, the ineffective noise reduction for certain noises (such as the noise that has not been learned), and the poor noise reduction effect.
  • an audio data processing method is proposed in the embodiment of the present application, which can directly synthesize audio data based on audio feature vectors, that is, synthesize audio data by means of speech synthesis, which is a noise reduction algorithm for single-channel signal processing, multi-channel
  • the fourth noise reduction method based on signal processing noise reduction algorithm and deep learning algorithm.
  • This noise reduction method can directly synthesize audio data through speech synthesis. It does not need to pay attention to the noise itself. It only needs to input the audio feature vector to the target voice.
  • the coder model can generate the final audio data, the reliability of speech noise reduction is higher, and it has stronger noise reduction ability.
  • It is a speech noise reduction method based on speech synthesis, which can enhance the speech signal and improve speech intelligibility. Spend.
  • a training process of a vocoder model and a processing process of audio data may be involved.
  • the training data can be used to train the configured initial vocoder model (for the convenience of distinction, the untrained vocoder model can be called the initial vocoder model), and get A trained target vocoder model (for convenience of distinction, the trained vocoder model may be referred to as a target vocoder model).
  • the audio feature vector can be input to the trained target vocoder model, and the audio data corresponding to the audio feature vector can be directly synthesized by the target vocoder model, that is, the target vocoder model can be used Synthesize audio data directly to obtain noise-removed audio data.
  • the following describes the training process of the vocoder model and the processing process of the audio data.
  • a vocoder model can be pre-configured as the initial vocoder model.
  • the structure of the initial vocoder model there is no restriction on the structure of the initial vocoder model, as long as the text feature vector can be converted into audio data.
  • an initial vocoder model based on a deep learning algorithm an initial vocoder model based on a neural network (such as a convolutional neural network), or other types of initial vocoder models.
  • FIG. 1 it is a schematic diagram of the training process of the vocoder model, for training the initial vocoder model as the target vocoder model, also known as the training process of the target vocoder model, the process includes:
  • Step 101 acquire sample audio data and sample text data corresponding to the sample audio data.
  • a plurality of sample audio data may be acquired (for convenience of distinction, the audio data in the training process is referred to as sample audio data), that is, a large amount of sample audio data may be acquired.
  • sample audio data including sample audio data with noise and sample audio data without noise (also called clean sample audio data)
  • the number of sample audio data without noise may be greater than that of samples with noise
  • the number of audio data, or the number of sample audio data without noise may be equal to that of sample audio data with noise, or the number of sample audio data without noise may be smaller than the number of sample audio data with noise.
  • all sample audio data acquired may also be sample audio data without noise.
  • sample audio data can be obtained, and these sample audio data are used as training data for the initial vocoder model for training the initial vocoder model, that is, sample audio data with noise can be used
  • the initial vocoder model is trained and optimized with the sample audio data without noise, and the target vocoder model with noise reduction ability is obtained.
  • the initial vocoder model can also be trained and optimized by using sample audio data without noise to obtain the target vocoder model.
  • sample text data corresponding to the sample audio data can be obtained (for the convenience of distinction, the text data in the training process can be referred to as sample text data), for example, can be pre-configured Sample text data corresponding to the sample audio data.
  • the sample audio data may be audio (all sounds that can be heard can be called audio), and audio is a piece of speech, such as the speech "today's weather is really nice”.
  • the sample text data may be text (that is, the expression form of written language, usually a combination of one or more sentences), and the text is a piece of text, such as the text "the weather is nice today”. Obviously, regardless of whether there is noise in the voice "the weather is really nice today", the text corresponding to the voice can be configured to be “the weather is really nice today", and this process is not limited.
  • the sample text data corresponding to the sample audio data can be obtained, and there is no restriction on the method of obtaining the sample text data.
  • Step 102 acquiring a text feature vector corresponding to the sample text data.
  • the text feature vector corresponding to the sample text data can be obtained, that is to say, there is a corresponding relationship between the sample audio data, the sample text data and the text feature vector, for example, the sample audio
  • the data a1 and the sample text data b1 correspond to the text feature vector c1
  • the sample audio data a2 and the sample text data b2 correspond to the text feature vector c2, and so on.
  • the text feature corresponding to the sample text data can be obtained, the number of text features can be at least one, and all text features can be composed into a feature vector, and this feature vector is the text feature vector.
  • unsupervised methods can be used, such as TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-inverse document frequency), etc.
  • supervised methods can be used, such as chi-square , information gain, mutual information, etc., there is no limit to the acquisition method, as long as the text features corresponding to the sample text data can be obtained, and the obtained text features can be composed into a text feature vector.
  • the text feature vector corresponding to each sample text data can be obtained.
  • Step 103 the text feature vector is input to the initial vocoder model, and the initial vocoder model outputs the initial audio data corresponding to the text feature vector.
  • the text feature vector can be input to the initial vocoder model, and the text feature vector is processed by the initial vocoder model to obtain the initial text feature vector corresponding to audio data.
  • the text feature vector has a corresponding relationship with the initial audio data, that is, there is a corresponding relationship between sample audio data, sample text data, text feature vector and initial audio data, such as sample audio data a1, sample text data b1, and text feature vector c1
  • sample audio data a2 sample text data b2, and text feature vector c2 correspond to the initial audio data d2, and so on. It can be seen from the above correspondence that the sample audio data a1 corresponds to the initial audio data d1 (the initial audio data is also audio), and the sample audio data a2 corresponds to the initial audio data d2.
  • a vocoder model can be pre-configured as an initial vocoder model.
  • the function of the initial vocoder model is to convert text feature vectors into audio data.
  • the structure of the initial vocoder model There is no limitation, as long as the text feature vector can be converted into audio data, for example, an initial vocoder model based on a deep learning algorithm, an initial vocoder model based on a neural network, and the like.
  • the initial vocoder model can process the text feature vector to obtain the audio data corresponding to the text feature vector.
  • the audio data obtained by the initial vocoder model can be called the initial audio data, and output the audio data corresponding to the text feature vector Corresponding original audio data.
  • the initial vocoder model can be divided into two sub-models, and these two sub-models are respectively the first initial sub-model and the second initial sub-model, that is to say, the initial vocoder model can be Consists of a first initial submodel and a second initial submodel.
  • the function of the first initial sub-model is to convert the text feature vector into an MFCC (Mel Frequency Cepstrum Coefficient, Mel Frequency Cepstral Coefficient) feature vector.
  • MFCC Mel Frequency Cepstrum Coefficient, Mel Frequency Cepstral Coefficient
  • the function of the second initial sub-model is to convert the MFCC feature vector into audio data, and there is no restriction on the structure of the second initial sub-model, as long as the second initial sub-model can convert the MFCC feature vector into audio data.
  • MFCC eigenvectors are a group of eigenvectors obtained by encoding speech physical information (such as spectrum envelope and details, etc.), and are cepstrum parameters extracted in the Mel scale frequency domain, while the Mel scale The degree describes the nonlinear characteristics of the frequency.
  • the MFCC feature vector is an implementation of the speech parameter feature vector.
  • the speech parameter feature vector can also include LPC (Linear Prediction Coefficients, linear predictive analysis) feature vector, PLP (Perceptual Linear Predictive, perceptual linear predictive coefficient) feature vector, LPCC (Linear Predictive Cepstral Coefficient, linear predictive cepstral coefficient) feature vector, etc.
  • the function of the first initial sub-model is to convert the text feature vector into a speech parameter feature vector
  • the function of the second initial sub-model is to convert the speech parameter feature vector into audio data.
  • the voice parameter feature vector is an MFCC feature vector as an example.
  • the implementation method is similar to the MFCC feature vector.
  • the initial vocoder model is composed of a first initial submodel and a second initial submodel
  • the text feature vector can be input to the first initial submodel of the initial vocoder model
  • the first initial submodel The text feature vector is processed to obtain the MFCC feature vector corresponding to the text feature vector.
  • the MFCC feature vector is input to the second initial sub-model of the initial vocoder model, and the MFCC feature vector is processed by the second initial sub-model to obtain initial audio data corresponding to the text feature vector.
  • the first initial The sub-model can process the text feature vector to obtain the MFCC feature vector corresponding to the text feature vector, and there is no restriction on the processing process, and input the MFCC feature vector to the second initial sub-model.
  • the second initial sub-model can process the MFCC feature vector to obtain the The initial audio data corresponding to the text feature vector, the processing process is not limited, and the initial audio data corresponding to the text feature vector is output.
  • Step 104 Train the initial vocoder model based on the sample audio data and the initial audio data (that is, adjust the parameters of the initial vocoder model) to obtain a trained target vocoder model.
  • the sample audio data is real audio data
  • the initial audio data is the audio data corresponding to the sample text data obtained by the initial vocoder model.
  • the loss value between the sample audio data and the initial audio data is smaller , it means that the closer the sample audio data is to the original audio data, that is, the better the performance of the initial vocoder model, the more accurate the initial audio data obtained by the initial vocoder model will be.
  • the loss value between the sample audio data and the original audio data is larger, it means that the difference between the sample audio data and the original audio data is greater, that is, the performance of the initial vocoder model is worse, and the initial audio data obtained by the initial vocoder model less accurate.
  • the initial vocoder model can be trained based on the loss value between the sample audio data and the initial audio data to obtain a trained target vocoder model.
  • the following steps may be used to train the initial vocoder model:
  • Step 1041 determine a target loss value based on the sample audio data and the initial audio data.
  • a loss function can be preconfigured, the input of the loss function can be the audio signal loss value between the sample audio data and the original audio data, and the output of the loss function can be the target loss value, therefore, the sample can be determined first The audio signal loss value between the audio data and the original audio data is substituted into the loss function to obtain the target loss value.
  • Both the sample audio data and the original audio data are audio signals, and the difference between the sample audio data and the original audio data is the audio signal loss value.
  • Quantization is an initial audio data value that can be calculated. There is no limit to the quantization method. Quantization is actually to digitize the audio signal to obtain a sample audio data value and an initial audio data value that can be calculated. After the sample audio data value and the initial audio data value are obtained, the absolute value of the difference between the sample audio data value and the initial audio data value may be an audio signal loss value.
  • the target loss value of the sample audio data and the initial audio data can be obtained. If the target loss value is smaller, the performance of the initial vocoder model is better. The more accurate the initial audio data is compared with the sample audio data, the larger the target loss value is, the worse the performance of the initial vocoder model is, and the less accurate the initial audio data is compared with the sample audio data.
  • a plurality of sample audio data and a plurality of initial audio data can be obtained, and the sample audio data has a one-to-one correspondence with the initial audio data, for example, the sample audio data a1 corresponds to the initial audio data d1, and the sample audio data a1 corresponds to the initial audio data d1, and the sample audio data The audio data a2 corresponds to the original audio data d2, and so on.
  • a target loss value corresponding to the data set can be determined. Then, calculate a final target loss value based on the target loss values corresponding to all data sets, such as calculating the average value, median, etc. of the target loss values corresponding to all data sets, and the calculation method is not limited.
  • Step 1042 Determine whether the initial vocoder model has converged based on the target loss value.
  • step 1043 may be performed, and if yes, step 1044 may be performed.
  • the preset threshold can be configured according to experience, and there is no limit to the value of the preset threshold, for example, the preset threshold can be a value greater than 0. If the target loss value is less than the preset threshold, it is determined that the initial vocoder model has converged. If the target loss value is not less than the preset threshold, it is determined that the initial vocoder model has not converged.
  • count the iteration duration of the initial vocoder model if the iteration duration of the initial vocoder model reaches the duration threshold, it is determined that the initial vocoder model has converged, if the iteration duration of the initial vocoder model does not reach the duration threshold, then Determine that the initial vocoder model did not converge.
  • the determination method is not limited.
  • Step 1043 adjust the parameters of the initial vocoder model based on the target loss value to obtain the adjusted vocoder model, use the adjusted vocoder model as the initial vocoder model, and return to execute the text feature vector input For the operation of the initial vocoder model, return to step 103.
  • the parameters of the initial vocoder model can be adjusted using a backpropagation algorithm (such as the gradient descent method, etc.) to obtain an adjusted vocoder model, and the parameter adjustment process is not performed.
  • a backpropagation algorithm such as the gradient descent method, etc.
  • the initial vocoder model can be composed of a first initial submodel and a second initial submodel, therefore, parameters of the first initial submodel can be adjusted to obtain an adjusted first initial submodel, and the second The parameters of the two initial submodels are adjusted to obtain an adjusted second initial submodel, and the adjusted first initial submodel and the adjusted second initial submodel form an adjusted vocoder model.
  • Step 1044 Determine the converged initial vocoder model as the target vocoder model. So far, the training process of the vocoder model is completed, that is, the initial vocoder model is trained by using training data (such as multiple sample audio data and multiple sample text data) to obtain the trained target vocoder model.
  • training data such as multiple sample audio data and multiple sample text data
  • the converged initial vocoder model can be composed of a first initial submodel and a second initial submodel, and the first initial submodel in the converged initial vocoder model can be recorded as the first target submodel model, the second initial sub-model in the converged initial vocoder model is recorded as the second target sub-model, therefore, the target vocoder model can be composed of the first target sub-model and the second target sub-model.
  • the processing method of the audio data may include:
  • Step 201 in a noise reduction application scenario, acquire audio data to be processed with noise.
  • application scenarios for noise reduction may include but are not limited to: voice calls, video conferencing, broadcasting, and home theaters And other audio systems, of course, the above are just a few examples, and there is no limit to the application scenarios of noise reduction.
  • the application scenarios of noise reduction can be any application scenarios that require voice noise reduction.
  • the application scenarios of noise reduction can be voice call applications Scenarios, or, the application scenario of noise reduction may be a video conference application scenario, or, the application scenario of noise reduction may be an application scenario of voice intercom, and the like.
  • the audio data in the noise reduction application scenario is audio data with noise
  • the audio data may be referred to as audio data to be processed. Therefore, the audio data to be processed with noise may be obtained.
  • Step 202 determining an audio feature vector corresponding to the audio data to be processed.
  • the audio features corresponding to the audio data to be processed can be obtained, the number of audio features can be at least one, and all audio features can be formed into a feature vector, and this feature vector is the audio feature vector .
  • Audio feature vectors are feature vectors related to speech parameters, including but not limited to MFCC feature vectors, LPC feature vectors, PLP feature vectors, LPCC feature vectors, etc. There is no limit to the type of audio feature vectors, followed by MFCC feature vectors For example, the implementation of other types of audio feature vectors is similar to the MFCC feature vector.
  • determining the audio feature vector corresponding to the audio data to be processed may include, but not limited to: obtaining an MFCC feature vector corresponding to the audio data to be processed, and determining an audio feature vector corresponding to the audio data to be processed based on the MFCC feature vector
  • the audio feature vector corresponding to the audio data to be processed for example, the MFCC feature vector may be used as the audio feature vector corresponding to the audio data to be processed.
  • the audio data to be processed with a frame length of M milliseconds (such as 16 milliseconds, etc.), and then extract the MFCC feature vector from the audio data to be processed, such as extracting the 80-dimensional MFCC feature vector, and then, the The MFCC feature vector is used as the audio feature vector corresponding to the audio data to be processed.
  • M milliseconds such as 16 milliseconds, etc.
  • obtaining the MFCC feature vector corresponding to the audio data to be processed may include but not limited to the following methods: performing windowing, fast Fourier transform, filtering based on Mel filter banks, Logarithmic operation and discrete cosine transform to get the MFCC feature vector.
  • pre-emphasis processing and frame processing can be performed on the continuous audio to obtain multi-frame audio data, and each frame of audio data is
  • the aforementioned audio data to be processed for example, 16 milliseconds of audio data to be processed.
  • the audio data to be processed can be windowed to obtain the data after windowing, and fast Fourier transform processing (ie FFT processing) is performed on the data after windowing to obtain the data after fast Fourier transform, and use Mel
  • the filter bank performs filtering processing on the fast Fourier transformed data to obtain the filtered data, and performs logarithmic processing on the filtered data to obtain the logarithmic data, and performs discrete cosine transform processing on the logarithmic data (that is, DCT processing), the data after the discrete cosine transform is obtained, and the data after the discrete cosine transform is the MFCC feature vector, so far, the MFCC feature vector is obtained.
  • processing such as windowing, fast Fourier transform, filtering based on Mel filter bank, logarithmic operation, and discrete cosine transform.
  • the above is just an example of obtaining the MFCC feature vector corresponding to the audio data to be processed, and there is no limitation to this implementation, as long as the MFCC feature vector can be obtained.
  • the audio feature vector is input to the trained target vocoder model, and the target vocoder model outputs target audio data corresponding to the audio feature vector.
  • the target audio data may be audio data after noise reduction processing is performed on the noise of the audio data to be processed.
  • the target vocoder model can inversely transform the audio feature vector (that is, the acoustic feature vector) to obtain the corresponding sound waveform, and then splicing the sound waveform to obtain a synthesized speech, which corresponds to the audio feature vector Target audio data.
  • the audio feature vector that is, the acoustic feature vector
  • the target audio data can be directly synthesized based on the audio feature vector, that is, the target audio data can be directly synthesized through speech synthesis, without paying attention to the noise itself of the audio data to be processed, and the reliability of speech noise reduction is higher, and it has stronger Therefore, after the audio feature vector is input to the target vocoder model, the target vocoder model can process the audio feature vector to obtain the target audio data corresponding to the audio feature vector, and the target audio data It is the audio data after noise reduction processing is performed on the noise of the audio data to be processed, that is to say, the target audio data that has undergone noise reduction processing is synthesized by speech synthesis.
  • the target vocoder model includes a first target sub-model and a second target sub-model
  • the first target sub-model (the same function as the first initial sub-model in the initial vocoder model) is used to convert the text feature vector Mapped to MFCC feature vectors
  • the second target sub-model (same function as the second initial sub-model in the initial vocoder model) is used to map MFCC feature vectors to audio data, on this basis:
  • the audio feature vector (ie, the MFCC feature vector) can be input to the second target sub-model of the target vocoder model, and the audio feature vector is processed by the second target sub-model to obtain the The target audio data corresponding to the audio feature vector.
  • the MFCC feature vector is directly input to the second target sub-model (that is, the MFCC feature vector directly reaches the second target sub-model without passing through the first target sub-model), after the second target sub-model obtains the MFCC feature vector, due to
  • the second target sub-model is used to map the MFCC feature vectors to audio data, therefore, the second target sub-model can process the MFCC feature vectors to obtain the target audio data corresponding to the MFCC feature vectors, the processing process is not limited, And output the target audio data corresponding to the MFCC feature vector.
  • the audio feature vector (i.e. the MFCC feature vector) can also be input to the first target sub-model of the target vocoder model, and the audio feature vector is input by the first target sub-model to A second target sub-model of the target vocoder model, and the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector.
  • the first target sub-model first input the MFCC feature vector to the first target sub-model, the first target sub-model does not process the MFCC feature vector after obtaining the MFCC feature vector, and inputs the MFCC feature vector to the second target sub-model, the second target sub-model After the sub-model obtains the MFCC feature vector, since the second target sub-model is used to map the MFCC feature vector to audio data, the second target sub-model can process the MFCC feature vector to obtain the target audio corresponding to the MFCC feature vector data, there is no restriction on the processing process, and the target audio data corresponding to the MFCC feature vector is output.
  • the audio feature vector corresponding to the audio data to be processed with noise can be input to the target vocoder model, and the output of the target vocoder model is consistent with the audio
  • the target audio data corresponding to the feature vector so that the target audio data is directly synthesized based on the audio feature vector, that is, the target audio data is speech synthesis, that is, the target audio data is directly synthesized by speech synthesis, and there is no need to pay attention to the audio data to be processed Noise itself, only needs to input the audio feature vector to the target vocoder model, the target audio data can be generated by speech synthesis, the reliability of speech noise reduction is higher, it has stronger noise reduction ability, and can analyze the audio data Effective noise reduction, with a very good noise reduction effect.
  • the non-stationary noise in the audio data can be removed, so as to achieve the purpose of denoising the non-stationary noise in the audio data
  • the audio data processing method will be described below in combination with specific application scenarios.
  • the system structure can include a text analysis module, a prosody processing module, an acoustic processing module and a speech synthesis module, the text analysis module and the prosody processing module are front-end modules, the acoustic processing module and The speech synthesis module is a back-end module.
  • the text analysis module is used to simulate the process of human understanding of natural speech, so that the computer can fully understand the input text, and provide various pronunciation, pause and other information for the acoustic processing module and speech synthesis module.
  • the prosody processing module is used to process various segmental features of pronunciation, such as pitch, sound length, and sound intensity, so that the synthesized speech can express semantics correctly and the speech sounds more natural, and then extract text features according to the results of word segmentation and labeling, Turn the text features into a sequence of text feature vectors.
  • the acoustic processing module (ie, the acoustic model) is used to establish a mapping from text feature vectors to acoustic feature vectors, and the text feature vectors will become acoustic feature vectors after being processed by the acoustic processing module.
  • the speech synthesis module (ie, vocoder) is used to obtain corresponding sound waveforms by inversely transforming the acoustic feature vectors. For example, multiple acoustic feature vectors can be inversely transformed to obtain corresponding multiple sound waveforms. Then, the Multiple sound waveforms are sequentially spliced to obtain a synthesized voice.
  • the speech synthesis module can be retained, and the text analysis module, prosody processing module and acoustic processing module can be removed.
  • the audio data corresponding to the audio data to be processed can be directly determined.
  • Feature vectors such as MFCC feature vectors
  • the speech synthesis module can obtain the target audio data corresponding to the MFCC feature vector based on the target vocoder model
  • the target audio data is the audio data after noise reduction processing of the noise of the audio data to be processed , that is to say, use the MFCC feature vector to replace the related functions of the text analysis module, prosody processing module and acoustic processing module, directly use the target vocoder model to synthesize speech, and realize a new noise reduction method.
  • the method may include:
  • Step 501 acquire audio data to be processed with a frame length of M milliseconds (for example, 16 milliseconds).
  • Step 502 extracting N-dimensional (eg, 80-dimensional) MFCC feature vectors from the audio data to be processed.
  • step 503 the MFCC feature vector is input to the target vocoder model, and the target vocoder model outputs target audio data corresponding to the MFCC feature vector, so as to realize noise reduction processing of the audio data.
  • FIG. 6 is a schematic structural diagram of the device.
  • the device may include:
  • An acquisition module 61 configured to acquire audio data to be processed with noise in a noise reduction application scenario
  • a determining module 62 configured to determine an audio feature vector corresponding to the audio data to be processed
  • the processing module 63 is used to input the audio feature vector to the trained target vocoder model, and the target audio data corresponding to the audio feature vector is output by the target vocoder model; wherein, the target audio data is The audio data after performing noise reduction processing on the noise of the audio data to be processed.
  • the device further includes (not shown in FIG. 6 ):
  • the training module is used to obtain the target vocoder model by training in the following manner:
  • the text feature vector is input to an initial vocoder model, and the initial audio data corresponding to the text feature vector is output by the initial vocoder model;
  • the initial vocoder model is trained based on the sample audio data and the initial audio data to obtain the trained target vocoder model.
  • the training module trains the initial vocoder model based on the sample audio data and the initial audio data, and obtains the trained target vocoder model for:
  • the training module inputs the text feature vector to the initial vocoder model, and when the initial vocoder model outputs the initial audio data corresponding to the text feature vector, it is used for:
  • the text feature vector is input to the first initial submodel of the initial vocoder model, and the text feature vector is processed by the first initial submodel to obtain the Mel corresponding to the text feature vector Frequency cepstral coefficient MFCC eigenvector;
  • the MFCC eigenvector is input to the second initial submodel of the initial vocoder model, and the MFCC eigenvector is processed by the second initial submodel to obtain the same The initial audio data corresponding to the above text feature vector.
  • the training module acquires sample audio data
  • the number of sample audio data is multiple, and the multiple sample audio data include sample audio data with noise and sample audio data without noise; Wherein, the quantity of the sample audio data without noise is greater than the quantity of the sample audio data with noise.
  • the determining module 62 determines the audio feature vector corresponding to the audio data to be processed, it is specifically used to: obtain the MFCC feature vector corresponding to the audio data to be processed; based on the MFCC feature vector Determine an audio feature vector corresponding to the audio data to be processed.
  • the target vocoder model includes a first target sub-model and a second target sub-model, the first target sub-model is used to map text feature vectors to MFCC feature vectors, the The second target sub-model is used to map the MFCC feature vector to audio data; the processing module 63 inputs the audio feature vector to the trained target vocoder model, and the output of the target vocoder model and the described target vocoder model
  • the target audio data corresponding to the audio feature vector is specifically used to: input the audio feature vector to the second target sub-model, and process the audio feature vector by the second target sub-model to obtain the The target audio data corresponding to the audio feature vector; or, the audio feature vector is input to the first target sub-model, and the audio feature vector is input to the second target sub-model by the first target sub-model , the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector.
  • the noise reduction application scenario is an application scenario requiring voice noise reduction; wherein, the noise reduction application scenario is a voice call application scenario, or the noise reduction application scenario is a video conference Application scenarios.
  • the audio data processing device includes: a processor 71 and a machine-readable storage medium 72, the The machine-readable storage medium 72 stores machine-executable instructions that can be executed by the processor 71; the processor 71 is used to execute the machine-executable instructions to achieve the following steps:
  • the audio feature vector is input to the trained target vocoder model, and the target audio data corresponding to the audio feature vector is output by the target vocoder model; wherein, the target audio data is for the target Process the noise of the audio data to perform noise reduction processing on the audio data.
  • the processor is prompted to obtain the target vocoder model through training in the following manner:
  • the text feature vector is input to an initial vocoder model, and the initial audio data corresponding to the text feature vector is output by the initial vocoder model;
  • the initial vocoder model is trained based on the sample audio data and the initial audio data to obtain the trained target vocoder model.
  • the initial vocoder model is trained based on the sample audio data and the initial audio data, and when the trained target vocoder model is obtained, the processor is prompts:
  • the processing The device is prompted to:
  • the text feature vector is input to the first initial submodel of the initial vocoder model, and the text feature vector is processed by the first initial submodel to obtain the Mel corresponding to the text feature vector Frequency cepstral coefficient MFCC eigenvector;
  • the MFCC eigenvector is input to the second initial submodel of the initial vocoder model, and the MFCC eigenvector is processed by the second initial submodel to obtain the same The initial audio data corresponding to the above text feature vector.
  • the number of sample audio data is multiple, and the multiple sample audio data includes sample audio data with noise and sample audio data without noise; wherein, the The number of sample audio data without noise is greater than the number of sample audio data with noise.
  • the processor when determining the audio feature vector corresponding to the audio data to be processed, the processor is prompted to:
  • An audio feature vector corresponding to the audio data to be processed is determined based on the MFCC feature vector.
  • the target vocoder model includes a first target sub-model and a second target sub-model, the first target sub-model is used to map text feature vectors to MFCC feature vectors, the The second target sub-model is used to map MFCC feature vectors to audio data;
  • the processor is prompted to:
  • the audio feature vector is input to the second target sub-model, and the audio feature vector is processed by the second target sub-model to obtain target audio data corresponding to the audio feature vector;
  • the audio feature vector is processed to obtain target audio data corresponding to the audio feature vector.
  • the noise reduction application scenario is an application scenario requiring voice noise reduction; wherein, the noise reduction application scenario is a voice call application scenario, or the noise reduction application scenario is a video conference Application scenarios.
  • the embodiment of the present application also provides a machine-readable storage medium, on which several computer instructions are stored, and when the computer instructions are executed by a processor, the present invention can be realized. Apply the audio data processing method disclosed in the above example.
  • the above-mentioned machine-readable storage medium may be any electronic, magnetic, optical or other physical storage device, which may contain or store information, such as executable instructions, data, and so on.
  • the machine-readable storage medium can be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid state drive, any type of storage disk (such as CD, DVD, etc.), or similar storage media, or a combination of them.
  • a typical implementing device is a computer, which may take the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control device, etc. desktops, tablets, wearables, or any combination of these.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • these computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means,
  • the instruction means implements the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operational steps are performed on the computer or other programmable equipment to produce computer-implemented processing, so that the information executed on the computer or other programmable equipment
  • the instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种音频数据的处理方法、装置及设备。该方法包括:在降噪应用场景中,获取存在噪声的待处理音频数据(201);确定与待处理音频数据对应的音频特征向量(202);将音频特征向量输入给已训练的目标声码器模型,由目标声码器模型输出与音频特征向量对应的目标音频数据,其中,目标音频数据是对待处理音频数据的噪声进行降噪处理后的音频数据(203)。

Description

音频数据的处理方法、装置及设备 技术领域
本申请涉及语音处理领域,尤其是一种音频数据的处理方法、装置及设备。
背景技术
在语音通话、视频会议、广播、家庭影院等音频系统中,经常会出现音频噪声等问题。比如说,接地不当产生的音频噪声、电磁辐射干扰产生的音频噪声、设备内部电路产生的音频噪声、电源干扰产生的音频噪声等。
为了去除音频数据中的噪声,提高音频数据的质量,就需要对音频数据进行降噪处理,得到去除噪声的音频数据。为了对音频数据进行降噪处理,可以采用单通道信号处理的降噪算法,如维纳滤波算法和谱减算法等,也可以采用多通道信号处理的降噪算法,如波束形成算法和盲源分离算法等。
但是,上述降噪算法均是直接对音频数据中的噪声进行降噪处理,存在无法对音频数据进行有效降噪,降噪效果不好等问题。比如说,若音频数据中存在非平稳噪声,则维纳滤波算法、谱减算法、波束形成算法和盲源分离算法等降噪算法,均无法对音频数据中的非平稳噪声进行降噪,降噪效果较差。
发明内容
本申请提供一种音频数据的处理方法,所述方法包括:
在降噪应用场景中,获取存在噪声的待处理音频数据;
确定与所述待处理音频数据对应的音频特征向量;
将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
示例性的,所述目标声码器模型的训练过程,包括:
获取样本音频数据和所述样本音频数据对应的样本文本数据;
获取与所述样本文本数据对应的文本特征向量;
将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
示例性的,所述基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型,包括:
基于所述样本音频数据和所述初始音频数据确定目标损失值;
基于所述目标损失值确定所述初始声码器模型是否已收敛;
若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
示例性的,所述将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据,包括:
将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱 系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
示例性的,在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
示例性的,所述确定与所述待处理音频数据对应的音频特征向量,包括:
获取与所述待处理音频数据对应的MFCC特征向量;
基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
在一种可能的实施方式中,所述目标声码器模型包括第一目标子模型和第二目标子模型,第一目标子模型用于将文本特征向量映射为MFCC特征向量,第二目标子模型用于将MFCC特征向量映射为音频数据;
所述将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据,包括:
将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
或,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
示例性的,所述降噪应用场景为需要进行语音降噪的应用场景;所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
本申请提供一种音频数据的处理装置,所述装置包括:
获取模块,用于在降噪应用场景中,获取存在噪声的待处理音频数据;
确定模块,用于确定与所述待处理音频数据对应的音频特征向量;
处理模块,用于将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
示例性的,所述处理装置还包括:
训练模块,用于采用如下方式训练得到所述目标声码器模型:
获取样本音频数据和所述样本音频数据对应的样本文本数据;
获取与所述样本文本数据对应的文本特征向量;
将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
示例性的,所述训练模块基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时用于:
基于所述样本音频数据和所述初始音频数据确定目标损失值;
基于所述目标损失值确定所述初始声码器模型是否已收敛;
若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
示例性的,所述训练模块将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时用于:
将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初 始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;
将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
示例性的,所述训练模块在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
示例性的,所述确定模块确定与所述待处理音频数据对应的音频特征向量时用于:
获取与所述待处理音频数据对应的MFCC特征向量;
基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
示例性的,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
所述处理模块将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时用于:
将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
或者,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
示例性的,所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
本申请提供一种音频数据的处理设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;其中,所述处理器用于执行机器可执行指令,以实现如下步骤:
在降噪应用场景中,获取存在噪声的待处理音频数据;
确定与所述待处理音频数据对应的音频特征向量;
将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
示例性的,所述处理器被促使采用如下方式训练得到所述目标声码器模型:
获取样本音频数据和所述样本音频数据对应的样本文本数据;
获取与所述样本文本数据对应的文本特征向量;
将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
示例性的,基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时,所述处理器被促使:
基于所述样本音频数据和所述初始音频数据确定目标损失值;
基于所述目标损失值确定所述初始声码器模型是否已收敛;
若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
示例性的,所述将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时,所述处理器被促使:
将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
示例性的,在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
示例性的,所述确定与所述待处理音频数据对应的音频特征向量时,所述处理器被促使:
获取与所述待处理音频数据对应的MFCC特征向量;
基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
示例性的,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
所述将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时,所述处理器被促使:
将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
或,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
示例性的,所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
由以上技术方案可见,本申请实施例中,在降噪应用场景中,可以将存在噪声的待处理音频数据对应的音频特征向量输入给目标声码器模型,由目标声码器模型输出与音频特征向量对应的目标音频数据,从而基于音频特征向量直接合成目标音频数据,即目标音频数据是语音合成,也就是说,通过语音合成方式直接合成目标音频数据,不需要关注待处理音频数据中的噪声本身,只需要将音频特征向量输入给目标声码器模型,就能够用语音合成方式来生成目标音频数据,语音降噪的可靠性更高,具有更强的降噪能力,能够对音频数据进行有效降噪,具有很好的降噪效果。通过合成已降噪的目标音频数据,能够去除音频数据中的非平稳噪声,达到对音频数据中的非平稳噪声进行降噪的目的。
附图说明
为了更加清楚地说明本申请实施例或者现有技术中的技术方案,下面将对本申请实施例或者现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据本申请实施例的这些附图获得其他的附图。
图1是本申请一种实施方式中的声码器模型的训练过程的示意图;
图2是本申请一种实施方式中的音频数据的处理方法的流程示意图;
图3是本申请一种实施方式中的获取MFCC特征向量的示意图;
图4是本申请一种实施方式中的语音合成的系统结构示意图;
图5是本申请一种实施方式中的音频数据的处理方法的流程示意图;
图6是本申请一种实施方式中的音频数据的处理装置的结构示意图;
图7是本申请一种实施方式中的音频数据的处理设备的硬件结构图。
具体实施方式
在本申请实施例使用的术语仅仅是出于描述特定实施例的目的,而非限制本申请。本申请和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请实施例可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,此外,所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
为了去除音频数据中的噪声,提高音频数据的质量,就需要对音频数据进行降噪处理,得到去除噪声的音频数据。为了对音频数据进行降噪处理,可以采用单通道信号处理的降噪算法,如维纳滤波算法和谱减算法等,也可以采用多通道信号处理的降噪算法,如波束形成算法和盲源分离算法等,还可以采用深度学习算法,通过训练深度神经网络来对音频数据进行降噪处理。
关于单通道信号处理的降噪算法和多通道信号处理的降噪算法,均是直接对音频数据中的噪声进行降噪处理,存在无法对音频数据进行有效降噪,降噪效果不好等问题。比如说,若音频数据中存在非平稳噪声,则这些降噪算法就无法对音频数据中的非平稳噪声进行有效降噪,导致降噪效果较差。
关于深度学习算法的降噪处理方式,存在深度学习算法的可靠性不高,对某些噪声(如没有学习过的噪声)无法有效降噪,降噪效果较差等问题。
针对上述发现,本申请实施例中提出一种音频数据的处理方法,可以基于音频特征向量直接合成音频数据,即采用语音合成方式合成音频数据,是在单通道信号处理的降噪算法、多通道信号处理的降噪算法和深度学习算法基础上的第四种降噪方法,该降噪方法可以通过语音合成方式直接合成音频数据,不需要关注噪声本身,只需要将音频特征向量输入给目标声码器模型,就能够生成最终的音频数据,语音降噪的可靠性更高,具有更强的降噪能力,是一种基于语音合成的语音降噪方法,能够增强语音信号,提升语音可懂度。
以下结合具体实施例,对本申请实施例的技术方案进行说明。
本申请实施例中,为了采用语音合成方式合成音频数据,可以涉及声码器模型的训练过程及音频数据的处理过程。在声码器模型的训练过程中,可以利用训练数据对已配置的初始声码器模型(为了区分方便,可以将未完成训练的声码器模型称为初始声码器模型)进行训练,得到已训练的目标声码器模型(为了区分方便,可以将已完成训练的声码器模型称为目标声码器模型)。
在音频数据的处理过程中,可以将音频特征向量输入给已训练的目标声码器模型,由目标声码器模型直接合成与该音频特征向量对应的音频数据,即可以使用目标声码器模型直接合成音频数据,得到已去除噪声的音频数据。
以下对声码器模型的训练过程及音频数据的处理过程进行说明。
针对声码器模型的训练过程,可以预先配置一个声码器模型作为初始声码器模型,对此初始声码器模型的结构不做限制,只要能够将文本特征向量转换为音频数据即可,比如说,基于深度学习算法的初始声码器模型、基于神经网络(如卷积神经网络)的初始声码器模型、或其它类型的初始声码器模型。
参见图1所示,为声码器模型的训练过程的示意图,用于将初始声码器模型训练为 目标声码器模型,也称为目标声码器模型的训练过程,该过程包括:
步骤101,获取样本音频数据和样本音频数据对应的样本文本数据。
示例性的,为了训练目标声码器模型,可以获取多个样本音频数据(为了区分方便,将训练过程中的音频数据称为样本音频数据),即获取大量样本音频数据。在多个样本音频数据中,包括存在噪声的样本音频数据和不存在噪声的样本音频数据(也可以称为干净的样本音频数据),不存在噪声的样本音频数据的数量可以大于存在噪声的样本音频数据的数量,或不存在噪声的样本音频数据的数量可以等于存在噪声的样本音频数据的数量,或不存在噪声的样本音频数据的数量可以小于存在噪声的样本音频数据的数量。当然,在实际应用中,针对获取的所有样本音频数据,也可以均为不存在噪声的样本音频数据。
综上所述,可以得到多个样本音频数据,这些样本音频数据作为初始声码器模型的训练数据,用于对初始声码器模型进行训练,也就是说,可以使用存在噪声的样本音频数据和不存在噪声的样本音频数据对初始声码器模型进行训练和优化,得到具有降噪能力的目标声码器模型。或者,也可以使用不存在噪声的样本音频数据对初始声码器模型进行训练和优化,得到目标声码器模型。
示例性的,针对每个样本音频数据来说,可以获取该样本音频数据对应的样本文本数据(为了区分方便,可以将训练过程中的文本数据称为样本文本数据),比如说,可以预先配置该样本音频数据对应的样本文本数据。
比如说,样本音频数据可以是音频(能够听到的所有声音均可以称为音频),音频即一段语音,如语音“今天天气真好”。样本文本数据可以是文本(即书面语言的表现形式,通常是一个或多个句子的组合),文本即一段文字,如文字“今天天气真好”。显然,无论语音“今天天气真好”是否存在噪声,均可以配置该语音对应的文字是“今天天气真好”,对此过程不做限制。
综上所述,针对每个样本音频数据来说,均可以获取该样本音频数据对应的样本文本数据,对此样本文本数据的获取方式不做限制。
步骤102,获取与该样本文本数据对应的文本特征向量。
比如说,针对每个样本文本数据来说,可以获取与该样本文本数据对应的文本特征向量,也就是说,样本音频数据、样本文本数据与文本特征向量之间具有对应关系,例如,样本音频数据a1、样本文本数据b1与文本特征向量c1对应,样本音频数据a2、样本文本数据b2与文本特征向量c2对应,以此类推。
示例性的,针对每个样本文本数据来说,可以获取该样本文本数据对应的文本特征,文本特征的数量可以为至少一个,可以将所有文本特征组成一个特征向量,而这个特征向量就是文本特征向量。关于获取该样本文本数据对应的文本特征的方式,可以采用无监督方法获取,如TF-IDF(Term Frequency-Inverse Document Frequency,词频-逆文档频率)等,也可以采用监督方法获取,如卡方、信息增益、互信息等,对此获取方式不做限制,只要能够获取该样本文本数据对应的文本特征,并将获取的文本特征组成文本特征向量即可。
综上所述,可以获取每个样本文本数据对应的文本特征向量。
步骤103,将文本特征向量输入给初始声码器模型,由初始声码器模型输出与该文本特征向量对应的初始音频数据。示例性的,针对每个文本特征向量来说,可以将该文本特征向量输入给初始声码器模型,由初始声码器模型对该文本特征向量进行处理,得到与该文本特征向量对应的初始音频数据。
显然,文本特征向量与初始音频数据具有对应关系,即样本音频数据、样本文本数据、文本特征向量与初始音频数据之间具有对应关系,如样本音频数据a1、样本文本数据b1、文本特征向量c1与初始音频数据d1对应,样本音频数据a2、样本文本数据b2、文本特征向量c2与初始音频数据d2对应,以此类推。从上述对应关系可以看出,样本 音频数据a1与初始音频数据d1(初始音频数据也是音频)对应,样本音频数据a2与初始音频数据d2对应。
在一种可能的实施方式中,可以预先配置一个声码器模型作为初始声码器模型,初始声码器模型的功能是将文本特征向量转换为音频数据,对此初始声码器模型的结构不做限制,只要能够将文本特征向量转换为音频数据即可,比如说,基于深度学习算法的初始声码器模型、基于神经网络的初始声码器模型等等。基于此,针对每个文本特征向量来说,在将该文本特征向量输入给初始声码器模型之后,由于该初始声码器模型用于将文本特征向量转换为音频数据,因此,初始声码器模型可以对该文本特征向量进行处理,得到与该文本特征向量对应的音频数据,为了区分方便,可以将初始声码器模型得到的音频数据称为初始音频数据,并输出与该文本特征向量对应的初始音频数据。
在一种可能的实施方式中,可以将初始声码器模型划分为两个子模型,这两个子模型分别为第一初始子模型和第二初始子模型,也就是说,初始声码器模型可以由第一初始子模型和第二初始子模型组成。第一初始子模型的功能是将文本特征向量转换为MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征向量,对此第一初始子模型的结构不做限制,只要第一初始子模型能够将文本特征向量转换为MFCC特征向量即可。第二初始子模型的功能是将MFCC特征向量转换为音频数据,对此第二初始子模型的结构也不做限制,只要第二初始子模型能够将MFCC特征向量转换为音频数据即可。
在语音识别领域中,MFCC特征向量是将语音物理信息(如频谱包络和细节等)进行编码运算得到的一组特征向量,是在Mel标度频率域提取出来的倒谱参数,而Mel标度则描述频率的非线性特性。需要注意的是,MFCC特征向量属于语音参数特征向量的一种实现方式,除了MFCC特征向量之外,语音参数特征向量还可以包括LPC(Linear Prediction Coefficients,线性预测分析)特征向量、PLP(Perceptual Linear Predictive,感知线性预测系数)特征向量、LPCC(Linear Predictive Cepstral Coefficient,线性预测倒谱系数)特征向量等。
综上所述,第一初始子模型的功能是将文本特征向量转换为语音参数特征向量,第二初始子模型的功能是将语音参数特征向量转换为音频数据,为了方便描述,本申请实施例中以语音参数特征向量是MFCC特征向量为例,针对LPC特征向量、PLP特征向量、LPCC特征向量,实现方式与MFCC特征向量类似。
在步骤103中,初始声码器模型由第一初始子模型和第二初始子模型组成,可以将文本特征向量输入给初始声码器模型的第一初始子模型,由第一初始子模型对文本特征向量进行处理,得到与该文本特征向量对应的MFCC特征向量。将MFCC特征向量输入给初始声码器模型的第二初始子模型,由第二初始子模型对MFCC特征向量进行处理,得到与该文本特征向量对应的初始音频数据。
比如说,针对每个文本特征向量来说,在将该文本特征向量输入给第一初始子模型后,由于第一初始子模型用于将文本特征向量转换为MFCC特征向量,因此,第一初始子模型可以对该文本特征向量进行处理,得到与该文本特征向量对应的MFCC特征向量,对此处理过程不做限制,并将MFCC特征向量输入给第二初始子模型。在将MFCC特征向量输入给第二初始子模型后,由于第二初始子模型用于将MFCC特征向量转换为音频数据,因此,第二初始子模型可以对该MFCC特征向量进行处理,得到与该文本特征向量对应的初始音频数据,对此处理过程不做限制,并输出与该文本特征向量对应的初始音频数据。
步骤104,基于样本音频数据和初始音频数据对初始声码器模型进行训练(即对初始声码器模型的参数进行调整),得到已训练的目标声码器模型。
比如说,样本音频数据是真正存在的音频数据,初始音频数据是初始声码器模型得到的与样本文本数据对应的音频数据,显然,若样本音频数据与初始音频数据之间的损 失值越小,则表示样本音频数据与初始音频数据越接近,即初始声码器模型的性能越好,初始声码器模型得到的初始音频数据越准确。若样本音频数据与初始音频数据之间的损失值越大,则表示样本音频数据与初始音频数据相差越大,即初始声码器模型的性能越差,初始声码器模型得到的初始音频数据越不准确。综上所述,可以基于样本音频数据与初始音频数据之间的损失值对初始声码器模型进行训练,得到已训练的目标声码器模型。
在一种可能的实施方式中,可以采用如下步骤对初始声码器模型进行训练:
步骤1041、基于样本音频数据和初始音频数据确定目标损失值。
示例性的,可以预先配置一个损失函数,该损失函数的输入可以为样本音频数据与初始音频数据之间的音频信号损失值,该损失函数的输出可以为目标损失值,因此,可以先确定样本音频数据与初始音频数据之间的音频信号损失值,并将该音频信号损失值代入该损失函数,从而得到目标损失值。
样本音频数据和初始音频数据均是音频信号,而样本音频数据与初始音频数据的差值就是音频信号损失值,比如说,将样本音频数据量化为可以运算的样本音频数据值,将初始音频数据量化为可以运算的初始音频数据值,对此量化方式不做限制,量化实际上是将音频信号数字化,得到可以运算的样本音频数据值和初始音频数据值。在得到样本音频数据值和初始音频数据值之后,样本音频数据值与初始音频数据值的差值的绝对值,可以是音频信号损失值。
当然,上述只是确定目标损失值的示例,对此不做限制,能够得到样本音频数据与初始音频数据的目标损失值即可,若目标损失值越小,初始声码器模型的性能越好,初始音频数据与样本音频数据相比越准确,若目标损失值越大,初始声码器模型的性能越差,初始音频数据与样本音频数据相比越不准确。
示例性的,参见上述实施例,可以得到多个样本音频数据和多个初始音频数据,且样本音频数据与初始音频数据具有一一对应关系,如样本音频数据a1与初始音频数据d1对应,样本音频数据a2与初始音频数据d2对应,以此类推。
基于每组数据集合(数据集合包括一个样本音频数据和该样本音频数据对应的初始音频数据),可以确定与该数据集合对应的目标损失值。然后,基于所有数据集合对应的目标损失值计算一个最终的目标损失值,如计算所有数据集合对应的目标损失值的平均值、中位数等,对此计算方式不做限制。
步骤1042、基于目标损失值确定初始声码器模型是否已收敛。
若否,则可以执行步骤1043,若是,则可以执行步骤1044。
比如说,可以判断该目标损失值是否小于预设阈值,该预设阈值可以根据经验进行配置,对此预设阈值的取值不做限制,如预设阈值可以是大于0的数值。若该目标损失值小于该预设阈值,则确定该初始声码器模型已收敛。若该目标损失值不小于该预设阈值,则确定该初始声码器模型未收敛。
在实际应用中,还可以采用其它方式确定初始声码器模型是否已收敛,比如说,统计初始声码器模型的迭代次数(基于训练数据集合中的所有样本音频数据对初始声码器模型的参数进行调整,称为一次迭代),若初始声码器模型的迭代次数达到次数阈值,则确定初始声码器模型已收敛,若初始声码器模型的迭代次数未达到次数阈值,则确定初始声码器模型未收敛。或者,统计初始声码器模型的迭代时长,若初始声码器模型的迭代时长达到时长阈值,则确定初始声码器模型已收敛,若初始声码器模型的迭代时长未达到时长阈值,则确定初始声码器模型未收敛。当然,上述只是示例,对此确定方式不做限制。
步骤1043、基于目标损失值对初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,并返回执行将文本特征向量输入给初始声码器模型的操作,即返回执行步骤103。
示例性的,基于该目标损失值,可以采用反向传播算法(如梯度下降法等)对初始声码器模型的参数进行调整,得到调整后的声码器模型,对此参数调整过程不做限制,只要能够对初始声码器模型的参数进行调整即可,且调整后的声码器模型能够使样本音频数据与初始音频数据之间的目标损失值变小。
示例性的,初始声码器模型可以由第一初始子模型和第二初始子模型组成,因此,可以对第一初始子模型的参数进行调整,得到调整后的第一初始子模型,对第二初始子模型的参数进行调整,得到调整后的第二初始子模型,而调整后的第一初始子模型和调整后的第二初始子模型就组成调整后的声码器模型。
步骤1044、将已收敛的初始声码器模型确定为目标声码器模型。至此,完成声码器模型的训练过程,即利用训练数据(如多个样本音频数据和多个样本文本数据)对初始声码器模型进行训练,得到已训练的目标声码器模型。
示例性的,已收敛的初始声码器模型可以由第一初始子模型和第二初始子模型组成,可以将已收敛的初始声码器模型中的第一初始子模型记为第一目标子模型,将已收敛的初始声码器模型中的第二初始子模型记为第二目标子模型,因此,目标声码器模型可以由第一目标子模型和第二目标子模型组成。
针对音频数据的处理过程,基于已训练的目标声码器模型,可以对存在噪声的音频数据进行处理,得到经过降噪处理后的音频数据,参见图2所示,为音频数据的处理方法的示意图,该音频数据的处理方法可以包括:
步骤201,在降噪应用场景中,获取存在噪声的待处理音频数据。
示例性的,在语音通话、视频会议、广播、家庭影院等音频系统中,经常会出现音频噪声等问题,因此,降噪应用场景可以包括但不限于:语音通话、视频会议、广播、家庭影院等音频系统,当然,上述只是几个示例,对此降噪应用场景不做限制,降噪应用场景可以为任何需要进行语音降噪的应用场景,比如说,降噪应用场景可以为语音通话应用场景,或,降噪应用场景可以为视频会议应用场景,或,降噪应用场景可以为语音对讲应用场景等。
示例性的,降噪应用场景中的音频数据是存在噪声的音频数据,可以将该音频数据称为待处理音频数据,因此,可以获取存在噪声的待处理音频数据。
步骤202,确定与该待处理音频数据对应的音频特征向量。
示例性的,针对待处理音频数据来说,可以获取待处理音频数据对应的音频特征,音频特征的数量可以为至少一个,可以将所有音频特征组成一个特征向量,而这个特征向量就是音频特征向量。音频特征向量是与语音参数有关的特征向量,可以包括但不限于MFCC特征向量、LPC特征向量、PLP特征向量、LPCC特征向量等,对此音频特征向量的类型不做限制,后续以MFCC特征向量为例,其它类型的音频特征向量的实现方式与MFCC特征向量类似。
在一种可能的实施方式中,确定与该待处理音频数据对应的音频特征向量,可以包括但不限于:获取与该待处理音频数据对应的MFCC特征向量,并基于该MFCC特征向量确定与该待处理音频数据对应的音频特征向量,比如说,可以将该MFCC特征向量作为与该待处理音频数据对应的音频特征向量。
比如说,可以先获取帧长为M毫秒(如16毫秒等)的待处理音频数据,然后,从待处理音频数据中提取出MFCC特征向量,如提取80维的MFCC特征向量,然后,将该MFCC特征向量作为待处理音频数据对应的音频特征向量。
示例性的,获取与该待处理音频数据对应的MFCC特征向量,可以包括但不限于如下方式:对该待处理音频数据进行加窗、快速傅里叶变换、基于梅尔滤波器组的滤波、对数运算和离散余弦变换,得到MFCC特征向量。
比如说,参见图3所示,为获取MFCC特征向量的示意图,首先,针对连续音频来说,可以对该连续音频进行预加重处理和分帧处理,得到多帧音频数据,每帧音频数据 就是上述待处理音频数据,如16毫秒的待处理音频数据。
然后,可以对待处理音频数据进行加窗处理,得到加窗后数据,并对加窗后数据进行快速傅里叶变换处理(即FFT处理),得到快速傅里叶变换后数据,并采用梅尔滤波器组对快速傅里叶变换后数据进行滤波处理,得到滤波后数据,并对滤波后数据进行对数运算处理,得到对数运算后数据,并对对数运算后数据进行离散余弦变换处理(即DCT处理),得到离散余弦变换后数据,而离散余弦变换后数据就是MFCC特征向量,至此,得到MFCC特征向量。
在上述各步骤中,关于加窗、快速傅里叶变换、基于梅尔滤波器组的滤波、对数运算和离散余弦变换等处理的实现方式,本实施例中不做限制。
当然,上述只是获取与该待处理音频数据对应的MFCC特征向量的一个示例,对此实现方式不做限制,只要能够得到MFCC特征向量即可。
步骤203,将音频特征向量输入给已训练的目标声码器模型,由目标声码器模型输出与该音频特征向量对应的目标音频数据。示例性的,该目标音频数据可以是对待处理音频数据的噪声进行降噪处理后的音频数据。
示例性的,目标声码器模型能够将音频特征向量(即声学特征向量)通过反变换得到相应的声音波形,然后对声音波形进行拼接得到合成语音,该合成语音就是与该音频特征向量对应的目标音频数据。在上述方式中,可以基于音频特征向量直接合成目标音频数据,即通过语音合成方式直接合成目标音频数据,不需要关注待处理音频数据的噪声本身,语音降噪的可靠性更高,具有更强的降噪能力,因此,在将音频特征向量输入给目标声码器模型之后,目标声码器模型可以对音频特征向量进行处理,得到与该音频特征向量对应的目标音频数据,且目标音频数据是对待处理音频数据的噪声进行降噪处理后的音频数据,也就是说,通过语音合成方式合成了经过降噪处理的目标音频数据。
示例性的,目标声码器模型包括第一目标子模型和第二目标子模型,第一目标子模型(与初始声码器模型中第一初始子模型的功能相同)用于将文本特征向量映射为MFCC特征向量,第二目标子模型(与初始声码器模型中第二初始子模型的功能相同)用于将MFCC特征向量映射为音频数据,在此基础上:
在一种可能的实施方式中,可以将音频特征向量(即MFCC特征向量)输入给目标声码器模型的第二目标子模型,由第二目标子模型对音频特征向量进行处理,得到与该音频特征向量对应的目标音频数据。比如说,直接将MFCC特征向量输入给第二目标子模型(即MFCC特征向量不经过第一目标子模型,直接到达第二目标子模型),第二目标子模型在得到MFCC特征向量后,由于第二目标子模型用于将MFCC特征向量映射为音频数据,因此,第二目标子模型可以对MFCC特征向量进行处理,得到与MFCC特征向量对应的目标音频数据,对此处理过程不做限制,并输出与MFCC特征向量对应的目标音频数据。
在另一种可能的实施方式中,还可以将音频特征向量(即MFCC特征向量)输入给目标声码器模型的第一目标子模型,由该第一目标子模型将该音频特征向量输入给目标声码器模型的第二目标子模型,以及,由该第二目标子模型对音频特征向量进行处理,得到与该音频特征向量对应的目标音频数据。
比如说,先将MFCC特征向量输入给第一目标子模型,第一目标子模型在得到MFCC特征向量后,不对MFCC特征向量进行处理,将MFCC特征向量输入给第二目标子模型,第二目标子模型在得到MFCC特征向量后,由于第二目标子模型用于将MFCC特征向量映射为音频数据,因此,第二目标子模型可以对MFCC特征向量进行处理,得到与MFCC特征向量对应的目标音频数据,对此处理过程不做限制,并输出与MFCC特征向量对应的目标音频数据。
由以上技术方案可见,本申请实施例中,在降噪应用场景中,可以将存在噪声的待处理音频数据对应的音频特征向量输入给目标声码器模型,由目标声码器模型输出与音 频特征向量对应的目标音频数据,从而基于音频特征向量直接合成目标音频数据,即目标音频数据是语音合成,也就是说,通过语音合成方式直接合成目标音频数据,不需要关注待处理音频数据中的噪声本身,只需要将音频特征向量输入给目标声码器模型,就能够用语音合成方式来生成目标音频数据,语音降噪的可靠性更高,具有更强的降噪能力,能够对音频数据进行有效降噪,具有很好的降噪效果。通过合成已降噪的目标音频数据,能够去除音频数据中的非平稳噪声,达到对音频数据中的非平稳噪声进行降噪的目的。
以下结合具体应用场景,对音频数据的处理方法进行说明。
参见图4所示,为语音合成的系统结构示意图,该系统结构可以包括文本分析模块、韵律处理模块、声学处理模块和语音合成模块,文本分析模块和韵律处理模块为前端模块,声学处理模块和语音合成模块为后端模块。
文本分析模块用于模拟人对自然语音的理解过程,使计算机对输入的文本能够完全理解,为声学处理模块和语音合成模块提供各种发音、停顿等信息。
韵律处理模块用于处理发音的各种音段特征,如音高、音长和音强等,使合成的语音能够正确表达语义,语音听起来更加自然,然后根据分词和标注的结果提取文本特征,将文本特征变成一个个文本特征向量组成的序列。
声学处理模块(即声学模型)用于建立从文本特征向量到声学特征向量的映射,文本特征向量经过声学处理模块的处理,会变成声学特征向量。
语音合成模块(即声码器)用于将声学特征向量通过反变换得到相应的声音波形,比如说,可以将多个声学特征向量通过反变换,得到相应的多个声音波形,然后,可以对多个声音波形依次进行拼接得到合成语音。
基于上述语音合成的系统结构,本申请实施例中,可以只保留语音合成模块,去除文本分析模块、韵律处理模块和声学处理模块,在该情况下,可以直接确定与待处理音频数据对应的音频特征向量,如MFCC特征向量,而语音合成模块可以基于目标声码器模型得到与该MFCC特征向量对应的目标音频数据,该目标音频数据是对待处理音频数据的噪声进行降噪处理后的音频数据,也就是说,使用MFCC特征向量替换文本分析模块、韵律处理模块和声学处理模块的相关功能,直接用目标声码器模型来合成语音,实现一种新的降噪方式。
参见图5所示,为音频数据的处理方法的流程图,该方法可以包括:
步骤501,获取帧长为M毫秒(如16毫秒)的待处理音频数据。
步骤502,从待处理音频数据中提取出N维(如80维)的MFCC特征向量。
步骤503,将MFCC特征向量输入给目标声码器模型,由目标声码器模型输出与该MFCC特征向量对应的目标音频数据,实现音频数据的降噪处理。
基于与上述方法同样的申请构思,本申请实施例中提出一种音频数据的处理装置,参见图6所示,为所述装置的结构示意图,所述装置可以包括:
获取模块61,用于在降噪应用场景中,获取存在噪声的待处理音频数据;
确定模块62,用于确定与所述待处理音频数据对应的音频特征向量;
处理模块63,用于将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
在一种可能的实施方式中,所述装置还包括(在图6中未示出):
训练模块,用于采用如下方式训练得到所述目标声码器模型:
获取样本音频数据和所述样本音频数据对应的样本文本数据;
获取与所述样本文本数据对应的文本特征向量;
将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
示例性的,所述训练模块基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时用于:
基于所述样本音频数据和所述初始音频数据确定目标损失值;
基于所述目标损失值确定所述初始声码器模型是否已收敛;
若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
示例性的,所述训练模块将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时用于:
将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
在一种可能的实施方式中,所述训练模块在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
在一种可能的实施方式中,所述确定模块62确定与所述待处理音频数据对应的音频特征向量时具体用于:获取与待处理音频数据对应的MFCC特征向量;基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
在一种可能的实施方式中,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;所述处理模块63将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时具体用于:将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;或者,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
在一种可能的实施方式中,所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
基于与上述方法同样的申请构思,本申请实施例中提出一种音频数据的处理设备,参见图7所示,所述音频数据的处理设备包括:处理器71和机器可读存储介质72,所述机器可读存储介质72存储有能够被所述处理器71执行的机器可执行指令;所述处理器71用于执行机器可执行指令,以实现如下步骤:
在降噪应用场景中,获取存在噪声的待处理音频数据;
确定与所述待处理音频数据对应的音频特征向量;
将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
在一种可能的实施方式中,所述处理器被促使采用如下方式训练得到所述目标声码 器模型:
获取样本音频数据和所述样本音频数据对应的样本文本数据;
获取与所述样本文本数据对应的文本特征向量;
将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
在一种可能的实施方式中,基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时,所述处理器被促使:
基于所述样本音频数据和所述初始音频数据确定目标损失值;
基于所述目标损失值确定所述初始声码器模型是否已收敛;
若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
在一种可能的实施方式中,所述将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时,所述处理器被促使:
将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
在一种可能的实施方式中,在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
在一种可能的实施方式中,所述确定与所述待处理音频数据对应的音频特征向量时,所述处理器被促使:
获取与所述待处理音频数据对应的MFCC特征向量;
基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
在一种可能的实施方式中,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
所述将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时,所述处理器被促使:
将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
或,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
在一种可能的实施方式中,所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
基于与上述方法同样的申请构思,本申请实施例还提供一种机器可读存储介质,所述机器可读存储介质上存储有若干计算机指令,所述计算机指令被处理器执行时,能够实现本申请上述示例公开的音频数据的处理方法。
其中,上述机器可读存储介质可以是任何电子、磁性、光学或其它物理存储装置, 可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(Radom Access Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
而且,这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上,使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (24)

  1. 一种音频数据的处理方法,包括:
    在降噪应用场景中,获取存在噪声的待处理音频数据;
    确定与所述待处理音频数据对应的音频特征向量;
    将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
  2. 根据权利要求1所述的方法,其中,
    所述目标声码器模型的训练过程,包括:
    获取样本音频数据和所述样本音频数据对应的样本文本数据;
    获取与所述样本文本数据对应的文本特征向量;
    将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
    基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
  3. 根据权利要求2所述的方法,其中,
    所述基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型,包括:
    基于所述样本音频数据和所述初始音频数据确定目标损失值;
    基于所述目标损失值确定所述初始声码器模型是否已收敛;
    若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
    若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
  4. 根据权利要求2所述的方法,其中,
    所述将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据,包括:
    将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
  5. 根据权利要求2-4任一项所述的方法,其中,
    在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
  6. 根据权利要求1所述的方法,其中,
    所述确定与所述待处理音频数据对应的音频特征向量,包括:
    获取与所述待处理音频数据对应的MFCC特征向量;
    基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
  7. 根据权利要求6所述的方法,其中,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
    所述将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据,包括:
    将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音 频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
    或,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
  8. 根据权利要求1-4、6-7任一项所述的方法,其中,
    所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
  9. 一种音频数据的处理装置,包括:
    获取模块,用于在降噪应用场景中,获取存在噪声的待处理音频数据;
    确定模块,用于确定与所述待处理音频数据对应的音频特征向量;
    处理模块,用于将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
  10. 根据权利要求9所述的处理装置,还包括:
    训练模块,用于采用如下方式训练得到所述目标声码器模型:
    获取样本音频数据和所述样本音频数据对应的样本文本数据;
    获取与所述样本文本数据对应的文本特征向量;
    将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
    基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
  11. 根据权利要求10所述的处理装置,其中,所述训练模块基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时用于:
    基于所述样本音频数据和所述初始音频数据确定目标损失值;
    基于所述目标损失值确定所述初始声码器模型是否已收敛;
    若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
    若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
  12. 根据权利要求10所述的处理装置,其中,
    所述训练模块将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时用于:
    将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;
    将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征向量对应的初始音频数据。
  13. 根据权利要求10-12任一项所述的处理装置,其中,
    所述训练模块在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
  14. 根据权利要求9所述的处理装置,其中,
    所述确定模块确定与所述待处理音频数据对应的音频特征向量时用于:
    获取与所述待处理音频数据对应的MFCC特征向量;
    基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
  15. 根据权利要求14所述的处理装置,其中,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
    所述处理模块将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时用于:
    将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
    或者,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
  16. 根据权利要求9-12、14-15任一项所述的处理装置,其中,
    所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
  17. 一种音频数据的处理设备,包括:处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令;其中,所述处理器用于执行机器可执行指令,以实现如下步骤:
    在降噪应用场景中,获取存在噪声的待处理音频数据;
    确定与所述待处理音频数据对应的音频特征向量;
    将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据;其中,所述目标音频数据是对所述待处理音频数据的噪声进行降噪处理后的音频数据。
  18. 根据权利要求17所述的处理设备,其中,所述处理器被促使采用如下方式训练得到所述目标声码器模型:
    获取样本音频数据和所述样本音频数据对应的样本文本数据;
    获取与所述样本文本数据对应的文本特征向量;
    将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据;
    基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型。
  19. 根据权利要求18所述的处理设备,其中,
    基于所述样本音频数据和所述初始音频数据对所述初始声码器模型进行训练,得到已训练的所述目标声码器模型时,所述处理器被促使:
    基于所述样本音频数据和所述初始音频数据确定目标损失值;
    基于所述目标损失值确定所述初始声码器模型是否已收敛;
    若否,则基于所述目标损失值对所述初始声码器模型的参数进行调整,得到调整后的声码器模型,将调整后的声码器模型作为初始声码器模型,返回执行将所述文本特征向量输入给初始声码器模型的操作;
    若是,则将已收敛的初始声码器模型确定为所述目标声码器模型。
  20. 根据权利要求18所述的处理设备,其中,
    所述将所述文本特征向量输入给初始声码器模型,由所述初始声码器模型输出与所述文本特征向量对应的初始音频数据时,所述处理器被促使:
    将所述文本特征向量输入给所述初始声码器模型的第一初始子模型,由所述第一初始子模型对所述文本特征向量进行处理,得到与所述文本特征向量对应的梅尔频率倒谱系数MFCC特征向量;将所述MFCC特征向量输入给所述初始声码器模型的第二初始子模型,由所述第二初始子模型对所述MFCC特征向量进行处理,得到与所述文本特征 向量对应的初始音频数据。
  21. 根据权利要求18-20任一项所述的处理设备,其中,
    在获取样本音频数据时,样本音频数据的数量为多个,多个样本音频数据包括存在噪声的样本音频数据和不存在噪声的样本音频数据;其中,所述不存在噪声的样本音频数据的数量大于所述存在噪声的样本音频数据的数量。
  22. 根据权利要求17所述的处理设备,其中,
    所述确定与所述待处理音频数据对应的音频特征向量时,所述处理器被促使:
    获取与所述待处理音频数据对应的MFCC特征向量;
    基于所述MFCC特征向量确定与所述待处理音频数据对应的音频特征向量。
  23. 根据权利要求22所述的处理设备,其中,所述目标声码器模型包括第一目标子模型和第二目标子模型,所述第一目标子模型用于将文本特征向量映射为MFCC特征向量,所述第二目标子模型用于将MFCC特征向量映射为音频数据;
    所述将所述音频特征向量输入给已训练的目标声码器模型,由所述目标声码器模型输出与所述音频特征向量对应的目标音频数据时,所述处理器被促使:
    将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据;
    或,将所述音频特征向量输入给所述第一目标子模型,由所述第一目标子模型将所述音频特征向量输入给所述第二目标子模型,由所述第二目标子模型对所述音频特征向量进行处理,得到与所述音频特征向量对应的目标音频数据。
  24. 根据权利要求17-20、22-23任一项所述的处理设备,其中,
    所述降噪应用场景为需要进行语音降噪的应用场景;其中,所述降噪应用场景为语音通话应用场景,或,所述降噪应用场景为视频会议应用场景。
PCT/CN2022/106380 2021-07-20 2022-07-19 音频数据的处理方法、装置及设备 WO2023001128A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110820027.5 2021-07-20
CN202110820027.5A CN113571047A (zh) 2021-07-20 2021-07-20 一种音频数据的处理方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2023001128A1 true WO2023001128A1 (zh) 2023-01-26

Family

ID=78165740

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/106380 WO2023001128A1 (zh) 2021-07-20 2022-07-19 音频数据的处理方法、装置及设备

Country Status (2)

Country Link
CN (1) CN113571047A (zh)
WO (1) WO2023001128A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571047A (zh) * 2021-07-20 2021-10-29 杭州海康威视数字技术股份有限公司 一种音频数据的处理方法、装置及设备
CN115662409B (zh) * 2022-10-27 2023-05-05 亿铸科技(杭州)有限责任公司 一种语音识别方法、装置、设备及存储介质
CN116386611B (zh) * 2023-04-20 2023-10-13 珠海谷田科技有限公司 一种教学声场环境的去噪方法

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143988A1 (en) * 2003-12-03 2005-06-30 Kaori Endo Noise reduction apparatus and noise reducing method
US20130060567A1 (en) * 2008-03-28 2013-03-07 Alon Konchitsky Front-End Noise Reduction for Speech Recognition Engine
CN108630190A (zh) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN109065067A (zh) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 一种基于神经网络模型的会议终端语音降噪方法
CN110491404A (zh) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 语音处理方法、装置、终端设备及存储介质
CN111223493A (zh) * 2020-01-08 2020-06-02 北京声加科技有限公司 语音信号降噪处理方法、传声器和电子设备
WO2020191271A1 (en) * 2019-03-20 2020-09-24 Research Foundation Of The City University Of New York Method for extracting speech from degraded signals by predicting the inputs to a speech vocoder
CN113053400A (zh) * 2019-12-27 2021-06-29 武汉Tcl集团工业研究院有限公司 音频信号降噪模型的训练方法、音频信号降噪方法及设备
CN113571047A (zh) * 2021-07-20 2021-10-29 杭州海康威视数字技术股份有限公司 一种音频数据的处理方法、装置及设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111653261A (zh) * 2020-06-29 2020-09-11 北京字节跳动网络技术有限公司 语音合成方法、装置、可读存储介质及电子设备
CN112599141B (zh) * 2020-11-26 2022-02-25 北京百度网讯科技有限公司 神经网络声码器训练方法、装置、电子设备及存储介质
CN112530400A (zh) * 2020-11-30 2021-03-19 清华珠三角研究院 基于深度学习的文本生成语音的方法、系统、装置及介质
CN112634866B (zh) * 2020-12-24 2024-05-14 北京猎户星空科技有限公司 语音合成模型训练和语音合成方法、装置、设备及介质
CN112786006B (zh) * 2021-01-13 2024-05-17 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050143988A1 (en) * 2003-12-03 2005-06-30 Kaori Endo Noise reduction apparatus and noise reducing method
US20130060567A1 (en) * 2008-03-28 2013-03-07 Alon Konchitsky Front-End Noise Reduction for Speech Recognition Engine
CN108630190A (zh) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN109065067A (zh) * 2018-08-16 2018-12-21 福建星网智慧科技股份有限公司 一种基于神经网络模型的会议终端语音降噪方法
WO2020191271A1 (en) * 2019-03-20 2020-09-24 Research Foundation Of The City University Of New York Method for extracting speech from degraded signals by predicting the inputs to a speech vocoder
CN110491404A (zh) * 2019-08-15 2019-11-22 广州华多网络科技有限公司 语音处理方法、装置、终端设备及存储介质
CN113053400A (zh) * 2019-12-27 2021-06-29 武汉Tcl集团工业研究院有限公司 音频信号降噪模型的训练方法、音频信号降噪方法及设备
CN111223493A (zh) * 2020-01-08 2020-06-02 北京声加科技有限公司 语音信号降噪处理方法、传声器和电子设备
CN113571047A (zh) * 2021-07-20 2021-10-29 杭州海康威视数字技术股份有限公司 一种音频数据的处理方法、装置及设备

Also Published As

Publication number Publication date
CN113571047A (zh) 2021-10-29

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
Wang et al. Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking
Žmolíková et al. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures
JP6993353B2 (ja) ニューラルネットワークベースの声紋情報抽出方法及び装置
Weninger et al. Single-channel speech separation with memory-enhanced recurrent neural networks
WO2023001128A1 (zh) 音频数据的处理方法、装置及设备
Han et al. Learning spectral mapping for speech dereverberation and denoising
Delcroix et al. Strategies for distant speech recognitionin reverberant environments
Krueger et al. Model-based feature enhancement for reverberant speech recognition
CN109767756B (zh) 一种基于动态分割逆离散余弦变换倒谱系数的音声特征提取算法
WO2018223727A1 (zh) 识别声纹的方法、装置、设备及介质
EP4004906A1 (en) Per-epoch data augmentation for training acoustic models
US11600284B2 (en) Voice morphing apparatus having adjustable parameters
Chougule et al. Robust spectral features for automatic speaker recognition in mismatch condition
CA3195578A1 (en) Audio generator and methods for generating an audio signal and training an audio generator
Ganapathy Multivariate autoregressive spectrogram modeling for noisy speech recognition
Su et al. Perceptually-motivated environment-specific speech enhancement
US11100940B2 (en) Training a voice morphing apparatus
CN114333865A (zh) 一种模型训练以及音色转换方法、装置、设备及介质
Yan et al. An initial investigation for detecting vocoder fingerprints of fake audio
Nguyen et al. Feature adaptation using linear spectro-temporal transform for robust speech recognition
JP2016143042A (ja) 雑音除去装置及び雑音除去プログラム
CN109741761B (zh) 声音处理方法和装置
Kaur et al. Speaker and speech recognition using deep neural network
WO2020015546A1 (zh) 一种远场语音识别方法、语音识别模型训练方法和服务器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22845293

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE