WO2022141868A1 - Method and apparatus for extracting speech features, terminal, and storage medium - Google Patents

Method and apparatus for extracting speech features, terminal, and storage medium Download PDF

Info

Publication number
WO2022141868A1
WO2022141868A1 PCT/CN2021/084166 CN2021084166W WO2022141868A1 WO 2022141868 A1 WO2022141868 A1 WO 2022141868A1 CN 2021084166 W CN2021084166 W CN 2021084166W WO 2022141868 A1 WO2022141868 A1 WO 2022141868A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature
speech
voice data
features
Prior art date
Application number
PCT/CN2021/084166
Other languages
French (fr)
Chinese (zh)
Inventor
张之勇
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022141868A1 publication Critical patent/WO2022141868A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application belongs to the field of computer technology, and in particular relates to a method, device, terminal and storage medium for extracting speech features.
  • the application of intelligent speech technology retrains the speech model or optimizes the original speech model by labeling a large amount of supervised data, which consumes a lot of manpower, economy and time. And there are very few labeled speech data that can be directly used as training samples, which is not conducive to the training of speech models. Therefore, unsupervised speech feature extraction methods are applied.
  • the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in the speech features extracted by the speech model. Inaccurate.
  • the embodiments of the present application provide a method, device, terminal, and storage medium for extracting speech features, so as to solve the problem that it is difficult for existing speech models based on unsupervised learning training to learn effective features of speech data, resulting in The problem of inaccurate speech features extracted using this speech model.
  • a first aspect of the embodiments of the present application provides a method for extracting speech features, including:
  • the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  • a second aspect of the embodiments of the present application provides an apparatus for extracting speech features, including:
  • an acquisition unit for acquiring the voice data to be processed
  • the processing unit is used to input the voice data into the trained voice feature extraction model for processing, and obtain the target voice feature corresponding to the voice data, and the voice feature extraction model is based on self-supervised learning, with each sample
  • the sample voice feature corresponding to the original voice data in the voice data pair is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair. It is obtained by performing data enhancement processing on the original voice data.
  • a third aspect of the embodiments of the present application provides a terminal for extracting speech features, including a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that the processing When the computer executes the computer program, it realizes:
  • the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  • a fourth aspect of the embodiments of the present application provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed by a processor to implement:
  • the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  • a fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal that extracts voice features, causes the terminal that extracts voice features to execute:
  • the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  • the beneficial effects of the embodiments of the present application are: on the one hand, the quantity of sample voice data is enlarged, on the other hand, it is not necessary to manually provide sample voice data, which saves a lot of manpower, economy, and time.
  • FIG. 1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application.
  • FIG. 3 is a schematic diagram of a speech feature extraction model structure provided by the application.
  • FIG. 4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application.
  • FIG. 5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application.
  • the application of intelligent speech technology retrains the speech model or optimizes the original speech model by labeling a large amount of supervised data, which consumes a lot of manpower, economy and time. And there are very few labeled speech data that can be directly used as training samples, which is not conducive to the training of speech models. Therefore, unsupervised speech feature extraction methods are applied.
  • the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in that the speech features extracted using the speech model are not precise.
  • the present application provides a method for extracting voice features.
  • the voice feature extraction model is based on the sample voice features corresponding to the original voice data in each sample voice data pair.
  • the difference between the original speech data and the enhanced speech data in each sample speech data pair is obtained by training, and the enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data.
  • the language feature extraction model trained in this way has learned that the ability to extract the voice features corresponding to the original voice data from the enhanced voice data can be understood as the ability to extract the voice features corresponding to the undistorted voice data from the distorted voice data. Learn how to extract effective speech features, so that the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process.
  • the processing result is more accurate.
  • the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.
  • FIG. 1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application.
  • the execution subject of the method for extracting voice features in this embodiment is a terminal, a server, etc., wherein the terminal includes but is not limited to mobile terminals such as smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), etc., and may also include Terminals such as desktop computers.
  • the execution subject is taken as an example for description.
  • the method for extracting speech features as shown in FIG. 1 may include S101 to S102, and the details are as follows:
  • S101 Acquire voice data to be processed.
  • the speech data to be processed is the speech data that needs to be extracted by speech features.
  • the extracted speech features can be applied to different intelligent speech task processing scenarios.
  • the extracted speech features can be applied to scenarios such as speech recognition, speaker identification, language recognition, speech translation, simultaneous translation, and speech control.
  • the voice data to be processed can be the same or different.
  • the voice data to be processed can be a complete voice uploaded to the terminal in advance; if voice features need to be extracted in the scenario of voice control, the voice data to be processed
  • the voice data may be the voice uttered by the user obtained through a built-in sound pickup device (eg, a microphone, a sound card, etc.). This is only an exemplary description, and it is not limited.
  • different application scenarios acquire voice data to be processed in different ways.
  • the way to obtain the voice data at this time may be to obtain the user's voice through a built-in sound pickup device (eg, microphone, sound card, etc.).
  • a built-in sound pickup device eg, microphone, sound card, etc.
  • the voice data may be obtained by the user uploading the to-be-processed voice data to the terminal in advance, and the terminal obtains the to-be-processed voice data.
  • the terminal detects the feature extraction instruction, according to the file identifier included in the feature extraction instruction, the terminal obtains a text file corresponding to the file identifier, and extracts the speech data to be processed in the text file. This is only an exemplary description, and it is not limited.
  • S102 Input the voice data into a trained voice feature extraction model for processing, and obtain a target voice feature corresponding to the voice data.
  • the voice feature extraction model is based on self-supervised learning, using the original voice data in each sample voice data pair
  • the sample voice feature corresponding to the voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement processing on the original voice data. owned.
  • a pre-trained voice feature extraction model is pre-stored in the terminal for extracting voice features.
  • the voice feature extraction model adopts self-supervised learning, and takes the sample voice features corresponding to the original voice data in each sample voice data pair as the target. Differentially trained.
  • the enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data in each sample speech data pair.
  • the original voice data is pure voice data, that is, voice data without noise, impurities and undistorted voice.
  • the enhanced speech data is obtained by performing any one or more of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing on the original speech data.
  • the obtained sample speech data is speech data containing noise, impurities, and distortion, and speech features extracted from the speech data.
  • these language data and speech features are trained through machine learning, so that the trained speech model has the ability to extract effective speech features from speech data containing noise, impurities and distortion.
  • this way of training the speech model due to the complexity and variability of speech data, and the original learning target is the speech features extracted from speech data containing noise, impurities, and distortion, resulting in the speech model learning during the training process.
  • the voice model finally trained cannot extract effective, accurate and rich voice features when actually processing voice data, which leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.
  • an unsupervised learning method is also used to train a speech model.
  • Unsupervised learning refers to finding changes in the input data without a target, and the purpose is to better understand the correlation in the data.
  • unsupervised speech feature extraction methods mainly include principal component analysis method and method based on mixture Gaussian model. The premise of the above two methods is that the speech data obeys the Gaussian distribution, and only needs to be artificially reduced during the execution process.
  • the voice data does not necessarily conform to the Gaussian distribution, and artificial dimensionality reduction will inevitably lead to the loss of high-dimensional features, resulting in the fact that the voice model cannot extract effective, accurate and rich voice features when actually processing the voice data, which in turn leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.
  • the method of self-supervised learning is adopted, and the sample speech features extracted from the original speech data are used as the target of self-supervised learning.
  • the target is clear, and since the original speech data is speech data without noise, impurities and distortion , the sample speech features extracted from the original speech data are more accurate, rich and effective.
  • Performing data enhancement processing on the original voice data to obtain enhanced voice data is equivalent to increasing the number of training samples;
  • the type of data enhancement processing can be controlled, so that the speech feature extraction model can learn various effective speech features in a targeted manner during the training process.
  • the speech feature extraction model obtained by the final training can extract effective, accurate and rich speech features when actually processing speech data.
  • the processing results are more accurate. precise.
  • the speech feature extraction model may be pre-trained by the terminal that extracts speech features, or the files corresponding to the speech feature extraction model may be transplanted to the terminal after being pre-trained by other devices. That is to say, the executive body for training the speech feature extraction model and the executive body for using the speech feature extraction model for speech feature extraction may be the same or different. For example, when other devices are used to train the initial speech feature extraction model, after the other devices finish training the initial speech feature extraction model, the model parameters of the initial speech feature extraction model are fixed, and the files corresponding to the trained speech feature extraction model are obtained. This file is then ported to a terminal that extracts speech features.
  • FIG. 2 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application.
  • the foregoing S102 may include S1021 to S1023, and the details are as follows:
  • S1021 Input the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature.
  • the trained speech feature extraction models include convolutional filters, convolutional encoders, and quasi-recurrent neural networks.
  • FIG. 3 is a schematic diagram of the structure of a speech feature extraction model provided by the present application.
  • the convolutional filter can be an interpretable convolutional filter (SincNet)
  • the convolutional encoder is composed of 7 convolutional neural network layers (ConvNet)
  • the quasi-cyclic neural network can be a neural network regression (Quantile RegressionNeural Network, QRNN). This is only an exemplary description, and it is not limited.
  • the voice data to be processed when the voice data to be processed is processed by the trained voice feature extraction model, the voice data can be converted into a waveform first.
  • the voice data can be converted by existing voice conversion waveform software.
  • Input the converted waveform into SincNet, and SincNet performs a time-domain convolution operation on the input waveform based on a sliding window with a preset duration to obtain the first voice feature corresponding to the voice data.
  • the first voice feature may include frequency features, Mel-Frequency Cepstral coefficients (Mel-Frequency Cepstral coefficients, MFCC) characteristics, filter bank characteristics (Filter bank characteristics, Fbank) characteristics, waveform (wave) characteristics, logarithmic power spectrum (Log-power spectrum, Lps) characteristics, etc.
  • the frequency features may include audio features, fundamental frequency features, frequency band features, and the like.
  • the preset duration can be adjusted according to the actual situation, for example, in this embodiment, it can be set as a sliding window of 10 milliseconds. Speech data is time-sequential, and the input waveform is subjected to a time-domain convolution operation based on a sliding window with a preset duration.
  • time domain convolution operation performed by SincNet on the input waveform can be expressed by the following formula (1), as follows:
  • y[n] represents the first speech feature output by SincNet
  • x[n] represents the input waveform
  • h[n] is a preset filter of length L.
  • S1022 Perform convolution processing on the first voice feature by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature.
  • the second voice feature may include MFCC feature, Fbank feature, wave feature, Lps feature, gamma (Gamma) feature, prosody (Proso ) features, etc.
  • the convolutional encoder is composed of 7 ConvNets, and the first ConvNet performs convolution processing on the first speech feature to obtain the first processing result. Input the first processing result to the second ConvNet, and the second ConvNet performs convolution processing on the first processing result to obtain the second processing result, and so on, until the last ConvNet passes the processing result to the previous ConvNet After the convolution process is performed, the second speech feature is output.
  • the first ConvNet convolves the first speech feature based on a preset convolution check, which can be understood as the first ConvNet performs feature selection in the first speech feature, removes redundant features, and obtains the first processing result.
  • the MFCC feature, the Fbank feature, the wave feature, the Lps feature, the gamma (Gamma) feature, the prosody (Proso) feature, etc. are extracted according to the information in the first speech feature.
  • the first processing result is input into the second ConvNet, and the second ConvNet further performs convolution on the basis of the features extracted by the first ConvNet to extract deeper features to obtain the second processing result.
  • the second speech feature is obtained after the last ConvNet performs convolution processing on the processing result passed by the previous ConvNet.
  • the seventh processing result can be input.
  • the down-sampling layer performs processing, and then the down-sampling layer outputs the second speech feature.
  • processing of the seventh processing result by the down-sampling layer can be expressed by the following formula (2), which is specifically as follows:
  • P j,m represents the output of the downsampling layer
  • j represents the processing result of the jth ConvNet
  • m represents the mth downsampling band
  • n represents the downsampling factor
  • r represents the length of the downsampling window. Size, indicating how many bands of data to downsample together.
  • the frequency of the voice of a man is generally lower than that of a woman. It is also generally reduced.
  • the difference can be well eliminated, so that the extracted speech features are more accurate.
  • S1023 Input the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter bank characteristics, target gamma features, and target prosody features.
  • the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter bank characteristics, target gamma features, and target prosody features.
  • the second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the voice data to be processed.
  • the target speech features include target waveform features, target log power spectral features, target spectral features, target filter bank features, target gamma features, and target prosody features. -power spectrum, Long-Lps) features, Long Mel-Frequency Cepstral coefficients (Long Mel-Frequency Cepstral coefficients, Long-MFCC) features, Long-term filter bank characteristics (Long Filter bank characteristics, Long-Fbank) features, Long Gamma feature, etc. It is worth noting that some of the features of the first voice feature, the second voice feature and the target voice feature are of the same type. The difference is that the features extracted from the first voice feature and the second voice feature are not very informative. It is rich, and the feature expression is not very accurate. After processing by the quasi-recurrent neural network, the obtained target speech feature information is rich and accurate.
  • the first layer in the QRNN is the convolution layer (Conv 1D), which is used to extract the features in the input second speech feature, Sigmoid and Tanh are the functions used in the QRNN, and the second layer is The pooling layer is used to reduce the number of features.
  • the pooling layer in QRNN adopts the fo-pool method.
  • the feature in the second speech feature extracted based on the convolutional layer in the QRNN can be expressed by the following formula (3), as follows:
  • X represents the input second speech feature
  • Z, F, O represent the multiplication gates that the parameter W participates in
  • W z , W f , and W o represent the convolution filters of the preset R size.
  • the features extracted by the convolutional layer are input into the pooling layer for processing, and the target speech features are output.
  • the processing of the features extracted by the convolutional layer of the pooling layer can be realized by the following formulas (4) and (5), as follows:
  • ct represents the unit state vector at time t
  • h t represents the hidden state vector at time t.
  • S1024 to S1025 may be further included after S1022, and the details are as follows:
  • S1024 Extract a third speech feature corresponding to the second speech feature based on the quasi-recurrent neural network.
  • the third voice feature is of the same type as each feature included in the target voice feature, that is, the third voice feature includes MFCC feature, Fbank feature, wave feature, Lps feature, Gamma feature, Proso feature, Long- Lps feature, Long-MFCC feature, Long-Fbank feature, Long Gamma feature, etc. This is only an exemplary description, and it is not limited.
  • the second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the second voice feature.
  • the quasi-recurrent neural network For the specific processing process of the second speech feature by the quasi-recurrent neural network, reference may be made to the description in S1023, which will not be repeated here.
  • S1025 Combine the second voice feature with the third voice feature in a skip connection manner to obtain the target voice feature.
  • Both the second voice feature and the third voice feature are represented in the form of vectors, and the second voice feature and the third voice feature are correspondingly added to obtain the target voice feature. If a certain type of feature included in the third voice feature is not included in the second voice feature, the vector corresponding to the type of feature in the second voice feature is 0 by default. This is only an exemplary description, and it is not limited.
  • the convolutional encoder is composed of 7 ConvNets, and each ConvNet has a corresponding processing result.
  • the combination of the second voice feature and the third voice feature by means of skip connection may be to combine the first processing result corresponding to the first ConvNet, the third processing result corresponding to the third ConvNet, and the fifth processing result corresponding to the fifth ConvNet.
  • the processing result is correspondingly added with the third voice feature to obtain the target voice feature.
  • the features are added correspondingly to obtain the target speech features.
  • the second processing result corresponding to the second ConvNet, the fourth processing result corresponding to the fourth ConvNet, and the sixth processing result corresponding to the sixth ConvNet are correspondingly added to the third voice feature to obtain the target voice feature. This is only an exemplary description, and it is not limited.
  • the target speech feature is expressed as the sum of the features found by the convolutional encoder, so that the final target speech feature information obtained is more accurate and the expression is more accurate.
  • the speech feature extraction model takes the sample speech features corresponding to the original speech data in each sample speech data pair as the target, and based on self-supervised learning, the original speech data and the enhanced speech data in each sample speech data pair are The difference between the two samples is obtained by training, and the enhanced voice data in each sample voice data pair is obtained by performing data enhancement processing on the original voice data.
  • the language feature extraction model trained in this way has learned that the ability to extract the speech features corresponding to the original speech data from the enhanced speech data can be understood as the ability to extract the speech features corresponding to the undistorted speech data from the distorted speech data. Therefore, the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process.
  • the processing result is more accurate.
  • the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.
  • FIG. 4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application.
  • the method may include S201-S206.
  • steps S205 to S206 shown in FIG. 4 reference may be made to the relevant descriptions of S101 to S102 in the embodiment corresponding to FIG. 1 , which are not repeated here for brevity. Steps S201 to S204 will be specifically described below.
  • S201 Input a plurality of sample speech data pairs in the sample speech data set into an initial speech feature extraction model for processing to obtain a sample speech feature corresponding to each original speech data and a real speech feature corresponding to each enhanced speech data.
  • the sample speech data set includes a plurality of sample speech data pairs, and each sample speech data pair includes one original speech data and one enhanced speech data.
  • the enhanced voice data in each sample voice data pair is obtained from the original voice data in the sample voice data pair after data enhancement processing.
  • the data enhancement processing may be any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing, or any multiple processing.
  • a probability value can be preset for each kind of data enhancement processing, and based on the preset probability value, data enhancement processing is performed on the raw voice data in each sample voice data pair that is obtained to obtain each sample voice data. Enhanced speech data corresponding to the original speech data in the data pair. The probability value is used to indicate the possibility of performing data enhancement processing corresponding to the probability value on each original speech data.
  • the probability value corresponding to reverberation processing is 0.5
  • the probability value corresponding to noise processing is 0.4
  • the probability value corresponding to frequency masking processing is 0.4
  • the probability value corresponding to time masking processing is 0.2
  • the probability value corresponding to clipping processing is 0.2
  • the probability value corresponding to overlapping speech processing is 0.1. That is to say, there is a probability of 0.5 to perform reverberation processing on a certain original voice data, a probability of 0.4 to perform noise processing on a certain raw voice data, and a probability of 0.4 to perform frequency masking on a certain raw voice data.
  • reverberation processing is achieved by convolving the signal corresponding to the original speech data with a set of 1300 impulse responses, which are derived graphically. Impulse responses simulate different acoustic conditions with reverberation times ranging from 0.3 to 0.9 seconds.
  • the noise in the noise processing is extracted from the preset FreeSound dataset and DIRHA dataset.
  • the noise in the noise processing can include background noise and non-stationary noise, such as alarms, door knocks, telephone ringing, TV sounds, etc.
  • the signal-to-noise ratio is randomly sampled between 0 and 10dB.
  • the frequency masking process is realized by filtering the time signal corresponding to the original speech data with a band-stop filter.
  • the temporal masking process is achieved by setting random segments in the original speech data to zero. Clipping is achieved by adding random saturation to the raw speech data. Overlapping speech processing is implemented by overlapping speech signals in the original speech data with the main signal corresponding to the original speech data.
  • Input multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing that is, input the original speech data in each sample speech data pair into the initial speech feature extraction model for processing.
  • the enhanced speech data in each sample speech data pair is input into the initial speech feature extraction model for processing.
  • the initial speech feature extraction model outputs the sample speech features corresponding to each original speech data, and outputs the real speech features corresponding to each enhanced speech data.
  • the initial speech feature extraction model includes an initial convolution filter, an initial convolution encoder and an initial quasi-recurrent neural network.
  • the initial convolutional filter can be an interpretable convolutional filter (SincNet)
  • the initial convolutional encoder is composed of 7 convolutional neural network layers (ConvNet)
  • the initial quasi-cyclic neural network can be QRNN.
  • Skip Connections (skip transfer) represents skip connections
  • FC represents the processing result of skip selection in 7 ConvNets.
  • the Workers at the top of Figure 3 represent 12 self-supervised tasks, implemented based on a small feedforward neural network (typically a hidden layer with 256 hidden units).
  • each of these 12 self-supervised tasks corresponds to a speech feature extracted from speech data, which can be generally understood as supervising the sample speech features corresponding to each original speech data, and outputting each enhanced speech data.
  • the Speech Distortion (voice distortion) in Figure 3 represents the data enhancement process, and the speech segment below the Speech Distortion represents the original speech data.
  • a processing method is to process the original voice data through an initial voice feature extraction model to obtain sample voice features corresponding to the original voice data.
  • One processing method is to first perform Speech Distortion processing on the original voice data, that is, data enhancement processing, to obtain enhanced voice data corresponding to the original voice data, and then extract the real voice features corresponding to the enhanced voice data.
  • Speech Distortion processing on the original voice data, that is, data enhancement processing
  • the real voice features corresponding to the enhanced voice data For the specific process of extracting the sample voice feature and the real voice feature, reference may be made to the description in S102, which will not be repeated here.
  • S202 For each sample voice data pair, calculate the sample voice feature corresponding to the original voice data in the sample voice data pair according to a preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair loss value.
  • the loss value between the sample voice feature corresponding to the original voice data in each sample voice data pair and the real voice feature corresponding to the enhanced voice data in the sample voice data pair can be used to measure the value extracted by the initial voice feature extraction model.
  • Accuracy of speech features It can be understood that the original voice data is pure voice data, that is, voice data without noise, impurities, and undistorted voice data, and the sample voice features corresponding to the original voice data are standard, informative, and accurately expressed voice features. This is also the learning target of our initial speech feature extraction model.
  • the enhanced speech data is obtained by performing data enhancement processing on the original speech data, which contains noise and impurities. When the same voice features as the sample voice features corresponding to the original voice data can be extracted from the enhanced voice data, it proves that the training of the initial voice feature extraction model is completed.
  • the preset loss function may be a mean square error function, a mean absolute error function, etc., which is not limited.
  • the sample speech features may include MFCC features, Fbank features, wave features, Lps features, Gamma features, Proso features, Long-Lps features, Long-MFCC features, Long-Fbank features, Long Gamma features, and the like.
  • the real speech features can also include waveform features (wave features), logarithmic power spectral rate features (Lps features), spectral features (MFCC features), filter bank features (Fbank features), gamma features, prosody features, Long-Lps features Features, Long-MFCC features, Long-Fbank features, Long Gamma features, etc.
  • the loss value between the sample speech features and the real speech features is calculated based on a preset loss function. It is worth noting that, since each sample speech feature and real speech feature contain corresponding multiple types of features, the final loss value is the sum of the loss values between each group of the same type of features.
  • the sample voice features include MFCC features, Fbank features, and wave features
  • the real voice features include MFCC features, Fbank features, and wave features.
  • the loss value between the sample voice feature and the real voice feature is the loss value between the MFCC feature corresponding to the sample voice feature and the MFCC feature corresponding to the real voice feature, the Fbank feature corresponding to the sample voice feature and the Fbank feature corresponding to the real voice feature.
  • the loss value between and the sum of the loss values between the wave feature corresponding to the sample speech feature and the wave feature corresponding to the real speech feature is only an exemplary description, and it is not limited.
  • the preset condition may be that the loss value is less than or equal to the preset loss value threshold, or that the loss value falls within the preset error range, but it is not limited to this, and can also be set according to the actual situation, which is not limited here.
  • the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is greater than the preset loss value threshold, it is determined that the voice features extracted by the current initial voice feature extraction model have not yet met the requirements. At this time, it is necessary to adjust the model parameters of the initial speech feature extraction model, then return to S201, and continue to execute S201 and S202, until the loss value determined in S202 is less than or equal to the preset loss value threshold, execute S204.
  • the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is less than or equal to the preset loss value threshold, it determines that the training of the current initial speech feature extraction model meets the expected requirements, and stops training the initial speech feature extraction model.
  • the initial speech feature extraction model after adjusting the model parameters has been trained with a large number of samples, and its loss value is kept within a small range.
  • the initial speech feature extraction model to process the speech data can obtain rich information, Express accurate phonetic features. Therefore, the initial speech feature extraction model when the training is stopped (that is, after the last training is completed) can be determined as the trained speech feature extraction model.
  • the voice feature extraction model trained in this embodiment can extract the same voice features as the original voice data from the enhanced voice data, and the enhanced voice data is obtained by performing reverberation processing and noise processing on the original voice data.
  • the speech feature extraction model also learns how to denoise the speech data and the ability to be distortion invariant.
  • the trained voice feature extraction model may also be uploaded to the blockchain.
  • uploading the trained voice feature extraction model to the blockchain can ensure its security and fairness and transparency to users.
  • the trained voice feature extraction model is uploaded to the blockchain. With the feature that the files on the blockchain cannot be tampered with at will, the trained voice feature extraction model can be prevented from being maliciously tampered with, so that subsequent users can directly and accurately obtain it.
  • the trained voice feature extraction model is also convenient for subsequent users to use the trained voice feature extraction model to process the voice data to be processed, so as to ensure that the voice features with rich information, accurate expression and effective are extracted.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • FIG. 5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application.
  • Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 .
  • FIG. 5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application.
  • Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 .
  • FIG. 5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application.
  • Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 .
  • FIG. 5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application.
  • Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 .
  • an acquisition unit 310 configured to acquire the voice data to be processed
  • the processing unit 320 is configured to input the voice data into a trained voice feature extraction model for processing to obtain target voice features corresponding to the voice data.
  • the voice feature extraction model is based on self-supervised learning, with each The sample speech feature corresponding to the original speech data in the sample speech data pair is the target, and is obtained by training the difference between the original speech data and the enhanced speech data in each sample speech data pair, and the enhanced speech data is a pair of The original voice data is obtained by performing data enhancement processing.
  • the speech feature extraction model includes a convolution filter, a convolution encoder and a quasi-recurrent neural network, and the processing unit 320 is specifically used for:
  • the first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;
  • the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
  • processing unit 320 is further configured to:
  • the method further includes:
  • the second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
  • the device further includes:
  • the first training unit is used to input a plurality of sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing, and obtain the sample speech features corresponding to each original speech data and the real data corresponding to each enhanced speech data. voice characteristics;
  • the second training unit is configured to, for each sample speech data pair, calculate the sample speech feature corresponding to the original speech data in the sample speech data pair according to the preset loss function, and the enhanced speech feature in the sample speech data pair The loss value between the real speech features corresponding to the data;
  • a third training unit configured to adjust the model parameters of the initial speech feature extraction model when the loss value does not meet the preset condition, and return to execute the step of inputting a plurality of sample speech data pairs in the sample speech data set into the The steps of processing in the initial voice feature extraction model to obtain the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;
  • a fourth training unit configured to stop training the initial speech feature extraction model when the loss value satisfies the preset condition, and use the trained initial speech feature extraction model as the trained speech feature extraction model .
  • the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
  • the data enhancement processing is any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, editing processing, and overlapping speech processing, or any multiple processing.
  • the device further includes:
  • the uploading unit is used for uploading the speech feature extraction model to the blockchain.
  • FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application.
  • the terminal 4 for extracting speech features in this embodiment includes: a processor 40 , a memory 41 , and computer instructions 42 that are stored in the memory 41 and run on the processor 40 .
  • the processor 40 executes the computer instructions 42, it realizes:
  • the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  • the computer instructions 42 may be divided into one or more units, and the one or more units are stored in the memory 41 and executed by the processor 40 to complete the present application.
  • the one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer instruction 42 in the terminal 4 for extracting speech features.
  • the computer instructions 42 can be divided into an acquisition unit and a processing unit, and the specific functions of each unit are as described above.
  • the terminal for extracting the voice feature may include, but is not limited to, the processor 40 and the memory 41 .
  • FIG. 6 is only an example of the terminal 4 for extracting voice features, and does not constitute a limitation on the terminal for extracting voice features, and may include more or less components than shown in the figure, or combine some components , or different components, for example, the terminal for extracting voice features may also include an input and output terminal, a network access terminal, a bus, and the like.
  • the so-called processor 40 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 41 may be an internal storage unit of the terminal for extracting voice features, such as a hard disk or memory of the terminal for extracting voice features.
  • the memory 41 can also be an external storage terminal of the terminal for extracting voice features, such as a plug-in hard disk equipped on the terminal for extracting voice features, a smart memory card (Smart Media Card, SMC), a secure digital (Secure) Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 41 may also include both an internal storage unit of the terminal for extracting voice features and an external storage terminal.
  • the memory 41 is used to store the computer instructions and other programs and data required by the terminal.
  • the memory 41 can also be used to temporarily store data that has been output or will be output.
  • the embodiment of the present application also provides a computer storage medium, which may be non-volatile or volatile, and stores a computer program in the computer storage medium, and when the computer program is executed by the processor, the following steps are:
  • the processed voice data the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained.
  • the voice feature extraction model is based on self-supervised learning.
  • the sample voice feature corresponding to the original voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement on the original voice data. processed.

Abstract

The present application is applicable to the technical field of computers and provides a method and apparatus for extracting speech features, a terminal, and a storage medium, the method comprising: acquiring speech data to be processed; and inputting the speech data into a trained speech feature extraction model for processing to obtain target speech features corresponding to the speech data. The speech feature extraction model in the method is obtained by training, on the basis of self-supervised learning, differences between the original speech data and enhanced speech data in each sample speech data pair by taking sample speech features corresponding to original speech data in each sample speech data pair as targets. Effective, informative, and accurately expressed target speech features can be extracted on the basis of the speech feature extraction model, such that when the target speech features are applied to intelligent speech task processing scenarios, the processing results are more accurate.

Description

一种提取语音特征的方法、装置、终端及存储介质A method, device, terminal and storage medium for extracting speech features
本申请要求于2020年12月29日提交中国专利局、申请号为202011602171.3,发明名称为“一种提取语音特征的方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 29, 2020, with the application number of 202011602171.3 and the invention titled "A method, device, terminal and storage medium for extracting speech features", the entire content of which is Incorporated herein by reference.
技术领域technical field
本申请属于计算机技术领域,尤其涉及一种提取语音特征的方法、装置、终端及存储介质。The present application belongs to the field of computer technology, and in particular relates to a method, device, terminal and storage medium for extracting speech features.
背景技术Background technique
智能语音技术作为人工智能重要的组成部分,其应用通过标注大量的有监督数据,重新训练语音模型或者在原始语音模型上进行优化,其过程会消耗大量的人力、经济以及时间。且可直接作为训练样本的带标注的语音数据很少,不利于语音模型的训练。因此,无监督的语音特征提取方法应用而生。As an important part of artificial intelligence, the application of intelligent speech technology retrains the speech model or optimizes the original speech model by labeling a large amount of supervised data, which consumes a lot of manpower, economy and time. And there are very few labeled speech data that can be directly used as training samples, which is not conducive to the training of speech models. Therefore, unsupervised speech feature extraction methods are applied.
技术问题technical problem
综上,发明人意识到,由于语音数据的复杂性和多变性,现有的基于无监督学习训练得到的语音模型很难学习到语音数据的有效特征,导致使用该语音模型提取到的语音特征不准确。To sum up, the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in the speech features extracted by the speech model. Inaccurate.
技术解决方案technical solutions
有鉴于此,本申请实施例提供了一种提取语音特征的方法、装置、终端及存储介质,以解决现有的基于无监督学习训练得到的语音模型很难学习到语音数据的有效特征,导致使用该语音模型提取到的语音特征不准确的问题。In view of this, the embodiments of the present application provide a method, device, terminal, and storage medium for extracting speech features, so as to solve the problem that it is difficult for existing speech models based on unsupervised learning training to learn effective features of speech data, resulting in The problem of inaccurate speech features extracted using this speech model.
本申请实施例的第一方面提供了一种提取语音特征的方法,包括:A first aspect of the embodiments of the present application provides a method for extracting speech features, including:
获取待处理的语音数据;Get the voice data to be processed;
将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
本申请实施例的第二方面提供了一种提取语音特征的装置,包括:A second aspect of the embodiments of the present application provides an apparatus for extracting speech features, including:
获取单元,用于获取待处理的语音数据;an acquisition unit for acquiring the voice data to be processed;
处理单元,用于将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The processing unit is used to input the voice data into the trained voice feature extraction model for processing, and obtain the target voice feature corresponding to the voice data, and the voice feature extraction model is based on self-supervised learning, with each sample The sample voice feature corresponding to the original voice data in the voice data pair is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair. It is obtained by performing data enhancement processing on the original voice data.
本申请实施例的第三方面提供了一种提取语音特征的终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执 行所述计算机程序时实现:A third aspect of the embodiments of the present application provides a terminal for extracting speech features, including a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that the processing When the computer executes the computer program, it realizes:
获取待处理的语音数据;Get the voice data to be processed;
将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
本申请实施例的第四方面提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:A fourth aspect of the embodiments of the present application provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed by a processor to implement:
获取待处理的语音数据;Get the voice data to be processed;
将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
本申请实施例的第五方面提供了一种计算机程序产品,当计算机程序产品在提取语音特征的终端上运行时,使得提取语音特征的终端执行:A fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal that extracts voice features, causes the terminal that extracts voice features to execute:
获取待处理的语音数据;Get the voice data to be processed;
将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
有益效果beneficial effect
本申请实施例与现有技术相比存在的有益效果是:一方面扩大了样本语音数据的数量,一方面不需要人工去提供样本语音数据,节省了大量的人力、经济、时间。Compared with the prior art, the beneficial effects of the embodiments of the present application are: on the one hand, the quantity of sample voice data is enlarged, on the other hand, it is not necessary to manually provide sample voice data, which saves a lot of manpower, economy, and time.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本申请实施例提供的一种提取语音特征的方法的示意流程图;1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application;
图2是本申请另一实施例提供的一种提取语音特征的方法的示意流程图;2 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application;
图3是本申请提供的语音特征提取模型结构的示意图;3 is a schematic diagram of a speech feature extraction model structure provided by the application;
图4是本申请又一实施例提供的一种提取语音特征的方法的示意流程图;4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application;
图5是本申请一实施例提供的一种提取语音特征的装置的示意图;5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application;
图6是本申请另一实施例提供的一种提取语音特征的终端的示意图。FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application.
本发明的实施方式Embodiments of the present invention
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
智能语音技术作为人工智能重要的组成部分,其应用通过标注大量的有监督数据,重新训练语音模型或者在原始语音模型上进行优化,其过程会消耗大量的人力、经济以及时间。且可直接作为训练样本的带标注的语音数据很少,不利于语音模型的训练。因此,无监督的语音特征提取方法应用而生。As an important part of artificial intelligence, the application of intelligent speech technology retrains the speech model or optimizes the original speech model by labeling a large amount of supervised data, which consumes a lot of manpower, economy and time. And there are very few labeled speech data that can be directly used as training samples, which is not conducive to the training of speech models. Therefore, unsupervised speech feature extraction methods are applied.
然而,发明人意识到,由于语音数据的复杂性和多变性,现有的基于无监督学习训练得到的语音模型很难学习到语音数据的有效特征,导致使用该语音模型提取到的语音特征不准确。However, the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in that the speech features extracted using the speech model are not precise.
有鉴于此,本申请提供了一种提取语音特征的方法,该方法中,语音特征提取模型是以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,基于自监督学习对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,且每个样本语音数据对中的增强语音数据是对原始语音数据进行数据增强处理得到的。这样训练得到的语言特征提取模型学习到了,从增强语音数据中提取到原始语音数据对应的语音特征的能力,可以理解为从失真语音数据中提取到未失真语音数据对应的语音特征的能力,也学习到了如何提取有效地语音特征,使得该语言特征提取模型在实际使用过程中,可提取到有效地、信息丰富、表达准确的目标语音特征。进而使该目标语音特征应用于智能语音任务处理场景时,处理结果更准确。且该语言特征提取模型在训练过程中,可根据原始语音数据生成增强语音数据,一方面扩大了样本语音数据的数量,一方面不需要人工去提供样本语音数据,节省了大量的人力、经济、时间。In view of this, the present application provides a method for extracting voice features. In the method, the voice feature extraction model is based on the sample voice features corresponding to the original voice data in each sample voice data pair. The difference between the original speech data and the enhanced speech data in each sample speech data pair is obtained by training, and the enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data. The language feature extraction model trained in this way has learned that the ability to extract the voice features corresponding to the original voice data from the enhanced voice data can be understood as the ability to extract the voice features corresponding to the undistorted voice data from the distorted voice data. Learn how to extract effective speech features, so that the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process. Furthermore, when the target speech feature is applied to the intelligent speech task processing scene, the processing result is more accurate. In addition, the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.
请参见图1,图1是本申请实施例提供的一种提取语音特征的方法的示意流程图。本实施例中提取语音特征的方法的执行主体为终端、服务器等,其中,终端包括但不限于智能手机、平板电脑、计算机、个人数字助理(Personal Digital Assistant,PDA)等移动终端,还可以包括台式电脑等终端。本实施例中以执行主体为终端为例进行说明,如图1所示的提取语音特征的方法可包括S101~S102,具体如下:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application. The execution subject of the method for extracting voice features in this embodiment is a terminal, a server, etc., wherein the terminal includes but is not limited to mobile terminals such as smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), etc., and may also include Terminals such as desktop computers. In this embodiment, the execution subject is taken as an example for description. The method for extracting speech features as shown in FIG. 1 may include S101 to S102, and the details are as follows:
S101:获取待处理的语音数据。S101: Acquire voice data to be processed.
待处理的语音数据即为需要进行语音特征提取的语音数据。提取到的语音特征可应用于不同的智能语音任务处理场景中。例如,提取到的语音特征可以应用到语音识别、说话人身份识别、语种识别、语音翻译、同声翻译、语音控制等场景中。The speech data to be processed is the speech data that needs to be extracted by speech features. The extracted speech features can be applied to different intelligent speech task processing scenarios. For example, the extracted speech features can be applied to scenarios such as speech recognition, speaker identification, language recognition, speech translation, simultaneous translation, and speech control.
正是由于可能会应用到不同的智能语音任务处理场景中,待处理的语音数据可以相同,也可以不同。例如,若是在说话人身份识别这种场景中需要提取语音特征,待处理的语音数据可以是预先上传至终端的一条完整的语音;若是在语音控制这种场景中需要提取语音特征,待处理的语音数据可以通过内置的拾音装置(例如,麦克风、声卡等)获取到的用 户发出的语音等。此处仅为示例性说明,对此不做限定。Just because it may be applied to different intelligent voice task processing scenarios, the voice data to be processed can be the same or different. For example, if voice features need to be extracted in the scenario of speaker identification, the voice data to be processed can be a complete voice uploaded to the terminal in advance; if voice features need to be extracted in the scenario of voice control, the voice data to be processed The voice data may be the voice uttered by the user obtained through a built-in sound pickup device (eg, a microphone, a sound card, etc.). This is only an exemplary description, and it is not limited.
示例性地,不同应用场景获取待处理的语音数据方式也不相同。当应用场景需要实时出结果时,例如同声翻译、语音控制的等,此时获取该语音数据的方式可以为,通过内置的拾音装置(例如,麦克风、声卡等)获取用户发出的语音。Exemplarily, different application scenarios acquire voice data to be processed in different ways. When the application scenario requires real-time results, such as simultaneous translation, voice control, etc., the way to obtain the voice data at this time may be to obtain the user's voice through a built-in sound pickup device (eg, microphone, sound card, etc.).
当应用场景不需要实时出结果时,例如说话人身份识别,获取该语音数据的方式可以为,用户预先将待处理的语音数据上传至终端,终端获取该待处理的语音数据。也可以是终端在检测到特征提取指令时,根据该特征提取指令中包含的文件标识,获取该文件标识对应的文本文件,并提取该文本文件中待处理的语音数据。此处仅为示例性说明,对此不做限定。When the application scenario does not require real-time results, such as speaker identification, the voice data may be obtained by the user uploading the to-be-processed voice data to the terminal in advance, and the terminal obtains the to-be-processed voice data. Alternatively, when the terminal detects the feature extraction instruction, according to the file identifier included in the feature extraction instruction, the terminal obtains a text file corresponding to the file identifier, and extracts the speech data to be processed in the text file. This is only an exemplary description, and it is not limited.
S102:将该语音数据输入到已训练的语音特征提取模型中进行处理,得到该语音数据对应的目标语音特征,该语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,该增强语音数据是对该原始语音数据进行数据增强处理得到的。S102: Input the voice data into a trained voice feature extraction model for processing, and obtain a target voice feature corresponding to the voice data. The voice feature extraction model is based on self-supervised learning, using the original voice data in each sample voice data pair The sample voice feature corresponding to the voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement processing on the original voice data. owned.
在本实施例中,提取语音特征的终端中预先存储有预先训练好的语音特征提取模型。该语音特征提取模型是采用自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的。In this embodiment, a pre-trained voice feature extraction model is pre-stored in the terminal for extracting voice features. The voice feature extraction model adopts self-supervised learning, and takes the sample voice features corresponding to the original voice data in each sample voice data pair as the target. Differentially trained.
每个样本语音数据对中的增强语音数据,是对每个样本语音数据对中的原始语音数据进行数据增强处理得到的。可以理解为,原始语音数据为纯净的语音数据,即不含噪音、杂质、未失真的语音数据。增强语音数据是对该原始语音数据,进行混响处理、加噪处理、频率掩蔽处理、时间掩蔽处理、剪辑处理、重叠语音处理中的任意一种处理或任意多种处理后得到的。The enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data in each sample speech data pair. It can be understood that the original voice data is pure voice data, that is, voice data without noise, impurities and undistorted voice. The enhanced speech data is obtained by performing any one or more of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing on the original speech data.
通常现有技术中,获取的样本语音数据是含噪音、杂质、失真的语音数据,以及从该语音数据中提取到的语音特征。基于该语音特征为学习目标,通过机器学习对这些语言数据和语音特征进行训练,使训练得到的语音模型具有从含噪音、杂质、失真的语音数据中,提取出有效语音特征的能力。然而,这种训练语音模型的方式,由于语音数据的复杂性和多变性,且学习目标原本就是从含噪音、杂质、失真的语音数据中提取到的语音特征,导致语音模型在训练过程中学习到很多没有意义的特征,再加上语音数据的复杂性和多变性带来的干扰,最终训练得到的语音模型在实际处理语音数据时,不能提取到有效、准确、丰富地语音特征,进而导致该语音模型在各种智能语音任务处理场景中应用时,处理结果不准确。Generally, in the prior art, the obtained sample speech data is speech data containing noise, impurities, and distortion, and speech features extracted from the speech data. Based on the speech feature as the learning target, these language data and speech features are trained through machine learning, so that the trained speech model has the ability to extract effective speech features from speech data containing noise, impurities and distortion. However, this way of training the speech model, due to the complexity and variability of speech data, and the original learning target is the speech features extracted from speech data containing noise, impurities, and distortion, resulting in the speech model learning during the training process. To many meaningless features, coupled with the interference caused by the complexity and variability of voice data, the voice model finally trained cannot extract effective, accurate and rich voice features when actually processing voice data, which leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.
或者,现有技术中还采用无监督学习的方式训练语音模型,无监督学习是指在没有目标的情况下寻找输入数据的变化,其目的在于更好地理解数据中的相关性。目前无监督的语音特征提取方法主要包括主成分分析方法和基于混合高斯模型的方法两种,上述两种方法的设置前提均是语音数据服从高斯分布,且在执行过程中仅需要进行人为降维,然而语 音数据不一定符合高斯分布,且人为降维会无可避免地导致高维特征的损失,导致语音模型在实际处理语音数据时,不能提取到有效、准确、丰富地语音特征,进而导致该语音模型在各种智能语音任务处理场景中应用时,处理结果不准确。Alternatively, in the prior art, an unsupervised learning method is also used to train a speech model. Unsupervised learning refers to finding changes in the input data without a target, and the purpose is to better understand the correlation in the data. At present, unsupervised speech feature extraction methods mainly include principal component analysis method and method based on mixture Gaussian model. The premise of the above two methods is that the speech data obeys the Gaussian distribution, and only needs to be artificially reduced during the execution process. , however, the voice data does not necessarily conform to the Gaussian distribution, and artificial dimensionality reduction will inevitably lead to the loss of high-dimensional features, resulting in the fact that the voice model cannot extract effective, accurate and rich voice features when actually processing the voice data, which in turn leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.
而本申请中采用自监督学习的方式,将原始语音数据中提取到的的样本语音特征作为自监督学习的目标,目标明确,且由于原始语音数据是不含噪音、杂质、未失真的语音数据,从该原始语音数据中提取到的样本语音特征更准确、丰富、有效。In this application, the method of self-supervised learning is adopted, and the sample speech features extracted from the original speech data are used as the target of self-supervised learning. The target is clear, and since the original speech data is speech data without noise, impurities and distortion , the sample speech features extracted from the original speech data are more accurate, rich and effective.
对原始语音数据进行数据增强处理得到增强语音数据,一方面相当于增加了训练样本的数量,一方面对原始语音数据应用的是已知的变换,便于对增强语音数据的类型进行控制,即对原始语音数据进行数据增强处理时,可以控制数据增强处理的类型,使得语音特征提取模型在训练过程中,有针对性地学习各种有效的语音特征。进而使最终训练得到的语音特征提取模型,在实际处理语音数据时,能够提取到有效、准确、丰富地语音特征,该语音特征提取模型在各种智能语音任务处理场景中应用时,处理结果更加准确。Performing data enhancement processing on the original voice data to obtain enhanced voice data, on the one hand, is equivalent to increasing the number of training samples; When the original speech data is processed for data enhancement, the type of data enhancement processing can be controlled, so that the speech feature extraction model can learn various effective speech features in a targeted manner during the training process. Then, the speech feature extraction model obtained by the final training can extract effective, accurate and rich speech features when actually processing speech data. When the speech feature extraction model is applied in various intelligent speech task processing scenarios, the processing results are more accurate. precise.
可以理解的是,语音特征提取模型可以由提取语音特征的终端预先训练好,也可以由其他设备预先训练好后将语音特征提取模型对应的文件移植至该终端中。也就是说,训练该语音特征提取模型的执行主体与使用该语音特征提取模型进行语音特征提取的执行主体可以是相同的,也可以是不同的。例如,当采用其他设备训练初始语音特征提取模型时,其他设备对初始语音特征提取模型结束训练后,固定初始语音特征提取模型的模型参数,得到训练好的语音特征提取模型对应的文件。然后将该文件移植到提取语音特征的终端中。It can be understood that the speech feature extraction model may be pre-trained by the terminal that extracts speech features, or the files corresponding to the speech feature extraction model may be transplanted to the terminal after being pre-trained by other devices. That is to say, the executive body for training the speech feature extraction model and the executive body for using the speech feature extraction model for speech feature extraction may be the same or different. For example, when other devices are used to train the initial speech feature extraction model, after the other devices finish training the initial speech feature extraction model, the model parameters of the initial speech feature extraction model are fixed, and the files corresponding to the trained speech feature extraction model are obtained. This file is then ported to a terminal that extracts speech features.
请参见图2,图2是本申请另一实施例提供的一种提取语音特征的方法的示意流程图。可选地,在一种可能的实现方式中,如图2所示,上述S102可以包括S1021~S1023,具体如下:Please refer to FIG. 2 , which is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application. Optionally, in a possible implementation manner, as shown in FIG. 2 , the foregoing S102 may include S1021 to S1023, and the details are as follows:
S1021:将该语音数据输入到该卷积滤波器中进行处理,得到该语音数据对应的第一语音特征,该第一语音特征包括频率特征。S1021: Input the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature.
已训练的语音特征提取模型中包括卷积滤波器、卷积编码器以及准循环神经网络。请参见图3,图3是本申请提供的语音特征提取模型结构的示意图。其中,卷积滤波器可以为可解释的卷积滤波器(SincNet),卷积编码器由7个卷积神经网络层(ConvNet)构成,准循环神经网络可以为神经网络回归(Quantile RegressionNeural Network,QRNN)。此处仅为示例性说明,对此不做限定。The trained speech feature extraction models include convolutional filters, convolutional encoders, and quasi-recurrent neural networks. Please refer to FIG. 3 , which is a schematic diagram of the structure of a speech feature extraction model provided by the present application. Among them, the convolutional filter can be an interpretable convolutional filter (SincNet), the convolutional encoder is composed of 7 convolutional neural network layers (ConvNet), and the quasi-cyclic neural network can be a neural network regression (Quantile RegressionNeural Network, QRNN). This is only an exemplary description, and it is not limited.
示例性地,已训练的语音特征提取模型在对待处理的语音数据进行处理时,可先将该语音数据转换为波形,具体可通过现有的语音转换波形软件对该语音数据进行转换,此处不再赘述。将转换得到的波形输入SincNet中,SincNet基于预设时长的滑动窗口,对输入的波形进行时域卷积操作,得到该语音数据对应的第一语音特征,第一语音特征可以包括频率特征、梅尔频率倒谱系数(Mel-Frequency Cepstral oefficients,MFCC)特征、滤波器组特性(Filter bank characteristics,Fbank)特征、波形(wave)特征、对数功率谱(Log-power spectrum,Lps)特征等。其中,频率特征可以包括音频特征、基频特征、频带特征等。其中,预设时长可根据实际情况进行调整,例如本实施例中,可设置为10毫秒的滑动窗口。 语音数据具有时序性,基于预设时长的滑动窗口对输入的波形进行时域卷积操作,可以理解为每次对10毫秒时长的波形进行时域卷积操作,直至输入的波形被处理完成。Exemplarily, when the voice data to be processed is processed by the trained voice feature extraction model, the voice data can be converted into a waveform first. Specifically, the voice data can be converted by existing voice conversion waveform software. Here No longer. Input the converted waveform into SincNet, and SincNet performs a time-domain convolution operation on the input waveform based on a sliding window with a preset duration to obtain the first voice feature corresponding to the voice data. The first voice feature may include frequency features, Mel-Frequency Cepstral coefficients (Mel-Frequency Cepstral coefficients, MFCC) characteristics, filter bank characteristics (Filter bank characteristics, Fbank) characteristics, waveform (wave) characteristics, logarithmic power spectrum (Log-power spectrum, Lps) characteristics, etc. The frequency features may include audio features, fundamental frequency features, frequency band features, and the like. The preset duration can be adjusted according to the actual situation, for example, in this embodiment, it can be set as a sliding window of 10 milliseconds. Speech data is time-sequential, and the input waveform is subjected to a time-domain convolution operation based on a sliding window with a preset duration.
示例性地,SincNet对输入的波形进行时域卷积操作可通过下式(1)表现,具体如下:Exemplarily, the time domain convolution operation performed by SincNet on the input waveform can be expressed by the following formula (1), as follows:
Figure PCTCN2021084166-appb-000001
Figure PCTCN2021084166-appb-000001
上述(1)式中,y[n]表示SincNet输出的第一语音特征,x[n]表示输入的波形,h[n]为预设的长度为L的滤波器。In the above formula (1), y[n] represents the first speech feature output by SincNet, x[n] represents the input waveform, and h[n] is a preset filter of length L.
此处仅为示例性说明,对此不做限定。This is only an exemplary description, and it is not limited.
S1022:通过该卷积编码器对该第一语音特征进行卷积处理,得到第二语音特征,该第二语音特征包括MFCC特征和Fbank特征。S1022: Perform convolution processing on the first voice feature by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature.
将第一语音特征输入卷积编码器中进行卷积处理,得到第二语音特征,第二语音特征可以包括MFCC特征、Fbank特征、wave特征、Lps特征、伽马(Gamma)特征、韵律(Proso)特征等。Input the first voice feature into the convolutional encoder for convolution processing to obtain the second voice feature, and the second voice feature may include MFCC feature, Fbank feature, wave feature, Lps feature, gamma (Gamma) feature, prosody (Proso ) features, etc.
该卷积编码器由7个ConvNet构成,第一个ConvNet对第一语音特征进行卷积处理,得到第一处理结果。将第一处理结果输入至第二个ConvNet,第二个ConvNet对该第一处理结果进行卷积处理,得到第二处理结果,以此类推,直至最后一个ConvNet对前一个ConvNet传递过来的处理结果进行卷积处理后,输出第二语音特征。The convolutional encoder is composed of 7 ConvNets, and the first ConvNet performs convolution processing on the first speech feature to obtain the first processing result. Input the first processing result to the second ConvNet, and the second ConvNet performs convolution processing on the first processing result to obtain the second processing result, and so on, until the last ConvNet passes the processing result to the previous ConvNet After the convolution process is performed, the second speech feature is output.
示例性地,第一个ConvNet基于预设的卷积核对第一语音特征卷积,可以理解为第一个ConvNet在第一语音特征中进行特征选择,去除多余特征,得到第一处理结果。例如,根据第一语音特征中的信息提取MFCC特征、Fbank特征、wave特征、Lps特征、伽马(Gamma)特征、韵律(Proso)特征等。将第一处理结果输入至第二个ConvNet中,第二个ConvNet在第一个ConvNet提取到的特征的基础上,进一步进行卷积,以提取更深层的特征,得到第二处理结果。以此类推,直至最后一个ConvNet对前一个ConvNet传递过来的处理结果进行卷积处理后,得到第二语音特征。Exemplarily, the first ConvNet convolves the first speech feature based on a preset convolution check, which can be understood as the first ConvNet performs feature selection in the first speech feature, removes redundant features, and obtains the first processing result. For example, the MFCC feature, the Fbank feature, the wave feature, the Lps feature, the gamma (Gamma) feature, the prosody (Proso) feature, etc. are extracted according to the information in the first speech feature. The first processing result is input into the second ConvNet, and the second ConvNet further performs convolution on the basis of the features extracted by the first ConvNet to extract deeper features to obtain the second processing result. By analogy, the second speech feature is obtained after the last ConvNet performs convolution processing on the processing result passed by the previous ConvNet.
可选地,在一种可能实现的方式中,为了使提取到的第二语音特征更准确,消除可能由性别、年龄带来的语音特征之间的差异性,可将该第七处理结果输入降采样层进行处理,再由该降采样层输出第二语音特征。Optionally, in a possible implementation manner, in order to make the extracted second speech feature more accurate and eliminate the difference between the speech features that may be caused by gender and age, the seventh processing result can be input. The down-sampling layer performs processing, and then the down-sampling layer outputs the second speech feature.
示例性地,降采样层对第七处理结果的处理可通过下式(2)表现,具体如下:Exemplarily, the processing of the seventh processing result by the down-sampling layer can be expressed by the following formula (2), which is specifically as follows:
Figure PCTCN2021084166-appb-000002
Figure PCTCN2021084166-appb-000002
上述(2)式中,P j,m表示降采样层的输出,j表示第j个ConvNet的处理结果,m表示第m个降采样带,n表示降采样因子,r表示降采样窗长的大小,表示要把多少频带的数据降采样到一起。 In the above formula (2), P j,m represents the output of the downsampling layer, j represents the processing result of the jth ConvNet, m represents the mth downsampling band, n represents the downsampling factor, and r represents the length of the downsampling window. Size, indicating how many bands of data to downsample together.
此处仅为示例性说明,对此不做限定。This is only an exemplary description, and it is not limited.
本实施例中,由于不同人器官构造和发声习惯不同,常常会导致特征提取后具有一定的差异性,具体表现为频谱偏移,例如男人相比女人声音频率普遍要低,承认相比儿童频 率也普遍降低,通过降采样层的处理,可很好的消除该差异性,使提取到的语音特征更准确。In this embodiment, due to the different organ structures and vocal habits of different people, there is often a certain difference after feature extraction, which is manifested as spectral shift. For example, the frequency of the voice of a man is generally lower than that of a woman. It is also generally reduced. Through the processing of the downsampling layer, the difference can be well eliminated, so that the extracted speech features are more accurate.
S1023:将该第二语音特征输入到该准循环神经网络中进行处理,得到该目标语音特征,该目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征。S1023: Input the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter bank characteristics, target gamma features, and target prosody features.
将第二语音特征输入QRNN中进行处理,得到待处理的语音数据对应的目标语音特征。目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征,目标语音特征还可包括长时对数功率谱(Long Log-power spectrum,Long-Lps)特征、长时梅尔频率倒谱系数(Long Mel-Frequency Cepstral oefficients,Long-MFCC)特征、长时滤波器组特性(Long Filter bank characteristics,Long-Fbank)特征、长时伽马(Long Gamma)特征等。值得说明的是,第一语音特征、第二语音特征以及目标语音特征中,有些特征是相同类型的特征,不同的是,第一语音特征和第二语音特征提取到的特征,特征信息不是很丰富,特征表达的也不是很准确,经过准循环神经网络的处理后,得到的目标语音特征信息丰富、表达准确。The second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the voice data to be processed. The target speech features include target waveform features, target log power spectral features, target spectral features, target filter bank features, target gamma features, and target prosody features. -power spectrum, Long-Lps) features, Long Mel-Frequency Cepstral coefficients (Long Mel-Frequency Cepstral coefficients, Long-MFCC) features, Long-term filter bank characteristics (Long Filter bank characteristics, Long-Fbank) features, Long Gamma feature, etc. It is worth noting that some of the features of the first voice feature, the second voice feature and the target voice feature are of the same type. The difference is that the features extracted from the first voice feature and the second voice feature are not very informative. It is rich, and the feature expression is not very accurate. After processing by the quasi-recurrent neural network, the obtained target speech feature information is rich and accurate.
如图3所示,QRNN中的第一层为卷积层(Conv 1D),用于提取输入的第二语音特征中的特征,Sigmoid、Tanh为该QRNN中用到的函数,第二层为池化层,用于减少特征数目,不同的是,QRNN中池化层采用的是fo-pool方法。示例性地,基于QRNN中卷积层提取第二语音特征中的特征可通过下式(3)表现,具体如下:As shown in Figure 3, the first layer in the QRNN is the convolution layer (Conv 1D), which is used to extract the features in the input second speech feature, Sigmoid and Tanh are the functions used in the QRNN, and the second layer is The pooling layer is used to reduce the number of features. The difference is that the pooling layer in QRNN adopts the fo-pool method. Exemplarily, the feature in the second speech feature extracted based on the convolutional layer in the QRNN can be expressed by the following formula (3), as follows:
Z=tanh(W z*X) Z=tanh(W z *X)
F=σ(W f*X) F=σ(W f *X)
O=σ(W o*X),(3) O=σ(W o *X), (3)
上述(3)式中,X表示输入的第二语音特征,Z、F、O表示参数W参与的乘法门,W z、W f、W o表示预设R大小的卷积滤波器,当滤波器宽度为2时,上述(3)式可以表示为: In the above formula (3), X represents the input second speech feature, Z, F, O represent the multiplication gates that the parameter W participates in, and W z , W f , and W o represent the convolution filters of the preset R size. When the width of the device is 2, the above formula (3) can be expressed as:
Figure PCTCN2021084166-appb-000003
Figure PCTCN2021084166-appb-000003
Figure PCTCN2021084166-appb-000004
Figure PCTCN2021084166-appb-000004
Figure PCTCN2021084166-appb-000005
Figure PCTCN2021084166-appb-000005
即滤波器的宽度越大,可考虑到更多时刻的特征,越能计算得到更高的特征。That is, the larger the width of the filter, the more features at more moments can be considered, and the higher the features can be calculated.
将卷积层提取到的特征输入池化层中进行处理,输出目标语音特征。可通过下式(4)、(5)实现池化层卷积层提取到的特征的处理,具体如下:The features extracted by the convolutional layer are input into the pooling layer for processing, and the target speech features are output. The processing of the features extracted by the convolutional layer of the pooling layer can be realized by the following formulas (4) and (5), as follows:
c t=f t☉c t-1+(1-f t)☉z t,(4) c t =f t ☉c t-1 +(1-f t )☉z t , (4)
h t=o t☉c t,(5) h t =o t ☉c t , (5)
上式(4)中c t表示时间t的单元状态向量,上式(5)中h t表示时间t的隐藏状态向量。 In the above formula (4), ct represents the unit state vector at time t, and in the above formula (5), h t represents the hidden state vector at time t.
可选地,在一种可能的实现方式中,为了使提取到的目标语音特征信息更准确、表达更准确,在S1022之后还可包括S1024~S1025,具体如下:Optionally, in a possible implementation manner, in order to make the extracted target speech feature information more accurate and more accurate in expression, S1024 to S1025 may be further included after S1022, and the details are as follows:
S1024:基于该准循环神经网络提取第二语音特征对应的第三语音特征。S1024: Extract a third speech feature corresponding to the second speech feature based on the quasi-recurrent neural network.
第三语音特征与目标语音特征中包括的各个特征的类型相同,即第三语音特征包括MFCC特征、Fbank特征、wave特征、Lps特征、伽马(Gamma)特征、韵律(Proso)特征、Long-Lps特征、Long-MFCC特征、Long-Fbank特征、Long Gamma特征等。此处仅为示例性说明,对此不做限定。The third voice feature is of the same type as each feature included in the target voice feature, that is, the third voice feature includes MFCC feature, Fbank feature, wave feature, Lps feature, Gamma feature, Proso feature, Long- Lps feature, Long-MFCC feature, Long-Fbank feature, Long Gamma feature, etc. This is only an exemplary description, and it is not limited.
将第二语音特征输入QRNN中进行处理,得到第二语音特征对应的目标语音特征。准循环神经网络对第二语音特征的具体处理过程,可参考S1023中的描述,此处不再赘述。The second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the second voice feature. For the specific processing process of the second speech feature by the quasi-recurrent neural network, reference may be made to the description in S1023, which will not be repeated here.
S1025:采用跳跃连接的方式将第二语音特征与第三语音特征结合,得到该目标语音特征。S1025: Combine the second voice feature with the third voice feature in a skip connection manner to obtain the target voice feature.
第二语音特征与第三语音特征均以向量的形式表现,将第二语音特征与第三语音特征对应相加,得到目标语音特征。若第三语音特征中包括的某个类型的特征,在第二语音特征中没有,默认第二语音特征中该类型的特征对应的向量为0。此处仅为示例性说明,对此不做限定。Both the second voice feature and the third voice feature are represented in the form of vectors, and the second voice feature and the third voice feature are correspondingly added to obtain the target voice feature. If a certain type of feature included in the third voice feature is not included in the second voice feature, the vector corresponding to the type of feature in the second voice feature is 0 by default. This is only an exemplary description, and it is not limited.
可选地,在一种可能的实现方式中,基于S1022可知,卷积编码器由7个ConvNet构成,每个ConvNet都有一个对应的处理结果。采用跳跃连接的方式将第二语音特征与第三语音特征结合可以是,将第一个ConvNet对应的第一处理结果、第三个ConvNet对应的第三处理结果、第五个ConvNet对应的第五处理结果与第三语音特征对应相加,得到目标语音特征。或者,将第一个ConvNet对应的第一处理结果、第三个ConvNet对应的第三处理结果、第五个ConvNet对应的第五处理结果、第七个ConvNet对应的第七处理结果与第三语音特征对应相加,得到目标语音特征。又或者,将第二个ConvNet对应的第二处理结果、第四个ConvNet对应的第四处理结果、第六个ConvNet对应的第六处理结果与第三语音特征对应相加,得到目标语音特征。此处仅为示例性说明,对此不做限定。Optionally, in a possible implementation manner, based on S1022, it can be known that the convolutional encoder is composed of 7 ConvNets, and each ConvNet has a corresponding processing result. The combination of the second voice feature and the third voice feature by means of skip connection may be to combine the first processing result corresponding to the first ConvNet, the third processing result corresponding to the third ConvNet, and the fifth processing result corresponding to the fifth ConvNet. The processing result is correspondingly added with the third voice feature to obtain the target voice feature. Or, compare the first processing result corresponding to the first ConvNet, the third processing result corresponding to the third ConvNet, the fifth processing result corresponding to the fifth ConvNet, and the seventh processing result corresponding to the seventh ConvNet and the third voice. The features are added correspondingly to obtain the target speech features. Alternatively, the second processing result corresponding to the second ConvNet, the fourth processing result corresponding to the fourth ConvNet, and the sixth processing result corresponding to the sixth ConvNet are correspondingly added to the third voice feature to obtain the target voice feature. This is only an exemplary description, and it is not limited.
本实施例中,目标语音特征表现为卷积编码器发现的特征的总和,因此,使最终得到的目标语音特征信息更准确、表达更准确。In this embodiment, the target speech feature is expressed as the sum of the features found by the convolutional encoder, so that the final target speech feature information obtained is more accurate and the expression is more accurate.
本申请实施例,语音特征提取模型是以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,基于自监督学习对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,且每个样本语音数据对中的增强语音数据是对原始语音数据进行数据增强处理得到的。这样训练得到的语言特征提取模型学习到了,从增强语音数据中提取到原始语音数据对应的语音特征的能力,可以理解为从失真语音数据中提取到未失真语音数据对应的语音特征的能力。使得该语言特征提取模型在实际使用过程中,可提取到有效地、信息丰富、表达准确的目标语音特征。进而使该目标语音特征应用于智能语音任务处理场景时,处理结果更准确。且该语言特征提取模型在训练过程中,可根据原始语音数据生成增强语音数据,一方面扩大了样本语音数据的数量,一方面不需要人工去提供样本语音数据,节省了大量的人力、经济、时间。In this embodiment of the present application, the speech feature extraction model takes the sample speech features corresponding to the original speech data in each sample speech data pair as the target, and based on self-supervised learning, the original speech data and the enhanced speech data in each sample speech data pair are The difference between the two samples is obtained by training, and the enhanced voice data in each sample voice data pair is obtained by performing data enhancement processing on the original voice data. The language feature extraction model trained in this way has learned that the ability to extract the speech features corresponding to the original speech data from the enhanced speech data can be understood as the ability to extract the speech features corresponding to the undistorted speech data from the distorted speech data. Therefore, the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process. Furthermore, when the target speech feature is applied to the intelligent speech task processing scene, the processing result is more accurate. In addition, the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.
请参见图4,图4是本申请又一实施例提供的一种提取语音特征的方法的示意流程图。该方法可以包括S201~S206。其中,图4所示的步骤S205~S206可以参考图1对应的实施例中S101~S102的相关描述,为了简洁,这里不再赘述。下面将具体对步骤S201~S204进 行说明。Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application. The method may include S201-S206. Wherein, for steps S205 to S206 shown in FIG. 4 , reference may be made to the relevant descriptions of S101 to S102 in the embodiment corresponding to FIG. 1 , which are not repeated here for brevity. Steps S201 to S204 will be specifically described below.
S201:将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征。S201: Input a plurality of sample speech data pairs in the sample speech data set into an initial speech feature extraction model for processing to obtain a sample speech feature corresponding to each original speech data and a real speech feature corresponding to each enhanced speech data.
样本语音数据集中包括多个样本语音数据对,每个样本语音数据对中包括一个原始语音数据和一个增强语音数据。其中,每个样本语音数据对中的增强语音数据,是由该样本语音数据对中的原始语音数据经过数据增强处理后得到的。其中,数据增强处理可以为混响处理、加噪处理、频率掩蔽处理、时间掩蔽处理、剪辑处理、重叠语音处理中的任意一种处理或任意多种处理。The sample speech data set includes a plurality of sample speech data pairs, and each sample speech data pair includes one original speech data and one enhanced speech data. Wherein, the enhanced voice data in each sample voice data pair is obtained from the original voice data in the sample voice data pair after data enhancement processing. Wherein, the data enhancement processing may be any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing, or any multiple processing.
示例性地,可对每一种数据增强处理预设一个概率值,并基于预设的概率值对获取到的每个样本语音数据对中的原始语音数据进行数据增强处理,得到每个样本语音数据对中的原始语音数据对应的增强语音数据。概率值用于表示对每个原始语音数据进行该概率值对应的数据增强处理的可能性。Exemplarily, a probability value can be preset for each kind of data enhancement processing, and based on the preset probability value, data enhancement processing is performed on the raw voice data in each sample voice data pair that is obtained to obtain each sample voice data. Enhanced speech data corresponding to the original speech data in the data pair. The probability value is used to indicate the possibility of performing data enhancement processing corresponding to the probability value on each original speech data.
例如,混响处理对应的概率值为0.5,加噪处理对应的概率值为0.4,频率掩蔽处理对应的概率值为0.4,时间掩蔽处理对应的概率值为0.2,剪辑处理对应的概率值为0.2,重叠语音处理对应的概率值为0.1。也就是说,有0.5的概率会对某个原始语音数据进行混响处理,有0.4的概率会对某个原始语音数据进行加噪处理,有0.4的概率会对某个原始语音数据进行频率掩蔽处理,有0.2的概率会对某个原始语音数据进行时间掩蔽处理,有0.2的概率会对某个原始语音数据进行剪辑处理。值得说明的是,虽然对每个不同的数据增强处理设置了概率值,但是对每个原始语音数据进行几种数据增强处理并不限定,可以是其中一种,也可以是基于概率值出现的几种处理的组合。For example, the probability value corresponding to reverberation processing is 0.5, the probability value corresponding to noise processing is 0.4, the probability value corresponding to frequency masking processing is 0.4, the probability value corresponding to time masking processing is 0.2, and the probability value corresponding to clipping processing is 0.2 , the probability value corresponding to overlapping speech processing is 0.1. That is to say, there is a probability of 0.5 to perform reverberation processing on a certain original voice data, a probability of 0.4 to perform noise processing on a certain raw voice data, and a probability of 0.4 to perform frequency masking on a certain raw voice data. Processing, there is a probability of 0.2 to perform time masking processing on a certain original voice data, and a probability of 0.2 to perform clipping processing on a certain original voice data. It is worth noting that although a probability value is set for each different data enhancement processing, there is no limitation to perform several data enhancement processing on each original voice data, which can be one of them, or it can appear based on the probability value. A combination of several treatments.
示例性地,混响处理是通过将原始语音数据对应的信号与一组1300个脉冲响应卷积来实现的,这些脉冲响应是用图像方法导出的。脉冲响应模拟不同的声学条件,混响时间在0.3到0.9秒之间。加噪处理中的噪声是从预设的FreeSound数据集和DIRHA数据集中提取的,加噪处理中的噪声可以包括背景噪声和非平稳噪声,如警报、敲门声、电话铃声、电视声等,信噪比在0到10dB之间随机采样。频率掩蔽处理通过用带阻滤波器对原始语音数据对应的时间信号进行滤波来实现的。时间掩蔽处理通过将原始语音数据中的随机片段设置为零实现。剪辑处理通过对原始语音数据添加随机饱和度来实现。重叠语音处理通过在原始语音数据中与该原始语音数据对应的主信号重叠的语音信号实现。此处均为示例性说明,对此不做限定。Illustratively, reverberation processing is achieved by convolving the signal corresponding to the original speech data with a set of 1300 impulse responses, which are derived graphically. Impulse responses simulate different acoustic conditions with reverberation times ranging from 0.3 to 0.9 seconds. The noise in the noise processing is extracted from the preset FreeSound dataset and DIRHA dataset. The noise in the noise processing can include background noise and non-stationary noise, such as alarms, door knocks, telephone ringing, TV sounds, etc. The signal-to-noise ratio is randomly sampled between 0 and 10dB. The frequency masking process is realized by filtering the time signal corresponding to the original speech data with a band-stop filter. The temporal masking process is achieved by setting random segments in the original speech data to zero. Clipping is achieved by adding random saturation to the raw speech data. Overlapping speech processing is implemented by overlapping speech signals in the original speech data with the main signal corresponding to the original speech data. These are all exemplary descriptions, which are not limited.
将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,即分别将每个样本语音数据对中的原始语音数据输入到初始语音特征提取模型中进行处理,将每个样本语音数据对中的增强语音数据输入到初始语音特征提取模型中进行处理。初始语音特征提取模型输出每个原始语音数据对应的样本语音特征,以及输出每个增强语音数据对应的真实语音特征。Input multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing, that is, input the original speech data in each sample speech data pair into the initial speech feature extraction model for processing. The enhanced speech data in each sample speech data pair is input into the initial speech feature extraction model for processing. The initial speech feature extraction model outputs the sample speech features corresponding to each original speech data, and outputs the real speech features corresponding to each enhanced speech data.
示例性地,如图3所示,在训练语音特征提取模型的过程中,初始语音特征提取模型 包括初始卷积滤波器、初始卷积编码器以及初始准循环神经网络。其中,初始卷积滤波器可以为可解释的卷积滤波器(SincNet),初始卷积编码器由7个卷积神经网络层(ConvNet)构成,初始准循环神经网络可以为QRNN。Skip Connections(跳跃式传递)表示跳跃式连接,FC表示在7个ConvNet中跳跃式选择的处理结果。图3顶层的Workers表示12个自监督任务,基于一个小的前馈神经网络(通常一个隐藏层有256个隐藏单元)实现。可以明显的看出,这12个自监督任务每个对应一个从语音数据中提取出的语音特征,可以通俗理解为,监督每个原始语音数据对应的样本语音特征,和输出每个增强语音数据对应的真实语音特征之间的差异性,并根据该差异性调整初始语音特征提取模型的模型参数,直至每个增强语音数据对应的真实语音特征与每个原始语音数据对应的样本语音特征相同。Exemplarily, as shown in Figure 3, in the process of training the speech feature extraction model, the initial speech feature extraction model includes an initial convolution filter, an initial convolution encoder and an initial quasi-recurrent neural network. Among them, the initial convolutional filter can be an interpretable convolutional filter (SincNet), the initial convolutional encoder is composed of 7 convolutional neural network layers (ConvNet), and the initial quasi-cyclic neural network can be QRNN. Skip Connections (skip transfer) represents skip connections, and FC represents the processing result of skip selection in 7 ConvNets. The Workers at the top of Figure 3 represent 12 self-supervised tasks, implemented based on a small feedforward neural network (typically a hidden layer with 256 hidden units). It can be clearly seen that each of these 12 self-supervised tasks corresponds to a speech feature extracted from speech data, which can be generally understood as supervising the sample speech features corresponding to each original speech data, and outputting each enhanced speech data. The difference between the corresponding real voice features, and adjust the model parameters of the initial voice feature extraction model according to the difference, until the real voice feature corresponding to each enhanced voice data is the same as the sample voice feature corresponding to each original voice data.
图3中的Speech Distortion(语音失真)表示数据增强处理,Speech Distortion下面的语音片段表示原始语音数据。可选地,一种处理方式为,通过初始语音特征提取模型对原始语音数据进行处理,得到原始语音数据对应的样本语音特征。一种处理方式为,先对原始语音数据进行Speech Distortion处理,即数据增强处理,得到该原始语音数据对应的增强语音数据,再提取该增强语音数据对应的真实语音特征。具体提取样本语音特征和真实语音特征的过程可参考S102中的描述,此处不再赘述。The Speech Distortion (voice distortion) in Figure 3 represents the data enhancement process, and the speech segment below the Speech Distortion represents the original speech data. Optionally, a processing method is to process the original voice data through an initial voice feature extraction model to obtain sample voice features corresponding to the original voice data. One processing method is to first perform Speech Distortion processing on the original voice data, that is, data enhancement processing, to obtain enhanced voice data corresponding to the original voice data, and then extract the real voice features corresponding to the enhanced voice data. For the specific process of extracting the sample voice feature and the real voice feature, reference may be made to the description in S102, which will not be repeated here.
S202:针对每个样本语音数据对,根据预设的损失函数计算样本语音数据对中的原始语音数据对应的样本语音特征,与该样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值。S202: For each sample voice data pair, calculate the sample voice feature corresponding to the original voice data in the sample voice data pair according to a preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair loss value.
每个样本语音数据对中的原始语音数据对应的样本语音特征,与该样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值,可用于衡量该初始语音特征提取模型提取的语音特征的准确度。可以理解的是,原始语音数据为纯净的语音数据,即不含噪音、杂质、未失真的语音数据,该原始语音数据对应的样本语音特征便是标准的、信息丰富、表达准确的语音特征,这也是我们初始语音特征提取模型学习的目标,增强语音数据是对该原始语音数据进行数据增强处理后得到,里面含有噪音、杂质等。当可以从该增强语音数据中提取到与原始语音数据对应的样本语音特征相同的语音特征时,证明初始语音特征提取模型训练完成。The loss value between the sample voice feature corresponding to the original voice data in each sample voice data pair and the real voice feature corresponding to the enhanced voice data in the sample voice data pair can be used to measure the value extracted by the initial voice feature extraction model. Accuracy of speech features. It can be understood that the original voice data is pure voice data, that is, voice data without noise, impurities, and undistorted voice data, and the sample voice features corresponding to the original voice data are standard, informative, and accurately expressed voice features. This is also the learning target of our initial speech feature extraction model. The enhanced speech data is obtained by performing data enhancement processing on the original speech data, which contains noise and impurities. When the same voice features as the sample voice features corresponding to the original voice data can be extracted from the enhanced voice data, it proves that the training of the initial voice feature extraction model is completed.
预设的损失函数可以为均方误差函数、平均绝对误差函数等,对此不做限定。样本语音特征可以包括MFCC特征、Fbank特征、wave特征、Lps特征、伽马(Gamma)特征、韵律(Proso)特征、Long-Lps特征、Long-MFCC特征、Long-Fbank特征、Long Gamma特征等。真实语音特征也可以包括波形特征(wave特征)、对数功谱率特征(Lps特征)、频谱特征(MFCC特征)、滤波器组特性(Fbank特征)、伽马特征、韵律特征、Long-Lps特征、Long-MFCC特征、Long-Fbank特征、Long Gamma特征等。The preset loss function may be a mean square error function, a mean absolute error function, etc., which is not limited. The sample speech features may include MFCC features, Fbank features, wave features, Lps features, Gamma features, Proso features, Long-Lps features, Long-MFCC features, Long-Fbank features, Long Gamma features, and the like. The real speech features can also include waveform features (wave features), logarithmic power spectral rate features (Lps features), spectral features (MFCC features), filter bank features (Fbank features), gamma features, prosody features, Long-Lps features Features, Long-MFCC features, Long-Fbank features, Long Gamma features, etc.
对于每个样本语音数据对中的原始语音数据和增强语音数据,基于预设的损失函数计算样本语音特征和真实语音特征之间的损失值。值得说明的是,由于每个样本语音特征和真实语音特征均包含的对应的多种类型的特征,最终得到的损失值为每组同类型的特征之间的损失值之和。例如,样本语音特征包括MFCC特征、Fbank特征、wave特征,真实语 音特征包括MFCC特征、Fbank特征、wave特征。样本语音特征和真实语音特征之间的损失值为,样本语音特征对应的MFCC特征与真实语音特征对应的MFCC特征之间的损失值、样本语音特征对应的Fbank特征与真实语音特征对应的Fbank特征之间的损失值以及样本语音特征对应的wave特征与真实语音特征对应的wave特征之间的损失值之和。此处仅为示例性说明,对此不做限定。For the original speech data and the enhanced speech data in each sample speech data pair, the loss value between the sample speech features and the real speech features is calculated based on a preset loss function. It is worth noting that, since each sample speech feature and real speech feature contain corresponding multiple types of features, the final loss value is the sum of the loss values between each group of the same type of features. For example, the sample voice features include MFCC features, Fbank features, and wave features, and the real voice features include MFCC features, Fbank features, and wave features. The loss value between the sample voice feature and the real voice feature is the loss value between the MFCC feature corresponding to the sample voice feature and the MFCC feature corresponding to the real voice feature, the Fbank feature corresponding to the sample voice feature and the Fbank feature corresponding to the real voice feature The loss value between and the sum of the loss values between the wave feature corresponding to the sample speech feature and the wave feature corresponding to the real speech feature. This is only an exemplary description, and it is not limited.
在计算得到损失值后,判断该损失值是否满足预设条件。当损失值不满足预设条件时,执行S201;当损失值满足预设条件时,执行S204。预设条件可以是损失值小于或等于预设的损失值阈值,也可以是损失值属于预设的误差范围,但并不限于此,还可以根据实际情况进行设置,此处不做限制。After calculating the loss value, it is judged whether the loss value satisfies the preset condition. When the loss value does not meet the preset condition, execute S201; when the loss value meets the preset condition, execute S204. The preset condition may be that the loss value is less than or equal to the preset loss value threshold, or that the loss value falls within the preset error range, but it is not limited to this, and can also be set according to the actual situation, which is not limited here.
S203:当损失值不满足预设条件时,调整初始语音特征提取模型的模型参数,并返回执行将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征的步骤。S203: when the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to perform processing by inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model, to obtain each The steps of each sample speech feature corresponding to the original speech data and the real speech feature corresponding to each enhanced speech data.
例如,假设预设条件为损失值小于或等于预设的损失值阈值。那么,当执行训练过程的设备在确认当前的损失值大于预设的损失值阈值时,判定当前的初始语音特征提取模型提取的语音特征还未达到要求。此时,需要调整初始语音特征提取模型的模型参数,之后返回S201,继续执行S201和S202,直到在S202中确定的损失值小于或等于预设的损失值阈值时,执行S204。For example, it is assumed that the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is greater than the preset loss value threshold, it is determined that the voice features extracted by the current initial voice feature extraction model have not yet met the requirements. At this time, it is necessary to adjust the model parameters of the initial speech feature extraction model, then return to S201, and continue to execute S201 and S202, until the loss value determined in S202 is less than or equal to the preset loss value threshold, execute S204.
S204:当损失值满足预设条件时,停止训练初始语音特征提取模型,并将训练后的初始语音特征提取模型作为已训练的语音特征提取模型。S204: When the loss value satisfies the preset condition, stop training the initial speech feature extraction model, and use the trained initial speech feature extraction model as the trained speech feature extraction model.
例如,假设预设条件为损失值小于或等于预设的损失值阈值。那么,当执行训练过程的设备在确认当前的损失值小于或者等于预设的损失值阈值时,判定当前的初始语音特征提取模型的训练符合预期要求,停止训练初始语音特征提取模型。For example, it is assumed that the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is less than or equal to the preset loss value threshold, it determines that the training of the current initial speech feature extraction model meets the expected requirements, and stops training the initial speech feature extraction model.
此时调整模型参数后的初始语音特征提取模型经过了大量的样本训练,且其损失值保持在一个较小的范围内,使用该初始语音特征提取模型对语音数据进行处理,可以获得信息丰富、表达准确的语音特征。因此,可以确定停止训练时(即最后一次训练完成后)的初始语音特征提取模型作为已训练的语音特征提取模型。At this time, the initial speech feature extraction model after adjusting the model parameters has been trained with a large number of samples, and its loss value is kept within a small range. Using the initial speech feature extraction model to process the speech data can obtain rich information, Express accurate phonetic features. Therefore, the initial speech feature extraction model when the training is stopped (that is, after the last training is completed) can be determined as the trained speech feature extraction model.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
本实施例中训练得到的语音特征提取模型,可在增强语音数据中提取到与原始语音数据相同的语音特征,而增强语音数据是对原始语音数据进行混响处理、加噪处理等得到的。从另一方面来说,语音特征提取模型也学习到了如何对语音数据进行降噪以及失真不变性的能力。The voice feature extraction model trained in this embodiment can extract the same voice features as the original voice data from the enhanced voice data, and the enhanced voice data is obtained by performing reverberation processing and noise processing on the original voice data. On the other hand, the speech feature extraction model also learns how to denoise the speech data and the ability to be distortion invariant.
基于实验表明,通过该语音特征提取模型提取到的语音特征,应用在语音识别、说话人身份识别、语种识别、语音翻译、同声翻译、语音控制等场景时,处理结果明显优于现有的语音模型和MFCC系统。Based on experiments, it is shown that when the speech features extracted by the speech feature extraction model are used in speech recognition, speaker identification, language recognition, speech translation, simultaneous translation, speech control and other scenarios, the processing results are significantly better than the existing ones. Speech models and MFCC systems.
可选地,在一种可能的实现方式中,在上述S102之后或者S204之后,还可将已训练的语音特征提取模型上传至区块链中。Optionally, in a possible implementation manner, after S102 or after S204, the trained voice feature extraction model may also be uploaded to the blockchain.
在本实施例中,将已训练的语音特征提取模型上传至区块链中,可保证其安全性和对用户的公正透明性。且将已训练的语音特征提取模型上传至区块链中,借助区块链上文件无法随意篡改的特性,能够避免已训练的语音特征提取模型被恶意篡改,便于后续用户可直接准确地获取到已训练的语音特征提取模型,也便于后续用户使用已训练的语音特征提取模型对待处理的语音数据进行处理,保证提取到信息丰富、表达准确、有效的语音特征。In this embodiment, uploading the trained voice feature extraction model to the blockchain can ensure its security and fairness and transparency to users. In addition, the trained voice feature extraction model is uploaded to the blockchain. With the feature that the files on the blockchain cannot be tampered with at will, the trained voice feature extraction model can be prevented from being maliciously tampered with, so that subsequent users can directly and accurately obtain it. The trained voice feature extraction model is also convenient for subsequent users to use the trained voice feature extraction model to process the voice data to be processed, so as to ensure that the voice features with rich information, accurate expression and effective are extracted.
本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
请参见图5,图5是本申请一实施例提供的一种提取语音特征的装置的示意图。该装置包括的各单元用于执行图1、图2、图4对应的实施例中的各步骤。具体请参阅图1、图2、图4各自对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图5,包括:Please refer to FIG. 5 , which is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application. Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 . For details, please refer to the relevant descriptions in the corresponding embodiments of FIG. 1 , FIG. 2 , and FIG. 4 . For convenience of explanation, only the parts related to this embodiment are shown. See Figure 5, including:
获取单元310,用于获取待处理的语音数据;an acquisition unit 310, configured to acquire the voice data to be processed;
处理单元320,用于将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The processing unit 320 is configured to input the voice data into a trained voice feature extraction model for processing to obtain target voice features corresponding to the voice data. The voice feature extraction model is based on self-supervised learning, with each The sample speech feature corresponding to the original speech data in the sample speech data pair is the target, and is obtained by training the difference between the original speech data and the enhanced speech data in each sample speech data pair, and the enhanced speech data is a pair of The original voice data is obtained by performing data enhancement processing.
可选地,所述语音特征提取模型包括卷积滤波器、卷积编码器以及准循环神经网络,所述处理单元320具体用于:Optionally, the speech feature extraction model includes a convolution filter, a convolution encoder and a quasi-recurrent neural network, and the processing unit 320 is specifically used for:
将所述语音数据输入到所述卷积滤波器中进行处理,得到所述语音数据对应的第一语音特征,所述第一语音特征包括频率特征;Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;
通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征,所述第二语音特征包括MFCC特征和Fbank特征;The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;
将所述第二语音特征输入到所述准循环神经网络中进行处理,得到所述目标语音特征,所述目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征。Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
可选地,所述处理单元320还用于:Optionally, the processing unit 320 is further configured to:
第一语音特征进行卷积处理,得到第二语音特征之后,所述方法还包括:After the first voice feature is subjected to convolution processing to obtain the second voice feature, the method further includes:
基于所述准循环神经网络提取所述第二语音特征对应的第三语音特征;Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;
采用跳跃连接的方式将所述第二语音特征与所述第三语音特征结合,得到所述目标语音特征。The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
可选地,所述装置还包括:Optionally, the device further includes:
第一训练单元,用于将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征;The first training unit is used to input a plurality of sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing, and obtain the sample speech features corresponding to each original speech data and the real data corresponding to each enhanced speech data. voice characteristics;
第二训练单元,用于针对每个样本语音数据对,根据预设的损失函数计算所述样本语音数据对中的原始语音数据对应的样本语音特征,与所述样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值;The second training unit is configured to, for each sample speech data pair, calculate the sample speech feature corresponding to the original speech data in the sample speech data pair according to the preset loss function, and the enhanced speech feature in the sample speech data pair The loss value between the real speech features corresponding to the data;
第三训练单元,用于当所述损失值不满足预设条件时,调整所述初始语音特征提取模型的模型参数,并返回执行所述将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征的步骤;A third training unit, configured to adjust the model parameters of the initial speech feature extraction model when the loss value does not meet the preset condition, and return to execute the step of inputting a plurality of sample speech data pairs in the sample speech data set into the The steps of processing in the initial voice feature extraction model to obtain the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;
第四训练单元,用于当所述损失值满足所述预设条件时,停止训练所述初始语音特征提取模型,并将训练后的所述初始语音特征提取模型作为已训练的语音特征提取模型。a fourth training unit, configured to stop training the initial speech feature extraction model when the loss value satisfies the preset condition, and use the trained initial speech feature extraction model as the trained speech feature extraction model .
可选地,所述真实语音特征包括波形特征、对数功谱率特征、频谱特征、滤波器组特性、伽马特征、韵律特征。Optionally, the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
可选地,所述数据增强处理为混响处理、加噪处理、频率掩蔽处理、时间掩蔽处理、剪辑处理、重叠语音处理中的任意一种处理或任意多种处理。Optionally, the data enhancement processing is any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, editing processing, and overlapping speech processing, or any multiple processing.
可选地,所述装置还包括:Optionally, the device further includes:
上传单元,用于将所述语音特征提取模型上传至区块链中。The uploading unit is used for uploading the speech feature extraction model to the blockchain.
请参见图6,图6是本申请另一实施例提供的一种提取语音特征的终端的示意图。如图6所示,该实施例的提取语音特征的终端4包括:处理器40、存储器41以及存储在所述存储器41中并可在所述处理器40上运行的计算机指令42。所述处理器40执行所述计算机指令42时实现:Please refer to FIG. 6. FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application. As shown in FIG. 6 , the terminal 4 for extracting speech features in this embodiment includes: a processor 40 , a memory 41 , and computer instructions 42 that are stored in the memory 41 and run on the processor 40 . When the processor 40 executes the computer instructions 42, it realizes:
获取待处理的语音数据;Get the voice data to be processed;
将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
具体的,例如图1所示的S101至S102。或者,所述处理器40执行所述计算机指令42时实现上述各实施例中各单元的功能,例如图5所示单元310至320功能。Specifically, for example, S101 to S102 shown in FIG. 1 . Alternatively, when the processor 40 executes the computer instructions 42, the functions of the units in the above embodiments, for example, the functions of the units 310 to 320 shown in FIG. 5 are implemented.
示例性地,所述计算机指令42可以被分割成一个或多个单元,所述一个或者多个单元被存储在所述存储器41中,并由所述处理器40执行,以完成本申请。所述一个或多个单元可以是能够完成特定功能的一系列计算机指令段,该指令段用于描述所述计算机指令42在所述提取语音特征的终端4中的执行过程。例如,所述计算机指令42可以被分割为获取单元以及处理单元,各单元具体功能如上所述。Illustratively, the computer instructions 42 may be divided into one or more units, and the one or more units are stored in the memory 41 and executed by the processor 40 to complete the present application. The one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer instruction 42 in the terminal 4 for extracting speech features. For example, the computer instructions 42 can be divided into an acquisition unit and a processing unit, and the specific functions of each unit are as described above.
所述提取语音特征的终端可包括,但不仅限于,处理器40、存储器41。本领域技术人员可以理解,图6仅仅是提取语音特征的终端4的示例,并不构成对提取语音特征的终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述提取语音特征的终端还可以包括输入输出终端、网络接入终端、总线等。The terminal for extracting the voice feature may include, but is not limited to, the processor 40 and the memory 41 . Those skilled in the art can understand that FIG. 6 is only an example of the terminal 4 for extracting voice features, and does not constitute a limitation on the terminal for extracting voice features, and may include more or less components than shown in the figure, or combine some components , or different components, for example, the terminal for extracting voice features may also include an input and output terminal, a network access terminal, a bus, and the like.
所称处理器40可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 40 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
所述存储器41可以是所述提取语音特征的终端的内部存储单元,例如提取语音特征的终端的硬盘或内存。所述存储器41也可以是所述提取语音特征的终端的外部存储终端,例如所述提取语音特征的终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器41还可以既包括所述提取语音特征的终端的内部存储单元也包括外部存储终端。所述存储器41用于存储所述计算机指令以及所述终端所需的其他程序和数据。所述存储器41还可以用于暂时地存储已经输出或者将要输出的数据。The memory 41 may be an internal storage unit of the terminal for extracting voice features, such as a hard disk or memory of the terminal for extracting voice features. The memory 41 can also be an external storage terminal of the terminal for extracting voice features, such as a plug-in hard disk equipped on the terminal for extracting voice features, a smart memory card (Smart Media Card, SMC), a secure digital (Secure) Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 41 may also include both an internal storage unit of the terminal for extracting voice features and an external storage terminal. The memory 41 is used to store the computer instructions and other programs and data required by the terminal. The memory 41 can also be used to temporarily store data that has been output or will be output.
本申请实施例还提供了一种计算机存储介质,计算机存储介质可以是非易失性,也可以是易失性,该计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现:获取待处理的语音数据;将语音数据输入到已训练的语音特征提取模型中进行处理,得到语音数据对应的目标语音特征,该语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,该增强语音数据是对该原始语音数据进行数据增强处理得到的。The embodiment of the present application also provides a computer storage medium, which may be non-volatile or volatile, and stores a computer program in the computer storage medium, and when the computer program is executed by the processor, the following steps are: The processed voice data; the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement on the original voice data. processed.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions recorded in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit scope of the technical solutions in the embodiments of the application, and should be included in the present application. within the scope of protection of the application.

Claims (20)

  1. 一种提取语音特征的方法,其中,包括:A method for extracting speech features, comprising:
    获取待处理的语音数据;Get the voice data to be processed;
    将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  2. 如权利要求1所述的方法,其中,所述语音特征提取模型包括卷积滤波器、卷积编码器以及准循环神经网络,所述将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,包括:The method of claim 1, wherein the speech feature extraction model comprises a convolutional filter, a convolutional encoder, and a quasi-recurrent neural network, and the speech data is input into the trained speech feature extraction model Perform processing to obtain the target voice feature corresponding to the voice data, including:
    将所述语音数据输入到所述卷积滤波器中进行处理,得到所述语音数据对应的第一语音特征,所述第一语音特征包括频率特征;Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;
    通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征,所述第二语音特征包括MFCC特征和Fbank特征;The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;
    将所述第二语音特征输入到所述准循环神经网络中进行处理,得到所述目标语音特征,所述目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征。Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
  3. 如权利要求2所述的方法,其中,所述通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征之后,所述方法还包括:The method according to claim 2, wherein after the convolution processing is performed on the first voice feature by the convolutional encoder to obtain the second voice feature, the method further comprises:
    基于所述准循环神经网络提取所述第二语音特征对应的第三语音特征;Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;
    采用跳跃连接的方式将所述第二语音特征与所述第三语音特征结合,得到所述目标语音特征。The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
  4. 如权利要求1至3任一项所述的方法,其中,所述获取待处理的语音数据之前,所述方法还包括:The method according to any one of claims 1 to 3, wherein, before acquiring the voice data to be processed, the method further comprises:
    将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征;Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;
    针对每个样本语音数据对,根据预设的损失函数计算所述样本语音数据对中的原始语音数据对应的样本语音特征,与所述样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值;For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between
    当所述损失值不满足预设条件时,调整所述初始语音特征提取模型的模型参数,并返回执行所述将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征的步骤;When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;
    当所述损失值满足所述预设条件时,停止训练所述初始语音特征提取模型,并将训练后的所述初始语音特征提取模型作为已训练的语音特征提取模型。When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
  5. 如权利要求4所述的方法,其中,所述真实语音特征包括波形特征、对数功谱率特征、频谱特征、滤波器组特性、伽马特征、韵律特征。The method of claim 4, wherein the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
  6. 如权利要求1所述的方法,其中,所述数据增强处理为混响处理、加噪处理、频率掩蔽处理、时间掩蔽处理、剪辑处理、重叠语音处理中的任意一种处理或任意多种处理。The method of claim 1, wherein the data enhancement processing is any one or any of multiple processing among reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing. .
  7. 如权利要求1所述的方法,其中,所述将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征之后,所述方法还包括:The method according to claim 1, wherein after the voice data is input into a trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained, the method further comprises:
    将所述语音特征提取模型上传至区块链中。Upload the speech feature extraction model to the blockchain.
  8. 一种提取语音特征的装置,其中,包括:A device for extracting speech features, comprising:
    获取单元,用于获取待处理的语音数据;an acquisition unit for acquiring the voice data to be processed;
    处理单元,用于将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The processing unit is used to input the voice data into the trained voice feature extraction model for processing, and obtain the target voice feature corresponding to the voice data, and the voice feature extraction model is based on self-supervised learning, with each sample The sample voice feature corresponding to the original voice data in the voice data pair is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair. It is obtained by performing data enhancement processing on the original voice data.
  9. 一种提取语音特征的终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现:A terminal for extracting voice features, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements when the processor executes the computer program:
    获取待处理的语音数据;Get the voice data to be processed;
    将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  10. 如权利要求9所述的提取语音特征的终端,其中,所述语音特征提取模型包括卷积滤波器、卷积编码器以及准循环神经网络,所述将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,包括:The terminal for extracting speech features according to claim 9, wherein the speech feature extraction model comprises a convolution filter, a convolutional encoder and a quasi-recurrent neural network, and the speech data is input to the trained speech Perform processing in the feature extraction model to obtain the target voice features corresponding to the voice data, including:
    将所述语音数据输入到所述卷积滤波器中进行处理,得到所述语音数据对应的第一语音特征,所述第一语音特征包括频率特征;Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;
    通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征,所述第二语音特征包括MFCC特征和Fbank特征;The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;
    将所述第二语音特征输入到所述准循环神经网络中进行处理,得到所述目标语音特征,所述目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征。Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
  11. 如权利要求10所述的提取语音特征的终端,其中,所述通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征之后,还包括:The terminal for extracting voice features according to claim 10, wherein after performing convolution processing on the first voice features by the convolutional encoder to obtain the second voice features, the method further comprises:
    基于所述准循环神经网络提取所述第二语音特征对应的第三语音特征;Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;
    采用跳跃连接的方式将所述第二语音特征与所述第三语音特征结合,得到所述目标语 音特征。The second voice feature is combined with the third voice feature by means of skip connection to obtain the target voice feature.
  12. 如权利要求9至11任一项所述的提取语音特征的终端,其中,所述获取待处理的语音数据之前,还包括:The terminal for extracting voice features according to any one of claims 9 to 11, wherein before acquiring the voice data to be processed, the method further comprises:
    将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征;Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;
    针对每个样本语音数据对,根据预设的损失函数计算所述样本语音数据对中的原始语音数据对应的样本语音特征,与所述样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值;For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between
    当所述损失值不满足预设条件时,调整所述初始语音特征提取模型的模型参数,并返回执行所述将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征的步骤;When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;
    当所述损失值满足所述预设条件时,停止训练所述初始语音特征提取模型,并将训练后的所述初始语音特征提取模型作为已训练的语音特征提取模型。When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
  13. 如权利要求12所述的提取语音特征的终端,其中,所述真实语音特征包括波形特征、对数功谱率特征、频谱特征、滤波器组特性、伽马特征、韵律特征。The terminal for extracting voice features according to claim 12, wherein the real voice features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
  14. 如权利要求9所述的提取语音特征的终端,其中,所述数据增强处理为混响处理、加噪处理、频率掩蔽处理、时间掩蔽处理、剪辑处理、重叠语音处理中的任意一种处理或任意多种处理。The terminal for extracting speech features according to claim 9, wherein the data enhancement processing is any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clip processing, and overlapping speech processing or any number of treatments.
  15. 如权利要求9所述的提取语音特征的终端,其中,所述将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征之后,还包括:The terminal for extracting voice features according to claim 9, wherein after inputting the voice data into a trained voice feature extraction model for processing, and obtaining the target voice feature corresponding to the voice data, the method further comprises:
    将所述语音特征提取模型上传至区块链中。Upload the speech feature extraction model to the blockchain.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:
    获取待处理的语音数据;Get the voice data to be processed;
    将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,所述语音特征提取模型是基于自监督学习,以每个样本语音数据对中的原始语音数据对应的样本语音特征为目标,对每个样本语音数据对中的原始语音数据和增强语音数据之间的差异性进行训练得到的,所述增强语音数据是对所述原始语音数据进行数据增强处理得到的。The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述语音特征提取模型包括卷积滤波器、卷积编码器以及准循环神经网络,所述将所述语音数据输入到已训练的语音特征提取模型中进行处理,得到所述语音数据对应的目标语音特征,包括:17. The computer-readable storage medium of claim 16, wherein the speech feature extraction model comprises a convolutional filter, a convolutional encoder, and a quasi-recurrent neural network, and wherein the speech data is input to the trained speech Perform processing in the feature extraction model to obtain the target voice features corresponding to the voice data, including:
    将所述语音数据输入到所述卷积滤波器中进行处理,得到所述语音数据对应的第一语音特征,所述第一语音特征包括频率特征;Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;
    通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征,所述第 二语音特征包括MFCC特征和Fbank特征;The first voice feature is subjected to convolution processing by the convolutional encoder to obtain the second voice feature, and the second voice feature includes the MFCC feature and the Fbank feature;
    将所述第二语音特征输入到所述准循环神经网络中进行处理,得到所述目标语音特征,所述目标语音特征包括目标波形特征、目标对数功谱率特征、目标频谱特征、目标滤波器组特性、目标伽马特征以及目标韵律特征。Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述通过所述卷积编码器对所述第一语音特征进行卷积处理,得到第二语音特征之后,还包括:The computer-readable storage medium according to claim 17, wherein after performing convolution processing on the first voice feature by the convolutional encoder to obtain the second voice feature, the method further comprises:
    基于所述准循环神经网络提取所述第二语音特征对应的第三语音特征;Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;
    采用跳跃连接的方式将所述第二语音特征与所述第三语音特征结合,得到所述目标语音特征。The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
  19. 如权利要求16至18任一项所述的计算机可读存储介质,其中,所述获取待处理的语音数据之前,还包括:The computer-readable storage medium according to any one of claims 16 to 18, wherein, before acquiring the voice data to be processed, the method further comprises:
    将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征;Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;
    针对每个样本语音数据对,根据预设的损失函数计算所述样本语音数据对中的原始语音数据对应的样本语音特征,与所述样本语音数据对中的增强语音数据对应的真实语音特征之间的损失值;For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between
    当所述损失值不满足预设条件时,调整所述初始语音特征提取模型的模型参数,并返回执行所述将样本语音数据集中的多个样本语音数据对输入到初始语音特征提取模型中进行处理,得到每个原始语音数据对应的样本语音特征以及每个增强语音数据对应的真实语音特征的步骤;When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;
    当所述损失值满足所述预设条件时,停止训练所述初始语音特征提取模型,并将训练后的所述初始语音特征提取模型作为已训练的语音特征提取模型。When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
  20. 如权利要求19所述的计算机可读存储介质,其中,所述真实语音特征包括波形特征、对数功谱率特征、频谱特征、滤波器组特性、伽马特征、韵律特征。The computer-readable storage medium of claim 19, wherein the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
PCT/CN2021/084166 2020-12-29 2021-03-30 Method and apparatus for extracting speech features, terminal, and storage medium WO2022141868A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011602171.3 2020-12-29
CN202011602171.3A CN112767927A (en) 2020-12-29 2020-12-29 Method, device, terminal and storage medium for extracting voice features

Publications (1)

Publication Number Publication Date
WO2022141868A1 true WO2022141868A1 (en) 2022-07-07

Family

ID=75697228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084166 WO2022141868A1 (en) 2020-12-29 2021-03-30 Method and apparatus for extracting speech features, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN112767927A (en)
WO (1) WO2022141868A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472147A (en) * 2022-09-15 2022-12-13 北京大学深圳医院 Language identification method and device
CN116229960A (en) * 2023-03-08 2023-06-06 江苏微锐超算科技有限公司 Robust detection method, system, medium and equipment for deceptive voice

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882873B (en) * 2022-07-12 2022-09-23 深圳比特微电子科技有限公司 Speech recognition model training method and device and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887494A (en) * 2017-12-01 2019-06-14 腾讯科技(深圳)有限公司 The method and apparatus of reconstructed speech signal
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111179911B (en) * 2020-01-02 2022-05-03 腾讯科技(深圳)有限公司 Target voice extraction method, device, equipment, medium and joint training method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109887494A (en) * 2017-12-01 2019-06-14 腾讯科技(深圳)有限公司 The method and apparatus of reconstructed speech signal
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RAVANELLI MIRCO; ZHONG JIANYUAN; PASCUAL SANTIAGO; SWIETOJANSKI PAWEL; MONTEIRO JOAO; TRMAL JAN; BENGIO YOSHUA: "Multi-Task Self-Supervised Learning for Robust Speech Recognition", ICASSP 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 4 May 2020 (2020-05-04), pages 6989 - 6993, XP033793230, DOI: 10.1109/ICASSP40776.2020.9053569 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115472147A (en) * 2022-09-15 2022-12-13 北京大学深圳医院 Language identification method and device
CN116229960A (en) * 2023-03-08 2023-06-06 江苏微锐超算科技有限公司 Robust detection method, system, medium and equipment for deceptive voice
CN116229960B (en) * 2023-03-08 2023-10-31 江苏微锐超算科技有限公司 Robust detection method, system, medium and equipment for deceptive voice

Also Published As

Publication number Publication date
CN112767927A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
Luo et al. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
WO2021042870A1 (en) Speech processing method and apparatus, electronic device, and computer-readable storage medium
WO2021000408A1 (en) Interview scoring method and apparatus, and device and storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
WO2022121257A1 (en) Model training method and apparatus, speech recognition method and apparatus, device, and storage medium
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
WO2019204547A1 (en) Systems and methods for automatic speech recognition using domain adaptation techniques
Kadıoğlu et al. An empirical study of Conv-TasNet
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
WO2022178942A1 (en) Emotion recognition method and apparatus, computer device, and storage medium
WO2019237519A1 (en) General vector training method, voice clustering method, apparatus, device and medium
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
WO2020192009A1 (en) Silence detection method based on neural network, and terminal device and medium
WO2023283823A1 (en) Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
Hasannezhad et al. PACDNN: A phase-aware composite deep neural network for speech enhancement
KR102026226B1 (en) Method for extracting signal unit features using variational inference model based deep learning and system thereof
CN110797033A (en) Artificial intelligence-based voice recognition method and related equipment thereof
CN108172214A (en) A kind of small echo speech recognition features parameter extracting method based on Mel domains
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
Saleem et al. Variance based time-frequency mask estimation for unsupervised speech enhancement
Hizlisoy et al. Text independent speaker recognition based on MFCC and machine learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912638

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912638

Country of ref document: EP

Kind code of ref document: A1