WO2022141868A1

WO2022141868A1 - Method and apparatus for extracting speech features, terminal, and storage medium

Info

Publication number: WO2022141868A1
Application number: PCT/CN2021/084166
Authority: WO
Inventors: 张之勇; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-29
Filing date: 2021-03-30
Publication date: 2022-07-07
Also published as: CN112767927A

Abstract

The present application is applicable to the technical field of computers and provides a method and apparatus for extracting speech features, a terminal, and a storage medium, the method comprising: acquiring speech data to be processed; and inputting the speech data into a trained speech feature extraction model for processing to obtain target speech features corresponding to the speech data. The speech feature extraction model in the method is obtained by training, on the basis of self-supervised learning, differences between the original speech data and enhanced speech data in each sample speech data pair by taking sample speech features corresponding to original speech data in each sample speech data pair as targets. Effective, informative, and accurately expressed target speech features can be extracted on the basis of the speech feature extraction model, such that when the target speech features are applied to intelligent speech task processing scenarios, the processing results are more accurate.

Description

A method, device, terminal and storage medium for extracting speech features

This application claims the priority of the Chinese patent application filed on December 29, 2020, with the application number of 202011602171.3 and the invention titled "A method, device, terminal and storage medium for extracting speech features", the entire content of which is Incorporated herein by reference.

technical field

The present application belongs to the field of computer technology, and in particular relates to a method, device, terminal and storage medium for extracting speech features.

Background technique

As an important part of artificial intelligence, the application of intelligent speech technology retrains the speech model or optimizes the original speech model by labeling a large amount of supervised data, which consumes a lot of manpower, economy and time. And there are very few labeled speech data that can be directly used as training samples, which is not conducive to the training of speech models. Therefore, unsupervised speech feature extraction methods are applied.

technical problem

To sum up, the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in the speech features extracted by the speech model. Inaccurate.

technical solutions

In view of this, the embodiments of the present application provide a method, device, terminal, and storage medium for extracting speech features, so as to solve the problem that it is difficult for existing speech models based on unsupervised learning training to learn effective features of speech data, resulting in The problem of inaccurate speech features extracted using this speech model.

A first aspect of the embodiments of the present application provides a method for extracting speech features, including:

Get the voice data to be processed;

The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.

A second aspect of the embodiments of the present application provides an apparatus for extracting speech features, including:

an acquisition unit for acquiring the voice data to be processed;

The processing unit is used to input the voice data into the trained voice feature extraction model for processing, and obtain the target voice feature corresponding to the voice data, and the voice feature extraction model is based on self-supervised learning, with each sample The sample voice feature corresponding to the original voice data in the voice data pair is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair. It is obtained by performing data enhancement processing on the original voice data.

A third aspect of the embodiments of the present application provides a terminal for extracting speech features, including a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that the processing When the computer executes the computer program, it realizes:

Get the voice data to be processed;

A fourth aspect of the embodiments of the present application provides a computer storage medium, where the computer storage medium stores a computer program, and the computer program is executed by a processor to implement:

Get the voice data to be processed;

A fifth aspect of the embodiments of the present application provides a computer program product that, when the computer program product runs on a terminal that extracts voice features, causes the terminal that extracts voice features to execute:

Get the voice data to be processed;

beneficial effect

Compared with the prior art, the beneficial effects of the embodiments of the present application are: on the one hand, the quantity of sample voice data is enlarged, on the other hand, it is not necessary to manually provide sample voice data, which saves a lot of manpower, economy, and time.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application;

2 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application;

3 is a schematic diagram of a speech feature extraction model structure provided by the application;

4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application;

5 is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

However, the inventor realized that due to the complexity and variability of speech data, it is difficult for the existing speech model based on unsupervised learning to learn the effective features of the speech data, resulting in that the speech features extracted using the speech model are not precise.

In view of this, the present application provides a method for extracting voice features. In the method, the voice feature extraction model is based on the sample voice features corresponding to the original voice data in each sample voice data pair. The difference between the original speech data and the enhanced speech data in each sample speech data pair is obtained by training, and the enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data. The language feature extraction model trained in this way has learned that the ability to extract the voice features corresponding to the original voice data from the enhanced voice data can be understood as the ability to extract the voice features corresponding to the undistorted voice data from the distorted voice data. Learn how to extract effective speech features, so that the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process. Furthermore, when the target speech feature is applied to the intelligent speech task processing scene, the processing result is more accurate. In addition, the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a method for extracting speech features provided by an embodiment of the present application. The execution subject of the method for extracting voice features in this embodiment is a terminal, a server, etc., wherein the terminal includes but is not limited to mobile terminals such as smart phones, tablet computers, computers, personal digital assistants (Personal Digital Assistant, PDA), etc., and may also include Terminals such as desktop computers. In this embodiment, the execution subject is taken as an example for description. The method for extracting speech features as shown in FIG. 1 may include S101 to S102, and the details are as follows:

S101: Acquire voice data to be processed.

The speech data to be processed is the speech data that needs to be extracted by speech features. The extracted speech features can be applied to different intelligent speech task processing scenarios. For example, the extracted speech features can be applied to scenarios such as speech recognition, speaker identification, language recognition, speech translation, simultaneous translation, and speech control.

Just because it may be applied to different intelligent voice task processing scenarios, the voice data to be processed can be the same or different. For example, if voice features need to be extracted in the scenario of speaker identification, the voice data to be processed can be a complete voice uploaded to the terminal in advance; if voice features need to be extracted in the scenario of voice control, the voice data to be processed The voice data may be the voice uttered by the user obtained through a built-in sound pickup device (eg, a microphone, a sound card, etc.). This is only an exemplary description, and it is not limited.

Exemplarily, different application scenarios acquire voice data to be processed in different ways. When the application scenario requires real-time results, such as simultaneous translation, voice control, etc., the way to obtain the voice data at this time may be to obtain the user's voice through a built-in sound pickup device (eg, microphone, sound card, etc.).

When the application scenario does not require real-time results, such as speaker identification, the voice data may be obtained by the user uploading the to-be-processed voice data to the terminal in advance, and the terminal obtains the to-be-processed voice data. Alternatively, when the terminal detects the feature extraction instruction, according to the file identifier included in the feature extraction instruction, the terminal obtains a text file corresponding to the file identifier, and extracts the speech data to be processed in the text file. This is only an exemplary description, and it is not limited.

S102: Input the voice data into a trained voice feature extraction model for processing, and obtain a target voice feature corresponding to the voice data. The voice feature extraction model is based on self-supervised learning, using the original voice data in each sample voice data pair The sample voice feature corresponding to the voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement processing on the original voice data. owned.

In this embodiment, a pre-trained voice feature extraction model is pre-stored in the terminal for extracting voice features. The voice feature extraction model adopts self-supervised learning, and takes the sample voice features corresponding to the original voice data in each sample voice data pair as the target. Differentially trained.

The enhanced speech data in each sample speech data pair is obtained by performing data enhancement processing on the original speech data in each sample speech data pair. It can be understood that the original voice data is pure voice data, that is, voice data without noise, impurities and undistorted voice. The enhanced speech data is obtained by performing any one or more of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing on the original speech data.

Generally, in the prior art, the obtained sample speech data is speech data containing noise, impurities, and distortion, and speech features extracted from the speech data. Based on the speech feature as the learning target, these language data and speech features are trained through machine learning, so that the trained speech model has the ability to extract effective speech features from speech data containing noise, impurities and distortion. However, this way of training the speech model, due to the complexity and variability of speech data, and the original learning target is the speech features extracted from speech data containing noise, impurities, and distortion, resulting in the speech model learning during the training process. To many meaningless features, coupled with the interference caused by the complexity and variability of voice data, the voice model finally trained cannot extract effective, accurate and rich voice features when actually processing voice data, which leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.

Alternatively, in the prior art, an unsupervised learning method is also used to train a speech model. Unsupervised learning refers to finding changes in the input data without a target, and the purpose is to better understand the correlation in the data. At present, unsupervised speech feature extraction methods mainly include principal component analysis method and method based on mixture Gaussian model. The premise of the above two methods is that the speech data obeys the Gaussian distribution, and only needs to be artificially reduced during the execution process. , however, the voice data does not necessarily conform to the Gaussian distribution, and artificial dimensionality reduction will inevitably lead to the loss of high-dimensional features, resulting in the fact that the voice model cannot extract effective, accurate and rich voice features when actually processing the voice data, which in turn leads to When the speech model is applied in various intelligent speech task processing scenarios, the processing results are inaccurate.

In this application, the method of self-supervised learning is adopted, and the sample speech features extracted from the original speech data are used as the target of self-supervised learning. The target is clear, and since the original speech data is speech data without noise, impurities and distortion , the sample speech features extracted from the original speech data are more accurate, rich and effective.

Performing data enhancement processing on the original voice data to obtain enhanced voice data, on the one hand, is equivalent to increasing the number of training samples; When the original speech data is processed for data enhancement, the type of data enhancement processing can be controlled, so that the speech feature extraction model can learn various effective speech features in a targeted manner during the training process. Then, the speech feature extraction model obtained by the final training can extract effective, accurate and rich speech features when actually processing speech data. When the speech feature extraction model is applied in various intelligent speech task processing scenarios, the processing results are more accurate. precise.

It can be understood that the speech feature extraction model may be pre-trained by the terminal that extracts speech features, or the files corresponding to the speech feature extraction model may be transplanted to the terminal after being pre-trained by other devices. That is to say, the executive body for training the speech feature extraction model and the executive body for using the speech feature extraction model for speech feature extraction may be the same or different. For example, when other devices are used to train the initial speech feature extraction model, after the other devices finish training the initial speech feature extraction model, the model parameters of the initial speech feature extraction model are fixed, and the files corresponding to the trained speech feature extraction model are obtained. This file is then ported to a terminal that extracts speech features.

Please refer to FIG. 2 , which is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application. Optionally, in a possible implementation manner, as shown in FIG. 2 , the foregoing S102 may include S1021 to S1023, and the details are as follows:

S1021: Input the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature.

The trained speech feature extraction models include convolutional filters, convolutional encoders, and quasi-recurrent neural networks. Please refer to FIG. 3 , which is a schematic diagram of the structure of a speech feature extraction model provided by the present application. Among them, the convolutional filter can be an interpretable convolutional filter (SincNet), the convolutional encoder is composed of 7 convolutional neural network layers (ConvNet), and the quasi-cyclic neural network can be a neural network regression (Quantile RegressionNeural Network, QRNN). This is only an exemplary description, and it is not limited.

Exemplarily, when the voice data to be processed is processed by the trained voice feature extraction model, the voice data can be converted into a waveform first. Specifically, the voice data can be converted by existing voice conversion waveform software. Here No longer. Input the converted waveform into SincNet, and SincNet performs a time-domain convolution operation on the input waveform based on a sliding window with a preset duration to obtain the first voice feature corresponding to the voice data. The first voice feature may include frequency features, Mel-Frequency Cepstral coefficients (Mel-Frequency Cepstral coefficients, MFCC) characteristics, filter bank characteristics (Filter bank characteristics, Fbank) characteristics, waveform (wave) characteristics, logarithmic power spectrum (Log-power spectrum, Lps) characteristics, etc. The frequency features may include audio features, fundamental frequency features, frequency band features, and the like. The preset duration can be adjusted according to the actual situation, for example, in this embodiment, it can be set as a sliding window of 10 milliseconds. Speech data is time-sequential, and the input waveform is subjected to a time-domain convolution operation based on a sliding window with a preset duration.

Exemplarily, the time domain convolution operation performed by SincNet on the input waveform can be expressed by the following formula (1), as follows:

In the above formula (1), y[n] represents the first speech feature output by SincNet, x[n] represents the input waveform, and h[n] is a preset filter of length L.

This is only an exemplary description, and it is not limited.

S1022: Perform convolution processing on the first voice feature by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature.

Input the first voice feature into the convolutional encoder for convolution processing to obtain the second voice feature, and the second voice feature may include MFCC feature, Fbank feature, wave feature, Lps feature, gamma (Gamma) feature, prosody (Proso ) features, etc.

The convolutional encoder is composed of 7 ConvNets, and the first ConvNet performs convolution processing on the first speech feature to obtain the first processing result. Input the first processing result to the second ConvNet, and the second ConvNet performs convolution processing on the first processing result to obtain the second processing result, and so on, until the last ConvNet passes the processing result to the previous ConvNet After the convolution process is performed, the second speech feature is output.

Exemplarily, the first ConvNet convolves the first speech feature based on a preset convolution check, which can be understood as the first ConvNet performs feature selection in the first speech feature, removes redundant features, and obtains the first processing result. For example, the MFCC feature, the Fbank feature, the wave feature, the Lps feature, the gamma (Gamma) feature, the prosody (Proso) feature, etc. are extracted according to the information in the first speech feature. The first processing result is input into the second ConvNet, and the second ConvNet further performs convolution on the basis of the features extracted by the first ConvNet to extract deeper features to obtain the second processing result. By analogy, the second speech feature is obtained after the last ConvNet performs convolution processing on the processing result passed by the previous ConvNet.

Optionally, in a possible implementation manner, in order to make the extracted second speech feature more accurate and eliminate the difference between the speech features that may be caused by gender and age, the seventh processing result can be input. The down-sampling layer performs processing, and then the down-sampling layer outputs the second speech feature.

Exemplarily, the processing of the seventh processing result by the down-sampling layer can be expressed by the following formula (2), which is specifically as follows:

In the above formula (2), P _j,m represents the output of the downsampling layer, j represents the processing result of the jth ConvNet, m represents the mth downsampling band, n represents the downsampling factor, and r represents the length of the downsampling window. Size, indicating how many bands of data to downsample together.

This is only an exemplary description, and it is not limited.

In this embodiment, due to the different organ structures and vocal habits of different people, there is often a certain difference after feature extraction, which is manifested as spectral shift. For example, the frequency of the voice of a man is generally lower than that of a woman. It is also generally reduced. Through the processing of the downsampling layer, the difference can be well eliminated, so that the extracted speech features are more accurate.

S1023: Input the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter bank characteristics, target gamma features, and target prosody features.

The second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the voice data to be processed. The target speech features include target waveform features, target log power spectral features, target spectral features, target filter bank features, target gamma features, and target prosody features. -power spectrum, Long-Lps) features, Long Mel-Frequency Cepstral coefficients (Long Mel-Frequency Cepstral coefficients, Long-MFCC) features, Long-term filter bank characteristics (Long Filter bank characteristics, Long-Fbank) features, Long Gamma feature, etc. It is worth noting that some of the features of the first voice feature, the second voice feature and the target voice feature are of the same type. The difference is that the features extracted from the first voice feature and the second voice feature are not very informative. It is rich, and the feature expression is not very accurate. After processing by the quasi-recurrent neural network, the obtained target speech feature information is rich and accurate.

As shown in Figure 3, the first layer in the QRNN is the convolution layer (Conv 1D), which is used to extract the features in the input second speech feature, Sigmoid and Tanh are the functions used in the QRNN, and the second layer is The pooling layer is used to reduce the number of features. The difference is that the pooling layer in QRNN adopts the fo-pool method. Exemplarily, the feature in the second speech feature extracted based on the convolutional layer in the QRNN can be expressed by the following formula (3), as follows:

Z=tanh(W _z *X)

F=σ(W _f *X)

O=σ(W _o *X), (3)

In the above formula (3), X represents the input second speech feature, Z, F, O represent the multiplication gates that the parameter W participates in, and W _z , W _f , and W _o represent the convolution filters of the preset R size. When the width of the device is 2, the above formula (3) can be expressed as:

That is, the larger the width of the filter, the more features at more moments can be considered, and the higher the features can be calculated.

The features extracted by the convolutional layer are input into the pooling layer for processing, and the target speech features are output. The processing of the features extracted by the convolutional layer of the pooling layer can be realized by the following formulas (4) and (5), as follows:

c _t =f _t ☉c _t-1 +(1-f _t )☉z _t , (4)

h _t =o _t ☉c _t , (5)

In the above formula (4), _ct represents the unit state vector at time t, and in the above formula (5), h _t represents the hidden state vector at time t.

Optionally, in a possible implementation manner, in order to make the extracted target speech feature information more accurate and more accurate in expression, S1024 to S1025 may be further included after S1022, and the details are as follows:

S1024: Extract a third speech feature corresponding to the second speech feature based on the quasi-recurrent neural network.

The third voice feature is of the same type as each feature included in the target voice feature, that is, the third voice feature includes MFCC feature, Fbank feature, wave feature, Lps feature, Gamma feature, Proso feature, Long- Lps feature, Long-MFCC feature, Long-Fbank feature, Long Gamma feature, etc. This is only an exemplary description, and it is not limited.

The second voice feature is input into the QRNN for processing to obtain the target voice feature corresponding to the second voice feature. For the specific processing process of the second speech feature by the quasi-recurrent neural network, reference may be made to the description in S1023, which will not be repeated here.

S1025: Combine the second voice feature with the third voice feature in a skip connection manner to obtain the target voice feature.

Both the second voice feature and the third voice feature are represented in the form of vectors, and the second voice feature and the third voice feature are correspondingly added to obtain the target voice feature. If a certain type of feature included in the third voice feature is not included in the second voice feature, the vector corresponding to the type of feature in the second voice feature is 0 by default. This is only an exemplary description, and it is not limited.

Optionally, in a possible implementation manner, based on S1022, it can be known that the convolutional encoder is composed of 7 ConvNets, and each ConvNet has a corresponding processing result. The combination of the second voice feature and the third voice feature by means of skip connection may be to combine the first processing result corresponding to the first ConvNet, the third processing result corresponding to the third ConvNet, and the fifth processing result corresponding to the fifth ConvNet. The processing result is correspondingly added with the third voice feature to obtain the target voice feature. Or, compare the first processing result corresponding to the first ConvNet, the third processing result corresponding to the third ConvNet, the fifth processing result corresponding to the fifth ConvNet, and the seventh processing result corresponding to the seventh ConvNet and the third voice. The features are added correspondingly to obtain the target speech features. Alternatively, the second processing result corresponding to the second ConvNet, the fourth processing result corresponding to the fourth ConvNet, and the sixth processing result corresponding to the sixth ConvNet are correspondingly added to the third voice feature to obtain the target voice feature. This is only an exemplary description, and it is not limited.

In this embodiment, the target speech feature is expressed as the sum of the features found by the convolutional encoder, so that the final target speech feature information obtained is more accurate and the expression is more accurate.

In this embodiment of the present application, the speech feature extraction model takes the sample speech features corresponding to the original speech data in each sample speech data pair as the target, and based on self-supervised learning, the original speech data and the enhanced speech data in each sample speech data pair are The difference between the two samples is obtained by training, and the enhanced voice data in each sample voice data pair is obtained by performing data enhancement processing on the original voice data. The language feature extraction model trained in this way has learned that the ability to extract the speech features corresponding to the original speech data from the enhanced speech data can be understood as the ability to extract the speech features corresponding to the undistorted speech data from the distorted speech data. Therefore, the language feature extraction model can extract effective, informative and accurate target speech features in the actual use process. Furthermore, when the target speech feature is applied to the intelligent speech task processing scene, the processing result is more accurate. In addition, the language feature extraction model can generate enhanced speech data based on the original speech data during the training process. On the one hand, the quantity of sample speech data is expanded. time.

Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a method for extracting speech features provided by another embodiment of the present application. The method may include S201-S206. Wherein, for steps S205 to S206 shown in FIG. 4 , reference may be made to the relevant descriptions of S101 to S102 in the embodiment corresponding to FIG. 1 , which are not repeated here for brevity. Steps S201 to S204 will be specifically described below.

S201: Input a plurality of sample speech data pairs in the sample speech data set into an initial speech feature extraction model for processing to obtain a sample speech feature corresponding to each original speech data and a real speech feature corresponding to each enhanced speech data.

The sample speech data set includes a plurality of sample speech data pairs, and each sample speech data pair includes one original speech data and one enhanced speech data. Wherein, the enhanced voice data in each sample voice data pair is obtained from the original voice data in the sample voice data pair after data enhancement processing. Wherein, the data enhancement processing may be any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing, or any multiple processing.

Exemplarily, a probability value can be preset for each kind of data enhancement processing, and based on the preset probability value, data enhancement processing is performed on the raw voice data in each sample voice data pair that is obtained to obtain each sample voice data. Enhanced speech data corresponding to the original speech data in the data pair. The probability value is used to indicate the possibility of performing data enhancement processing corresponding to the probability value on each original speech data.

For example, the probability value corresponding to reverberation processing is 0.5, the probability value corresponding to noise processing is 0.4, the probability value corresponding to frequency masking processing is 0.4, the probability value corresponding to time masking processing is 0.2, and the probability value corresponding to clipping processing is 0.2 , the probability value corresponding to overlapping speech processing is 0.1. That is to say, there is a probability of 0.5 to perform reverberation processing on a certain original voice data, a probability of 0.4 to perform noise processing on a certain raw voice data, and a probability of 0.4 to perform frequency masking on a certain raw voice data. Processing, there is a probability of 0.2 to perform time masking processing on a certain original voice data, and a probability of 0.2 to perform clipping processing on a certain original voice data. It is worth noting that although a probability value is set for each different data enhancement processing, there is no limitation to perform several data enhancement processing on each original voice data, which can be one of them, or it can appear based on the probability value. A combination of several treatments.

Illustratively, reverberation processing is achieved by convolving the signal corresponding to the original speech data with a set of 1300 impulse responses, which are derived graphically. Impulse responses simulate different acoustic conditions with reverberation times ranging from 0.3 to 0.9 seconds. The noise in the noise processing is extracted from the preset FreeSound dataset and DIRHA dataset. The noise in the noise processing can include background noise and non-stationary noise, such as alarms, door knocks, telephone ringing, TV sounds, etc. The signal-to-noise ratio is randomly sampled between 0 and 10dB. The frequency masking process is realized by filtering the time signal corresponding to the original speech data with a band-stop filter. The temporal masking process is achieved by setting random segments in the original speech data to zero. Clipping is achieved by adding random saturation to the raw speech data. Overlapping speech processing is implemented by overlapping speech signals in the original speech data with the main signal corresponding to the original speech data. These are all exemplary descriptions, which are not limited.

Input multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing, that is, input the original speech data in each sample speech data pair into the initial speech feature extraction model for processing. The enhanced speech data in each sample speech data pair is input into the initial speech feature extraction model for processing. The initial speech feature extraction model outputs the sample speech features corresponding to each original speech data, and outputs the real speech features corresponding to each enhanced speech data.

Exemplarily, as shown in Figure 3, in the process of training the speech feature extraction model, the initial speech feature extraction model includes an initial convolution filter, an initial convolution encoder and an initial quasi-recurrent neural network. Among them, the initial convolutional filter can be an interpretable convolutional filter (SincNet), the initial convolutional encoder is composed of 7 convolutional neural network layers (ConvNet), and the initial quasi-cyclic neural network can be QRNN. Skip Connections (skip transfer) represents skip connections, and FC represents the processing result of skip selection in 7 ConvNets. The Workers at the top of Figure 3 represent 12 self-supervised tasks, implemented based on a small feedforward neural network (typically a hidden layer with 256 hidden units). It can be clearly seen that each of these 12 self-supervised tasks corresponds to a speech feature extracted from speech data, which can be generally understood as supervising the sample speech features corresponding to each original speech data, and outputting each enhanced speech data. The difference between the corresponding real voice features, and adjust the model parameters of the initial voice feature extraction model according to the difference, until the real voice feature corresponding to each enhanced voice data is the same as the sample voice feature corresponding to each original voice data.

The Speech Distortion (voice distortion) in Figure 3 represents the data enhancement process, and the speech segment below the Speech Distortion represents the original speech data. Optionally, a processing method is to process the original voice data through an initial voice feature extraction model to obtain sample voice features corresponding to the original voice data. One processing method is to first perform Speech Distortion processing on the original voice data, that is, data enhancement processing, to obtain enhanced voice data corresponding to the original voice data, and then extract the real voice features corresponding to the enhanced voice data. For the specific process of extracting the sample voice feature and the real voice feature, reference may be made to the description in S102, which will not be repeated here.

S202: For each sample voice data pair, calculate the sample voice feature corresponding to the original voice data in the sample voice data pair according to a preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair loss value.

The loss value between the sample voice feature corresponding to the original voice data in each sample voice data pair and the real voice feature corresponding to the enhanced voice data in the sample voice data pair can be used to measure the value extracted by the initial voice feature extraction model. Accuracy of speech features. It can be understood that the original voice data is pure voice data, that is, voice data without noise, impurities, and undistorted voice data, and the sample voice features corresponding to the original voice data are standard, informative, and accurately expressed voice features. This is also the learning target of our initial speech feature extraction model. The enhanced speech data is obtained by performing data enhancement processing on the original speech data, which contains noise and impurities. When the same voice features as the sample voice features corresponding to the original voice data can be extracted from the enhanced voice data, it proves that the training of the initial voice feature extraction model is completed.

The preset loss function may be a mean square error function, a mean absolute error function, etc., which is not limited. The sample speech features may include MFCC features, Fbank features, wave features, Lps features, Gamma features, Proso features, Long-Lps features, Long-MFCC features, Long-Fbank features, Long Gamma features, and the like. The real speech features can also include waveform features (wave features), logarithmic power spectral rate features (Lps features), spectral features (MFCC features), filter bank features (Fbank features), gamma features, prosody features, Long-Lps features Features, Long-MFCC features, Long-Fbank features, Long Gamma features, etc.

For the original speech data and the enhanced speech data in each sample speech data pair, the loss value between the sample speech features and the real speech features is calculated based on a preset loss function. It is worth noting that, since each sample speech feature and real speech feature contain corresponding multiple types of features, the final loss value is the sum of the loss values between each group of the same type of features. For example, the sample voice features include MFCC features, Fbank features, and wave features, and the real voice features include MFCC features, Fbank features, and wave features. The loss value between the sample voice feature and the real voice feature is the loss value between the MFCC feature corresponding to the sample voice feature and the MFCC feature corresponding to the real voice feature, the Fbank feature corresponding to the sample voice feature and the Fbank feature corresponding to the real voice feature The loss value between and the sum of the loss values between the wave feature corresponding to the sample speech feature and the wave feature corresponding to the real speech feature. This is only an exemplary description, and it is not limited.

After calculating the loss value, it is judged whether the loss value satisfies the preset condition. When the loss value does not meet the preset condition, execute S201; when the loss value meets the preset condition, execute S204. The preset condition may be that the loss value is less than or equal to the preset loss value threshold, or that the loss value falls within the preset error range, but it is not limited to this, and can also be set according to the actual situation, which is not limited here.

S203: when the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to perform processing by inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model, to obtain each The steps of each sample speech feature corresponding to the original speech data and the real speech feature corresponding to each enhanced speech data.

For example, it is assumed that the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is greater than the preset loss value threshold, it is determined that the voice features extracted by the current initial voice feature extraction model have not yet met the requirements. At this time, it is necessary to adjust the model parameters of the initial speech feature extraction model, then return to S201, and continue to execute S201 and S202, until the loss value determined in S202 is less than or equal to the preset loss value threshold, execute S204.

S204: When the loss value satisfies the preset condition, stop training the initial speech feature extraction model, and use the trained initial speech feature extraction model as the trained speech feature extraction model.

For example, it is assumed that the preset condition is that the loss value is less than or equal to the preset loss value threshold. Then, when the device performing the training process confirms that the current loss value is less than or equal to the preset loss value threshold, it determines that the training of the current initial speech feature extraction model meets the expected requirements, and stops training the initial speech feature extraction model.

At this time, the initial speech feature extraction model after adjusting the model parameters has been trained with a large number of samples, and its loss value is kept within a small range. Using the initial speech feature extraction model to process the speech data can obtain rich information, Express accurate phonetic features. Therefore, the initial speech feature extraction model when the training is stopped (that is, after the last training is completed) can be determined as the trained speech feature extraction model.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The voice feature extraction model trained in this embodiment can extract the same voice features as the original voice data from the enhanced voice data, and the enhanced voice data is obtained by performing reverberation processing and noise processing on the original voice data. On the other hand, the speech feature extraction model also learns how to denoise the speech data and the ability to be distortion invariant.

Based on experiments, it is shown that when the speech features extracted by the speech feature extraction model are used in speech recognition, speaker identification, language recognition, speech translation, simultaneous translation, speech control and other scenarios, the processing results are significantly better than the existing ones. Speech models and MFCC systems.

Optionally, in a possible implementation manner, after S102 or after S204, the trained voice feature extraction model may also be uploaded to the blockchain.

In this embodiment, uploading the trained voice feature extraction model to the blockchain can ensure its security and fairness and transparency to users. In addition, the trained voice feature extraction model is uploaded to the blockchain. With the feature that the files on the blockchain cannot be tampered with at will, the trained voice feature extraction model can be prevented from being maliciously tampered with, so that subsequent users can directly and accurately obtain it. The trained voice feature extraction model is also convenient for subsequent users to use the trained voice feature extraction model to process the voice data to be processed, so as to ensure that the voice features with rich information, accurate expression and effective are extracted.

The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Please refer to FIG. 5 , which is a schematic diagram of an apparatus for extracting speech features provided by an embodiment of the present application. Each unit included in the apparatus is used to execute each step in the embodiment corresponding to FIG. 1 , FIG. 2 , and FIG. 4 . For details, please refer to the relevant descriptions in the corresponding embodiments of FIG. 1 , FIG. 2 , and FIG. 4 . For convenience of explanation, only the parts related to this embodiment are shown. See Figure 5, including:

an acquisition unit 310, configured to acquire the voice data to be processed;

The processing unit 320 is configured to input the voice data into a trained voice feature extraction model for processing to obtain target voice features corresponding to the voice data. The voice feature extraction model is based on self-supervised learning, with each The sample speech feature corresponding to the original speech data in the sample speech data pair is the target, and is obtained by training the difference between the original speech data and the enhanced speech data in each sample speech data pair, and the enhanced speech data is a pair of The original voice data is obtained by performing data enhancement processing.

Optionally, the speech feature extraction model includes a convolution filter, a convolution encoder and a quasi-recurrent neural network, and the processing unit 320 is specifically used for:

Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;

The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;

Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.

Optionally, the processing unit 320 is further configured to:

After the first voice feature is subjected to convolution processing to obtain the second voice feature, the method further includes:

Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;

The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.

Optionally, the device further includes:

The first training unit is used to input a plurality of sample speech data pairs in the sample speech data set into the initial speech feature extraction model for processing, and obtain the sample speech features corresponding to each original speech data and the real data corresponding to each enhanced speech data. voice characteristics;

The second training unit is configured to, for each sample speech data pair, calculate the sample speech feature corresponding to the original speech data in the sample speech data pair according to the preset loss function, and the enhanced speech feature in the sample speech data pair The loss value between the real speech features corresponding to the data;

A third training unit, configured to adjust the model parameters of the initial speech feature extraction model when the loss value does not meet the preset condition, and return to execute the step of inputting a plurality of sample speech data pairs in the sample speech data set into the The steps of processing in the initial voice feature extraction model to obtain the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;

a fourth training unit, configured to stop training the initial speech feature extraction model when the loss value satisfies the preset condition, and use the trained initial speech feature extraction model as the trained speech feature extraction model .

Optionally, the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.

Optionally, the data enhancement processing is any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, editing processing, and overlapping speech processing, or any multiple processing.

Optionally, the device further includes:

The uploading unit is used for uploading the speech feature extraction model to the blockchain.

Please refer to FIG. 6. FIG. 6 is a schematic diagram of a terminal for extracting speech features provided by another embodiment of the present application. As shown in FIG. 6 , the terminal 4 for extracting speech features in this embodiment includes: a processor 40 , a memory 41 , and computer instructions 42 that are stored in the memory 41 and run on the processor 40 . When the processor 40 executes the computer instructions 42, it realizes:

Get the voice data to be processed;

Specifically, for example, S101 to S102 shown in FIG. 1 . Alternatively, when the processor 40 executes the computer instructions 42, the functions of the units in the above embodiments, for example, the functions of the units 310 to 320 shown in FIG. 5 are implemented.

Illustratively, the computer instructions 42 may be divided into one or more units, and the one or more units are stored in the memory 41 and executed by the processor 40 to complete the present application. The one or more units may be a series of computer instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer instruction 42 in the terminal 4 for extracting speech features. For example, the computer instructions 42 can be divided into an acquisition unit and a processing unit, and the specific functions of each unit are as described above.

The terminal for extracting the voice feature may include, but is not limited to, the processor 40 and the memory 41 . Those skilled in the art can understand that FIG. 6 is only an example of the terminal 4 for extracting voice features, and does not constitute a limitation on the terminal for extracting voice features, and may include more or less components than shown in the figure, or combine some components , or different components, for example, the terminal for extracting voice features may also include an input and output terminal, a network access terminal, a bus, and the like.

The so-called processor 40 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 41 may be an internal storage unit of the terminal for extracting voice features, such as a hard disk or memory of the terminal for extracting voice features. The memory 41 can also be an external storage terminal of the terminal for extracting voice features, such as a plug-in hard disk equipped on the terminal for extracting voice features, a smart memory card (Smart Media Card, SMC), a secure digital (Secure) Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 41 may also include both an internal storage unit of the terminal for extracting voice features and an external storage terminal. The memory 41 is used to store the computer instructions and other programs and data required by the terminal. The memory 41 can also be used to temporarily store data that has been output or will be output.

The embodiment of the present application also provides a computer storage medium, which may be non-volatile or volatile, and stores a computer program in the computer storage medium, and when the computer program is executed by the processor, the following steps are: The processed voice data; the voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and the difference between the original voice data and the enhanced voice data in each sample voice data pair is obtained by training, and the enhanced voice data is obtained by performing data enhancement on the original voice data. processed.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions recorded in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit scope of the technical solutions in the embodiments of the application, and should be included in the present application. within the scope of protection of the application.

Claims

A method for extracting speech features, comprising:

Get the voice data to be processed;

The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
The method of claim 1, wherein the speech feature extraction model comprises a convolutional filter, a convolutional encoder, and a quasi-recurrent neural network, and the speech data is input into the trained speech feature extraction model Perform processing to obtain the target voice feature corresponding to the voice data, including:

Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;

The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;

Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
The method according to claim 2, wherein after the convolution processing is performed on the first voice feature by the convolutional encoder to obtain the second voice feature, the method further comprises:

Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;

The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
The method according to any one of claims 1 to 3, wherein, before acquiring the voice data to be processed, the method further comprises:

Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;

For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between

When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;

When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
The method of claim 4, wherein the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
The method of claim 1, wherein the data enhancement processing is any one or any of multiple processing among reverberation processing, noise addition processing, frequency masking processing, time masking processing, clipping processing, and overlapping speech processing. .
The method according to claim 1, wherein after the voice data is input into a trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained, the method further comprises:

Upload the speech feature extraction model to the blockchain.
A device for extracting speech features, comprising:

an acquisition unit for acquiring the voice data to be processed;

The processing unit is used to input the voice data into the trained voice feature extraction model for processing, and obtain the target voice feature corresponding to the voice data, and the voice feature extraction model is based on self-supervised learning, with each sample The sample voice feature corresponding to the original voice data in the voice data pair is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair. It is obtained by performing data enhancement processing on the original voice data.
A terminal for extracting voice features, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements when the processor executes the computer program:

Get the voice data to be processed;

The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
The terminal for extracting speech features according to claim 9, wherein the speech feature extraction model comprises a convolution filter, a convolutional encoder and a quasi-recurrent neural network, and the speech data is input to the trained speech Perform processing in the feature extraction model to obtain the target voice features corresponding to the voice data, including:

Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;

The first voice feature is subjected to convolution processing by the convolutional encoder to obtain a second voice feature, where the second voice feature includes an MFCC feature and an Fbank feature;

Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
The terminal for extracting voice features according to claim 10, wherein after performing convolution processing on the first voice features by the convolutional encoder to obtain the second voice features, the method further comprises:

Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;

The second voice feature is combined with the third voice feature by means of skip connection to obtain the target voice feature.
The terminal for extracting voice features according to any one of claims 9 to 11, wherein before acquiring the voice data to be processed, the method further comprises:

Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;

For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between

When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;

When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
The terminal for extracting voice features according to claim 12, wherein the real voice features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.
The terminal for extracting speech features according to claim 9, wherein the data enhancement processing is any one of reverberation processing, noise addition processing, frequency masking processing, time masking processing, clip processing, and overlapping speech processing or any number of treatments.
The terminal for extracting voice features according to claim 9, wherein after inputting the voice data into a trained voice feature extraction model for processing, and obtaining the target voice feature corresponding to the voice data, the method further comprises:

Upload the speech feature extraction model to the blockchain.
A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:

Get the voice data to be processed;

The voice data is input into the trained voice feature extraction model for processing, and the target voice feature corresponding to the voice data is obtained. The voice feature extraction model is based on self-supervised learning. The sample voice feature corresponding to the original voice data is the target, and is obtained by training the difference between the original voice data and the enhanced voice data in each sample voice data pair, and the enhanced voice data is obtained from the original voice data. data augmentation.
17. The computer-readable storage medium of claim 16, wherein the speech feature extraction model comprises a convolutional filter, a convolutional encoder, and a quasi-recurrent neural network, and wherein the speech data is input to the trained speech Perform processing in the feature extraction model to obtain the target voice features corresponding to the voice data, including:

Inputting the voice data into the convolution filter for processing to obtain a first voice feature corresponding to the voice data, where the first voice feature includes a frequency feature;

The first voice feature is subjected to convolution processing by the convolutional encoder to obtain the second voice feature, and the second voice feature includes the MFCC feature and the Fbank feature;

Inputting the second voice feature into the quasi-recurrent neural network for processing to obtain the target voice feature, where the target voice feature includes a target waveform feature, a target logarithmic power spectral rate feature, a target spectral feature, and a target filter set features, target gamma features, and target prosody features.
The computer-readable storage medium according to claim 17, wherein after performing convolution processing on the first voice feature by the convolutional encoder to obtain the second voice feature, the method further comprises:

Extracting the third voice feature corresponding to the second voice feature based on the quasi-recurrent neural network;

The second voice feature is combined with the third voice feature in a skip connection manner to obtain the target voice feature.
The computer-readable storage medium according to any one of claims 16 to 18, wherein, before acquiring the voice data to be processed, the method further comprises:

Inputting a plurality of sample voice data pairs in the sample voice data set into the initial voice feature extraction model for processing to obtain sample voice features corresponding to each original voice data and real voice features corresponding to each enhanced voice data;

For each sample voice data pair, the sample voice feature corresponding to the original voice data in the sample voice data pair is calculated according to the preset loss function, and the difference between the real voice feature corresponding to the enhanced voice data in the sample voice data pair and the real voice feature in the sample voice data pair is calculated. loss value between

When the loss value does not meet the preset condition, adjust the model parameters of the initial speech feature extraction model, and return to performing the process of inputting multiple sample speech data pairs in the sample speech data set into the initial speech feature extraction model. processing, the steps of obtaining the sample voice feature corresponding to each original voice data and the real voice feature corresponding to each enhanced voice data;

When the loss value satisfies the preset condition, the training of the initial speech feature extraction model is stopped, and the trained initial speech feature extraction model is used as the trained speech feature extraction model.
The computer-readable storage medium of claim 19, wherein the real speech features include waveform features, logarithmic power spectral rate features, spectral features, filter bank features, gamma features, and prosody features.