CN114548221A

CN114548221A - Generation type data enhancement method and system for small sample unbalanced voice database

Info

Publication number: CN114548221A
Application number: CN202210050846.0A
Authority: CN
Inventors: 陶智; 钱金阳; 章溢华; 张晓俊; 许宜申
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-05-27
Anticipated expiration: 2042-01-17
Also published as: CN114548221B

Abstract

The invention discloses a generating data enhancement method of a small sample unbalanced voice database, which comprises the steps of S1, carrying out signal preprocessing on original voice data, and dividing the preprocessed voice data into a training set and a test set; s2, compressing the training set data and the test set data; s3: carrying out one-hot coding on the compressed training set data and the compressed test set data; s4, training a low residual WaveNet neural network by using the training set data subjected to unique heat coding; s5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network. The generating type data enhancement method and the generating type data enhancement system for the small sample unbalanced voice database can generate accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm.

Description

Generation type data enhancement method and system for small sample unbalanced voice database

Technical Field

The invention relates to the technical field of voice data enhancement, in particular to a generating type data enhancement method and system of a small sample unbalanced voice database.

Background

Data enhancement is mainly used to prevent overfitting of the model. With the development of deep learning, various machine learning models used in the fields of speech recognition, classification and the like are developed towards high complexity. Factors for determining the effects of these machine learning models, in addition to the machine learning algorithm itself, have a great influence on whether the database used for training the models has sufficient data size and whether the number of samples is balanced. Because the small sample data set is easy to have the problem of overfitting or weak generalization capability, the unbalanced data set can cause the prediction deviation of the model, and therefore the original small sample unbalanced voice database needs to be expanded and balanced.

The traditional voice data enhancement method mainly comprises the following steps: volume enhancement, velocity enhancement, pitch enhancement, motion enhancement, noise enhancement, time domain masking, frequency domain masking, and the like. The accuracy and robustness of the machine learning algorithm can be improved to a certain extent by using the enhanced voices to train the machine learning model. However, these enhancement methods are all for enhancing some kind of characteristics of the original speech, and are not suitable for some special databases. For example, in a vowel database, volume, speed, etc. are inherent characteristics of a certain type of sample in the database, and cannot be directly changed.

Various types of speech generation models based on deep learning can address such problems. However, at present, such models are mainly used for more accurate and real-time speech generation, and mass data are needed for training of the models. Even if the trained model is used, the problems of single generation sample, unsatisfactory generation data effect and the like exist when the generation data enhancement is carried out on a special database such as a vowel database and the like.

In view of the above, when a small sample unbalanced speech database is faced, it is necessary to design an applicable data enhancement model, which can be trained by using the original small sample unbalanced speech database and can generate accurate and diverse data for different databases.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a generating data enhancement method for generating accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm to the small sample unbalanced voice database.

In order to solve the above problem, the present invention provides a generative data enhancement method for a small sample unbalanced voice database, including:

s1, preprocessing the original voice data, and dividing the preprocessed voice data into a training set and a test set;

s2, compressing the training set data and the test set data;

s3: carrying out one-hot coding on the compressed training set data and the compressed test set data;

s4, training a low residual WaveNet neural network by using the training set data subjected to unique heat coding;

s5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.

As a further improvement of the present invention, the signal preprocessing of the original voice data includes: pre-emphasis and normalization processing are performed on the original voice data as follows:

wherein ,

for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

As a further improvement of the present invention, the compressing the training set data and the test set data includes:

the training set data and the test set data are compressed using the following formula:

wherein ,f(x_t) For compressed speech data, x_tIs the voice data at time point t, mu is the compression coefficient.

As a further improvement of the present invention, in step S3, the method of one-hot encoding is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.

As a further improvement of the present invention, in step S4, the low residual WaveNet neural network includes a plurality of residual blocks having the same structure, each of the residual blocks includes a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training step is as follows:

s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;

s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:

wherein L (X) is a loss value at point x, p (x)_i) Is the true data value of point x, q (x)_i) Is the predicted data value of point x, n is the length of the one-hot code;

s43, updating weight parameters of the neural network;

and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.

As a further improvement of the present invention, step S5 includes: and generating voice sample data of one point by using the test set data subjected to unique hot coding and the trained low residual WaveNet neural network, taking the generated voice sample data as the input of the low residual WaveNet neural network, and generating the voice sample data of the next point until the length of the generated voice sample data reaches a set value.

In order to solve the above problem, the present invention further provides a generating data enhancement system for a small sample unbalanced speech database, which includes:

the preprocessing module is used for preprocessing the original voice data and dividing the preprocessed voice data into a training set and a test set;

the compression module is used for compressing the training set data and the test set data;

the coding module is used for carrying out one-hot coding on the compressed training set data and the compressed test set data;

the neural network training module is used for training a low residual WaveNet neural network by using the training set data subjected to the one-hot coding;

and the voice sample generation module is used for generating a voice sample which does not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.

As a further improvement of the present invention, the signal preprocessing of the original voice data includes:

pre-emphasis and normalization processing are performed on the original voice data as follows:

wherein ,

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

the training set data and the test set data are compressed using the following equations:

As a further improvement of the invention, the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprises a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training steps are as follows:

s43, updating weight parameters of the neural network;

The invention has the beneficial effects that:

the generation type data enhancement method and the system of the small sample unbalanced voice database adopt the autoregressive modeling voice context to generate limited voice data aiming at the small sample unbalanced voice database, and use the low residual WaveNet network model, so that the method is easier to train and has higher generation speed than the data generation method directly using the WaveNet network model.

The generating type data enhancement method and the generating type data enhancement system for the small sample unbalanced voice database can generate accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a generative data enhancement method for a small sample unbalanced speech database in a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a low residual WaveNet neural network in a preferred embodiment of the present invention;

fig. 3 is a flowchart of MFCC feature parameter extraction.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Example one

As shown in fig. 1, a generative data enhancement method for a small sample unbalanced speech database in this embodiment includes the following steps:

step S1, performing signal preprocessing on the original voice data, and dividing the preprocessed voice data into a training set and a test set;

specifically, the signal preprocessing of the original voice data includes: pre-emphasis and normalization processing are performed on the original voice data as follows:

wherein ,

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

In one embodiment, the pre-emphasis factor α is 0.97.

S2, compressing the training set data and the test set data;

specifically, the training set data and the test set data are compressed using the following equations:

wherein ,f(x_t) For compressed speech data, x_tIs the voice data at time point t, mu is the compression coefficient. In one embodiment, the compression factor μ is 256.

Step S3: carrying out one-hot coding on the compressed training set data and the compressed test set data;

specifically, the method of one-hot encoding is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.

In one embodiment, the compressed training set data and test set data are encoded one-hot with a length of 256.

Step S4, training a low residual WaveNet neural network by using the training set data subjected to unique hot coding;

specifically, the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprises a plurality of dilation causal convolutions with exponentially increased dilation rates, and the training steps are as follows:

s43, updating weight parameters of the neural network;

As shown in fig. 2, in one embodiment, the low residual WaveNet neural network is composed of two identical structural residual blocks, each of which contains 10 dilation cause and effect convolutions, the dilation coefficients of the 10 dilation cause and effect convolutions being 1, 2, 4, 8, 16, 32, 64, 128, 256, 512. And carrying out causal convolution once before the two residual blocks, carrying out residual linkage once after the two residual blocks on the input and the output after the two residual blocks, and carrying out causal convolution twice. The training steps are as follows:

s41, randomly selecting 16 voice segments with lengths of 2047 points from the training set, and sending the voice segments into a low residual WaveNet neural network;

wherein L (X) is a loss value at point x, p (x)_i) Is the true data value of point x, q (x)_i) For the predicted data value at point x, 256 is the length of the one-hot code.

S43, updating the weight of the neural network;

and S44, repeating the steps S4-1-S4-3 until the loss function reaches the set value or the set training times are completed.

And step S5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.

Specifically, the test set data after the unique hot coding and the trained low residual WaveNet neural network are used for generating voice sample data of one point, the generated voice sample data is used as the input of the low residual WaveNet neural network, and the voice sample data of the next point is generated until the length of the generated voice sample data reaches a set value. In one embodiment, the set value is 0.5 s.

In order to verify the feasibility of the generating data enhancement method of the small sample unbalanced speech database, the traditional speech characteristic parameters MFCC are extracted from the original small sample unbalanced speech database, a new speech sample is generated by using the method of the invention, the characteristic parameters MFCC are extracted, and the difference of the generated sample and the original sample in a characteristic space is compared and analyzed.

Specifically, referring to fig. 3, the conventional speech feature parameter MFCC extraction process includes:

preprocessing, namely performing pre-emphasis, windowing and framing processing on the voice signals S (n), and taking a Hamming window as a window function to obtain each frame of signal S_n(m)；

Fast Fourier transform: obtaining an amplitude spectrum X by short-time Fourier analysis_n(k)；

Mel-filter processing: will magnitude spectrum X_n(k) Passing through a set of Mel-scale triangular filter banks (M filters);

obtaining logarithmic energy: calculating the logarithmic energy output by each filter bank;

discrete Cosine Transform (DCT), which introduces logarithmic energy into DCT to obtain MFCC coefficient of M order;

dynamic difference parameters: the 1 st and 2 nd derivatives of the MFCC are extracted and added to the feature matrix.

The generating data enhancement method of the small sample unbalanced voice database is adopted to generate new data and extract the traditional voice characteristic parameter MFCC, and the t-test is used to carry out statistical test analysis on the new sample characteristic set of the generated voice and the characteristic set of the original voice. And finally, the difference value P is greater than 0.05, which shows that the generated new sample has no obvious difference from the original sample and has the information representative value of a real sample.

Example two

The embodiment discloses a generative data enhancement system of a small sample unbalanced voice database, which comprises the following modules:

wherein ,

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

s43, updating weight parameters of the neural network;

In the following, the generative data enhancement method of the small sample unbalanced speech database of the present invention is used for pattern recognition of the small sample unbalanced speech signal.

Under the condition of small sample unbalanced voice database, the data enhancement technology of the invention is used for pattern recognition of voice signals. A pattern recognition system for modeling speech signals includes data generation, feature extraction, and classifier classification.

Firstly, the data generation is the same as the data generation step in the first embodiment;

secondly, the feature extraction is the same as the extraction process of the conventional speech feature parameter MFCC in the feature space difference comparison step in the first embodiment;

thirdly, classifying by a classifier:

the features of the original speech data and the features of the speech data after the data enhancement method of the present invention are used to train random forest classifiers (RF), respectively.

For data which are not subjected to data enhancement and a data modeling pattern recognition system which is subjected to data enhancement by the method, a 10-fold cross-validation method is used for pattern recognition, and the experimental results are shown in table 1:

TABLE 1

From the experimental results in the table, the original small sample unbalanced database is not beneficial to modeling a mode recognition system of a voice signal, and particularly, on the accuracy index and the sensitivity index, the two evaluation indexes are obviously improved after being processed by the data enhancement method.

The generative data enhancement method and the system of the small sample unbalanced voice database aim at the small sample unbalanced voice database, adopt the forward-backward relation of autoregressive modeling voice, generate limited sample voice data, and use a low residual WaveNet network model, so that the method is easier to train and has higher generation speed than a data generation method directly using the WaveNet network model.

The above embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. The generative data enhancement method of the small sample unbalanced voice database is characterized by comprising the following steps:

s2, compressing the training set data and the test set data;

2. The method as claimed in claim 1, wherein the signal preprocessing of the raw speech data comprises: pre-emphasis and normalization processing are performed on the original voice data as follows:

wherein ,

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

3. The generative data enhancement method for a small-sample unbalanced speech database as claimed in claim 1, wherein said compressing the training set data and the test set data comprises:

4. The generative data enhancement method of a small-sample unbalanced speech database as claimed in claim 1, wherein in step S3, the one-hot encoding method is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.

5. The generative data enhancement method of a small sample unbalanced speech database as claimed in claim 1, wherein in step S4, the low residual wave net neural network comprises a plurality of residual blocks with the same structure, each of the residual blocks comprises a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training steps are as follows:

s43, updating weight parameters of the neural network;

6. The generative data enhancement method of a small sample unbalanced speech database as recited in claim 1, wherein step S5 comprises: and generating voice sample data of one point by using the test set data subjected to unique hot coding and the trained low residual WaveNet neural network, taking the generated voice sample data as the input of the low residual WaveNet neural network, and generating the voice sample data of the next point until the length of the generated voice sample data reaches a set value.

7. A generative data enhancement system for a small sample unbalanced speech database, comprising:

8. The generative data enhancement system for a small sample unbalanced speech database as recited in claim 7, wherein the signal pre-processing of the raw speech data comprises:

wherein ,

wherein S (n) is normalized voice data,

for the pre-emphasized voice data, N is the total length of the voice data.

9. The generative data enhancement system for a small sample unbalanced speech database as recited in claim 7, wherein the compressing the training set data and the test set data comprises:

10. The generative data enhancement system of a small sample unbalanced speech database according to claim 7, wherein the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprising a plurality of dilation causal convolutions with exponentially increasing dilation rates, the training steps are as follows:

s43, updating weight parameters of the neural network;