CN114548221A - Generation type data enhancement method and system for small sample unbalanced voice database - Google Patents

Generation type data enhancement method and system for small sample unbalanced voice database Download PDF

Info

Publication number
CN114548221A
CN114548221A CN202210050846.0A CN202210050846A CN114548221A CN 114548221 A CN114548221 A CN 114548221A CN 202210050846 A CN202210050846 A CN 202210050846A CN 114548221 A CN114548221 A CN 114548221A
Authority
CN
China
Prior art keywords
data
voice
training
neural network
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210050846.0A
Other languages
Chinese (zh)
Other versions
CN114548221B (en
Inventor
陶智
钱金阳
章溢华
张晓俊
许宜申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210050846.0A priority Critical patent/CN114548221B/en
Publication of CN114548221A publication Critical patent/CN114548221A/en
Application granted granted Critical
Publication of CN114548221B publication Critical patent/CN114548221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)

Abstract

The invention discloses a generating data enhancement method of a small sample unbalanced voice database, which comprises the steps of S1, carrying out signal preprocessing on original voice data, and dividing the preprocessed voice data into a training set and a test set; s2, compressing the training set data and the test set data; s3: carrying out one-hot coding on the compressed training set data and the compressed test set data; s4, training a low residual WaveNet neural network by using the training set data subjected to unique heat coding; s5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network. The generating type data enhancement method and the generating type data enhancement system for the small sample unbalanced voice database can generate accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm.

Description

Generation type data enhancement method and system for small sample unbalanced voice database
Technical Field
The invention relates to the technical field of voice data enhancement, in particular to a generating type data enhancement method and system of a small sample unbalanced voice database.
Background
Data enhancement is mainly used to prevent overfitting of the model. With the development of deep learning, various machine learning models used in the fields of speech recognition, classification and the like are developed towards high complexity. Factors for determining the effects of these machine learning models, in addition to the machine learning algorithm itself, have a great influence on whether the database used for training the models has sufficient data size and whether the number of samples is balanced. Because the small sample data set is easy to have the problem of overfitting or weak generalization capability, the unbalanced data set can cause the prediction deviation of the model, and therefore the original small sample unbalanced voice database needs to be expanded and balanced.
The traditional voice data enhancement method mainly comprises the following steps: volume enhancement, velocity enhancement, pitch enhancement, motion enhancement, noise enhancement, time domain masking, frequency domain masking, and the like. The accuracy and robustness of the machine learning algorithm can be improved to a certain extent by using the enhanced voices to train the machine learning model. However, these enhancement methods are all for enhancing some kind of characteristics of the original speech, and are not suitable for some special databases. For example, in a vowel database, volume, speed, etc. are inherent characteristics of a certain type of sample in the database, and cannot be directly changed.
Various types of speech generation models based on deep learning can address such problems. However, at present, such models are mainly used for more accurate and real-time speech generation, and mass data are needed for training of the models. Even if the trained model is used, the problems of single generation sample, unsatisfactory generation data effect and the like exist when the generation data enhancement is carried out on a special database such as a vowel database and the like.
In view of the above, when a small sample unbalanced speech database is faced, it is necessary to design an applicable data enhancement model, which can be trained by using the original small sample unbalanced speech database and can generate accurate and diverse data for different databases.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a generating data enhancement method for generating accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm to the small sample unbalanced voice database.
In order to solve the above problem, the present invention provides a generative data enhancement method for a small sample unbalanced voice database, including:
s1, preprocessing the original voice data, and dividing the preprocessed voice data into a training set and a test set;
s2, compressing the training set data and the test set data;
s3: carrying out one-hot coding on the compressed training set data and the compressed test set data;
s4, training a low residual WaveNet neural network by using the training set data subjected to unique heat coding;
s5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
As a further improvement of the present invention, the signal preprocessing of the original voice data includes: pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure BDA0003474238710000021
wherein ,
Figure BDA0003474238710000022
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure BDA0003474238710000023
wherein S (n) is normalized voice data,
Figure BDA0003474238710000024
for the pre-emphasized voice data, N is the total length of the voice data.
As a further improvement of the present invention, the compressing the training set data and the test set data includes:
the training set data and the test set data are compressed using the following formula:
Figure BDA0003474238710000031
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient.
As a further improvement of the present invention, in step S3, the method of one-hot encoding is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.
As a further improvement of the present invention, in step S4, the low residual WaveNet neural network includes a plurality of residual blocks having the same structure, each of the residual blocks includes a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training step is as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure BDA0003474238710000032
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
As a further improvement of the present invention, step S5 includes: and generating voice sample data of one point by using the test set data subjected to unique hot coding and the trained low residual WaveNet neural network, taking the generated voice sample data as the input of the low residual WaveNet neural network, and generating the voice sample data of the next point until the length of the generated voice sample data reaches a set value.
In order to solve the above problem, the present invention further provides a generating data enhancement system for a small sample unbalanced speech database, which includes:
the preprocessing module is used for preprocessing the original voice data and dividing the preprocessed voice data into a training set and a test set;
the compression module is used for compressing the training set data and the test set data;
the coding module is used for carrying out one-hot coding on the compressed training set data and the compressed test set data;
the neural network training module is used for training a low residual WaveNet neural network by using the training set data subjected to the one-hot coding;
and the voice sample generation module is used for generating a voice sample which does not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
As a further improvement of the present invention, the signal preprocessing of the original voice data includes:
pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure BDA0003474238710000041
wherein ,
Figure BDA0003474238710000042
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure BDA0003474238710000043
wherein S (n) is normalized voice data,
Figure BDA0003474238710000044
for the pre-emphasized voice data, N is the total length of the voice data.
As a further improvement of the present invention, the compressing the training set data and the test set data includes:
the training set data and the test set data are compressed using the following equations:
Figure BDA0003474238710000045
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient.
As a further improvement of the invention, the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprises a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training steps are as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure BDA0003474238710000051
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
The invention has the beneficial effects that:
the generation type data enhancement method and the system of the small sample unbalanced voice database adopt the autoregressive modeling voice context to generate limited voice data aiming at the small sample unbalanced voice database, and use the low residual WaveNet network model, so that the method is easier to train and has higher generation speed than the data generation method directly using the WaveNet network model.
The generating type data enhancement method and the generating type data enhancement system for the small sample unbalanced voice database can generate accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are specifically described below with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a generative data enhancement method for a small sample unbalanced speech database in a preferred embodiment of the present invention;
FIG. 2 is a block diagram of a low residual WaveNet neural network in a preferred embodiment of the present invention;
fig. 3 is a flowchart of MFCC feature parameter extraction.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Example one
As shown in fig. 1, a generative data enhancement method for a small sample unbalanced speech database in this embodiment includes the following steps:
step S1, performing signal preprocessing on the original voice data, and dividing the preprocessed voice data into a training set and a test set;
specifically, the signal preprocessing of the original voice data includes: pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure BDA0003474238710000061
wherein ,
Figure BDA0003474238710000062
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure BDA0003474238710000063
wherein S (n) is normalized voice data,
Figure BDA0003474238710000064
for the pre-emphasized voice data, N is the total length of the voice data.
In one embodiment, the pre-emphasis factor α is 0.97.
S2, compressing the training set data and the test set data;
specifically, the training set data and the test set data are compressed using the following equations:
Figure BDA0003474238710000065
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient. In one embodiment, the compression factor μ is 256.
Step S3: carrying out one-hot coding on the compressed training set data and the compressed test set data;
specifically, the method of one-hot encoding is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.
In one embodiment, the compressed training set data and test set data are encoded one-hot with a length of 256.
Step S4, training a low residual WaveNet neural network by using the training set data subjected to unique hot coding;
specifically, the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprises a plurality of dilation causal convolutions with exponentially increased dilation rates, and the training steps are as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure BDA0003474238710000071
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
As shown in fig. 2, in one embodiment, the low residual WaveNet neural network is composed of two identical structural residual blocks, each of which contains 10 dilation cause and effect convolutions, the dilation coefficients of the 10 dilation cause and effect convolutions being 1, 2, 4, 8, 16, 32, 64, 128, 256, 512. And carrying out causal convolution once before the two residual blocks, carrying out residual linkage once after the two residual blocks on the input and the output after the two residual blocks, and carrying out causal convolution twice. The training steps are as follows:
s41, randomly selecting 16 voice segments with lengths of 2047 points from the training set, and sending the voice segments into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure BDA0003474238710000072
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) For the predicted data value at point x, 256 is the length of the one-hot code.
S43, updating the weight of the neural network;
and S44, repeating the steps S4-1-S4-3 until the loss function reaches the set value or the set training times are completed.
And step S5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
Specifically, the test set data after the unique hot coding and the trained low residual WaveNet neural network are used for generating voice sample data of one point, the generated voice sample data is used as the input of the low residual WaveNet neural network, and the voice sample data of the next point is generated until the length of the generated voice sample data reaches a set value. In one embodiment, the set value is 0.5 s.
In order to verify the feasibility of the generating data enhancement method of the small sample unbalanced speech database, the traditional speech characteristic parameters MFCC are extracted from the original small sample unbalanced speech database, a new speech sample is generated by using the method of the invention, the characteristic parameters MFCC are extracted, and the difference of the generated sample and the original sample in a characteristic space is compared and analyzed.
Specifically, referring to fig. 3, the conventional speech feature parameter MFCC extraction process includes:
preprocessing, namely performing pre-emphasis, windowing and framing processing on the voice signals S (n), and taking a Hamming window as a window function to obtain each frame of signal Sn(m);
Fast Fourier transform: obtaining an amplitude spectrum X by short-time Fourier analysisn(k);
Mel-filter processing: will magnitude spectrum Xn(k) Passing through a set of Mel-scale triangular filter banks (M filters);
obtaining logarithmic energy: calculating the logarithmic energy output by each filter bank;
discrete Cosine Transform (DCT), which introduces logarithmic energy into DCT to obtain MFCC coefficient of M order;
dynamic difference parameters: the 1 st and 2 nd derivatives of the MFCC are extracted and added to the feature matrix.
The generating data enhancement method of the small sample unbalanced voice database is adopted to generate new data and extract the traditional voice characteristic parameter MFCC, and the t-test is used to carry out statistical test analysis on the new sample characteristic set of the generated voice and the characteristic set of the original voice. And finally, the difference value P is greater than 0.05, which shows that the generated new sample has no obvious difference from the original sample and has the information representative value of a real sample.
Example two
The embodiment discloses a generative data enhancement system of a small sample unbalanced voice database, which comprises the following modules:
the preprocessing module is used for preprocessing the original voice data and dividing the preprocessed voice data into a training set and a test set;
specifically, the signal preprocessing of the original voice data includes: pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure BDA0003474238710000091
wherein ,
Figure BDA0003474238710000092
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure BDA0003474238710000093
wherein S (n) is normalized voice data,
Figure BDA0003474238710000094
for the pre-emphasized voice data, N is the total length of the voice data.
The compression module is used for compressing the training set data and the test set data;
specifically, the training set data and the test set data are compressed using the following equations:
Figure BDA0003474238710000095
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient.
The coding module is used for carrying out one-hot coding on the compressed training set data and the compressed test set data;
specifically, the method of one-hot encoding is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.
The neural network training module is used for training a low residual WaveNet neural network by using the training set data subjected to the one-hot coding;
specifically, the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprises a plurality of dilation causal convolutions with exponentially increased dilation rates, and the training steps are as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure BDA0003474238710000101
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
And the voice sample generation module is used for generating a voice sample which does not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
Specifically, the test set data after the unique hot coding and the trained low residual WaveNet neural network are used for generating voice sample data of one point, the generated voice sample data is used as the input of the low residual WaveNet neural network, and the voice sample data of the next point is generated until the length of the generated voice sample data reaches a set value. In one embodiment, the set value is 0.5 s.
In the following, the generative data enhancement method of the small sample unbalanced speech database of the present invention is used for pattern recognition of the small sample unbalanced speech signal.
Under the condition of small sample unbalanced voice database, the data enhancement technology of the invention is used for pattern recognition of voice signals. A pattern recognition system for modeling speech signals includes data generation, feature extraction, and classifier classification.
Firstly, the data generation is the same as the data generation step in the first embodiment;
secondly, the feature extraction is the same as the extraction process of the conventional speech feature parameter MFCC in the feature space difference comparison step in the first embodiment;
thirdly, classifying by a classifier:
the features of the original speech data and the features of the speech data after the data enhancement method of the present invention are used to train random forest classifiers (RF), respectively.
For data which are not subjected to data enhancement and a data modeling pattern recognition system which is subjected to data enhancement by the method, a 10-fold cross-validation method is used for pattern recognition, and the experimental results are shown in table 1:
Figure BDA0003474238710000111
TABLE 1
From the experimental results in the table, the original small sample unbalanced database is not beneficial to modeling a mode recognition system of a voice signal, and particularly, on the accuracy index and the sensitivity index, the two evaluation indexes are obviously improved after being processed by the data enhancement method.
The generative data enhancement method and the system of the small sample unbalanced voice database aim at the small sample unbalanced voice database, adopt the forward-backward relation of autoregressive modeling voice, generate limited sample voice data, and use a low residual WaveNet network model, so that the method is easier to train and has higher generation speed than a data generation method directly using the WaveNet network model.
The generating type data enhancement method and the generating type data enhancement system for the small sample unbalanced voice database can generate accurate and various voice samples to expand the existing small sample unbalanced voice database, so that the database can apply a more complex machine learning algorithm.
The above embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (10)

1. The generative data enhancement method of the small sample unbalanced voice database is characterized by comprising the following steps:
s1, preprocessing the original voice data, and dividing the preprocessed voice data into a training set and a test set;
s2, compressing the training set data and the test set data;
s3: carrying out one-hot coding on the compressed training set data and the compressed test set data;
s4, training a low residual WaveNet neural network by using the training set data subjected to unique heat coding;
s5, generating voice samples which do not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
2. The method as claimed in claim 1, wherein the signal preprocessing of the raw speech data comprises: pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure FDA0003474238700000011
wherein ,
Figure FDA0003474238700000012
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure FDA0003474238700000013
wherein S (n) is normalized voice data,
Figure FDA0003474238700000014
for the pre-emphasized voice data, N is the total length of the voice data.
3. The generative data enhancement method for a small-sample unbalanced speech database as claimed in claim 1, wherein said compressing the training set data and the test set data comprises:
the training set data and the test set data are compressed using the following equations:
Figure FDA0003474238700000015
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient.
4. The generative data enhancement method of a small-sample unbalanced speech database as claimed in claim 1, wherein in step S3, the one-hot encoding method is as follows: the interval of values between-1 and 1 is divided into n segments, each successive value is represented by a binary number of n bits, only one bit of the n-bit binary number is 1, the rest is 0, and the bit of 1 is the position of the number in the interval of n segments.
5. The generative data enhancement method of a small sample unbalanced speech database as claimed in claim 1, wherein in step S4, the low residual wave net neural network comprises a plurality of residual blocks with the same structure, each of the residual blocks comprises a plurality of dilation causal convolutions with exponentially increasing dilation rates, and the training steps are as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure FDA0003474238700000021
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
6. The generative data enhancement method of a small sample unbalanced speech database as recited in claim 1, wherein step S5 comprises: and generating voice sample data of one point by using the test set data subjected to unique hot coding and the trained low residual WaveNet neural network, taking the generated voice sample data as the input of the low residual WaveNet neural network, and generating the voice sample data of the next point until the length of the generated voice sample data reaches a set value.
7. A generative data enhancement system for a small sample unbalanced speech database, comprising:
the preprocessing module is used for preprocessing the original voice data and dividing the preprocessed voice data into a training set and a test set;
the compression module is used for compressing the training set data and the test set data;
the coding module is used for carrying out one-hot coding on the compressed training set data and the compressed test set data;
the neural network training module is used for training a low residual WaveNet neural network by using the training set data subjected to the one-hot coding;
and the voice sample generation module is used for generating a voice sample which does not exist in the original database by using the one-hot coded test set data and the trained low residual WaveNet neural network.
8. The generative data enhancement system for a small sample unbalanced speech database as recited in claim 7, wherein the signal pre-processing of the raw speech data comprises:
pre-emphasis and normalization processing are performed on the original voice data as follows:
Figure FDA0003474238700000031
wherein ,
Figure FDA0003474238700000032
for the pre-emphasized voice data, x (N), x (N-1) are the nth sampling point and the nth-1 sampling point of the original voice data, alpha is a pre-emphasis coefficient, and N is the total length of the data;
Figure FDA0003474238700000033
wherein S (n) is normalized voice data,
Figure FDA0003474238700000034
for the pre-emphasized voice data, N is the total length of the voice data.
9. The generative data enhancement system for a small sample unbalanced speech database as recited in claim 7, wherein the compressing the training set data and the test set data comprises:
the training set data and the test set data are compressed using the following equations:
Figure FDA0003474238700000035
wherein ,f(xt) For compressed speech data, xtIs the voice data at time point t, mu is the compression coefficient.
10. The generative data enhancement system of a small sample unbalanced speech database according to claim 7, wherein the low residual WaveNet neural network comprises a plurality of residual blocks with the same structure, each residual block comprising a plurality of dilation causal convolutions with exponentially increasing dilation rates, the training steps are as follows:
s41, randomly selecting a voice segment with the length of k segment as a receptive field point of network convolution operation from the training set, and sending the voice segment into a low residual WaveNet neural network;
s42, taking 16 points behind the input data as real output, and calculating an error loss function of the real output and the prediction output as follows:
Figure FDA0003474238700000041
wherein L (X) is a loss value at point x, p (x)i) Is the true data value of point x, q (x)i) Is the predicted data value of point x, n is the length of the one-hot code;
s43, updating weight parameters of the neural network;
and S44, repeating the steps S41-S43 until the loss function reaches the set value or the set training times are finished.
CN202210050846.0A 2022-01-17 2022-01-17 Method and system for enhancing generated data of small sample unbalanced voice database Active CN114548221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210050846.0A CN114548221B (en) 2022-01-17 2022-01-17 Method and system for enhancing generated data of small sample unbalanced voice database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210050846.0A CN114548221B (en) 2022-01-17 2022-01-17 Method and system for enhancing generated data of small sample unbalanced voice database

Publications (2)

Publication Number Publication Date
CN114548221A true CN114548221A (en) 2022-05-27
CN114548221B CN114548221B (en) 2023-04-28

Family

ID=81672087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210050846.0A Active CN114548221B (en) 2022-01-17 2022-01-17 Method and system for enhancing generated data of small sample unbalanced voice database

Country Status (1)

Country Link
CN (1) CN114548221B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
CN111402929A (en) * 2020-03-16 2020-07-10 南京工程学院 Small sample speech emotion recognition method based on domain invariance
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
CN112420018A (en) * 2020-10-26 2021-02-26 昆明理工大学 Language identification method suitable for low signal-to-noise ratio environment
US20220013105A1 (en) * 2020-07-09 2022-01-13 Google Llc Self-Training WaveNet for Text-to-Speech

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
CN111402929A (en) * 2020-03-16 2020-07-10 南京工程学院 Small sample speech emotion recognition method based on domain invariance
CN111429947A (en) * 2020-03-26 2020-07-17 重庆邮电大学 Speech emotion recognition method based on multi-stage residual convolutional neural network
US20220013105A1 (en) * 2020-07-09 2022-01-13 Google Llc Self-Training WaveNet for Text-to-Speech
CN112420018A (en) * 2020-10-26 2021-02-26 昆明理工大学 Language identification method suitable for low signal-to-noise ratio environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AARON VAN DEN OORD: "WAVENET: A GENERATIVE MODEL FOR RAW AUDIO", 《ARXIV》 *

Also Published As

Publication number Publication date
CN114548221B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN110751044B (en) Urban noise identification method based on deep network migration characteristics and augmented self-coding
CN109767756B (en) Sound characteristic extraction algorithm based on dynamic segmentation inverse discrete cosine transform cepstrum coefficient
CN108831443B (en) Mobile recording equipment source identification method based on stacked self-coding network
CN111243602A (en) Voiceprint recognition method based on gender, nationality and emotional information
CN110647656B (en) Audio retrieval method utilizing transform domain sparsification and compression dimension reduction
CN111724770A (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN111583957B (en) Drama classification method based on five-tone music rhythm spectrogram and cascade neural network
CN111986699A (en) Sound event detection method based on full convolution network
CN112183582A (en) Multi-feature fusion underwater target identification method
Benamer et al. Database for arabic speech commands recognition
Imran et al. An analysis of audio classification techniques using deep learning architectures
Chakravarty et al. Spoof detection using sequentially integrated image and audio features
CN112035700B (en) Voice deep hash learning method and system based on CNN
CN114065809A (en) Method and device for identifying abnormal sound of passenger car, electronic equipment and storage medium
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
Wu et al. A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling
Gaafar et al. An improved method for speech/speaker recognition
Elhami et al. Audio feature extraction with convolutional neural autoencoders with application to voice conversion
CN114548221B (en) Method and system for enhancing generated data of small sample unbalanced voice database
Wani et al. Deepfakes audio detection leveraging audio spectrogram and convolutional neural networks
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
CN113658607A (en) Environmental sound classification method based on data enhancement and convolution cyclic neural network
Chit et al. Myanmar continuous speech recognition system using fuzzy logic classification in speech segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant