CN113160844A

CN113160844A - Speech enhancement method and system based on noise background classification

Info

Publication number: CN113160844A
Application number: CN202110459982.0A
Authority: CN
Inventors: 李晔; 冯涛; 张鹏; 李姝�; 汪付强
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-23

Abstract

The invention discloses a speech enhancement method and a system based on noise background classification, comprising the following steps: acquiring a voice signal to be processed; carrying out feature extraction on a voice signal to be processed; inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed; selecting a trained generator corresponding to the noise background label; and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal. The method selects a Mel frequency cepstrum coefficient extracted from the voice with noise to be input into a classifier to classify the noise background, and generates a countermeasure network aiming at the noise background in the same model for the voice with good classification to realize voice enhancement.

Description

Speech enhancement method and system based on noise background classification

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech enhancement method and system based on noise background classification.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Speech is the most direct and efficient tool for exchanging information between people, and is also a tool for communicating between people and machines. However, when information is exchanged between people and communication is performed between people and machines, noise always affects the information, the type of noise is different in different scenes, and the influence of different noises on effective voice information is different. For example, people converse in automobiles, and the noise is engine noise, horn sound and the like; most of noises in the cafe are guest conversation sounds; the noise in the computer room is mostly the fan sound of the computer running. Therefore, the same method may not be good for speech enhancement in multiple scenes. Therefore, how to use a speech enhancement method to achieve good effects in different scenes becomes a technical problem to be solved urgently by technical personnel in the field.

At present, various speech enhancement methods are mostly used for speech enhancement aiming at a specific background noise, and the enhancement effect is common when meeting other types of noise backgrounds, so that a speech enhancement method aiming at various noise scenes is urgently needed.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a speech enhancement method and a speech enhancement system based on noise background classification; different noise scenes are distinguished, so that voice enhancement is performed by using a specific network in the same model aiming at a certain scene, and a better voice enhancement effect is realized.

In a first aspect, the present invention provides a speech enhancement method based on noise background classification;

the speech enhancement method based on the noise background classification comprises the following steps:

acquiring a voice signal to be processed;

carrying out feature extraction on a voice signal to be processed;

inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;

selecting a trained generator corresponding to the noise background label;

and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.

In a second aspect, the present invention provides a speech enhancement system based on noise background classification;

a speech enhancement system based on noise background classification, comprising:

an acquisition module configured to: acquiring a voice signal to be processed;

a feature extraction module configured to: carrying out feature extraction on a voice signal to be processed;

a classification module configured to: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;

a selection module configured to: selecting a trained generator corresponding to the noise background label;

an enhancement module configured to: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.

In a third aspect, the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention fully considers the problem that most voice enhancement methods in the field of voice enhancement can not obtain good effect when voice enhancement is carried out under multiple scenes, selects a Mel frequency cepstrum coefficient for extracting noisy voice to input into a classifier to classify noise backgrounds, and uses a confrontation network generated aiming at the noise backgrounds in the same model to realize voice enhancement on the well-classified voice.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of the method of the first embodiment.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Interpretation of terms:

Mel-Frequency Cepstral Coefficient (MFCC);

a countermeasure network (generic adaptive Networks) is generated.

Example one

The embodiment provides a speech enhancement method based on noise background classification;

as shown in fig. 1, the speech enhancement method based on noise background classification includes:

s101: acquiring a voice signal to be processed;

s102: carrying out feature extraction on a voice signal to be processed;

s103: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed;

s104: selecting a trained generator corresponding to the noise background label;

s105: and inputting the voice signal to be processed into the selected trained generator to obtain an enhanced voice signal.

Further, the S102: carrying out feature extraction on a voice signal to be processed; the method specifically comprises the following steps:

and extracting the Mel frequency cepstrum coefficient characteristics of the voice signal to be processed.

Further, the step S103: inputting the extracted features into a trained classifier to obtain a noise background label of the voice to be processed; wherein, the classifier after training, the training step includes:

constructing a first training set, wherein the first training set is a voice signal characteristic of a known noise background label;

inputting the training set into a classifier, and training the classifier;

and when the loss function of the classifier obtains the minimum value or the training reaches the iteration times, stopping the training to obtain the trained classifier.

Illustratively, the building of the data set includes:

for the clean speech data set, THCHS30 was chosen for use, THCHS30 is an open chinese speech database published by the university of qinghua speech and language technology Center (CSLT).

The noise background selects the noise recorded in five scenes, namely a coffee shop, a running automobile, a running subway, a server running machine room and a cafeteria.

Dividing the data set THCHS30 into six parts equally, wherein each part of time is 5 hours, synthesizing five parts of pure voice data with the noise of five different scenes into noisy voice with different signal-to-noise ratios by using a program to serve as a training set, dividing the remaining part of pure voice into five parts, and synthesizing the remaining part of pure voice with the noise of five noise scenes into noisy voice with different signal-to-noise ratios to serve as a test set.

When a voice file with noise is synthesized, each training set file adds a code of the type at the end of the file name according to the current noise background type.

Extracting information of the noise voice:

MFCC feature extraction is carried out on the noisy speech under each scene, the label code of the noise type of the noisy speech is read, each MFCC feature and each label are correspondingly stored in an array A, and the sequence of the array A is disturbed.

Further, the classifier is a convolutional neural network.

Alternatively, the first and second electrodes may be,

further, the specific structure of the classifier comprises:

the device comprises a first convolution layer, a first activation function layer, a first maximum pooling layer, a second convolution layer, a second activation function layer and a second maximum pooling layer which are connected in sequence.

The number of convolution kernels of the first convolution layer is the same as that of convolution kernels of the second convolution layer, the first convolution layer is provided with 32 convolution kernels, and each convolution kernel has a sampling window of 5 x 5.

The following are exemplary: the classifier is composed of a first layer which is composed of convolution layers, the convolution layers are provided with 32 convolution kernels, each convolution kernel is provided with a sampling window of 5 x 5, a ReLU activation function is used after each convolution layer, a max-posing pooling layer is applied, a second convolution layer is added later, the configuration of the second convolution layer is the same as that of the first convolution layer, the ReLU activation function is used, the max-posing pooling layer is applied, the output of the second max-posing pooling layer is flattened into 1 dimension, the second convolution layer is input into a full connection layer, and the prediction result of the classifier is obtained after the full connection layer.

Further, the known noise background label, for example, includes: a coffee shop, a driving car, a running subway, a server running machine room and/or an independent restaurant.

Illustratively, the classifier trains: inputting the disordered noisy speech array a into a classifier, verifying the predicted noise background type label with the label in the array a by the classifier, optimizing reverse transmission by using an Adamaptizer optimizer to adjust parameters of layers to reduce errors if the errors are large due to continuous comparison of a large batch of files, learning to predict the noise background label by the model after 150 times of iterative training, and ensuring the accuracy to be more than 98%.

Further, the S104: selecting a trained generator corresponding to the noise background label; wherein, the generator after training, concrete training step includes:

s1041: constructing a second training set; the second training set comprising: a noise-free speech signal and a noisy speech signal of a known noise background label; the noise-carrying voice signal of the known noise background label is obtained by adding the background noise of the corresponding label to the noise-free voice signal;

s1042: repeating the steps of initializing the discriminator, initializing the generator and optimizing the weight;

when the method is executed for the first time, the discriminator initialization step and the generator initialization step both use normally distributed random numbers to assign weights;

when the optimization is not executed for the first time, a discriminator initialization step and a generator initialization step use the weight optimized by the optimizer in the last suboptimal weight step;

s1043: judging whether the number of the current trained data is larger than a set value or not, and repeating the training until the set training number is reached; after training, storing the last layer of weight in the step of optimizing the weight; a trained generator is obtained.

Further, the discriminator initializing step; the method specifically comprises the following steps:

when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed noiseless voice into a discriminator, wherein the discriminator outputs 1 to represent that the input is noiseless voice;

when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the noiseless voice and the noisy voice processed by the generator are input into a discriminator, and the discriminator outputs a discrimination result.

Further, the generator initializes; the method specifically comprises the following steps:

when the method is executed for the first time, assigning the weight value by using the normally distributed random number; inputting the preprocessed voice with noise into a generator, compressing the preprocessed voice with noise by a coding structure, performing reverse compression by a decoding structure, and sending voice characteristics in the voice with noise into the decoding structure from the coding structure through jump connection to guide the decoding structure to generate enhanced voice;

when the optimization is not carried out for the first time, the weight optimized by the optimizer in the last second optimization weight step is used; the pre-processed noisy speech is input into a generator, the pre-processed noisy speech is firstly compressed by a coding structure, then is subjected to inverse compression by a decoding structure, and the speech characteristics in the noisy speech are sent into the decoding structure from the coding structure through jumping connection to guide the decoding structure to generate enhanced speech.

Further, optimizing the weight value; the method specifically comprises the following steps:

the AdamaOptizer optimizer in the generation countermeasure network updates the weights of convolution kernels of each coding structure and decoding structure in the generator through gradient descent according to the loss value of the generator and the loss value of the discriminator which are obtained by the enhanced voice and the noiseless voice, so as to generate enhanced voice which is more similar to the noiseless voice; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.

Further, the second training set is constructed by selecting a proper voice data set and multiple noise backgrounds and synthesizing training data with noise type labels with different signal-to-noise ratios by using pure voice and different noises.

Illustratively, the generator is composed of a plurality of convolutional layers and a plurality of deconvolution layers, the convolutional layers are called encoding structures, the deconvolution layers are called decoding structures, the convolutional layers and the deconvolution layers have mirror symmetry structures, and jump connection structures are added between the convolutional layers and the deconvolution layers.

The discriminator is made up of a plurality of convolutional layers, the structure of which is the same as that of the convolutional layers in the generator.

Further, the step S105: inputting the voice signal to be processed into the selected generator after training to obtain an enhanced voice signal; the method specifically comprises the following steps:

and inputting the voice signal to be processed into the selected generator after training, and sequentially carrying out coding and decoding processing to obtain an enhanced voice signal.

The invention classifies the noise background by extracting Mel Frequency Cepstral Coefficient (MFCC) of the voice with noise and inputting the extracted voice into a convolutional neural network, and realizes voice enhancement by using a generation countermeasure network (generic adaptive networks) model aiming at the noise background in the same model for the voice with the classified voice.

And constructing a plurality of voice enhancement networks, wherein the number of the voice enhancement networks is the same as that of the noise backgrounds, the input noise type of each voice enhancement network is divided, and the divided voice enhancement networks only accept the input of the corresponding noisy languages.

And inputting the noisy speech of the unknown scene into the model, and classifying the noisy speech by the classifier and obtaining the enhanced speech through the speech enhancement network.

Illustratively, the speech enhancement model selects a plurality of Generative Adaptive Networks (GAN), and constructs a total of five identical Generative adaptive networks, each selecting the same structure, each GAN network consisting of one generator and one discriminator.

Illustratively, the training phase generates the countermeasure network input data processing:

and storing scene type labels of the noisy speech, the noiseless speech and the noisy speech corresponding to the five noise backgrounds into a TFrecord file.

In the TFrecord file, the noisy speech is marked as noise, the noiseless speech is marked as clean, the scene type label of the noisy speech is marked as label, and the noisy speech and the noiseless speech are correspondingly generated in the countermeasure network according to the type of the label.

The method comprises the following steps of preprocessing the noisy speech and the noiseless speech before inputting, and dividing the preprocessed noisy speech and the noiseless speech into a plurality of batches, wherein one batch is 150 sampling points in one second.

The five generative confrontation networks are the same when performing the voice enhancement operation, only the input noisy voice is different from the noiseless voice, and the following takes the example of the generative confrontation network of the noise background in the coffee shop with the noise code set as a.

Illustratively, the training phase generates the initialization of the discriminators within the countermeasure network:

the weights of the convolution kernels of the convolution layers in the discriminator are initialized using a random number that generates a normal distribution, the preprocessed noiseless speech is input to the discriminator, and the discriminator outputs 1, indicating that such input is noiseless speech.

Illustratively, the training phase generates generator initializations within the countermeasure network:

the weights of the convolution kernels of the coding structure and the decoding structure in the generator are initialized by using the random numbers which generate the normal distribution. The pre-processed noisy speech is input into a generator, the pre-processed noisy speech is firstly compressed by a coding structure, then is subjected to inverse compression by a decoding structure, and the speech characteristics in the noisy speech are sent into the decoding structure from the coding structure through jumping connection to guide the decoding structure to generate enhanced speech.

Illustratively, the weight optimizing stage in the training stage:

after the two stages of the initialization of the discriminator and the initialization of the generator are completed, the enhanced voice generated by the generator is input into the discriminator, and because the input of the discriminator is the noiseless voice in the initialization stage, the enhanced voice and the noiseless voice have larger difference at the moment, the discriminator can output 0 which represents that the input is the enhanced voice.

An AdamaOptizer optimizer in the generation countermeasure network guides the weights of convolution kernels of each coding structure and decoding structure in a generator to be updated according to the loss value of the generator and the loss value of a discriminator obtained by the enhanced speech and the noiseless speech, so that enhanced speech which is more similar to the noiseless speech is generated; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.

Inputting the files of the test set into a classifier, automatically classifying the files in the test set by the classifier, labeling a noise background label, inputting the voice with noise into a GAN network for processing the voice with noise according to the label labeled by the classifier, performing noise reduction processing on the voice with noise by the GAN network at an interval of 1 second, and connecting the processed files after processing all the voice with noise to obtain enhanced voice.

The innovation points of the invention are as follows: extracting the Mel frequency cepstrum coefficient of the noise voice, inputting the Mel frequency cepstrum coefficient into a classifier to classify the noise background, and using a generation countermeasure network aiming at the noise background in a model to realize voice enhancement for the classified voice.

The invention provides a voice enhancement method based on noise background classification, which classifies the noise background by inputting the Mel frequency cepstrum coefficient of the voice with noise into a classifier, and uses a generation countermeasure network aiming at the noise background in a model to realize voice enhancement. Compared with other voice enhancement methods, the method has better generalization and better effect in different noise scenes.

Example two

The embodiment provides a speech enhancement system based on noise background classification;

an acquisition module configured to: acquiring a voice signal to be processed;

It should be noted here that the above-mentioned obtaining module, the feature extracting module, the classifying module, the selecting module and the enhancing module correspond to steps S101 to S105 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The speech enhancement method based on the noise background classification is characterized by comprising the following steps:

acquiring a voice signal to be processed;

carrying out feature extraction on a voice signal to be processed;

selecting a trained generator corresponding to the noise background label;

2. The speech enhancement method based on noise background classification according to claim 1, characterized by performing feature extraction on the speech signal to be processed; the method specifically comprises the following steps:

3. The speech enhancement method based on noise background classification according to claim 1, wherein the extracted features are input into a trained classifier to obtain a noise background label of the speech to be processed; wherein, the classifier after training, the training step includes:

inputting the training set into a classifier, and training the classifier;

4. The noise background classification-based speech enhancement method of claim 1, wherein based on the noise background labels, trained generators corresponding to the labels are selected; wherein, the generator after training, concrete training step includes:

(1) constructing a second training set; the second training set comprising: a noise-free speech signal and a noisy speech signal of a known noise background label; the noise-carrying voice signal of the known noise background label is obtained by adding the background noise of the corresponding label to the noise-free voice signal;

(2) repeating the steps of initializing the discriminator, initializing the generator and optimizing the weight;

(3) judging whether the number of the current trained data is larger than a set value or not, and repeating the training until the set training number is reached; after training, storing the last layer of weight in the step of optimizing the weight; a trained generator is obtained.

5. The noise background classification-based speech enhancement method according to claim 4, wherein the discriminator initializing step; the method specifically comprises the following steps:

6. The noise background classification-based speech enhancement method of claim 4, wherein the generator initialization step; the method specifically comprises the following steps:

7. The speech enhancement method according to claim 4, wherein said step of optimizing weights; the method specifically comprises the following steps:

the optimizer in the generation countermeasure network updates the weights of the convolution kernels of each coding structure and decoding structure in the generator through gradient descent according to the loss value of the generator and the loss value of the discriminator obtained by the enhanced speech and the noiseless speech, so as to generate enhanced speech which is more similar to the noiseless speech; the optimizer also updates the weights in the discriminator to enhance the ability of the discriminator to recognize the enhanced speech.

8. A speech enhancement system based on noise background classification, comprising:

an acquisition module configured to: acquiring a voice signal to be processed;

9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.