WO2020098256A1

WO2020098256A1 - Speech enhancement method based on fully convolutional neural network, device, and storage medium

Info

Publication number: WO2020098256A1
Application number: PCT/CN2019/089180
Authority: WO
Inventors: 赵峰; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-14
Filing date: 2019-05-30
Publication date: 2020-05-22
Also published as: CN109326299B; CN109326299A

Abstract

The present application relates to the field of artificial intelligence. Disclosed is a speech enhancement method based on a fully convolutional neural network. The method comprises: constructing a fully convolutional neural network model, the fully convolutional neural network comprising an input layer, a hidden layer, and an output layer; the hidden layer comprising a plurality of convolutional layers; each of the plurality of convolutional layers comprising a plurality of filters; training the fully convolutional neural network model; inputting an original speech signal into the trained fully convolutional neural network model; and outputting an enhanced speech signal. In the fully convolutional neural network model of the present application, a full connection layer is deleted, and only convolutional layers are comprised, so that parameters of the neural network are significantly reduced, the fully convolutional neural network model can be suitable for a mobile device having the memory limited, each output sample only relies on adjacent inputs, and original information and spatial arrangement information of the speech signal can be reserved well at less weight values. Also disclosed are an electronic device and a computer readable storage medium.

Description

Speech enhancement method, device and storage medium based on fully convolutional neural network

Technical field

The present application relates to the field of speech technology, and in particular to a speech enhancement method, device and storage medium based on a fully convolutional neural network.

Background technique

Speech enhancement refers to a technique in which clean speech is disturbed by various noises in real life scenes, and the noise needs to be filtered by a certain method to improve the quality and intelligibility of the speech. In daily life, the voices collected by microphones are usually "pollution" voices with different noises. The main purpose of voice enhancement is to recover clean voices from these "polluted" noisy voices. Speech enhancement involves a wide range of application fields, including voice calls, teleconferences, scene recordings, military eavesdropping, hearing aid devices, and voice recognition devices, and has become a preprocessing module for many voice encoding and recognition systems. Taking speech enhancement as an example of hearing aids, the usual hearing aids only realize the basic amplification of a voice. More complicated ones will perform sound pressure level compression to compensate for the patient's hearing range, but if the hearing scene is more complicated, the patient hears The voice contains not only the amplified voice but also a lot of noise. Over time, it will inevitably cause secondary damage to the patient's hearing system. Therefore, in high-end digital hearing aid devices, voice enhancement has become an important aspect that cannot be ignored.

The speech enhancement application is used in the front-end processing of speech-related applications to ensure that the speech is separated from the noisy signal so that the back-end recognition model can correctly recognize the content of the speech. Existing speech enhancement methods include unsupervised speech enhancement methods and supervised speech enhancement methods, where the unsupervised speech enhancement method is to extract the amplitude spectrum or log spectrum of the speech signal, the phase information is ignored, when the speech signal is synthesized In the domain, applying the phase information of the noisy speech signal without changing the phase signal will weaken the quality of the enhanced speech signal. The supervised speech enhancement method is a neural network-based speech enhancement method, and a deep neural network (DNN, Deep Neural Network) and a convolutional neural network (CNN, Convolutional Neural Network) with fully connected layers are used for supervised speech enhancement , Can not well represent the high and low frequency components of the model, and the fully connected layer in it can not well retain the original information and spatial arrangement information of the signal.

Summary of the invention

In view of the above problems, the present application provides a speech enhancement method, device and storage medium based on a fully convolutional neural network to solve the existing neural network model of the speech enhancement method cannot well preserve the original information and spatial arrangement of the speech signal The problem of information.

In order to achieve the above objective, the present application provides a speech enhancement method based on a fully convolutional neural network, including:

Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:

y _t = F ^T * R _t (1)

Where y _t is the t-th node of the output layer, F ^T is the transpose of the filter's weight matrix, F ∈ R ^f ^{× 1} , f represents the filter size, and R _t is the t-th node of the hidden layer;

Training the fully convolutional neural network model;

Input the original speech signal into the trained fully convolutional neural network model;

Output enhanced voice signal.

Preferably, the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:

among them,

Represents the output value of the jth node of the first hidden layer, x _i represents the variable of the ith node of the input layer,

Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,

Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,

Represents the output value of the kth node of the lth hidden layer,

Represents the output value of the jth node of the l-1th hidden layer,

Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,

Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.

In order to achieve the above object, another aspect of the present application is to provide an electronic device including: a memory and a processor, where the memory includes a voice enhancement program, and the voice enhancement program is implemented when executed by the processor The following steps:

y _t = F ^T * R _t (1)

Training the fully convolutional neural network model;

Output enhanced voice signal.

Preferably, the model of the hidden layer in the fully convolutional neural network model is:

among them,

Represents the output value of the kth node of the lth hidden layer,

Represents the output value of the jth node of the l-1th hidden layer,

Preferably, the processor training the fully convolutional neural network model includes:

Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;

Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;

Input a training sample in the training sample set, and extract feature vectors from the training sample;

Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;

Calculate the error of each node of the output layer:

e _k = o _k -y _k (4)

Wherein, e _k represents an error of the output layer nodes k, o _k indicates the actual value of the output layer nodes k, y _k represents the output value of the output node of the k-th layer;

Update the parameters of the fully convolutional neural network model based on error back propagation;

Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;

Set the loss function of the fully convolutional neural network model:

, N represents the number of nodes of the output layer, o _k indicates the actual value of the output layer nodes k, y _k represents the output value of the output node of the k-th layer;

Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.

In order to achieve the above object, yet another aspect of the present application is to provide a computer-readable storage medium, the computer-readable storage medium includes a speech enhancement program, and when the speech enhancement program is executed by a processor, the Steps of speech enhancement method.

Compared with the prior art, the present application has the following advantages and beneficial effects:

This application constructs a fully convolutional neural network model as a speech enhancement model and inputs the original speech signal for processing to obtain an enhanced speech signal. In the fully convolutional neural network model, the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, making the fully convolutional neural network model suitable for mobile devices with limited memory, and each output The sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.

BRIEF DESCRIPTION

1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application;

2 is a schematic diagram of the structure of a fully convolutional neural network model in this application;

FIG. 3 is a schematic diagram of a module of a speech enhancement program in the present application.

The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.

detailed description

The embodiments described in the present application will be described below with reference to the drawings. Those of ordinary skill in the art may recognize that the described embodiments can be modified in various ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and descriptions are illustrative in nature and are only used to explain the present application, rather than to limit the protection scope of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.

FIG. 1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application. As shown in FIG. 1, the speech enhancement method based on a fully convolutional neural network described in this application includes the following steps:

Step S1. Construct a fully convolutional neural network model. As shown in FIG. 2, the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each Each convolutional layer has multiple filters, and the output model of the output layer is:

y _t = F ^T * R _t (1)

Where y _t is the t-th node of the output layer, F ^T is the transpose of the weight matrix of the filter, F ∈ R ^f ^{× 1} (f represents the filter size), and R _t is the t-th node of the hidden layer ;

Step S2: Train the fully convolutional neural network model;

Step S3, input the original speech signal into the trained fully convolutional neural network model;

Step S4: Output an enhanced voice signal.

In this application, the weight matrix F of the filter is shared during the convolution operation. Therefore, no matter whether the output layer node is a high frequency part or a low frequency part, the hidden layer node R _t and the adjacent two nodes R _{t- 1} and R _{t + 1} will not be very similar, whether the hidden layer node is similar to the adjacent node depends on the input of the original input layer node, so that the fully convolutional neural network can retain the original input information well.

In this application, by constructing a fully convolutional neural network model as a speech enhancement model, the original speech signal is input and processed to obtain an enhanced speech signal. In the fully convolutional neural network model, the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, so that the fully convolutional neural network model can be adapted to mobile devices with limited memory, such as mobile terminals such as mobile phones. , And each output sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.

In an optional embodiment of the present application, the fully convolutional neural network model includes: an input layer, six convolutional layers (with padding) and an output layer, each convolutional layer has 1024 nodes, and the convolutional span Is 1, each convolutional layer has 15 filters of size 11, and the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:

among them,

Represents the output value of the kth node of the lth hidden layer,

Represents the output value of the jth node of the l-1th hidden layer,

Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, f is the excitation function, and the PReLUs activation function is selected.

In an optional embodiment of the present application, training the fully convolutional neural network model includes:

Construct a sample set, and divide the sample set into a training sample set and a test sample set according to the ratio, wherein the samples in the sample set can be randomly selected from the TIMIT corpus, the ratio of the number of samples in the training sample set and the test sample set is 6 : 1, for example, 700 phrases are randomly selected from the TIMIT corpus, of which 600 phrases constitute the training sample set, and the remaining 100 phrases constitute the test sample set. The training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set. The signal-to-noise ratio can be different, and the noise type can also be different. Make the test conditions more realistic. Only five noise types are listed in the training sample set in this application, but this application is not limited to this.

Input a training sample in the training sample set, and extract a logarithmic power spectrum (LPS, Log power) feature vector from the training sample; for example, in the input training sample, select 512 sampling points of the original speech As one frame, and 257-dimensional LPS vectors are extracted as feature vectors per frame.

Calculate the error of each node of the output layer:

e _k = o _k -y _k (4)

Set the loss function of the fully convolutional neural network model:

Preferably, the test error is calculated according to the following formula:

Among them, MSE represents the test error, N represents the number of samples in the test sample set,

Represents the actual value of the sample z of the test sample set at the kth node of the output layer,

Represents the output value of the sample z of the test sample set at the k-th node of the output layer. The smaller the test error, the higher the accuracy of the constructed fully convolutional neural network model.

In this application, the output data of the fully convolutional neural network model is normalized, and then, the calculation of the node error of the output layer and the calculation of the test error are performed to reduce the test error and improve the model accuracy.

Preferably, the speech quality is evaluated by speech quality evaluation (PESQ, Perceptual evaluation of speech quality), and the intelligibility of speech is evaluated by short-term objective intelligibility score (STOI, Short Time Objective) Intelligibility.

Speech enhancement is performed through the fully convolutional neural network model of this application. Compared with the deep neural network model and the convolutional neural network model that include fully connected layers, both PESQ and STOI are improved, PESQ can be increased by about 0.5, and STOI can be increased by 0.2 -0.3 or so.

Preferably, the model of the hidden layer applies the PReLUs activation function.

Preferably, the fully convolutional neural network model is trained using a TIMIT corpus, which is divided into a training set and a test set.

Preferably, the model of the hidden layer applies Adam to minimize the minimum mean square error of pure speech and enhanced speech.

Preferably, the output enhanced speech signal judges the enhanced quality by PESQ and short-term objective intelligibility score STOI.

The speech enhancement method based on the fully convolutional neural network described in this application is applied to an electronic device. The electronic device may be a terminal device such as a television, a smart phone, a tablet computer, and a computer. However, the electronic device is not limited to the listed examples, and the electronic device may be any other device controlled by the user to process user commands through voice recognition technology, and output voice recognition results by performing voice enhancement processing on the input user's voice.

The electronic device includes: a memory and a processor, and the memory includes a speech enhancement program, and when the speech enhancement program is executed by the processor, the following steps are implemented:

y _t = F ^T * R _t (1)

Training the fully convolutional neural network model;

Output enhanced voice signal.

The memory includes at least one type of readable storage medium, which may be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited thereto, and may be stored in a non-transitory manner An instruction or software and any associated data file and any device that provides an instruction or software program to a processor to enable the processor to execute the instruction or software program.

The electronic device further includes a voice receiver, which receives a user's voice signal through a device such as a microphone of the electronic device, and then performs voice enhancement processing on the input voice signal.

The processor may be a central processing unit, a microprocessor, or other data processing chips, etc., and may execute a stored program in a memory.

The processor looks for the weight value of each filter in the fully convolutional neural network through the gradient descent method.

In an optional embodiment of the present application, the model of the hidden layer in the fully convolutional neural network model is:

among them,

Represents the output value of the kth node of the lth hidden layer,

Represents the output value of the jth node of the l-1th hidden layer,

Represents the offset of the k-th node of the l-th hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function, where the excitation function can be selected from PReLUs activation function, Sigmoid function, tanh function, Relu function and other functions .

In an embodiment of the present application, the step of the processor training the fully convolutional neural network model includes:

Construct a sample set, and divide the sample set into a training sample set and a test sample set according to the ratio, where the samples in the sample set can be randomly selected from the TIMIT corpus, and the ratio of the number of samples in the training sample set and the test sample set is 6 : 1, for example, 700 phrases are randomly selected from the TIMIT corpus, of which 600 phrases constitute the training sample set, and the remaining 100 phrases constitute the test sample set. The training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set, so that the test conditions are closer to the reality. Only five noise types are listed in the training sample set in this application, but this application is not limited to this;

Calculate the error of each node of the output layer:

e _k = o _k -y _k (4)

Set the loss function of the fully convolutional neural network model:

Preferably, the test error is calculated according to the following formula:

Represents the output value of the sample z of the test sample set at the k-th node of the output layer.

In other embodiments, the speech enhancement program may also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application. The module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions. The speech enhancement program can be divided into: model building module 1, model training module 2, input module 3 and output module 4. The functions or operation steps implemented by the above modules are similar to the above, and will not be described in detail here, for example, for example:

The model building module 1 constructs a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which is With multiple filters, the output model of the output layer is:

y _t = F ^T * R _t (1)

Where t is the index of the node, y _t is the t-th node of the output layer, F is the filter, F ∈ R ^{f × 1} , f represents the filter size, and R _t is the t-th node of the hidden layer;

Model training module 2, training the fully convolutional neural network model;

Input module 3, input the original voice signal to the trained fully convolutional neural network model;

The output module 4 outputs enhanced voice signals.

In an embodiment of the present application, the computer-readable storage medium may be any tangible medium that contains or stores programs or instructions, and the programs therein may be executed, and the corresponding functions are implemented by hardware related to the stored program instructions. For example, the computer-readable storage medium may be a computer disk, hard disk, random access memory, read-only memory, or the like. The present application is not limited to this, and may be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to cause the processor to execute the programs or instructions therein. The computer-readable storage medium includes a speech enhancement program. When the speech enhancement program is executed by a processor, the following speech enhancement method is implemented:

Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is multiple convolutional layers, each of which has multiple filters , The output model of the output layer is:

y _t = F ^T * R _t (1)

Training the fully convolutional neural network model;

Output enhanced voice signal.

among them,

Represents the output value of the kth node of the lth hidden layer,

Represents the output value of the jth node of the l-1th hidden layer,

Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer

Preferably, training the fully convolutional neural network model includes:

Substitute the input data of training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;

Calculate the error of each node of the output layer:

e _k = o _k -y _k (4)

Set the loss function of the fully convolutional neural network model:

Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when the number of successive iterations.

Preferably, the test error is calculated according to the following formula:

Preferably, the test samples in the test sample set and the training samples in the training sample set have different signal-to-noise ratios and types of noise.

Preferably, the fully convolutional neural network model includes an input layer, six convolutional layers and an output layer, each convolutional layer has 1024 nodes, and the convolutional span is 1.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the aforementioned voice enhancement method and electronic device, and will not be repeated here.

It should be noted that in this article, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method that includes a series of elements includes not only those elements, It also includes other elements not explicitly listed, or includes elements inherent to such processes, devices, objects, or methods. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, device, article or method that includes the element.

The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM / RAM as described above) , Magnetic disks, optical disks), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to perform the method described in each embodiment of the present application.

The above are only the preferred embodiments of the present application, and do not limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related technical fields The same reason is included in the patent protection scope of this application.

Claims

A speech enhancement method based on a fully convolutional neural network, applied to an electronic device, is characterized by,

Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:

y t = F T * R t (1)

Where y t is the t-th node of the output layer, F T is the transpose of the weight matrix of the filter, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;

Training the fully convolutional neural network model;

Input the original speech signal into the trained fully convolutional neural network model;

Output enhanced voice signal.
The speech enhancement method based on the fully convolutional neural network according to claim 1, characterized in that the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:

among them,
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Represents the output value of the kth node of the lth hidden layer,
Represents the output value of the jth node of the l-1th hidden layer,
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
The speech enhancement method based on a fully convolutional neural network according to claim 2, wherein training the fully convolutional neural network model includes:

Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;

Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;

Input a training sample in the training sample set, and extract feature vectors from the training sample;

Substitute the input data of training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;

Calculate the error of each node of the output layer:

e k = o k -y k (4)

Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;

Update the parameters of the fully convolutional neural network model based on error back propagation;

Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;

Set the loss function of the fully convolutional neural network model:

, N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;

Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
The speech enhancement method based on a fully convolutional neural network according to claim 3, wherein the test error is calculated according to the following formula:

Among them, MSE represents the test error, N represents the number of samples in the test sample set,
Represents the actual value of the sample z of the test sample set at the kth node of the output layer,
Represents the output value of the sample z of the test sample set at the k-th node of the output layer.
The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the test samples in the test sample set and the training samples in the training sample set have different signal-to-noise ratios and different types of noise.
The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the weight value of each filter in the fully convolutional neural network is searched by a gradient descent method.
The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the model of the hidden layer uses a PReLUs activation function.
The speech enhancement method based on a fully convolutional neural network according to claim 2, characterized in that the training of the fully convolutional neural network model uses a TIMIT corpus, which is divided into a training set and a test set.
The speech enhancement method based on a fully convolutional neural network according to claim 3, wherein the model of the hidden layer uses Adam to minimize the minimum mean square error of pure speech and enhanced speech.
The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the output enhanced speech signal is judged by PESQ and short-term objective intelligibility score STOI to judge the enhancement quality.
The speech enhancement method based on a fully convolutional neural network according to any one of claims 1 to 10, wherein the fully convolutional neural network model includes an input layer, six convolutional layers and an output layer, each Each convolutional layer has 1024 nodes, and the convolution span is 1.
An electronic device, characterized in that the electronic device includes: a memory and a processor, and the memory includes a voice enhancement program, and the voice enhancement program is implemented by the processor to implement the following steps:

Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:

y t = F T * R t (1)

Where y t is the t-th node of the output layer, F T is the transpose of the weight matrix of the filter, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;

Training the fully convolutional neural network model;

Input the original speech signal into the trained fully convolutional neural network model;

Output enhanced voice signal.
The electronic device according to claim 12, wherein the model of the hidden layer in the fully convolutional neural network model is:

among them,
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Represents the output value of the kth node of the lth hidden layer,
Represents the output value of the jth node of the l-1th hidden layer,
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
The electronic device according to claim 12, wherein the processor training the fully convolutional neural network model comprises:

Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;

Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;

Input a training sample in the training sample set, and extract feature vectors from the training sample;

Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;

Calculate the error of each node of the output layer:

e k = o k -y k (4)

Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;

Update the parameters of the fully convolutional neural network model based on error back propagation;

Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;

Set the loss function of the fully convolutional neural network model:

, N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;

Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
The electronic device according to claim 12, wherein the processor finds the weight value of each filter in the fully convolutional neural network by a gradient descent method.
The electronic device according to claim 12, wherein the model of the hidden layer uses a PReLUs activation function.
The electronic device according to claim 12, wherein the training of the fully convolutional neural network model uses a TIMIT corpus, which is divided into a training set and a test set.
The electronic device according to claim 12, wherein the model of the hidden layer uses Adam to minimize the minimum mean square error of pure speech and enhanced speech.
The electronic device according to claim 12, wherein after outputting the enhanced speech signal, the quality of speech enhancement is judged by PESQ and short-term objective intelligibility score STOI.
A computer-readable storage medium, characterized in that the computer-readable storage medium includes a speech enhancement program, and when the speech enhancement program is executed by a processor, it implements any one of claims 1 to 10. Steps of speech enhancement method.