WO2020098256A1 - Speech enhancement method based on fully convolutional neural network, device, and storage medium - Google Patents

Speech enhancement method based on fully convolutional neural network, device, and storage medium Download PDF

Info

Publication number
WO2020098256A1
WO2020098256A1 PCT/CN2019/089180 CN2019089180W WO2020098256A1 WO 2020098256 A1 WO2020098256 A1 WO 2020098256A1 CN 2019089180 W CN2019089180 W CN 2019089180W WO 2020098256 A1 WO2020098256 A1 WO 2020098256A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
convolutional neural
layer
output
node
Prior art date
Application number
PCT/CN2019/089180
Other languages
French (fr)
Chinese (zh)
Inventor
赵峰
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020098256A1 publication Critical patent/WO2020098256A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the field of speech technology, and in particular to a speech enhancement method, device and storage medium based on a fully convolutional neural network.
  • Speech enhancement refers to a technique in which clean speech is disturbed by various noises in real life scenes, and the noise needs to be filtered by a certain method to improve the quality and intelligibility of the speech.
  • the voices collected by microphones are usually "pollution" voices with different noises.
  • the main purpose of voice enhancement is to recover clean voices from these "polluted” noisy voices.
  • Speech enhancement involves a wide range of application fields, including voice calls, teleconferences, scene recordings, military eavesdropping, hearing aid devices, and voice recognition devices, and has become a preprocessing module for many voice encoding and recognition systems. Taking speech enhancement as an example of hearing aids, the usual hearing aids only realize the basic amplification of a voice.
  • the speech enhancement application is used in the front-end processing of speech-related applications to ensure that the speech is separated from the noisy signal so that the back-end recognition model can correctly recognize the content of the speech.
  • Existing speech enhancement methods include unsupervised speech enhancement methods and supervised speech enhancement methods, where the unsupervised speech enhancement method is to extract the amplitude spectrum or log spectrum of the speech signal, the phase information is ignored, when the speech signal is synthesized In the domain, applying the phase information of the noisy speech signal without changing the phase signal will weaken the quality of the enhanced speech signal.
  • the supervised speech enhancement method is a neural network-based speech enhancement method, and a deep neural network (DNN, Deep Neural Network) and a convolutional neural network (CNN, Convolutional Neural Network) with fully connected layers are used for supervised speech enhancement , Can not well represent the high and low frequency components of the model, and the fully connected layer in it can not well retain the original information and spatial arrangement information of the signal.
  • DNN Deep Neural Network
  • CNN Convolutional Neural Network
  • the present application provides a speech enhancement method, device and storage medium based on a fully convolutional neural network to solve the existing neural network model of the speech enhancement method cannot well preserve the original information and spatial arrangement of the speech signal The problem of information.
  • the present application provides a speech enhancement method based on a fully convolutional neural network, including:
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is a plurality of convolutional layers, each of which has multiple filters ,
  • the output model of the output layer is:
  • F T is the transpose of the filter's weight matrix
  • F ⁇ R f ⁇ 1 , f represents the filter size
  • R t is the t-th node of the hidden layer
  • the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
  • x i represents the variable of the ith node of the input layer
  • n represents the number of nodes in the input layer
  • H is the number of nodes in the hidden layer
  • f is the excitation function.
  • another aspect of the present application is to provide an electronic device including: a memory and a processor, where the memory includes a voice enhancement program, and the voice enhancement program is implemented when executed by the processor The following steps:
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is a plurality of convolutional layers, each of which has multiple filters ,
  • the output model of the output layer is:
  • F T is the transpose of the filter's weight matrix
  • F ⁇ R f ⁇ 1 , f represents the filter size
  • R t is the t-th node of the hidden layer
  • the model of the hidden layer in the fully convolutional neural network model is:
  • x i represents the variable of the ith node of the input layer
  • n represents the number of nodes in the input layer
  • H is the number of nodes in the hidden layer
  • f is the excitation function.
  • the processor training the fully convolutional neural network model includes:
  • parameters of the fully convolutional neural network model include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
  • Input a training sample in the training sample set, and extract feature vectors from the training sample
  • e k represents an error of the output layer nodes k
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • N represents the number of nodes of the output layer
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • the end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
  • yet another aspect of the present application is to provide a computer-readable storage medium, the computer-readable storage medium includes a speech enhancement program, and when the speech enhancement program is executed by a processor, the Steps of speech enhancement method.
  • This application constructs a fully convolutional neural network model as a speech enhancement model and inputs the original speech signal for processing to obtain an enhanced speech signal.
  • the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, making the fully convolutional neural network model suitable for mobile devices with limited memory, and each output
  • the sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.
  • FIG. 1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application;
  • FIG. 2 is a schematic diagram of the structure of a fully convolutional neural network model in this application.
  • FIG. 3 is a schematic diagram of a module of a speech enhancement program in the present application.
  • FIG. 1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application. As shown in FIG. 1, the speech enhancement method based on a fully convolutional neural network described in this application includes the following steps:
  • Step S1 Construct a fully convolutional neural network model.
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is a plurality of convolutional layers, each Each convolutional layer has multiple filters, and the output model of the output layer is:
  • Step S2 Train the fully convolutional neural network model
  • Step S3 input the original speech signal into the trained fully convolutional neural network model
  • Step S4 Output an enhanced voice signal.
  • the weight matrix F of the filter is shared during the convolution operation. Therefore, no matter whether the output layer node is a high frequency part or a low frequency part, the hidden layer node R t and the adjacent two nodes R t- 1 and R t + 1 will not be very similar, whether the hidden layer node is similar to the adjacent node depends on the input of the original input layer node, so that the fully convolutional neural network can retain the original input information well.
  • a fully convolutional neural network model as a speech enhancement model
  • the original speech signal is input and processed to obtain an enhanced speech signal.
  • the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, so that the fully convolutional neural network model can be adapted to mobile devices with limited memory, such as mobile terminals such as mobile phones.
  • each output sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.
  • the fully convolutional neural network model includes: an input layer, six convolutional layers (with padding) and an output layer, each convolutional layer has 1024 nodes, and the convolutional span Is 1, each convolutional layer has 15 filters of size 11, and the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
  • x i represents the variable of the ith node of the input layer
  • n represents the number of nodes in the input layer
  • H is the number of nodes in the hidden layer
  • f is the excitation function
  • the PReLUs activation function is selected.
  • training the fully convolutional neural network model includes:
  • parameters of the fully convolutional neural network model include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
  • the training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set.
  • the signal-to-noise ratio can be different, and the noise type can also be different. Make the test conditions more realistic. Only five noise types are listed in the training sample set in this application, but this application is not limited to this.
  • Input a training sample in the training sample set, and extract a logarithmic power spectrum (LPS, Log power) feature vector from the training sample; for example, in the input training sample, select 512 sampling points of the original speech As one frame, and 257-dimensional LPS vectors are extracted as feature vectors per frame.
  • LPS logarithmic power spectrum
  • e k represents an error of the output layer nodes k
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • N represents the number of nodes of the output layer
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • the end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
  • test error is calculated according to the following formula:
  • MSE represents the test error
  • N represents the number of samples in the test sample set
  • the output data of the fully convolutional neural network model is normalized, and then, the calculation of the node error of the output layer and the calculation of the test error are performed to reduce the test error and improve the model accuracy.
  • the speech quality is evaluated by speech quality evaluation (PESQ, Perceptual evaluation of speech quality), and the intelligibility of speech is evaluated by short-term objective intelligibility score (STOI, Short Time Objective) Intelligibility.
  • PESQ Perceptual evaluation of speech quality
  • STOI Short Time Objective
  • Speech enhancement is performed through the fully convolutional neural network model of this application.
  • PESQ and STOI are improved, PESQ can be increased by about 0.5, and STOI can be increased by 0.2 -0.3 or so.
  • the model of the hidden layer applies the PReLUs activation function.
  • the fully convolutional neural network model is trained using a TIMIT corpus, which is divided into a training set and a test set.
  • the model of the hidden layer applies Adam to minimize the minimum mean square error of pure speech and enhanced speech.
  • the output enhanced speech signal judges the enhanced quality by PESQ and short-term objective intelligibility score STOI.
  • the speech enhancement method based on the fully convolutional neural network described in this application is applied to an electronic device.
  • the electronic device may be a terminal device such as a television, a smart phone, a tablet computer, and a computer.
  • the electronic device is not limited to the listed examples, and the electronic device may be any other device controlled by the user to process user commands through voice recognition technology, and output voice recognition results by performing voice enhancement processing on the input user's voice.
  • the electronic device includes: a memory and a processor, and the memory includes a speech enhancement program, and when the speech enhancement program is executed by the processor, the following steps are implemented:
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is a plurality of convolutional layers, each of which has multiple filters ,
  • the output model of the output layer is:
  • F T is the transpose of the filter's weight matrix
  • F ⁇ R f ⁇ 1 , f represents the filter size
  • R t is the t-th node of the hidden layer
  • the memory includes at least one type of readable storage medium, which may be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited thereto, and may be stored in a non-transitory manner
  • a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc.
  • the electronic device further includes a voice receiver, which receives a user's voice signal through a device such as a microphone of the electronic device, and then performs voice enhancement processing on the input voice signal.
  • a voice receiver which receives a user's voice signal through a device such as a microphone of the electronic device, and then performs voice enhancement processing on the input voice signal.
  • the processor may be a central processing unit, a microprocessor, or other data processing chips, etc., and may execute a stored program in a memory.
  • the processor looks for the weight value of each filter in the fully convolutional neural network through the gradient descent method.
  • the model of the hidden layer in the fully convolutional neural network model is:
  • x i represents the variable of the ith node of the input layer
  • n represents the number of nodes in the input layer
  • H is the number of nodes in the hidden layer
  • f is the excitation function, where the excitation function can be selected from PReLUs activation function, Sigmoid function, tanh function
  • the step of the processor training the fully convolutional neural network model includes:
  • parameters of the fully convolutional neural network model include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
  • the training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set, so that the test conditions are closer to the reality. Only five noise types are listed in the training sample set in this application, but this application is not limited to this;
  • Input a training sample in the training sample set, and extract feature vectors from the training sample
  • e k represents an error of the output layer nodes k
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • N represents the number of nodes of the output layer
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • the end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
  • test error is calculated according to the following formula:
  • MSE represents the test error
  • N represents the number of samples in the test sample set
  • the speech enhancement program may also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application.
  • the module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions.
  • the speech enhancement program can be divided into: model building module 1, model training module 2, input module 3 and output module 4.
  • the functions or operation steps implemented by the above modules are similar to the above, and will not be described in detail here, for example, for example:
  • the model building module 1 constructs a fully convolutional neural network model.
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is a plurality of convolutional layers, each of which is With multiple filters, the output model of the output layer is:
  • t is the index of the node
  • y t is the t-th node of the output layer
  • F is the filter
  • f represents the filter size
  • R t is the t-th node of the hidden layer
  • Model training module 2 training the fully convolutional neural network model
  • Input module 3 input the original voice signal to the trained fully convolutional neural network model
  • the output module 4 outputs enhanced voice signals.
  • the model of the hidden layer applies the PReLUs activation function.
  • the fully convolutional neural network model is trained using a TIMIT corpus, which is divided into a training set and a test set.
  • the model of the hidden layer applies Adam to minimize the minimum mean square error of pure speech and enhanced speech.
  • the output enhanced speech signal judges the enhanced quality by PESQ and short-term objective intelligibility score STOI.
  • the computer-readable storage medium may be any tangible medium that contains or stores programs or instructions, and the programs therein may be executed, and the corresponding functions are implemented by hardware related to the stored program instructions.
  • the computer-readable storage medium may be a computer disk, hard disk, random access memory, read-only memory, or the like.
  • the present application is not limited to this, and may be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to cause the processor to execute the programs or instructions therein.
  • the computer-readable storage medium includes a speech enhancement program. When the speech enhancement program is executed by a processor, the following speech enhancement method is implemented:
  • the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer.
  • the hidden layer is multiple convolutional layers, each of which has multiple filters ,
  • the output model of the output layer is:
  • F T is the transpose of the filter's weight matrix
  • F ⁇ R f ⁇ 1 , f represents the filter size
  • R t is the t-th node of the hidden layer
  • the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
  • x i represents the variable of the ith node of the input layer
  • n represents the number of nodes in the input layer
  • H is the number of nodes in the hidden layer
  • f is the excitation function.
  • training the fully convolutional neural network model includes:
  • parameters of the fully convolutional neural network model include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
  • Input a training sample in the training sample set, and extract feature vectors from the training sample
  • e k represents an error of the output layer nodes k
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • N represents the number of nodes of the output layer
  • o k indicates the actual value of the output layer nodes k
  • y k represents the output value of the output node of the k-th layer
  • the end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when the number of successive iterations.
  • test error is calculated according to the following formula:
  • MSE represents the test error
  • N represents the number of samples in the test sample set
  • test samples in the test sample set and the training samples in the training sample set have different signal-to-noise ratios and types of noise.
  • the fully convolutional neural network model includes an input layer, six convolutional layers and an output layer, each convolutional layer has 1024 nodes, and the convolutional span is 1.
  • the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM / RAM as described above) , Magnetic disks, optical disks), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to perform the method described in each embodiment of the present application.
  • a storage medium such as ROM / RAM as described above
  • Magnetic disks such as magnetic disks, optical disks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The present application relates to the field of artificial intelligence. Disclosed is a speech enhancement method based on a fully convolutional neural network. The method comprises: constructing a fully convolutional neural network model, the fully convolutional neural network comprising an input layer, a hidden layer, and an output layer; the hidden layer comprising a plurality of convolutional layers; each of the plurality of convolutional layers comprising a plurality of filters; training the fully convolutional neural network model; inputting an original speech signal into the trained fully convolutional neural network model; and outputting an enhanced speech signal. In the fully convolutional neural network model of the present application, a full connection layer is deleted, and only convolutional layers are comprised, so that parameters of the neural network are significantly reduced, the fully convolutional neural network model can be suitable for a mobile device having the memory limited, each output sample only relies on adjacent inputs, and original information and spatial arrangement information of the speech signal can be reserved well at less weight values. Also disclosed are an electronic device and a computer readable storage medium.

Description

基于全卷积神经网络的语音增强方法、装置及存储介质Speech enhancement method, device and storage medium based on fully convolutional neural network 技术领域Technical field
本申请涉及语音技术领域,尤其涉及一种基于全卷积神经网络的语音增强方法、装置及存储介质。The present application relates to the field of speech technology, and in particular to a speech enhancement method, device and storage medium based on a fully convolutional neural network.
背景技术Background technique
语音增强,是指干净语音在现实的生活场景中受到各种噪声干扰时,需要通过一定的方法将噪声滤除,以提升该段语音的质量和可懂度的技术。日常生活中,麦克风采集的语音通常是带有不同噪声的“污染”语音,语音增强的主要目的就是从这些被“污染”的带噪语音中恢复出干净语音。语音增强涉及的应用领域十分广泛,包括语音通话、电话会议、场景录音、军事窃听、助听器设备和语音识别设备等,并成为许多语音编码和识别系统的预处理模块。以语音增强应用于助听器为例,通常的助听器,只是实现一个语音的基本放大,复杂一些的会进行声压级压缩以实现对患者听觉范围的补偿,但是如果听觉场景比较复杂,患者听到的语音中不仅包含了放大后的语音也包含了很多噪声,时间一长势必会对患者的听觉系统造成二次损害,因此高端的数字助听器设备中,语音增强也成为不容忽视的一个重要方面。Speech enhancement refers to a technique in which clean speech is disturbed by various noises in real life scenes, and the noise needs to be filtered by a certain method to improve the quality and intelligibility of the speech. In daily life, the voices collected by microphones are usually "pollution" voices with different noises. The main purpose of voice enhancement is to recover clean voices from these "polluted" noisy voices. Speech enhancement involves a wide range of application fields, including voice calls, teleconferences, scene recordings, military eavesdropping, hearing aid devices, and voice recognition devices, and has become a preprocessing module for many voice encoding and recognition systems. Taking speech enhancement as an example of hearing aids, the usual hearing aids only realize the basic amplification of a voice. More complicated ones will perform sound pressure level compression to compensate for the patient's hearing range, but if the hearing scene is more complicated, the patient hears The voice contains not only the amplified voice but also a lot of noise. Over time, it will inevitably cause secondary damage to the patient's hearing system. Therefore, in high-end digital hearing aid devices, voice enhancement has become an important aspect that cannot be ignored.
语音增强应用在语音相关应用的前端处理过程中,确保把语音从带噪信号中分离出来,以便后端识别模型正确识别语音的内容。现有的语音增强方法包括无监督语音增强方法和有监督语音增强方法,其中,无监督语音增强方法是提取语音信号的幅度谱或对数谱,相位信息被忽略,当将语音信号合成到时域时,相位信号不变的应用带噪语音信号的相位信息,会减弱增强语音信号的质量。有监督语音增强方法是基于神经网络的语音增强方法,而采用带有全连接层的深度神经网络(DNN,Deep Neural Network)和卷积神经网络(CNN,Convolutional Neural Network)进行有监督的语音增强,不能很好的表示模型的高低频成分,并且其中的全连接层也不能很好地保留信号的原始信息和空间排列信息。The speech enhancement application is used in the front-end processing of speech-related applications to ensure that the speech is separated from the noisy signal so that the back-end recognition model can correctly recognize the content of the speech. Existing speech enhancement methods include unsupervised speech enhancement methods and supervised speech enhancement methods, where the unsupervised speech enhancement method is to extract the amplitude spectrum or log spectrum of the speech signal, the phase information is ignored, when the speech signal is synthesized In the domain, applying the phase information of the noisy speech signal without changing the phase signal will weaken the quality of the enhanced speech signal. The supervised speech enhancement method is a neural network-based speech enhancement method, and a deep neural network (DNN, Deep Neural Network) and a convolutional neural network (CNN, Convolutional Neural Network) with fully connected layers are used for supervised speech enhancement , Can not well represent the high and low frequency components of the model, and the fully connected layer in it can not well retain the original information and spatial arrangement information of the signal.
发明内容Summary of the invention
鉴于以上问题,本申请提供一种基于全卷积神经网络的语音增强方法、装置及存储介质,以解决现有的语音增强方法的神经网络模型不能很好地保留语音信号的原始信息和空间排列信息的问题。In view of the above problems, the present application provides a speech enhancement method, device and storage medium based on a fully convolutional neural network to solve the existing neural network model of the speech enhancement method cannot well preserve the original information and spatial arrangement of the speech signal The problem of information.
为了实现上述目的,本申请提供一种基于全卷积神经网络的语音增强方法,包括:In order to achieve the above objective, the present application provides a speech enhancement method based on a fully convolutional neural network, including:
构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:
y t=F T*R t  (1) y t = F T * R t (1)
其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f ×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the filter's weight matrix, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
输出增强语音信号。Output enhanced voice signal.
优选地,根据下式构建全卷积神经网络模型的隐含层的模型:Preferably, the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
Figure PCTCN2019089180-appb-000001
Figure PCTCN2019089180-appb-000001
Figure PCTCN2019089180-appb-000002
Figure PCTCN2019089180-appb-000002
其中,
Figure PCTCN2019089180-appb-000003
表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
Figure PCTCN2019089180-appb-000004
表示输入层第i个节点和第1个隐含层第k个节点的连接权重值,
Figure PCTCN2019089180-appb-000005
表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
Figure PCTCN2019089180-appb-000006
表示第l个隐含层的第k个节点的输出值,
Figure PCTCN2019089180-appb-000007
表示第l-1个隐含层的第j个节点的输出值,
Figure PCTCN2019089180-appb-000008
表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
Figure PCTCN2019089180-appb-000009
表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数。
among them,
Figure PCTCN2019089180-appb-000003
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Figure PCTCN2019089180-appb-000004
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Figure PCTCN2019089180-appb-000005
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Figure PCTCN2019089180-appb-000006
Represents the output value of the kth node of the lth hidden layer,
Figure PCTCN2019089180-appb-000007
Represents the output value of the jth node of the l-1th hidden layer,
Figure PCTCN2019089180-appb-000008
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Figure PCTCN2019089180-appb-000009
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
为了实现上述目的,本申请的另一个方面是提供一种电子装置,该电子装置包括:存储器和处理器,所述存储器中包括语音增强程序,所述语音增强程序被所述处理器执行时实现如下步骤:In order to achieve the above object, another aspect of the present application is to provide an electronic device including: a memory and a processor, where the memory includes a voice enhancement program, and the voice enhancement program is implemented when executed by the processor The following steps:
构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:
y t=F T*R t  (1) y t = F T * R t (1)
其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f ×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the filter's weight matrix, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
输出增强语音信号。Output enhanced voice signal.
优选地,所述全卷积神经网络模型中隐含层的模型为:Preferably, the model of the hidden layer in the fully convolutional neural network model is:
Figure PCTCN2019089180-appb-000010
Figure PCTCN2019089180-appb-000010
Figure PCTCN2019089180-appb-000011
Figure PCTCN2019089180-appb-000011
其中,
Figure PCTCN2019089180-appb-000012
表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
Figure PCTCN2019089180-appb-000013
表示输入层第i个节点和第1个隐含层第k个节点的连接权重值,
Figure PCTCN2019089180-appb-000014
表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
Figure PCTCN2019089180-appb-000015
表示第l个隐含层的第k个节点的输出值,
Figure PCTCN2019089180-appb-000016
表示第l-1个隐含层的第j个节点的输出值,
Figure PCTCN2019089180-appb-000017
表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
Figure PCTCN2019089180-appb-000018
表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数。
among them,
Figure PCTCN2019089180-appb-000012
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Figure PCTCN2019089180-appb-000013
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Figure PCTCN2019089180-appb-000014
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Figure PCTCN2019089180-appb-000015
Represents the output value of the kth node of the lth hidden layer,
Figure PCTCN2019089180-appb-000016
Represents the output value of the jth node of the l-1th hidden layer,
Figure PCTCN2019089180-appb-000017
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Figure PCTCN2019089180-appb-000018
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
优选地,处理器训练所述全卷积神经网络模型包括:Preferably, the processor training the fully convolutional neural network model includes:
对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集;Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;
输入所述训练样本集中的一个训练样本,并从所述训练样本中提取特征向量;Input a training sample in the training sample set, and extract feature vectors from the training sample;
将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出值和输出层各节点的输出值;Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
计算输出层各节点误差:Calculate the error of each node of the output layer:
e k=o k-y k   (4) e k = o k -y k (4)
其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
Figure PCTCN2019089180-appb-000019
Figure PCTCN2019089180-appb-000019
其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
为了实现上述目的,本申请的再一个方面是提供一种计算机可读存储介质,所述计算机可读存储介质中包括语音增强程序,所述语音增强程序被处理器执行时,实现如上所述的语音增强方法的步骤。In order to achieve the above object, yet another aspect of the present application is to provide a computer-readable storage medium, the computer-readable storage medium includes a speech enhancement program, and when the speech enhancement program is executed by a processor, the Steps of speech enhancement method.
相对于现有技术,本申请具有以下优点和有益效果:Compared with the prior art, the present application has the following advantages and beneficial effects:
本申请通过构建全卷积神经网络模型作为语音增强模型,输入原始语音信号进行处理,得到增强语音信号。全卷积神经网络模型中,删除了全连接层,仅包含卷积层,大大减小了神经网络的参数,使得全卷积神经网络模型可以适用于限制内存的移动设备中,并且每个输出样本仅仅依赖相邻输入,可以通过相关的更少的权重值很好地保留语音信号的原始信息和空间排列信息。This application constructs a fully convolutional neural network model as a speech enhancement model and inputs the original speech signal for processing to obtain an enhanced speech signal. In the fully convolutional neural network model, the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, making the fully convolutional neural network model suitable for mobile devices with limited memory, and each output The sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.
附图说明BRIEF DESCRIPTION
图1为本申请所述基于全卷积神经网络的语音增强方法的流程示意图;1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application;
图2为本申请中全卷积神经网络模型结构示意图;2 is a schematic diagram of the structure of a fully convolutional neural network model in this application;
图3为本申请中语音增强程序的模块示意图。FIG. 3 is a schematic diagram of a module of a speech enhancement program in the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.
具体实施方式detailed description
下面将参考附图来描述本申请所述的实施例。本领域的普通技术人员可以认识到,在不偏离本申请的精神和范围的情况下,可以用各种不同的方式或其组合对所描述的实施例进行修正。因此,附图和描述在本质上是说明性 的,仅仅用以解释本申请,而不是用于限制权利要求的保护范围。此外,在本说明书中,附图未按比例画出,并且相同的附图标记表示相同的部分。The embodiments described in the present application will be described below with reference to the drawings. Those of ordinary skill in the art may recognize that the described embodiments can be modified in various ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and descriptions are illustrative in nature and are only used to explain the present application, rather than to limit the protection scope of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.
图1为本申请所述基于全卷积神经网络的语音增强方法的流程示意图,如图1所示,本申请所述基于全卷积神经网络的语音增强方法包括以下步骤:FIG. 1 is a schematic flowchart of a speech enhancement method based on a fully convolutional neural network described in this application. As shown in FIG. 1, the speech enhancement method based on a fully convolutional neural network described in this application includes the following steps:
步骤S1、构建全卷积神经网络模型,如图2所示,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Step S1. Construct a fully convolutional neural network model. As shown in FIG. 2, the fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each Each convolutional layer has multiple filters, and the output model of the output layer is:
y t=F T*R t   (1) y t = F T * R t (1)
其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f ×1(f表示滤波器尺寸),R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the weight matrix of the filter, F ∈ R f × 1 (f represents the filter size), and R t is the t-th node of the hidden layer ;
步骤S2、训练所述全卷积神经网络模型;Step S2: Train the fully convolutional neural network model;
步骤S3、将原始语音信号输入经过训练的全卷积神经网络模型;Step S3, input the original speech signal into the trained fully convolutional neural network model;
步骤S4、输出增强语音信号。Step S4: Output an enhanced voice signal.
本申请中,滤波器的权重矩阵F在卷积操作过程中是共享的,因此,无论输出层节点是高频部分还是低频部分,隐含层节点R t与相邻的两个节点R t-1和R t+1不会很相似,隐含层节点与相邻节点是否相似取决于原始的输入层节点的输入,使得全卷积神经网络可以很好地保留原始输入信息。 In this application, the weight matrix F of the filter is shared during the convolution operation. Therefore, no matter whether the output layer node is a high frequency part or a low frequency part, the hidden layer node R t and the adjacent two nodes R t- 1 and R t + 1 will not be very similar, whether the hidden layer node is similar to the adjacent node depends on the input of the original input layer node, so that the fully convolutional neural network can retain the original input information well.
本申请中通过构建全卷积神经网络模型作为语音增强模型,输入原始语音信号进行处理,得到增强语音信号。全卷积神经网络模型中,删除了全连接层,仅包含卷积层,大大减小了神经网络的参数,使得全卷积神经网络模型可以适应限制内存的移动设备中,例如手机等移动终端,并且每个输出样本仅仅依赖相邻输入,可以通过相关的更少的权重值很好地保留语音信号的原始信息和空间排列信息。In this application, by constructing a fully convolutional neural network model as a speech enhancement model, the original speech signal is input and processed to obtain an enhanced speech signal. In the fully convolutional neural network model, the fully connected layer is deleted, and only the convolutional layer is included, which greatly reduces the parameters of the neural network, so that the fully convolutional neural network model can be adapted to mobile devices with limited memory, such as mobile terminals such as mobile phones. , And each output sample only depends on the adjacent input, and the original information and spatial arrangement information of the speech signal can be well preserved through the associated less weight values.
本申请的一个可选实施例中,所述全卷积神经网络模型包括:输入层、六个卷积层(具有padding)和输出层,每个卷积层均具有1024个节点,卷积跨度为1,每个卷积层均具有15个尺寸为11的滤波器,根据下式构建全卷积神经网络模型的隐含层的模型:In an optional embodiment of the present application, the fully convolutional neural network model includes: an input layer, six convolutional layers (with padding) and an output layer, each convolutional layer has 1024 nodes, and the convolutional span Is 1, each convolutional layer has 15 filters of size 11, and the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
Figure PCTCN2019089180-appb-000020
Figure PCTCN2019089180-appb-000020
Figure PCTCN2019089180-appb-000021
Figure PCTCN2019089180-appb-000021
其中,
Figure PCTCN2019089180-appb-000022
表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
Figure PCTCN2019089180-appb-000023
表示输入层第i个节点和第1个隐含层第k个节点的连接 权重值,
Figure PCTCN2019089180-appb-000024
表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
Figure PCTCN2019089180-appb-000025
表示第l个隐含层的第k个节点的输出值,
Figure PCTCN2019089180-appb-000026
表示第l-1个隐含层的第j个节点的输出值,
Figure PCTCN2019089180-appb-000027
表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
Figure PCTCN2019089180-appb-000028
表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数,选择PReLUs激活函数。
among them,
Figure PCTCN2019089180-appb-000022
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Figure PCTCN2019089180-appb-000023
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Figure PCTCN2019089180-appb-000024
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Figure PCTCN2019089180-appb-000025
Represents the output value of the kth node of the lth hidden layer,
Figure PCTCN2019089180-appb-000026
Represents the output value of the jth node of the l-1th hidden layer,
Figure PCTCN2019089180-appb-000027
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Figure PCTCN2019089180-appb-000028
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, f is the excitation function, and the PReLUs activation function is selected.
本申请的一个可选实施例中,训练所述全卷积神经网络模型包括:In an optional embodiment of the present application, training the fully convolutional neural network model includes:
对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集,其中,样本集中的样本可以从TIMIT语料库中随机选取,训练样本集和测试样本集中样本个数的比例为6:1,例如,从TIMIT语料库中随机选取700个短语,其中的600个短语构成训练样本集,其余的100个短语构成测试样本集,训练样本集中包含5种信噪比下的5种噪声类型(白噪声、粉噪声、办公室噪声、超市噪声和街道噪声),测试样本集中包含与训练样本集中相同或不同的信噪比和噪声类型,信噪比可以不同,且噪声类型也可以不同,以使测试条件更加贴近真实。本申请中的训练样本集中仅列举出5种噪声类型,但本申请并不限于此。Construct a sample set, and divide the sample set into a training sample set and a test sample set according to the ratio, wherein the samples in the sample set can be randomly selected from the TIMIT corpus, the ratio of the number of samples in the training sample set and the test sample set is 6 : 1, for example, 700 phrases are randomly selected from the TIMIT corpus, of which 600 phrases constitute the training sample set, and the remaining 100 phrases constitute the test sample set. The training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set. The signal-to-noise ratio can be different, and the noise type can also be different. Make the test conditions more realistic. Only five noise types are listed in the training sample set in this application, but this application is not limited to this.
输入所述训练样本集中的一个训练样本,并从所述训练样本中提取对数功率谱(LPS,Log power spectra)特征向量;例如,在输入的训练样本中,选取原始语音的512个采样点作为一帧,并且,每帧提取257维LPS向量作为特征向量。Input a training sample in the training sample set, and extract a logarithmic power spectrum (LPS, Log power) feature vector from the training sample; for example, in the input training sample, select 512 sampling points of the original speech As one frame, and 257-dimensional LPS vectors are extracted as feature vectors per frame.
将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出值和输出层各节点的输出值;Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
计算输出层各节点误差:Calculate the error of each node of the output layer:
e k=o k-y k   (4) e k = o k -y k (4)
其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
Figure PCTCN2019089180-appb-000029
Figure PCTCN2019089180-appb-000029
其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
优选地,根据下式计算测试误差:Preferably, the test error is calculated according to the following formula:
Figure PCTCN2019089180-appb-000030
Figure PCTCN2019089180-appb-000030
其中,MSE表示测试误差,N表示测试样本集的样本个数,
Figure PCTCN2019089180-appb-000031
表示测试样本集的样本z在输出层第k个节点的实际值,
Figure PCTCN2019089180-appb-000032
表示测试样本集的样本z在输出层第k个节点的输出值。测试误差越小,表明构建的全卷积神经网络模型的精度越高。
Among them, MSE represents the test error, N represents the number of samples in the test sample set,
Figure PCTCN2019089180-appb-000031
Represents the actual value of the sample z of the test sample set at the kth node of the output layer,
Figure PCTCN2019089180-appb-000032
Represents the output value of the sample z of the test sample set at the k-th node of the output layer. The smaller the test error, the higher the accuracy of the constructed fully convolutional neural network model.
本申请中,对全卷积神经网络模型的输出数据进行归一化处理,之后,再进行输出层节点误差的计算和测试误差的计算等,以减小测试误差,提高模型精度。In this application, the output data of the fully convolutional neural network model is normalized, and then, the calculation of the node error of the output layer and the calculation of the test error are performed to reduce the test error and improve the model accuracy.
优选地,通过语音质量评价(PESQ,Perceptual evaluation of speech quality)评价语音质量,通过短时客观可懂度得分(STOI,Short Time Objective Intelligibility)评价语音的可懂度。Preferably, the speech quality is evaluated by speech quality evaluation (PESQ, Perceptual evaluation of speech quality), and the intelligibility of speech is evaluated by short-term objective intelligibility score (STOI, Short Time Objective) Intelligibility.
通过本申请的全卷积神经网络模型进行语音增强,相对于包含全连接层的深度神经网络模型和卷积神经网络模型,PESQ和STOI均有所提高,PESQ可以提高0.5左右,STOI可以提高0.2-0.3左右。Speech enhancement is performed through the fully convolutional neural network model of this application. Compared with the deep neural network model and the convolutional neural network model that include fully connected layers, both PESQ and STOI are improved, PESQ can be increased by about 0.5, and STOI can be increased by 0.2 -0.3 or so.
优选地,隐含层的模型应用PReLUs激活函数。Preferably, the model of the hidden layer applies the PReLUs activation function.
优选地,训练所述全卷积神经网络模型采用TIMIT语料库,将其划分为训练集和测试集。Preferably, the fully convolutional neural network model is trained using a TIMIT corpus, which is divided into a training set and a test set.
优选地,隐含层的模型应用Adam来最小化纯净语音和增强语音的最小均方误差。Preferably, the model of the hidden layer applies Adam to minimize the minimum mean square error of pure speech and enhanced speech.
优选地,输出增强语音信号通过PESQ和短时客观可懂度得分STOI来判断增强质量。Preferably, the output enhanced speech signal judges the enhanced quality by PESQ and short-term objective intelligibility score STOI.
本申请所述基于全卷积神经网络的语音增强方法应用于电子装置,电子装置可以是电视机、智能手机、平板电脑、计算机等终端设备。然而,电子装置并不限于所列举示例,电子装置可以是用户控制的通过语音识别技术处理用户命令的任何其他装置,通过对输入用户的语音进行语音增强处理,输出语音识别结果。The speech enhancement method based on the fully convolutional neural network described in this application is applied to an electronic device. The electronic device may be a terminal device such as a television, a smart phone, a tablet computer, and a computer. However, the electronic device is not limited to the listed examples, and the electronic device may be any other device controlled by the user to process user commands through voice recognition technology, and output voice recognition results by performing voice enhancement processing on the input user's voice.
所述电子装置包括:存储器和处理器,所述存储器中包括语音增强程序,所述语音增强程序被所述处理器执行时实现如下步骤:The electronic device includes: a memory and a processor, and the memory includes a speech enhancement program, and when the speech enhancement program is executed by the processor, the following steps are implemented:
构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:
y t=F T*R t  (1) y t = F T * R t (1)
其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f ×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the filter's weight matrix, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
输出增强语音信号。Output enhanced voice signal.
存储器包括至少一种类型的可读存储介质,可以是闪存、硬盘、光盘等非易失性存储介质,也可以是插接式硬盘等,且并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关联的数据文件并向处理器提供指令或软件程序以使该处理器能够执行指令或软件程序的任何装置。The memory includes at least one type of readable storage medium, which may be a non-volatile storage medium such as a flash memory, a hard disk, an optical disk, or a plug-in hard disk, etc., and is not limited thereto, and may be stored in a non-transitory manner An instruction or software and any associated data file and any device that provides an instruction or software program to a processor to enable the processor to execute the instruction or software program.
所述电子装置还包括语音接收器,通过电子装置的麦克风等设备接收用户的语音信号,再对输入的语音信号进行语音增强处理。The electronic device further includes a voice receiver, which receives a user's voice signal through a device such as a microphone of the electronic device, and then performs voice enhancement processing on the input voice signal.
处理器可以是中央处理器、微处理器或其他数据处理芯片等,可以运行存储器中的存储程序。The processor may be a central processing unit, a microprocessor, or other data processing chips, etc., and may execute a stored program in a memory.
处理器通过梯度下降法寻找全卷积神经网络中各滤波器的权重值。The processor looks for the weight value of each filter in the fully convolutional neural network through the gradient descent method.
本申请的一个可选实施例中,所述全卷积神经网络模型中隐含层的模型为:In an optional embodiment of the present application, the model of the hidden layer in the fully convolutional neural network model is:
Figure PCTCN2019089180-appb-000033
Figure PCTCN2019089180-appb-000033
Figure PCTCN2019089180-appb-000034
Figure PCTCN2019089180-appb-000034
其中,
Figure PCTCN2019089180-appb-000035
表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
Figure PCTCN2019089180-appb-000036
表示输入层第i个节点和第1个隐含层第k个节点的连接 权重值,
Figure PCTCN2019089180-appb-000037
表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
Figure PCTCN2019089180-appb-000038
表示第l个隐含层的第k个节点的输出值,
Figure PCTCN2019089180-appb-000039
表示第l-1个隐含层的第j个节点的输出值,
Figure PCTCN2019089180-appb-000040
表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
Figure PCTCN2019089180-appb-000041
表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数,其中,激励函数可以选择PReLUs激活函数、Sigmoid函数、tanh函数、Relu函数等函数。
among them,
Figure PCTCN2019089180-appb-000035
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Figure PCTCN2019089180-appb-000036
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Figure PCTCN2019089180-appb-000037
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Figure PCTCN2019089180-appb-000038
Represents the output value of the kth node of the lth hidden layer,
Figure PCTCN2019089180-appb-000039
Represents the output value of the jth node of the l-1th hidden layer,
Figure PCTCN2019089180-appb-000040
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
Figure PCTCN2019089180-appb-000041
Represents the offset of the k-th node of the l-th hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function, where the excitation function can be selected from PReLUs activation function, Sigmoid function, tanh function, Relu function and other functions .
本申请的一个实施例中,处理器训练所述全卷积神经网络模型的步骤包括:In an embodiment of the present application, the step of the processor training the fully convolutional neural network model includes:
对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集,其中,样本集中的样本可以从TIMIT语料库中随机选取,训练样本集和测试样本集中样本个数的比例为6:1,例如,从TIMIT语料库中随机选取700个短语,其中的600个短语构成训练样本集,其余的100个短语构成测试样本集,训练样本集中包含5种信噪比下的5种噪声类型(白噪声、粉噪声、办公室噪声、超市噪声和街道噪声),测试样本集中包含与训练样本集中相同或不同的信噪比和噪声类型,以使测试条件更加贴近真实。本申请中的训练样本集中仅列举出5种噪声类型,但本申请并不限于此;Construct a sample set, and divide the sample set into a training sample set and a test sample set according to the ratio, where the samples in the sample set can be randomly selected from the TIMIT corpus, and the ratio of the number of samples in the training sample set and the test sample set is 6 : 1, for example, 700 phrases are randomly selected from the TIMIT corpus, of which 600 phrases constitute the training sample set, and the remaining 100 phrases constitute the test sample set. The training sample set contains 5 noise types with 5 signal-to-noise ratios (White noise, pink noise, office noise, supermarket noise and street noise), the test sample set contains the same or different signal-to-noise ratio and noise type as the training sample set, so that the test conditions are closer to the reality. Only five noise types are listed in the training sample set in this application, but this application is not limited to this;
输入所述训练样本集中的一个训练样本,并从所述训练样本中提取特征向量;Input a training sample in the training sample set, and extract feature vectors from the training sample;
将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出值和输出层各节点的输出值;Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
计算输出层各节点误差:Calculate the error of each node of the output layer:
e k=o k-y k  (4) e k = o k -y k (4)
其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
Figure PCTCN2019089180-appb-000042
Figure PCTCN2019089180-appb-000042
其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
优选地,根据下式计算测试误差:Preferably, the test error is calculated according to the following formula:
Figure PCTCN2019089180-appb-000043
Figure PCTCN2019089180-appb-000043
其中,MSE表示测试误差,N表示测试样本集的样本个数,
Figure PCTCN2019089180-appb-000044
表示测试样本集的样本z在输出层第k个节点的实际值,
Figure PCTCN2019089180-appb-000045
表示测试样本集的样本z在输出层第k个节点的输出值。
Among them, MSE represents the test error, N represents the number of samples in the test sample set,
Figure PCTCN2019089180-appb-000044
Represents the actual value of the sample z of the test sample set at the kth node of the output layer,
Figure PCTCN2019089180-appb-000045
Represents the output value of the sample z of the test sample set at the k-th node of the output layer.
在其他实施例中,语音增强程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器中,并由处理器执行,以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。所述语音增强程序可以被分割为:模型构建模块1、模型训练模块2、输入模块3和输出模块4。上述模块所实现的功能或操作步骤均与上文类似,此处不再详述,示例性地,例如其中:In other embodiments, the speech enhancement program may also be divided into one or more modules, and the one or more modules are stored in the memory and executed by the processor to complete the application. The module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions. The speech enhancement program can be divided into: model building module 1, model training module 2, input module 3 and output module 4. The functions or operation steps implemented by the above modules are similar to the above, and will not be described in detail here, for example, for example:
模型构建模块1,构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:The model building module 1 constructs a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which is With multiple filters, the output model of the output layer is:
y t=F T*R t    (1) y t = F T * R t (1)
其中,t是节点的索引,y t是输出层的第t个节点,F是滤波器,F∈R f×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where t is the index of the node, y t is the t-th node of the output layer, F is the filter, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
模型训练模块2,训练所述全卷积神经网络模型;Model training module 2, training the fully convolutional neural network model;
输入模块3,将原始语音信号输入经过训练的全卷积神经网络模型;Input module 3, input the original voice signal to the trained fully convolutional neural network model;
输出模块4,输出增强语音信号。The output module 4 outputs enhanced voice signals.
优选地,隐含层的模型应用PReLUs激活函数。Preferably, the model of the hidden layer applies the PReLUs activation function.
优选地,训练所述全卷积神经网络模型采用TIMIT语料库,将其划分为训练集和测试集。Preferably, the fully convolutional neural network model is trained using a TIMIT corpus, which is divided into a training set and a test set.
优选地,隐含层的模型应用Adam来最小化纯净语音和增强语音的最小均方误差。Preferably, the model of the hidden layer applies Adam to minimize the minimum mean square error of pure speech and enhanced speech.
优选地,输出增强语音信号通过PESQ和短时客观可懂度得分STOI来判断增强质量。Preferably, the output enhanced speech signal judges the enhanced quality by PESQ and short-term objective intelligibility score STOI.
本申请的一个实施例中,计算机可读存储介质可以是任何包含或存储程序或指令的有形介质,其中的程序可以被执行,通过存储的程序指令相关的硬件实现相应的功能。例如,计算机可读存储介质可以是计算机磁盘、硬盘、随机存取存储器、只读存储器等。本申请并不限于此,可以是以非暂时性方式存储指令或软件以及任何相关数据文件或数据结构并且可提供给处理器以使处理器执行其中的程序或指令的任何装置。所述计算机可读存储介质中包括语音增强程序,所述语音增强程序被处理器执行时,实现如下的语音增强方法:In an embodiment of the present application, the computer-readable storage medium may be any tangible medium that contains or stores programs or instructions, and the programs therein may be executed, and the corresponding functions are implemented by hardware related to the stored program instructions. For example, the computer-readable storage medium may be a computer disk, hard disk, random access memory, read-only memory, or the like. The present application is not limited to this, and may be any device that stores instructions or software and any related data files or data structures in a non-transitory manner and can be provided to the processor to cause the processor to execute the programs or instructions therein. The computer-readable storage medium includes a speech enhancement program. When the speech enhancement program is executed by a processor, the following speech enhancement method is implemented:
构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is multiple convolutional layers, each of which has multiple filters , The output model of the output layer is:
y t=F T*R t  (1) y t = F T * R t (1)
其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f ×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the filter's weight matrix, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
输出增强语音信号。Output enhanced voice signal.
优选地,根据下式构建全卷积神经网络模型的隐含层的模型:Preferably, the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
Figure PCTCN2019089180-appb-000046
Figure PCTCN2019089180-appb-000046
Figure PCTCN2019089180-appb-000047
Figure PCTCN2019089180-appb-000047
其中,
Figure PCTCN2019089180-appb-000048
表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
Figure PCTCN2019089180-appb-000049
表示输入层第i个节点和第1个隐含层第k个节点的连接权重值,
Figure PCTCN2019089180-appb-000050
表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
Figure PCTCN2019089180-appb-000051
表示第l个隐含层的第k个节点的输出值,
Figure PCTCN2019089180-appb-000052
表示第l-1个隐含层的第j个节点的输出值,
Figure PCTCN2019089180-appb-000053
表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
Figure PCTCN2019089180-appb-000054
表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数。
among them,
Figure PCTCN2019089180-appb-000048
Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
Figure PCTCN2019089180-appb-000049
Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
Figure PCTCN2019089180-appb-000050
Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
Figure PCTCN2019089180-appb-000051
Represents the output value of the kth node of the lth hidden layer,
Figure PCTCN2019089180-appb-000052
Represents the output value of the jth node of the l-1th hidden layer,
Figure PCTCN2019089180-appb-000053
Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer
Figure PCTCN2019089180-appb-000054
Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
优选地,训练所述全卷积神经网络模型包括:Preferably, training the fully convolutional neural network model includes:
对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集;Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;
输入所述训练样本集中的一个训练样本,并从所述训练样本中提取特征向量;Input a training sample in the training sample set, and extract feature vectors from the training sample;
将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出值和输出层各节点的输出值;Substitute the input data of training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
计算输出层各节点误差:Calculate the error of each node of the output layer:
e k=o k-y k    (4) e k = o k -y k (4)
其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
Figure PCTCN2019089180-appb-000055
Figure PCTCN2019089180-appb-000055
其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when the number of successive iterations.
优选地,根据下式计算测试误差:Preferably, the test error is calculated according to the following formula:
Figure PCTCN2019089180-appb-000056
Figure PCTCN2019089180-appb-000056
其中,MSE表示测试误差,N表示测试样本集的样本个数,
Figure PCTCN2019089180-appb-000057
表示测试样本集的样本z在输出层第k个节点的实际值,
Figure PCTCN2019089180-appb-000058
表示测试样本集的样本z在输出层第k个节点的输出值。
Among them, MSE represents the test error, N represents the number of samples in the test sample set,
Figure PCTCN2019089180-appb-000057
Represents the actual value of the sample z of the test sample set at the kth node of the output layer,
Figure PCTCN2019089180-appb-000058
Represents the output value of the sample z of the test sample set at the k-th node of the output layer.
优选地,测试样本集中的测试样本与训练样本集中的训练样本的信噪比 和噪声类型不同。Preferably, the test samples in the test sample set and the training samples in the training sample set have different signal-to-noise ratios and types of noise.
优选地,所述全卷积神经网络模型包括输入层、六个卷积层和输出层,每个卷积层均具有1024个节点,卷积跨度为1。Preferably, the fully convolutional neural network model includes an input layer, six convolutional layers and an output layer, each convolutional layer has 1024 nodes, and the convolutional span is 1.
本申请之计算机可读存储介质的具体实施方式与上述语音增强方法、电子装置的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the aforementioned voice enhancement method and electronic device, and will not be repeated here.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method that includes a series of elements includes not only those elements, It also includes other elements not explicitly listed, or includes elements inherent to such processes, devices, objects, or methods. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, device, article or method that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM / RAM as described above) , Magnetic disks, optical disks), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to perform the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and do not limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related technical fields The same reason is included in the patent protection scope of this application.

Claims (20)

  1. 一种基于全卷积神经网络的语音增强方法,应用于电子装置,其特征在于,A speech enhancement method based on a fully convolutional neural network, applied to an electronic device, is characterized by,
    构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:
    y t=F T*R t (1) y t = F T * R t (1)
    其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the weight matrix of the filter, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
    训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
    将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
    输出增强语音信号。Output enhanced voice signal.
  2. 根据权利要求1所述的基于全卷积神经网络的语音增强方法,其特征在于,根据下式构建全卷积神经网络模型的隐含层的模型:The speech enhancement method based on the fully convolutional neural network according to claim 1, characterized in that the model of the hidden layer of the fully convolutional neural network model is constructed according to the following formula:
    Figure PCTCN2019089180-appb-100001
    Figure PCTCN2019089180-appb-100001
    Figure PCTCN2019089180-appb-100002
    Figure PCTCN2019089180-appb-100002
    其中,
    Figure PCTCN2019089180-appb-100003
    表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
    Figure PCTCN2019089180-appb-100004
    表示输入层第i个节点和第1个隐含层第k个节点的连接权重值,
    Figure PCTCN2019089180-appb-100005
    表示第1个隐含层第k个节点的偏移量,n表示输入层的节点数,
    Figure PCTCN2019089180-appb-100006
    表示第l个隐含层的第k个节点的输出值,
    Figure PCTCN2019089180-appb-100007
    表示第l-1个隐含层的第j个节点的输出值,
    Figure PCTCN2019089180-appb-100008
    表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
    Figure PCTCN2019089180-appb-100009
    表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数。
    among them,
    Figure PCTCN2019089180-appb-100003
    Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
    Figure PCTCN2019089180-appb-100004
    Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
    Figure PCTCN2019089180-appb-100005
    Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
    Figure PCTCN2019089180-appb-100006
    Represents the output value of the kth node of the lth hidden layer,
    Figure PCTCN2019089180-appb-100007
    Represents the output value of the jth node of the l-1th hidden layer,
    Figure PCTCN2019089180-appb-100008
    Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
    Figure PCTCN2019089180-appb-100009
    Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
  3. 根据权利要求2所述的基于全卷积神经网络的语音增强方法,其特征在于,训练所述全卷积神经网络模型包括:The speech enhancement method based on a fully convolutional neural network according to claim 2, wherein training the fully convolutional neural network model includes:
    对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
    构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集;Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;
    输入所述训练样本集中的一个训练样本,并从所述训练样本中提取特征向量;Input a training sample in the training sample set, and extract feature vectors from the training sample;
    将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出 值和输出层各节点的输出值;Substitute the input data of training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
    计算输出层各节点误差:Calculate the error of each node of the output layer:
    e k=o k-y k  (4) e k = o k -y k (4)
    其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
    基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
    输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
    设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
    Figure PCTCN2019089180-appb-100010
    Figure PCTCN2019089180-appb-100010
    其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
    判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
  4. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,根据下式计算测试误差:The speech enhancement method based on a fully convolutional neural network according to claim 3, wherein the test error is calculated according to the following formula:
    Figure PCTCN2019089180-appb-100011
    Figure PCTCN2019089180-appb-100011
    其中,MSE表示测试误差,N表示测试样本集的样本个数,
    Figure PCTCN2019089180-appb-100012
    表示测试样本集的样本z在输出层第k个节点的实际值,
    Figure PCTCN2019089180-appb-100013
    表示测试样本集的样本z在输出层第k个节点的输出值。
    Among them, MSE represents the test error, N represents the number of samples in the test sample set,
    Figure PCTCN2019089180-appb-100012
    Represents the actual value of the sample z of the test sample set at the kth node of the output layer,
    Figure PCTCN2019089180-appb-100013
    Represents the output value of the sample z of the test sample set at the k-th node of the output layer.
  5. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,测试样本集中的测试样本与训练样本集中的训练样本的信噪比不同,且噪声类型也不同。The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the test samples in the test sample set and the training samples in the training sample set have different signal-to-noise ratios and different types of noise.
  6. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,通过梯度下降法寻找全卷积神经网络中各滤波器的权重值。The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the weight value of each filter in the fully convolutional neural network is searched by a gradient descent method.
  7. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,所述隐含层的模型应用PReLUs激活函数。The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the model of the hidden layer uses a PReLUs activation function.
  8. 根据权利要求2所述的基于全卷积神经网络的语音增强方法,其特征在于,训练所述全卷积神经网络模型采用TIMIT语料库,将其划分为训练集和测试集。The speech enhancement method based on a fully convolutional neural network according to claim 2, characterized in that the training of the fully convolutional neural network model uses a TIMIT corpus, which is divided into a training set and a test set.
  9. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,所述隐含层的模型应用Adam来最小化纯净语音和增强语音的最小均方误差。The speech enhancement method based on a fully convolutional neural network according to claim 3, wherein the model of the hidden layer uses Adam to minimize the minimum mean square error of pure speech and enhanced speech.
  10. 根据权利要求3所述的基于全卷积神经网络的语音增强方法,其特征在于,输出增强语音信号通过PESQ和短时客观可懂度得分STOI来判断增强质量。The speech enhancement method based on a fully convolutional neural network according to claim 3, characterized in that the output enhanced speech signal is judged by PESQ and short-term objective intelligibility score STOI to judge the enhancement quality.
  11. 根据权利要求1至10中任一项所述的基于全卷积神经网络的语音增强方法,其特征在于,所述全卷积神经网络模型包括输入层、六个卷积层和输出层,每个卷积层均具有1024个节点,卷积跨度为1。The speech enhancement method based on a fully convolutional neural network according to any one of claims 1 to 10, wherein the fully convolutional neural network model includes an input layer, six convolutional layers and an output layer, each Each convolutional layer has 1024 nodes, and the convolution span is 1.
  12. 一种电子装置,其特征在于,该电子装置包括:存储器和处理器,所述存储器中包括语音增强程序,所述语音增强程序被所述处理器执行时实现如下步骤:An electronic device, characterized in that the electronic device includes: a memory and a processor, and the memory includes a voice enhancement program, and the voice enhancement program is implemented by the processor to implement the following steps:
    构建全卷积神经网络模型,所述全卷积神经网络模型包括输入层、隐含层和输出层,所述隐含层为多个卷积层,每个卷积层均具有多个滤波器,所述输出层的输出模型为:Construct a fully convolutional neural network model. The fully convolutional neural network model includes an input layer, a hidden layer, and an output layer. The hidden layer is a plurality of convolutional layers, each of which has multiple filters , The output model of the output layer is:
    y t=F T*R t (1) y t = F T * R t (1)
    其中,y t是输出层的第t个节点,F T是滤波器的权重矩阵的转置,F∈R f×1,f表示滤波器尺寸,R t是隐含层的第t个节点; Where y t is the t-th node of the output layer, F T is the transpose of the weight matrix of the filter, F ∈ R f × 1 , f represents the filter size, and R t is the t-th node of the hidden layer;
    训练所述全卷积神经网络模型;Training the fully convolutional neural network model;
    将原始语音信号输入经过训练的全卷积神经网络模型;Input the original speech signal into the trained fully convolutional neural network model;
    输出增强语音信号。Output enhanced voice signal.
  13. 根据权利要求12所述的电子装置,其特征在于,所述全卷积神经网络模型中隐含层的模型为:The electronic device according to claim 12, wherein the model of the hidden layer in the fully convolutional neural network model is:
    Figure PCTCN2019089180-appb-100014
    Figure PCTCN2019089180-appb-100014
    Figure PCTCN2019089180-appb-100015
    Figure PCTCN2019089180-appb-100015
    其中,
    Figure PCTCN2019089180-appb-100016
    表示第1个隐含层的第j个节点的输出值,x i表示输入层的第i个节点的变量,
    Figure PCTCN2019089180-appb-100017
    表示输入层第i个节点和第1个隐含层第k个节点的连接权重值,
    Figure PCTCN2019089180-appb-100018
    表示第1个隐含层第k个节点的偏移量,n表示输入层的节点 数,
    Figure PCTCN2019089180-appb-100019
    表示第l个隐含层的第k个节点的输出值,
    Figure PCTCN2019089180-appb-100020
    表示第l-1个隐含层的第j个节点的输出值,
    Figure PCTCN2019089180-appb-100021
    表示第l个隐含层的第k个节点和第l-1个隐含层的第j个节点的连接权重值,
    Figure PCTCN2019089180-appb-100022
    表示第l个隐含层第k个节点的偏移量,H为隐含层的节点数,f为激励函数。
    among them,
    Figure PCTCN2019089180-appb-100016
    Represents the output value of the jth node of the first hidden layer, x i represents the variable of the ith node of the input layer,
    Figure PCTCN2019089180-appb-100017
    Represents the connection weight value of the i-th node of the input layer and the k-th node of the first hidden layer,
    Figure PCTCN2019089180-appb-100018
    Represents the offset of the k-th node of the first hidden layer, n represents the number of nodes in the input layer,
    Figure PCTCN2019089180-appb-100019
    Represents the output value of the kth node of the lth hidden layer,
    Figure PCTCN2019089180-appb-100020
    Represents the output value of the jth node of the l-1th hidden layer,
    Figure PCTCN2019089180-appb-100021
    Represents the connection weight value of the kth node of the lth hidden layer and the jth node of the l-1 hidden layer,
    Figure PCTCN2019089180-appb-100022
    Represents the offset of the kth node of the lth hidden layer, H is the number of nodes in the hidden layer, and f is the excitation function.
  14. 根据权利要求12所述的电子装置,其特征在于,处理器训练所述全卷积神经网络模型包括:The electronic device according to claim 12, wherein the processor training the fully convolutional neural network model comprises:
    对所述全卷积神经网络模型的参数进行初始赋值,所述参数包括输入层和隐含层的连接权重值、相邻隐含层之间的连接权重值和隐含层的偏移量;Initially assign parameters of the fully convolutional neural network model, the parameters include the connection weight value of the input layer and the hidden layer, the connection weight value between adjacent hidden layers, and the offset of the hidden layer;
    构建样本集,并将所述样本集按比例划分为训练样本集和测试样本集;Construct a sample set, and divide the sample set into a training sample set and a test sample set in proportion;
    输入所述训练样本集中的一个训练样本,并从所述训练样本中提取特征向量;Input a training sample in the training sample set, and extract feature vectors from the training sample;
    将训练样本的输入数据代入公式(1)-(3),计算隐含层各节点的输出值和输出层各节点的输出值;Substitute the input data of the training samples into formulas (1)-(3) to calculate the output value of each node of the hidden layer and the output value of each node of the output layer;
    计算输出层各节点误差:Calculate the error of each node of the output layer:
    e k=o k-y k  (4) e k = o k -y k (4)
    其中,e k表示输出层第k个节点的误差,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; Wherein, e k represents an error of the output layer nodes k, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
    基于误差反向传播更新所述全卷积神经网络模型的参数;Update the parameters of the fully convolutional neural network model based on error back propagation;
    输入下一个训练样本,继续更新全卷积神经网络模型的参数,直至训练样本集中的所有训练样本训练结束,完成一次迭代;Enter the next training sample and continue to update the parameters of the fully convolutional neural network model until all training samples in the training sample set are trained and an iteration is completed;
    设定全卷积神经网络模型的损失函数:Set the loss function of the fully convolutional neural network model:
    Figure PCTCN2019089180-appb-100023
    Figure PCTCN2019089180-appb-100023
    其中,n表示输出层的节点数,o k表示输出层第k个节点的实际值,y k表示输出层第k个节点的输出值; , N represents the number of nodes of the output layer, o k indicates the actual value of the output layer nodes k, y k represents the output value of the output node of the k-th layer;
    判断训练是否满足结束条件,如果满足结束条件,则结束训练,输出经过训练的全卷积神经网络模型,如果不满足结束条件,将继续训练模型,其中,所述结束条件包括第一结束条件或/和第二结束条件中的一个或两个,第一结束条件为当前迭代次数大于设定的最大迭代次数,第二结束条件为连续多次迭代时损失函数值的变化小于设定目标值。Determine whether the training meets the end condition. If the end condition is met, then end the training and output the trained fully convolutional neural network model. If the end condition is not met, the model will continue to be trained. The end condition includes the first end condition or / And one or two of the second end conditions, the first end condition is that the current iteration number is greater than the set maximum number of iterations, and the second end condition is that the change of the loss function value is less than the set target value when iterates for multiple consecutive times.
  15. 根据权利要求12所述的电子装置,其特征在于,处理器通过梯度下降法寻找全卷积神经网络中各滤波器的权重值。The electronic device according to claim 12, wherein the processor finds the weight value of each filter in the fully convolutional neural network by a gradient descent method.
  16. 根据权利要求12所述的电子装置,其特征在于,所述隐含层的模型应用PReLUs激活函数。The electronic device according to claim 12, wherein the model of the hidden layer uses a PReLUs activation function.
  17. 根据权利要求12所述的电子装置,其特征在于,训练所述全卷积神经网络模型采用TIMIT语料库,将其划分为训练集和测试集。The electronic device according to claim 12, wherein the training of the fully convolutional neural network model uses a TIMIT corpus, which is divided into a training set and a test set.
  18. 根据权利要求12所述的电子装置,其特征在于,所述隐含层的模型应用Adam来最小化纯净语音和增强语音的最小均方误差。The electronic device according to claim 12, wherein the model of the hidden layer uses Adam to minimize the minimum mean square error of pure speech and enhanced speech.
  19. 根据权利要求12所述的电子装置,其特征在于,输出增强语音信号后通过PESQ和短时客观可懂度得分STOI来判断语音增强质量。The electronic device according to claim 12, wherein after outputting the enhanced speech signal, the quality of speech enhancement is judged by PESQ and short-term objective intelligibility score STOI.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括语音增强程序,所述语音增强程序被处理器执行时,实现如权利要求1至10中任一项所述的语音增强方法的步骤。A computer-readable storage medium, characterized in that the computer-readable storage medium includes a speech enhancement program, and when the speech enhancement program is executed by a processor, it implements any one of claims 1 to 10. Steps of speech enhancement method.
PCT/CN2019/089180 2018-11-14 2019-05-30 Speech enhancement method based on fully convolutional neural network, device, and storage medium WO2020098256A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811350813.8A CN109326299B (en) 2018-11-14 2018-11-14 Speech enhancement method, device and storage medium based on full convolution neural network
CN201811350813.8 2018-11-14

Publications (1)

Publication Number Publication Date
WO2020098256A1 true WO2020098256A1 (en) 2020-05-22

Family

ID=65261439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/089180 WO2020098256A1 (en) 2018-11-14 2019-05-30 Speech enhancement method based on fully convolutional neural network, device, and storage medium

Country Status (2)

Country Link
CN (1) CN109326299B (en)
WO (1) WO2020098256A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753977A (en) * 2020-06-30 2020-10-09 中国科学院半导体研究所 Optical neural network convolution layer chip, convolution calculation method and electronic equipment
CN112182709A (en) * 2020-09-28 2021-01-05 中国水利水电科学研究院 Rapid prediction method for let-down water temperature of large-scale reservoir stop log door layered water taking facility
CN112188428A (en) * 2020-09-28 2021-01-05 广西民族大学 Energy efficiency optimization method for Sink node in sensing cloud network
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113821967A (en) * 2021-06-04 2021-12-21 北京理工大学 Large sample training data generation method based on scattering center model

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326299B (en) * 2018-11-14 2023-04-25 平安科技(深圳)有限公司 Speech enhancement method, device and storage medium based on full convolution neural network
CN110265053B (en) * 2019-06-29 2022-04-19 联想(北京)有限公司 Signal noise reduction control method and device and electronic equipment
CN110348566B (en) * 2019-07-15 2023-01-06 上海点积实业有限公司 Method and system for generating digital signal for neural network training
CN110534123B (en) * 2019-07-22 2022-04-01 中国科学院自动化研究所 Voice enhancement method and device, storage medium and electronic equipment
CN110648681B (en) * 2019-09-26 2024-02-09 腾讯科技(深圳)有限公司 Speech enhancement method, device, electronic equipment and computer readable storage medium
CN116508099A (en) * 2020-10-29 2023-07-28 杜比实验室特许公司 Deep learning-based speech enhancement
CN113345463B (en) * 2021-05-31 2024-03-01 平安科技(深圳)有限公司 Speech enhancement method, device, equipment and medium based on convolutional neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN106847302A (en) * 2017-02-17 2017-06-13 大连理工大学 Single channel mixing voice time-domain seperation method based on convolutional neural networks
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157953B (en) * 2015-04-16 2020-02-07 科大讯飞股份有限公司 Continuous speech recognition method and system
US10090001B2 (en) * 2016-08-01 2018-10-02 Apple Inc. System and method for performing speech enhancement using a neural network-based combined symbol
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
CN108133702A (en) * 2017-12-20 2018-06-08 重庆邮电大学 A kind of deep neural network speech enhan-cement model based on MEE Optimality Criterias
CN108334843B (en) * 2018-02-02 2022-03-25 成都国铁电气设备有限公司 Arcing identification method based on improved AlexNet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160322055A1 (en) * 2015-03-27 2016-11-03 Google Inc. Processing multi-channel audio waveforms
CN106847302A (en) * 2017-02-17 2017-06-13 大连理工大学 Single channel mixing voice time-domain seperation method based on convolutional neural networks
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks
CN108172238A (en) * 2018-01-06 2018-06-15 广州音书科技有限公司 A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN109326299A (en) * 2018-11-14 2019-02-12 平安科技(深圳)有限公司 Sound enhancement method, device and storage medium based on full convolutional neural networks

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753977A (en) * 2020-06-30 2020-10-09 中国科学院半导体研究所 Optical neural network convolution layer chip, convolution calculation method and electronic equipment
CN111753977B (en) * 2020-06-30 2024-01-02 中国科学院半导体研究所 Optical neural network convolution layer chip, convolution calculation method and electronic equipment
CN112182709A (en) * 2020-09-28 2021-01-05 中国水利水电科学研究院 Rapid prediction method for let-down water temperature of large-scale reservoir stop log door layered water taking facility
CN112188428A (en) * 2020-09-28 2021-01-05 广西民族大学 Energy efficiency optimization method for Sink node in sensing cloud network
CN112182709B (en) * 2020-09-28 2024-01-16 中国水利水电科学研究院 Method for rapidly predicting water drainage temperature of large reservoir stoplog gate layered water taking facility
CN112188428B (en) * 2020-09-28 2024-01-30 广西民族大学 Energy efficiency optimization method for Sink node in sensor cloud network
CN113314136A (en) * 2021-05-27 2021-08-27 西安电子科技大学 Voice optimization method based on directional noise reduction and dry sound extraction technology
CN113821967A (en) * 2021-06-04 2021-12-21 北京理工大学 Large sample training data generation method based on scattering center model

Also Published As

Publication number Publication date
CN109326299B (en) 2023-04-25
CN109326299A (en) 2019-02-12

Similar Documents

Publication Publication Date Title
WO2020098256A1 (en) Speech enhancement method based on fully convolutional neural network, device, and storage medium
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Zhang et al. Deep learning for environmentally robust speech recognition: An overview of recent developments
WO2020042707A1 (en) Convolutional recurrent neural network-based single-channel real-time noise reduction method
CN110956957B (en) Training method and system of speech enhancement model
Qian et al. Very deep convolutional neural networks for robust speech recognition
CN110853663B (en) Speech enhancement method based on artificial intelligence, server and storage medium
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
JP6987378B2 (en) Neural network learning method and computer program
CN109036460A (en) Method of speech processing and device based on multi-model neural network
Liu et al. Speech enhancement method based on LSTM neural network for speech recognition
CN107068167A (en) Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
KR20200145219A (en) Method and apparatus for combined learning using feature enhancement based on deep neural network and modified loss function for speaker recognition robust to noisy environments
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN112183107A (en) Audio processing method and device
CN111357051A (en) Speech emotion recognition method, intelligent device and computer readable storage medium
CN111798875A (en) VAD implementation method based on three-value quantization compression
WO2020170907A1 (en) Signal processing device, learning device, signal processing method, learning method, and program
CN107545898B (en) Processing method and device for distinguishing speaker voice
CN115884032A (en) Smart call noise reduction method and system of feedback earphone
KR102204975B1 (en) Method and apparatus for speech recognition using deep neural network
CN113823301A (en) Training method and device of voice enhancement model and voice enhancement method and device
Agcaer et al. Optimization of amplitude modulation features for low-resource acoustic scene classification
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19885956

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.08.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19885956

Country of ref document: EP

Kind code of ref document: A1