CN112309411B - Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system - Google Patents

Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system Download PDF

Info

Publication number
CN112309411B
CN112309411B CN202011332442.8A CN202011332442A CN112309411B CN 112309411 B CN112309411 B CN 112309411B CN 202011332442 A CN202011332442 A CN 202011332442A CN 112309411 B CN112309411 B CN 112309411B
Authority
CN
China
Prior art keywords
gating
voice
real
imaginary
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011332442.8A
Other languages
Chinese (zh)
Other versions
CN112309411A (en
Inventor
刘明
周彦兵
唐飞
周小明
赵学华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Information Technology
Original Assignee
Shenzhen Institute of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Information Technology filed Critical Shenzhen Institute of Information Technology
Priority to CN202011332442.8A priority Critical patent/CN112309411B/en
Publication of CN112309411A publication Critical patent/CN112309411A/en
Application granted granted Critical
Publication of CN112309411B publication Critical patent/CN112309411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system. The beneficial effects of the invention are as follows: the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is well avoided.

Description

Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system
Technical Field
The invention relates to a voice enhancement method, in particular to a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.
Background
Early auditory experiment researches show that when the signal-to-noise ratio is higher than 6dB, the influence of phase distortion on voice quality and intelligibility is small, so that most single-channel voice enhancement methods at present mainly perform noise reduction treatment in the amplitude domain of voice signals and directly utilize noisy phases to reconstruct the voice signals. However, when the acoustic scene faced by our voice product is worse, for example, the signal-to-noise ratio is lower than 0dB, or the noise signal completely floods the voice signal in local time, if only the amplitude of the voice signal is enhanced, it cannot be ensured that the enhanced voice has good voice intelligibility, and even some voice distortion problems such as sound tremble, buzzing and the like can occur.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.
The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals.
As a further improvement of the present invention, firstly, the noisy speech signal is subjected to framing and windowing, then is subjected to short-time fourier transform to obtain a complex spectrum of the noisy speech signal, the real and imaginary parts are separated, and only the effective value part is taken, so that two groups of input features are obtained: real and imaginary features.
As a further improvement of the present invention, two sets of input features are then fed into a gated multi-scale hole convolutional network model.
As a further improvement of the present invention, the process flow of the gating multi-scale hole convolution network model includes: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is utilized to respectively perform time sequence characteristic analysis on the real part characteristic and the imaginary part characteristic representation of the encoding, and a gating decoding module respectively performs gating decoding operation to obtain an enhanced real-imaginary part frequency spectrum.
As a further improvement of the invention, the enhanced real-imaginary frequency spectrum is subjected to inverse Fourier transform and then overlap-added to finally obtain the enhanced voice signal.
As a further improvement of the invention, the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.
As a further improvement of the invention, the output of each gated linear coding unit is exponentially activated to perform a nonlinear transformation of the characteristic.
As a further refinement of the invention, the input of the multi-scale feature analysis module comprises two sets of features: (1) a real or imaginary spectrum of the original noisy speech; (2) And the gating coding module outputs real or imaginary characteristics.
As a further improvement of the invention, the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit performs a splicing operation on two sets of feature tensors, and before splicing, the two sets of tensors need to be remolded to be changed into a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Then, sub-band decomposition is carried out on the spliced characteristic tensor, wherein the characteristic tensor is totally divided into 8 sub-bands, the tensor shape of the first 7 sub-bands is [ sentence number, sentence length, 40], the shape of the last sub-band is [ sentence number, sentence length, 42], the input of the current sub-band and the winding output of the adjacent sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after a plurality of multi-scale analysis units are carried out, the multi-scale analyzed characteristics are expanded by utilizing a 1024-dimensional full-connection layer, the output characteristic tensor is remolded into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then the two remolded characteristic tensors are respectively sent to a gate control decoding module for decoding operation.
The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system, which comprises a readable storage medium, wherein the readable storage medium is stored with execution instructions, and the execution instructions are used for realizing the method of any one of the above when being executed by a processor.
The beneficial effects of the invention are as follows: through the scheme, the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is better avoided.
Drawings
FIG. 1 is a process flow diagram of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.
FIG. 2 is a diagram of a gating multi-scale hole convolution network structure of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 3 is a block diagram of a gating linear encoding and decoding unit of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 4 is a block diagram of a multi-scale analysis unit of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.
Detailed Description
The invention is further described with reference to the following description of the drawings and detailed description.
A phase-sensitive gating multi-scale cavity convolution network voice enhancement method aims at constructing a mapping relation between complex frequency spectrums of voice signals by utilizing a neural network model, mapping real and imaginary frequency spectrums of voice with noise after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and recovering the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The processing flow of the whole algorithm is shown in figure 1, the dotted line part is a gating multi-scale cavity convolution network structure designed by the invention, and the gating multi-scale cavity convolution network structure is a core module of the whole algorithm, and the processing flow realizes noise reduction processing of real and imaginary frequency spectrums of noisy speech through three modules of gating coding, multi-scale feature analysis and gating decoding.
As shown in fig. 1, the noisy speech signal is first subjected to a frame windowing process, then to a short-time fourier transform to obtain a complex spectrum of the noisy speech signal, separating real and imaginary parts, and taking only the effective value part, two sets of input features are obtained: real and imaginary features. And then, the two groups of features are sent into a gating multi-scale cavity convolution network model, the gating encoding operation is firstly carried out to obtain a high-latitude nonlinear feature representation form, and then, the multi-scale feature analysis module is utilized to respectively carry out time sequence feature analysis on the encoded feature representations and respectively decode the time sequence feature analysis to obtain the enhanced real-imaginary frequency spectrum. The following describes each module of the gating multi-scale hole convolution network in detail.
The detailed structure of the gating multi-scale cavity convolution network is shown in fig. 2, and the gating multi-scale cavity convolution network consists of three parts, namely gating coding, multi-scale feature analysis and gating decoding. The real and imaginary characteristics X real (n, K) and X imag (n, K) of the input noisy speech will first enter the gating coding part to perform characteristic transformation, the structure of the gating linear coding unit is as shown in (a) in fig. 3, the tensor shape of the input real and imaginary characteristics is [ sentence number, sentence length, 161,2], and since the sampling rate of 16K is adopted, the frame length of the speech is 20ms, and the frames overlap by 10ms, so that 161 in the third dimension is the characteristic length corresponding to each frame of the real part or the imaginary part, and 2 in the fourth dimension represents the real part and the imaginary part, and the total two dimensions. Here, a total of 5 gating linear coding units are stacked, each coding unit adopts a convolution kernel of 1×3, and performs two-dimensional convolution operation in a mode of 1×2 step length, and the number of channels is 16, 32, 64, 128, 256 respectively, so that output tensors of the 5 linear coding units are sequentially obtained: [ number of sentences, sentence length, 80, 16], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 9, 128], and [ number of sentences, sentence length, 4, 256]. In order to achieve attention control between features, a Sigmoid activation function is used to nonlinearly activate the convolved output of one side of each coding unit to become a probability value within [0,1], and then in a manner of gating attention, the points multiply the convolved output features of the other side. In addition, the output of each gating linear coding unit is subjected to exponential linear activation in the following formula (1) to perform nonlinear transformation of the features.
Wherein, alpha is a parameter which needs to be optimized in the training process, and the exponential linear activation is favorable for relieving gradient disappearance in the training process, so that the model is more robust to input noise.
Next, in order to make full use of the context information between the speech signals, we use a multi-scale time domain feature analysis method to analyze and integrate the feature information in the past frame and the current frame, and capture the context information more beneficial to estimating the features of the current frame. The structure of the designed multi-scale analysis cell is shown in fig. 4, wherein the input of the multi-scale analysis cell has mainly two parts: the real or imaginary spectrum of the original voice with noise and the characteristics output by the previous module are then spliced, and before splicing, the two groups of characteristic tensors are required to be remolded to become a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Next, the stitched feature tensor is sub-band decomposed, here divided into a total of 8 sub-bands, with the tensor shape of the first 7 sub-bands [ number of sentences, sentence length, 40], and the shape of the last sub-band [ number of sentences, sentence length, 42]. When each sub-band is subjected to convolution operation, the input of the current sub-band and the output of the adjacent sub-band are spliced, and then one-dimensional cavity convolution operation is performed, wherein 5 multiscale analysis units are stacked in total, and in order to better enlarge the convolution receptive field, a mode that the cavity rate is gradually increased is adopted to be 1,3,5,7 and 11 respectively. Further, after each sub-band is coiled, exponential linear activation in equation (1) is employed. The sub-band splicing convolution mode ensures that each convolution layer has different receptive field ranges, and in the decomposition direction, the receptive field is linearly increased, so that the convolution layer is ensured to have timing sequence characteristic analysis capability of different scales. However, it is desirable that each multi-scale analysis unit can produce an intermediate estimate of a set of complex spectral features and take this as input to the next multi-scale analysis unit. Therefore, after the multi-scale convolution layer, we design a fully connected linear decoding layer, and linearly transform the features obtained by the multi-scale layer to obtain an intermediate estimation value of the real part or the imaginary part, and the tensor is [ sentence number, sentence length, 161].
After 5 multi-scale analysis units, expanding the multi-scale analyzed features by using a 1024-dimensional full-connection layer, and remolding the output feature tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256]. And then, respectively feeding the two groups of remodeled characteristic tensors into a gating linear decoding unit for decoding operation. The operation mode of the linear decoding unit is shown in fig. 3 (b), unlike the linear encoding unit, the linear decoding unit adopts two-dimensional deconvolution operation to realize expansion of the feature tensor, each decoding unit adopts a convolution kernel of 1×3, performs two-dimensional deconvolution in a mode of step length of 1×2, so that gradual expansion of the feature of each channel can be realized, and the number of channels adopts a gradually decreasing mode to be 128, 64, 32, 16,1 respectively, so that the output tensors of 5 linear decoding units are sequentially obtained, namely: [ number of sentences, sentence length, 9, 128], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 80, 16], and [ number of sentences, sentence length, 161,1]. Similarly, the output of each gated linear decoding unit also performs exponential linear activation in equation (1) to perform nonlinear transformation of the feature, and the gated linear decoding unit is shown in fig. 3 (b).
After the neural network model is constructed, a great amount of data is required to be trained, so that the neural network model has the capability of mapping pure real-imaginary frequency spectrums. Firstly, preparing enough pairs of noisy speech complex spectrum and ideal speech complex spectrum as training data sets, so that we choose 4620 sentences in TIMIT dataset [1] as clean speech data of the training sets, and then using 12 kinds of noise in NOISEX-92 [2] noise library, including restaurant noise, 2 kinds of fighter noise, 2 kinds of expelling ship noise, factory noise, tank noise, volvo car noise, high-frequency channel noise, white noise, leopard type of war car noise and gun noise, as noise data, randomly mixing with clean speech, and leading the mixed signal-to-noise ratio to be between [ -5,15], and obtaining noisy training data with a total time length of about 38 hours. In order to adjust the parameters of the model, a verification set is required to be set, 280 sentences are also selected from the test set of the TMIT data set to be used as pure voice data of the verification set, and the pure voice data and 12 kinds of noise in the training set are uniformly mixed with each other with the signal to noise ratio of-5 to 15 dB. The loss function during the training of the gated multi-scale cavity convolution network is calculated by mean square error, the calculation formula is shown as a formula (2), wherein n and k are respectively the frame and frequency indexes of the voice signal, X real (n, k) and X imag (n, k) are ideal real-imaginary part spectrums, andAnd/>Then the real-imaginary spectrum of the neural network output:
During training, the overfitting problem of the model is reduced in a mode of 20% random neuron inactivation rate and batch normalization, the Adam optimization algorithm is utilized for back propagation, the iteration is performed for 50 times at a learning rate of 0.001, and then the iteration is performed for 10 times at a learning rate of 0.0001, so that the gated multi-scale cavity convolution network model with the real-imaginary part spectrum of the mapped pure voice can be obtained.
The following experiment verifies the noise reduction effect of the method provided by the invention, and in order to evaluate the quality, the intelligibility and the distortion condition of the noise-reduced voice, PESQ (Perceptual evaluation of speech quality), STOI (Short-Time Objective Intelligibility) and SDR (Signal to Distortion Ratio) indexes are adopted to evaluate the noise-reduced voice. As shown in Table 1, all noise reduction effects and indicators were measured on the test set, with higher indicators representing better performance. The test set used was an additional 320 words selected from the test set of the TIMIT dataset that were not repeated with both the training set and the validation set, and mixed with 12 kinds of trained noise and 3 kinds of untrained noise (untrained fighter noise, untrained factory noise, and pink noise) of NOISEX-92, respectively, to five noise pollution levels of-5 dB,0dB,5dB,10dB, and 15 dB. The experimental results in table 1 show that the method provided by the invention not only has good noise reduction effect in the trained noise scene, but also can be well generalized to the untrained noise scene, and has good model generalization performance. Even if the instantaneous noise such as factory noise, machine gun noise and the like exists, the method provided by the invention has obvious effect, hardly hears abrupt background noise, and has good recovery of voice quality. In some noise environments with low signal-to-noise ratio, the enhanced speech also has no problems such as buzzing, sound jitter, etc. In addition, the time delay of the method designed by the invention is less than 30ms, and the method can completely meet the requirement of most voice products on real-time property.
TABLE 1 evaluation results of the PESQ, STOI and SDR metrics in different noise environments
Different from the deep neural network noise reduction method which only enhances in the amplitude domain in the prior art, the method models complex spectrum information of the voice signal, namely real-imaginary spectrum after Fourier transformation, so as to construct a hollow convolutional neural network of a multi-scale coding and decoding architecture, and learns the mapping relation between the noisy signal and the pure signal in the complex domain, thereby realizing common optimization of phase and amplitude information. The main advantages of the algorithm are as follows:
(1) The learning is carried out in a complex domain, the enhancement of phase information is considered, and better speech intelligibility and speech quality can be realized in a low signal-to-noise ratio environment;
(2) The real and imaginary part information of the complex spectrum is equivalent to two learning targets, and compared with a method for mapping a single-magnitude spectrum, the model with multiple targets has better generalization performance;
(3) The modeling is performed by using a multi-scale convolution method, so that the context information in the voice can be captured more finely, and more voice details can be recovered;
(4) The designed model is a complete causal system, that is, the output of the model is only related to the current frame and the past frame information, and the time delay of the algorithm is reduced to the greatest extent.
Reference is made to:
[1]J.S.Garofolo,"Getting started with the DARPA TIMIT CD ROM:An acoustic phonetic continuous speech database NIST Tech Report,"1988.
[2]Andrew Varga,Herman J.M.,Steeneken,"Assessment for automatic speech recognition:II.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems,"Speech Communication,vol.12, no.3,1993.
the foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (5)

1. A phase-sensitive gating multi-scale cavity convolution network voice enhancement method is characterized in that: the method comprises the steps of constructing a mapping relation between complex spectrums of voice signals by utilizing a neural network model, mapping real and imaginary spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary spectrums, recovering the enhanced real and imaginary spectrums into enhanced time-domain voice signals, firstly, carrying out framing and windowing processing on the noisy voice signals, then carrying out short-time Fourier transformation to obtain complex spectrums of the noisy voice signals, separating real and imaginary parts, and taking only effective value parts, so that two groups of input features are obtained: the real part features and the imaginary part features, and then two groups of input features are sent into a gating multi-scale cavity convolution network model, wherein the processing flow of the gating multi-scale cavity convolution network model comprises the following steps: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is used for respectively performing time sequence characteristic analysis on the encoded real part characteristic and the encoded imaginary part characteristic representation, a gating decoding module performs gating decoding operation to obtain an enhanced real part and imaginary part frequency spectrum, and the input of the multi-scale characteristic analysis module comprises two groups of characteristics: (1) a real or imaginary spectrum of the original noisy speech; (2) The method comprises the steps that a gating coding module outputs real or imaginary characteristics, a multiscale characteristic analysis module is formed by stacking at least two multiscale analysis units, each multiscale analysis unit is used for splicing two groups of characteristic tensors, the two groups of tensors are required to be remodeled before splicing to be changed into a three-dimensional tensor, the three-dimensional tensor is in a shape of [ sentence number, sentence length, 322], then the spliced characteristic tensor is subjected to sub-band decomposition, the three-dimensional tensor is totally divided into 8 sub-bands, the tensor in the shape of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], the input of the current sub-band and the adjacent sub-band convolution output of the current sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after the multiscale analysis units are subjected to expansion, the output characteristic tensor is remodeled by using a layer of 1024-dimensional full-connection layer, the output characteristic tensor is remodeled into a 4-dimensional tensor form of [ sentence number, sentence length, 4, 256] and then the two-dimensional characteristic tensor after decoding operation is carried out, the decoding operation is carried out by the gating module respectively.
2. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: and performing inverse Fourier transform on the enhanced real-imaginary frequency spectrum, and performing overlap addition to finally obtain the enhanced voice signal.
3. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.
4. A phase-sensitive gated multi-scale void convolution network speech enhancement method according to claim 3, characterized in that: the output of each gating linear coding unit is subjected to exponential linear activation to perform nonlinear transformation of the features.
5. A phase-sensitive gating multi-scale cavity convolution network voice enhancement system is characterized in that: comprising a readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 4.
CN202011332442.8A 2020-11-24 2020-11-24 Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system Active CN112309411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011332442.8A CN112309411B (en) 2020-11-24 2020-11-24 Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011332442.8A CN112309411B (en) 2020-11-24 2020-11-24 Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system

Publications (2)

Publication Number Publication Date
CN112309411A CN112309411A (en) 2021-02-02
CN112309411B true CN112309411B (en) 2024-06-11

Family

ID=74335732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011332442.8A Active CN112309411B (en) 2020-11-24 2020-11-24 Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system

Country Status (1)

Country Link
CN (1) CN112309411B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129873B (en) * 2021-04-27 2022-07-08 思必驰科技股份有限公司 Optimization method and system for stack type one-dimensional convolution network awakening acoustic model
CN113707163B (en) * 2021-08-31 2024-05-14 北京达佳互联信息技术有限公司 Speech processing method and device and model training method and device
CN114283829B (en) * 2021-12-13 2023-06-16 电子科技大学 Voice enhancement method based on dynamic gating convolution circulation network
CN114842863B (en) * 2022-04-19 2023-06-02 电子科技大学 Signal enhancement method based on multi-branch-dynamic merging network
CN115862581A (en) * 2023-02-10 2023-03-28 杭州兆华电子股份有限公司 Secondary elimination method and system for repeated pattern noise

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3156908A1 (en) * 2015-01-06 2016-07-14 David Burton Mobile wearable monitoring systems
WO2019075267A1 (en) * 2017-10-11 2019-04-18 Google Llc Self-gating activation neural network layers
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110534123A (en) * 2019-07-22 2019-12-03 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment
CN110674866A (en) * 2019-09-23 2020-01-10 兰州理工大学 Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network
CN111160040A (en) * 2019-12-26 2020-05-15 西安交通大学 Information reliability evaluation system and method based on multi-scale gating equilibrium interaction fusion network
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3156908A1 (en) * 2015-01-06 2016-07-14 David Burton Mobile wearable monitoring systems
WO2019075267A1 (en) * 2017-10-11 2019-04-18 Google Llc Self-gating activation neural network layers
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110534123A (en) * 2019-07-22 2019-12-03 中国科学院自动化研究所 Sound enhancement method, device, storage medium, electronic equipment
CN110674866A (en) * 2019-09-23 2020-01-10 兰州理工大学 Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network
CN111160040A (en) * 2019-12-26 2020-05-15 西安交通大学 Information reliability evaluation system and method based on multi-scale gating equilibrium interaction fusion network
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种融合相位估计的深度卷积神经网络语音增强方法;袁文浩;梁春燕;夏斌;孙文珠;;电子学报;20181015(第10期) *

Also Published As

Publication number Publication date
CN112309411A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN112309411B (en) Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system
CN111971743B (en) Systems, methods, and computer readable media for improved real-time audio processing
US11069344B2 (en) Complex evolution recurrent neural networks
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
EP3926623B1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
US20210295859A1 (en) Enhanced multi-channel acoustic models
CN107680611B (en) Single-channel sound separation method based on convolutional neural network
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
CN110739003B (en) Voice enhancement method based on multi-head self-attention mechanism
WO2019227586A1 (en) Voice model training method, speaker recognition method, apparatus, device and medium
EP4235646A2 (en) Adaptive audio enhancement for multichannel speech recognition
CN110060657B (en) SN-based many-to-many speaker conversion method
CN112331224A (en) Lightweight time domain convolution network voice enhancement method and system
CN113936681B (en) Speech enhancement method based on mask mapping and mixed cavity convolution network
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
Wang et al. Recurrent deep stacking networks for supervised speech separation
Mundodu Krishna et al. Single channel speech separation based on empirical mode decomposition and Hilbert transform
JP2023546099A (en) Audio generator, audio signal generation method, and audio generator learning method
Hasannezhad et al. PACDNN: A phase-aware composite deep neural network for speech enhancement
Du et al. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
Sheeja et al. Speech dereverberation and source separation using DNN-WPE and LWPR-PCA
CN113571074B (en) Voice enhancement method and device based on multi-band structure time domain audio frequency separation network
Azam et al. Urdu spoken digits recognition using classified MFCC and backpropgation neural network
CN113707172A (en) Single-channel voice separation method, system and computer equipment of sparse orthogonal network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant