CN113555031B - Training method and device of voice enhancement model, and voice enhancement method and device - Google Patents

Training method and device of voice enhancement model, and voice enhancement method and device Download PDF

Info

Publication number
CN113555031B
CN113555031B CN202110869479.2A CN202110869479A CN113555031B CN 113555031 B CN113555031 B CN 113555031B CN 202110869479 A CN202110869479 A CN 202110869479A CN 113555031 B CN113555031 B CN 113555031B
Authority
CN
China
Prior art keywords
noise
speech
enhancement
features
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110869479.2A
Other languages
Chinese (zh)
Other versions
CN113555031A (en
Inventor
陈联武
张晨
张旭
郑羲光
任新蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202110869479.2A priority Critical patent/CN113555031B/en
Publication of CN113555031A publication Critical patent/CN113555031A/en
Application granted granted Critical
Publication of CN113555031B publication Critical patent/CN113555031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The disclosure relates to a training method and device for a speech enhancement model, and a speech enhancement method and device, wherein the training method comprises the following steps: acquiring a training sample set; respectively inputting at least two frequency spectrums of the noise-containing voice signal into corresponding feature extraction networks in at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters; at least two features are fused to obtain fused features; inputting the fused characteristics into a voice enhancement network to obtain a predicted enhancement spectrum of the noise-containing voice signal; determining a target loss function of the voice enhancement model based on the estimated time domain signal corresponding to the estimated enhancement frequency spectrum and the corresponding clean voice signal; and adjusting parameters of at least two feature extraction networks and a voice enhancement network according to the target loss function, and training the voice enhancement model.

Description

Training method and device of voice enhancement model, and voice enhancement method and device
Technical Field
The disclosure relates to the field of audio and video, and in particular relates to a training method and device for a voice enhancement model, and a voice enhancement method and device.
Background
At present, a voice enhancement technology based on a neural network is mainly divided into a frequency domain scheme and a time domain scheme, and for the frequency domain scheme, a Short-time Fourier transform (Short-Time Fourier Transform, abbreviated as STFT) is mainly adopted to convert a noise-containing voice signal into the frequency domain, so that frequency spectrum characteristics are extracted, and then the neural network estimates a clean voice signal according to the extracted frequency spectrum characteristics. However, this scheme uses fixed time-frequency conversion parameters to STFT the noisy speech signal to extract the spectral features of the noisy speech signal. Because of the large variation of the speech signal and the noise signal in the actual scene, ideally, different time-frequency conversion parameters should be used for different characteristics of the signal (for example, for short window analysis for very short tap noise, long window analysis is used for signals with lower fundamental frequency and more stable). Therefore, in the actual frequency domain scheme, a fixed time-frequency conversion parameter is selected according to the overall effect to extract the spectrum characteristics, so that the characteristics of all signals in the noise-containing voice signal cannot be better covered.
Disclosure of Invention
The disclosure provides a training method and device for a voice enhancement model, and a voice enhancement method and device, so as to at least solve the problem that in the related art, a fixed time-frequency conversion parameter is selected according to the overall effect to extract spectrum characteristics, and the characteristics of all signals in a noise-containing voice signal cannot be better covered.
According to a first aspect of embodiments of the present disclosure, there is provided a training method of a speech enhancement model, the speech enhancement model including at least two feature extraction networks and a speech enhancement network, the training method comprising: acquiring a training sample set, wherein each training sample in the training sample set comprises a noise-containing voice signal and a corresponding clean voice signal, and the noise-containing voice signal is a voice signal obtained by adding noise and reverberation to the corresponding clean voice signal; respectively inputting at least two frequency spectrums of the noise-containing voice signal into corresponding feature extraction networks in at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters; at least two features are fused to obtain fused features; inputting the fused characteristics into a voice enhancement network to obtain a predicted enhancement spectrum of the noise-containing voice signal; determining a target loss function of the voice enhancement model based on the estimated time domain signal corresponding to the estimated enhancement frequency spectrum and the corresponding clean voice signal; and adjusting parameters of at least two feature extraction networks and a voice enhancement network according to the target loss function, and training the voice enhancement model.
Optionally, the output of the speech enhancement network is a mask of the noisy speech signal, where the mask represents a spectrum duty ratio of a clean speech signal in the noisy speech signal, and the fused features are input to the speech enhancement network to obtain an estimated enhancement spectrum of the noisy speech signal, and the method includes: multiplying the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain a predicted enhanced spectrum of the noisy speech signal, wherein the spectrum of the noisy speech signal is obtained based on a preset set of time-frequency conversion parameters.
Optionally, the output of the speech enhancement network is a predicted enhanced spectrum of the noisy speech signal.
Optionally, before inputting at least two spectrums of the noise-containing voice signal to corresponding feature extraction networks in the at least two feature extraction networks respectively, the method further includes: acquiring at least two preset groups of different time-frequency conversion parameters; and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals.
Optionally, fusing at least two features to obtain fused features, including: and carrying out weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset.
Optionally, the set of time-frequency conversion parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.
According to a second aspect of embodiments of the present disclosure, there is provided a speech enhancement method, including: acquiring a noise-containing voice signal to be processed; respectively inputting at least two frequency spectrums of the noise-containing voice signal to be processed into a corresponding characteristic extraction network in the voice enhancement model to obtain at least two characteristics of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two groups of preset different time-frequency conversion parameters; at least two features are fused to obtain fused features; inputting the fused characteristics into a voice enhancement network in a voice enhancement model to obtain an enhanced frequency spectrum of a noise-containing voice signal to be processed; and obtaining a time domain signal corresponding to the enhanced frequency spectrum, and taking the time domain signal as an enhanced voice signal of the noise-containing voice signal to be processed.
Optionally, the output of the speech enhancement network is a mask of the noise-containing speech signal to be processed, where the mask represents a spectrum ratio of a clean speech signal in the noise-containing speech signal to be processed, and the fused features are input to the speech enhancement network in the speech enhancement model to obtain an enhanced spectrum of the noise-containing speech signal to be processed, and the method includes: multiplying the frequency spectrum of the noise-containing voice signal to be processed with the corresponding mask to obtain an enhanced frequency spectrum of the noise-containing voice signal to be processed, wherein the frequency spectrum of the noise-containing voice signal to be processed is acquired based on a preset set of time-frequency conversion parameters.
Optionally, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.
Optionally, before at least two spectrums of the noise-containing voice signal to be processed are respectively input into the corresponding feature extraction networks in the voice enhancement model, the method further includes: acquiring at least two preset groups of different time-frequency conversion parameters; and performing short-time Fourier transform on the noise-containing voice signal to be processed based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed.
Optionally, fusing at least two features to obtain fused features, including: and carrying out weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset.
Optionally, the set of time-frequency conversion parameters includes at least one of window length, window shift, window function, and fast fourier transform length.
Optionally, the speech enhancement model is trained based on the training method of the speech enhancement model.
According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus of a speech enhancement model, the speech enhancement model including at least two feature extraction networks and a speech enhancement network, the training apparatus comprising:
A training sample set acquisition unit configured to acquire a training sample set, wherein each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, the noisy speech signal being a speech signal of the corresponding clean speech signal to which noise and reverberation are added; the characteristic extraction unit is configured to input at least two frequency spectrums of the noise-containing voice signal into corresponding characteristic extraction networks in the at least two characteristic extraction networks respectively to obtain at least two characteristics of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two different preset time-frequency conversion parameters; the fusion unit is configured to fuse at least two features to obtain fused features; the estimated enhancement spectrum acquisition unit is configured to input the fused characteristics into a voice enhancement network to obtain an estimated enhancement spectrum of the noise-containing voice signal; a target loss function determining unit configured to determine a target loss function of the speech enhancement model based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean speech signal; and the training unit is configured to adjust parameters of at least two feature extraction networks and a voice enhancement network according to the target loss function and train the voice enhancement model.
Optionally, the output of the speech enhancement network is a mask of the noisy speech signal, where the mask represents a spectrum ratio of a clean speech signal in the noisy speech signal, and the estimated enhancement spectrum acquisition unit is further configured to multiply the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain an estimated enhancement spectrum of the noisy speech signal, where the spectrum of the noisy speech signal is acquired based on a preset set of time-frequency conversion parameters.
Optionally, the output of the speech enhancement network is a predicted enhanced spectrum of the noisy speech signal.
Optionally, the feature extraction unit is further configured to obtain at least two preset groups of different time-frequency conversion parameters before inputting at least two spectrums of the noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks respectively to obtain at least two features of the noise-containing voice signal; and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals.
Optionally, the fusion unit is further configured to perform weighted stitching or weighted addition on at least two features, where a weight corresponding to each of the at least two features is preset.
Optionally, the set of time-frequency conversion parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.
According to a fourth aspect of embodiments of the present disclosure, there is provided a speech enhancement apparatus, comprising: a signal acquisition unit configured to acquire a noise-containing speech signal to be processed; the characteristic extraction unit is configured to input at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding characteristic extraction networks in the voice enhancement model respectively to obtain at least two characteristics of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two different time-frequency conversion parameters; the fusion unit is configured to fuse at least two features to obtain fused features; the enhancement spectrum acquisition unit is configured to input the fused characteristics into a voice enhancement network in the voice enhancement model to obtain an enhancement spectrum of the noise-containing voice signal to be processed; the enhanced voice signal acquisition unit is configured to acquire a time domain signal corresponding to the enhanced frequency spectrum and take the time domain signal as an enhanced voice signal of the noise-containing voice signal to be processed.
Optionally, the output of the speech enhancement network is a mask of the noise-containing speech signal to be processed, where the mask represents a spectrum ratio of a clean speech signal in the noise-containing speech signal to be processed, and the enhancement spectrum acquisition unit is further configured to multiply the spectrum of the noise-containing speech signal to be processed with a corresponding mask to obtain an enhancement spectrum of the noise-containing speech signal to be processed, where the spectrum of the noise-containing speech signal to be processed is acquired based on a preset set of time-frequency conversion parameters.
Optionally, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.
Optionally, the feature extraction unit is further configured to obtain at least two preset groups of different time-frequency conversion parameters before at least two spectrums of the noise-containing voice signal to be processed are respectively input into corresponding feature extraction networks in the voice enhancement model to obtain at least two features of the noise-containing voice signal to be processed; and performing short-time Fourier transform on the noise-containing voice signal to be processed based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed.
Optionally, the fusion unit is further configured to perform weighted stitching or weighted addition on at least two features, where a weight corresponding to each of the at least two features is preset.
Optionally, the set of time-frequency conversion parameters includes at least one of window length, window shift, window function, and fast fourier transform length.
Optionally, the speech enhancement model is trained based on the training method of the speech enhancement model.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the training method and the speech enhancement method of the speech enhancement model according to the present disclosure.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by at least one processor, causes the at least one processor to perform the training method and the speech enhancement method according to the speech enhancement model of the present disclosure as above.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a training method and a speech enhancement method for a speech enhancement model according to the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the training method and device of the voice enhancement model, the voice enhancement method and device, in the training process, a plurality of time-frequency conversion parameters are preset, feature extraction is carried out based on frequency spectrums obtained by the time-frequency conversion parameters, multi-scale features of a noise-containing voice signal can be extracted, the multi-scale features are fused, and the fused features are used for training the voice enhancement model. Because the extracted multi-scale features contain different types of information of the noise-containing voice signals, namely the characteristics of all signals in the noise-containing voice signals, better training effects can be obtained through the extracted multi-scale features, and the overall effect of the voice enhancement model is improved. Therefore, the method and the device solve the problem that the characteristics of all signals in the noise-containing voice signals cannot be well covered by selecting a fixed time-frequency conversion parameter to extract the frequency spectrum characteristics according to the overall effect in the related technology.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a schematic diagram illustrating an implementation scenario of a training method of a speech enhancement model according to an exemplary embodiment of the present disclosure;
FIG. 2 is a flowchart illustrating a method of training a speech enhancement model according to an exemplary embodiment;
FIG. 3 is a flowchart illustrating a method of speech enhancement according to an exemplary embodiment;
FIG. 4 is a schematic diagram of a speech enhancement system according to an exemplary embodiment;
FIG. 5 is a block diagram of a training apparatus for a speech enhancement model, according to an example embodiment;
FIG. 6 is a block diagram of a speech enhancement apparatus according to an exemplary embodiment;
fig. 7 is a block diagram of an electronic device 700 according to an embodiment of the disclosure.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.
The invention provides a training method and a voice enhancement method for a voice enhancement model, which can achieve a better training effect and promote the overall effect of the voice enhancement model. Fig. 1 is a schematic diagram illustrating an implementation scenario of a training method of a speech enhancement model according to an exemplary embodiment of the present disclosure, and as illustrated in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the user terminals are not limited to 2, including but not limited to devices such as a mobile phone, a personal computer, and the like, and the user terminal may be installed with a microphone for acquiring sound, and the server may be one server, or a server cluster formed by a plurality of servers, or may be a cloud computing platform or a virtualization center.
After receiving a request for training a speech enhancement model (including at least two feature extraction networks and a speech enhancement network) sent by a user terminal 110 and 120, the server 100 may count a clean speech signal and a noise signal received in history, mix the clean speech signal and the noise signal according to a preset manner, add reverberation, obtain a noisy speech signal, use the noisy speech signal and a corresponding clean speech signal as one training sample for training the speech enhancement model, obtain a plurality of training samples according to the above manner, combine the plurality of training samples to obtain a training sample set, after obtaining the training sample set, the server 100 inputs at least two spectrums of the noisy speech signal into the corresponding feature extraction networks in the at least two feature extraction networks respectively, obtain at least two features of the noisy speech signal, where the at least two spectrums are obtained based on at least two sets of different time-frequency conversion parameters preset, then perform fusion processing on the obtained at least two features, input the fused features into the speech enhancement network to obtain enhancement of the noisy speech signal, and then estimate the estimated enhancement function based on the corresponding clean speech enhancement signal, determine the target speech enhancement function according to the estimated enhancement function, and the target speech enhancement function loss is determined by the training network.
After the voice enhancement model is trained, the user terminals 110 and 120 receive the voice signal containing noise (such as the voice of a speaker in a conference) through the microphone and send the voice signal to the server 100, after the server 100 receives the voice signal containing noise, at least two spectrums of the voice signal containing noise are respectively input into corresponding feature extraction networks in the voice enhancement model to obtain at least two features of the voice signal containing noise to be processed, wherein the at least two spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters, then fusion processing is carried out on the at least two features, the fused features are input into the voice enhancement networks in the voice enhancement model to obtain an enhancement spectrum of the voice signal containing noise to be processed, and then a time domain signal corresponding to the enhancement spectrum is obtained, and the time domain signal is the enhancement voice signal containing noise received by the user terminals 110 and 120, namely the voice signal containing noise and the voice after reverberation are removed by the speaker in the conference.
Hereinafter, a training method and apparatus of a speech enhancement model, a speech enhancement method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to fig. 2 to 6.
FIG. 2 is a flow chart illustrating a method of training a speech enhancement model, according to an exemplary embodiment, where the speech enhancement model includes at least two feature extraction networks and a speech enhancement network, as in the training method of FIG. 2, the method of training a speech enhancement model comprising the steps of:
In step S201, a training sample set is obtained, where each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, and the noisy speech signal is a speech signal obtained by adding noise and reverberation to the corresponding clean speech signal. The noise-containing voice signals and the corresponding clean voice signals in the training sample set may be single-channel noise-containing voice signals and corresponding clean voice signals, or may be multi-channel noise-containing voice signals and corresponding clean voice signals, or may be noise-containing voice signals and corresponding clean voice signals required by a time domain noise reduction system performing framing operation, which is not limited in this disclosure.
Specifically, in the case where the noisy speech signal and the corresponding clean speech signal in the training sample set are a single-channel noisy speech signal and a corresponding clean speech signal, the noisy speech signal may be generated by means of data enhancement. That is, for clean speech and noise signals, the frequency response of the hardware device is simulated by various EQ filters, then the ambient reverberation is simulated by using various room impulse responses, and finally the simulated speech and noise signals are mixed according to different signal-to-noise ratios to generate a noise-containing speech signal. I.e. the noisy speech signal used in the training process corresponds to the clean speech signal.
Returning to fig. 2, in step S202, at least two spectrums of the noise-containing speech signal are respectively input to corresponding feature extraction networks in the at least two feature extraction networks, so as to obtain at least two features of the noise-containing speech signal, where the at least two spectrums are obtained based on at least two different preset time-frequency conversion parameters. Specifically, this step may be performed by performing time-frequency domain conversion by multi-scale time-frequency analysis (stft_1, stft_2, …, stft_m), and then inputting the time-frequency domain converted spectrum to a corresponding feature extraction network (feanet_1, feanet_2, …, feanet_m), where M represents the number of multi-scale features (i.e., the time-frequency conversion parameters). When m=1, a speech enhancement scheme with fixed time-frequency resolution is adopted, so in the present disclosure, the value of M needs to satisfy that M is greater than or equal to 2. Based on the output of each STFT analysis, a corresponding feature extraction network (feanet_1, feanet_2, …, feanet_m) is provided. Since the FFT lengths corresponding to different stft_m are different, the input dimensions of the feanet_m are also different. The FeaNet_m can select different network structures, a typical FeaNet_m can comprise two layers of Conv2d convolution networks to extract the structure information of the frequency spectrum, and then a layer of full connection network is connected to map the characteristics to the required dimension.
According to an exemplary embodiment of the present disclosure, before inputting at least two spectrums of a noise-containing speech signal to corresponding feature extraction networks of the at least two feature extraction networks, respectively, obtaining at least two features of the noise-containing speech signal, the method further includes: acquiring at least two preset groups of different time-frequency conversion parameters; and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals. According to the embodiment, short-time Fourier transform can be performed based on different time-frequency conversion parameters, so that a plurality of frequency spectrums of the noise-containing voice signals are obtained to cover all signals in the noise-containing voice signals.
According to an exemplary embodiment of the present disclosure, a set of time-frequency conversion parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.
Specifically, for stft_m (m=1, 2, …, M), the corresponding time-frequency conversion parameter may be selected according to the actual scene. For example, for a 16KHz input speech enhancement system, a typical configuration is shown in table 1 below. The configuration is that the time-frequency conversion parameter setting corresponding to the multi-scale analysis when m=3 is adopted, and the window shift is uniformly set to 160 sampling points for facilitating the alignment of the multi-scale features.
TABLE 1 time-frequency conversion parameter settings for 16KHz speech enhancement systems
Window length Window moving device Window function FFT Length
STFT_1 320 160 Hamming 320
STFT_2 512 160 Hamming 512
STFT_3 768 160 Kaiser 1024
STFT_0 512 160 Hamming 512
Returning to fig. 2, in step S203, at least two features are fused, to obtain a fused feature. For example, in the fusion process, different fusion weights { α } can be set for different scale features (i.e., the at least two features) according to the requirements of the actual scene 12 ,…,α M The fusion method may employ a splice fusion method or an additive fusion method, but is not limited thereto, i.e., the above fusion method may employ any fusion method applicable to the present disclosure. It should be noted that this step may be implemented by providing a feature fusion module, where the input of the feature fusion module is the multi-scale feature { F } 1 ,F 2 ,…,F M I.e. at least two features as described above), output i.e. fused feature F all
According to an exemplary embodiment of the present disclosure, fusing at least two features to obtain a fused feature includes: and carrying out weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset. Through the embodiment, the fusion can be conveniently and rapidly carried out.
For example, when the dimensions of the multi-scale features (i.e., the at least two features described above) are not identical, a stitching fusion approach may be employed:
F all =concat(α 1 F 12 F 2 ,...,α M F M )
α m (m=1, 2, …, M) has a value of [0,1 ]]The values in the interval correspond to the at least two weights, M is a positive integer, and a typical value is
For another example, when the dimensions of the multi-scale features (i.e., the at least two features described above) are identical, an additive fusion approach may be employed:
F all =α 1 F 12 F 2 +…+α M F M
α M (m=1, 2, …, M) has a value of [0,1 ]]The values in the interval correspond to the at least two weights, M is a positive integer, and a typical value is
In step S204, the fused features are input to a speech enhancement network to obtain a predicted enhancement spectrum of the noisy speech signal. For example, the speech enhancement network input is a multi-scale fusion feature F all The output is the estimated enhancement spectrum of the clean speech signal, and may also be a mask of the clean speech signal. When the mask of the clean voice signal is output, a multiplication module is needed to be additionally arranged, and the mask is multiplied with the frequency spectrum of the noise-containing voice signal to obtain the estimated enhancement frequency spectrum of the clean voice signal. The present disclosure is not limited to the structure of the voice enhancement network, and in order to ensure low complexity of the system in a practical scenario, a typical voice enhancement network structure may be a two-layer RNN network plus a one-layer fully connected network.
According to an exemplary embodiment of the present disclosure, an output of a speech enhancement network is a mask of a noisy speech signal, where the mask represents a spectrum ratio of a clean speech signal in the noisy speech signal, and the fused features are input to the speech enhancement network to obtain an estimated enhancement spectrum of the noisy speech signal, including: multiplying the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain a predicted enhanced spectrum of the noisy speech signal, wherein the spectrum of the noisy speech signal is obtained based on a preset set of time-frequency conversion parameters. It should be noted that the preset set of time-frequency conversion parameters may be one of the preset at least two different sets of time-frequency conversion parameters, or may be a set of time-frequency conversion parameters preset separately. By separating the portion of the spectrum multiplied by the mask from the speech enhancement network, the complexity of the speech enhancement network may be reduced.
According to an exemplary embodiment of the present disclosure, the output of the speech enhancement network is a predicted enhanced spectrum of the noisy speech signal. By the embodiment, the estimated enhancement spectrum can be conveniently and rapidly acquired.
In step S205, a target loss function of the speech enhancement model is determined based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean speech signal. The present disclosure is not limited to the objective loss function, and a conventional time domain or frequency domain loss function may be used, for example, a spectrum mean square error MSE, a log energy spectrum mean absolute error MAE, a time domain MSE, and the like.
In step S206, parameters of at least two feature extraction networks and a speech enhancement network are adjusted according to the objective loss function, and the speech enhancement model is trained. In the training process, a noise-containing voice signal is input, and through the multi-scale feature extraction, feature fusion and voice enhancement network, an enhanced voice signal (i.e. a predicted time domain signal) can be finally obtained. And calculating the value of the target loss function according to the enhanced voice signal and the corresponding clean voice signal, taking the minimized target loss function as a target, and updating parameters of the two feature extraction networks and the voice enhancement network until the voice enhancement model converges.
FIG. 3 is a flow chart illustrating a method of speech enhancement according to an exemplary embodiment, the speech enhancement model shown in FIG. 3 being trained based on the training method of any of the speech enhancement models described above, the method of speech enhancement comprising the steps of:
in step S301, a noise-containing speech signal to be processed is acquired. The noise-containing voice signal to be processed may be a voice signal received by a microphone in the terminal, or any other voice signal to be processed.
In step S302, at least two spectrums of the noise-containing speech signal to be processed are respectively input to the corresponding feature extraction network in the speech enhancement model, so as to obtain at least two features of the noise-containing speech signal to be processed, where the at least two spectrums are obtained based on at least two different preset time-frequency conversion parameters. Specifically, this step may be performed by performing time-frequency domain conversion by multi-scale time-frequency analysis (stft_1, stft_2, …, stft_m), and then inputting the time-frequency domain converted spectrum to a corresponding feature extraction network (feanet_1, feanet_2, …, feanet_m), where M represents the number of multi-scale features (i.e., the time-frequency conversion parameters). When m=1, a speech enhancement scheme with fixed time-frequency resolution is adopted, so in the present disclosure, the value of M needs to satisfy that M is greater than or equal to 2. Based on the output of each STFT analysis, a corresponding feature extraction network (feanet_1, feanet_2, …, feanet_m) is provided. Since the FFT lengths corresponding to different stft_m are different, the input dimensions of the feanet_m are also different. The FeaNet_m can select different network structures, a typical FeaNet_m can comprise two layers of Conv2d convolution networks to extract the structure information of the frequency spectrum, and then a layer of full connection network is connected to map the characteristics to the required dimension.
According to an exemplary embodiment of the present disclosure, before inputting at least two spectrums of a noise-containing speech signal to be processed into corresponding feature extraction networks in a speech enhancement model, respectively, obtaining at least two features of the noise-containing speech signal to be processed, the method further includes: acquiring at least two preset groups of different time-frequency conversion parameters; and performing short-time Fourier transform on the noise-containing voice signal to be processed based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed. According to the embodiment, short-time Fourier transform can be performed based on different time-frequency conversion parameters, so that a plurality of frequency spectrums of the noise-containing voice signals are obtained to cover all signals in the noise-containing voice signals.
According to an exemplary embodiment of the present disclosure, the set of time-frequency conversion parameters includes at least one of a window length, a window shift, a window function, and a fast fourier transform length.
Specifically, for stft_m (m=1, 2, …, M), the corresponding time-frequency conversion parameter may be selected according to the actual scene. For example, for a 16KHz input speech enhancement system, a typical configuration is shown in table 1 below. The configuration is that the time-frequency conversion parameter setting corresponding to the multi-scale analysis when m=3 is adopted, and the window shift is uniformly set to 160 sampling points for facilitating the alignment of the multi-scale features.
TABLE 1 time-frequency conversion parameter settings for 16KHz speech enhancement systems
Window length Window moving device Window function FFT Length
STFT_1 320 160 Hamming 320
STFT_2 512 160 Hamming 512
STFT_3 768 160 Kaiser 1024
STFT_0 512 160 Hamming 512
In step S303, at least two features are fused, and fused features are obtained. For example, in the fusion process, different fusion weights { α } can be set for different scale features (i.e., the at least two features) according to the requirements of the actual scene 12 ,…,α M The fusion method may employ a splice fusion method or an additive fusion method, but is not limited thereto, i.e., the above fusion method may employ any fusion method applicable to the present disclosure. It should be noted that this step may be implemented by providing a feature fusion module, where the input of the feature fusion module is the multi-scale feature { F } 1 ,F 2 ,…,F M I.e. at least two features as described above), output i.e. fused feature F all
According to an exemplary embodiment of the present disclosure, fusing at least two features to obtain a fused feature includes: and carrying out weighted splicing or weighted addition on the at least two features, wherein the weight corresponding to each of the at least two features is preset. Through the embodiment, the fusion can be conveniently and rapidly carried out.
For example, when the dimensions of the multi-scale features (i.e., the at least two features described above) are not identical, a stitching fusion approach may be employed:
F all =concat(α i F 12 F 2 ,...,α M F M )
α m (m=1, 2, …, M) has a value of [0,1 ]]The values in the interval correspond to the at least two weights, M is a positive integer, and a typical value is
For another example, when the dimensions of the multi-scale features (i.e., the at least two features described above) are identical, an additive fusion approach may be employed:
F all =α 1 F 12 F 2 +…+α M F M
α M (m=1, 2, …, M) has a value of [0,1 ]]The values in the interval correspond to the at least two weights, M is a positive integer, and a typical value is
In step S304, an enhanced spectrum of the noise-containing speech signal to be processed is obtained based on the fused features and the speech enhancement network in the speech enhancement model. For example, the speech enhancement network input is a multi-scale fusion feature F all The output is the estimated enhancement spectrum of the clean speech signal, and may also be a mask of the clean speech signal. When the mask of the clean voice signal is output, a multiplication module is needed to be additionally arranged, and the mask is multiplied with the frequency spectrum of the noise-containing voice signal to obtain the estimated enhancement frequency spectrum of the clean voice signal. The present disclosure is not limited to the structure of the voice enhancement network, and in order to ensure low complexity of the system in a practical scenario, a typical voice enhancement network structure may be a two-layer RNN network plus a one-layer fully connected network.
According to an exemplary embodiment of the present disclosure, an output of a speech enhancement network is a mask of a noise-containing speech signal to be processed, where the mask represents a spectral ratio of a clean speech signal in the noise-containing speech signal to be processed, and the fused features are input to the speech enhancement network in a speech enhancement model to obtain an enhanced spectrum of the noise-containing speech signal to be processed, and the method includes: multiplying the frequency spectrum of the noise-containing voice signal to be processed with the corresponding mask to obtain an enhanced frequency spectrum of the noise-containing voice signal to be processed, wherein the frequency spectrum of the noise-containing voice signal to be processed is acquired based on a preset set of time-frequency conversion parameters. It should be noted that the predetermined set of time-frequency conversion parameters may be different from the predetermined set of at least two different time-frequency conversion parameters. By separating the portion of the spectrum multiplied by the mask from the speech enhancement network, the complexity of the speech enhancement network may be reduced.
According to an exemplary embodiment of the present disclosure, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed. By the embodiment, the estimated enhancement spectrum can be conveniently and rapidly acquired.
In step S305, a time domain signal corresponding to the enhanced spectrum is obtained, and the time domain signal is used as an enhanced speech signal of the noise-containing speech signal to be processed.
To facilitate an understanding of the above embodiments, a system is described below, and fig. 4 is a schematic diagram of a speech enhancement system according to an exemplary embodiment. As shown in fig. 4, the noise-containing speech signal (noise) is passed through a plurality of STFT modules (stft_1, stft_2, …, stft_m) and corresponding feature extraction networks (feanet_1, feanet_2, …, feanet_m) to obtain multi-scale features corresponding to different time-frequency conversion parameters. The obtained multi-scale features are fused through a feature Fusion Layer (Fusion Layer), and then a Mask (Mask) of the noise-containing voice signal is obtained through a voice enhancement network (EnhNet). In addition, STFT_0 and ISTFT_0 are time-frequency domain conversion corresponding to voice enhancement operation, noise-containing voice is subjected to STFT_0 to obtain noise-containing frequency spectrum and multiplied by a mask estimated by a voice enhancement network to obtain voice enhancement frequency spectrum, and final time domain enhancement voice is obtained through ISTFT_0, namely a final enhancement voice signal. Note that, the time-frequency conversion parameter corresponding to the stft_0 may be preset.
In summary, the present disclosure overcomes performance limitations caused by the adoption of fixed time-to-frequency conversion parameters in the speech noise reduction scheme in the related art, and proposes a real-time speech noise reduction scheme based on multi-scale features. The method comprises the steps of obtaining frequency spectrums corresponding to various time-frequency conversion parameters by a time-frequency analysis method for different time-frequency conversion parameters of an input noise-containing voice signal, extracting individual characteristics of each frequency spectrum, fusing the extracted characteristics, and estimating a clean voice signal by a neural network based on the fused characteristics. According to the method and the device, the characteristics of multiple frequency spectrums of the noise-containing voice signals are extracted, so that information of different types of voices and noise can be effectively extracted, and the overall effect of the model is improved.
FIG. 5 is a block diagram illustrating a training apparatus for a speech enhancement model according to an exemplary embodiment. Wherein the speech enhancement model comprises at least two feature extraction networks and a speech enhancement network, referring to fig. 5, the apparatus comprises a training sample set acquisition unit 50, a feature extraction unit 52, a fusion unit 54, an estimated enhancement spectrum acquisition unit 56, a target loss function determination unit 58 and a training unit 510.
A training sample set obtaining unit 50 configured to obtain a training sample set, wherein each training sample in the training sample set includes a noise-containing speech signal and a corresponding clean speech signal, the noise-containing speech signal being a speech signal of the corresponding clean speech signal to which noise and reverberation are added; the feature extraction unit 52 is configured to input at least two spectrums of the noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks respectively, so as to obtain at least two features of the noise-containing voice signal, wherein the at least two spectrums are obtained based on at least two preset different time-frequency conversion parameters; a fusion unit 54, configured to perform fusion processing on at least two features, so as to obtain fused features; the estimated enhancement spectrum acquisition unit 56 is configured to input the fused features into a voice enhancement network to obtain an estimated enhancement spectrum of the noise-containing voice signal; a target loss function determining unit 58 configured to determine a target loss function of the speech enhancement model based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean speech signal; the training unit 510 is configured to adjust parameters of the at least two feature extraction networks and the speech enhancement network according to the objective loss function, and train the speech enhancement model.
According to an embodiment of the present disclosure, the output of the speech enhancement network is a mask of the noisy speech signal, where the mask represents a spectral ratio of the clean speech signal in the noisy speech signal, and the estimated enhancement spectrum obtaining unit 56 is further configured to multiply the spectrum of the noisy speech signal with the mask of the noisy speech signal to obtain an estimated enhancement spectrum of the noisy speech signal, where the spectrum of the noisy speech signal is obtained based on a preset set of time-frequency conversion parameters.
According to embodiments of the present disclosure, the output of the speech enhancement network is a predicted enhanced spectrum of the noisy speech signal.
According to an embodiment of the present disclosure, the feature extraction unit 52 is further configured to obtain at least two preset sets of different time-frequency conversion parameters before inputting at least two spectrums of the noise-containing speech signal to corresponding feature extraction networks of the at least two feature extraction networks, respectively, to obtain at least two features of the noise-containing speech signal; and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals.
According to an embodiment of the present disclosure, the fusion unit 54 is further configured to perform a weighted stitching or weighted addition of at least two features, wherein the weight corresponding to each of the at least two features is preset.
According to an embodiment of the present disclosure, a set of time-frequency conversion parameters includes: at least one of window length, window shift, window function, and fast fourier transform length.
Fig. 6 is a block diagram illustrating a speech enhancement apparatus according to an exemplary embodiment. Wherein the speech enhancement model comprises at least two feature extraction networks and a speech enhancement network, referring to fig. 6, the apparatus comprises a signal acquisition unit 60, a feature extraction unit 62, a fusion unit 64, an enhanced spectrum acquisition unit 66 and an enhanced speech signal acquisition unit 68.
A signal acquisition unit 60 configured to acquire a noise-containing speech signal to be processed; the feature extraction unit 62 is configured to input at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding feature extraction networks in the voice enhancement model respectively, so as to obtain at least two features of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are acquired based on at least two different preset time-frequency conversion parameters; a fusion unit 64 configured to perform fusion processing on at least two features to obtain fused features; an enhanced spectrum acquisition unit 66 configured to input the fused features to a speech enhancement network in a speech enhancement model, to obtain an enhanced spectrum of the noise-containing speech signal to be processed; the enhanced speech signal obtaining unit 68 is configured to obtain a time domain signal corresponding to the enhanced spectrum, and take the time domain signal as an enhanced speech signal of the noise-containing speech signal to be processed.
According to an embodiment of the disclosure, the output of the speech enhancement network is a mask of the noise-containing speech signal to be processed, wherein the mask represents a spectrum ratio of a clean speech signal in the noise-containing speech signal to be processed, and the enhancement spectrum acquisition unit is further configured to multiply the spectrum of the noise-containing speech signal to be processed with a corresponding mask to obtain an enhancement spectrum of the noise-containing speech signal to be processed, wherein the spectrum of the noise-containing speech signal to be processed is acquired based on a preset set of time-frequency conversion parameters.
According to embodiments of the present disclosure, the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.
According to an embodiment of the present disclosure, the feature extraction unit 62 is further configured to obtain at least two preset sets of different time-frequency conversion parameters before inputting at least two spectrums of the noise-containing speech signal to be processed into corresponding feature extraction networks in the speech enhancement model, respectively, to obtain at least two features of the noise-containing speech signal to be processed; and performing short-time Fourier transform on the noise-containing voice signal to be processed based on at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed.
According to an embodiment of the present disclosure, the fusion unit 64 is further configured to perform a weighted stitching or weighted addition of at least two features, wherein the weight corresponding to each of the at least two features is preset.
According to an embodiment of the present disclosure, the set of time-frequency conversion parameters includes at least one of a window length, a window shift, a window function, and a fast fourier transform length.
According to an embodiment of the present disclosure, the speech enhancement model is trained based on the above-described training method of the speech enhancement model.
According to embodiments of the present disclosure, an electronic device may be provided. Fig. 7 is a block diagram of an electronic device 700 including at least one memory 701 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a training method and a speech enhancement method for a speech enhancement model according to embodiments of the present disclosure, according to embodiments of the present disclosure.
By way of example, the electronic device 700 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1000 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) individually or in combination. The electronic device 700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
In electronic device 700, processor 702 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the processor 702 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
The processor 702 may execute instructions or code stored in the memory, wherein the memory 701 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory 701 may be integrated with the processor 702, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 701 may include a separate device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory 701 and the processor 702 may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor 702 is able to read files stored in the memory 701.
In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein the instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method and the speech enhancement method of the speech enhancement model of the embodiments of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a training method and a speech enhancement method for a speech enhancement model of an embodiment of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (28)

1. A method of training a speech enhancement model, the speech enhancement model comprising at least two feature extraction networks and a speech enhancement network, the method comprising:
Obtaining a training sample set, wherein each training sample in the training sample set comprises a noise-containing voice signal and a corresponding clean voice signal, and the noise-containing voice signal is a voice signal of the corresponding clean voice signal after noise and reverberation are added;
respectively inputting at least two frequency spectrums of a noise-containing voice signal into corresponding feature extraction networks in the at least two feature extraction networks to obtain at least two features of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters;
carrying out fusion treatment on the at least two features to obtain fused features;
inputting the fused characteristics into the voice enhancement network to obtain the estimated enhancement frequency spectrum of the noise-containing voice signal;
determining a target loss function of the voice enhancement model based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean voice signal;
adjusting parameters of the at least two feature extraction networks and the voice enhancement network according to the target loss function, and training the voice enhancement model;
the fusing processing is performed on the at least two features to obtain fused features, which comprises the following steps:
When the dimensions of the at least two features are inconsistent, weighting and splicing the at least two features to obtain the fused features;
and when the dimensions of the at least two features are consistent, weighting and adding the at least two features to obtain the fused features.
2. The training method of claim 1 wherein the output of the speech enhancement network is a mask of the noisy speech signal, wherein the mask represents a spectral duty cycle of a clean speech signal in the noisy speech signal,
inputting the fused features into the voice enhancement network to obtain an estimated enhancement spectrum of the noise-containing voice signal, wherein the method comprises the following steps:
multiplying the spectrum of the noise-containing voice signal with the mask of the noise-containing voice signal to obtain the estimated enhancement spectrum of the noise-containing voice signal, wherein the spectrum of the noise-containing voice signal is obtained based on a preset set of time-frequency conversion parameters.
3. The training method of claim 1 wherein the output of the speech enhancement network is a predicted enhancement spectrum of the noisy speech signal.
4. The training method of claim 1, further comprising, prior to inputting at least two spectra of the noisy speech signal into corresponding ones of the at least two feature extraction networks, respectively, obtaining at least two features of the noisy speech signal:
Acquiring at least two preset groups of different time-frequency conversion parameters;
and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on the at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals.
5. The training method of claim 1, wherein the weight for each of the at least two features is preset.
6. The training method of claim 1 wherein the set of time-to-frequency conversion parameters comprises: at least one of window length, window shift, window function, and fast fourier transform length.
7. A method of speech enhancement, comprising:
acquiring a noise-containing voice signal to be processed;
respectively inputting at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding feature extraction networks in a voice enhancement model to obtain at least two features of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two different preset time-frequency conversion parameters;
carrying out fusion treatment on the at least two features to obtain fused features;
inputting the fused characteristics to a voice enhancement network in the voice enhancement model to obtain an enhanced frequency spectrum of the noise-containing voice signal to be processed;
Acquiring a time domain signal corresponding to the enhanced frequency spectrum, and taking the time domain signal as an enhanced voice signal of the noise-containing voice signal to be processed;
the fusing processing is performed on the at least two features to obtain fused features, which comprises the following steps:
when the dimensions of the at least two features are inconsistent, weighting and splicing the at least two features to obtain the fused features;
and when the dimensions of the at least two features are consistent, weighting and adding the at least two features to obtain the fused features.
8. The speech enhancement method of claim 7 wherein the output of the speech enhancement network is a mask of the noise-containing speech signal to be processed, wherein the mask represents a spectral duty cycle of a clean speech signal in the noise-containing speech signal to be processed,
the step of inputting the fused features to a voice enhancement network in the voice enhancement model to obtain an enhanced spectrum of the noise-containing voice signal to be processed, comprising:
multiplying the frequency spectrum of the noise-containing voice signal to be processed with a corresponding mask to obtain an enhanced frequency spectrum of the noise-containing voice signal to be processed, wherein the frequency spectrum of the noise-containing voice signal to be processed is acquired based on a preset set of time-frequency conversion parameters.
9. The speech enhancement method of claim 7 wherein the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.
10. The method of speech enhancement according to claim 7, further comprising, before inputting at least two spectra of the noise-containing speech signal to be processed into the corresponding feature extraction networks in the speech enhancement model, respectively, obtaining at least two features of the noise-containing speech signal to be processed:
acquiring at least two preset groups of different time-frequency conversion parameters;
and performing short-time Fourier transform on the noise-containing voice signal to be processed based on the at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed.
11. The speech enhancement method of claim 7, wherein the weights for each of the at least two features are preset.
12. The speech enhancement method of claim 7 wherein the set of time-frequency conversion parameters comprises at least one of window length, window shift, window function, and fast fourier transform length.
13. The speech enhancement method according to any one of claims 7 to 12, wherein the speech enhancement model is trained based on the training method of the speech enhancement model according to any one of claims 1 to 6.
14. A training device for a speech enhancement model, the speech enhancement model comprising at least two feature extraction networks and a speech enhancement network, the training device comprising:
a training sample set acquisition unit configured to acquire a training sample set, wherein each training sample in the training sample set includes a noisy speech signal and a corresponding clean speech signal, the noisy speech signal being a speech signal of the corresponding clean speech signal to which noise and reverberation are added;
the characteristic extraction unit is configured to input at least two frequency spectrums of the noise-containing voice signal into corresponding characteristic extraction networks in the at least two characteristic extraction networks respectively to obtain at least two characteristics of the noise-containing voice signal, wherein the at least two frequency spectrums are obtained based on at least two preset groups of different time-frequency conversion parameters;
the fusion unit is configured to fuse the at least two features to obtain fused features;
the estimated enhancement spectrum acquisition unit is configured to input the fused characteristics into the voice enhancement network to obtain an estimated enhancement spectrum of the noise-containing voice signal;
A target loss function determining unit configured to determine a target loss function of the speech enhancement model based on the estimated time domain signal corresponding to the estimated enhancement spectrum and the corresponding clean speech signal;
a training unit configured to adjust parameters of the at least two feature extraction networks and the speech enhancement network according to the objective loss function, and train the speech enhancement model;
the fusion unit is further configured to perform weighted splicing on the at least two features to obtain fused features when the dimensions of the at least two features are inconsistent; and when the dimensions of the at least two features are consistent, weighting and adding the at least two features to obtain the fused features.
15. The training apparatus of claim 14 wherein the output of the speech enhancement network is a mask of the noisy speech signal, wherein the mask represents a spectral duty cycle of a clean speech signal in the noisy speech signal,
the estimated enhancement spectrum acquisition unit is further configured to multiply the spectrum of the noise-containing voice signal with the mask of the noise-containing voice signal to obtain an estimated enhancement spectrum of the noise-containing voice signal, where the spectrum of the noise-containing voice signal is acquired based on a preset set of time-frequency conversion parameters.
16. The training apparatus of claim 14 wherein the output of the speech enhancement network is a predicted enhancement spectrum of the noisy speech signal.
17. The training device of claim 14, wherein the feature extraction unit is further configured to obtain at least two different sets of preset time-frequency conversion parameters before inputting at least two spectra of the noisy speech signal to corresponding feature extraction networks of the at least two feature extraction networks, respectively, to obtain at least two features of the noisy speech signal; and respectively carrying out short-time Fourier transform on the noise-containing voice signals based on the at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signals.
18. The training device of claim 14, wherein the weight for each of the at least two features is preset.
19. The training apparatus of claim 14 wherein the set of time-to-frequency conversion parameters comprises: at least one of window length, window shift, window function, and fast fourier transform length.
20. A speech enhancement apparatus, comprising:
A signal acquisition unit configured to acquire a noise-containing speech signal to be processed;
the feature extraction unit is configured to input at least two frequency spectrums of the noise-containing voice signal to be processed into corresponding feature extraction networks in the voice enhancement model respectively to obtain at least two features of the noise-containing voice signal to be processed, wherein the at least two frequency spectrums are obtained based on at least two different time-frequency conversion parameters;
the fusion unit is configured to fuse the at least two features to obtain fused features;
the enhancement spectrum acquisition unit is configured to input the fused characteristics into a voice enhancement network in the voice enhancement model to obtain an enhancement spectrum of the noise-containing voice signal to be processed;
an enhanced speech signal obtaining unit configured to obtain a time domain signal corresponding to the enhanced spectrum, and take the time domain signal as an enhanced speech signal of the noise-containing speech signal to be processed;
the fusion unit is further configured to perform weighted splicing on the at least two features to obtain fused features when the dimensions of the at least two features are inconsistent; and when the dimensions of the at least two features are consistent, weighting and adding the at least two features to obtain the fused features.
21. The speech enhancement apparatus of claim 20 wherein the output of the speech enhancement network is a mask of the noise-containing speech signal to be processed, wherein the mask represents a spectral duty cycle of a clean speech signal in the noise-containing speech signal to be processed,
the enhanced spectrum acquisition unit is further configured to multiply the spectrum of the noise-containing voice signal to be processed with a corresponding mask to obtain an enhanced spectrum of the noise-containing voice signal to be processed, wherein the spectrum of the noise-containing voice signal to be processed is acquired based on a preset set of time-frequency conversion parameters.
22. The speech enhancement apparatus of claim 20 wherein the output of the speech enhancement network is an enhanced spectrum of the noisy speech signal to be processed.
23. The speech enhancement apparatus of claim 20, wherein the feature extraction unit is further configured to obtain at least two different sets of preset time-frequency conversion parameters before inputting at least two spectra of the noise-containing speech signal to be processed into corresponding feature extraction networks in the speech enhancement model, respectively, to obtain at least two features of the noise-containing speech signal to be processed; and performing short-time Fourier transform on the noise-containing voice signal to be processed based on the at least two groups of different time-frequency conversion parameters to obtain at least two frequency spectrums of the noise-containing voice signal to be processed.
24. The speech enhancement apparatus of claim 20, wherein the weight for each of said at least two features is preset.
25. The speech enhancement apparatus of claim 20 wherein the set of time-to-frequency conversion parameters comprises at least one of window length, window shift, window function, and fast fourier transform length.
26. The speech enhancement apparatus according to any one of claims 20 to 25, wherein the speech enhancement model is trained based on a training method of the speech enhancement model according to any one of claims 1 to 6.
27. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the training method of the speech enhancement model of any of claims 1 to 6 and/or the speech enhancement method of any of claims 7 to 13.
28. A computer-readable storage medium, characterized in that instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method of a speech enhancement model according to any one of claims 1 to 6 and/or the speech enhancement method according to any one of claims 7 to 13.
CN202110869479.2A 2021-07-30 2021-07-30 Training method and device of voice enhancement model, and voice enhancement method and device Active CN113555031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110869479.2A CN113555031B (en) 2021-07-30 2021-07-30 Training method and device of voice enhancement model, and voice enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110869479.2A CN113555031B (en) 2021-07-30 2021-07-30 Training method and device of voice enhancement model, and voice enhancement method and device

Publications (2)

Publication Number Publication Date
CN113555031A CN113555031A (en) 2021-10-26
CN113555031B true CN113555031B (en) 2024-02-23

Family

ID=78133309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110869479.2A Active CN113555031B (en) 2021-07-30 2021-07-30 Training method and device of voice enhancement model, and voice enhancement method and device

Country Status (1)

Country Link
CN (1) CN113555031B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114974281A (en) * 2022-05-24 2022-08-30 云知声智能科技股份有限公司 Training method and device of voice noise reduction model, storage medium and electronic device
CN117409794B (en) * 2023-12-13 2024-03-15 深圳市声菲特科技技术有限公司 Audio signal processing method, system, computer device and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853664A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Signal denoising method and device and audio decoding system
JP2012181475A (en) * 2011-03-03 2012-09-20 Univ Of Tokyo Method for extracting feature of acoustic signal and method for processing acoustic signal using the feature
CN106297768A (en) * 2015-05-11 2017-01-04 苏州大学 Speech recognition method
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN111696572A (en) * 2019-03-13 2020-09-22 富士通株式会社 Speech separation apparatus, method and medium
WO2020232180A1 (en) * 2019-05-14 2020-11-19 Dolby Laboratories Licensing Corporation Method and apparatus for speech source separation based on a convolutional neural network
KR102191736B1 (en) * 2020-07-28 2020-12-16 주식회사 수퍼톤 Method and apparatus for speech enhancement with artificial neural network
CN112289333A (en) * 2020-12-25 2021-01-29 北京达佳互联信息技术有限公司 Training method and device of voice enhancement model and voice enhancement method and device
CN113113041A (en) * 2021-04-29 2021-07-13 电子科技大学 Voice separation method based on time-frequency cross-domain feature selection

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7505902B2 (en) * 2004-07-28 2009-03-17 University Of Maryland Discrimination of components of audio signals based on multiscale spectro-temporal modulations
KR100789084B1 (en) * 2006-11-21 2007-12-26 한양대학교 산학협력단 Speech enhancement method by overweighting gain with nonlinear structure in wavelet packet transform
US9536540B2 (en) * 2013-07-19 2017-01-03 Knowles Electronics, Llc Speech signal separation and synthesis based on auditory scene analysis and speech modeling

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853664A (en) * 2009-03-31 2010-10-06 华为技术有限公司 Signal denoising method and device and audio decoding system
JP2012181475A (en) * 2011-03-03 2012-09-20 Univ Of Tokyo Method for extracting feature of acoustic signal and method for processing acoustic signal using the feature
CN106297768A (en) * 2015-05-11 2017-01-04 苏州大学 Speech recognition method
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN111696572A (en) * 2019-03-13 2020-09-22 富士通株式会社 Speech separation apparatus, method and medium
WO2020232180A1 (en) * 2019-05-14 2020-11-19 Dolby Laboratories Licensing Corporation Method and apparatus for speech source separation based on a convolutional neural network
KR102191736B1 (en) * 2020-07-28 2020-12-16 주식회사 수퍼톤 Method and apparatus for speech enhancement with artificial neural network
CN112289333A (en) * 2020-12-25 2021-01-29 北京达佳互联信息技术有限公司 Training method and device of voice enhancement model and voice enhancement method and device
CN113113041A (en) * 2021-04-29 2021-07-13 电子科技大学 Voice separation method based on time-frequency cross-domain feature selection

Also Published As

Publication number Publication date
CN113555031A (en) 2021-10-26

Similar Documents

Publication Publication Date Title
JP6765445B2 (en) Frequency-based audio analysis using neural networks
CN113241088B (en) Training method and device of voice enhancement model and voice enhancement method and device
CN110503971A (en) Time-frequency mask neural network based estimation and Wave beam forming for speech processes
CN110600017A (en) Training method of voice processing model, voice recognition method, system and device
CN112309426B (en) Voice processing model training method and device and voice processing method and device
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
CN113593594B (en) Training method and equipment for voice enhancement model and voice enhancement method and equipment
CN113192536B (en) Training method of voice quality detection model, voice quality detection method and device
CN113470685B (en) Training method and device for voice enhancement model and voice enhancement method and device
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
CN112712816B (en) Training method and device for voice processing model and voice processing method and device
CN113284507A (en) Training method and device of voice enhancement model and voice enhancement method and device
CN114121029A (en) Training method and device of speech enhancement model and speech enhancement method and device
CN114758668A (en) Training method of voice enhancement model and voice enhancement method
US20230116052A1 (en) Array geometry agnostic multi-channel personalized speech enhancement
CN113035221B (en) Training method and device for voice processing model and voice processing method and device
JP6891144B2 (en) Generation device, generation method and generation program
CN112652290B (en) Method for generating reverberation audio signal and training method of audio processing model
CN111477248B (en) Audio noise detection method and device
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
CN116092529A (en) Training method and device of tone quality evaluation model, and tone quality evaluation method and device
CN114446316B (en) Audio separation method, training method, device and equipment of audio separation model
CN113990327B (en) Speaking object characterization extraction model training method and speaking object identity recognition method
CN114242110A (en) Model training method, audio processing method, device, equipment, medium and product
CN114694683A (en) Speech enhancement evaluation method, and training method and device of speech enhancement evaluation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant