CN107248414A - A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization - Google Patents

A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization Download PDF

Info

Publication number
CN107248414A
CN107248414A CN201710366412.0A CN201710366412A CN107248414A CN 107248414 A CN107248414 A CN 107248414A CN 201710366412 A CN201710366412 A CN 201710366412A CN 107248414 A CN107248414 A CN 107248414A
Authority
CN
China
Prior art keywords
frequency spectrum
multiframe
voice
multiframe frequency
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710366412.0A
Other languages
Chinese (zh)
Inventor
何亮
施梦楠
徐灿
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710366412.0A priority Critical patent/CN107248414A/en
Publication of CN107248414A publication Critical patent/CN107248414A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention, belong to speech enhan-cement and Non-negative Matrix Factorization field.This method is pre-processed to clean speech, noise, noisy speech, obtains short-term spectrum, and be converted to multiframe frequency spectrum;The multiframe frequency spectrum of noise and clean speech is separately converted to the product of corresponding basic matrix and coefficient matrix, the basic matrix of the multiframe frequency spectrum of noise and the basic matrix of clean speech multiframe frequency spectrum is obtained;Two basic matrixs are synthesized to the basic matrix of noisy speech multiframe frequency spectrum, and the multiframe frequency spectrum of noisy speech is converted into the product of basic matrix and coefficient matrix, the coefficient matrix of noisy speech multiframe frequency spectrum is obtained, and then obtains the initial estimation of noise and enhancing voice multiframe frequency spectrum;By Wiener Filtering, the multiframe frequency spectrum of enhancing voice is obtained, time-domain signal is transformed to, final enhancing voice is obtained.The present invention saves the peculiar information of voice, preferably reduces voice, lifts the effect of speech enhan-cement.

Description

A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization
Technical field
It is more particularly to a kind of to be based on multiframe frequency spectrum and non-negative square the invention belongs to speech enhan-cement and Non-negative Matrix Factorization field Sound enhancement method and device that battle array is decomposed.
Background technology
Speech enhan-cement, also referred to as voice de-noising, are that noisy speech is handled, and remove the noise portion in noisy speech Point, the clean speech part in noisy speech is obtained, while voice quality is improved, at the voice for the intelligibility for improving voice Reason technology.Speech enhancement technique can suppress the ambient noise in voice communication course, improve communication quality.Language can also be used as The pretreatment system of sound processing system, helps speech processing system to support antimierophonic interference, the stability of lifting system.In electronics Rapid development of information technology and ripe today, speech-enhancement system is applied to communication, mobile phone, computer, concert, investigation, field The various fields such as scape recording.
The huge number of sound enhancement method, one of which is the sound enhancement method based on short-term spectrum, such method Including classic algorithms such as Wiener Filter Method, spectrum-subtraction and the MMSE estimations technique.Sound enhancement method based on short-term spectrum realizes letter It is single, noise can be suppressed, there is application value.Although short-term spectrum meets the property of voice short-term stationarity, but have ignored language The further feature information of sound.
Tseng et al. proposes the sound enhancement method learnt based on multiframe sparse dictionary with statistical criteria.This method Outstanding contributions are that multiframe frequency spectrum has been used in sound enhancement method, this and traditional sound enhancement method based on short-term spectrum There is larger difference.In the sound enhancement method based on short-term spectrum, in order to obtain short-term spectrum, it is necessary to when dividing the speech into Between length be 10ms~30ms frame, reuse Short Time Fourier Transform (Short-time Fourier transform, STFT time-domain signal) is transformed into frequency domain.
The least unit of voice is phoneme, and the time span of frame is typically smaller than the time span of phoneme.That is, short The frame of time-frequency spectrum can not cover the least unit of voice.Single phoneme has relatively-stationary time structure, some adjacent phonemes Also there is specific transformational relation.Obviously, these information can not be obtained by single frame.
Multiframe frequency spectrum can preserve these information.Multiframe frequency spectrum is a kind of pronunciation modeling mode of combination context.Built Vertical model has following advantage:(1) in the voice of Time Continuous, context mechanism affects the distribution of time-frequency energy, multiframe Frequency spectrum can preserve the information of this kind of voice;(2) multiframe frequency spectrum can obtain formant conversion and the tone between adjacent phoneme The information such as change.
Therefore, the multiframe frequency spectrum with reference to context modeling is expected to obtain more preferable speech model, so as to obtain preferably Speech enhan-cement effect.In multiframe frequency spectrum, " bag " of multiframe synthesis substituted for original single frames.Multiframe in " bag " is in time Continuously, single " bag " frame bigger equivalent to time span, can also be referred to as Long-time spectrum by multiframe frequency spectrum.With short-term spectrum phase Than multiframe frequency spectrum saves the time-series dynamics feature and time structure of voice.
1999, Lee and Seung proposed Non-negative Matrix Factorization (Non-negative matrix factorization,NMF).NMF is used primarily for image processing field.In recent years, NMF field of speech enhancement also achieve compared with Good effect, is increasingly becoming a kind of sound enhancement method of main flow, receives the attention of scholars.
N × m non-negative data matrix Vs tieed up are resolved into two matrixes of W and H by NMF, and meet approximately equal formula:
V≈WH (1-1)
Wherein W is n × r dimension matrixes, and H is r × m dimension matrixes, and parameter r meets r < nm/ (n+m).
The general basic matrix that W is referred to as to V, H is referred to as V coefficient matrix.Basic matrix W saves non-negative data matrix V Data characteristics, by base vector wiConstitute, each base vector wiRepresent an independent characteristic vector.Coefficient matrix H is non-negative data The dimensionality reduction matrix of matrix V, by coefficient vector hiConstitute, each coefficient vector hiA correspondence V column vector vi
NMF main function is feature extraction and Data Dimensionality Reduction.Data matrix V characteristic vector is contained in basic matrix W. Same class data have an approximate feature, therefore the common characteristic of a class data can be extracted using NMF.Coefficient matrix H Corresponded with data matrix V, the result after V dimensionality reductions can be regarded as.In same class data, W does not change with V change Become, and H changes with V change.Generally, basic matrix includes " general character " of data, and coefficient matrix represents the " special of data Property ".
The sound enhancement method for being currently based on NMF is still to be handled for short-term spectrum, and this kind of method exists as follows The problem of:Training short-term spectrum can not obtain the peculiar information of voice that multiframe frequency spectrum is included, and recover obtained clean speech quality Poor, speech enhan-cement effect is poor.
The existing sound enhancement method based on NMF, flow including training basic matrix stage and voice as shown in figure 1, increase In strong two stages of stage, comprise the following steps:
1) the basic matrix stage is trained, following steps are specifically included:
1-a) by training data pretreatment and Fast Fourier Transform (FFT), respectively obtaining clean speech in training data The short-term spectrum of short-term spectrum and noise;
1-b) by NMF algorithms, by step 1-a) short-term spectrum point of the obtained short-term spectrum of clean speech and noise The product of each self-corresponding basic matrix and coefficient matrix is not converted into;
1-c) by calculating minimum broad sense KL divergence cost functions, obtain respectively clean speech short-term spectrum basic matrix and The basic matrix of noise short-term spectrum;
2) the speech enhan-cement stage, following steps are specifically included:
2-a) by noisy speech pretreatment and Fast Fourier Transform (FFT), obtaining the short-term spectrum of noisy speech;
2-b) using step 1-c) basic matrix of noise short-term spectrum that respectively obtains and clean speech short-term spectrum Basic matrix, synthesizes the basic matrix of noisy speech short-term spectrum;
2-c) by NMF algorithms, by step 2-a) the obtained short-term spectrum of noisy speech is converted into basic matrix and coefficient The product of matrix;
2-d) using step 2-c) obtained result, by calculating minimum broad sense KL divergence cost functions, and combine step 2-b) basic matrix for the noisy speech short-term spectrum that synthesis is obtained, obtains the coefficient matrix of noisy speech short-term spectrum;
2-e) by step 2-d) obtain the coefficient matrix and step 2-b of noisy speech short-term spectrum) obtained band makes an uproar language The basic matrix of sound short-term spectrum, obtains the initial estimation of clean speech and the short-term spectrum of noise;
2-f) by Wiener Filtering, the short-term spectrum of enhancing voice is obtained;
2-g) by step 2-f) the obtained short-term spectrum of enhancing voice is transformed to time-domain signal, obtains final enhancing language Sound.
In above-mentioned steps, a most key step is NMF basic matrix derivation algorithms, i.e. step 1-c).The algorithm Detailed process is as follows:
KL divergence cost functions are:By calculating, the product of basic matrix and coefficient matrix dissipates with the KL of target nonnegative matrix Spending distance should be as far as possible small, and expression formula is as follows:
In formula, V represents the short-term spectrum of voice, and W and H represent the basic matrix and coefficient matrix of short-term spectrum respectively.I, j are The ranks index of matrix.
W and H finite element is set as random nonnegative number, following iterative formula is substituted into:
In formula, i, a, u, k indexes for the ranks of matrix.
By iterating to calculate several times, until W and H restrains, Non-negative Matrix Factorization is completed, and obtains the group moment of short-term spectrum Battle array.
The content of the invention
The invention aims to solve the weak point of prior art, it is proposed that one kind is based on multiframe frequency spectrum and non-negative The sound enhancement method and device of matrix decomposition.Multiframe frequency spectrum and Non-negative Matrix Factorization are applied to speech-enhancement system by the present invention On;Multiframe frequency spectrum is built on the basis of original short-term spectrum, and enhancing voice is obtained using Non-negative Matrix Factorization, obtains and protects Deposited voice it is peculiar middle long when information, preferably reduce voice, lift the effect of speech enhan-cement.
A kind of sound enhancement method based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention, it is characterised in that It is divided into structure multiframe frequency spectrum stage, training basic matrix stage and speech enhan-cement stage common three phases;Comprise the following steps:
1) the multiframe frequency spectrum stage is built;Specifically include following steps:
1-1) voice is pre-processed, the short-term spectrum of voice is obtained;Pretreatment includes carrying out zero averaging to voice And preemphasis;Zero averaging is carried out first, is that whole section of voice subtracts its average;Then preemphasis is carried out:After zero averaging Voice carries out high-pass filtering, and carries out framing to the voice, then carries out Fast Fourier Transform (FFT);Language is obtained after the completion of pretreatment The short-term spectrum of sound;
1-2) according to step 1-1) the obtained short-term spectrum of voice, according to pack arrangement, short-term spectrum is converted into correspondence Multiframe frequency spectrum;
2) the basic matrix stage is trained;Specifically include following steps:
2-1) extract noise and clean speech, repeat step 1), respectively obtain the multiframe frequency spectrum and clean speech of noise Multiframe frequency spectrum;
2-2) by Non-negative Matrix Factorization NMF algorithms, by step 2-1) the obtained multiframe frequency spectrum of noise and clean speech Multiframe frequency spectrum be separately converted to the product of each self-corresponding basic matrix and coefficient matrix;
2-3) by minimum broad sense KL divergence cost functions, the basic matrix and pure language of the multiframe frequency spectrum of noise are obtained respectively The basic matrix of the multiframe frequency spectrum of sound;
3) the speech enhan-cement stage;Specifically include following steps:
3-1) extract noisy speech, repeat step 1), obtain the multiframe frequency spectrum of noisy speech;
3-2) using step 2-3) base of the multiframe frequency spectrum of the obtained basic matrix of the multiframe frequency spectrum of noise and clean speech Matrix, synthesizes the basic matrix of noisy speech multiframe frequency spectrum;
3-3) by NMF algorithms, by step 3-1) the obtained multiframe frequency spectrum of noisy speech is converted into corresponding basic matrix With the product of coefficient matrix;
3-4) using step 3-3) the obtained basic matrix of the multiframe frequency spectrum of noisy speech and the product of coefficient matrix, pass through Minimum broad sense KL divergence cost functions, and combine step 3-2) the obtained basic matrix of noisy speech multiframe frequency spectrum, obtain band Make an uproar voice multiframe frequency spectrum coefficient matrix;
3-5) by step 3-4) the obtained coefficient matrix and step 2-3 of noisy speech multiframe frequency spectrum) obtained noise Multiframe frequency spectrum basic matrix and clean speech multiframe frequency spectrum basic matrix, the multiframe frequency spectrum of noise is obtained respectively and pure The initial estimation of the multiframe frequency spectrum of voice;
3-6) using step 3-5) initial estimation of the multiframe frequency spectrum of the obtained multiframe frequency spectrum of noise and clean speech, lead to Wiener Filtering is crossed, the multiframe frequency spectrum of enhancing voice is obtained;
3-7) by step 3-6) the obtained multiframe frequency spectrum of enhancing voice is transformed to strengthen voice by releasing pack arrangement Short-term spectrum, during pack arrangement is released, sum-average arithmetic is carried out to the same number of frames included in multiple bags;
3-8) by step 3-7) the obtained short-term spectrum of clean speech is transformed to time-domain signal, obtains final enhancing Voice.
A kind of sound enhancement method based on multiframe frequency spectrum proposed by the present invention and Non-negative Matrix Factorization realizes device, its It is characterised by, including:Voice pretreatment module, multiframe frequency spectrum builds module, trains multiframe frequency spectrum group moment array module, and anamorphic zone is made an uproar Voice group moment array module, calculates noisy speech coefficient matrix module, calculates voice and noise multiframe spectrum block, Wiener filtering mould Block, recovers time-domain signal module and memory module;
The voice pretreatment module is used for clean speech and noise and pending noisy speech framing, adding window, Fast Fourier Transform (FFT), obtains the short-term spectrum of corresponding voice;
The multiframe frequency spectrum, which builds module, to be used for the pretreated short-term spectrum of voice pretreatment module, is converted into correspondence Multiframe frequency spectrum;
The training multiframe frequency spectrum group moment array module, many frame frequencies for building the noise that module is obtained according to multiframe frequency spectrum The multiframe frequency spectrum of spectrum and clean speech, trains the basic matrix of noise and the basic matrix of clean speech;
The synthesis noisy speech group moment array module, for the noise according to training multiframe frequency spectrum group moment array module acquisition Basic matrix and the basic matrix of clean speech synthesize the basic matrix of noisy speech;
The calculating noisy speech coefficient matrix module, the band for being obtained according to synthesis noisy speech group moment array module is made an uproar The basic matrix of voice, using the method for Non-negative Matrix Factorization, obtains the coefficient matrix of noisy speech;
The calculating voice and noise multiframe spectrum block, for the band synthesized according to synthesis noisy speech group moment array module Make an uproar voice basic matrix and calculate the coefficient matrix of the noisy speech that noisy speech coefficient matrix module is obtained, enhancing is calculated respectively The initial estimation of the multiframe frequency spectrum of voice and the multiframe frequency spectrum of noise;
The Wiener filtering module, for the enhancing voice according to the multiframe spectrum block acquisition for calculating voice and noise The initial estimation of the multiframe frequency spectrum of multiframe frequency spectrum and noise, builds Wiener filter, obtains the multiframe frequency spectrum of enhancing voice;
The recovery time-domain signal module, the multiframe frequency spectrum of the enhancing voice obtained according to Wiener filtering module, is increased The time-domain signal of strong voice;
The memory module, basic matrix and pure language for storing the noise that training multiframe frequency spectrum group moment array module is obtained The basic matrix data of sound, and corresponding data is transmitted to corresponding module.
The features of the present invention and beneficial effect:
A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention, with tradition Method is compared, and multiframe frequency spectrum and Non-negative Matrix Factorization are applied on sound enhancement method by the inventive method.In original frequency in short-term Multiframe frequency spectrum is built on the basis of spectrum, and strengthens voice using Non-negative Matrix Factorization, the peculiar information of voice is obtained and save, Information when i.e. long in voice, preferably reduces voice, lifts the effect of speech enhan-cement.By the present invention in that multiframe frequency spectrum is used, Voice quality can be effectively improved, the effect of speech enhan-cement is lifted.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the existing sound enhancement method based on NMF.
Fig. 2 is a kind of sound enhancement method flow chart element based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention Figure.
Embodiment
A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention, are tied below Close drawings and the specific embodiments detailed description as follows.
A kind of sound enhancement method based on multiframe frequency spectrum and Non-negative Matrix Factorization proposed by the present invention, FB(flow block) is as schemed Shown in 2, this method is divided into structure multiframe frequency spectrum stage, training basic matrix stage and speech enhan-cement stage common three phases;Including Following steps:
1) the multiframe frequency spectrum stage is built;Specifically include following steps:
1-1) voice is pre-processed, the short-term spectrum of voice is obtained;Pretreatment includes carrying out zero averaging to voice And preemphasis;Zero averaging is carried out first, is that whole section of voice subtracts its average;Then preemphasis is carried out:After zero averaging Voice carries out high-pass filtering, and carries out framing to the voice, then carries out Fast Fourier Transform (FFT);Language is obtained after the completion of pretreatment The short-term spectrum of sound;
The voice object pre-processed is not required, can be any voice;
1-2) according to step 1-1) the obtained short-term spectrum of voice, according to specific " bag " structure, by short-term spectrum turn Turn to corresponding multiframe frequency spectrum;
2) the basic matrix stage is trained;Specifically include following steps:
2-1) extract noise and clean speech, repeat step 1), respectively obtain the multiframe frequency spectrum and clean speech of noise Multiframe frequency spectrum;Noise and clean speech derive from basic database in the present invention.
2-2) by NMF algorithms, by step 2-1) the multiframe frequency spectrum point of the obtained multiframe frequency spectrum of noise and clean speech The product of each self-corresponding basic matrix and coefficient matrix is not converted into;
2-3) by minimum broad sense KL divergence cost functions, the basic matrix and pure language of the multiframe frequency spectrum of noise are obtained respectively The basic matrix of the multiframe frequency spectrum of sound;
3) the speech enhan-cement stage;Specifically include following steps:
3-1) extract noisy speech, repeat step 1), obtain the multiframe frequency spectrum of noisy speech;Noisy speech in the present embodiment Record and obtain for laboratory;
3-2) using step 2-3) base of the multiframe frequency spectrum of the obtained basic matrix of the multiframe frequency spectrum of noise and clean speech Matrix, synthesizes the basic matrix of noisy speech multiframe frequency spectrum;
3-3) by NMF algorithms, by step 3-1) the obtained multiframe frequency spectrum of noisy speech is converted into corresponding basic matrix With the product of coefficient matrix;
3-4) using step 3-3) the obtained basic matrix of the multiframe frequency spectrum of noisy speech and the product of coefficient matrix, pass through Minimum broad sense KL divergence cost functions, and combine step 3-2) the obtained basic matrix of noisy speech multiframe frequency spectrum, obtain band Make an uproar voice multiframe frequency spectrum coefficient matrix;
3-5) by step 3-4) the obtained coefficient matrix and step 2-3 of noisy speech multiframe frequency spectrum) obtained noise Multiframe frequency spectrum basic matrix and clean speech multiframe frequency spectrum basic matrix, the multiframe frequency spectrum of noise is obtained respectively and pure The initial estimation of the multiframe frequency spectrum of voice;
3-6) using step 3-5) initial estimation of the multiframe frequency spectrum of the obtained multiframe frequency spectrum of noise and clean speech, lead to Wiener Filtering is crossed, the multiframe frequency spectrum of enhancing voice is obtained;
3-7) by step 3-6) the obtained multiframe frequency spectrum of enhancing voice is transformed to strengthen voice by releasing pack arrangement Short-term spectrum, during pack arrangement is released, sum-average arithmetic is carried out to the same number of frames included in multiple " bags ";
3-8) by step 3-7) the obtained short-term spectrum of clean speech is transformed to time-domain signal, obtains final enhancing Voice.
Above-mentioned steps 1-2) in, according to step 1-1) the obtained short-term spectrum of voice, according to specific " bag " structure, general Short-term spectrum is converted into corresponding multiframe frequency spectrum;Comprise the following steps that:
1-2-1) assume that the mathematic(al) representation such as formula (1) of the short-term spectrum of voice is shown:
V=[v1,v2,……vm], V ∈ Rn×m (1)
In formula, V represents the short-term spectrum of voice, and m is frame number, and n is the frame length of every frame data, v1、v2…vmRepresent respectively each Frame short-term spectrum, R represents set of real numbers, i.e., including all rationals and surd set.
1-2-2) using specific " bag " structure, multiframe frequency spectrum is built.The structure of " bag " has a variety of, (for example:Various forms of bags are equally applicable to this method.) present invention implementation A kind of concrete form of " bag " that example is used is as follows:
In formula,It is " bag " of the multiframe frequency spectrum built, vi-2,vi,vi+2Respectively represent step 1-2-1) obtain the i-th -2, I, i+2 frame short-term spectrum.
The expression formula for obtaining the multiframe frequency spectrum of voice is shown below:
In formula,Represent multiframe frequency spectrum;
Above-mentioned steps 2-2) and 3-3) in, the NMF algorithms are comprised the following steps that:
N × m non-negative data matrix Vs tieed up are resolved into two matrixes of W and H, and meet approximately equal formula:
V≈WH (4)
Wherein, W and H represent basic matrix and coefficient matrix respectively.
Above-mentioned steps 2-3) in, by minimum broad sense KL divergence cost functions, the base of the multiframe frequency spectrum of noise is obtained respectively The basic matrix of matrix and the multiframe frequency spectrum of clean speech;Comprise the following steps that:
The KL divergences distance of the product of basic matrix and coefficient matrix and target nonnegative matrix should be as far as possible small, and expression formula is as follows:
W and H finite element is set as random nonnegative number, iterative formula is substituted into:
By iterating to calculate several times, until W and H restrains, Non-negative Matrix Factorization is completed, and obtains basic matrix.
Above-mentioned steps 3-2) in, the basic matrix of noisy speech multiframe frequency spectrum is synthesized, shown in expression formula such as formula (8):
Wns=[Ws,Wn] (8)
Wherein, WnsRepresent the basic matrix of the multiframe frequency spectrum of noisy speech, WsRepresent the group moment of the multiframe frequency spectrum of clean speech Battle array, WnRepresent the basic matrix of the multiframe frequency spectrum of noise.
Above-mentioned steps 3-4) in, the coefficient matrix of the multiframe frequency spectrum of noisy speech is obtained, is comprised the following steps that:
The KL divergences distance of the product of basic matrix and coefficient matrix and target nonnegative matrix should be as far as possible small, the following institute of expression formula Show:
Wherein, VnsRepresent the non-negative data matrix of noisy speech multiframe frequency spectrum;VnsAnd Wns, it is known that setting noisy speech Multiframe frequency spectrum coefficient matrix HnsFinite element be random nonnegative number, substitute into iterative formula
Calculated or H by iteration several times (the present embodiment is 100 times)nsConvergence, Non-negative Matrix Factorization is completed, and is obtained band and is made an uproar The coefficient matrix of voice multiframe frequency spectrum.Pass through formula (11), HnsIt is decomposed into the coefficient matrix of the multiframe frequency spectrum of clean speech and makes an uproar The coefficient matrix of the multiframe frequency spectrum of sound.
Hns=(Hs′,Hn′) (11)
Wherein, Hs' represent clean speech multiframe frequency spectrum coefficient matrix, Hn' represent noise multiframe frequency spectrum coefficient square Battle array.
Above-mentioned steps 3-5) in, the initial estimation of the multiframe frequency spectrum of noise and the multiframe frequency spectrum of clean speech is obtained respectively, Expression formula is as follows:
Vs'=WsH′s (12)
Vn'=WnHn′ (13)
Wherein, Vs' represent clean speech multiframe frequency spectrum initial estimation, Vn' the initial of multiframe frequency spectrum for representing noise is estimated Meter.Behalf clean speech, n represents noise.Hns=(Hs′,Hn'), Hs' represent clean speech multiframe frequency spectrum coefficient matrix, Hn' represent noise multiframe frequency spectrum coefficient matrix.
Above-mentioned steps 3-6) in, the multiframe frequency spectrum of enhancing voice is obtained, expression formula is as follows:
Wherein,Represent the multiframe frequency spectrum of enhancing voice.
Using a kind of speech sound enhancement device based on multiframe frequency spectrum and Non-negative Matrix Factorization of the inventive method, including:Language Sound pretreatment module, multiframe frequency spectrum builds module, trains multiframe frequency spectrum group moment array module, synthesizes noisy speech group moment array module, Noisy speech coefficient matrix module is calculated, voice and noise multiframe spectrum block is calculated, Wiener filtering module recovers time-domain signal Module and memory module totally 9 modules;
The voice pretreatment module is used for clean speech and noise and pending noisy speech framing, adding window, Fast Fourier Transform (FFT) (FFT), obtains the short-term spectrum of corresponding voice;
The multiframe frequency spectrum, which builds module, to be used for the pretreated short-term spectrum of voice pretreatment module, is converted into correspondence Multiframe frequency spectrum;
The training multiframe frequency spectrum group moment array module, many frame frequencies for building the noise that module is obtained according to multiframe frequency spectrum The multiframe frequency spectrum of spectrum and clean speech, trains the basic matrix of noise and the basic matrix of clean speech;
The synthesis noisy speech group moment array module, for the noise according to training multiframe frequency spectrum group moment array module acquisition Basic matrix and the basic matrix of clean speech synthesize the basic matrix of noisy speech;
The calculating noisy speech coefficient matrix module, the band for being obtained according to synthesis noisy speech group moment array module is made an uproar The basic matrix of voice, using the method for Non-negative Matrix Factorization, obtains the coefficient matrix of noisy speech;
The calculating voice and noise multiframe spectrum block, for the band synthesized according to synthesis noisy speech group moment array module Make an uproar voice basic matrix and calculate the coefficient matrix of the noisy speech that noisy speech coefficient matrix module is obtained, calculate respectively pure The initial estimation of the multiframe frequency spectrum of voice and the multiframe frequency spectrum of noise;
The Wiener filtering module, for the clean speech according to the multiframe spectrum block acquisition for calculating voice and noise The initial estimation of the multiframe frequency spectrum of multiframe frequency spectrum and noise, builds Wiener filter, obtains the multiframe frequency spectrum of enhancing voice;
The recovery time-domain signal module, the multiframe frequency spectrum of the enhancing voice obtained according to Wiener filtering module, is increased The time-domain signal of strong voice;
The memory module, basic matrix and pure language for storing the noise that training multiframe frequency spectrum group moment array module is obtained The basic matrix data of sound, and corresponding data is transmitted to corresponding module.
Above-mentioned each module can use conventional digital integrated electronic circuit to realize.

Claims (3)

1. a kind of sound enhancement method based on multiframe frequency spectrum and Non-negative Matrix Factorization, it is characterised in that be divided into many frame frequencies of structure Spectrum stage, training basic matrix stage and speech enhan-cement stage common three phases;Comprise the following steps:
1) the multiframe frequency spectrum stage is built;Specifically include following steps:
1-1) voice is pre-processed, the short-term spectrum of voice is obtained;Pretreatment includes carrying out voice zero averaging and pre- Aggravate;Zero averaging is carried out first, is that whole section of voice subtracts its average;Then preemphasis is carried out:To the voice after zero averaging High-pass filtering is carried out, and framing is carried out to the voice, Fast Fourier Transform (FFT) is then carried out;Voice is obtained after the completion of pretreatment Short-term spectrum;
1-2) according to step 1-1) the obtained short-term spectrum of voice, according to pack arrangement, short-term spectrum is converted into corresponding many Frame frequency is composed;
2) the basic matrix stage is trained;Specifically include following steps:
2-1) extract noise and clean speech, repeat step 1), the multiframe frequency spectrum of noise and the multiframe of clean speech are obtained respectively Frequency spectrum;
2-2) by Non-negative Matrix Factorization NMF algorithms, by step 2-1) the obtained multiframe frequency spectrum of noise and clean speech it is many Frame frequency composes the product for being separately converted to each self-corresponding basic matrix and coefficient matrix;
2-3) by minimum broad sense KL divergence cost functions, the basic matrix and clean speech of the multiframe frequency spectrum of noise are obtained respectively The basic matrix of multiframe frequency spectrum;
3) the speech enhan-cement stage;Specifically include following steps:
3-1) extract noisy speech, repeat step 1), obtain the multiframe frequency spectrum of noisy speech;
3-2) using step 2-3) basic matrix of the multiframe frequency spectrum of the obtained basic matrix of the multiframe frequency spectrum of noise and clean speech, Synthesize the basic matrix of noisy speech multiframe frequency spectrum;
3-3) by NMF algorithms, by step 3-1) the obtained multiframe frequency spectrum of noisy speech is converted into corresponding basic matrix and is The product of matrix number;
3-4) using step 3-3) the obtained basic matrix of the multiframe frequency spectrum of noisy speech and the product of coefficient matrix, pass through minimum Broad sense KL divergence cost functions, and combine step 3-2) the obtained basic matrix of noisy speech multiframe frequency spectrum, obtain band and make an uproar language The coefficient matrix of the multiframe frequency spectrum of sound;
3-5) by step 3-4) the obtained coefficient matrix and step 2-3 of noisy speech multiframe frequency spectrum) obtain noise it is many The basic matrix of the basic matrix of frame frequency spectrum and the multiframe frequency spectrum of clean speech, obtains the multiframe frequency spectrum and clean speech of noise respectively Multiframe frequency spectrum initial estimation;
3-6) using step 3-5) initial estimation of the multiframe frequency spectrum of the obtained multiframe frequency spectrum of noise and clean speech, pass through dimension Nanofiltration wave method, obtains the multiframe frequency spectrum of enhancing voice;
3-7) by step 3-6) the obtained multiframe frequency spectrum of enhancing voice is transformed to enhancing voice in short-term by releasing pack arrangement Frequency spectrum, during pack arrangement is released, sum-average arithmetic is carried out to the same number of frames included in multiple bags;
3-8) by step 3-7) the obtained short-term spectrum of clean speech is transformed to time-domain signal, obtains final enhancing language Sound.
2. the method as described in claim 1, it is characterised in that the step 1-2) comprise the following steps that:
1-2-1) assume that the mathematic(al) representation such as formula (1) of the short-term spectrum of voice is shown:
V=[v1,v2,……vm], V ∈ Rn×m (1)
In formula, V represents the short-term spectrum of voice, and m is frame number, and n is the frame length of every frame data, v1、v2…vmIt is short that each frame is represented respectively Time-frequency spectrum, R represents set of real numbers;
Pack arrangement 1-2-2) is used, multiframe frequency spectrum is built;A kind of concrete form of bag is as follows:
<mrow> <msub> <mover> <mi>v</mi> <mo>&amp;OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mo>&amp;lsqb;</mo> <msub> <mi>v</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>2</mn> </mrow> </msub> <mo>;</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>;</mo> <msub> <mi>v</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>2</mn> </mrow> </msub> <mo>&amp;rsqb;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
In formula,It is the bag of the multiframe frequency spectrum built, vi-2,vi,vi+2Step 1-2-1 is represented respectively) obtained the i-th -2, i, i+2 Frame short-term spectrum;
The expression formula for obtaining the multiframe frequency spectrum of voice is shown below:
<mrow> <mover> <mi>V</mi> <mo>&amp;OverBar;</mo> </mover> <mo>=</mo> <mo>&amp;lsqb;</mo> <msub> <mover> <mi>v</mi> <mo>&amp;OverBar;</mo> </mover> <mn>1</mn> </msub> <mo>,</mo> <msub> <mover> <mi>v</mi> <mo>&amp;OverBar;</mo> </mover> <mn>2</mn> </msub> <mo>,</mo> <mo>...</mo> <mo>...</mo> <mo>,</mo> <msub> <mover> <mi>v</mi> <mo>&amp;OverBar;</mo> </mover> <mi>m</mi> </msub> <mo>&amp;rsqb;</mo> <mo>,</mo> <mover> <mi>V</mi> <mo>&amp;OverBar;</mo> </mover> <mo>&amp;Element;</mo> <msup> <mi>R</mi> <mrow> <mn>3</mn> <mi>n</mi> <mo>&amp;times;</mo> <mi>m</mi> </mrow> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
In formula,Represent multiframe frequency spectrum.
3. a kind of realization dress of the sound enhancement method based on multiframe frequency spectrum and Non-negative Matrix Factorization described in use claim 1 Put, it is characterised in that including:Voice pretreatment module, multiframe frequency spectrum builds module, trains multiframe frequency spectrum group moment array module, closes Into noisy speech group moment array module, noisy speech coefficient matrix module is calculated, voice and noise multiframe spectrum block, wiener is calculated Filtration module, recovers time-domain signal module and memory module;
The voice pretreatment module is used for clean speech and noise and pending noisy speech framing, adding window, quickly Fourier transformation, obtains the short-term spectrum of corresponding voice;
The multiframe frequency spectrum, which builds module, to be used for the pretreated short-term spectrum of voice pretreatment module, is converted into corresponding many Frame frequency is composed;
It is described training multiframe frequency spectrum group moment array module, for according to multiframe frequency spectrum build module obtain noise multiframe frequency spectrum and The multiframe frequency spectrum of clean speech, trains the basic matrix of noise and the basic matrix of clean speech;
The synthesis noisy speech group moment array module, for the group moment of the noise obtained according to training multiframe frequency spectrum group moment array module The basic matrix of battle array and clean speech synthesizes the basic matrix of noisy speech;
The calculating noisy speech coefficient matrix module, for the noisy speech obtained according to synthesis noisy speech group moment array module Basic matrix, using the method for Non-negative Matrix Factorization, obtain the coefficient matrix of noisy speech;
The calculating voice and noise multiframe spectrum block, for language of being made an uproar according to the band for synthesizing the synthesis of noisy speech group moment array module The basic matrix of sound and the coefficient matrix for calculating the noisy speech that noisy speech coefficient matrix module is obtained, calculate enhancing voice respectively Multiframe frequency spectrum and noise multiframe frequency spectrum initial estimation;
The Wiener filtering module, for the multiframe of the enhancing voice obtained according to the multiframe spectrum block for calculating voice and noise The initial estimation of the multiframe frequency spectrum of frequency spectrum and noise, builds Wiener filter, obtains the multiframe frequency spectrum of enhancing voice;
The recovery time-domain signal module, the multiframe frequency spectrum of the enhancing voice obtained according to Wiener filtering module, obtains enhancing language The time-domain signal of sound;
The memory module, for storing the basic matrix of noise that training multiframe frequency spectrum group moment array module obtains and clean speech Basic matrix data, and corresponding data is transmitted to corresponding module.
CN201710366412.0A 2017-05-23 2017-05-23 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization Pending CN107248414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710366412.0A CN107248414A (en) 2017-05-23 2017-05-23 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710366412.0A CN107248414A (en) 2017-05-23 2017-05-23 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization

Publications (1)

Publication Number Publication Date
CN107248414A true CN107248414A (en) 2017-10-13

Family

ID=60017435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710366412.0A Pending CN107248414A (en) 2017-05-23 2017-05-23 A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization

Country Status (1)

Country Link
CN (1) CN107248414A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564965A (en) * 2018-04-09 2018-09-21 太原理工大学 A kind of anti-noise speech recognition system
CN110428848A (en) * 2019-06-20 2019-11-08 西安电子科技大学 A kind of sound enhancement method based on the prediction of public space speech model
CN111710343A (en) * 2020-06-03 2020-09-25 中国科学技术大学 Single-channel voice separation method on double transform domains
CN111863014A (en) * 2019-04-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN113823305A (en) * 2021-09-03 2021-12-21 深圳市芒果未来科技有限公司 Method and system for suppressing noise of metronome in audio

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization
CN104103277A (en) * 2013-04-15 2014-10-15 北京大学深圳研究生院 Time frequency mask-based single acoustic vector sensor (AVS) target voice enhancement method
CN104505100A (en) * 2015-01-06 2015-04-08 中国人民解放军理工大学 Non-supervision speech enhancement method based robust non-negative matrix decomposition and data fusion
US9105270B2 (en) * 2013-02-08 2015-08-11 Asustek Computer Inc. Method and apparatus for audio signal enhancement in reverberant environment
CN106030705A (en) * 2014-02-27 2016-10-12 高通股份有限公司 Systems and methods for speaker dictionary based speech modeling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441872A (en) * 2007-11-19 2009-05-27 三菱电机株式会社 Denoising acoustic signals using constrained non-negative matrix factorization
US9105270B2 (en) * 2013-02-08 2015-08-11 Asustek Computer Inc. Method and apparatus for audio signal enhancement in reverberant environment
CN104103277A (en) * 2013-04-15 2014-10-15 北京大学深圳研究生院 Time frequency mask-based single acoustic vector sensor (AVS) target voice enhancement method
CN106030705A (en) * 2014-02-27 2016-10-12 高通股份有限公司 Systems and methods for speaker dictionary based speech modeling
CN104505100A (en) * 2015-01-06 2015-04-08 中国人民解放军理工大学 Non-supervision speech enhancement method based robust non-negative matrix decomposition and data fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUNG-WEI TSENG ET AL: "A Single Channel Speech Enhancement Approach by Combining Statistical Criterion and Multi-Frame Sparse Dictionary Learning", 《INTERSPEECH》 *
KEVIN W. WILSON ET AL: "SPEECH DENOISING USING NONNEGATIVE MATRIX FACTORIZATION WITH PRIORS", 《2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
张立伟等: "稀疏卷积非负矩阵分解的语音增强算法", 《数据采集与处理》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108564965A (en) * 2018-04-09 2018-09-21 太原理工大学 A kind of anti-noise speech recognition system
CN111863014A (en) * 2019-04-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN110428848A (en) * 2019-06-20 2019-11-08 西安电子科技大学 A kind of sound enhancement method based on the prediction of public space speech model
CN110428848B (en) * 2019-06-20 2021-10-29 西安电子科技大学 Speech enhancement method based on public space speech model prediction
CN111710343A (en) * 2020-06-03 2020-09-25 中国科学技术大学 Single-channel voice separation method on double transform domains
CN111710343B (en) * 2020-06-03 2022-09-30 中国科学技术大学 Single-channel voice separation method on double transform domains
CN113823305A (en) * 2021-09-03 2021-12-21 深圳市芒果未来科技有限公司 Method and system for suppressing noise of metronome in audio

Similar Documents

Publication Publication Date Title
CN107248414A (en) A kind of sound enhancement method and device based on multiframe frequency spectrum and Non-negative Matrix Factorization
CN104392718B (en) A kind of robust speech recognition methods based on acoustic model array
CN103117059B (en) Voice signal characteristics extracting method based on tensor decomposition
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
Sarikaya et al. High resolution speech feature parametrization for monophone-based stressed speech recognition
CN105957537B (en) One kind being based on L1/2The speech de-noising method and system of sparse constraint convolution Non-negative Matrix Factorization
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
CN109326302A (en) A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN111081268A (en) Phase-correlated shared deep convolutional neural network speech enhancement method
CN103531205A (en) Asymmetrical voice conversion method based on deep neural network feature mapping
CN103474066B (en) Based on the ecological of multi-band signal reconstruct
CN102436809B (en) Network speech recognition method in English oral language machine examination system
CN106653056A (en) Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof
CN110189766B (en) Voice style transfer method based on neural network
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN110136709A (en) Audio recognition method and video conferencing system based on speech recognition
CN107274887A (en) Speaker&#39;s Further Feature Extraction method based on fusion feature MGFCC
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
Tan et al. Novel variations of group sparse regularization techniques with applications to noise robust automatic speech recognition
CN114495969A (en) Voice recognition method integrating voice enhancement
CN106356058A (en) Robust speech recognition method based on multi-band characteristic compensation
CN113066475B (en) Speech synthesis method based on generating type countermeasure network
CN107103913A (en) A kind of audio recognition method based on power spectrum Gabor characteristic sequence recursive models
CN103886859B (en) Phonetics transfer method based on one-to-many codebook mapping
Nakatani et al. Logmax observation model with MFCC-based spectral prior for reduction of highly nonstationary ambient noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20181126

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Applicant after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 Tsinghua Yuan, Haidian District, Beijing, No. 1

Applicant before: Tsinghua University

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171013