CN1742322A - Noise reduction and audio-visual speech activity detection - Google Patents

Noise reduction and audio-visual speech activity detection Download PDF

Info

Publication number
CN1742322A
CN1742322A CN 200480002628 CN200480002628A CN1742322A CN 1742322 A CN1742322 A CN 1742322A CN 200480002628 CN200480002628 CN 200480002628 CN 200480002628 A CN200480002628 A CN 200480002628A CN 1742322 A CN1742322 A CN 1742322A
Authority
CN
China
Prior art keywords
teller
speech
valuation
signal
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200480002628
Other languages
Chinese (zh)
Other versions
CN100356446C (en
Inventor
M·塔内达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Mobile Communications AB
Original Assignee
Sony Ericsson Mobile Communications AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Ericsson Mobile Communications AB filed Critical Sony Ericsson Mobile Communications AB
Publication of CN1742322A publication Critical patent/CN1742322A/en
Application granted granted Critical
Publication of CN100356446C publication Critical patent/CN100356446C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

The present invention generally relates to the field of noise reduction systems which are equipped with an audio-visual user interface, in particular to an audio-visual speech activity recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise (n(t)) and surrounding persons' voices.

Description

Noise reduces to detect with audio-visual speech activity
Invention field and background
The present invention relates generally to the field that the noise based on speech activity recognition reduces, particularly, the audiovisual user interface that relates to a kind of telecommunication apparatus, application of this telecommunication apparatus operation, this applications advantageously can be subjected at teller's speech comprising in the environment of ground unrest interference of statistical distribution of neighbourhood noise and people around's speech and be used in for example near-end speaker detection algorithm.
The effective solution of the spectrum efficiency that is used for improving the new generation of wireless communication system has been represented in the interrupted transmission of the voice signal that detects based on voice/time-out.In the present context, need healthy and strong voice activity detection algorithms, because, in having the typical mobile environment of ground unrest, present high mis-classification rate according to the traditional solution of prior art level.
Speech activity detector (VAD) though purpose be the acoustic background noise that also can distinguish voice signal and several types for low signal-noise ratio (SNR).So in typical telephone conversation, such VAD is used for reaching silence compression with Comfort Noise Generator (CNG).In field of multimedia communication, silence compression allow a voice channel by with the information sharing of other type, guarantee that therefore speech simultaneously and data use.In the cellular wireless system such as GSM based on discontinuous transmission (DTX) pattern, VAD is used to the common road that reduces portable set and disturbs and power consumption.And, VAD for provide variable bit-rate (VBR) voice coding, be absolutely necessary such as future of UMTS reducing the average data bit rate in each algebraically word cellular network.Most of capacity gain are owing in different the causing between speech activity and the inertia.Yet, relying on very much sorter based on the performance of the voice coding scheme of phonetics classification, the latter must be healthy and strong for every type ground unrest.As everyone knows, the performance of VAD is very crucial for total voice quality, particularly under the situation of low SNR.Be detected as at speech frame under the situation of noise, since the reduction of voice in the dialogue, the intelligent grievous injury that is subjected to.On the other hand, if it is very high to be detected as the percentage of noise of voice, then can not get the potential advantage of silence compression.Exist under the situation of ground unrest, may be difficult to distinguish voice and quiet.Therefore, need more effective algorithm for the voice activity detection under the wireless environment.
Though at F.Beritelli, S.Casale, " Improved VAD is B for Mobile Communications Using Soft Computing (being used to use the improved VAD appendix B G.729 of the mobile communication of soft calculating) G.729Annex " (Contribution ITU-T with A.Cavallaro, Study Group16, Question 19/16, Washington, 2-5 day in September, 1997) the fuzzy speech activity detector (FVAD) that proposes in is carried out better than other solution that provides in the document, but it presents movable increasing, and especially exists under the situation of nonstationary noise.The function scheme of FVAD is based on traditional pattern recognition method, and four differential parameters that wherein are used for speech activity/inertia classification are: full range band energy difference, low-frequency band energy difference, zero crossing difference and distortion spectrum.Matching stage by by means of as at M.Russo, the one group of fuzzy rule execution obtaining automatically of " FuGeNeSys:Fuzzy Genetic Neural System for FuzzyModeling (the general fuzzy nervous system (FuGeNeSys) that is used for obscurity model building) " (appearing at IEEETransaction on Fuzzy Systems) new blended learning instrument of describing.As everyone knows, fuzzy system allows the transition gradually, continuous between two numerical value, rather than sharp change.Therefore, fuzzy VAD returns the continuous output signal of scope from 0 (inertia) to 1 (activity), and whether this do not rely on that single input signal has surpassed predefined threshold value, but depends on the net assessment (" de-fuzzy processing ") of the value that they have supposed.By the output of fuzzy system (changing in the scope between 0 and 1) with at " the Voice Control ofPan-European Digital Mobile Radio System (pan-European digital mobile radio system) " of C.B.Southcott etc., the fixed threshold of describing in (ICC ' 89, the 1070-1074 page or leaf), select is by experiment compared and is made terminal decision.
As speech activity detector, when working under noisy environment, traditional automatic speech recognition (ASR) system also meets difficulty, because the precision of traditional ASR algorithm reduces under noisy environment widely.When talking under the teller is comprising neighbourhood noise and people around's the noisy environment of interference speech, microphone does not singly pick up teller's speech and picks up these background sounds.Therefore, handled is the sound signal that comprises the teller's speech that is superimposed with described background sound.Interference sound is loud more, and the easy property understood of teller's sound reduces manyly more.In order to overcome this problem, added that the noise in the different frequency zone that utilizes neighbourhood noise and each teller's speech reduces circuit.
Shown that on Fig. 2 a a kind of being used for reduces circuit based on the pink noise of the application of phone, it implements a kind of being used for carrying out relevant method through the discrete signal spectrum S of the sound signal s of analog to digital conversion (t) (k Δ f) with the audio speech activity valuation based on a kind of speech activity algorithm for estimating according to the prior art level.Described audio speech activity valuation is that the amplitude detecting by digital audio and video signals s (nT) obtains.Sound signal  that noise reduces of this circuit output i(nT), it is by making the ground unrest of discrete signal spectrum S (k Δ f) and statistical distribution
Figure A20048000262800071
The Carrier To Noise Power Density spectrum of estimation
Figure A20048000262800072
Sampled version
Figure A20048000262800073
Difference experience quick Fu Liye inverse transformation (IFFT) and calculated.
The prior art general introduction
At US5,313, the invention of describing in 522 relates to a kind of understandable equipment of people that is used to make the impaired hearing evil of participating in telephone conversation, and this equipment comprises and is used for the audio speech signal that receives being transformed into the circuit of a series of phonemes and being used for the device of this which couple to the POTS circuit.Thus, this circuit comprises and carries out the aligned phoneme sequence that detects and the lip motion of teller's record relevant and the device of these lip motion of demonstration in afterwards the image on display device, the people who permits the impaired hearing evil thus is when listening to telephone conversation, carry out the lip readout, this has improved individual understanding level.
Invention disclosed relates to a kind of communication facilities and method in WO99/52097, the lip motion that is used for the sensing teller, generation is corresponding to the sound signal and the described sound signal of transmission of the described teller's lip motion that is detected, and therefore the perception ambient noise level reaches and the power level of the sound signal that control will send thus.
Goal of the invention
It seems from above-mentioned prior art level, the objective of the invention is to strengthen the voice/time-out accuracy of detection of voice activity detection (VAD) system based on phone.Particularly, the sound that the objective of the invention is to improve teller therein is by the serious crowded environment that disturbs of neighbourhood noise and/or people around's the speech signal-interference ratio (SIR) of the voice signal of record down.
Above-mentioned purpose obtains by means of the feature in independent claims.Favourable feature is stipulated in the dependent claims.
Brief summary of the invention
The present invention is devoted to a kind of noise with audiovisual user interface and reduces and automatic speech activity recognition system, and wherein said system adapts to and moves a kind of visual feature vector that is used to make up o V, nTWith the audio frequency characteristics vector o A, nTApplication, this visual feature vector o V, nTComprise by detecting and analyze described teller S iFor example lip motion and/or facial expression and the feature extracted from the digital video sequences v (nT) that shows teller's face, and this audio frequency characteristics vector o A, nTComprise the feature of extracting from the analog audio sequence s (t) of record.Described tonic train s (t) represents the described teller S of the ground unrest interference that is subjected to statistical distribution thus iSpeech
n’(t)=n(t)+S Int(t), (1)
It comprises neighbourhood noise n (t) and at described teller S iEnvironment in people around's the weighted sum of interference speech:
S Int ( t ) ∝ Σ j = 1 N a j · s j ( t - T j ) (for j ≠ i) (2a)
Have a j = 1 4 π · R jM 2 [ m - 2 ] (2b)
Thus, N represents that the teller (comprises described teller S iInterior) sum, a jBe at teller S iEnvironment in j teller S jUndesired signal s j(t) decay factor, T jBe s j(t) time-delay, and R JMBe illustrated in the distance between the microphone of j teller and recorde audio signal s (t).By following the tracks of teller's lip motion, extract visual signature, this visual signature can be analyzed and be used to further processing then.For this reason, the user interface of bimodal perception comprises sensing teller face, is used for the described teller S of record demonstration iLip motion and/or the video camera of the digital video sequences v (nT) of facial expression; Be used for determining the audio feature extraction and the analytical equipment of teller's speech and the sound of pronunciation-phoneme characteristics of speech sounds according to the tonic train s (t) of record; And the Visual Feature Retrieval Process and the analytical equipment of the sound-phoneme characteristics of speech sounds of speech that is used for continuously or determines discontinuously the current location of teller's face, the lip motion of following the tracks of image teller afterwards and/or facial expression and determine the teller according to the lip motion that detects and/or facial expression and pronunciation.
According to the present invention, the visual signature with analyzing of above-mentioned extraction is fed to a noise and reduces circuit, and needing this noise to reduce circuit increases signal-interference ratio of the sound signal s of record (t) (SIR).Described noise reduce circuit be particularly suitable for by according to sound-phoneme characteristics of speech sounds of obtaining teller's speech and described ground unrest
Figure A20048000262800091
Separate and carry out near-end speaker and detect:
o av,nT:=[ o a,nT To v,nT T] T (3)
It exports a voice activity indicator signal ( i(nT)), this voice activity indicator signal is obtained by the speech activity valuation that described audio feature extraction and analytical equipment and described Visual Feature Retrieval Process and analytical equipment are provided by combination.
The accompanying drawing summary
To understand favourable feature of the present invention, aspect and useful embodiment from the following description, claims and accompanying drawing.Wherein:
Fig. 1 shows that the noise with audiovisual user interface reduces and the speech activity recognition system, and described system adapts to real-time follow-up of operation especially and uses this application combination visual signature o V, nTWith audio frequency characteristics o A, nT, this visual signature is by detecting and analyze teller S iLip motion and/or facial expression and extraction from the digital video sequences v (nT) that shows teller's face, and this audio frequency characteristics is subjected to the described teller S that the ground unrest n ' of statistical distribution (t) disturbs from representative iThe analog audio sequence s (t) of speech in extract,
Fig. 2 a is a block diagram, show according to the prior art level, estimate to be used for to reduce and the speech activity recognition system according to audio speech activity based on traditional noise of the application of phone,
Fig. 2 b shows that the noise be used for strengthening according to one embodiment of the present of invention, based on the video camera of the application of phone reduces the example with the speech activity recognition system, and its implements audio-visual speech activity algorithm for estimating,
Fig. 2 c shows that the noise be used for strengthening according to an alternative embodiment of the invention, based on the video camera of the application of phone reduces the example with the speech activity recognition system, and its implements audio-visual speech activity algorithm for estimating,
Fig. 3 a shows a process flow diagram, and the near-end speaker detection method according to embodiment illustrated in fig. 1, the noise level that is used to the analog audio sequence s (t) that reduces to detect of the present invention is described,
Fig. 3 b is the process flow diagram that shows according to the near-end speaker detection method of the embodiment shown in Fig. 2 b of the present invention, and
Fig. 3 c is the process flow diagram that shows according to the near-end speaker detection method of the embodiment shown in Fig. 2 c of the present invention.
Detailed description of the invention
To illustrate in greater detail as Fig. 1 2b, the different embodiments of the invention shown in 2c and the 3a-c below.With Fig. 1 to the 3c reference number and the meaning of the symbol of symbolic representation can obtain from subordinate list.
According to the first embodiment of the present invention as shown in Figure 1, described noise reduces to comprise that with speech activity recognition system 100 noise reduces circuit 106, it is particularly suitable for reducing the ground unrest n ' that received by microphone 101a (t) with by teller's speech and described ground unrest n ' (t) are separated and are close to teller's detection, and comprising multi-channel echo offset unit 108, it is particularly suitable for carrying out near-end speaker according to the sound-phoneme characteristics of speech sounds that obtains by means of above-mentioned audio frequency and Visual Feature Retrieval Process and analytical equipment 104a+b and 106b respectively and detects and/or ambiguous language (double-talk) detection algorithm.Thus, described sound-phoneme characteristics of speech sounds is based on: teller's mouth open respectively valuation as the acoustic energy of the vowel of articulation or diphthong, the rapid movement of teller's lip is as for labial or labiodental consonant (for example, plosive, fricative or affricate element--be respectively voiced sound or voiceless sound) hint, and at teller S iThe phoneme characteristic that the position of lip and motion other related statistics with speech and between pronouncing detects.
Above-mentioned noise reduces circuit 106 and comprises: be used for calculating digital signal processing device 106a corresponding to the discrete signal spectrum S (k Δ f) of the analog-to-digital conversion version s (nT) of the tonic train s (t) that writes down by carrying out fast Fourier transform (FFT); Be used for audio feature extraction and analytical equipment 106b (for example, amplitude detector) that tonic train s (t) according to record detects the sound-phoneme characteristics of speech sounds of teller's speech and pronunciation; Be used for estimating ground unrest n ' the Carrier To Noise Power Density spectrum (t) of statistical distribution according to the result of teller's testing process of carrying out by described audio feature extraction and analytical equipment 106b
Figure A20048000262800101
Device
106c; Be used for deducting the Carrier To Noise Power Density spectrum of estimation from the discrete signal spectrum S (k Δ f) of the tonic train s (nT) of analog-to-digital conversion
Figure A20048000262800102
The discretize version
Figure A20048000262800103
Subtract each other unit 106d; And be used for calculating the corresponding discrete time signal  of resulting difference signal by carrying out quick Fu Liye inverse transformation (IFFT) i(nT) digital signal processing device 106e.
The noise of being described reduces to comprise with speech activity recognition system 100: audio feature extraction and analytical equipment 106b, be used for tonic train s (t) according to record determine teller's speech and the sound of pronunciation-phoneme characteristics of speech sounds ( o A, nT); And Visual Feature Retrieval Process and analytical equipment 104a+b, be used for determining the current location of teller's face with the data rate of 1 frame/second, follow the tracks of described teller S with the data rate of 15 frame/seconds iLip motion and/or the sound-phoneme characteristics of speech sounds of facial expression and the speech of determining the teller according to the lip motion that detects and/or facial expression and pronunciation ( o V, nT).
As shown in Figure 1, described noise reduction system 200b/c can advantageously be used on the phone 102 of enabling video the application based on visual telephone operation, in the telecommunication system, and this visual telephone is equipped with and points to the teller S that participates in video-phone session iFacial built-in video camera 101b '.
Fig. 2 b show be used for according to one embodiment of the present of invention, based on the application of phone, the noise that strengthens of video camera reduces the example with the 200b of speech activity recognition system slowly, its implements audio-visual speech activity algorithm for estimating.Thus, the audio frequency characteristics vector that provides by described audio feature extraction and analytical equipment 106b is provided o A, tAudio speech activity valuation and another speech activity valuation carry out relevantly, the latter is by calculating the Carrier To Noise Power Density spectrum of discrete signal spectrum S (k Δ f) and the ground unrest n ' estimation (t) of statistical distribution
Figure A20048000262800111
Sampled version
Figure A20048000262800112
Difference obtain.Described audio speech activity valuation is by simulation. the sound signal s of digital conversion (t), obtain through the amplitude detecting of the discrete signal spectrum S of bandpass filtering (k Δ f).
Be similar to embodiment shown in Figure 1, the noise that Fig. 2 b is described reduces to comprise with the 200b of speech activity recognition system: be used for tonic train s (t) according to record determine teller's speech and the sound of pronunciation-phoneme characteristics of speech sounds ( o A, nT) audio feature extraction and analytical equipment 106b (for example, amplitude detector); And the current location that is used for determining with the data rate of 1 frame/second teller's face, follow the tracks of described teller S with the data rate of 15 frame/seconds iLip motion and the sound-phoneme characteristics of speech sounds of facial expression and the speech of determining the teller according to the lip motion that detects and/or facial expression and pronunciation ( o V, nT) Visual Feature Retrieval Process and analytical equipment 104 ' and 104 ".Thus, described audio feature extraction and analytical equipment 106b can be embodied as amplitude detector simply.
Except above parts 106a-e with reference to Fig. 1 description, the noise that Fig. 2 b is described reduces circuit 106 and comprises: delay unit 204, and it provides the time-delay version of the discrete signal spectrum S (k Δ f) of the sound signal s (t) of analog-to-digital conversion; The first multiplier unit 107a is used for the discrete signal spectrum S with the time-delay version s (nT-τ) of the sound signal s (nT) of analog-to-digital conversion τ(k Δ f) with take from by Visual Feature Retrieval Process and analytical equipment 104a+b and/or 104 '+104 " visual feature vector that provides O V, tThe valuation of vision speech activity be correlated with (S9), therefore produce the signal s that is used for upgrading for corresponding to the described teller's speech of representative i(t) frequency spectrum S i(f) valuation Another valuation
Figure A20048000262800114
And the ground unrest n ' Carrier To Noise Power Density (t) that is used to upgrade for statistical distribution is composed Φ Nn(f) valuation Another valuation And second multiplier unit 107, be used for discrete signal spectrum S with the time-delay version s (nT-τ) of the sound signal s (nT) of analog-to-digital conversion τ(k Δ f) carries out relevant (S8a) with the audio speech activity valuation that the amplitude detecting (S8b) of discrete signal spectrum S (k Δ f) by bandpass filtering obtains, and therefore produces for the signal s corresponding to the described teller's speech of representative i(t) frequency spectrum S i(f) valuation
Figure A20048000262800122
With for described ground unrest n ' Carrier To Noise Power Density spectrum Φ (t) Nn(f) valuation Sampling and maintenance (S﹠amp; H) unit 106d ' provides the Carrier To Noise Power Density spectrum of estimation
Figure A20048000262800124
Sampled version
Figure A20048000262800125
Noise reduces circuit 106 and also comprises the bandpass filter with adjustable cutoff frequency, and the discrete signal spectrum S (k Δ f) that it is used for the sound signal s (t) to analog-to-digital conversion carries out filtering.Cutoff frequency can be according to the voice signal frequency spectrum of estimating
Figure A20048000262800126
Bandwidth be conditioned.Switch 106f is provided between first and second patterns selectively and switches, so as respectively by use or do not use advised, provide noise to reduce voice signal  i(t) audio-visual speech recognition methods receives described voice signal s i(t).According to another aspect of the present invention, provide to be used for as speech activity indicator signal  i(nT) actual level is reduced to the device that turn-offs described microphone 101a when being lower than predefined threshold value (not shown).
On Fig. 2 c, show one be used for according to an alternative embodiment of the invention, based on the application of phone, the noise that strengthens of video camera reduces the example with the 200c of speech activity recognition system fast, its implements audio-visual speech activity algorithm for estimating.Circuit is time-delay version and the Carrier To Noise Power Density spectrum by calculate discrete signal spectrum S (k Δ f) and estimate of the discrete signal spectrum S of the sound signal s of analog-to-digital conversion (t) (k Δ f) with the audio-visual speech activity valuation
Figure A20048000262800127
Sampled version
Figure A20048000262800128
Difference and another speech activity valuation of obtaining be correlated with.Above-mentioned audio-visual speech activity valuation is that the audio frequency characteristics vector that is provided by described audio feature extraction and analytical equipment 106b by combination is provided o A, tWith by described vision voice activity detection module 104 " visual feature vector that provides o V, tAnd the audiovisual features vector that obtains o Av, t
Except above parts with reference to Fig. 1 description, the noise of being described on Fig. 2 c reduces circuit 106 and comprises sum unit 107c, it is used for the audio speech activity valuation that provided by audio feature extraction and analytical equipment 106b is added to (S11a) by Visual Feature Retrieval Process and analytical equipment 104 ' and 104 " in the vision speech activity valuation that provides; therefore produce the audio-visual speech activity valuation, this audio feature extraction and analytical equipment be used for according to the tonic train s (t) that writes down determine teller's speech and the sound of pronunciation-phoneme characteristics of speech sounds ( o A, nT), and this Visual Feature Retrieval Process and analytical equipment are used for determining with the data rate of 1 frame/second the current location of teller's face, follow the tracks of described teller S with the data rate of 15 frame/seconds iLip motion and facial expression and sound-phoneme characteristics of speech sounds of determining teller's speech and pronunciation according to the lip motion that detects and/or facial expression ( o V, nT).Noise reduces circuit 106 and also comprises multiplier unit 107 ', and it is used for the discrete signal spectrum S of the sound signal s of analog-to-digital conversion (t) (k Δ f) and the audio frequency characteristics vector that is provided by described audio feature extraction and analytical equipment 106b by combination o A, tWith by described vision voice activity detection module 104 " visual feature vector that provides o V, tAnd the audio-visual speech activity valuation that obtains is correlated with (S11b), produces thus for the signal s corresponding to the described teller's speech of representative i(t) frequency spectrum S i(f) valuation With ground unrest n ' Carrier To Noise Power Density spectrum Φ (t) for statistical distribution Nn(f) valuation
Figure A20048000262800132
Sampling and maintenance (S﹠amp; H) unit 106d ' provides the Carrier To Noise Power Density spectrum of estimation
Figure A20048000262800133
Sampled version
Figure A20048000262800134
Noise reduces circuit 106 and also comprises the bandpass filter with adjustable cutoff frequency, and the discrete signal spectrum S (k Δ f) that it is used for the sound signal s (t) to analog-to-digital conversion carries out filtering.Described cutoff frequency can be according to the voice signal frequency spectrum of estimating
Figure A20048000262800135
Bandwidth be conditioned.Switch 106f is provided between first and second patterns and selectively switches, so that respectively by using or do not use the voice signal  that is advised, provide noise to reduce i(t) audio-visual speech recognition methods receives described voice signal s i(t).According to another aspect of the present invention, described noise reduction system 200c comprises and being used for as speech activity indicator signal  i(nT) actual level is reduced to the device that turn-offs described microphone 101a when being lower than the predetermined threshold value (not shown).
Another embodiment of the present invention is at the near-end speaker detection method that shows on the process flow diagram shown in Fig. 3 a.Described method is the noise level that reduces to be subjected to the analog audio sequence s (t) of the record that the ground unrest n ' of statistical distribution (t) disturbs, and described tonic train is represented teller S iSpeech.After making analog audio sequence s (t) experience analog-to-digital conversion (S1), by carrying out the corresponding discrete signal spectrum S (k Δ f) that fast Fourier transform (FFT) calculates the tonic train s (nT) of (S2) analog-to-digital conversion, and by analyzing the visual signature from the video sequence that the record with analog audio sequence s (t) is recorded simultaneously, extract, and from described signal spectrum S (k Δ f) detection (S3) described teller S iSpeech, this video sequence is followed the tracks of teller S in the current location, image afterwards of teller's face iLip motion and/or facial expression.Then, estimate ground unrest n ' the Carrier To Noise Power Density spectrum (t) of (S4) statistical distribution according to the result of speech person detection step (S3)
Figure A20048000262800136
After this from the discrete spectrum S (k Δ f) of the tonic train s (nT) of analog-to-digital conversion, deduct the Carrier To Noise Power Density spectrum that (S5) estimates
Figure A20048000262800137
Sampled version At last, calculate the corresponding discrete time signal  of (S6) resulting difference signal by carrying out quick Fu Liye inverse transformation (IFFT) i(nT), the discrete version of the voice signal of this difference signal representative identification.
Randomly, can carry out (S7) a kind of multi-channel echo cancellation algorithms according to sound-phoneme characteristics of speech sounds, this algorithm is echo path shock response modeling by means of self-adaptation finite impulse response (FIR) wave filter and deducts echoed signal from analog audio sequence s (t), and this sound-phoneme characteristics of speech sounds is to be used for from the position of following the tracks of teller's face, image teller S afterwards by a kind of iLip motion and/or the video sequence of facial expression in extract visual signature algorithm obtain.Described multi-channel echo cancellation algorithms is carried out ambiguous language testing process thus.
According to another aspect of the present invention, use a kind of learning process, the visual signature that it extracts from the video sequence that the record with analog audio sequence s (t) is recorded simultaneously by analysis strengthens detection (S3) described teller S from the discrete signal spectrum S (k Δ f) of the version s (nT) of the analog-to-digital conversion of analog audio sequence s (t) iThe step of speech, this video sequence is followed the tracks of teller S in the current location, image afterwards of teller's face iLip motion and/or facial expression.
In the one embodiment of the present of invention that on the process flow diagram shown in Fig. 3 a+b, show, propose a kind of near-end speaker detection method, it is characterized in that discrete signal spectrum S the time-delay version s (nT-τ) of the sound signal s (nT) of analog-to-digital conversion τThe step of relevant (S8a) is carried out in (k Δ f) and the audio speech activity valuation that the amplitude detecting (S8b) of discrete signal spectrum S (k Δ f) by bandpass filtering obtains, and produces thus for the signal s corresponding to the described teller's of representative speech i(t) frequency spectrum S i(f) valuation
Figure A20048000262800141
With for described ground unrest Carrier To Noise Power Density spectrum Φ Nn(f) valuation
Figure A20048000262800143
And, the discrete signal spectrum S of the time-delay version s (nT-τ) of the sound signal s of analog-to-digital conversion (nT) τ(k Δ f) with take from by Visual Feature Retrieval Process and analytical equipment 104a+b and/or 104 '+104 " visual feature vector that provides o V, tThe valuation of vision speech activity be correlated with (S9), therefore produce and be used for upgrading for corresponding to the described teller's voice signal s of representative i(t) frequency spectrum S i(f) valuation
Figure A20048000262800144
Another valuation
Figure A20048000262800145
And the ground unrest n ' Carrier To Noise Power Density (t) that is used to upgrade for statistical distribution is composed Φ Nn(f) valuation Another valuation
Figure A20048000262800147
Noise reduces circuit 106 provides bandpass filter 204 thus, is used for the discrete signal spectrum S of the sound signal s (t) to analog-to-digital conversion τ(k Δ f) carries out filtering, and the cutoff frequency of wherein said bandpass filter 204 is according to the voice signal frequency spectrum of estimating Bandwidth be conditioned (S10).
In an alternative embodiment of the invention that on the process flow diagram shown in Fig. 3 a+c, shows, propose a kind of near-end speaker detection method, it is characterized in that the audio speech activity valuation that the amplitude detecting of the discrete signal spectrum S (k Δ f) of the bandpass filtering of the sound signal s (t) by analog-to-digital conversion obtains is added to (S11a) and take from by described Visual Feature Retrieval Process and analytical equipment 104a+b and/or 104 '+104 " the sound visual feature vector that provides o V, tThe step of vision speech activity valuation, produce the audio-visual speech activity valuation thus.According to present embodiment, discrete signal spectrum S (k Δ f) carries out relevant (S11b) with the audio-visual speech activity valuation, produces thus for the signal s corresponding to the described teller's speech of representative i(t) frequency spectrum S i(f) valuation
Figure A20048000262800149
And for the ground unrest n ' of statistical distribution Carrier To Noise Power Density spectrum Φ (t) Nn(f) valuation
Figure A200480002628001410
The cutoff frequency that the discrete signal spectrum S (k Δ f) that is used for the sound signal s (t) to analog-to-digital conversion carries out the bandpass filter 204 of filtering is according to the voice signal frequency spectrum of estimating Bandwidth (S11c) that be conditioned.
At last, the invention still further relates to in the telecommunication system based on the application of visual telephone (for example, video conference) uses aforesaid noise reduction system 200b/c and corresponding near-end speaker detection method, described application operates on the videophone, has to point to the teller S that participates in video-phone session iThe built-in camera 101b ' of face.This relates in particular to following scene, that is: many people are sitting in the room that is equipped with many video cameras and microphone and cause teller's speech to disturb mutually with other people's speech.
Table: feature of describing and their corresponding reference symbol
Sequence number Technical characterictic (system unit or process steps)
100 Noise with audiovisual user interface reduces and the speech activity recognition system, and described system is particularly suitable for moving a real-time lip and follows the tracks of application, and its combination is by detecting and analyze teller S iLip motion and/or facial expression and the visual signature that extracts from the digital video sequences v (nT) that shows teller's face o V, nTThe described teller S that (t) disturbs with the ground unrest n ' that is subjected to statistical distribution from representative iThe audio frequency characteristics that extracts of the analog audio sequence s (t) of speech o A, nT, wherein except representing described teller S iThe signal of speech beyond, described tonic train s (t) comprises neighbourhood noise n (t) and at described teller S iEnvironment in people around's the weighted sum ∑ of speech of interference ja jS j(t-T j) (j ≠ i)
101a Microphone is used for writing down the teller S that ground unrest n ' that representative is subjected to statistical distribution (t) disturbs iThe analog audio sequence s (t) of speech, this ground unrest comprises neighbourhood noise n (t) and at described teller S iEnvironment in people around's the weighted sum ∑ of speech of interference ja jS j(t-T j) (have j's ≠ i)
101a ' Analog-to-digital converter (ADC) is used for the analog audio sequence s (t) by described microphone 101a record is transformed to numeric field.
101b Point to the video camera of teller's face, be used for record and show described teller S iLip motion and/or the video sequence of facial expression
101b ' Aforesaid, as to have integrated analog-to-digital converter (ADC) video camera
102 Videophone application is used for sending the video sequence of the lip motion of the face that shows the teller and image afterwards
104 Automatically the visual front end of audio-visual speech recognition system 100, the lip track algorithm is used for being subjected to the teller S that the ground unrest n ' of statistical distribution (t) disturbs from its speech when implementing by merging iLip motion and/or facial expression draws additional visual signature and use the bimodal method to carry out speech recognition and contiguous teller detects, visual front end 104 comprises the current location that is used for continuously or determines teller's face discontinuously, follows the tracks of image teller S afterwards iLip motion and/or the Visual Feature Retrieval Process and the analytical equipment of the sound-phoneme characteristics of speech sounds of facial expression and the speech of determining the teller according to the lip motion that detects and/or facial expression and pronunciation
104 ' The Visual Feature Retrieval Process module is used for following the tracks of continuously teller S iLip motion and/or facial expression and sound-phoneme characteristics of speech sounds of determining teller's speech according to the lip motion that detects and/or facial expression
Figure A20048000262800171
Figure A20048000262800181
Figure A20048000262800191
Figure A20048000262800201
Figure A20048000262800211

Claims (15)

1. noise reduction system with audiovisual user interface, described system adapt to operation one application especially, are used for combination from showing teller (S i) the visual signature that extracts of the digital video sequences (v (nT)) of face ( o V, nT) with the audio frequency characteristics that extracts from analog audio sequence (s (t)) ( o A, nT), wherein said tonic train (s (t)) can be included in described teller (S i) environment in noise, described noise reduction system (200b/c) comprising:
-be used for detecting and analyzing described analog audio sequence (s (t)) device (101a, 106b),
-be used to detect the device (101b ') of described video sequence (v (nT)), and
-be used to analyze the device (104a+b, 104 '+104 ") of the vision signal (v (nT)) that is detected,
It is characterized in that,
Noise reduce circuit (106) be suitable for according to the characteristics of speech sounds that obtains combination (( o Av, nT) :=[ o A, nT T, o V, nT T] T) teller's speech and described ground unrest (n ' (t)) separated and export the combination by the speech activity valuation that provided by described analytical equipment (106b, 104a+b, 104 '+104 ") and the speech activity indicator signal that obtains
Figure A2004800026280002C1
2. according to the noise reduction system of claim 1,
It is characterized in that
Be used in described speech activity indicator signal Actual level be reduced to the device (SW) that turn-offs voice-grade channel when being lower than predefined threshold value.
3. according to each noise reduction system of claim 1 or 2,
It is characterized in that
Multi-channel echo offset unit (108) is particularly suitable for carrying out near-end speaker and detecting and ambiguous language detection algorithm according to the sound-phoneme characteristics of speech sounds that is obtained by described audio feature extraction and analytical equipment (106b) and described Visual Feature Retrieval Process and analytical equipment (104a+b, 104 '+104 ").
4. according to each noise reduction system of claim 1 to 3,
It is characterized in that
Described audio feature extraction and analytical equipment (106b) are amplitude detectors.
5. near-end speaker detection method that is used to the noise level of the analog audio sequence (s (t)) that reduces to detect,
Described method is characterised in that following steps:
-make described analog audio sequence (s (t)) experience (S1) analog-to-digital conversion,
-by carrying out the corresponding discrete signal spectrum (S (k Δ f)) that fast Fourier transform (FFT) calculates the tonic train (s (nT)) of (S2) analog-to-digital conversion,
-by analyzing extraction from the video sequence (v (nT)) that the record with analog audio sequence (s (t)) is recorded simultaneously visual signature ( o V, nT) and from described signal spectrum (S (k Δ f)), detect (S3) described teller (S i) speech, this video sequence is followed the tracks of teller (S in the current location, image afterwards of teller's face i) lip motion and/or facial expression,
-according to the result of speech person detection step (S3), estimate the ground unrest of (S4) statistical distribution Carrier To Noise Power Density spectrum (Φ Nn(f)),
-from the discrete signal spectrum (S (k Δ f)) of the tonic train (s (nT)) of analog-to-digital conversion, deduct the Carrier To Noise Power Density spectrum that (S5) estimates
Figure A2004800026280003C2
The discretize version And
-calculate the corresponding discrete time signal of (S6) resulting difference signal by carrying out quick Fu Liye inverse transformation (IFFT)
Figure A2004800026280003C4
Produce the discrete version of the voice signal of identification thus.
6. according to the near-end speaker detection method of claim 5,
It is characterized in that following steps,
According to by being used for from the position of following the tracks of teller's face, image teller (S afterwards i) lip motion and/or the video sequence (v (nT)) of facial expression in extract visual signature ( o V, nT) sound-phoneme characteristics of speech sounds of obtaining of algorithm, carry out (S7) multi-channel echo cancellation algorithms, this multi-channel echo cancellation algorithms is echo path shock response modeling by means of self-adaptation finite impulse response (FIR) wave filter and deducts echoed signal from analog audio sequence (s (t)).
7. according to the near-end speaker detection method of claim 6,
It is characterized in that
Described multi-channel echo cancellation algorithms is carried out ambiguous language testing process.
8. according to each near-end speaker detection method of claim 5 to 7,
It is characterized in that
Described sound-phoneme characteristics of speech sounds be based on the teller mouth open respectively valuation as the acoustic energy of the vowel of articulation or diphthong, the rapid movement of teller's lip is respectively as for the hint of labial or labiodental consonant, and at described teller (S i) the position of lip and the phoneme characteristic that detects of motion and other related statistics between speech and the pronunciation.
9. according to each near-end speaker detection method of claim 5 to 8,
It is characterized in that
A kind of learning process, be used for by analyzing video sequence (v (the nT)) extraction that is recorded simultaneously from record with analog audio sequence (s (t)) visual signature ( o V, nT) and enhancing detects (S3) described teller S from the discrete signal spectrum S (k Δ f) of the version (s (nT)) of the analog-to-digital conversion of analog audio sequence s (t) iThe step of speech, this video sequence is followed the tracks of teller (S in the current location, image afterwards of teller's face i) lip motion and/or facial expression.
10. according to each near-end speaker detection method of claim 5 to 9,
It is characterized in that following steps
Discrete signal spectrum (S with the time-delay version (s (nT-τ)) of the sound signal (s (nT)) of analog-to-digital conversion τ(k Δ f)) carry out relevant (S8a) with the audio speech activity valuation that the amplitude detecting (S8b) of discrete signal spectrum (S (k Δ f)) by bandpass filtering obtains, produce thus for signal (s corresponding to the described teller's of representative speech i(t)) frequency spectrum (S i(f)) valuation
Figure A2004800026280004C1
And for the Carrier To Noise Power Density spectrum (Φ of the ground unrest of described statistical distribution (n ' (t)) Nn(f)) valuation
Figure A2004800026280004C2
11. according to the near-end speaker detection method of claim 10,
It is characterized in that following steps
Discrete signal spectrum (S with the time-delay version (s (nT-τ)) of the sound signal (s (nT)) of analog-to-digital conversion τ(k Δ f)) with take from the visual feature vector that provides by Visual Feature Retrieval Process and analytical equipment (104a+b, 104 '+104 ") ( o V, t) the valuation of vision speech activity be correlated with (S9), produce thus and be used for upgrading for corresponding to the described teller's voice signal (s of representative i(t)) frequency spectrum (S i(f)) valuation Another valuation And the Carrier To Noise Power Density spectrum (Φ that is used to upgrade ground unrest for statistical distribution (n ' (t)) Nn(f)) valuation Another valuation
12. according to each near-end speaker detection method of claim 10 or 11,
It is characterized in that following steps
According to the voice signal frequency spectrum of estimating
Figure A2004800026280004C7
Bandwidth and regulate the cutoff frequency that discrete signal spectrum (S (k Δ f)) that (S10) be used for the sound signal (s (t)) to analog-to-digital conversion is carried out the bandpass filter (204) of filtering.
13. according to each near-end speaker detection method of claim 5 to 9,
It is characterized in that following steps
-will be by analog-to-digital conversion the audio speech activity valuation that obtains of the amplitude detecting of discrete signal spectrum (S (k Δ f)) of bandpass filtering of sound signal (s (t)) be added to the visual feature vector that (S11a) takes to be provided by described Visual Feature Retrieval Process and analytical equipment (104a+b, 104 '+104 ") ( o V, t) the valuation of vision speech activity, produce the audio-visual speech activity valuation thus,
-discrete signal spectrum (S (k Δ f)) is carried out relevant (S11b) with the audio-visual speech activity valuation, produce thus for corresponding to the representative described teller's voice signal (s i(t)) frequency spectrum (S i(f)) valuation And for the Carrier To Noise Power Density spectrum (Φ of the ground unrest of statistical distribution (n ' (t)) Nn(f)) valuation
Figure A2004800026280004C9
And
-according to the voice signal frequency spectrum of estimating
Figure A2004800026280004C10
Bandwidth and regulate the cutoff frequency that discrete signal spectrum (S (k Δ f)) that (S11c) be used for the sound signal (s (t)) to analog-to-digital conversion is carried out the bandpass filter (204) of filtering.
14. will according to claim 1 to 4 each noise reduction system (200b/c) and be used for the application based on visual telephone of telecommunication system according to each near-end speaker detection method of claim 5 to 13, this application operates in to have points to the teller (S that participates in video-phone session i) on the phone of enabling video of facial built-in video camera (101b ').
15. the telecommunication apparatus of audiovisual user interface is equipped with,
It is characterized in that,
Each noise reduction system (200b/c) according to claim 1 to 4.
CNB200480002628XA 2003-01-24 2004-01-09 Noise reduction and audio-visual speech activity detection Expired - Fee Related CN100356446C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP03001637 2003-01-24
EP03001637.2 2003-01-24
EP03022561.9 2003-10-02

Publications (2)

Publication Number Publication Date
CN1742322A true CN1742322A (en) 2006-03-01
CN100356446C CN100356446C (en) 2007-12-19

Family

ID=36094003

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200480002628XA Expired - Fee Related CN100356446C (en) 2003-01-24 2004-01-09 Noise reduction and audio-visual speech activity detection

Country Status (3)

Country Link
CN (1) CN100356446C (en)
AT (1) ATE389934T1 (en)
DE (1) DE60319796T2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646418A (en) * 2012-03-29 2012-08-22 北京华夏电通科技股份有限公司 Method and system for eliminating multi-channel acoustic echo of remote voice frequency interaction
CN102682273A (en) * 2011-03-18 2012-09-19 夏普株式会社 Device and method for detecting lip movement
CN103325385A (en) * 2012-03-23 2013-09-25 杜比实验室特许公司 Method and device for speech communication and method and device for operating jitter buffer
CN103617801A (en) * 2013-12-18 2014-03-05 联想(北京)有限公司 Voice detection method and device and electronic equipment
CN104133404A (en) * 2014-07-23 2014-11-05 株洲南车时代电气股份有限公司 Method and device for processing signal
CN104537227A (en) * 2014-12-18 2015-04-22 中国科学院上海高等研究院 Transformer substation noise separation method
WO2015117403A1 (en) * 2014-07-23 2015-08-13 中兴通讯股份有限公司 Noise suppression method and apparatus, computer program and computer storage medium
CN104991754A (en) * 2015-06-29 2015-10-21 小米科技有限责任公司 Recording method and apparatus
CN106155707A (en) * 2015-03-23 2016-11-23 联想(北京)有限公司 Information processing method and electronic equipment
CN106443071A (en) * 2016-09-20 2017-02-22 中国科学院上海微系统与信息技术研究所 Method for extracting noise-recognizable high-range acceleration sensor resonant frequency
CN106531155A (en) * 2015-09-10 2017-03-22 三星电子株式会社 Apparatus and method for generating acoustic model, and apparatus and method for speech recognition
CN107004405A (en) * 2014-12-18 2017-08-01 三菱电机株式会社 Speech recognition equipment and audio recognition method
CN108521516A (en) * 2018-03-30 2018-09-11 百度在线网络技术(北京)有限公司 Control method and device for terminal device
CN109040641A (en) * 2018-08-30 2018-12-18 维沃移动通信有限公司 A kind of video data synthetic method and device
CN110853667A (en) * 2013-01-29 2020-02-28 弗劳恩霍夫应用研究促进协会 Audio encoder
CN111052232A (en) * 2017-07-03 2020-04-21 耶路撒冷希伯来大学伊森姆研究发展有限公司 Method and system for enhancing speech signals of human speakers in video using visual information
CN111768760A (en) * 2020-05-26 2020-10-13 云知声智能科技股份有限公司 Multi-mode voice endpoint detection method and device
CN111899723A (en) * 2020-08-28 2020-11-06 北京地平线机器人技术研发有限公司 Voice activation state detection method and device

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101656070B (en) * 2008-08-22 2012-01-04 展讯通信(上海)有限公司 Voice detection method
CN112289340A (en) * 2020-11-03 2021-01-29 北京猿力未来科技有限公司 Audio detection method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029784A1 (en) * 2000-10-02 2002-04-11 Clarity, Llc Audio visual speech processing
US6792107B2 (en) * 2001-01-26 2004-09-14 Lucent Technologies Inc. Double-talk detector suitable for a telephone-enabled PC
DE10120168A1 (en) * 2001-04-18 2002-10-24 Deutsche Telekom Ag Determining characteristic intensity values of background noise in non-speech intervals by defining statistical-frequency threshold and using to remove signal segments below

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682273A (en) * 2011-03-18 2012-09-19 夏普株式会社 Device and method for detecting lip movement
CN103325385A (en) * 2012-03-23 2013-09-25 杜比实验室特许公司 Method and device for speech communication and method and device for operating jitter buffer
US9912617B2 (en) 2012-03-23 2018-03-06 Dolby Laboratories Licensing Corporation Method and apparatus for voice communication based on voice activity detection
CN103325385B (en) * 2012-03-23 2018-01-26 杜比实验室特许公司 Voice communication method and equipment, the method and apparatus of operation wobble buffer
CN102646418A (en) * 2012-03-29 2012-08-22 北京华夏电通科技股份有限公司 Method and system for eliminating multi-channel acoustic echo of remote voice frequency interaction
CN102646418B (en) * 2012-03-29 2014-07-23 北京华夏电通科技股份有限公司 Method and system for eliminating multi-channel acoustic echo of remote voice frequency interaction
CN110853667B (en) * 2013-01-29 2023-10-27 弗劳恩霍夫应用研究促进协会 audio encoder
CN110853667A (en) * 2013-01-29 2020-02-28 弗劳恩霍夫应用研究促进协会 Audio encoder
CN103617801B (en) * 2013-12-18 2017-09-29 联想(北京)有限公司 Speech detection method, device and electronic equipment
CN103617801A (en) * 2013-12-18 2014-03-05 联想(北京)有限公司 Voice detection method and device and electronic equipment
CN105321523A (en) * 2014-07-23 2016-02-10 中兴通讯股份有限公司 Noise inhibition method and device
CN104133404B (en) * 2014-07-23 2016-09-07 株洲南车时代电气股份有限公司 A kind of signal processing method and device
CN104133404A (en) * 2014-07-23 2014-11-05 株洲南车时代电气股份有限公司 Method and device for processing signal
WO2015117403A1 (en) * 2014-07-23 2015-08-13 中兴通讯股份有限公司 Noise suppression method and apparatus, computer program and computer storage medium
CN104537227B (en) * 2014-12-18 2017-06-30 中国科学院上海高等研究院 Transformer station's noise separation method
CN107004405A (en) * 2014-12-18 2017-08-01 三菱电机株式会社 Speech recognition equipment and audio recognition method
CN104537227A (en) * 2014-12-18 2015-04-22 中国科学院上海高等研究院 Transformer substation noise separation method
CN106155707A (en) * 2015-03-23 2016-11-23 联想(北京)有限公司 Information processing method and electronic equipment
CN106155707B (en) * 2015-03-23 2020-02-21 联想(北京)有限公司 Information processing method and electronic equipment
CN104991754B (en) * 2015-06-29 2018-03-16 小米科技有限责任公司 The way of recording and device
CN104991754A (en) * 2015-06-29 2015-10-21 小米科技有限责任公司 Recording method and apparatus
CN106531155A (en) * 2015-09-10 2017-03-22 三星电子株式会社 Apparatus and method for generating acoustic model, and apparatus and method for speech recognition
CN106531155B (en) * 2015-09-10 2022-03-15 三星电子株式会社 Apparatus and method for generating acoustic model and apparatus and method for speech recognition
CN106443071B (en) * 2016-09-20 2019-09-13 中国科学院上海微系统与信息技术研究所 The extracting method of the identifiable high-range acceleration transducer resonant frequency of noise
CN106443071A (en) * 2016-09-20 2017-02-22 中国科学院上海微系统与信息技术研究所 Method for extracting noise-recognizable high-range acceleration sensor resonant frequency
CN111052232A (en) * 2017-07-03 2020-04-21 耶路撒冷希伯来大学伊森姆研究发展有限公司 Method and system for enhancing speech signals of human speakers in video using visual information
CN108521516A (en) * 2018-03-30 2018-09-11 百度在线网络技术(北京)有限公司 Control method and device for terminal device
CN109040641A (en) * 2018-08-30 2018-12-18 维沃移动通信有限公司 A kind of video data synthetic method and device
CN109040641B (en) * 2018-08-30 2020-10-16 维沃移动通信有限公司 Video data synthesis method and device
CN111768760A (en) * 2020-05-26 2020-10-13 云知声智能科技股份有限公司 Multi-mode voice endpoint detection method and device
CN111768760B (en) * 2020-05-26 2023-04-18 云知声智能科技股份有限公司 Multi-mode voice endpoint detection method and device
CN111899723A (en) * 2020-08-28 2020-11-06 北京地平线机器人技术研发有限公司 Voice activation state detection method and device

Also Published As

Publication number Publication date
CN100356446C (en) 2007-12-19
DE60319796T2 (en) 2009-05-20
ATE389934T1 (en) 2008-04-15
DE60319796D1 (en) 2008-04-30

Similar Documents

Publication Publication Date Title
CN100356446C (en) Noise reduction and audio-visual speech activity detection
US7684982B2 (en) Noise reduction and audio-visual speech activity detection
Bhat et al. A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone
Kim et al. An algorithm that improves speech intelligibility in noise for normal-hearing listeners
CN107293286B (en) Voice sample collection method based on network dubbing game
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN105513605A (en) Voice enhancement system and method for cellphone microphone
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
CN102723078A (en) Emotion speech recognition method based on natural language comprehension
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
CN108564956B (en) Voiceprint recognition method and device, server and storage medium
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
CN107945793A (en) A kind of voice-activation detecting method and device
CN110277087A (en) A kind of broadcast singal anticipation preprocess method
Wand et al. Analysis of phone confusion in EMG-based speech recognition
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
Kekre et al. Speaker recognition using Vector Quantization by MFCC and KMCG clustering algorithm
Fraile et al. Mfcc-based remote pathology detection on speech transmitted through the telephone channel-impact of linear distortions: Band limitation, frequency response and noise
CN115359804B (en) Directional audio pickup method and system based on microphone array
Zhang et al. Microphone array processing for distance speech capture: A probe study on whisper speech detection
CN115641839A (en) Intelligent voice recognition method and system
CN112992131A (en) Method for extracting ping-pong command of target voice in complex scene
Jahanirad et al. Blind source computer device identification from recorded VoIP calls for forensic investigation
Shahrul Azmi et al. Noise robustness of Spectrum Delta (SpD) features in Malay vowel recognition
Chandra Hindi vowel classification using QCN-PNCC features

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071219

Termination date: 20160109

CF01 Termination of patent right due to non-payment of annual fee