CN115132219A - Speech recognition method and system based on quadratic spectral subtraction under complex noise background - Google Patents

Speech recognition method and system based on quadratic spectral subtraction under complex noise background Download PDF

Info

Publication number
CN115132219A
CN115132219A CN202210711617.9A CN202210711617A CN115132219A CN 115132219 A CN115132219 A CN 115132219A CN 202210711617 A CN202210711617 A CN 202210711617A CN 115132219 A CN115132219 A CN 115132219A
Authority
CN
China
Prior art keywords
noise
audio
historical
current frame
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210711617.9A
Other languages
Chinese (zh)
Inventor
邵鹏志
谢志豪
王乃正
孟英谦
彭龙
李胜昌
宋彪
邬书豪
李泽宇
张世超
魏中锐
任智颖
葛祥雨
胡明哲
霸建民
高圣楠
张敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China North Computer Application Technology Research Institute
Original Assignee
China North Computer Application Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China North Computer Application Technology Research Institute filed Critical China North Computer Application Technology Research Institute
Priority to CN202210711617.9A priority Critical patent/CN115132219A/en
Publication of CN115132219A publication Critical patent/CN115132219A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention relates to a speech recognition method and a speech recognition system based on a quadratic spectral subtraction under a complex noise background, and belongs to the technical field of speech enhancement. The method comprises the following steps: selecting noisy historical audio and pure noise audio under a complex noise background, and obtaining historical noise estimation through calculation processing; performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio; and sequentially processing each frame of audio: and determining a historical noise removal factor and a current frame audio noise removal factor based on the historical noise estimation and the noise estimation of the current frame audio, and performing secondary spectrum subtraction on the current frame audio to obtain a voice spectrum after the noise of the current frame audio is reduced. The method solves the problem that the residual noise cannot be controlled to a lower level by adopting the prior art for the complicated background noise in the real world.

Description

Speech recognition method and system based on quadratic spectral subtraction under complex noise background
Technical Field
The invention belongs to the technical field of voice enhancement, and particularly relates to a voice recognition method and system based on secondary spectral subtraction under a complex noise background.
Background
Spectral subtraction is one of the speech enhancement algorithms. Speech enhancement is a technique of extracting a useful speech signal from a noisy speech signal, and suppressing and reducing noise interference, when the speech signal is interfered with or even buried by various noises. However, noise interference is usually random, and it is almost impossible to extract a completely clean speech signal from noisy speech. In this case, the purpose of speech enhancement is mainly two: firstly, the voice quality is improved, and background noise is eliminated, which is a subjective measure; and secondly, the voice task effects of voice recognition, speaker recognition and the like are improved, and the objective measurement is realized. However, these two objectives often cannot be combined, for example, some speech enhancement algorithms can significantly reduce background noise and improve speech quality, but cannot improve the speech recognition effect, even slightly decrease the speech recognition effect.
Spectral subtraction is a speech enhancement algorithm that processes bandwidth noise more traditionally and efficiently, and its basic idea is to subtract the noise power spectrum from the noisy speech signal under the condition that the additive noise and the short-time stationary speech signal are assumed to be independent from each other, so as to obtain a purer speech spectrum. The spectral subtraction has the outstanding advantages of small calculation amount of the algorithm, low calculation amount, low calculation complexity and suitability for real-time processing scenes. The disadvantage is that the processed signal will leave a relatively large amount of noise, referred to as musical noise.
In order to attenuate the musical noise caused by spectral subtraction, Berouti proposes a spectral subtraction algorithm that reduces the amplitude of the wideband spectral peak remaining from spectral subtraction by using a noise removal factor, and fills the spectral valleys (negative values of spectral subtraction) with the lowest audio energy to control how much of the residual noise and the magnitude of the musical noise. The over subtraction spectral subtraction expression is as follows:
Figure BDA0003708270550000021
wherein, P y (ω)、P s (ω)、P n (ω) representing the power spectra of the noisy signal, the clean speech signal and the noise signal, respectively; α is a noise removal factor, which is a coefficient of the audio spectrum minus the noise spectrum; b represents the lowest audio energy remaining in the audio;
both spectral subtraction and over-subtraction are true in a stationary background noise environment, i.e. noise has an equal influence on all spectral components of speech. However, the background noise in the real world varies with time, different interference noise has different effects on each frequency band of the voice, and the over-subtraction spectral subtraction still cannot control the residual noise to a low level.
Disclosure of Invention
In view of the above analysis, the present invention aims to provide a speech recognition method based on quadratic spectral subtraction under complex noise background, which performs noise estimation on noisy historical audio, determines a historical noise removal factor and a current frame audio noise removal factor through calculation for current audio noise estimation, and performs quadratic spectral subtraction on noisy audio to be recognized, so as to solve the problem that the background noise in the real world is complex and the residual noise cannot be controlled to a lower level by using the prior art.
On one hand, the invention provides a speech recognition method based on secondary spectral subtraction under a complex noise background, which specifically comprises the following steps:
obtaining a historical noise estimation of a complex noise background based on a noisy historical audio and a clean noise audio under the complex noise background;
performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio;
processing each frame of audio in sequence to obtain noise-reduced voice; wherein, to current frame audio frequency processing, include: and performing secondary spectral subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice frequency spectrum of the current frame audio after noise reduction.
Further, the obtaining of the historical noise estimate of the complex noise background based on the noisy historical audio and the clean noise audio under the complex noise background includes:
framing each piece of the historical audio containing noise, and processing to obtain a power spectrum of each frame signal of the historical audio containing noise;
selecting the audio frequency of the preset number of frames with the lowest power spectrum on each audio frequency as pure noise, and estimating to obtain the average power spectrum B of each frame of noise of each noisy historical audio frequency i (ω), wherein i ═ 1, 2, 3 … …, n, represent the calendar containing noiseNumber of history frequencies;
dividing each pure noise audio into frames, and processing to obtain the average noise power spectrum B of each frame of each pure noise audio j (ω), where j ═ 1, 2, 3 … …, k, represents the number of clean noise tones;
b is to be i (omega) and B j (omega) averaging to obtain historical noise estimate
Figure BDA0003708270550000031
Further, the noise estimation of the current frame audio includes:
selecting the audio with the lowest power spectrum in the audio to be identified and the preset number of frames as pure noise, and estimating the average power spectrum of the noise of each frame of the audio to be identified
Figure BDA0003708270550000032
I.e. a noise estimate of the current frame audio.
Further, performing secondary spectral subtraction on the current frame audio by using the following formula to obtain a noise-reduced speech spectrum of the current frame audio:
Figure BDA0003708270550000033
wherein the content of the first and second substances,
Figure BDA0003708270550000034
representing the power spectrum estimate of the audio of the current frame, Y n+1 (ω, m) represents the frequency spectrum of the audio of the current frame,. psi n+1 (ω, m) represents phase information of the current frame audio; alpha is alpha m 、β m Respectively, a historical noise removal factor and a current frame audio noise removal factor; b is a mixture of m Is the lowest spectral factor of the audio signal.
Further, the α is calculated by the following formula m 、β m And b m
Figure BDA0003708270550000041
Figure BDA0003708270550000042
Figure BDA0003708270550000043
Wherein c is a constant, ξ m The posterior signal-to-noise ratio of the current frame audio signal frequency domain is obtained; alpha is alpha min 、α max Respectively represent alpha m Minimum and maximum values of (d); beta is a beta min 、β max Respectively represents beta m Minimum and maximum values of; b min 、b max Respectively represent b m Maximum and minimum values of.
Further, the xi is calculated by the following formula m
Figure BDA0003708270550000044
Wherein k is frequency point, sigma k |Y n+1k M) | represents the audio spectral intensity of the current frame,
Figure BDA0003708270550000045
representing the spectral strength of the historical noise estimate.
Further, for said alpha m 、β m And b m Is limited to a and a minimum value, including max =3,α min =1,β max =3,β min =1,b max =0.1,b min =0.02。
On the other hand, the invention also provides a speech recognition system based on the quadratic spectral subtraction under the complex noise background, which comprises the following components:
a historical noise estimation module: obtaining historical noise estimation of the complex noise background based on the noisy historical audio and the pure noise audio under the complex noise background;
the audio noise reduction processing module to be identified: performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio; sequentially processing each frame of audio to obtain noise-reduced voice; wherein, to current frame audio frequency processing, include: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
Further, the historical noise estimation module includes the following modules:
a noisy historical audio processing module: framing each piece of the historical audio containing noise, and processing to obtain a power spectrum of each frame signal of the historical audio containing noise; selecting the audio frequency of the preset number of frames with the lowest power spectrum on each audio frequency as pure noise, and estimating to obtain the average power spectrum B of each frame of noise of each noisy historical audio frequency n (ω), wherein n represents the nth noisy historical audio;
a clean noise audio processing module: dividing each pure noise audio into frames, and processing to obtain the average noise power spectrum B of each frame of each pure noise audio k (ω), wherein k represents a k-th clean noise audio;
a noise estimation module: b is to be p (omega) and B n (omega) averaging to obtain historical noise estimate
Figure BDA0003708270550000051
Further, the audio noise reduction processing module to be identified includes:
the current audio noise estimation module: the method comprises the steps of framing audio to be identified under a complex noise background to obtain multi-frame audio, selecting the audio with the lowest power spectrum and a preset number of frames as pure noise, and estimating the average power spectrum of the noise of each frame of the audio to be identified
Figure BDA0003708270550000052
I.e. noise of the current frame audioEstimating sound;
a noise reduction processing module: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
The invention can realize at least one of the following beneficial effects:
1. the method comprises the steps of acquiring a large number of historical noisy audios of complex noise backgrounds of different scenes, carrying out noise weighted averaging to obtain historical noise estimation, and extracting specific frames from the current audio to be identified to carry out current noise estimation, so that the problem that the noise estimation is deviated from the noise in the audio to be identified due to the difference of the complex noise backgrounds of different scenes is solved.
2. Historical noise removal factors and current frame audio noise removal factors are determined through calculation, secondary spectrum subtraction is carried out on the audio containing noise to be recognized, and accuracy of noise reduction processing on the audio to be recognized is improved.
3. By iterating the noise estimation of the audio to be processed to the historical noise estimation each time, the historical noise estimation can be continuously optimized, and the accuracy of noise reduction processing of the audio to be identified by secondary spectral subtraction is further improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram showing the detailed steps of the method of the present invention;
fig. 3 is a comparison of waveforms before and after processing a piece of audio to be processed according to an embodiment of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Method embodiment
Example 1
The invention discloses a speech recognition method based on quadratic spectral subtraction under complex noise background, which comprises the following steps:
step S1: and selecting noisy historical audio and pure noise audio under a complex noise background, and obtaining historical noise estimation through calculation processing.
The complex noise background refers to that in a voice signal environment, various different noises irrelevant to the existence or nonexistence of signals exist and the noises are changed continuously; noisy historical audio refers to audio that contains complex noise and clean speech signals, and clean noise audio refers to noise audio captured in a complex noise background that does not contain speech signals. The method can be suitable for steady-state noise voice enhancement scenes in the military field, such as voice recognition scenes under the operation condition of armored vehicles; at this time, the noisy historical audio refers to audio including noise generated by the operation of the armored vehicle and a clean voice signal, and the clean noise audio refers to noise generated by the operation of the armored vehicle.
Specifically, step S1 includes the following steps:
step S11: selecting i pieces of noisy historical audio (wherein i is 1, 2, 3, … …, n, n is the number of the noisy historical audio), framing the ith piece of noisy audio, illustratively, framing every 25ms, shifting by 10ms, and respectively representing the clean audio, the noisy audio and the noisy audio signal at the mth frame in the ith piece of audio by X (i, m), Y (i, m) and B (i, m), wherein the corresponding frequency spectrums are respectively represented by X (ω, m), Y (ω, m) and B (ω, m);
step S12: carrying out short-time Fourier transform on each frame of audio to obtain a power spectrum X of each frame of signal i (ω, m) and phase spectrum ψ i (ω,m);
Step S13: selecting a power spectrum X over the entire audio i The lowest 100 frames of audio frequency (omega, m) is taken as pure noise, and the average power spectrum B of each frame of noise of the audio frequency to be denoised is estimated i (ω);
Step S14: selecting j clean noise audios (wherein j is (1, 2, 3 … …, k)), framing the j clean noise audio, framing every 25ms, shifting the frame by 10ms, performing short-time Fourier transform on each frame of audio, and calculating a weighted power spectrum to obtain an average noise power spectrum B of each frame j (ω);
Step S15: b is to be i (omega) and B j (omega) averaging to obtain historical noise estimate
Figure BDA0003708270550000071
Step S2: the method comprises the following steps of framing audio to be identified under a complex noise background, and sequentially processing each frame of audio to obtain noise-reduced voice, wherein the method specifically comprises the following steps:
step S21: let the serial number of the audio to be identified be n +1 and i be n +1, execute the same steps as S11, S12 and S13, and estimate the average power spectrum of each frame of the audio to be identified
Figure BDA0003708270550000081
Step S22: calculating parameters required for performing secondary spectral subtraction on the current frame audio: xi m 、α m 、β m And b m Taking the value of (A); wherein ξ m The posterior signal-to-noise ratio of the current frame audio signal frequency domain; alpha (alpha) ("alpha") m 、β m Respectively, a historical noise removal factor and a current frame audio noise removal factor; b m Is the lowest spectral factor of the audio signal;
Figure BDA0003708270550000082
wherein k is frequency point, sigma k |Y n+1k M) | denotes the audio spectral intensity of the current frame,
Figure BDA0003708270550000083
representing a spectral strength of the historical noise estimate;
Figure BDA0003708270550000084
Figure BDA0003708270550000085
Figure BDA0003708270550000086
step S23: performing secondary spectrum subtraction on the current frame of the audio to be recognized by using the following formula to obtain a voice spectrum of the current frame audio after noise reduction:
Figure BDA0003708270550000087
it should be noted that, when the audio to be identified is identified next time, the current audio to be identified is added as the (n + 1) th audio to be identified into the historical noisy audio, and the noise of the audio to be identified is estimated
Figure BDA0003708270550000088
Iterative to historical noise estimation
Figure BDA0003708270550000089
And optimizing historical noise estimation.
Example 2
The invention further discloses a speech recognition method based on quadratic spectral subtraction under a complex noise background, which comprises the following steps:
step S4: selecting 100 noise-containing human voice audios sampled on the spot of an armored car training field with the sampling rate of 44kHz and the length of 1 minute as noise-containing historical audios; wherein, the background noise scene has: the running noise of the engine only when the armored vehicle is static, the running noise of the engine when the armored vehicle is accelerated, the collision noise of a track and wheels or the ground when the vehicle runs, the running noise of the vehicle when the armored vehicle is decelerated and the like; selecting 300 armored vehicle noises sampled on the spot in a training field with the sampling rate of 44kHz and the length of 1 minute as pure noise audio; the historical noise estimation is obtained through calculation processing, and the method specifically comprises the following steps:
step S41: for each noisy audio, framing every 25ms, frame shifting by 10ms, denoted as clean cause frequency, noisy audio, and noisy audio signal at mth frame in ith (i ═ 1, 2, 3.... said., 100) audio, respectively, by X (i, m), Y (i, m), and B (i, m), respectively, and the corresponding frequency spectra are denoted as X (ω, m), Y (ω, m), and B (ω, m), respectively;
step S42: carrying out short-time Fourier transform on each frame of audio to obtain a power spectrum X of each frame of signal i (ω, m) and phase spectrum ψ i (ω,m);
Step S43: selecting a power spectrum X over the entire audio i The lowest 100 frames of audio frequency (omega, m) is taken as pure noise, and the average power spectrum B of each frame of noise of the audio frequency to be denoised is estimated i (ω);
Step S44: dividing each pure noise audio into frames, dividing each frame every 25ms, shifting the frames by 10ms, performing short-time Fourier transform on each frame of audio, and calculating a weighted power spectrum to obtain an average noise power spectrum B of each frame j (ω)(j=1,2,3......,80);
Step S45: b is to be i (omega) and B j (omega) averaging to obtain historical noise estimate
Figure BDA0003708270550000091
Step S5: selecting 50 noise-containing voice audios sampled on the spot in an armored car training field with the sampling rate of 44kHz and the length of 1 minute as audios to be recognized, framing the audios to be recognized, and processing each frame of audio in sequence to obtain voices after noise reduction, wherein the method specifically comprises the following steps:
step S51: making the serial number of the current audio to be identified be n + 1; let i equal n +1, executeThe steps S41, S42 and S43 are performed to estimate the average power spectrum of each frame of the audio to be identified
Figure BDA0003708270550000101
Step S52: calculating parameters required for performing secondary spectral subtraction on the current frame audio: xi m 、α m 、β m And b m Taking the value of (A); wherein ξ m The posterior signal-to-noise ratio of the frequency domain of the current frame audio signal; alpha is alpha m 、β m Respectively, a historical noise removal factor and a current frame audio noise removal factor; b m Is the lowest spectral factor of the audio signal;
Figure BDA0003708270550000102
wherein k is frequency point, sigma k |Y n+1k M) | represents the audio spectral intensity of the current frame,
Figure BDA0003708270550000103
representing a spectral strength of the historical noise estimate;
Figure BDA0003708270550000104
Figure BDA0003708270550000105
Figure BDA0003708270550000106
step S53: performing secondary spectrum subtraction on the current frame of the audio to be recognized by using the following formula to obtain a voice spectrum of the current frame audio after noise reduction:
Figure BDA0003708270550000107
step S54: estimating noise of current audio to be identified
Figure BDA0003708270550000108
Iterative to historical noise estimation
Figure BDA0003708270550000109
Optimizing historical noise estimation;
step S55: and executing the steps S51, S52, S53 and S54 on the next piece of audio to be identified until the 50 pieces of voice audio to be identified are processed.
The noise reduction effect of the embodiment is verified by adopting the following method:
marking the voice of a speaker in 50 voice audios to be identified by using a marking tool, wherein the time slot is 10ms, and for each time slot, if the voice exists, the voice is marked as 1, otherwise, the voice is marked as 0; each audio file corresponds to a markup file.
For each piece of audio, the length of the audio is fixed, so that the length of the noise reduction processed result is the same as that of the labeling data. Comparing the result data with the content of the labeled data, counting the number a of the labeled contents different from each other, and calculating with the total labeled length b of the audio to obtain the error rate e as a/b.
Through calculation, by adopting the speech recognition method under the complex noise background based on the quadratic spectral subtraction disclosed by the embodiment, the error rate of denoising the audio to be recognized is 14.92%, and the accuracy rate is 85.1%; the error rate of denoising 50 audios to be identified by adopting a traditional power-based spectral subtraction method is 19.10%, and the accuracy rate is 80.9%; compared with the classical spectral subtraction method, the speech recognition method of the embodiment has the advantage that the accuracy rate is obviously improved.
Fig. 3 shows the comparison of waveforms before and after processing a piece of audio to be processed by using the speech recognition method of the embodiment, and waveforms before and after sound noise reduction can be observed from the marked position in the diagram, so that the background noise removal effect is obvious.
System embodiment
A speech recognition system based on quadratic spectral subtraction on a complex noise background, comprising:
a historical noise estimation module: obtaining a historical noise estimation of a complex noise background based on a noisy historical audio and a clean noise audio under the complex noise background;
the audio noise reduction processing module to be identified: performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio; processing each frame of audio in sequence to obtain noise-reduced voice; wherein, to current frame audio frequency processing, include: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
The historical noise estimation module comprises the following modules:
a noisy historical audio processing module: framing each piece of the historical audio containing noise, and processing to obtain a power spectrum of each frame signal of the historical audio containing noise; selecting the audio with the lowest power spectrum on each audio and with a preset number of frames as pure noise, and estimating to obtain the average power spectrum B of each frame of noise of each noisy historical audio n (ω), wherein n represents the nth noisy historical audio;
a clean noise audio processing module: dividing each pure noise audio into frames, and processing to obtain the average noise power spectrum B of each frame of each pure noise audio k (ω), where k represents the k-th clean noise audio;
a noise estimation module: b is to be p (omega) and B n (omega) weighted averaging to obtain a historical noise estimate
Figure BDA0003708270550000121
Wherein, treat discernment audio frequency noise reduction processing module includes:
the current audio noise estimation module: performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio, selecting the audio with the lowest power spectrum of a preset number of frames as pure noise, and estimating the average power spectrum of the noise of each frame of the audio to be identified
Figure BDA0003708270550000122
Namely noise estimation of the current frame audio;
a noise reduction processing module: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
Because the speech recognition system based on the quadratic spectral subtraction under the complex noise background and the speech recognition method based on the quadratic spectral subtraction under the complex noise background are based on the same invention concept, the related parts can be referred to each other, and the same technical effect can be realized.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims (10)

1. A speech recognition method based on quadratic spectral subtraction under a complex noise background is characterized by comprising the following steps:
obtaining a historical noise estimation of a complex noise background based on a noisy historical audio and a clean noise audio under the complex noise background;
performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio;
processing each frame of audio in sequence to obtain noise-reduced voice; wherein, to current frame audio processing, include: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
2. The speech recognition method of claim 1, wherein obtaining the historical noise estimate for the complex noise background based on noisy historical audio and clean noise audio in the complex noise background comprises:
framing each piece of the historical audio containing noise, and processing to obtain a power spectrum of each frame signal of the historical audio containing noise;
selecting the audio frequency of the preset number of frames with the lowest power spectrum on each audio frequency as pure noise, and estimating to obtain the average power spectrum B of each frame of noise of each noisy historical audio frequency i (ω), wherein i ═ 1, 2, 3, … …, n, represents the number of noisy historical audio;
dividing each pure noise audio into frames, and processing to obtain the average noise power spectrum B of each frame of each pure noise audio j (ω), where j ═ 1, 2, 3, … …, k, represents the number of clean noise tones;
b is to be i (omega) and B j (omega) averaging to obtain the historical noise estimate
Figure FDA0003708270540000011
3. The speech recognition method of claim 2, wherein the noise estimation of the current frame audio comprises:
selecting the audio with the lowest power spectrum in the audio to be identified and the preset number of frames as pure noise; estimating a noise average power spectrum for each frame of audio to be identified based on the pure noise
Figure FDA0003708270540000012
I.e. a noise estimate of the current frame audio.
4. The speech recognition method of claim 3, wherein the current frame audio is subjected to secondary spectral subtraction by using the following formula to obtain a power spectrum estimate of the current frame audio, that is, a noise-reduced speech spectrum of the current frame audio:
Figure FDA0003708270540000021
wherein the content of the first and second substances,
Figure FDA0003708270540000022
represents the power spectrum estimation of the current frame audio, m represents the sequence number of the current frame audio, Y n+1 (ω, m) represents the frequency spectrum of the audio of the current frame,. psi n+1 (ω, m) represents phase information of the current frame audio; alpha is alpha m 、β m Respectively, a historical noise removal factor and a current frame audio noise removal factor; b m Is the lowest spectral factor of the audio signal.
5. The speech recognition method of claim 4, wherein the α is calculated using the following formula m 、β m And b m
Figure FDA0003708270540000023
Figure FDA0003708270540000024
Figure FDA0003708270540000025
Wherein c is a constant and xi m The posterior signal-to-noise ratio of the frequency domain of the current frame audio signal is obtained; alpha (alpha) ("alpha") min 、α max Respectively represent alpha m Minimum and maximum values of; beta is a min 、β max Respectively represents beta m Minimum and maximum values of (d); b min 、b max Respectively represent b m Maximum and minimum values of.
6. A speech recognition method according to claim 5, wherein the ξ is calculated by the following formula m
Figure FDA0003708270540000026
Wherein k is frequency point, sigma k |Y n+1k M) | represents the audio spectral intensity of the current frame,
Figure FDA0003708270540000031
representing the spectral strength of the historical noise estimate.
7. The speech recognition technique of claim 6, wherein a is estimated for the alpha m 、β m And b m Is limited to a maximum and a minimum value, including
α max =3,α min =1,β max =3,β min =1,b max =0.1,b min =0.02。
8. A speech recognition system based on quadratic spectral subtraction on a complex noise background, comprising:
a historical noise estimation module: obtaining a historical noise estimation of a complex noise background based on a noisy historical audio and a clean noise audio under the complex noise background;
the audio noise reduction processing module to be identified: performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio; processing each frame of audio in sequence to obtain noise-reduced voice; wherein, to current frame audio processing, include: and performing secondary spectrum subtraction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice spectrum of the current frame audio after noise reduction.
9. The speech recognition system of claim 8, wherein the historical noise estimation module comprises:
a noise-containing historical audio processing module used for framing each piece of the noise-containing historical audio respectively to obtain each frame signal of the noise-containing historical audio after processingThe power spectrum of (a); selecting the audio with the lowest power spectrum on each audio and with a preset number of frames as pure noise, and estimating to obtain the average power spectrum B of each frame of noise of each noisy historical audio n (ω), wherein n represents the nth noisy historical audio;
a pure noise audio processing module for framing each pure noise audio and obtaining the average noise power spectrum B of each frame of each pure noise audio after processing k (ω), wherein k represents a k-th clean noise audio;
a noise estimation module for estimating B p (omega) and B n (omega) averaging to obtain historical noise estimate
Figure FDA0003708270540000041
10. The speech recognition system of claim 9, wherein the audio to be recognized noise reduction processing module comprises:
the current audio noise estimation module is used for performing framing processing on the audio to be identified under the complex noise background to obtain multi-frame audio, selecting the audio with the lowest power spectrum and a preset number of frames as pure noise, and estimating the noise average power spectrum of each frame of the audio to be identified
Figure FDA0003708270540000042
Namely noise estimation of the current frame audio;
and the noise reduction processing module is used for carrying out secondary spectrum reduction on the current frame audio based on the historical noise estimation and the noise estimation of the current frame audio to obtain a voice frequency spectrum after the noise reduction of the current frame audio.
CN202210711617.9A 2022-06-22 2022-06-22 Speech recognition method and system based on quadratic spectral subtraction under complex noise background Pending CN115132219A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210711617.9A CN115132219A (en) 2022-06-22 2022-06-22 Speech recognition method and system based on quadratic spectral subtraction under complex noise background

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210711617.9A CN115132219A (en) 2022-06-22 2022-06-22 Speech recognition method and system based on quadratic spectral subtraction under complex noise background

Publications (1)

Publication Number Publication Date
CN115132219A true CN115132219A (en) 2022-09-30

Family

ID=83379433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210711617.9A Pending CN115132219A (en) 2022-06-22 2022-06-22 Speech recognition method and system based on quadratic spectral subtraction under complex noise background

Country Status (1)

Country Link
CN (1) CN115132219A (en)

Similar Documents

Publication Publication Date Title
CN108735213B (en) Voice enhancement method and system based on phase compensation
US6289309B1 (en) Noise spectrum tracking for speech enhancement
CN103109320B (en) Noise suppression device
US7957964B2 (en) Apparatus and methods for noise suppression in sound signals
JP2006215568A (en) Speech enhancement apparatus and method and computer-readable medium having program recorded thereon
US7428490B2 (en) Method for spectral subtraction in speech enhancement
US20090265168A1 (en) Noise cancellation system and method
KR20110068637A (en) Method and apparatus for removing a noise signal from input signal in a noisy environment
Wolfe et al. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement
JP2003280696A (en) Apparatus and method for emphasizing voice
CN114005457A (en) Single-channel speech enhancement method based on amplitude estimation and phase reconstruction
JP4434813B2 (en) Noise spectrum estimation method, noise suppression method, and noise suppression device
CN112233657A (en) Speech enhancement method based on low-frequency syllable recognition
Gupta et al. Speech enhancement using MMSE estimation and spectral subtraction methods
Rao et al. Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration
CN115132219A (en) Speech recognition method and system based on quadratic spectral subtraction under complex noise background
Górriz et al. Jointly Gaussian PDF-based likelihood ratio test for voice activity detection
EP1635331A1 (en) Method for estimating a signal to noise ratio
Upadhyay et al. Single channel speech enhancement utilizing iterative processing of multi-band spectral subtraction algorithm
Shao et al. A versatile speech enhancement system based on perceptual wavelet denoising
Bai et al. Two-pass quantile based noise spectrum estimation
CN111653287A (en) Single-channel speech enhancement algorithm based on DNN and in-band cross-correlation coefficient
Pallavi et al. Phase-locked Loop (PLL) Based Phase Estimation in Single Channel Speech Enhancement.
Li et al. Sub-band based log-energy and its dynamic range stretching for robust in-car speech recognition
Seyedin et al. Robust MVDR-based feature extraction for speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination