CN113851151A - Masking threshold estimation method, device, electronic equipment and storage medium - Google Patents
Masking threshold estimation method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113851151A CN113851151A CN202111250359.0A CN202111250359A CN113851151A CN 113851151 A CN113851151 A CN 113851151A CN 202111250359 A CN202111250359 A CN 202111250359A CN 113851151 A CN113851151 A CN 113851151A
- Authority
- CN
- China
- Prior art keywords
- noise
- signal
- masking threshold
- voice
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000873 masking effect Effects 0.000 title claims abstract description 131
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000001228 spectrum Methods 0.000 claims abstract description 149
- 238000004590 computer program Methods 0.000 claims description 7
- 230000007480 spreading Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 4
- 230000001629 suppression Effects 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000004378 air conditioning Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000013067 intermediate product Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephone Function (AREA)
Abstract
The embodiment of the invention discloses a masking threshold estimation method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the power spectrum of the voice signal with the noise, the amplitude spectrum of the voice signal with the noise and the pure tone coefficient; and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold. The embodiment of the invention can improve the accuracy of the masking threshold estimation, further effectively enhance the noise suppression result and improve the voice recognition effect.
Description
Technical Field
The embodiment of the invention relates to the technical field of signal processing, in particular to a masking threshold estimation method and device, electronic equipment and a storage medium.
Background
With the rapid development of signal processing techniques and speech recognition techniques, speech enhancement techniques in front-end preprocessing are also becoming more and more important. Generally, when a device plays sound, noise is heard along with voice, but the presence of noise interferes with voice and even affects the perception of voice by human ears. Normally, blind source separation technique is adopted, and the current most important technical means of blind source separation technique is to estimate the masking threshold.
At present, in a non-stationary environment, many noise estimation algorithms have problems of tracking delay, large error and the like, and some researchers try to perform speech enhancement by using the auditory characteristics of human ears in the non-stationary environment, but the estimation accuracy of the masking threshold becomes the key for performing speech enhancement based on the auditory characteristics.
Therefore, how to improve the estimation accuracy of the masking threshold is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides a masking threshold estimation method and device, electronic equipment and a storage medium, which can improve the estimation accuracy of a masking threshold.
In a first aspect, an embodiment of the present invention provides a masking threshold estimation method, including:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
In a second aspect, an embodiment of the present invention further provides a masking threshold estimation apparatus, including:
the basic parameter acquisition module is used for acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
the characteristic parameter determining module is used for determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
the pure tone coefficient determining module is used for determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
the intermediate masking threshold determining module is used for determining an intermediate masking threshold according to the pure tone coefficient;
and the target masking threshold determining module is used for determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the intermediate masking threshold.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a masking threshold estimation method as in any embodiment of the invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the masking threshold estimation method according to any embodiment of the present invention.
According to the masking threshold estimation scheme provided by the embodiment of the invention, the amplitude spectrum of a voice signal with noise is obtained, and the amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.
Drawings
Fig. 1 is a flowchart of a masking threshold estimation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a masking threshold estimation device according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Example one
Fig. 1 is a flowchart of a masking threshold estimation method in a first embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. The method may be performed by a masking threshold estimation apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, for example, the electronic device may be a device with communication and computing capabilities, such as a background server. As shown in fig. 1, the method specifically includes:
s110, obtaining an amplitude spectrum of a voice signal with noise, and obtaining an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal.
The noisy voice signal is acquired by at least one voice acquisition device in a voice acquisition field. The voice acquisition site can be a communication site in a conference room, a broadcasting room, a railway station and other noisy environments, and can also be a military communication site or a voice recognition site and the like. For example, when a broadcaster broadcasts news, various sounds may appear in a broadcasting room, traffic noise generated by passing vehicles outside the broadcasting room building or noise generated by back-and-forth movement of a air conditioning system, a light control system, a camera and a worker in the broadcasting room building, and at the moment, voice signals in the broadcasting room need to be collected, and voice enhancement is performed on the voice signals of the broadcaster.
The voice acquisition device can be a microphone or a wave detector. Specifically, the number of the voice collecting devices is not limited, and may be 1 or more. When the number of the voice collecting devices is 2 or more, the arrangement mode of the voice collecting devices is not limited in order to collect voice signals at different positions. For example, the speech acquisition devices may be arranged along a circumferential direction of a source of clean speech signals in the noisy speech signal. In addition, because noise interference in the noisy speech signal has uncertainty and randomness, the speech acquisition device can acquire the noisy speech signal continuously or intermittently at short intervals.
Further, in order to better perform the masking threshold estimation processing on the noisy speech signal, the collected noisy speech signal needs to be converted into a frequency domain sound signal, for example, the noisy speech signal may be converted into a frequency domain signal by using a fourier transform or the like. The power spectrum characterizes the change condition of the power of the voice signal with noise along with the frequency, and is the power of the voice signal with noise in a unit frequency band; the amplitude spectrum represents the distribution of the amplitude of the signal along with the frequency, in the frequency domain description of the signal, the frequency is taken as an independent variable, the amplitude of each frequency component forming the signal is taken as a dependent variable, and the frequency function is called as the amplitude spectrum. The power spectrum and the amplitude spectrum of the voice signal with noise can be obtained by performing FFT (Fast Fourier Transform) on the voice signal with noise obtained by the above acquisition.
Further, the amplitude spectrum of the noise signal in the noisy speech signal is obtained by estimating the noise signal according to the collected noisy speech signal and performing FFT processing. The noise signal estimation method may be the following three methods: a recursive average noise estimation algorithm, a minimum tracking algorithm, and a histogram noise estimation algorithm.
It should be noted that the noisy speech signal includes a clean speech signal and a noise signal. The clean voice signal refers to a desired voice signal, and the noise signal refers to all interference signals except the desired voice signal. For example, when a broadcaster broadcasts news indoors, the sound signal of the broadcaster is a pure voice signal, and the sound generated by the broadcaster when walking back and forth among an indoor air conditioning system, a light control system, a camera and a worker and the sound generated by a passing vehicle outdoors are noise signals.
S120, determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise.
The voice characteristic spectrum deviation is a difference between an amplitude spectrum of each frequency band or each frequency point of the voice signal with noise and an average value of the amplitude spectrum, and can be used for measuring the precision of the obtained amplitude spectrum of the voice signal with noise. The speech feature spectrum deviation can be obtained by using a statistical formula.
In this embodiment, optionally, determining the voice feature spectrum deviation of the noisy voice signal according to the amplitude spectrum of the noisy voice signal and the amplitude spectrum of the noise signal, and determining the voice feature flatness according to the amplitude spectrum of the noisy voice signal includes:
determining the deviation of the speech characteristic spectrum by adopting the following formula:
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,is the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal.
It will, of course, be appreciated that,the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal can be determined by adopting the following formula:
it should be noted that, in a short time (e.g., 10ms to 30ms), the shape of the vocal cords and vocal tract of the human is relatively stable, and thus the short-time spectrum of the acquired human voice signal has relative stability. The voice acquisition site may be in a non-stationary environment, and in order to ensure that the noise signal in the voice signal to be enhanced can be effectively suppressed in the non-stationary environment, the voice signal to be enhanced needs to be divided into a plurality of frequency bands according to the frequency domain of the voice signal to be enhanced, and each frequency band includes a plurality of frequency points. For example, a noisy speech signal may be divided into 5 frequency bands, each of which includes 20 frequency points.
The frequency band division may be based on a Bark scale, or may also be based on a Mel scale. The Bark scale is a unit of perception frequency, the frequency of the voice signal to be enhanced is mapped to 24 psychoacoustic critical frequency bands in Hertz, the width of each critical frequency band is one Bark scale, and when the frequency band division is carried out in the Bark scale, physical frequency needs to be converted into psychoacoustic frequency. The Mel scale is a frequency band division approach closer to the human auditory system.
For example, for different dedicated devices, the way of band division for the speech signal to be enhanced may be selected according to the application scenario. For example, when a broadcaster broadcasts news in a broadcasting room, since the collected sound signal of each broadcaster and the noise signal in the broadcasting room are relatively stable, the frequency domain of the to-be-enhanced speech signal may be divided into 26 frequency bands according to the Mel scale.
In this embodiment, optionally, the method further includes determining the flatness of the voice feature by using the following formula:
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
It will be appreciated that the above-described,the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal can be determined by adopting the following formula:
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise. The method has the advantages that the calculation mode is simpler, more convenient and faster, the data required by calculation is easy to obtain, and the data can be used as the calculation basis of the pure tone coefficient in the follow-up process.
And S130, determining pure tone coefficients of different frequency bands in the noisy speech signal according to the speech characteristic spectrum deviation and the speech characteristic flatness.
The pure tone coefficient can be calculated and determined by establishing a relational expression between the voice characteristic spectrum deviation and the voice characteristic flatness and the pure tone coefficient, and can also be determined by establishing a neural network model for masking threshold estimation.
In a possible embodiment, optionally, determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness includes:
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],μiIs the second adjustment value, mui∈[-70,-50]。
It can be understood that the above formula establishes a relation with the pure tone coefficient by using the deviation of the speech feature spectrum and the flatness of the speech feature, to obtain an intermediate pure tone coefficient, and compares the intermediate pure tone coefficient with 1, and the minimum value is the pure tone coefficient α (i). Wherein the first adjustment value betaiAnd a second adjustment value etaiAre adjustment values obtained as needed or empirically when calculating the pure tone coefficient α (i), and of course, the first adjustment value βiAnd a second adjustment value etaiThe adaptation can be made according to the application scenario.
And S140, determining an intermediate masking threshold according to the pure tone coefficient.
The intermediate masking threshold is an intermediate product of the estimated target masking threshold, and can be determined by establishing a relational expression calculation or by establishing a neural network model of masking threshold estimation.
Optionally, determining an intermediate masking threshold according to the pure tone coefficient includes:
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
It will be appreciated that the third adjustment value λiIs an adjustment value obtained as needed or empirically when calculating the intermediate masking threshold t (i). The offset o (i) and the extended critical band spectrum value c (i) relative to the intermediate masking threshold may be determined by calculation through establishing a relational expression, or may be determined by establishing a neural network model for masking threshold estimation.
On the basis of the above technical solutions, optionally, the offset from the intermediate masking threshold is determined by using the following formula:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
It is to be understood that the fourth adjustment value μ is an adjustment value obtained according to needs or experience when calculating the offset o (i) from the intermediate masking threshold, and may be adaptively adjusted according to an application scenario.
On the basis of the above technical solutions, optionally, the expansion critical band spectrum is determined by using the following formula:
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the bandPower spectrum, SP, of the j frequency point of a noisy speech signalijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
It is understood that the critical band spread spectrum c (i) is obtained by summing the products of the power spectrum of each frequency point in each frequency band of the noisy speech signal and the spread function.
Optionally, the following formula is adopted to determine the spreading function of the jth frequency point in the ith frequency band of the noisy speech signal:
s150, determining a target masking threshold according to a comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
The absolute masking threshold is a masking threshold when a normal person can hear the lowest sound in a noise-free environment, and may be determined by the following formula:
J(i)=3.64*i-0.8-6.5*exp(i-3.3)2+10-3*i4;
where j (i) is the absolute masking threshold for the ith frequency band.
Secondly, comparing the predetermined absolute masking threshold with the intermediate masking threshold, which can be compared by using the following formula:
T(i)=max(T(i),J(i))。
it will be appreciated that the maximum of the intermediate masking threshold t (i) and the absolute masking threshold j (i) is taken as the target masking threshold.
According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. According to the technical scheme provided by the embodiment of the invention, the noise-carrying voice signal is divided into a plurality of frequency bands, and the basic parameter, the characteristic parameter, the pure tone coefficient and the masking threshold of each frequency band are respectively calculated, so that the accuracy of target masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect can be improved.
Example two
Fig. 2 is a schematic structural diagram of a masking threshold estimation device in a second embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. As shown in fig. 2, the apparatus includes:
a basic parameter obtaining module 210, configured to obtain an amplitude spectrum of a voice signal with noise, and obtain an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
a characteristic parameter determining module 220, configured to determine a voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determine a voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
a pure tone coefficient determining module 230, configured to determine pure tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness;
an intermediate masking threshold determining module 240, configured to determine an intermediate masking threshold according to the pure tone coefficient;
a target masking threshold determining module 250, configured to determine a target masking threshold according to a comparison result between a predetermined absolute masking threshold and the intermediate masking threshold.
According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.
Further, the characteristic parameter determining module 220 includes:
a voice characteristic spectrum deviation determining unit, configured to determine a voice characteristic spectrum deviation by using the following formula:
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,is the average value of the amplitude spectrum of the ith frequency band of the voice signal with noise;
a voice feature flatness determination unit, configured to determine a voice feature flatness by using the following formula:
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
Further, the pure tone coefficient determining module 230 is specifically configured to:
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],ηiIs the second adjustment value, ηi∈[-70,-50]。
Further, the intermediate masking threshold determining module 240 includes:
an intermediate masking threshold determining unit for determining the intermediate masking threshold using the following formula:
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
Further, the intermediate masking threshold determining unit includes:
an offset from intermediate masking threshold determining subunit configured to determine the offset from intermediate masking threshold using the following equation:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
An expansion critical band spectrum determination subunit configured to determine the expansion critical band spectrum by using the following formula:
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the power spectrum of the j frequency point of the voice signal with noise, SP (j)ijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
Further, the expansion critical band spectrum determining subunit is further configured to determine the expansion function of the jth frequency point in the ith frequency band of the noisy speech signal by using the following formula:
the masking threshold estimation device provided by the embodiment of the invention can execute the masking threshold estimation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the masking threshold estimation method.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device 300 includes: one or more processors 320; the storage device 310 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 320, the one or more processors 320 implement the masking threshold estimation method provided in the embodiment of the present application, the method includes:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold.
Of course, those skilled in the art can understand that the processor 320 also implements the technical solution of the masking threshold estimation method provided in any embodiment of the present application.
The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 3, the electronic device 300 includes a processor 320, a storage device 310, an input device 330, and an output device 340; the number of the processors 320 in the electronic device may be one or more, and one processor 320 is taken as an example in fig. 3; the processor 320, the storage device 310, the input device 330, and the output device 340 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 350 in fig. 3.
The storage device 310 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the masking threshold estimation method in the embodiment of the present application.
The storage device 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage device 310 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 310 may further include memory located remotely from processor 320, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display screen, a speaker, and other electronic devices.
Example four
The fourth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the masking threshold estimation method provided in the fourth embodiment of the present invention, where the computer program includes:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The masking threshold estimation method, the masking threshold estimation device, the electronic device and the storage medium provided in the above embodiments may be implemented by any of the masking threshold estimation methods provided in the embodiments of the present application, and have corresponding functional modules and beneficial effects for implementing the methods. For technical details that are not described in detail in the above embodiments, reference may be made to the masking threshold estimation method provided in any of the embodiments of the present application.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (10)
1. A masking threshold estimation method, comprising:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
2. The method of claim 1, wherein determining a deviation of a speech feature spectrum of the noisy speech signal from the magnitude spectrum of the noisy speech signal and the magnitude spectrum of the noise signal, and determining a flatness of the speech feature from the magnitude spectrum of the noisy speech signal, comprises:
determining the deviation of the speech characteristic spectrum by adopting the following formula:
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,is the average value of the amplitude spectrum of the ith frequency band of the voice signal with noise;
further comprising, determining the flatness of the speech features using the following formula:
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
3. The method according to claim 1, wherein determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum bias and the flatness of the speech feature comprises:
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],ηiIs the second adjustment value, ηi∈[-70,-50]。
4. The method of claim 1, wherein determining an intermediate masking threshold based on the pure tone coefficients comprises:
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
5. The method of claim 4, wherein the offset from the intermediate masking threshold is determined using the following equation:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
6. The method of claim 4, wherein the extended critical band spectrum is determined using the following formula:
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the band noisePower spectrum, SP, of the j frequency point of a speech signalijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
8. a masking threshold estimating apparatus, comprising:
the basic parameter acquisition module is used for acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
the characteristic parameter determining module is used for determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
the pure tone coefficient determining module is used for determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
the intermediate masking threshold determining module is used for determining an intermediate masking threshold according to the pure tone coefficient;
and the target masking threshold determining module is used for determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the intermediate masking threshold.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the masking threshold estimation method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the masking threshold estimation method as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111250359.0A CN113851151A (en) | 2021-10-26 | 2021-10-26 | Masking threshold estimation method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111250359.0A CN113851151A (en) | 2021-10-26 | 2021-10-26 | Masking threshold estimation method, device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113851151A true CN113851151A (en) | 2021-12-28 |
Family
ID=78983130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111250359.0A Pending CN113851151A (en) | 2021-10-26 | 2021-10-26 | Masking threshold estimation method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113851151A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114241800A (en) * | 2022-02-28 | 2022-03-25 | 天津市北海通信技术有限公司 | Intelligent stop reporting auxiliary system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005107448A (en) * | 2003-10-02 | 2005-04-21 | Nippon Telegr & Teleph Corp <Ntt> | Noise reduction processing method, and device, program, and recording medium for implementing same method |
US20130054178A1 (en) * | 2011-08-30 | 2013-02-28 | Ou Eliko Tehnoloogia Arenduskeskus | Method and device for broadband analysis of systems and substances |
CN103021420A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院自动化研究所 | Speech enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation |
CN107705801A (en) * | 2016-08-05 | 2018-02-16 | 中国科学院自动化研究所 | The training method and Speech bandwidth extension method of Speech bandwidth extension model |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
WO2021114733A1 (en) * | 2019-12-10 | 2021-06-17 | 展讯通信(上海)有限公司 | Noise suppression method for processing at different frequency bands, and system thereof |
-
2021
- 2021-10-26 CN CN202111250359.0A patent/CN113851151A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2005107448A (en) * | 2003-10-02 | 2005-04-21 | Nippon Telegr & Teleph Corp <Ntt> | Noise reduction processing method, and device, program, and recording medium for implementing same method |
US20130054178A1 (en) * | 2011-08-30 | 2013-02-28 | Ou Eliko Tehnoloogia Arenduskeskus | Method and device for broadband analysis of systems and substances |
CN103021420A (en) * | 2012-12-04 | 2013-04-03 | 中国科学院自动化研究所 | Speech enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation |
CN107705801A (en) * | 2016-08-05 | 2018-02-16 | 中国科学院自动化研究所 | The training method and Speech bandwidth extension method of Speech bandwidth extension model |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
WO2021114733A1 (en) * | 2019-12-10 | 2021-06-17 | 展讯通信(上海)有限公司 | Noise suppression method for processing at different frequency bands, and system thereof |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114241800A (en) * | 2022-02-28 | 2022-03-25 | 天津市北海通信技术有限公司 | Intelligent stop reporting auxiliary system |
CN114241800B (en) * | 2022-02-28 | 2022-05-27 | 天津市北海通信技术有限公司 | Intelligent stop reporting auxiliary system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6889698B2 (en) | Methods and devices for amplifying audio | |
CN109410975B (en) | Voice noise reduction method, device and storage medium | |
US8712074B2 (en) | Noise spectrum tracking in noisy acoustical signals | |
EP3526979B1 (en) | Method and apparatus for output signal equalization between microphones | |
US11069366B2 (en) | Method and device for evaluating performance of speech enhancement algorithm, and computer-readable storage medium | |
CN109036460B (en) | Voice processing method and device based on multi-model neural network | |
CN110634497A (en) | Noise reduction method and device, terminal equipment and storage medium | |
TW201448616A (en) | Method and apparatus for determining directions of uncorrelated sound sources in a Higher Order Ambisonics representation of a sound field | |
US10531178B2 (en) | Annoyance noise suppression | |
CN113766073A (en) | Howling detection in a conferencing system | |
CN112485761B (en) | Sound source positioning method based on double microphones | |
WO2016119388A1 (en) | Method and device for constructing focus covariance matrix on the basis of voice signal | |
Kim et al. | Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain. | |
CN109997186B (en) | Apparatus and method for classifying acoustic environments | |
CN110634508A (en) | Music classifier, related method and hearing aid | |
CN113851151A (en) | Masking threshold estimation method, device, electronic equipment and storage medium | |
CN113949955A (en) | Noise reduction processing method and device, electronic equipment, earphone and storage medium | |
Satoh et al. | Ambient sound-based proximity detection with smartphones | |
TW201503116A (en) | Method for using voiceprint identification to operate voice recoginition and electronic device thereof | |
CN112513977A (en) | Signal processing device and method, and program | |
CN110169082A (en) | Combining audio signals output | |
CN111383629B (en) | Voice processing method and device, electronic equipment and storage medium | |
CN112562717B (en) | Howling detection method and device, storage medium and computer equipment | |
CN115410593A (en) | Audio channel selection method, device, equipment and storage medium | |
Nabi et al. | An improved speech enhancement algorithm for dual-channel mobile phones using wavelet and genetic algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |