CN113851151A - Masking threshold estimation method, device, electronic equipment and storage medium - Google Patents

Masking threshold estimation method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113851151A
CN113851151A CN202111250359.0A CN202111250359A CN113851151A CN 113851151 A CN113851151 A CN 113851151A CN 202111250359 A CN202111250359 A CN 202111250359A CN 113851151 A CN113851151 A CN 113851151A
Authority
CN
China
Prior art keywords
noise
signal
masking threshold
voice
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111250359.0A
Other languages
Chinese (zh)
Inventor
秦永红
付贤会
刘武钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rongxun Technology Co ltd
Original Assignee
Beijing Rongxun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Rongxun Technology Co ltd filed Critical Beijing Rongxun Technology Co ltd
Priority to CN202111250359.0A priority Critical patent/CN113851151A/en
Publication of CN113851151A publication Critical patent/CN113851151A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the invention discloses a masking threshold estimation method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the power spectrum of the voice signal with the noise, the amplitude spectrum of the voice signal with the noise and the pure tone coefficient; and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold. The embodiment of the invention can improve the accuracy of the masking threshold estimation, further effectively enhance the noise suppression result and improve the voice recognition effect.

Description

Masking threshold estimation method, device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of signal processing, in particular to a masking threshold estimation method and device, electronic equipment and a storage medium.
Background
With the rapid development of signal processing techniques and speech recognition techniques, speech enhancement techniques in front-end preprocessing are also becoming more and more important. Generally, when a device plays sound, noise is heard along with voice, but the presence of noise interferes with voice and even affects the perception of voice by human ears. Normally, blind source separation technique is adopted, and the current most important technical means of blind source separation technique is to estimate the masking threshold.
At present, in a non-stationary environment, many noise estimation algorithms have problems of tracking delay, large error and the like, and some researchers try to perform speech enhancement by using the auditory characteristics of human ears in the non-stationary environment, but the estimation accuracy of the masking threshold becomes the key for performing speech enhancement based on the auditory characteristics.
Therefore, how to improve the estimation accuracy of the masking threshold is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides a masking threshold estimation method and device, electronic equipment and a storage medium, which can improve the estimation accuracy of a masking threshold.
In a first aspect, an embodiment of the present invention provides a masking threshold estimation method, including:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
In a second aspect, an embodiment of the present invention further provides a masking threshold estimation apparatus, including:
the basic parameter acquisition module is used for acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
the characteristic parameter determining module is used for determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
the pure tone coefficient determining module is used for determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
the intermediate masking threshold determining module is used for determining an intermediate masking threshold according to the pure tone coefficient;
and the target masking threshold determining module is used for determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the intermediate masking threshold.
In a third aspect, an embodiment of the present invention further provides an electronic device, including:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a masking threshold estimation method as in any embodiment of the invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the masking threshold estimation method according to any embodiment of the present invention.
According to the masking threshold estimation scheme provided by the embodiment of the invention, the amplitude spectrum of a voice signal with noise is obtained, and the amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.
Drawings
Fig. 1 is a flowchart of a masking threshold estimation method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a masking threshold estimation device according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Example one
Fig. 1 is a flowchart of a masking threshold estimation method in a first embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. The method may be performed by a masking threshold estimation apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, for example, the electronic device may be a device with communication and computing capabilities, such as a background server. As shown in fig. 1, the method specifically includes:
s110, obtaining an amplitude spectrum of a voice signal with noise, and obtaining an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal.
The noisy voice signal is acquired by at least one voice acquisition device in a voice acquisition field. The voice acquisition site can be a communication site in a conference room, a broadcasting room, a railway station and other noisy environments, and can also be a military communication site or a voice recognition site and the like. For example, when a broadcaster broadcasts news, various sounds may appear in a broadcasting room, traffic noise generated by passing vehicles outside the broadcasting room building or noise generated by back-and-forth movement of a air conditioning system, a light control system, a camera and a worker in the broadcasting room building, and at the moment, voice signals in the broadcasting room need to be collected, and voice enhancement is performed on the voice signals of the broadcaster.
The voice acquisition device can be a microphone or a wave detector. Specifically, the number of the voice collecting devices is not limited, and may be 1 or more. When the number of the voice collecting devices is 2 or more, the arrangement mode of the voice collecting devices is not limited in order to collect voice signals at different positions. For example, the speech acquisition devices may be arranged along a circumferential direction of a source of clean speech signals in the noisy speech signal. In addition, because noise interference in the noisy speech signal has uncertainty and randomness, the speech acquisition device can acquire the noisy speech signal continuously or intermittently at short intervals.
Further, in order to better perform the masking threshold estimation processing on the noisy speech signal, the collected noisy speech signal needs to be converted into a frequency domain sound signal, for example, the noisy speech signal may be converted into a frequency domain signal by using a fourier transform or the like. The power spectrum characterizes the change condition of the power of the voice signal with noise along with the frequency, and is the power of the voice signal with noise in a unit frequency band; the amplitude spectrum represents the distribution of the amplitude of the signal along with the frequency, in the frequency domain description of the signal, the frequency is taken as an independent variable, the amplitude of each frequency component forming the signal is taken as a dependent variable, and the frequency function is called as the amplitude spectrum. The power spectrum and the amplitude spectrum of the voice signal with noise can be obtained by performing FFT (Fast Fourier Transform) on the voice signal with noise obtained by the above acquisition.
Further, the amplitude spectrum of the noise signal in the noisy speech signal is obtained by estimating the noise signal according to the collected noisy speech signal and performing FFT processing. The noise signal estimation method may be the following three methods: a recursive average noise estimation algorithm, a minimum tracking algorithm, and a histogram noise estimation algorithm.
It should be noted that the noisy speech signal includes a clean speech signal and a noise signal. The clean voice signal refers to a desired voice signal, and the noise signal refers to all interference signals except the desired voice signal. For example, when a broadcaster broadcasts news indoors, the sound signal of the broadcaster is a pure voice signal, and the sound generated by the broadcaster when walking back and forth among an indoor air conditioning system, a light control system, a camera and a worker and the sound generated by a passing vehicle outdoors are noise signals.
S120, determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise.
The voice characteristic spectrum deviation is a difference between an amplitude spectrum of each frequency band or each frequency point of the voice signal with noise and an average value of the amplitude spectrum, and can be used for measuring the precision of the obtained amplitude spectrum of the voice signal with noise. The speech feature spectrum deviation can be obtained by using a statistical formula.
In this embodiment, optionally, determining the voice feature spectrum deviation of the noisy voice signal according to the amplitude spectrum of the noisy voice signal and the amplitude spectrum of the noise signal, and determining the voice feature flatness according to the amplitude spectrum of the noisy voice signal includes:
determining the deviation of the speech characteristic spectrum by adopting the following formula:
Figure BDA0003322404600000071
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,
Figure BDA0003322404600000072
is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,
Figure BDA0003322404600000073
is the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal.
It will, of course, be appreciated that,
Figure BDA0003322404600000074
the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal can be determined by adopting the following formula:
Figure BDA0003322404600000075
it should be noted that, in a short time (e.g., 10ms to 30ms), the shape of the vocal cords and vocal tract of the human is relatively stable, and thus the short-time spectrum of the acquired human voice signal has relative stability. The voice acquisition site may be in a non-stationary environment, and in order to ensure that the noise signal in the voice signal to be enhanced can be effectively suppressed in the non-stationary environment, the voice signal to be enhanced needs to be divided into a plurality of frequency bands according to the frequency domain of the voice signal to be enhanced, and each frequency band includes a plurality of frequency points. For example, a noisy speech signal may be divided into 5 frequency bands, each of which includes 20 frequency points.
The frequency band division may be based on a Bark scale, or may also be based on a Mel scale. The Bark scale is a unit of perception frequency, the frequency of the voice signal to be enhanced is mapped to 24 psychoacoustic critical frequency bands in Hertz, the width of each critical frequency band is one Bark scale, and when the frequency band division is carried out in the Bark scale, physical frequency needs to be converted into psychoacoustic frequency. The Mel scale is a frequency band division approach closer to the human auditory system.
For example, for different dedicated devices, the way of band division for the speech signal to be enhanced may be selected according to the application scenario. For example, when a broadcaster broadcasts news in a broadcasting room, since the collected sound signal of each broadcaster and the noise signal in the broadcasting room are relatively stable, the frequency domain of the to-be-enhanced speech signal may be divided into 26 frequency bands according to the Mel scale.
In this embodiment, optionally, the method further includes determining the flatness of the voice feature by using the following formula:
Figure BDA0003322404600000081
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
It will be appreciated that the above-described,
Figure BDA0003322404600000082
the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal can be determined by adopting the following formula:
Figure BDA0003322404600000083
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise. The method has the advantages that the calculation mode is simpler, more convenient and faster, the data required by calculation is easy to obtain, and the data can be used as the calculation basis of the pure tone coefficient in the follow-up process.
And S130, determining pure tone coefficients of different frequency bands in the noisy speech signal according to the speech characteristic spectrum deviation and the speech characteristic flatness.
The pure tone coefficient can be calculated and determined by establishing a relational expression between the voice characteristic spectrum deviation and the voice characteristic flatness and the pure tone coefficient, and can also be determined by establishing a neural network model for masking threshold estimation.
In a possible embodiment, optionally, determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness includes:
Figure BDA0003322404600000091
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],μiIs the second adjustment value, mui∈[-70,-50]。
It can be understood that the above formula establishes a relation with the pure tone coefficient by using the deviation of the speech feature spectrum and the flatness of the speech feature, to obtain an intermediate pure tone coefficient, and compares the intermediate pure tone coefficient with 1, and the minimum value is the pure tone coefficient α (i). Wherein the first adjustment value betaiAnd a second adjustment value etaiAre adjustment values obtained as needed or empirically when calculating the pure tone coefficient α (i), and of course, the first adjustment value βiAnd a second adjustment value etaiThe adaptation can be made according to the application scenario.
And S140, determining an intermediate masking threshold according to the pure tone coefficient.
The intermediate masking threshold is an intermediate product of the estimated target masking threshold, and can be determined by establishing a relational expression calculation or by establishing a neural network model of masking threshold estimation.
Optionally, determining an intermediate masking threshold according to the pure tone coefficient includes:
Figure BDA0003322404600000092
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
It will be appreciated that the third adjustment value λiIs an adjustment value obtained as needed or empirically when calculating the intermediate masking threshold t (i). The offset o (i) and the extended critical band spectrum value c (i) relative to the intermediate masking threshold may be determined by calculation through establishing a relational expression, or may be determined by establishing a neural network model for masking threshold estimation.
On the basis of the above technical solutions, optionally, the offset from the intermediate masking threshold is determined by using the following formula:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
It is to be understood that the fourth adjustment value μ is an adjustment value obtained according to needs or experience when calculating the offset o (i) from the intermediate masking threshold, and may be adaptively adjusted according to an application scenario.
On the basis of the above technical solutions, optionally, the expansion critical band spectrum is determined by using the following formula:
Figure BDA0003322404600000101
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the bandPower spectrum, SP, of the j frequency point of a noisy speech signalijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
It is understood that the critical band spread spectrum c (i) is obtained by summing the products of the power spectrum of each frequency point in each frequency band of the noisy speech signal and the spread function.
Optionally, the following formula is adopted to determine the spreading function of the jth frequency point in the ith frequency band of the noisy speech signal:
Figure BDA0003322404600000102
s150, determining a target masking threshold according to a comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
The absolute masking threshold is a masking threshold when a normal person can hear the lowest sound in a noise-free environment, and may be determined by the following formula:
J(i)=3.64*i-0.8-6.5*exp(i-3.3)2+10-3*i4
where j (i) is the absolute masking threshold for the ith frequency band.
Secondly, comparing the predetermined absolute masking threshold with the intermediate masking threshold, which can be compared by using the following formula:
T(i)=max(T(i),J(i))。
it will be appreciated that the maximum of the intermediate masking threshold t (i) and the absolute masking threshold j (i) is taken as the target masking threshold.
According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. According to the technical scheme provided by the embodiment of the invention, the noise-carrying voice signal is divided into a plurality of frequency bands, and the basic parameter, the characteristic parameter, the pure tone coefficient and the masking threshold of each frequency band are respectively calculated, so that the accuracy of target masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect can be improved.
Example two
Fig. 2 is a schematic structural diagram of a masking threshold estimation device in a second embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. As shown in fig. 2, the apparatus includes:
a basic parameter obtaining module 210, configured to obtain an amplitude spectrum of a voice signal with noise, and obtain an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
a characteristic parameter determining module 220, configured to determine a voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determine a voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
a pure tone coefficient determining module 230, configured to determine pure tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness;
an intermediate masking threshold determining module 240, configured to determine an intermediate masking threshold according to the pure tone coefficient;
a target masking threshold determining module 250, configured to determine a target masking threshold according to a comparison result between a predetermined absolute masking threshold and the intermediate masking threshold.
According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.
Further, the characteristic parameter determining module 220 includes:
a voice characteristic spectrum deviation determining unit, configured to determine a voice characteristic spectrum deviation by using the following formula:
Figure BDA0003322404600000131
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,
Figure BDA0003322404600000132
is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,
Figure BDA0003322404600000133
is the average value of the amplitude spectrum of the ith frequency band of the voice signal with noise;
a voice feature flatness determination unit, configured to determine a voice feature flatness by using the following formula:
Figure BDA0003322404600000134
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
Further, the pure tone coefficient determining module 230 is specifically configured to:
Figure BDA0003322404600000135
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],ηiIs the second adjustment value, ηi∈[-70,-50]。
Further, the intermediate masking threshold determining module 240 includes:
an intermediate masking threshold determining unit for determining the intermediate masking threshold using the following formula:
Figure BDA0003322404600000136
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
Further, the intermediate masking threshold determining unit includes:
an offset from intermediate masking threshold determining subunit configured to determine the offset from intermediate masking threshold using the following equation:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
An expansion critical band spectrum determination subunit configured to determine the expansion critical band spectrum by using the following formula:
Figure BDA0003322404600000141
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the power spectrum of the j frequency point of the voice signal with noise, SP (j)ijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
Further, the expansion critical band spectrum determining subunit is further configured to determine the expansion function of the jth frequency point in the ith frequency band of the noisy speech signal by using the following formula:
Figure BDA0003322404600000142
the masking threshold estimation device provided by the embodiment of the invention can execute the masking threshold estimation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the masking threshold estimation method.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device 300 includes: one or more processors 320; the storage device 310 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 320, the one or more processors 320 implement the masking threshold estimation method provided in the embodiment of the present application, the method includes:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold.
Of course, those skilled in the art can understand that the processor 320 also implements the technical solution of the masking threshold estimation method provided in any embodiment of the present application.
The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 3, the electronic device 300 includes a processor 320, a storage device 310, an input device 330, and an output device 340; the number of the processors 320 in the electronic device may be one or more, and one processor 320 is taken as an example in fig. 3; the processor 320, the storage device 310, the input device 330, and the output device 340 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 350 in fig. 3.
The storage device 310 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the masking threshold estimation method in the embodiment of the present application.
The storage device 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage device 310 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 310 may further include memory located remotely from processor 320, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display screen, a speaker, and other electronic devices.
Example four
The fourth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the masking threshold estimation method provided in the fourth embodiment of the present invention, where the computer program includes:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The masking threshold estimation method, the masking threshold estimation device, the electronic device and the storage medium provided in the above embodiments may be implemented by any of the masking threshold estimation methods provided in the embodiments of the present application, and have corresponding functional modules and beneficial effects for implementing the methods. For technical details that are not described in detail in the above embodiments, reference may be made to the masking threshold estimation method provided in any of the embodiments of the present application.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A masking threshold estimation method, comprising:
acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
determining an intermediate masking threshold according to the pure tone coefficient;
and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.
2. The method of claim 1, wherein determining a deviation of a speech feature spectrum of the noisy speech signal from the magnitude spectrum of the noisy speech signal and the magnitude spectrum of the noise signal, and determining a flatness of the speech feature from the magnitude spectrum of the noisy speech signal, comprises:
determining the deviation of the speech characteristic spectrum by adopting the following formula:
Figure FDA0003322404590000011
wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, Di(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Yi(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,
Figure FDA0003322404590000012
is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,
Figure FDA0003322404590000013
is the average value of the amplitude spectrum of the ith frequency band of the voice signal with noise;
further comprising, determining the flatness of the speech features using the following formula:
Figure FDA0003322404590000021
where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.
3. The method according to claim 1, wherein determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum bias and the flatness of the speech feature comprises:
Figure FDA0003322404590000022
where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, βiIs the first adjustment value, βi∈[0,1],ηiIs the second adjustment value, ηi∈[-70,-50]。
4. The method of claim 1, wherein determining an intermediate masking threshold based on the pure tone coefficients comprises:
Figure FDA0003322404590000023
where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λiIs the third adjustment value, λi∈[0,1]。
5. The method of claim 4, wherein the offset from the intermediate masking threshold is determined using the following equation:
O(i)=(α(i)-μ)*i+(3μ-1)*α(i)+μ;
where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].
6. The method of claim 4, wherein the extended critical band spectrum is determined using the following formula:
Figure FDA0003322404590000024
wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the band noisePower spectrum, SP, of the j frequency point of a speech signalijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.
7. The method of claim 6, wherein the spreading function of the j-th frequency point in the i-th frequency band of the noisy speech signal is determined by the following formula:
Figure FDA0003322404590000031
8. a masking threshold estimating apparatus, comprising:
the basic parameter acquisition module is used for acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;
the characteristic parameter determining module is used for determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;
the pure tone coefficient determining module is used for determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;
the intermediate masking threshold determining module is used for determining an intermediate masking threshold according to the pure tone coefficient;
and the target masking threshold determining module is used for determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the intermediate masking threshold.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the masking threshold estimation method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the masking threshold estimation method as claimed in any one of claims 1 to 7.
CN202111250359.0A 2021-10-26 2021-10-26 Masking threshold estimation method, device, electronic equipment and storage medium Pending CN113851151A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111250359.0A CN113851151A (en) 2021-10-26 2021-10-26 Masking threshold estimation method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111250359.0A CN113851151A (en) 2021-10-26 2021-10-26 Masking threshold estimation method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113851151A true CN113851151A (en) 2021-12-28

Family

ID=78983130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111250359.0A Pending CN113851151A (en) 2021-10-26 2021-10-26 Masking threshold estimation method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113851151A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241800A (en) * 2022-02-28 2022-03-25 天津市北海通信技术有限公司 Intelligent stop reporting auxiliary system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005107448A (en) * 2003-10-02 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> Noise reduction processing method, and device, program, and recording medium for implementing same method
US20130054178A1 (en) * 2011-08-30 2013-02-28 Ou Eliko Tehnoloogia Arenduskeskus Method and device for broadband analysis of systems and substances
CN103021420A (en) * 2012-12-04 2013-04-03 中国科学院自动化研究所 Speech enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method
WO2021114733A1 (en) * 2019-12-10 2021-06-17 展讯通信(上海)有限公司 Noise suppression method for processing at different frequency bands, and system thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005107448A (en) * 2003-10-02 2005-04-21 Nippon Telegr & Teleph Corp <Ntt> Noise reduction processing method, and device, program, and recording medium for implementing same method
US20130054178A1 (en) * 2011-08-30 2013-02-28 Ou Eliko Tehnoloogia Arenduskeskus Method and device for broadband analysis of systems and substances
CN103021420A (en) * 2012-12-04 2013-04-03 中国科学院自动化研究所 Speech enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation
CN107705801A (en) * 2016-08-05 2018-02-16 中国科学院自动化研究所 The training method and Speech bandwidth extension method of Speech bandwidth extension model
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method
WO2021114733A1 (en) * 2019-12-10 2021-06-17 展讯通信(上海)有限公司 Noise suppression method for processing at different frequency bands, and system thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241800A (en) * 2022-02-28 2022-03-25 天津市北海通信技术有限公司 Intelligent stop reporting auxiliary system
CN114241800B (en) * 2022-02-28 2022-05-27 天津市北海通信技术有限公司 Intelligent stop reporting auxiliary system

Similar Documents

Publication Publication Date Title
JP6889698B2 (en) Methods and devices for amplifying audio
CN109410975B (en) Voice noise reduction method, device and storage medium
US8712074B2 (en) Noise spectrum tracking in noisy acoustical signals
EP3526979B1 (en) Method and apparatus for output signal equalization between microphones
US11069366B2 (en) Method and device for evaluating performance of speech enhancement algorithm, and computer-readable storage medium
CN109036460B (en) Voice processing method and device based on multi-model neural network
CN110634497A (en) Noise reduction method and device, terminal equipment and storage medium
TW201448616A (en) Method and apparatus for determining directions of uncorrelated sound sources in a Higher Order Ambisonics representation of a sound field
US10531178B2 (en) Annoyance noise suppression
CN113766073A (en) Howling detection in a conferencing system
CN112485761B (en) Sound source positioning method based on double microphones
WO2016119388A1 (en) Method and device for constructing focus covariance matrix on the basis of voice signal
Kim et al. Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain.
CN109997186B (en) Apparatus and method for classifying acoustic environments
CN110634508A (en) Music classifier, related method and hearing aid
CN113851151A (en) Masking threshold estimation method, device, electronic equipment and storage medium
CN113949955A (en) Noise reduction processing method and device, electronic equipment, earphone and storage medium
Satoh et al. Ambient sound-based proximity detection with smartphones
TW201503116A (en) Method for using voiceprint identification to operate voice recoginition and electronic device thereof
CN112513977A (en) Signal processing device and method, and program
CN110169082A (en) Combining audio signals output
CN111383629B (en) Voice processing method and device, electronic equipment and storage medium
CN112562717B (en) Howling detection method and device, storage medium and computer equipment
CN115410593A (en) Audio channel selection method, device, equipment and storage medium
Nabi et al. An improved speech enhancement algorithm for dual-channel mobile phones using wavelet and genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination