CN113851151A

CN113851151A - Masking threshold estimation method, device, electronic equipment and storage medium

Info

Publication number: CN113851151A
Application number: CN202111250359.0A
Authority: CN
Inventors: 秦永红; 付贤会; 刘武钊
Original assignee: Beijing Rongxun Technology Co ltd
Current assignee: Beijing Rongxun Technology Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2021-12-28

Abstract

The embodiment of the invention discloses a masking threshold estimation method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the power spectrum of the voice signal with the noise, the amplitude spectrum of the voice signal with the noise and the pure tone coefficient; and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold. The embodiment of the invention can improve the accuracy of the masking threshold estimation, further effectively enhance the noise suppression result and improve the voice recognition effect.

Description

Masking threshold estimation method, device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of signal processing, in particular to a masking threshold estimation method and device, electronic equipment and a storage medium.

Background

With the rapid development of signal processing techniques and speech recognition techniques, speech enhancement techniques in front-end preprocessing are also becoming more and more important. Generally, when a device plays sound, noise is heard along with voice, but the presence of noise interferes with voice and even affects the perception of voice by human ears. Normally, blind source separation technique is adopted, and the current most important technical means of blind source separation technique is to estimate the masking threshold.

At present, in a non-stationary environment, many noise estimation algorithms have problems of tracking delay, large error and the like, and some researchers try to perform speech enhancement by using the auditory characteristics of human ears in the non-stationary environment, but the estimation accuracy of the masking threshold becomes the key for performing speech enhancement based on the auditory characteristics.

Therefore, how to improve the estimation accuracy of the masking threshold is a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

The embodiment of the invention provides a masking threshold estimation method and device, electronic equipment and a storage medium, which can improve the estimation accuracy of a masking threshold.

In a first aspect, an embodiment of the present invention provides a masking threshold estimation method, including:

acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;

determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;

determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;

determining an intermediate masking threshold according to the pure tone coefficient;

and determining a target masking threshold according to the comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.

In a second aspect, an embodiment of the present invention further provides a masking threshold estimation apparatus, including:

the basic parameter acquisition module is used for acquiring an amplitude spectrum of a voice signal with noise and acquiring an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;

the characteristic parameter determining module is used for determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;

the pure tone coefficient determining module is used for determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness;

the intermediate masking threshold determining module is used for determining an intermediate masking threshold according to the pure tone coefficient;

and the target masking threshold determining module is used for determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the intermediate masking threshold.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a masking threshold estimation method as in any embodiment of the invention.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the masking threshold estimation method according to any embodiment of the present invention.

According to the masking threshold estimation scheme provided by the embodiment of the invention, the amplitude spectrum of a voice signal with noise is obtained, and the amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.

Drawings

Fig. 1 is a flowchart of a masking threshold estimation method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a masking threshold estimation device according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Example one

Fig. 1 is a flowchart of a masking threshold estimation method in a first embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. The method may be performed by a masking threshold estimation apparatus, which may be implemented in software and/or hardware, and may be configured in an electronic device, for example, the electronic device may be a device with communication and computing capabilities, such as a background server. As shown in fig. 1, the method specifically includes:

s110, obtaining an amplitude spectrum of a voice signal with noise, and obtaining an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal.

The noisy voice signal is acquired by at least one voice acquisition device in a voice acquisition field. The voice acquisition site can be a communication site in a conference room, a broadcasting room, a railway station and other noisy environments, and can also be a military communication site or a voice recognition site and the like. For example, when a broadcaster broadcasts news, various sounds may appear in a broadcasting room, traffic noise generated by passing vehicles outside the broadcasting room building or noise generated by back-and-forth movement of a air conditioning system, a light control system, a camera and a worker in the broadcasting room building, and at the moment, voice signals in the broadcasting room need to be collected, and voice enhancement is performed on the voice signals of the broadcaster.

The voice acquisition device can be a microphone or a wave detector. Specifically, the number of the voice collecting devices is not limited, and may be 1 or more. When the number of the voice collecting devices is 2 or more, the arrangement mode of the voice collecting devices is not limited in order to collect voice signals at different positions. For example, the speech acquisition devices may be arranged along a circumferential direction of a source of clean speech signals in the noisy speech signal. In addition, because noise interference in the noisy speech signal has uncertainty and randomness, the speech acquisition device can acquire the noisy speech signal continuously or intermittently at short intervals.

Further, in order to better perform the masking threshold estimation processing on the noisy speech signal, the collected noisy speech signal needs to be converted into a frequency domain sound signal, for example, the noisy speech signal may be converted into a frequency domain signal by using a fourier transform or the like. The power spectrum characterizes the change condition of the power of the voice signal with noise along with the frequency, and is the power of the voice signal with noise in a unit frequency band; the amplitude spectrum represents the distribution of the amplitude of the signal along with the frequency, in the frequency domain description of the signal, the frequency is taken as an independent variable, the amplitude of each frequency component forming the signal is taken as a dependent variable, and the frequency function is called as the amplitude spectrum. The power spectrum and the amplitude spectrum of the voice signal with noise can be obtained by performing FFT (Fast Fourier Transform) on the voice signal with noise obtained by the above acquisition.

Further, the amplitude spectrum of the noise signal in the noisy speech signal is obtained by estimating the noise signal according to the collected noisy speech signal and performing FFT processing. The noise signal estimation method may be the following three methods: a recursive average noise estimation algorithm, a minimum tracking algorithm, and a histogram noise estimation algorithm.

It should be noted that the noisy speech signal includes a clean speech signal and a noise signal. The clean voice signal refers to a desired voice signal, and the noise signal refers to all interference signals except the desired voice signal. For example, when a broadcaster broadcasts news indoors, the sound signal of the broadcaster is a pure voice signal, and the sound generated by the broadcaster when walking back and forth among an indoor air conditioning system, a light control system, a camera and a worker and the sound generated by a passing vehicle outdoors are noise signals.

S120, determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise.

The voice characteristic spectrum deviation is a difference between an amplitude spectrum of each frequency band or each frequency point of the voice signal with noise and an average value of the amplitude spectrum, and can be used for measuring the precision of the obtained amplitude spectrum of the voice signal with noise. The speech feature spectrum deviation can be obtained by using a statistical formula.

In this embodiment, optionally, determining the voice feature spectrum deviation of the noisy voice signal according to the amplitude spectrum of the noisy voice signal and the amplitude spectrum of the noise signal, and determining the voice feature flatness according to the amplitude spectrum of the noisy voice signal includes:

determining the deviation of the speech characteristic spectrum by adopting the following formula:

wherein i is the frequency band number, j is the frequency point number, N is the number of frequency points (j is more than or equal to 0)<N), Diff (i) is the deviation of the speech characteristic spectrum of the i-th band of the noisy speech signal, D_i(j) Is the estimated value of the amplitude spectrum of the j frequency point of the ith frequency band of the noise signal, Y_i(j) Is the amplitude spectrum of the j frequency point of the ith frequency band of the voice signal with noise,

is the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal,

is the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal.

It will, of course, be appreciated that,

the average value of the estimated value of the amplitude spectrum of the ith frequency band of the noise signal can be determined by adopting the following formula:

it should be noted that, in a short time (e.g., 10ms to 30ms), the shape of the vocal cords and vocal tract of the human is relatively stable, and thus the short-time spectrum of the acquired human voice signal has relative stability. The voice acquisition site may be in a non-stationary environment, and in order to ensure that the noise signal in the voice signal to be enhanced can be effectively suppressed in the non-stationary environment, the voice signal to be enhanced needs to be divided into a plurality of frequency bands according to the frequency domain of the voice signal to be enhanced, and each frequency band includes a plurality of frequency points. For example, a noisy speech signal may be divided into 5 frequency bands, each of which includes 20 frequency points.

The frequency band division may be based on a Bark scale, or may also be based on a Mel scale. The Bark scale is a unit of perception frequency, the frequency of the voice signal to be enhanced is mapped to 24 psychoacoustic critical frequency bands in Hertz, the width of each critical frequency band is one Bark scale, and when the frequency band division is carried out in the Bark scale, physical frequency needs to be converted into psychoacoustic frequency. The Mel scale is a frequency band division approach closer to the human auditory system.

For example, for different dedicated devices, the way of band division for the speech signal to be enhanced may be selected according to the application scenario. For example, when a broadcaster broadcasts news in a broadcasting room, since the collected sound signal of each broadcaster and the noise signal in the broadcasting room are relatively stable, the frequency domain of the to-be-enhanced speech signal may be divided into 26 frequency bands according to the Mel scale.

In this embodiment, optionally, the method further includes determining the flatness of the voice feature by using the following formula:

where flat (i) is the flatness of the speech characteristics of the i-th band of the noisy speech signal.

It will be appreciated that the above-described,

the average value of the amplitude spectrum of the ith frequency band of the noisy speech signal can be determined by adopting the following formula:

determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise. The method has the advantages that the calculation mode is simpler, more convenient and faster, the data required by calculation is easy to obtain, and the data can be used as the calculation basis of the pure tone coefficient in the follow-up process.

And S130, determining pure tone coefficients of different frequency bands in the noisy speech signal according to the speech characteristic spectrum deviation and the speech characteristic flatness.

The pure tone coefficient can be calculated and determined by establishing a relational expression between the voice characteristic spectrum deviation and the voice characteristic flatness and the pure tone coefficient, and can also be determined by establishing a neural network model for masking threshold estimation.

In a possible embodiment, optionally, determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness includes:

where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, β_iIs the first adjustment value, β_i∈[0，1]，μ_iIs the second adjustment value, mu_i∈[-70,-50]。

It can be understood that the above formula establishes a relation with the pure tone coefficient by using the deviation of the speech feature spectrum and the flatness of the speech feature, to obtain an intermediate pure tone coefficient, and compares the intermediate pure tone coefficient with 1, and the minimum value is the pure tone coefficient α (i). Wherein the first adjustment value beta_iAnd a second adjustment value eta_iAre adjustment values obtained as needed or empirically when calculating the pure tone coefficient α (i), and of course, the first adjustment value β_iAnd a second adjustment value eta_iThe adaptation can be made according to the application scenario.

And S140, determining an intermediate masking threshold according to the pure tone coefficient.

The intermediate masking threshold is an intermediate product of the estimated target masking threshold, and can be determined by establishing a relational expression calculation or by establishing a neural network model of masking threshold estimation.

Optionally, determining an intermediate masking threshold according to the pure tone coefficient includes:

where T (i) is the intermediate masking threshold of the i-th band of the noisy speech signal, O (i) is the offset of the i-th band of the noisy speech signal from the intermediate masking threshold, T (i-1) is the masking threshold of the (i-1) -th band of the noisy speech signal, and λ_iIs the third adjustment value, λ_i∈[0，1]。

It will be appreciated that the third adjustment value λ_iIs an adjustment value obtained as needed or empirically when calculating the intermediate masking threshold t (i). The offset o (i) and the extended critical band spectrum value c (i) relative to the intermediate masking threshold may be determined by calculation through establishing a relational expression, or may be determined by establishing a neural network model for masking threshold estimation.

On the basis of the above technical solutions, optionally, the offset from the intermediate masking threshold is determined by using the following formula:

O(i)＝(α(i)-μ)*i+(3μ-1)*α(i)+μ；

where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].

It is to be understood that the fourth adjustment value μ is an adjustment value obtained according to needs or experience when calculating the offset o (i) from the intermediate masking threshold, and may be adaptively adjusted according to an application scenario.

On the basis of the above technical solutions, optionally, the expansion critical band spectrum is determined by using the following formula:

wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the bandPower spectrum, SP, of the j frequency point of a noisy speech signal_ijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.

It is understood that the critical band spread spectrum c (i) is obtained by summing the products of the power spectrum of each frequency point in each frequency band of the noisy speech signal and the spread function.

Optionally, the following formula is adopted to determine the spreading function of the jth frequency point in the ith frequency band of the noisy speech signal:

s150, determining a target masking threshold according to a comparison result of the predetermined absolute masking threshold and the intermediate masking threshold.

The absolute masking threshold is a masking threshold when a normal person can hear the lowest sound in a noise-free environment, and may be determined by the following formula:

J(i)＝3.64*i^-0.8-6.5*exp(i-3.3)²+10^-3*i⁴；

where j (i) is the absolute masking threshold for the ith frequency band.

Secondly, comparing the predetermined absolute masking threshold with the intermediate masking threshold, which can be compared by using the following formula:

T(i)＝max(T(i),J(i))。

it will be appreciated that the maximum of the intermediate masking threshold t (i) and the absolute masking threshold j (i) is taken as the target masking threshold.

According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. According to the technical scheme provided by the embodiment of the invention, the noise-carrying voice signal is divided into a plurality of frequency bands, and the basic parameter, the characteristic parameter, the pure tone coefficient and the masking threshold of each frequency band are respectively calculated, so that the accuracy of target masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect can be improved.

Example two

Fig. 2 is a schematic structural diagram of a masking threshold estimation device in a second embodiment of the present invention, which is applicable to a case of estimating a masking threshold in a noisy environment. As shown in fig. 2, the apparatus includes:

a basic parameter obtaining module 210, configured to obtain an amplitude spectrum of a voice signal with noise, and obtain an amplitude spectrum of a noise signal in the voice signal with noise; wherein the noisy speech signal comprises a clean speech signal and a noise signal;

a characteristic parameter determining module 220, configured to determine a voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determine a voice characteristic flatness according to the amplitude spectrum of the voice signal with noise;

a pure tone coefficient determining module 230, configured to determine pure tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum deviation and the speech feature flatness;

an intermediate masking threshold determining module 240, configured to determine an intermediate masking threshold according to the pure tone coefficient;

a target masking threshold determining module 250, configured to determine a target masking threshold according to a comparison result between a predetermined absolute masking threshold and the intermediate masking threshold.

According to the technical scheme of the embodiment of the invention, an amplitude spectrum of a voice signal with noise is obtained, and an amplitude spectrum of a noise signal in the voice signal with noise is obtained; wherein the noisy speech signal comprises a clean speech signal and a noise signal; determining the voice characteristic spectrum deviation of the voice signal with noise according to the amplitude spectrum of the voice signal with noise and the amplitude spectrum of the noise signal, and determining the voice characteristic flatness according to the amplitude spectrum of the voice signal with noise; determining pure tone coefficients of different frequency bands in the voice signal with noise according to the voice characteristic spectrum deviation and the voice characteristic flatness; determining an intermediate masking threshold according to the pure tone coefficient; and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold. By the technical scheme provided by the embodiment of the invention, the accuracy of the masking threshold estimation can be improved, the noise suppression result can be effectively enhanced, and the voice recognition effect is improved.

Further, the characteristic parameter determining module 220 includes:

a voice characteristic spectrum deviation determining unit, configured to determine a voice characteristic spectrum deviation by using the following formula:

is the average value of the amplitude spectrum of the ith frequency band of the voice signal with noise;

a voice feature flatness determination unit, configured to determine a voice feature flatness by using the following formula:

Further, the pure tone coefficient determining module 230 is specifically configured to:

where α (i) is the pure tone coefficient of the i-th band of the noisy speech signal, β_iIs the first adjustment value, β_i∈[0，1]，η_iIs the second adjustment value, η_i∈[-70,-50]。

Further, the intermediate masking threshold determining module 240 includes:

an intermediate masking threshold determining unit for determining the intermediate masking threshold using the following formula:

Further, the intermediate masking threshold determining unit includes:

an offset from intermediate masking threshold determining subunit configured to determine the offset from intermediate masking threshold using the following equation:

O(i)＝(α(i)-μ)*i+(3μ-1)*α(i)+μ；

where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].

An expansion critical band spectrum determination subunit configured to determine the expansion critical band spectrum by using the following formula:

wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the power spectrum of the j frequency point of the voice signal with noise, SP (j)_ijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.

Further, the expansion critical band spectrum determining subunit is further configured to determine the expansion function of the jth frequency point in the ith frequency band of the noisy speech signal by using the following formula:

the masking threshold estimation device provided by the embodiment of the invention can execute the masking threshold estimation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the masking threshold estimation method.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device 300 includes: one or more processors 320; the storage device 310 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 320, the one or more processors 320 implement the masking threshold estimation method provided in the embodiment of the present application, the method includes:

and determining a target masking threshold according to a comparison result of a predetermined absolute masking threshold and the first masking threshold.

Of course, those skilled in the art can understand that the processor 320 also implements the technical solution of the masking threshold estimation method provided in any embodiment of the present application.

The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 3, the electronic device 300 includes a processor 320, a storage device 310, an input device 330, and an output device 340; the number of the processors 320 in the electronic device may be one or more, and one processor 320 is taken as an example in fig. 3; the processor 320, the storage device 310, the input device 330, and the output device 340 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 350 in fig. 3.

The storage device 310 is a computer-readable storage medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the masking threshold estimation method in the embodiment of the present application.

The storage device 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage device 310 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 310 may further include memory located remotely from processor 320, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display screen, a speaker, and other electronic devices.

Example four

The fourth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the masking threshold estimation method provided in the fourth embodiment of the present invention, where the computer program includes:

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The masking threshold estimation method, the masking threshold estimation device, the electronic device and the storage medium provided in the above embodiments may be implemented by any of the masking threshold estimation methods provided in the embodiments of the present application, and have corresponding functional modules and beneficial effects for implementing the methods. For technical details that are not described in detail in the above embodiments, reference may be made to the masking threshold estimation method provided in any of the embodiments of the present application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A masking threshold estimation method, comprising:

2. The method of claim 1, wherein determining a deviation of a speech feature spectrum of the noisy speech signal from the magnitude spectrum of the noisy speech signal and the magnitude spectrum of the noise signal, and determining a flatness of the speech feature from the magnitude spectrum of the noisy speech signal, comprises:

further comprising, determining the flatness of the speech features using the following formula:

3. The method according to claim 1, wherein determining pure-tone coefficients of different frequency bands in the noisy speech signal according to the speech feature spectrum bias and the flatness of the speech feature comprises:

4. The method of claim 1, wherein determining an intermediate masking threshold based on the pure tone coefficients comprises:

5. The method of claim 4, wherein the offset from the intermediate masking threshold is determined using the following equation:

O(i)＝(α(i)-μ)*i+(3μ-1)*α(i)+μ；

where μ is a fourth adjustment value, μ ∈ [0.5, 6.5 ].

6. The method of claim 4, wherein the extended critical band spectrum is determined using the following formula:

wherein C (i) is the spectrum value of the expansion critical band, j is the frequency point number, N is the frequency point number, P (j) is the band noisePower spectrum, SP, of the j frequency point of a speech signal_ijIs the spreading function of the j frequency point in the i frequency band of the voice signal with noise.

7. The method of claim 6, wherein the spreading function of the j-th frequency point in the i-th frequency band of the noisy speech signal is determined by the following formula:

8. a masking threshold estimating apparatus, comprising:

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the masking threshold estimation method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the masking threshold estimation method as claimed in any one of claims 1 to 7.