CN112700789B

CN112700789B - Noise detection method, nonvolatile readable storage medium and electronic device

Info

Publication number: CN112700789B
Application number: CN202110310614.XA
Authority: CN
Inventors: 阎张懿; 林锦鸿; 梁明亮; 汪震
Original assignee: Shenzhen Zhongke Lanxun Technology Co ltd
Current assignee: Shenzhen Zhongke Lanxun Technology Co ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25
Anticipated expiration: 2041-03-24
Also published as: CN112700789A

Abstract

The invention relates to the technical field of noise detection, and discloses a noise detection method, a nonvolatile readable storage medium and electronic equipment. The noise detection method comprises the following steps: and acquiring a target voice frame, extracting various voice characteristics according to the target voice frame, and detecting whether the target voice frame contains a noise signal or not according to the various voice characteristics. Therefore, the method can judge whether the target voice frame contains the noise signal in a multi-dimension manner, and avoids the situation of misjudgment or misjudgment during single-dimension judgment, thereby improving the accuracy and reliability of noise detection.

Description

Noise detection method, nonvolatile readable storage medium and electronic device

Technical Field

The invention relates to the technical field of noise detection, in particular to a noise detection method, a nonvolatile readable storage medium and electronic equipment.

Background

The existing electronic equipment has an increasingly noise reduction function, wherein the accuracy of noise detection is an important dimension for measuring the noise reduction quality of the electronic equipment. The conventional noise detection method usually detects noise by using a single feature, and since the types of noise are more varied, such as low-frequency noise, intermediate-frequency noise or high-frequency noise, and the voice signals are also more varied, the electronic device cannot reliably and accurately determine the noise from the voice signals by using a single feature to detect a certain type of noise.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a noise detection method, a non-volatile readable storage medium, and an electronic device, which can improve the accuracy of noise detection.

In a first aspect, an embodiment of the present invention provides a noise detection method, including:

acquiring a target voice frame;

extracting multiple types of voice features according to the target voice frame;

and detecting whether the target voice frame contains a noise signal or not according to the voice characteristics of the plurality of types.

Optionally, the detecting whether the target speech frame contains a noise signal according to the plurality of types of speech features includes:

determining a noise probability that each type of the voice features belongs to a noise feature;

and detecting whether the target voice frame contains a noise signal or not according to the noise probability of the various voice characteristics.

Optionally, the detecting whether the target speech frame contains a noise signal according to the noise probabilities of the multiple classes of speech features includes:

calculating a weighted value of each type of voice feature according to the noise probability of each type of voice feature and a preset weight corresponding to the noise probability;

accumulating the weighted value of each type of voice characteristics to obtain a total weighted value;

and detecting whether the target voice frame contains a noise signal or not according to the total weighted value and a first preset noise threshold value.

Optionally, the detecting whether the target speech frame includes a noise signal according to the total weighted value and a preset threshold includes:

judging whether the total weighted value is greater than the first preset noise threshold value;

and if so, determining that the target voice frame belongs to the type determined to contain the noise signal.

Optionally, the detecting whether the target speech frame includes a noise signal according to the total weighted value and a preset threshold further includes:

if the total weighted value is smaller than the first preset noise threshold value, judging whether the total weighted value is larger than a second preset noise threshold value, wherein the second preset noise threshold value is smaller than the first preset noise threshold value;

if so, determining that the target voice frame belongs to a type possibly containing a noise signal;

and if so, determining that the target voice frame belongs to the type of the noiseless signal.

Optionally, the noise feature comprises a subband centroid value feature and/or a spectral template combination feature and/or a negative slope fitting feature, and the determining the noise probability that each type of the speech feature belongs to the noise feature comprises:

according to the sub-band mass center value algorithm, the mass center value of the noise frequency range of the target voice frame is obtained, the mass center value is normalized, the noise probability that the voice feature belongs to the sub-band mass center value feature is obtained, and/or,

according to the spectrum template combination algorithm, the difference degree between the target speech frame and the preset speech frame template is obtained, the difference degree is normalized, the noise probability of the speech feature belonging to the spectrum template combination feature is obtained, and/or,

and according to a negative slope fitting algorithm, solving the error between the amplitude spectrum of the target voice frame and the linear approximate amplitude spectrum, and carrying out normalization processing on the error to obtain the noise probability that the voice feature belongs to the negative slope fitting feature.

Optionally, the method further comprises:

acquiring a current noise detection state;

selecting a noise detection path according to the current noise detection state;

and under the noise detection path, executing corresponding operation according to the detection result of whether the target voice frame contains the noise signal.

Optionally, the current noise detection state includes a noise determination state, a noise possible state, and a noise-free state, and selecting a noise detection path according to the current noise detection state includes:

when the current noise detection state is a noise possible state or a noise-free state, selecting a first noise detection path;

and when the current noise detection state is a noise determination state, selecting a second noise detection path.

Optionally, the performing, in the noise detection path, a corresponding operation according to a detection result of whether the target speech frame includes a noise signal includes:

under the first noise detection path:

when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, accumulating a preset value on continuous frame numbers, updating the current noise detection state to be a noise determination state, and executing a first operation according to the accumulated continuous frame numbers and a preset frame number threshold, wherein the continuous frame numbers are the frame numbers of voice frames which are continuous in time and contain the noise signal, and/or,

when the detection result is that the target voice frame belongs to a type possibly containing noise signals, accumulating a preset value for continuous frame numbers, setting the current noise detection state as a noise possible state, executing a first operation according to the accumulated continuous frame numbers and a preset frame number threshold value, and/or,

and when the detection result is that the target voice frame belongs to the type of the noiseless signal, resetting the continuous frame number, and setting the current noise detection state to be the noiseless state.

under the second noise detection path:

when the detection result is that the target voice frame belongs to the type determined to contain the noise signal, accumulating a preset value for the continuous frame number, executing a first operation according to the accumulated continuous frame number and a preset frame number threshold value, wherein the continuous frame number is the frame number of the voice frame which is continuous in time and contains the noise signal, and/or,

and when the detection result shows that the target voice frame does not belong to the type determined to contain the noise signal, executing a second operation according to the continuous frame number and a preset frame number threshold.

Optionally, the executing the first operation according to the accumulated consecutive frame number and the preset frame number threshold includes:

judging whether the accumulated continuous frame number is greater than the preset frame number threshold value or not;

if yes, executing noise reduction operation;

if not, returning to the step of obtaining the target voice frame.

Optionally, the executing the second operation according to the continuous frame number and the preset frame number threshold includes:

judging whether the continuous frame number is larger than the preset frame number threshold value or not;

if so, executing the judgment operation of the intermittent noise;

and if not, resetting the continuous frame number.

Optionally, the performing intermittent noise determination operation includes:

starting from the target voice frame, reversely traversing to a historical voice frame which firstly contains a noise signal, wherein intermediate voice frames between the target voice frame and the historical voice frame are all voice frames which do not contain the noise signal;

accumulating the total number of the intermediate voice frames by a preset value to obtain an accumulated frame number;

judging whether the accumulated frame number is less than the interval frame number threshold value;

if yes, executing a third operation;

and if not, resetting the continuous frame number and the accumulated frame number.

Optionally, the performing the third operation includes:

accumulating a preset value for the continuous frame number;

and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold.

Optionally, the method further comprises: and when the detection result indicates that the target voice frame belongs to the type which is determined to contain the noise signal, clearing the accumulated frame number.

Optionally, the performing the noise reduction operation includes:

determining the noise size according to the centroid value of the noise frequency range in the target voice frame;

and according to the noise magnitude, implementing noise reduction operation.

Optionally, the method further comprises:

when detecting that the target voice frame contains a noise signal, accumulating a preset value for a continuous frame number, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold, wherein the continuous frame number is the frame number of the voice frame which is continuous in time and contains the noise signal;

and when detecting that the target voice frame does not contain a noise signal, executing a second operation according to the continuous frame number and a preset frame number threshold.

Optionally, before extracting the multi-class speech features, the method further comprises:

preliminarily judging whether the target voice frame contains a noise signal or not;

if yes, entering a step of extracting multi-class voice features according to the target voice frame;

if not, returning to the step of obtaining the target voice frame.

Optionally, the determining whether the target speech frame includes a noise signal includes:

calculating the logarithm of the power of each frequency point in the target voice frame;

obtaining a first sum of the logarithm of all the frequency points and a second sum of the logarithm of each frequency point in the range of the noise frequency band;

calculating a ratio of the second sum to the first sum;

and judging whether the ratio is larger than a third preset noise threshold value.

Optionally, the noise signal is wind noise.

In a second aspect, a non-transitory readable storage medium stores computer-executable instructions for causing an electronic device to perform the noise detection method described above.

In a third aspect, embodiments of the present invention provide a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by an electronic device, cause the electronic device to perform the noise detection method described above.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the noise detection method described above.

Compared with the prior art, the invention at least has the following beneficial effects: in the noise detection method provided by the embodiment of the invention, firstly, the target speech frame is obtained, then, the multi-class speech features are extracted according to the target speech frame, and finally, whether the target speech frame contains the noise signal or not is detected according to the multi-class speech features, so that the method can judge whether the target speech frame contains the noise signal or not in a multi-dimensional manner, avoids the situation of misjudgment or misjudgment during single-dimensional judgment, and improves the accuracy and reliability of noise detection.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a schematic block diagram of a circuit of an earphone according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a noise detection method according to an embodiment of the present invention;

FIG. 3a is a schematic flow chart of S23 shown in FIG. 2;

FIG. 3b is a schematic frequency spectrum diagram of various types of wind noise according to an embodiment of the present invention;

FIG. 3c is a schematic view of the process of S232 shown in FIG. 3 a;

FIG. 3d is a schematic flow chart of S2323 shown in FIG. 3 c;

fig. 4a is a schematic flow chart of a noise detection method according to another embodiment of the present invention;

FIG. 4b is a schematic diagram of each voice frame in the time axis according to the embodiment of the present invention;

fig. 5a is a schematic flow chart illustrating a noise detection method according to still another embodiment of the present invention;

FIG. 5b is a schematic diagram of each voice frame in a time axis according to another embodiment of the present invention;

fig. 6a is a schematic flow chart illustrating a noise detection method according to still another embodiment of the present invention;

FIG. 6b is a schematic flow chart of S27 shown in FIG. 6 a;

FIG. 6c is a diagram illustrating simulation effects of noise detection according to an embodiment of the present invention;

fig. 7a is a schematic structural diagram of a noise detection apparatus according to an embodiment of the present invention;

FIG. 7b is a schematic structural diagram of the noise detection module shown in FIG. 7 a;

fig. 7c is a schematic structural diagram of a noise detection apparatus according to another embodiment of the present invention;

fig. 7d is a schematic structural diagram of a noise detection apparatus according to still another embodiment of the present invention;

fig. 8 is a schematic structural diagram of a noise detection apparatus according to yet another embodiment of the present invention;

fig. 9 is a schematic circuit structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that, if not conflicted, the various features of the embodiments of the invention may be combined with each other within the scope of protection of the invention. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. The terms "first", "second", "third", and the like used in the present invention do not limit data and execution order, but distinguish the same items or similar items having substantially the same function and action.

The noise detection methods provided herein may be applicable to any suitable type of electronic device, such as an electronic device like a headset, a mobile phone, a smart watch, a tablet, a calling set, a sound box, etc., when the electronic device is a headset, the headset may include an in-ear headset, a headphone, or an in-ear headset, etc.

Referring to fig. 1, the earphone 100 includes a transducer 11, an ADC converter 12, a sampling rate converter 13, a controller 14, and a multiplier 15.

The transducer 11 is used for collecting a sound signal, which may be a noise signal or a voice signal, wherein the voice signal may be emitted by a user or other audio source device, and the transducer 11 may be any suitable acousto-electric transducer device, such as a microphone.

The ADC converter 12 is configured to convert the sound signal into a digital signal, the sampling rate converter 13 samples the digital signal according to a preset sampling rate, the controller 14 detects whether the sampled digital signal includes a noise signal according to a noise detection algorithm, if the sampled digital signal includes the noise signal, the controller 14 processes the sampled digital signal according to a noise reduction algorithm to obtain a noise cancellation signal, and multiplies the noise cancellation signal and the sampled digital signal by the multiplier 15 to obtain a noise reduction signal.

In addition to the noise reduction architectures described herein, those skilled in the art may also develop other alternative noise reduction architectures in accordance with the teachings disclosed herein.

As another aspect of the embodiments of the present invention, an embodiment of the present invention provides a noise detection method. Referring to fig. 2, the noise detection method S200 includes:

s21, acquiring a target voice frame;

in this embodiment, the target speech frame is a speech frame currently required to be processed, where the speech frame is each frame of speech signal processed by using a frame-division windowing method, the speech frame includes a normal speech signal or a noise signal or a mixed signal of the noise signal and the speech signal, and the window function here may select any suitable type of window function, such as a hanning window, a triangular window, a rectangular window, and so on. It is understood that the noise signal may be in the same frequency band as the normal speech signal or in a different frequency band.

S22, extracting multiple types of voice features according to the target voice frame;

in this embodiment, the speech features are used to represent features of a target speech frame, and can discriminate whether a target speech frame contains a noise signal, and the electronic device can extract corresponding speech features in different dimensions from the target speech frame according to different speech feature extraction algorithms, so as to obtain multiple types of speech features.

And S23, detecting whether the target speech frame contains noise signals according to the multi-class speech characteristics.

In this embodiment, the electronic device may determine whether the target speech frame contains a noise signal according to any suitable rule and by combining with multiple types of speech features.

Therefore, the embodiment can judge whether the target speech frame contains the noise signal in a multi-dimension manner, and avoid the situation of misjudgment or misjudgment during single-dimension judgment, thereby improving the accuracy and reliability of noise detection.

In general, considering that the accuracy of evaluating whether the target speech frame contains a noise signal is different for different speech features, in some embodiments, the electronic device can comprehensively determine whether the target speech frame contains a noise signal according to the confidence of each speech feature, and therefore, referring to fig. 3a, S23 includes:

s231, determining the noise probability of each type of voice feature belonging to the noise feature;

s232, detecting whether the target voice frame contains a noise signal or not according to the noise probability of the multi-class voice characteristics.

In this embodiment, the noise features may be any type of suitable features, such as sub-band centroid value features and/or spectrum template combination features and/or negative slope fitting features, and in general, low-frequency noise such as wind noise, pink noise, brown noise, etc., conforms to the above respective noise features.

In some embodiments, the frequency band of the noise is a low frequency band, for example, the noise is wind noise, pink noise, brown noise, or the like, wherein the wind noise is a very specific noise which is emitted by the eddy current formed at the microphone by the wind and has a great influence on the voice quality. Usually, the wind noise is sudden, the duration range of each wind noise is several to hundreds of milliseconds according to the wind quantity, the interval time is random, the low-frequency energy is large, and the wind noise has high instability and short-time mutation.

In this embodiment, the noise probability is used to indicate the probability that the target speech frame contains noise signals, and in the target speech frame, the probability that each type of speech feature belongs to the noise feature may be the same or different, for example, the probability that speech feature a belongs to the noise feature is 60%, the probability that speech feature B belongs to the noise feature is 70%, and the probability that speech feature C belongs to the noise feature is 60%.

In this embodiment, after obtaining the noise probabilities of the multiple types of speech features, the electronic device may process the noise probabilities of the multiple types of speech features in combination with any suitable rule, so as to detect whether the target speech frame contains a noise signal.

Therefore, by adopting the method, the accuracy of whether each voice characteristic target voice frame contains the noise signal or not can be fully considered and integrated, so that whether the target voice frame contains the noise signal or not can be judged in a multidimensional, reliable and accurate manner.

In the following, the principle of the noise probability that the speech feature belongs to the noise feature is explained in detail by taking noise as an example of wind noise, but it should be understood that the following explanation does not set any limit to the scope of the present invention:

in some embodiments, when the noise feature is a Sub-band centroid value feature (SSC), the electronic device finds a centroid value of a noise frequency range of a target speech frame according to a Sub-band centroid value algorithm, and performs normalization processing on the centroid value to obtain a noise probability that the speech feature belongs to the Sub-band centroid value feature.

For example, the subband centroid value is a frequency weighted average of energy in a certain frequency range, and reflects information of frequency distribution and energy distribution of the speech signal. Herein, the frequency range selected in the subband centroid value algorithm is a range in which noise energy is concentrated, and when the noise is low-frequency noise such as wind noise, the frequency range selected in the subband centroid value algorithm is a low-frequency range. The centroid is calculated as follows:

wherein mu represents frequency point, lambda represents frame number, fs is sampling rate, and M is frame length.

Representing frame frequency domain signals

The smoothed power spectrum of (a) is as follows:

where α is a smoothing factor, ranging from 0 to 1. The calculated centroids are then normalized:

in some embodiments, when the noise feature is a Spectrum Template Combination feature (STC), the electronic device calculates a difference between the target speech frame and a preset speech frame Template according to a STC Combination algorithm, and normalizes the difference to obtain a noise probability that the speech feature belongs to the STC Combination feature.

For example, let the noisy speech magnitude spectrum be

Since wind noise is also additive noise, the estimated noisy speech magnitude spectrum can be considered to be formed by a clean speech spectrum template

Sum-pure wind noise frequency spectrum template

The composition is as follows:

wherein

Using the formula [1 ] of long-term speech amplitude spectrum defined by ITU-T P.50 standard]：

The group can adopt the amplitude spectrum obtained by actually recorded pure wind noise calculation and then calculate the actual noise voice amplitude spectrum

And estimated mean square error:

obviously, with a minimum value of 0, the derivation of the above equation is simplified:

then, according to the above formula, when

The greater the wind noise, the smaller the probability of existence, then to comply with our normalization rule, the normalization is as follows:

in some embodiments, when the noise feature is a Negative Slope Fit feature (NSF), the electronic device finds an error between the magnitude spectrum of the target speech frame and the linearly approximated magnitude spectrum according to a Negative Slope Fit algorithm, and normalizes the error to obtain a noise probability that the speech feature belongs to the Negative Slope Fit feature.

For example, it is substantially closer to the spectral characteristics of wind noise

Where f is the frequency, so that, as shown in fig. 3b, the wind noise amplitude decreases with increasing frequency, i.e. the slope of the amplitude spectrum curve is negative, the wind noise amplitude spectrum is expressed approximately linearly with a set of parameters, i.e.:

wherein

And for controlling the approximationThe slope of the amplitude spectrum and the dc component of (1), for convenience we will describe

And

expressed as a vector:

and then a set of vectors:

then

Can be simplified to be expressed as:

then, the minimum mean square error criterion is adopted to solve the error of the variable real signal amplitude spectrum and the approximate signal amplitude spectrum

：

Solving the above equation can yield a set of optimal parameters:

if the frame signal corresponds to wind noise, the slope

Should be negative, and

should be less than a certain threshold, where it is set not to exceed

50% of the total. Then the above parameters are normalized to correspond to the wind noise probability:

in some embodiments, when the noise probability of the multi-class speech features is integrated for determination, the electronic device may perform determination by using a weighting algorithm, please refer to fig. 3c, S232 includes:

s2321, calculating a weighted value of each type of voice feature according to the noise probability of each type of voice feature and a corresponding preset weight;

s2322, accumulating the weighted value of each type of voice characteristics to obtain a total weighted value;

s2323, detecting whether the target speech frame includes a noise signal according to the total weight value and the first preset noise threshold.

In some embodiments, the total weight value is:

，

is a preset weight of the sub-band centroid value characteristic,

is a preset weight of the spectral template combination feature,

fitting features to negative slopesThe preset weights of the features can be equally divided or can be defined by the user according to the business rules.

In some embodiments, referring to fig. 3d, S2323 includes:

s2324, judging whether the total weighted value is greater than a first preset noise threshold value;

s2325, if the number of the target voice frames is larger than the preset number, determining that the target voice frames belong to the type which is determined to contain the noise signals.

In the present embodiment, "determining the type containing the noise signal" is defined as the type in which the target speech frame must contain the noise signal.

In some embodiments, when the noise characteristics are obvious and easy to judge, if the total weight value is less than or equal to the first preset noise threshold, it may be directly determined that the target speech frame belongs to the type of the noiseless signal.

In some embodiments, in consideration of the complexity of the variation degree of the noise signal, as mentioned above, the electronic device may reliably determine that the target speech frame must include the noise signal or the target speech frame does not include the noise signal, or may determine that the target speech frame may include the noise signal with a large probability, so as to take into account the "the target speech frame may include the noise signal with a large probability" and further facilitate the subsequent noise reduction operation with high quality, in some embodiments, the electronic device may also count and consider various situations whether the target speech frame includes the noise signal, and therefore, in some embodiments, please continue to refer to fig. 3d, S2323 further includes:

s2326, if so, determining whether the total weighted value is greater than a second preset noise threshold, where the second preset noise threshold is smaller than the first preset noise threshold, if so, executing S2327, and if not, executing S2328;

s2327, if the number of the target voice frames is larger than the preset number, determining that the target voice frames belong to types possibly containing noise signals;

s2328, if the target voice frame is smaller than the noise-free signal, determining that the target voice frame belongs to the type of the noise-free signal.

In the present embodiment, the "type that may include a noise signal" is defined as a target speech frame that includes a noise signal at a high rate, and the "type that does not include a noise signal" is defined as a target speech frame that does not necessarily include a noise signal.

In this embodiment, the first preset noise threshold and the second preset noise threshold are defined by the user according to the service requirement, but as mentioned above, the second preset noise threshold is smaller than the first preset noise threshold.

Since the present embodiment fully considers the above three situations occurring when the target speech frame detects noise, with the method provided by the present embodiment, it is possible to prepare for reliable, accurate, and high-quality noise reduction in the subsequent steps.

As mentioned above, for "the target speech frame belongs to three types, i.e. determining the type of the contained noise signal", "the target speech frame belongs to the type of the possibly contained noise signal" and "the target speech frame belongs to the type of the noiseless signal", and considering that some noises have persistence, the logic operations of the three types are different for more reliable, accurate and high-quality noise reduction, so, in some embodiments, referring to fig. 4a, the noise detection method S200 further includes:

s24, acquiring the current noise detection state;

s25, selecting a noise detection path according to the current noise detection state;

and S26, under the noise detection path, executing corresponding operation according to the detection result of whether the target voice frame contains the noise signal.

In some embodiments, the current noise detection state is used to characterize a detection result of the electronic device detecting whether a last frame of target speech frame includes a noise signal, where the current noise detection state includes a noise determination state, a noise possible state, and a noise-free state, and if the detection result indicates that the last frame of target speech frame belongs to a type determined to include a noise signal, the current noise detection state is the noise determination state. If the detection result is that the last frame of target speech frame belongs to a type possibly containing noise signals, the current noise detection state is a noise possible state. And if the detection result is that the last frame of target speech frame belongs to the type of the noiseless signal, the current noise detection state is the noiseless state. When the electronic equipment starts to execute the noise detection operation, the default current noise detection state is a noise-free state.

In some embodiments, the noise detection path is used to instruct the electronic device to select a path into which the corresponding logical operation is to be taken, as directed by the current noise detection state. When the current noise detection state is a noise possible state or a noise-free state, a first noise detection path is selected, and when the current noise detection state is a noise determination state, a second noise detection path is selected.

In some embodiments, under the first noise detection path: and when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, accumulating a preset value for the continuous frame number, updating the current noise detection state to be a noise determination state, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold, wherein the continuous frame number is the frame number of the voice frame which is continuous in time and contains the noise signal, and the preset value is self-defined by a user according to the service requirement, for example, the preset value is 1.

For example, status flag bit C₀The current noise detection state is recorded, with the marker noise determination state =2, the noise possible state =1, and the no noise state = 0.

First frame digit C₁Recording the continuous number of frames, which is the number of time-continuous speech frames containing noise signals as described above, wherein "a speech frame containing noise signals" herein includes both cases of "determining a speech frame containing noise" and "a speech frame possibly containing noise".

For example, the first speech frame A11, the second speech frame A12, the third speech frame A13 … … and the ninth speech frame A19 all belong to the type determined to contain noise signals, and the first speech frame A11, the second speech frame A12, the third speech frame A13 … … and the ninth speech frame A19 are consecutive in time, so the number of consecutive frames is C₁=9。

As another example, the first speech frame A11 and the second speech frame A12 both belong to a class determined to contain a noise signal, and the third speech frame A13 belongs to a class that may contain a noise signalType, the fourth speech frame A14 … … the ninth speech frame A19 all belong to the type determined to contain noise signals, the number of consecutive frames is also C₁=9。

As another example, if the first speech frame A11 is of a type determined to contain a noise signal, the second speech frame A12 is of a type that may contain a noise signal, the third speech frame A13 is of a type determined to contain a noise signal, the fourth speech frame A14 is of a type that may contain a noise signal, the fifth speech frame A14 … … and the ninth speech frame A19 are of a type determined to contain a noise signal, then the number of consecutive frames is also C₁=9。

Assuming that the current noise detection state is a noise-free state, the electronic device calls the state flag bit C₀，C₀=0, the electronic device selects the first noise detection path. Then, the electronic device performs a corresponding operation according to the detection result of whether the target speech frame B11 contains a noise signal.

When the total weight value is I₀Greater than a first predetermined noise threshold X₁If the target speech frame B11 is of the type determined to contain noise signals, the electronic device retrieves the continuous frame number C₁Adding a predetermined value to successive frames, e.g. C₁=C₁+1 and updates the current noise detection state to a noise determination state, i.e., C₀=2。

Then, the electronic device executes a first operation according to the accumulated continuous frame number and a preset frame number threshold, for example, the electronic device determines whether the accumulated continuous frame number is greater than the preset frame number threshold, if so, executes a noise reduction operation, and if not, returns to the step of acquiring the target voice frame, for example, the accumulated continuous frame number C₁=10, preset frame number threshold T₁=9, since the number of voice frames in which noise signals continuously appear exceeds the preset frame number threshold, it is necessary to perform a noise reduction operation. As another example, the number of consecutive frames C after accumulation₁=6, since the number of voice frames in which noise signals continuously appear does not exceed the preset frame number threshold, the electronic device needs to continuously detect whether the target voice frame of the next frame contains noise signals, so as to reliably trigger the execution of the noise reduction operation.

Therefore, with the present method, it is possible to sufficiently continue the noise so that the noise reduction operation can be reliably and efficiently performed.

In some embodiments, under the first noise detection path: and when the detection result is that the target voice frame belongs to a type possibly containing a noise signal, accumulating a preset numerical value for the continuous frame number, setting the current noise detection state as a possible noise state, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold.

For example, assuming the current noise detection state is a noise probable state, the electronic device retrieves the state flag bit C₀，C₀=1, the electronic device selects the first noise detection path. Then, the electronic device performs a corresponding operation according to the detection result of whether the target speech frame B12 contains a noise signal.

When the total weight value is I₀Less than a first predetermined noise threshold X₁But greater than a second preset noise threshold X₂If the target speech frame B12 is of a type possibly containing noise signals, the electronic device retrieves the continuous frame number C₁Adding a predetermined value to successive frames, e.g. C₁=C₁+1 and sets the current noise detection state to a noise possible state, i.e. C₀=1。

Then, the electronic device executes a first operation according to the accumulated consecutive frame number and the preset frame number threshold, and the specific operation may be as described with reference to the foregoing embodiment.

Therefore, in order to avoid as much as possible the situation that the 'target speech frame may contain a noise signal' from being missed so as to cause the noise reduction which cannot be reliably, effectively and high-quality, the method can fully achieve the actual situation of noise detection, and takes the situation that the 'target speech frame may contain a noise signal' with high confidence as an element, and incorporates the judgment condition for performing the noise reduction in the later period, thereby achieving the purpose of more reliably, effectively and high-quality noise reduction.

In some embodiments, under the first noise detection path: and when the detection result is that the target voice frame belongs to the type of the noiseless signal, resetting the continuous frame number, and setting the current noise detection state as the noiseless state.

For example, assuming the current noise detection state is a noise probable state, the electronic device retrieves the state flag bit C₀，C₀=1, the electronic device selects the first noise detection path. Then, the electronic device performs a corresponding operation according to the detection result of whether the target speech frame B13 contains a noise signal.

When the total weight value is I₀Less than a second predetermined noise threshold X₂If the target speech frame B12 is a type of noise-free signal, the electronic device retrieves the continuous frame number C₁Zero out successive frame numbers, e.g. C₁=0, and sets the current noise detection state to a noise-free state, i.e., C₀=0。

It will be appreciated that the electronic device may be configured with the operating logic for one or two or three of the above three situations in the first noise detection path.

In some embodiments, under the second noise detection path: and when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, accumulating a preset value for the continuous frame number, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold.

For example, assuming the current noise detection state is the noise determination state, the electronic device retrieves the state flag bit C₀，C₀=2, the electronic device selects the second noise detection path. Then, the electronic device performs a corresponding operation according to the detection result of whether the target speech frame C11 contains a noise signal.

When the total weight value is I₀Greater than a first predetermined noise threshold X₁If the target voice frame C11 is of the type determined to contain noise signals, the electronic device retrieves the continuous frame number C₁Adding a predetermined value to successive frames, e.g. C₁=C₁+1, at this time C₀Still 2.

In some embodiments, under the second noise detection path: and when the detection result is that the target voice frame does not belong to the type determined to contain the noise signal, executing a second operation according to the continuous frame number and a preset frame number threshold value.

For example, assuming the current noise detection state is the noise determination state, the electronic device retrieves the state flag bit C₀，C₀=2, the electronic device selects the second noise detection path. Then, the electronic device performs a corresponding operation according to the detection result of whether the target speech frame C12 contains a noise signal.

When the total weight value is I₀Less than a first predetermined noise threshold X₁But greater than a second preset noise threshold X₂Then the target speech frame C12 is detected as belonging to the type possibly containing noise signals, or the total weight value I₀Less than a second predetermined noise threshold X₂If the target voice frame C12 is a type of noise-free signal, the electronic device retrieves the continuous frame number C₁And executing a second operation according to the continuous frame number and a preset frame number threshold value.

In some embodiments, in the second noise detection path and when the detection result is that the target speech frame does not belong to the type determined to contain the noise signal, the electronic device determines, according to the continuous frame number and the preset frame number threshold, whether the continuous frame number is greater than the preset frame number threshold when performing the second operation, if so, performs the determination operation of the intermittent noise, and if not, clears the continuous frame number.

Generally, if the number of continuous occurrences of a speech frame containing a noise signal, that is, the number of continuous frames does not reach the continuous condition, the electronic device may not need to perform a noise reduction operation, and this embodiment may zero the number of continuous frames. If the continuous frame number reaches the continuous condition and the current target speech frame does not contain a noise signal and the continuous state is interrupted, the electronic device needs to execute the judgment operation of the intermittent noise.

Generally, some noise signals are bursty, although the noise signals generated by each burst can last for a certain time period, for the noise signals generated by bursts at different times, for example, a speech frame set D11 containing the noise signals is separated from a speech frame set D12 containing the noise signals for a certain time period, that is, a speech frame between the two speech frame sets is of a noise-free type, if the speech frame between the two speech frame sets is not subjected to noise reduction processing, and only the speech frames at the front end and the rear end are subjected to noise reduction processing, the continuity of the speech signals between the two speech frame sets (including the two speech frame sets) will not be natural enough, resulting in poor user experience. Here, the speech frames in each set of speech frames are consecutive and the number of consecutive frames is greater than a preset frame number threshold.

Therefore, the electronic apparatus needs to perform the determination operation of the intermittent noise in order to perform the noise reduction operation more efficiently and with high quality.

In some embodiments, when the electronic device performs the determination operation of the intermittent noise, first, the electronic device starts from a target speech frame, and traverses reversely to a historical speech frame that includes a noise signal first, and all intermediate speech frames between the target speech frame and the historical speech frame are speech frames that do not include the noise signal, please refer to fig. 4b, in fig. 4a, each speech frame is sequentially arranged on a time axis according to a time sequence, where speech frames e1, e2, and e3 all include the noise signal, e4, e5, e6, and e7 do not include the noise signal, assuming that e8 is the target speech frame and e8 does not include the noise signal, the electronic device traverses reversely, and speech frame e3 is the speech frame that includes the noise signal first, that is, speech frame e3 is the historical speech frame, and all speech frames e4, e5, e6, and e7 are intermediate speech frames.

Then, the electronic device accumulates a preset value of the total number of the intermediate speech frames to obtain an accumulated frame number, wherein the preset value is defined by a user, for example, the preset value is 1, as described above, the total number w of the intermediate speech frames =4, and the accumulated frame number C is obtained₂=w+1=5。

Finally, the electronic equipment judges whether the accumulated frame number is less than the interval frame number threshold, if so, the third operation is executed, and if not, the continuous frame number C is judged₁And accumulated frame number C₂Clearing is performed, e.g. with an interval frame number threshold of 6, due to accumulated frame number C₂Less than threshold T of interval frame number₂The electronic device performs a third operation, or alternatively, the interval frame number threshold T₂Is 5, since the number of frames C is accumulated₂Not less than threshold value T of interval frame number₂Electronic device to consecutive frame number C₁And accumulated frame number C₂If the target speech frame e8 does not contain noise, the electronic device does not need to perform noise reduction operation on the speech frames e4 to e8 because there are more speech frames without noise and the duration is longer, as shown in fig. 4 b.

In some embodiments, when the electronic device performs the third operation, first, a preset value is accumulated for the consecutive frame number, and finally, the first operation is performed according to the accumulated consecutive frame number and the preset frame number threshold, for example, whether the accumulated consecutive frame number is greater than the preset frame number threshold is determined, if yes, a noise reduction operation is performed, and if not, the step of obtaining the target speech frame is returned.

In some embodiments, the accumulated frame number is cleared when the detection result is that the target speech frame belongs to a type determined to contain a noise signal.

In order to elaborate the specific process of the electronic device performing the corresponding operation according to the detection result of whether the target speech frame includes the noise signal under the different noise detection paths, which is described in detail below with reference to fig. 5a, it is understood that the following explanation is not intended to limit the scope of the present invention, and the specific process is as follows:

S510、C₀whether the value is 0 or 1, if yes, executing S511, and if not, executing S516;

s511, judgment I₀Whether or not it is greater than X₁If yes, executing S512, otherwise executing S513;

s512, setting C₁=C₁+1，C₂=0，C₀=2, and proceeds to S524;

s513, judgment I₀Whether or not it is greater than X₂If yes, go to step S514, otherwise go to step S515;

s514, settingPut C₁=C₁+1，C₀=1, and proceeds to S524;

s515, setting C₁=0，C₀=0;

S516, judgment I₀Whether or not it is greater than X₁If yes, executing S517, otherwise, executing S518;

s517 and setting C₁=C₁+1，C₂=0, and proceeds to S524;

s518, judging C₁Whether or not greater than T₁Otherwise, executing S519, if yes, executing S520;

s519, setting C₁=0，C₂=0，C₀=0;

S520, setting C₂=C₂+1, go to S521;

s521, judgment C₂Whether or not greater than T₂If not, go to S522, if yes, go to S523;

s522, setting C₁=0，C₂=0，C₀=0;

S523, setting C₁=C₁+1, and S524;

s524, judging C₁Whether or not greater than T₁If not, executing S525, if yes, executing S526;

s525, re-acquiring a target voice frame;

and S526, executing noise reduction operation.

In the embodiment, the electronic device not only can reliably and effectively reduce noise, but also can reduce noise for intermittent noise, so that the noise reduction effect is improved, and the voice is output with high quality.

In some embodiments, to understand the principle of the method for reducing noise in the case of intermittent noise in detail, the following detailed description is made with reference to fig. 5a and 5b, specifically as follows:

assuming that the electronic device starts detecting noise, the current noise detection state defaults to a noise-free state, i.e., C₀Number of consecutive frames C =0₁=0, cumulative number of frames C₂=0, preset frame number threshold T₁Interval frame number threshold T of =5₂=4。

Due to C₀=0, S511 is executed to determine whether or not the target speech frame f1 contains a noise signal. Assuming that the target speech frame f1 is of a type determined to contain a noise signal, C is set₀=2, number of consecutive frames C accumulated at this time₁=0+1=1。

Next, assuming that the target speech frame f2 belongs to the type determined to contain noise signals, the accumulated continuous frames C are then₁=1+1= 2. By analogy, assuming that the target speech frames f3, f4, f5 and f6 are of the type determined to contain noise signals, after the target speech frame f6 is judged to contain noise signals, the accumulated continuous frame number C is obtained₁=6。

Due to the number of accumulated consecutive frames C₁=6 greater than a preset frame number threshold T₁=5, then, the electronic device starts to perform the noise reduction operation.

Then, assuming that the target speech frame f7 is of the type without noise, when it is determined that the target speech frame f7 is of the type without noise, S518 is performed, in which case C₁=7, obviously, C₁=7 greater than T₁=5, then, the electronic device starts counting the accumulated frame number, i.e. the accumulated frame number C₂=0+1=1。

Due to the accumulated frame number C₂=1 less than threshold value T of number of interval frames₂=4, which indicates that the pause time from the speech frame containing noise signal to the speech frame without noise signal is relatively small, as mentioned above, in order to make the adjacent speech frame containing noise signal and speech frame without noise signal more natural in the noise reduction process, the speech frame without noise signal with small pause time needs to be also included in the noise reduction operation, therefore, the electronic device will add the target speech frame f7 as one frame to the continuous frame number, i.e. the number of the accumulated continuous frame number C₁=6+1=7。

Obviously, the number of consecutive frames C after the accumulation₁=7 always being greater than a preset frame number threshold T₁=5, the electronic device continues to perform noise reduction operations.

Assuming that the target speech frames f8, f9, f10 and f11 all belong to the type of noise-free signals, the target speech frames are judgedf11 is of the type of a noiseless signal, the number of accumulated frames C at this time₂=5。

Due to the accumulated frame number C₂=5 greater than threshold value T for number of interval frames₂=4, which indicates that there are too many speech frames belonging to the type of noise-free signal, the electronic device may regard this situation as a "true noise-free" situation, and therefore the electronic device does not need to continue the noise reduction operation.

From this, it can be seen that, assuming that the electronic device does not perform the noise reduction operation on the target speech frames f7, f8, f9 and f10 after the target speech frames f6 have performed the noise reduction operation, it is conceivable that the sound quality effect of the electronic device outputting the speech segments is not natural. However, according to the method, even if the target speech frames f7, f8, f9 and f10 do not contain noise signals, the electronic device can perform noise reduction operation on the target speech frames f7, f8, f9 and f10 in order to search for high-quality speech, thereby achieving high-quality noise reduction effect.

In some embodiments, when the electronic device performs the noise reduction operation, the size of the noise may be determined according to the centroid value of the noise frequency range in the target speech frame, and the noise reduction operation may be performed according to the size of the noise.

In some embodiments, before performing S22, in order to improve noise reduction efficiency and speech output efficiency, when it is determined that the target speech frame does not contain a noise signal, the electronic device may not need to perform noise detection processing on the target speech frame, and may take another next speech frame as a new target speech frame for determination, please refer to fig. 6a, where the noise detection method S200 further includes:

s27, preliminarily judging whether the target speech frame contains a noise signal, if so, entering S22, otherwise, returning to S21.

Therefore, with the present method, it is possible to efficiently perform noise detection.

In some embodiments, the noise signal is low-frequency noise, and referring to fig. 6b, S27 includes:

s271, calculating the logarithm of the power of each frequency point in the target voice frame;

s272, solving a first sum of the logarithm of all the frequency points and a second sum of the logarithm of each frequency point in the noise frequency range;

s273, calculating the ratio of the second sum to the first sum;

and S274, judging whether the ratio is larger than a third preset noise threshold value.

In this embodiment, the third preset noise threshold is self-defined by the user according to the service requirement, and by using a logarithm method, the third preset noise threshold can amplify the energy of the low-frequency noise, so that whether the target speech frame contains a noise signal in a low frequency band can be effectively and roughly determined.

As described above, the electronic device needs to perform corresponding operations according to the current noise detection state and the detection result of whether the target speech frame contains the noise signal. However, in some embodiments, the difference from the above embodiments is that the electronic device performs the corresponding operation without combining the current noise detection state, when the detection target speech frame contains a noise signal, accumulates a preset value for the consecutive frame number, which is the frame number of the speech frame that is continuous in time and contains the noise signal, and performs the first operation according to the accumulated consecutive frame number and the preset frame number threshold. When the detected target speech frame does not contain the noise signal, the second operation is executed according to the continuous frame number and the preset frame number threshold, and the "execute the first operation" and the "execute the second operation" herein may refer to the embodiments provided above, and are not described herein again.

In order to detail the beneficial effects of the noise reduction method provided by the present embodiment, this is described in detail with reference to the noise detection effect simulation diagram provided by fig. 6 c:

as shown in fig. 6c, from top to bottom, it is shown that the first graph is the simulation effect graph of pure voice, the second graph is the simulation effect graph of wind noise, and the third graph is the simulation effect graph of voice with wind noise.

The fourth graph is the ratio of the second sum to the first sum when the target speech frame is roughly judged to contain the noise signal, and the fourth graph shows that the ratio is higher in the part of speech with wind noise; in the part of voice without wind noise, the ratio is lower and is close to 0.

The fifth graph is a schematic diagram of the total weighted value of the noise features according to the multi-class voice features, and the fifth graph shows that the total weighted value is higher in the partial voice with wind noise; in the partial voice without wind noise, the total weighted value is relatively low and is close to 0, so the noise detection method provided by the embodiment has relatively high noise detection accuracy and reliability.

It should be noted that, in the foregoing embodiments, a certain order does not necessarily exist between the foregoing steps, and those skilled in the art can understand, according to the description of the embodiments of the present invention, that in different embodiments, the foregoing steps may have different execution orders, that is, may be executed in parallel, may also be executed interchangeably, and the like.

As another aspect of the embodiments of the present invention, an embodiment of the present invention provides a noise detection apparatus. The noise detection device may be a software module, where the software module includes a plurality of instructions, and the instructions are stored in a memory, and the processor may access the memory and call the instructions to execute the instructions, so as to complete the noise detection method described in each of the above embodiments.

In some embodiments, the noise detection apparatus may also be built by hardware devices, for example, the noise detection apparatus may be built by one or more than two chips, and each chip may work in coordination with each other to complete the noise detection method described in each of the above embodiments. For another example, the noise detection apparatus may also be constructed by various logic devices, such as a general processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a single chip, an arm (acorn RISC machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components.

Referring to fig. 7a, the noise detection apparatus 700 includes a speech framing module 71, a feature extraction module 72, and a noise detection module 73.

The voice framing module 71 is configured to obtain a target voice frame, the feature extraction module 72 is configured to extract multiple types of voice features according to the target voice frame, and the noise detection module 73 is configured to detect whether the target voice frame contains a noise signal according to the multiple types of voice features.

Therefore, the device can judge whether the target voice frame contains the noise signal in a multi-dimensional way, and avoids the situation of misjudgment or misjudgment during single-dimensional judgment, thereby improving the accuracy and reliability of noise detection.

In some embodiments, referring to fig. 7b, the noise detection module 73 includes a probability determination unit 731 and a noise detection unit 732, the probability determination unit 731 is configured to determine a noise probability that each type of speech feature belongs to a noise feature, and the noise detection unit 732 is configured to detect whether the target speech frame includes a noise signal according to the noise probabilities of the types of speech features.

In some embodiments, the noise detection unit 732 is specifically configured to: calculating the weighted value of each type of voice features according to the noise probability of each type of voice features and the corresponding preset weight; accumulating the weighted value of each type of voice characteristics to obtain a total weighted value; and detecting whether the target voice frame contains a noise signal or not according to the total weighted value and the first preset noise threshold value.

In some embodiments, the noise detection unit 732 is further specifically configured to: judging whether the total weighted value is greater than a first preset noise threshold value or not; if so, determining that the target voice frame belongs to the type determined to contain the noise signal.

In some embodiments, the noise detection unit 732 is further specifically configured to: and if the total weighted value is smaller than the first preset noise threshold value, judging whether the total weighted value is larger than a second preset noise threshold value, wherein the second preset noise threshold value is smaller than the first preset noise threshold value, if so, determining that the target voice frame belongs to a type possibly containing a noise signal, and if not, determining that the target voice frame belongs to a type without the noise signal.

In some embodiments, the noise features include sub-band centroid value features and/or spectral template combination features and/or negative slope fitting features, and the probability determination unit 731 is configured to: according to a sub-band centroid value algorithm, a centroid value of a noise frequency range of a target voice frame is obtained, normalization processing is conducted on the centroid value, noise probability that a voice feature belongs to the sub-band centroid value feature is obtained, and/or according to a frequency spectrum template combination algorithm, a difference degree between the target voice frame and a preset voice frame template is obtained, normalization processing is conducted on the difference degree, noise probability that the voice feature belongs to the frequency spectrum template combination feature is obtained, and/or according to a negative slope fitting algorithm, an error between an amplitude spectrum and a linear approximate amplitude spectrum of the target voice frame is obtained, normalization processing is conducted on the error, and noise probability that the voice feature belongs to the negative slope fitting feature is obtained.

In some embodiments, referring to fig. 7c, the noise detection apparatus 700 further includes a state detection module 74, a path selection module 75 and an operation execution module 76, where the state detection module 74 is configured to obtain a current noise detection state, the path selection module 75 is configured to select a noise detection path according to the current noise detection state, and the operation execution module 76 is configured to execute a corresponding operation according to a detection result of whether the target speech frame includes a noise signal or not in the noise detection path.

In some embodiments, the current noise detection state includes a noise determination state, a noise possible state, and a noise free state, and the path selection module 75 is specifically configured to: when the current noise detection state is a noise possible state or a noise-free state, selecting a first noise detection path; when the current noise detection state is a noise determination state, a second noise detection path is selected.

In some embodiments, the operation execution module 76 is specifically configured to: under a first noise detection path: when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, accumulating a preset value on the continuous frame number, updating the current noise detection state to be a noise determination state, executing a first operation according to the accumulated continuous frame number and a preset frame number threshold, wherein the continuous frame number is the frame number of a voice frame which is continuous in time and contains the noise signal, and/or when the detection result is that the target voice frame belongs to the type which possibly contains the noise signal, accumulating the preset value on the continuous frame number, setting the current noise detection state to be a noise possible state, executing the first operation according to the accumulated continuous frame number and the preset frame number threshold, and/or when the detection result is that the target voice frame belongs to the type which is free of the noise signal, resetting the continuous frame number, and setting the current noise detection state to be a noise-free state.

In some embodiments, the operation execution module 76 is further specifically configured to: under a second noise detection path: and when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, accumulating a preset value on the continuous frame number, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold, wherein the continuous frame number is the frame number of the voice frame which is continuous in time and contains the noise signal, and/or when the detection result is that the target voice frame does not belong to the type which is determined to contain the noise signal, executing a second operation according to the continuous frame number and the preset frame number threshold.

In some embodiments, the operation execution module 76 is further specifically configured to: and judging whether the accumulated continuous frame number is greater than the preset frame number threshold value, if so, executing noise reduction operation, and if not, returning to the voice framing module 71.

In some embodiments, the operation execution module 76 is further specifically configured to: and judging whether the continuous frame number is greater than a preset frame number threshold value, if so, executing judgment operation of intermittent noise, and if not, resetting the continuous frame number.

In some embodiments, the operation execution module 76 is further specifically configured to: starting from a target voice frame, reversely traversing to a historical voice frame which firstly contains a noise signal, accumulating the total number of the intermediate voice frames by a preset value to obtain an accumulated frame number, judging whether the accumulated frame number is less than an interval frame number threshold value, if so, executing a third operation, and if not, resetting the continuous frame number and the accumulated frame number.

In some embodiments, the operation execution module 76 is further specifically configured to: accumulating a preset value for the continuous frame number, and executing a first operation according to the accumulated continuous frame number and a preset frame number threshold.

In some embodiments, the operation execution module 76 is further specifically configured to: and when the detection result is that the target voice frame belongs to the type which is determined to contain the noise signal, clearing the accumulated frame number.

In some embodiments, the operation execution module 76 is further specifically configured to: and determining the noise according to the centroid value of the noise frequency range in the target voice frame, and implementing noise reduction operation according to the noise.

In some embodiments, before executing the feature extraction module 72, please refer to fig. 7d, in which the noise detection apparatus 700 further includes a noise initial determination module 77, the noise initial determination module 77 is configured to determine whether the target speech frame includes a noise signal, if so, the feature extraction module 72 is executed, and if not, the speech framing module 71 is returned to.

In some embodiments, the noise signal is low-frequency noise, and the noise initialization module 77 is specifically configured to: and solving the logarithm of the power of each frequency point in the target voice frame, solving a first sum of the logarithms of all the frequency points and a second sum of the logarithms of all the frequency points in the noise frequency range, calculating the ratio of the second sum to the first sum, and judging whether the ratio is greater than a third preset noise threshold value.

In some embodiments, the noise signal is wind noise.

The difference from the above embodiments is that, in this embodiment, please refer to fig. 8, the noise detection apparatus 700 further includes a first operation module 78 and a second operation module 79, the first operation module 78 is configured to accumulate a preset value for consecutive frames when the detection target speech frame includes a noise signal, and execute a first operation according to the accumulated consecutive frames and a preset frame threshold, where the consecutive frames are frames of a speech frame that is continuous in time and includes the noise signal, and the second operation module 79 is configured to execute a second operation according to the consecutive frames and the preset frame threshold when the detection target speech frame does not include the noise signal.

The noise detection device can execute the noise detection method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in the embodiments of the noise detection apparatus, reference may be made to the noise detection method provided by the embodiments of the present invention.

Referring to fig. 9, fig. 9 is a schematic circuit structure diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 9, the electronic device 900 includes one or more processors 91 and memory 92. In fig. 9, one processor 91 is taken as an example.

The processor 91 and the memory 92 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The memory 92, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the noise detection method in the embodiments of the present invention. The processor 91 executes various functional applications and data processing of the noise detection apparatus by executing nonvolatile software programs, instructions and modules stored in the memory 92, that is, implements the functions of the noise detection method provided by the above-described method embodiment and the various modules or units of the above-described apparatus embodiment.

The memory 92 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 92 may optionally include memory located remotely from the processor 91, and such remote memory may be connected to the processor 91 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The program instructions/modules are stored in the memory 92 and, when executed by the one or more processors 91, perform the noise detection method of any of the method embodiments described above.

Embodiments of the present invention also provide a non-transitory computer storage medium storing computer-executable instructions, which are executed by one or more processors, such as the processor 91 in fig. 9, to enable the one or more processors to perform the noise detection method in any of the above method embodiments.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by an electronic device, cause the electronic device to perform any of the noise detection methods described above.

The above-described embodiments of the apparatus or device are merely illustrative, wherein the unit modules described as separate parts may or may not be physically separate, and the parts displayed as module units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network module units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A noise detection method, comprising:

acquiring a target voice frame;

detecting whether the target voice frame contains a noise signal or not according to the various voice characteristics;

acquiring a current noise detection state, wherein the current noise detection state comprises a noise determination state, a noise possible state and a noise-free state;

selecting a noise detection path according to the current noise detection state, wherein the selecting a noise detection path according to the current noise detection state comprises: when the current noise detection state is a noise possible state or a noise-free state, selecting a first noise detection path; when the current noise detection state is a noise determination state, selecting a second noise detection path;

2. The method of claim 1, wherein the detecting whether the target speech frame contains a noise signal according to the classes of the speech features comprises:

3. The method of claim 2, wherein the detecting whether the target speech frame contains a noise signal according to the noise probabilities of the classes of the speech features comprises:

4. The method of claim 3, wherein the detecting whether the target speech frame contains a noise signal according to the total weight value and a preset threshold comprises:

5. The method of claim 4, wherein the detecting whether the target speech frame contains a noise signal according to the total weighted value and a preset threshold further comprises:

6. The method according to claim 2, wherein the noise features comprise sub-band centroid value features and/or spectral template combination features and/or negative slope fitting features, and wherein the determining the noise probability that each type of the speech features belongs to noise features comprises:

7. The method of claim 1, wherein performing, in the noise detection path, a corresponding operation according to a detection result of whether the target speech frame contains a noise signal comprises:

under the first noise detection path:

8. The method of claim 1, wherein performing, in the noise detection path, a corresponding operation according to a detection result of whether the target speech frame contains a noise signal comprises:

under the second noise detection path:

9. The method of claim 7 or 8, wherein the performing the first operation according to the accumulated consecutive frame number and the preset frame number threshold comprises:

if yes, executing noise reduction operation;

if not, returning to the step of obtaining the target voice frame.

10. The method of claim 8, wherein performing the second operation according to the consecutive frame number and a preset frame number threshold comprises:

if so, executing the judgment operation of the intermittent noise;

and if not, resetting the continuous frame number.

11. The method according to claim 10, wherein said performing intermittent noise determination operation comprises:

judging whether the accumulated frame number is smaller than the interval frame number threshold value;

if yes, executing a third operation;

12. The method of claim 11, wherein the performing the third operation comprises:

accumulating a preset value for the continuous frame number;

13. The method of claim 11, further comprising: and when the detection result indicates that the target voice frame belongs to the type which is determined to contain the noise signal, clearing the accumulated frame number.

14. The method of claim 9, wherein the performing a noise reduction operation comprises:

and according to the noise magnitude, implementing noise reduction operation.

15. The method of any of claims 1 to 6, further comprising:

16. The method of any of claims 1 to 6, wherein prior to extracting the plurality of classes of speech features, the method further comprises:

if not, returning to the step of obtaining the target voice frame.

17. The method of claim 16, wherein the noise signal is low frequency noise, and wherein the determining whether the target speech frame contains a noise signal comprises:

calculating a ratio of the second sum to the first sum;

18. The method of any one of claims 1 to 6, wherein the noise signal is wind noise.

19. A non-transitory readable storage medium storing computer-executable instructions for causing an electronic device to perform the noise detection method of any one of claims 1 to 18.

20. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the noise detection method of any one of claims 1 to 18.