CN110634497A

CN110634497A - Noise reduction method and device, terminal equipment and storage medium

Info

Publication number: CN110634497A
Application number: CN201911028898.2A
Authority: CN
Inventors: 熊伟浩; 秦明
Original assignee: TP Link Technologies Co Ltd
Current assignee: TP Link Technologies Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2019-12-31
Anticipated expiration: 2039-10-28
Also published as: CN110634497B

Abstract

The application is suitable for the technical field of audio data processing, and provides a noise reduction method, a device, a terminal device and a storage medium, wherein the noise reduction method comprises the steps of segmenting audio data, extracting voice features in each audio segment, obtaining the probability that each audio segment is noise, preliminarily judging each audio segment, reducing the probability that noise generated in a burst is misjudged, classifying the audio segments again through a noise probability and a classification model, further determining whether the current audio segment is a noise segment, further reducing the probability that the noise generated in the burst is misjudged again, filtering a noise spectrum determined as the noise segment, outputting target voice data after noise reduction, and effectively reducing or filtering the noise generated in the burst.

Description

Noise reduction method and device, terminal equipment and storage medium

Technical Field

The present application belongs to the technical field of audio data processing, and in particular, to a noise reduction method, apparatus, terminal device, and storage medium.

Background

In industrial production and daily life environments, there is often excessive environmental noise that risks impairing the hearing of workers. In order to reduce noise, it is necessary to determine whether a current signal is noise, and then filter the noise, and a method based on a statistical model is usually used to determine whether the current signal is noise, and the method calculates a signal-to-noise ratio of the current signal in real time, and compares the signal-to-noise ratio with an estimated signal-to-noise ratio to determine whether the current signal is noise. The method can be suitable for scenes with slowly-changing noise or scenes with basically-unchanged noise for a period of time.

However, for sounds which are different from environmental noises suddenly, such as sounds of automobile whistling and sounds of sudden rotation of a motor of an intelligent camera, when the noises start, the calculated signal-to-noise ratio is much larger than the estimated signal-to-noise ratio, so that the motor noises at the moment can be wrongly judged as non-noises, the noise spectrum is not iteratively updated, the motor noises can be gradually judged as noises until a period of time, and the iteration rate is very slow.

In summary, the existing noise reduction method has the problem that the noise generated by burst cannot be effectively reduced or filtered.

Disclosure of Invention

The embodiment of the application provides a noise reduction method, a noise reduction device, terminal equipment and a storage medium, and can solve the problem that the noise generated suddenly cannot be effectively reduced or filtered out in the noise reduction method.

In a first aspect, an embodiment of the present application provides a noise reduction method, including:

acquiring audio data, dividing the audio data into a plurality of audio segments according to a first preset rule, and extracting voice characteristics of each audio segment;

acquiring a first noise probability of a current audio clip;

if the first noise probability meets the test condition, calculating the distance between the current audio clip and a classification vector according to the voice feature based on a classification model, and determining a classification result;

updating the first noise probability according to the distance between the current audio clip and the classification vector and the classification result;

if the updated first noise probability is within the preset probability range, determining the current audio clip as the noise clip;

and filtering the noise frequency spectrums of all noise segments in the audio data by using a filter, and outputting target audio data.

Further, the acquiring of the audio data, dividing the audio data into a plurality of audio segments according to a first preset rule, and extracting the voice feature of each audio segment includes:

performing framing processing on the audio data according to first preset time; each audio clip comprises a first preset number of audio sampling points;

windowing the audio data according to the audio sampling points of the current frame and a second preset number of audio sampling points of the previous frame to obtain a current audio segment;

performing Fourier transform on the current audio clip to obtain a frequency spectrum of the current audio clip with preset dimensionality;

and extracting voice characteristics according to the frequency spectrum of the current audio segment.

Further, the obtaining the first noise probability of the current audio piece includes:

acquiring all audio clips in a second preset time period;

calculating the voice energy of each audio clip in a second preset time period, and acquiring target energy according to the voice energy of each audio clip;

and calculating the ratio of the voice energy of the current audio clip to the target energy, and acquiring the corresponding first noise probability according to the ratio.

Further, if the first noise probability satisfies a test condition, calculating a distance between the current audio segment and a classification vector according to the speech feature based on a classification model, and determining a classification result, including:

if the first noise probability is larger than a first threshold value, inputting the voice features into a classification model to obtain the distance between the current audio clip and a classification vector;

judging whether the distance between the current audio clip and the classification vector is larger than a second threshold value or not;

if the distance between the current audio clip and the classification vector is larger than or equal to the second threshold, determining that the classification result of the current audio clip is a first classification result;

and if the distance between the current audio clip and the classification vector is smaller than the second threshold, determining that the classification result of the current audio clip is a second classification result.

Further, the updating the first noise probability according to the distance between the current audio piece and the classification vector and the classification result includes:

obtaining a classification result of M audio clips including the current audio clip, wherein M is a positive integer;

calculating noise ratios according to the classification results of the M audio segments;

acquiring the distance between a target audio clip and a classification vector, and calculating a distance average value according to the distances between all the target audio clips and the classification vector; the target audio clip refers to an audio clip of which the classification result is the first classification result in the M audio clips;

and if the noise ratio is greater than or equal to a third threshold and the distance average value is greater than or equal to a fourth threshold, determining that the current audio clip is a noise clip, and updating the first noise probability to be a second noise probability.

Further, the filtering, by using a filter, noise spectrums of all noise segments in the audio data, and outputting the target audio data includes:

acquiring the frequency spectrum of the current audio clip;

obtaining an iterative noise spectrum of a previous noise segment;

updating the frequency spectrum of the current audio segment to obtain a target noise frequency spectrum according to an iterative model, wherein the iterative model is as follows:

N_k(n)＝N_k-1(n)+γY_k(n)；

where Nk (n) represents the target noise spectrum of the current noise segment, Nk-1(n) represents the iterative noise spectrum of the previous noise segment, γ represents the iteration factor, and yk (n) represents the spectrum of the current audio segment;

and filtering the target noise frequency spectrum of the current noise segment by using the filter, and outputting the target voice data.

Further, the filtering, by using the filter, the target noise spectrum of the current noise segment and outputting the target speech data includes:

according to a filtering model, filtering the target noise frequency spectrum to obtain an output frequency spectrum;

obtaining a current voice output segment through inverse Fourier transform according to the output frequency spectrum of the current audio segment; the current voice output segment comprises a first audio sampling point of the current frame after filtering and a second audio sampling point of the last frame after filtering;

and according to a second preset rule, superposing the second audio sampling points with a third preset number of the first audio sampling points to form the target voice data for outputting.

In a second aspect, an embodiment of the present application provides a noise reduction apparatus, including:

the first acquisition module is used for acquiring audio data, dividing the audio data into a plurality of audio segments according to a first preset rule and extracting the voice characteristics of each audio segment;

the second acquisition module is used for acquiring the first noise probability of the current audio clip;

the calculation module is used for calculating the distance between the current audio clip and a classification vector according to the voice features based on a classification model and determining a classification result if the first noise probability meets a test condition;

the updating module is used for updating the first noise probability according to the distance between the current audio clip and the classification vector and the classification result;

the determining module is used for determining that the current audio clip is a noise clip if the updated first noise probability is within a preset probability range;

and the output module is used for filtering the noise frequency spectrums of all the noise segments in the audio data by using a filter and outputting the target audio data.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the noise reduction method according to any one of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the noise reduction method according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the noise reduction method according to any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that: the audio data are segmented, the voice characteristics in each audio segment are extracted, the probability that each audio segment is noise is obtained, each audio segment can be preliminarily judged, the probability that the noise generated in a burst is misjudged is reduced, then the audio segments are classified again through the noise probability and the classification model, whether the current audio segment is the noise segment is further determined, the probability that the noise generated in the burst is misjudged is further reduced, the noise spectrum judged as the noise segment is filtered, the target voice data after noise reduction is output, and the noise generated in the burst can be effectively reduced or filtered.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of an implementation of a noise reduction method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of another implementation of a noise reduction method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of audio data samples of a noise reduction method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a coordinate expression of a speech feature calculation of a noise reduction method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of another implementation of a noise reduction method provided by an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a noise reduction apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The noise reduction method provided by the embodiment of the application can be applied to terminal equipment with noise suppression, such as a noise suppression component, an echo energy dissipater and the like, and the embodiment of the application does not limit the specific type of the terminal equipment. For example, the noise suppression component may be a noise reduction headphone, an audio camera, or an audio device having a noise reduction circuit or a noise reduction function, or the like.

Fig. 1 shows a flowchart of an implementation of a noise reduction method provided by an embodiment of the present invention, which is detailed as follows:

s101, audio data are obtained, the audio data are divided into a plurality of audio segments according to a first preset rule, and voice features of the audio segments are extracted.

In an application, the audio data may be obtained by using the audio data inside the noise reduction device itself or obtained external audio data. For example, the audio data to be played is cached in the internal memory of the noise reduction device in advance, and the audio data can be directly acquired and processed, or the audio signal collected by a microphone (mic) in the noise reduction device is processed.

In application, in order to maintain the consistency of the played audio data and reduce the processing difficulty of the audio data, the audio data with a short time length can be collected each time and processed according to a first preset rule, for example, the audio data in 1S is collected each time. The first preset rule may be that the audio data in 1S is segmented for equal duration, for example, the audio data is divided into 10ms audio segments, that is, the audio data in 1S acquired each time is divided into 100 10ms audio segments, so that the amount of calculation for processing the audio segment data each time is reduced.

In application, the above speech features include, but are not limited to, fundamental frequencies, harmonics, mel-frequency cepstral coefficients of sound, and the like. For example, the voice feature may be calculated by a filter and a fast fourier transform according to a voice signal carried by the audio segment, which is not limited in this respect.

S102, obtaining a first noise probability of the current audio clip.

In application, the first noise probability is a probability of preliminarily determining whether the current audio segment is noise. Specifically, the first noise probability may be determined according to a signal-to-noise ratio of the current audio clip, or may be determined according to a power value of a middle frequency point of the current audio clip. For example, the first noise probability may be determined by a signal-to-noise ratio of the current audio segment, and values of different signal-to-noise ratios may be set in advance to correspond to different noise probabilities. The higher the signal-to-noise ratio, the smaller the first noise probability that the noise reduction apparatus determines that the current audio segment is a noise segment. Exemplarily, when the first noise probability is set to be 0-0.5, judging that the current signal-to-noise ratio is high, and preliminarily judging that the audio clip is a non-noise clip; and when the first noise probability is 0.5-1, judging that the current signal-to-noise ratio is low, and preliminarily judging that the audio clip is a noise clip.

S103, if the first noise probability meets the test condition, calculating the distance between the current audio clip and the classification vector according to the voice features based on a classification model, and determining a classification result.

In application, the check condition is used to check the first noise probability, and determine whether the first noise probability meets a preset condition, for example, the value corresponding to the check condition is set to be greater than or equal to 0 and less than or equal to 0.3. Under normal conditions, when sudden noise (such as sound when the motor starts to start), the signal-to-noise ratio calculated through the above steps is higher than the normal signal-to-noise ratio. Therefore, the burst noise is initially determined as a non-noise speech frame by the noise reduction apparatus, i.e. the corresponding first noise probability is within 0-0.3. If the first noise probability of the current burst noise segment obtained by the noise reduction device is 0.2 (namely the probability that the current burst noise is judged to be small), the current audio segment is judged to meet the verification condition, then the distance between the current audio segment and the classification vector is calculated according to the voice characteristics, and the classification result is determined again. When the noise generated by the burst is faced, the burst noise is not directly judged to be non-noise voice, but whether the current audio segment is noise is further determined again, and the noise generated by the burst can be effectively reduced or filtered.

In application, the classification model may be a Support Vector Machine (SVM) model, and the classification vector may be a support vector in a SVM. The support vector machine can map the voice feature vector to a higher-dimensional space, and process the voice feature in the space to acquire the distance between the voice feature and the support vector. The support vectors are a group of support vectors obtained by training with the SVM, and for example, when a linear kernel function is used, the obtained support vectors satisfy the linear superposition principle. In other applications, the classification model may also be a classifier such as a logistic regression model, a K-NNk nearest neighbor classification algorithm model (K-NNk nearest neighbor), and the like, to substitute for the classification vector machine to classify the current audio segment. The logistic regression model classifier is consistent with the classification vector machine, the distance between the current audio segment and the classification vector can be calculated according to the voice characteristics, the classification result is determined, and the K-NN is used for determining the class of the nearest K characteristic points in the voice characteristics of the current audio segment and the voice characteristics of the previously confirmed noise segment. And if the relevance of the voice feature point of the current audio clip and the voice feature point of the noise clip is greater than a certain threshold value, the voice feature point of the current audio clip is considered to belong to the noise clip point, and the class with the highest occurrence frequency in the K points is taken as the class of the current audio clip.

In application, the classification result is to classify the current audio segment according to the distance between the current audio segment and the classification vector. Specifically, if the current audio segment is a noise segment, the current audio segment is considered as a positive sample, and if the current audio segment is a non-noise segment, the current audio segment is considered as a negative sample. In the process of determining the positive sample and the negative sample, the method for setting the distance value may be: recording sound data when the motor rotates in advance, wherein the sound data comprises noises generated during operations of different models, horizontal rotation, vertical rotation, zooming and the like of the motor, acquiring corresponding intervals of the models, and setting an interval value. If the distance value is set to be 0, the classification result of the current audio clip is determined to be a positive sample under the condition that the distance between the current audio clip and the classification vector is greater than 0, and similarly, the classification result of the current audio clip is determined to be a negative sample under the condition that the distance between the current audio clip and the classification vector is less than 0. The pitch value may be, but is not limited to, the pitch value obtained by processing the sound data of other noise to obtain a corresponding pitch and using the pitch as a criterion, or the pitch value determined by dividing the pitch of a plurality of different noises.

And S104, updating the first noise probability according to the distance between the current audio clip and the classification vector and the classification result.

In application, the first noise probability is updated according to the distance between the current audio clip and the classification vector and the classification result. Specifically, the distance is greater than a preset threshold and the classification result is a positive sample, the first noise probability 0.2 is updated to 1, or the probability 0.2 is updated to 0.2+ x, where x may be a value preset by a user, and when the updated 0.2+ x value is 1 or greater than 0.95, the current audio segment is determined to be a noise segment. For example, the distance is 2, the distance value is set to 0, if the preset threshold is 1.5, the classification result of the current audio segment is determined to be a positive sample, and the distance is greater than the preset threshold, and the first noise probability 0.2 is updated to 1, that is, the current audio segment is determined to be noise. If the distance is 1.2, the distance value is set to 0, and the preset threshold is 1.5, the classification result of the current speech is determined to be a positive sample, and is smaller than the preset threshold, the first noise probability 0.2 can be updated to 0.7, the probability that the current audio segment is noise is determined to be high, but the current audio segment is not directly determined to be noise.

And S105, if the updated first noise probability is within the preset probability range, determining that the current audio clip is the noise clip.

In application, the preset probability range may be set to 0.95-1, which indicates that the current audio segment has a high probability of being a noise segment, and the first noise probability is updated after the current audio segment meets the condition. For example, the first noise probability is 0.2, and when the distance and the classification result meet the conditions, the current probability may be updated to be 1, that is, the current audio segment is determined to be a noise segment.

And S106, filtering the noise frequency spectrums of all the noise segments in the audio data by using a filter, and outputting target audio data.

In application, the current audio segment has a spectrum of the current audio segment, and the spectrum is a noise spectrum, or a normal speech spectrum, that is, when the current audio segment is considered as a noise segment, the spectrum of the current audio segment is a noise spectrum. The filter may be, but is not limited to, a wiener filter, a low-pass filter, and the like, and is configured to filter a noise spectrum of the current noise segment, including reducing or eliminating the current noise spectrum, and process the filtered spectrum to obtain target speech data and output the target speech data.

In the embodiment, the audio data is segmented, the voice features in each audio segment are extracted, the probability that each audio segment is noise is obtained, each audio segment can be preliminarily judged, the probability that the noise generated in a burst is misjudged is reduced, then the audio segments are classified again through the noise probability sum and classification model, whether the current audio segment is the noise segment is further determined, the probability that the noise generated in the burst is misjudged is further reduced, the noise spectrum determined as the noise segment is filtered, the target voice data after noise reduction is output, and the noise generated in the burst can be effectively reduced or filtered.

Referring to fig. 2, in an embodiment, step S101 includes:

s201, performing framing processing on the audio data according to first preset time; wherein each audio clip comprises a first preset number of audio sample points.

In application, after the audio data are received, the audio data can be firstly cached in a cache queue of the noise reduction device, and then the audio data in the first preset time period in the cache queue is collected for processing. For example, each time 10ms of data is taken out from the buffering queue for processing, for an audio data sampling rate of 8000Hz within 1S of time, 80 audio samples can be processed each time (every 10ms), that is, each audio segment is considered to include 80 (a first preset number) audio samples. It should be noted that the first preset number is determined according to the sampling rate of the audio data, and is not limited herein.

S202, windowing is carried out on the audio data according to the audio sampling points of the current frame and the second preset number of audio sampling points of the previous frame, and a current audio segment is obtained.

In application, the number of the audio sampling points of the current frame is 80, the previous frame is previous frame audio data (audio sampling points of a previous audio clip) of the 80 audio sampling points, and the second preset number is a number set by a user. Specifically, after 80 audio sample point data of the current frame is obtained, the data and the remaining 48 sample points at the end of the last processed voice frame are combined into 128 sample points, the 128 sample points can be temporarily set to be 1 frame, the 1 frame data is windowed, and a current audio segment with the number of the sample points being 128 audio sample points is obtained, specifically referring to fig. 3.

S203, carrying out Fourier transform on the current audio clip to obtain the frequency spectrum of the current audio clip with preset dimensionality.

In application, the frequency spectrum with preset dimensionality can be obtained by performing Fast Fourier Transform (FFT) operation on the current audio segment. Specifically, the definable spectrum is represented by yk (n), and the fast fourier transform is formulated as follows:

where x (t), t-0, 2, …, 127 represents the current time-domain audio signal, N-128 is the total number of sample points,representing an imaginary number. Since the result of the FFT has central symmetry, i.e. Y_k(127)＝Y_k(0) Therefore, it is sufficient to reserve the first 65 dimensions, and the actually calculated n is 0, 1, 2, …, 64, and the total is 65 dimensions.

It should be noted that in other applications, if the number of processing points per time is 160 for an audio data sampling rate of 16000Hz within 1S time, the data of each frame may be 256 (combined with the remaining 96 sampling points of the last processing), and a spectrum with 129 dimensions is obtained after FFT. The positions 80, 128, 65 and 160, 256, 129 are not fixed and can be adjusted according to actual conditions, and the method is not limited.

And S204, extracting voice characteristics according to the frequency spectrum of the current audio clip.

In application, after the frequency spectrum of the current audio segment is obtained, the speech features can be extracted according to the frequency spectrum. Specifically, the Mel spectrum is obtained by passing the above spectrum through a Mel filter (Mel filter) group; performing Cepstrum analysis on the Mel spectrum to obtain Mel Frequency Cepstrum Coefficients (MFCCs), wherein MFCCs are characteristics of the current audio segment.

For example:

1. taking absolute value of Yk (n) to obtain | Yk (n) |;

2. the above absolute value spectrum | yk (n) | is passed through M Mel filter banks to obtain Mel spectrum, for a total of M values. Typically 8000Hz samples, M is 20; 16000Hz, M is 40.

Where M is 0, 1, 2, …, M-1, hm (n) is the mth mel filter, and the calculation formula is

FIG. 4 is a graph of the Mel filter coordinate corresponding to hm (n), which is H in the formula_m(k) Where M is 0, 1, 2, …, M-1, hm (n) is the mth mel filter.

The above speech feature calculation formula is as follows:

where f (m) is the mth mel frequency, the calculation is as follows:

taking 8000Hz samples as an example, the cut-off frequency is 4000 Hz:

according to the following formula:

the maximum Mel frequency f can be obtained_max[mel]2146.1mel, when M is 20:

since Mel frequency is equally spaced, Mel (f (m)) + mxΔ is based on Mel (f (0)))F (m) can be calculated in reverse.

3. The Mel spectrum is logarithmized, uk (m) log10(zk (m)).

4. U k (m) the first 13 coefficients with the largest energy, namely the first 13 coefficients are retained by Discrete Cosine Transform (DCT)

Wherein p is 0, 1, 2, …, 12.

In this embodiment, by obtaining the characteristics of each frame of audio data with a short time length and combining with the second preset number of audio sampling points of the previous frame for processing, the frequency spectrum difference and the frequency spectrum flatness of the current audio segment and the previous audio segment are reduced, so that the audio segments in the audio data are subjected to noise reduction processing, and the output sound is slowly changed without being abrupt, thereby avoiding the frame loss.

Referring to fig. 5, in an embodiment, step S102 includes:

s301, all audio clips in a second preset time period are obtained.

In an application, the second preset time may be a time period set by a user, or may be a time period set inside a system, which is not limited in this respect. Specifically, the second preset time period may be the time period of the first 50ms of the current audio segment, and the first 5 audio segments of the current audio segment may be considered to be counted because the audio data is acquired every 10ms and is subjected to the superposition processing. In other applications, all audio segments within the second preset time period may also include the current audio segment, which is not limited herein.

S302, calculating the voice energy of each audio clip in a second preset time period, and acquiring target energy according to the voice energy of each audio clip.

In application, taking a noise reduction earphone as an example, the noise reduction earphone receives an audio signal through a Microphone (MIC), and performs analog-to-digital conversion through processing of an audio circuit, so as to convert the analog signal into a digital signal. The numerical value of the digital signal corresponding to each audio sampling point can be regarded as the speech energy of the current audio sampling point, that is, the speech energy corresponding to each audio clip can be specifically calculated by the sum of squares of the numerical values of the 128 audio sampling points. The target energy may be obtained as an average of the obtained plurality of speech energies, or may be obtained as a median of the speech energies in the plurality of audio segments. For example, the second preset time period has 5 voice energies, the values of the 5 voice energies are arranged from large to small, and the middle value is taken as the target voice energy. There are various methods that can be implemented to obtain the target energy from the plurality of speech energies, which are not limited to the above.

S303, calculating the ratio of the voice energy of the current audio clip to the target energy, and acquiring the corresponding first noise probability according to the ratio.

In application, different noise reduction algorithms have different methods for estimating noise probability, but the signal-to-noise ratio is basically adopted for judgment, for example, the larger the signal-to-noise ratio is, the smaller the calculated noise probability is, and the larger the speech probability is. The estimation mode of the signal-to-noise ratio can be a ratio obtained by dividing the voice energy of the current audio clip by the target energy, and the first noise probability corresponding to the ratio can be preset and stored in an internal storage area of the noise reduction device. For example, when the signal-to-noise ratio is smaller than 1, the audio segment is preliminarily determined to be a noise segment according to the signal-to-noise ratio criterion, and the first noise probability value obtained by the final calculation is larger, for example, the first noise probability may be between 0.5 and 1; when the signal-to-noise ratio is greater than 1, the current audio segment is preliminarily determined to be a speech frame according to the signal-to-noise ratio criterion, the noise probability obtained by final calculation is smaller, and the first noise probability can be 0-0.3 or 0-0.5, which is not limited.

In the embodiment, the probability of misjudgment of the sudden sound can be preliminarily reduced by preliminarily judging each audio clip.

In an embodiment, the step S103 includes:

and if the first noise probability is larger than a first threshold value, inputting the voice features into a classification model, and obtaining the distance between the current audio clip and a classification vector.

In application, if the first noise probability is greater than a first threshold, the speech features are input into the classification model. Specifically, the first threshold may be a threshold biased toward a speech probability, and if the first threshold is set to 0.05, when the first noise probability value is initially determined to be 0 to 0.3, the noise reduction device determines that the probability that the current audio segment is speech is greater, then determines whether the current first noise probability is greater than 0.05, and if the current first noise probability is greater than 0.05, obtains a distance between the current audio segment and the classification vector by using a classification model, determines a classification result according to the distance, and determines whether the current audio segment is speech or noise segment again.

In other embodiments, if the first noise probability is smaller than a first threshold, the speech feature is input into a classification model, and a distance between the current audio segment and a classification vector is obtained. Specifically, in the case of motor rotation, in most cases, the sound volume of the motor sudden sound is larger than the environmental noise, and is short and sudden. When the motor rotates, due to the fact that the volume is high, the signal to noise ratio calculated according to the step S102 is high, the probability of voice calculated by a traditional noise reduction algorithm is high, the noise probability is low, and noise reduction is weak. When the first noise probability is smaller than 0.3, namely the first noise probability is 0-0.3, the classification model is used for verifying the current audio clip with the higher speech probability again, and therefore the probability that the sudden sound is misjudged can be reduced.

In application, the classification model is a support vector machine model, and specifically includes:

where y is the distance between the current audio segment and the support vector, x (i) represents the i-th dimension of the current audio segment, N represents the number of speech features, and w (i) and b are designated coefficients. The above-mentioned y value is calculated from the speech feature in step S101. Specifically, the calculation is performed according to the above obtained yk (n) and MFCC characteristics:

1) calculating a differential value of the MFCC characteristics according to the MFCC characteristics of the first two frames;

MFCC is a calculated 13-dimensional feature MFCC_K(P)，P＝0，1，...，12；

Wherein, the characteristic values of the previous frame and the previous frame are MFCC respectively_K-1(P)，P＝0，1，...，12、MFCC_K-2(P)，P＝0，1，...，12；

First order difference: DM_k(p)＝MFC_kC(p)-MFC_kC_-1(P), P ═ 0, 1,. ·, 12, 13 dimensions total;

second order difference: DDM_k(p)＝MFC_kC(p)-2MFC_kC_-1(p)+MFC_kC_-2(P), P ═ 0, 1,. ·, 12, 13 dimensions total;

all the features are combined to be 13+13+ 13-39 dimensions, and the number of the above-mentioned selected MFCCs can be adjusted according to the situation and precision of the computing resources, which is not limited. The index k indicates the kth frame, i.e. the kth audio piece.

2) And (3) performing normalization processing on the calculated 39-dimensional features in a mode of subtracting the mean value and dividing by the square difference to obtain and store all normalized features to form mean (n) and variance std (n) of corresponding dimensions, wherein the purpose is to meet the processing requirements of the SVM.

In other applications, when the classification model is a support vector machine model, the support vector machine may further use an RBF kernel function or a linear kernel function to process the speech features, which is not limited herein. In order to reduce the occupation of the computing resources of the noise reduction device, the present embodiment adopts an SVM classifier with a linear kernel function, wherein the prediction result can be expressed as:

wherein x (i) represents the ith dimension of the 39-dimension features, N represents the total feature quantity, w (i) and b are SVM coefficients, and are the optimum values w (i) and b found in advance in the training process using the linear kernel function, such as the audio segments of positive samples and the audio segments of negative samples are determined in advance, and the requirements of all positive samples are met

And for all negative examples, satisfy

That is, when a linear kernel function is used, the obtained support vectors satisfy the linear superposition principle, so that the support vectors can be summed to combine into a set of weights w (i), i ═ 1 … N, and the sum can be directly multiplied with the summed weights in the calculation process, thereby saving a large amount of calculation.

And judging whether the distance between the current audio fragment and the classification vector is larger than a second threshold value.

And if the distance between the current audio clip and the classification vector is larger than or equal to the second threshold, determining that the classification result of the current audio clip is a first classification result.

In application, the second threshold may be a value preset inside the noise reduction apparatus for comparing with the distance between the current audio segment and the classification vector. The first classification result is to determine that the current audio segment is a positive sample. The second classification result is to determine that the current audio segment is a negative sample. Specifically, y >0 may be set to represent positive samples, and y <0 may be set to represent negative samples.

In the embodiment, by using the FFT and MFCC results for classification, the calculation amount can be saved when the FFT and MFCC results coexist with other functions, and the linear kernel function is used, so that the calculation resources used in the calculation stage occupy small space.

In one embodiment, step S104 includes:

and obtaining the classification result of M audio segments including the current audio segment, wherein M is a positive integer.

And calculating the noise ratio according to the classification result of the M audio segments.

In an application, the M audio segments may be a plurality of audio segments including the current audio segment, or a plurality of audio segments before the current audio segment (excluding the current audio segment), which is not limited herein. Specifically, the method and the device adopt 5 audio clips including the current audio clip to obtain the classification results of the 5 audio clips.

In application, the noise ratio is determined by the classification result. Specifically, in the classification results of 5 audio segments, the number of the first classification results is counted, and if the number of the first classification results is 3, the noise ratio is 0.6.

Acquiring the distance between a target audio clip and a classification vector, and calculating a distance average value according to the distances between all the target audio clips and the classification vector; the target audio clip refers to an audio clip of which the classification result is the first classification result in the M audio clips.

In application, the target audio segment is an audio segment of which the classification result is the first classification result in the audio segments. Specifically, if 3 of the 5 audio segments are the first classification result, the distance between the 3 audio segments is obtained, and the average value of the distance is obtained to determine whether the average value of the distance is greater than the fourth threshold.

In application, the third threshold and the fourth threshold are both thresholds that are set in advance by a user inside the noise reduction apparatus. Specifically, the third threshold may be 3, the fourth threshold may be 1, that is, of the 5 audio segments, the target audio segment that is the first classification result is greater than or equal to 3, and the obtained average distance value is greater than or equal to 1, then it is determined that the current audio segment is a noise segment.

In application, the second noise probability is a probability that the current audio segment is determined to be a noise segment, that is, the second noise probability may be considered to be 1. If the first noise probability of the preliminary determination is 0.2, in the subsequent calculation step, the noise occupancy is greater than or equal to the third threshold and the distance average is greater than or equal to the fourth threshold, the first noise probability 0.2 may be updated to a second noise probability 1, which is used to determine that the current audio segment is a noise segment.

In this embodiment, the classification result of the current audio segment is determined by combining the classification result of the previous N frames and the interval average value, which can be used to improve the classification precision of the current audio segment and accurately suppress noise.

In one embodiment, step S106 includes:

acquiring the frequency spectrum of the current audio clip;

obtaining an iterative noise spectrum of a previous noise segment;

N_k(n)＝N_k-1(n)+γY_k(n)；

In application, the frequency spectrum of the current audio segment may be obtained by performing fourier transform on the current audio segment, and the calculation formula thereof is described above, which is not discussed too much. In application, the iterative noise spectrum of the previous noise segment is obtained through iterative model calculation. Specifically, when K is 1, the iterative noise spectrum of the previous noise segment is 0, and the current iterative noise spectrum is calculated by multiplying the frequency spectrum of the current audio segment by the iterative factor, and when K is 2, the current target noise spectrum N is calculated₂(n)＝N₁(n)+γY₁(n)。

In this embodiment, the target noise spectrum is formed by iterating the original noise spectrum, and the iteration process is obtained by multiplying the current spectrum by an iteration factor and adding the current spectrum to the last iteration noise spectrum, so as to prevent the noise segment estimation from being inaccurate due to sudden change. Because the noise in the environment is slowly changed under most conditions, if the frequency spectrum of the current audio frequency segment is directly taken as the noise frequency spectrum, the calculated noise frequency spectrum of each frame of audio frequency segment is different, and the noise elimination degree of each frame after noise reduction is different, so that the continuously output voice data is unstable; and the frequency spectrum of the current audio clip may have errors with the actual frequency spectrum due to accidental factors, and the noise-canceling effect is not good. By iterating the noise spectrum, the error between the spectrum of the current audio segment and the actual spectrum can be reduced, and the continuously output voice data can be more stable.

In other applications, the embodiment may also perform noise reduction processing on the audio segment when the audio segment is determined to be a speech segment, for example, the frequency spectrum of the audio segment before noise reduction satisfies the following condition:

when someone talks: yk (n) ═ nk (n) + ek (n); (where ek (n) denotes the human voice spectrum);

when no one speaks: yk (n) ═ nk (n);

the spectrum of the audio piece after noise reduction then satisfies the following condition:

when someone talks: sk (n) ≈ Sk (n);

when no one speaks: sk (n) ≈ 0.

In this embodiment, a good noise reduction effect can be further achieved by performing noise reduction on the audio segment determined as the voice, so that the user experience is improved.

In an embodiment, the filtering, with the filter, a target noise spectrum of a current noise segment and outputting the target speech data includes:

and filtering the target noise frequency spectrum according to a filtering model to obtain an output frequency spectrum.

Obtaining a current voice output segment through inverse Fourier transform according to the output frequency spectrum of the current audio segment; the current voice output segment comprises a first audio sampling point of the current frame after filtering and a second audio sampling point of the last frame after filtering.

In application, the filter model includes, but is not limited to, a wiener filter model, or a particle filter model. The embodiment adopts a wiener filtering model, wherein the wiener filtering model specifically comprises:

wherein S k (n) is the output spectrum of the current speech segment, the nk (n) represents the target noise spectrum of the current noisy speech segment, and the yk (n) represents the spectrum of the current speech segment.

In application, the sk (n) is a final output spectrum of the current audio segment, and the current audio output segment is obtained through inverse fourier transform according to the current output spectrum, that is, the current audio output segment is also synthesized by 128 audio sampling points, where 80 audio sampling points are audio data of the current frame, and 48 sampling points are audio data of the previous frame. In application, the method firstly collects 10ms data (80 audio sampling points) and combines 48 audio sampling points in the previous frame of audio data to perform primary processing (Fourier transform) to obtain a frequency spectrum Yk (n) of a current audio segment, performs detection, classification and noise reduction processes on the frequency spectrum to obtain an output frequency spectrum Sk (n) of the current audio segment, performs inverse Fourier transform on the Sk (n), and obtains audio sampling points again (a first audio sampling point of the current frame after filtering and a second audio sampling point of the previous frame after filtering).

In application, the second audio sampling points are overlapped with the first audio sampling points of a third preset number according to a second preset rule to form target voice data for output. Specifically, the number of the second audio sampling points is 48, and as described in step S201, 10ms of audio data (80 sampling points) in the buffer queue is obtained each time for processing, in order to keep the input and output audio data consistent, that is, 32 audio sampling points in the current 80 audio sampling points are obtained to superimpose 48 audio sampling points in one frame to form target audio data for output, and the remaining 48 audio sampling points are reserved so as to superimpose the target audio data when the target audio data is output next time. Of the current 80 audio sampling points, the remaining 48 audio sampling points at the end are superimposed on the next frame, and the first 32 audio sampling points are audio sampling points that are continuous with the previous frame in time in the actual sequential playing process, which can be specifically referred to fig. 3.

In the embodiment, the audio data is segmented, the voice features in each audio segment are extracted, the probability that each audio segment is noise is obtained, each audio segment can be preliminarily judged, the probability that the noise generated in a burst is misjudged is reduced, then the audio segments are classified again through the noise probability and the classification model, whether the current audio segment is the noise segment is further determined, the probability that the noise generated in the burst is misjudged is further reduced, the noise spectrum determined as the noise segment is filtered, the target voice data after noise reduction is output, and the noise generated in the burst can be effectively reduced or filtered.

As shown in fig. 6, the present embodiment also provides a noise reduction apparatus 100, including:

the first obtaining module 10 is configured to obtain audio data, divide the audio data into a plurality of audio segments according to a first preset rule, and extract a voice feature of each audio segment.

A second obtaining module 20, configured to obtain a first noise probability of the current audio piece.

And the calculating module 30 is configured to calculate, based on the classification model, a distance between the current audio segment and the classification vector according to the speech feature if the first noise probability satisfies a test condition, and determine a classification result.

And the updating module 40 is configured to update the first noise probability according to the distance between the current audio segment and the classification vector and the classification result.

A determining module 50, configured to determine that the current audio segment is a noise segment if the updated first noise probability is within a preset probability range.

And an output module 60, configured to filter the noise spectrums of all noise segments in the audio data by using a filter, and output the target audio data.

In an embodiment, the first obtaining module 10 is further configured to:

In an embodiment, the second obtaining module 20 is further configured to:

acquiring all audio clips in a second preset time period;

In one embodiment, the calculation module 30 is further configured to:

In one embodiment, the update module 40 is further configured to:

In one embodiment, the output module 60 is further configured to:

acquiring the frequency spectrum of the current audio clip;

obtaining an iterative noise spectrum of a previous noise segment;

N_k(n)＝N_k-1(n)+γY_k(n)；

In one embodiment, the output module 60 is further configured to:

An embodiment of the present application further provides a terminal device, where the terminal device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments may be implemented.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

Fig. 7 is a schematic diagram of a terminal device 80 according to an embodiment of the present application. As shown in fig. 7, the terminal device 80 of this embodiment includes: a processor 803, a memory 801 and a computer program 802 stored in the memory 801 and executable on the processor 803. The processor 803 implements the steps in the various method embodiments described above, such as the steps S101 to S106 shown in fig. 1, when executing the computer program 802. Alternatively, the processor 803 realizes the functions of the modules/units in the above-described device embodiments when executing the computer program 802.

Illustratively, the computer program 802 may be partitioned into one or more modules/units that are stored in the memory 801 and executed by the processor 803 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 802 in the terminal device 80. For example, the computer program 802 may be divided into a first obtaining module, a second obtaining module, a calculating module, an updating module, a determining module and an outputting module, and the specific functions of each module are as follows:

the first acquisition module is used for acquiring audio data, dividing the audio data into a plurality of audio segments according to a first preset rule, and extracting the voice characteristics of each audio segment.

The second obtaining module is used for obtaining the first noise probability of the current audio clip.

And the calculation module is used for calculating the distance between the current audio clip and the classification vector according to the voice characteristics based on a classification model and determining a classification result if the first noise probability meets the test condition.

The updating module is used for updating the first noise probability according to the distance between the current audio segment and the classification vector and the classification result.

The determining module is used for determining that the current audio clip is the noise clip if the updated first noise probability is within the preset probability range.

The output module is used for filtering the noise frequency spectrums of all the noise segments in the audio data by using a filter and outputting the target audio data.

The terminal device 80 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 803 and a memory 801. Those skilled in the art will appreciate that fig. 7 is merely an example of a terminal device 80, and does not constitute a limitation of terminal device 80, and may include more or fewer components than shown, or some components in combination, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 803 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 801 may be an internal storage unit of the terminal device 80, such as a hard disk or a memory of the terminal device 80. The memory 801 may also be an external storage device of the terminal device 80, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device 80. In one embodiment, the memory 801 may also include both internal and external memory units of the terminal device 80. The memory 801 is used to store the computer programs and other programs and data required by the terminal device. The memory 801 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of noise reduction, comprising:

acquiring a first noise probability of a current audio clip;

2. The noise reduction method according to claim 1, wherein the obtaining audio data, dividing the audio data into a plurality of audio segments according to a first preset rule, and extracting the speech features of each audio segment includes:

3. The noise reduction method of claim 1, wherein the obtaining the first noise probability for the current audio piece comprises:

acquiring all audio clips in a second preset time period;

4. The noise reduction method according to claim 1, wherein if the first noise probability satisfies a test condition, the determining a classification result by calculating a distance between the current audio segment and a classification vector according to the speech feature based on a classification model includes:

5. The noise reduction method of claim 4, wherein the updating the first noise probability according to the distance between the current audio piece and the classification vector and the classification result comprises:

6. The noise reduction method according to claim 2, wherein the filtering, with the filter, noise spectra of all noise segments in the audio data to output target audio data comprises:

acquiring the frequency spectrum of the current audio clip;

obtaining an iterative noise spectrum of a previous noise segment;

N_k(n)＝N_k-1(n)+γY_k(n)；

wherein N is_k(N) represents a target noise spectrum for a current noise segment, said N_k-1(n) represents the iterative noise spectrum of the last noise segment, γ represents the iteration factor, and Y represents the iteration factor_k(n) represents a spectrum of the current audio segment;

7. The noise reduction method according to claim 6, wherein the filtering, by using the filter, a target noise spectrum of a current noise segment and outputting the target speech data comprises:

8. A noise reducing device, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.