WO2018107874A1

WO2018107874A1 - Method and apparatus for automatically controlling gain of audio data

Info

Publication number: WO2018107874A1
Application number: PCT/CN2017/104796
Authority: WO
Inventors: 雷延强; 程雪峰
Original assignee: 广州视源电子科技股份有限公司
Priority date: 2016-12-16
Filing date: 2017-09-30
Publication date: 2018-06-21
Also published as: CN106653047A

Abstract

A method and an apparatus for automatically controlling the gain of audio data. The method comprises: dividing audio data into frames, and extracting the feature parameters of each frame of the data (S1); according to the feature parameters of the current frame of the data and a speech-type Gaussian mixed model, obtaining the speech-type conditional probability of the current frame of the data, and according to the feature parameters of the current frame of the data and a pre-configured noise-type Gaussian mixed model, obtaining the noise-type conditional probability of the current frame of the data (S2); according to the speech-type conditional probability of the current frame of the data and the noise-type conditional probability of the current frame of the data, calculating the probability of the current frame of the data being a speech frame and the probability of the current frame of the data being a noise frame (S3); when the probability of the current frame of the data being the speech frame is greater than the probability of the current frame of the data being the noise frame, determining the current frame of the audio data as the speech frame, and when the probability of the current frame of the data being the speech frame is less than the probability of the current frame of the data being the noise frame, determining the current frame of the data to be a noise frame (S4); when the current frame of the data is determined to be a speech frame, controlling the gain thereof on the basis of a pre-configured speech frame gain control rule, and when the current frame of the data is determined to be a noise frame, controlling the gain thereof on the basis of a pre-configured noise frame gain control rule (S5). The method can improve the level of identifying speech and noise, thereby automatically controlling the gain, and effectively improving user experience.

Description

Automatic gain control method and device for audio data

Technical field

The present invention relates to audio signal processing technologies, and in particular, to an automatic gain control method and apparatus for audio data.

Background technique

In the process of speech signal processing, the volume intensity of different audio signals is often different, and accompanied by noise, but as a user, it is expected that the conversation with each person is the same volume intensity without the control of the volume keys. Realize and enhance the user experience. The existing automatic gain control method performs gain control on the two parts by analyzing the speech part and the noise part in the audio signal.

The existing automatic gain control methods use time domain analysis to distinguish between speech and noise. This method of differentiation has great limitations and cannot effectively distinguish the characteristics of speech and noise. It often recognizes speech as noise or noise. Recognized as speech, causing erroneous gain control of the audio signal. For example, in a cochlear/hearing aid device, if the noise is erroneously amplified, the user experience is very poor, and even the user may be seriously uncomfortable.

Summary of the invention

In view of the above problems, an object of the present invention is to provide an automatic gain control method and apparatus for audio data, which can accurately and effectively distinguish between a voice portion and a noise portion in audio data, and separately perform gain control thereon, thereby greatly improving the user. Comfort.

In order to achieve the above object, an aspect of the present invention provides an automatic gain control method for audio data, including:

Performing framing processing on the audio data, and extracting characteristic parameters of each frame data;

Obtaining a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-matching according to the feature parameter of the current frame data a noise class Gaussian mixture model is obtained, and a noise class condition probability of the current frame data is obtained;

Calculating a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;

Determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and when the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame, The current frame data is determined to be a noise frame;

When the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the pre-configured noise frame gain is performed when the current frame data is determined to be a noise frame. A control rule controls the gain of the current frame data.

Preferably, the automatic gain control method of the audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;

The step of constructing a voice category Gaussian mixture model specifically includes:

Performing frame processing on the voice sample data and extracting feature parameters of each frame data by the same processing method as the audio data;

Dividing the feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;

Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each voice category;

The weight, mean and covariance of the Gaussian submodel corresponding to each speech category are iteratively optimized by the EM algorithm to obtain a Gaussian mixture model of the speech category;

The step of constructing the noise class Gaussian mixture model specifically includes:

Performing frame processing on the noise sample data and extracting characteristic parameters of each frame data by the same processing method as the audio data;

Dividing the characteristic parameters of the noise sample data into a plurality of noise categories according to a K-means algorithm;

Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each noise class;

The weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.

Preferably, the probability that the current frame data belongs to a voice frame and the probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data includes:

Calculating the current frame data according to a speech-like conditional probability p(x _T /Y ₁ ) of the current frame data and a noise-like conditional probability p(x _T /Y ₂ ) of the current frame data in combination with a Bayesian formula voice frame belonging posterior probability _{_{p '(Y 1 / x T}} ) , and after the noise frame belonging posterior probability _{_{p' (Y 2 / x T}} );

According to p(Y ₁ /x _T )=α ₁ ·p(Y ₁ /x _T-W+1 )+...α _W-1 ·p(Y ₁ /x _T-1 )+α _W ·p'(Y ₁ / x _T ) calculate p (Y1/x _T );

According to p(Y ₂ /x _T )=α ₁ ·p(Y ₂ /x _T-W+1 )+...α _W-1 ·p(Y ₂ /x _T-1 )+α _W ·p'(Y ₂ /x _T ) calculate p(Y2/x _T );

among them,

T is a frame number of the current frame data in the audio data; X _T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.

Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.

Preferably, when the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a preset voice frame gain control rule, and the current frame data is determined as a noise frame according to a pre- The configured noise frame gain control rule controls the gain of the current frame data, including:

When the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;

When the current frame data is determined to be a noise frame, the current frame data is kept unchanged.

Another aspect of the present invention provides an automatic gain control apparatus for audio data, including:

a pre-processing module, configured to perform frame processing on the audio data, and extract feature parameters of each frame data;

a first probability acquisition module, configured to use a feature parameter of the current frame data and a pre-configured voice class Obtaining a conditional probability of the speech condition of the current frame data; and obtaining a noise-like conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured noise class Gaussian mixture model;

a second probability acquisition module, configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;

a determining module, configured to determine, as the voice frame, the current frame audio data when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise frame The probability of the current frame data is determined as a noise frame;

a gain control module, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame, A pre-configured noise frame gain control rule controls the gain of the current frame data.

Preferably, the automatic gain control device of the audio data further includes a first model building module and a second model building module;

The first model building module includes:

a first pre-processing unit, configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;

a first classifying unit, configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;

a first initial parameter obtaining unit, configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;

The first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;

The second model building module includes:

a second pre-processing unit, configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;

a second classifying unit, configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm;

a second initial parameter obtaining unit, configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category;

The second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.

Preferably, the second probability acquisition module comprises:

a posterior probability acquiring unit, configured to combine Bayesian with a conditional probability p(x _T /Y ₁ ) of the current frame data and a noise-like conditional probability p(x _T /Y ₂ ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities _{_{p '(Y 1 / x T}} ) , and after the noise frame belonging posterior probability _{_{p' (Y 2 / x T}} );

Probability weighted smoothing unit for

According to p(Y ₁ /x _T )=α ₁ ·p(Y ₁ /x _T-W+1 )+...α _W-1 ·p(Y ₁ /x _T-1 )+α _W ·p'(Y ₁ /x _T ) calculate p(Y1/x _T ); and

among them,

Preferably, the gain control module comprises:

a first gain control unit, configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;

a second gain control unit, configured to: when the current frame data is determined to be a noise frame, The current frame data does not change.

Compared with the prior art, an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as a noise frame; when the current frame data When determining to be a speech frame, controlling the gain of the current frame data according to a preset voice frame gain control rule, and controlling the current frame according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame The gain of the data. In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.

DRAWINGS

In order to more clearly illustrate the technical solutions of the present invention, the drawings used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention, which are common in the art. For the skilled person, other drawings can be obtained from these drawings without any creative work.

1 is a schematic flow chart of an automatic gain control method for audio data according to an embodiment of the present invention;

FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Please refer to FIG. 1 , which is a schematic flowchart of an automatic gain control method for audio data according to an embodiment of the present invention, including:

S1, performing framing processing on the audio data, and extracting characteristic parameters of each frame data;

Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions. Although the framing can adopt the method of continuous segmentation, the method of overlapping segmentation can make the frame-frame transition smoothly and maintain its continuity. The overlapping portion of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is preferably 0 to 1/2.

The method for extracting the feature parameters may be an MFCC (Mel Frequency Cepstral Coefficient) algorithm, an LPC (Linear Predictive Analysis) algorithm, an LPL (Linear Predictive Analysis) algorithm, or the like.

S2: obtaining a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class, Obtaining a noise-like conditional probability of the current frame data;

S3. Calculate, according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data, a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame;

S4, when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame, determining the current frame audio data as a voice frame; and when the probability that the current frame data belongs to the voice frame is less than a probability belonging to the noise frame Determining the current frame data as a noise frame;

It should be noted that when the probability that the current frame data belongs to the voice frame is equal to the probability of belonging to the noise frame, the current frame data may be determined as a voice frame or a noise frame according to a preset setting. Those skilled in the art should be able to understand.

S5. When the current frame data is determined to be a voice frame, control the gain of the current frame data according to a preset voice frame gain control rule, and follow the pre-configured noise when the current frame data is determined to be a noise frame. A frame gain control rule controls the gain of the current frame data.

In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.

The weight, mean and covariance of the Gaussian submodel corresponding to each phonetic category are iteratively optimized by the EM algorithm (expected maximum algorithm) to obtain a Gaussian mixture model of the speech category;

Through the above steps, a Gaussian mixture model of speech categories and a Gaussian mixture model of noise categories can be constructed. Since the steps of constructing the Gaussian mixture model of the speech category and the Gaussian mixture model of the noise category are basically the same, the following is an example of constructing the Gaussian mixture model of the speech category.

1. It is assumed that the voice sample data is divided into m frame data, and the feature parameters of the voice sample data are divided into K voice categories according to the K-means algorithm, that is, the voice category Gaussian mixture model is composed of K Gaussian sub-models.

2. For the kth Gaussian submodel, the initial mean value can be obtained.

Initial covariance

And set the initial weight of any Gaussian submodel

3. Iteratively optimize the mean μ _k , covariance ∑ _k , and weight ω _k of the kth Gaussian submodel:

Where t is the number of iterations, and t is greater than or equal to 0;

It is a standard Gaussian function; x _i represents the characteristic parameter of the i-th frame speech sample data.

4. Assuming that the EM algorithm is stable at t=t1, you can

Assigned to ω _k , will

Assigned to μ _k , will

Assigned to C _k to obtain a Gaussian mixture model for speech categories:

Substituting the feature parameter x _T of the current frame data into the speech class Gaussian mixture model p(x/Y ₁ ) can obtain the speech-like conditional probability p(x _T /Y ₁ ) of the current frame data.

Similarly, the noise class Gaussian mixture model p(x/Y ₂ ) can be obtained; the characteristic parameter x _T of the current frame data is substituted into the noise class Gaussian mixture model p(x/Y ₂ ) to obtain the noise conditional probability of the current frame data. p(x _T /Y ₂ ). It should be noted that the Gaussian mixture model of the noise category and the Gaussian mixture model of the speech category are identical in form, and all belong to the Gaussian mixture model, but the number of specific Gaussian submodels and the specific parameters may be different. One skilled in the art should be able to understand.

As a further improvement of the embodiment of the present invention, in step S3, the probability of the current frame data belonging to the voice frame is calculated according to the voice class condition probability of the current frame data and the noise class condition probability of the current frame data. The probability of belonging to a noise frame, including:

S31, calculating a current Bayes formula according to a speech-like conditional probability p(x _T /Y ₁ ) of the current frame data and a noise-like conditional probability p(x _T /Y ₂ ) of the current frame data, and calculating the current after the frame data belonging to the speech frame posteriori probabilities _{_{p '(Y 1 / x T}} ) , and after the noise frame belonging posterior probability _{_{p' (Y 2 / x T}} );

Specifically, according to the Bayesian formula, the posterior probability that the current frame data belongs to the speech frame is

The posterior probability that the current frame data belongs to the noise frame is

p(Y ₁ ) is the prior probability of the speech class, and p(Y ₂ ) is the prior probability of the noise class. Since the probability of occurrence of noise and speech cannot be estimated in the actual application scenario, p(Y ₁ )=p(Y ₂ ) can be set equal, thus p′(Y ₁ /x _T ) and p′(Y ₂ /x _T ) is transformed into:

S32, according to p(Y ₁ /x _T )=α ₁ ·p(Y ₁ /x _T-W+1 )+...α _W-1 ·p(Y ₁ /x _T-1 )+α _W ·p' (Y ₁ /x _T ) calculates p(Y1/x _T ); and,

among them,

p(Y ₁ /x _T ) is the probability that p'(Y ₁ /x _T ) is weighted and smoothed; similarly, p(Y ₂ /x _T ) is p'(Y ₁ /x _T ) weighted smoothing The probability of getting after. W represents the window width of the weighted smoothing.

α ₁ ~ α _w is the weighting factor. It can be seen from the expression of α _j that α ₁ to α _w obey the Gaussian distribution and α ₁ +...+α _w-1 +α _w =1. In α ₁ to α _w , α _w is the maximum value. That is, the weighting coefficient of the posterior probability of the current frame data is the largest.

In principle, according to the size of p'(Y1/x _T ) and p'(Y2/x _T ), it can be determined whether the current frame data belongs to a speech frame or a noise frame, but the speech or noise is usually a continuous multi-frame, weighted smoothing. It can make the transition of the recognition result more stable and prevent some abnormal mutation results.

Preferably, in step S5, when the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the current frame data is determined as The noise frame controls the gain of the current frame data according to a pre-configured noise frame gain control rule, including:

When the ratio is greater than 1, it means that the time domain energy does not reach the expected energy value, and the current frame data needs to be amplified; when the ratio is less than 1, it represents that the time domain energy exceeds the expected energy. The value needs to be reduced.

Through step S5, the speech frame can be enlarged or reduced according to the time domain energy of the speech frame to achieve the automatic gain control effect, and the noise frame remains unchanged, thereby avoiding erroneously amplifying the noise frame.

It should be noted that the above is only one implementation manner of the speech frame gain control rule and the noise frame gain control rule, and the purpose is to automatically perform gain amplification or reduction on the speech frame, and avoid the amplification operation of the noise frame, and other implementations. The way to compress the gain of a noise frame, for example, is also optional.

In order to perform the automatic gain control method of the audio data described above, an embodiment of the present invention further provides an automatic gain control apparatus for audio data. FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention. The automatic gain control device of the audio data includes:

The pre-processing module 1 is configured to perform frame processing on the audio data, and extract feature parameters of each frame data;

The first probability acquisition module 2 is configured to obtain a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-configure according to the feature parameter of the current frame data a noise class Gaussian mixture model to obtain a noise-like conditional probability of the current frame data;

The second probability acquisition module 3 is configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;

The determining module 4 is configured to determine the current frame audio data as a voice frame when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise When the probability of the frame is determined, the current frame data is determined as a noise frame;

a gain control module 5, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame The gain of the current frame data is controlled in accordance with a pre-configured noise frame gain control rule.

The first model building module includes:

The second model building module includes:

Preferably, the second probability acquisition module 3 includes:

Probability weighted smoothing unit for

According to p(Y ₁ /x _T )=α ₁ ·p(Y ₁ /x _T-W+1 )+...α _W-1 ·p(Y ₁ /x _T-1 )+α _W ·p'(Y ₁ /x _T ) calculate p(Y ₁ /x _T ); and

According to p(Y ₂ /x _T )=α ₁ ·p(Y ₂ /x _T-W+1 )+...α _W-1 ·p(Y ₂ /x _T-1 )+α _W ·p'(Y ₂ / x _T ) Calculate p(Y ₂ /x _T );

among them,

Preferably, the gain control module 5 comprises:

And a second gain control unit, configured to keep the current frame data unchanged when the current frame data is determined to be a noise frame.

It should be noted that the automatic gain control apparatus for audio data provided by the embodiment of the present invention is used to perform the above-mentioned automatic gain control method for audio data, and the beneficial effects and working principles of the two are corresponding one-to-one, and thus will not be described again.

Compared with the prior art, an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as a noise frame; when the current frame data Determined to language In the case of a tone frame, controlling the gain of the current frame data according to a pre-configured voice frame gain control rule, and controlling the current frame data according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame Gain. In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.

The above disclosure is only a preferred embodiment of the present invention, and of course, the scope of the present invention is not limited thereto, and those skilled in the art can understand all or part of the process of implementing the above embodiments, and according to the present invention. The equivalent changes required are still within the scope of the invention.

One of ordinary skill in the art can understand that all or part of the process of implementing the foregoing embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Claims

An automatic gain control method for audio data, comprising:

Performing framing processing on the audio data, and extracting characteristic parameters of each frame data;

Obtaining a speech-like conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and obtaining a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class a noise-like conditional probability of the current frame data;

Calculating a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;

Determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and when the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame, The current frame data is determined to be a noise frame;

When the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the pre-configured noise frame gain is performed when the current frame data is determined to be a noise frame. A control rule controls the gain of the current frame data.
The automatic gain control method for audio data according to claim 1, wherein said automatic gain control method of audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;

The step of constructing a voice category Gaussian mixture model specifically includes:

Performing frame processing on the voice sample data and extracting feature parameters of each frame data by the same processing method as the audio data;

Dividing the feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;

Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each voice category;

The weight, mean and covariance of the Gaussian submodel corresponding to each speech category are iteratively optimized by the EM algorithm to obtain a Gaussian mixture model of the speech category;

The step of constructing the noise class Gaussian mixture model specifically includes:

Performing frame processing on the noise sample data and extracting characteristic parameters of each frame data by the same processing method as the audio data;

Dividing the characteristic parameters of the noise sample data into a plurality of noise categories according to a K-means algorithm;

Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each noise class;

The weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
The automatic gain control method for audio data according to claim 1, wherein said calculating current frame data according to a speech condition condition probability of said current frame data and a noise class condition probability of said current frame data The probability of a speech frame and the probability of belonging to a noise frame, including:

Calculating the current frame data according to a speech-like conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data in combination with a Bayesian formula voice frame belonging posterior probability p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );

According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 / x T ) calculate p (Y1/x T );

According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );

among them,

T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
The automatic gain control method for audio data according to claim 1, wherein any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
The automatic gain control method for audio data according to any one of claims 1 to 4, wherein when the current frame data is determined to be a speech frame, the control unit is controlled according to a pre-configured speech frame gain control rule. Determining the gain of the current frame data, and controlling the gain of the current frame data according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame, including:

When the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;

When the current frame data is determined to be a noise frame, the current frame data is kept unchanged.
An automatic gain control device for audio data, comprising:

a pre-processing module, configured to perform frame processing on the audio data, and extract feature parameters of each frame data;

a first probability acquisition module, configured to obtain a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-configured according to the feature parameter of the current frame data a noise class Gaussian mixture model, which obtains a noise-like conditional probability of the current frame data;

a second probability acquisition module, configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;

a determining module, configured to determine, as the voice frame, the current frame audio data when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise frame The probability of the current frame data is determined as a noise frame;

a gain control module, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame, A pre-configured noise frame gain control rule controls the gain of the current frame data.
An automatic gain control apparatus for audio data according to claim 6, wherein said said The automatic gain control device for audio data further includes a first model building module and a second model building module;

The first model building module includes:

a first pre-processing unit, configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;

a first classifying unit, configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;

a first initial parameter obtaining unit, configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;

The first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;

The second model building module includes:

a second pre-processing unit, configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;

a second classifying unit, configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm;

a second initial parameter obtaining unit, configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category;

The second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
The automatic gain control apparatus for audio data according to claim 6, wherein the second probability acquisition module comprises:

a posterior probability acquiring unit, configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );

Probability weighted smoothing unit for

According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 /x T ) calculate p(Y1/x T ); and

According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );

among them,

T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
The automatic gain control apparatus for audio data according to claim 6, wherein any adjacent two frames of data obtained by performing frame processing on said audio data have overlapping portions.
The automatic gain control apparatus for audio data according to any one of claims 6 to 9, wherein the gain control module comprises:

a first gain control unit, configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;

And a second gain control unit, configured to keep the current frame data unchanged when the current frame data is determined to be a noise frame.