WO2018107874A1 - Method and apparatus for automatically controlling gain of audio data - Google Patents

Method and apparatus for automatically controlling gain of audio data Download PDF

Info

Publication number
WO2018107874A1
WO2018107874A1 PCT/CN2017/104796 CN2017104796W WO2018107874A1 WO 2018107874 A1 WO2018107874 A1 WO 2018107874A1 CN 2017104796 W CN2017104796 W CN 2017104796W WO 2018107874 A1 WO2018107874 A1 WO 2018107874A1
Authority
WO
WIPO (PCT)
Prior art keywords
current frame
data
noise
frame
probability
Prior art date
Application number
PCT/CN2017/104796
Other languages
French (fr)
Chinese (zh)
Inventor
雷延强
程雪峰
Original Assignee
广州视源电子科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州视源电子科技股份有限公司 filed Critical 广州视源电子科技股份有限公司
Publication of WO2018107874A1 publication Critical patent/WO2018107874A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to audio signal processing technologies, and in particular, to an automatic gain control method and apparatus for audio data.
  • the volume intensity of different audio signals is often different, and accompanied by noise, but as a user, it is expected that the conversation with each person is the same volume intensity without the control of the volume keys. Realize and enhance the user experience.
  • the existing automatic gain control method performs gain control on the two parts by analyzing the speech part and the noise part in the audio signal.
  • the existing automatic gain control methods use time domain analysis to distinguish between speech and noise. This method of differentiation has great limitations and cannot effectively distinguish the characteristics of speech and noise. It often recognizes speech as noise or noise. Recognized as speech, causing erroneous gain control of the audio signal. For example, in a cochlear/hearing aid device, if the noise is erroneously amplified, the user experience is very poor, and even the user may be seriously uncomfortable.
  • an object of the present invention is to provide an automatic gain control method and apparatus for audio data, which can accurately and effectively distinguish between a voice portion and a noise portion in audio data, and separately perform gain control thereon, thereby greatly improving the user. Comfort.
  • an aspect of the present invention provides an automatic gain control method for audio data, including:
  • the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the pre-configured noise frame gain is performed when the current frame data is determined to be a noise frame.
  • a control rule controls the gain of the current frame data.
  • the automatic gain control method of the audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;
  • the step of constructing a voice category Gaussian mixture model specifically includes:
  • the weight, mean and covariance of the Gaussian submodel corresponding to each speech category are iteratively optimized by the EM algorithm to obtain a Gaussian mixture model of the speech category;
  • the step of constructing the noise class Gaussian mixture model specifically includes:
  • the weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
  • the probability that the current frame data belongs to a voice frame and the probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data includes:
  • T is a frame number of the current frame data in the audio data
  • X T is a feature parameter of the current frame data
  • T-W+1 is a frame number of a front W frame of the current frame data
  • is the default value.
  • any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
  • the gain of the current frame data is controlled according to a preset voice frame gain control rule, and the current frame data is determined as a noise frame according to a pre-
  • the configured noise frame gain control rule controls the gain of the current frame data, including:
  • the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;
  • the current frame data is determined to be a noise frame
  • the current frame data is kept unchanged.
  • Another aspect of the present invention provides an automatic gain control apparatus for audio data, including:
  • a pre-processing module configured to perform frame processing on the audio data, and extract feature parameters of each frame data
  • a first probability acquisition module configured to use a feature parameter of the current frame data and a pre-configured voice class Obtaining a conditional probability of the speech condition of the current frame data; and obtaining a noise-like conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured noise class Gaussian mixture model;
  • a second probability acquisition module configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
  • a determining module configured to determine, as the voice frame, the current frame audio data when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise frame The probability of the current frame data is determined as a noise frame;
  • a gain control module configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame, A pre-configured noise frame gain control rule controls the gain of the current frame data.
  • the automatic gain control device of the audio data further includes a first model building module and a second model building module;
  • the first model building module includes:
  • a first pre-processing unit configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
  • a first classifying unit configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm
  • a first initial parameter obtaining unit configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;
  • the first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;
  • the second model building module includes:
  • a second pre-processing unit configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
  • a second classifying unit configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm
  • a second initial parameter obtaining unit configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category
  • the second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
  • the second probability acquisition module comprises:
  • a posterior probability acquiring unit configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
  • T is a frame number of the current frame data in the audio data
  • X T is a feature parameter of the current frame data
  • T-W+1 is a frame number of a front W frame of the current frame data
  • is the default value.
  • any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
  • the gain control module comprises:
  • a first gain control unit configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;
  • a second gain control unit configured to: when the current frame data is determined to be a noise frame, The current frame data does not change.
  • an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as
  • the noise is changed with the change of the environment.
  • the embodiment of the present invention by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise.
  • the technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
  • FIG. 1 is a schematic flow chart of an automatic gain control method for audio data according to an embodiment of the present invention
  • FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart of an automatic gain control method for audio data according to an embodiment of the present invention, including:
  • any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
  • the framing can adopt the method of continuous segmentation, the method of overlapping segmentation can make the frame-frame transition smoothly and maintain its continuity.
  • the overlapping portion of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is preferably 0 to 1/2.
  • the method for extracting the feature parameters may be an MFCC (Mel Frequency Cepstral Coefficient) algorithm, an LPC (Linear Predictive Analysis) algorithm, an LPL (Linear Predictive Analysis) algorithm, or the like.
  • MFCC Mel Frequency Cepstral Coefficient
  • LPC Linear Predictive Analysis
  • LPL Linear Predictive Analysis
  • S2 obtaining a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class, Obtaining a noise-like conditional probability of the current frame data;
  • the current frame data may be determined as a voice frame or a noise frame according to a preset setting. Those skilled in the art should be able to understand.
  • the noise is changed with the change of the environment.
  • the embodiment of the present invention by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise.
  • the technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
  • the automatic gain control method of the audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;
  • the step of constructing a voice category Gaussian mixture model specifically includes:
  • the weight, mean and covariance of the Gaussian submodel corresponding to each phonetic category are iteratively optimized by the EM algorithm (expected maximum algorithm) to obtain a Gaussian mixture model of the speech category;
  • the step of constructing the noise class Gaussian mixture model specifically includes:
  • the weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
  • a Gaussian mixture model of speech categories and a Gaussian mixture model of noise categories can be constructed. Since the steps of constructing the Gaussian mixture model of the speech category and the Gaussian mixture model of the noise category are basically the same, the following is an example of constructing the Gaussian mixture model of the speech category.
  • the voice sample data is divided into m frame data, and the feature parameters of the voice sample data are divided into K voice categories according to the K-means algorithm, that is, the voice category Gaussian mixture model is composed of K Gaussian sub-models.
  • the initial mean value can be obtained.
  • Initial covariance And set the initial weight of any Gaussian submodel
  • t is the number of iterations, and t is greater than or equal to 0; It is a standard Gaussian function; x i represents the characteristic parameter of the i-th frame speech sample data.
  • Substituting the feature parameter x T of the current frame data into the speech class Gaussian mixture model p(x/Y 1 ) can obtain the speech-like conditional probability p(x T /Y 1 ) of the current frame data.
  • the noise class Gaussian mixture model p(x/Y 2 ) can be obtained; the characteristic parameter x T of the current frame data is substituted into the noise class Gaussian mixture model p(x/Y 2 ) to obtain the noise conditional probability of the current frame data. p(x T /Y 2 ).
  • the Gaussian mixture model of the noise category and the Gaussian mixture model of the speech category are identical in form, and all belong to the Gaussian mixture model, but the number of specific Gaussian submodels and the specific parameters may be different. One skilled in the art should be able to understand.
  • step S3 the probability of the current frame data belonging to the voice frame is calculated according to the voice class condition probability of the current frame data and the noise class condition probability of the current frame data.
  • the probability of belonging to a noise frame including:
  • the posterior probability that the current frame data belongs to the speech frame is
  • the posterior probability that the current frame data belongs to the noise frame is
  • p(Y 1 ) is the prior probability of the speech class
  • T is a frame number of the current frame data in the audio data
  • x T is a feature parameter of the current frame data
  • T-W+1 is a frame number of a front W frame of the current frame data
  • is the default value.
  • p(Y 1 /x T ) is the probability that p'(Y 1 /x T ) is weighted and smoothed; similarly, p(Y 2 /x T ) is p'(Y 1 /x T ) weighted smoothing
  • W represents the window width of the weighted smoothing.
  • p'(Y1/x T ) and p'(Y2/x T ) it can be determined whether the current frame data belongs to a speech frame or a noise frame, but the speech or noise is usually a continuous multi-frame, weighted smoothing. It can make the transition of the recognition result more stable and prevent some abnormal mutation results.
  • step S5 when the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the current frame data is determined as The noise frame controls the gain of the current frame data according to a pre-configured noise frame gain control rule, including:
  • the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;
  • the current frame data is determined to be a noise frame
  • the current frame data is kept unchanged.
  • the ratio When the ratio is greater than 1, it means that the time domain energy does not reach the expected energy value, and the current frame data needs to be amplified; when the ratio is less than 1, it represents that the time domain energy exceeds the expected energy. The value needs to be reduced.
  • step S5 the speech frame can be enlarged or reduced according to the time domain energy of the speech frame to achieve the automatic gain control effect, and the noise frame remains unchanged, thereby avoiding erroneously amplifying the noise frame.
  • the above is only one implementation manner of the speech frame gain control rule and the noise frame gain control rule, and the purpose is to automatically perform gain amplification or reduction on the speech frame, and avoid the amplification operation of the noise frame, and other implementations.
  • the way to compress the gain of a noise frame, for example, is also optional.
  • FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention.
  • the automatic gain control device of the audio data includes:
  • the pre-processing module 1 is configured to perform frame processing on the audio data, and extract feature parameters of each frame data;
  • the first probability acquisition module 2 is configured to obtain a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-configure according to the feature parameter of the current frame data a noise class Gaussian mixture model to obtain a noise-like conditional probability of the current frame data;
  • the second probability acquisition module 3 is configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
  • the determining module 4 is configured to determine the current frame audio data as a voice frame when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise When the probability of the frame is determined, the current frame data is determined as a noise frame;
  • a gain control module 5 configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame The gain of the current frame data is controlled in accordance with a pre-configured noise frame gain control rule.
  • the automatic gain control device of the audio data further includes a first model building module and a second model building module;
  • the first model building module includes:
  • a first pre-processing unit configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
  • a first classifying unit configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm
  • a first initial parameter obtaining unit configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;
  • the first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;
  • the second model building module includes:
  • a second pre-processing unit configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
  • a second classifying unit configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm
  • a second initial parameter obtaining unit configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category
  • the second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
  • the second probability acquisition module 3 includes:
  • a posterior probability acquiring unit configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
  • T is a frame number of the current frame data in the audio data
  • x T is a feature parameter of the current frame data
  • T-W+1 is a frame number of a front W frame of the current frame data
  • is the default value.
  • any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
  • the gain control module 5 comprises:
  • a first gain control unit configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;
  • a second gain control unit configured to keep the current frame data unchanged when the current frame data is determined to be a noise frame.
  • the automatic gain control apparatus for audio data provided by the embodiment of the present invention is used to perform the above-mentioned automatic gain control method for audio data, and the beneficial effects and working principles of the two are corresponding one-to-one, and thus will not be described again.
  • an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as
  • the noise is changed with the change of the environment.
  • the embodiment of the present invention by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise.
  • the technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Abstract

A method and an apparatus for automatically controlling the gain of audio data. The method comprises: dividing audio data into frames, and extracting the feature parameters of each frame of the data (S1); according to the feature parameters of the current frame of the data and a speech-type Gaussian mixed model, obtaining the speech-type conditional probability of the current frame of the data, and according to the feature parameters of the current frame of the data and a pre-configured noise-type Gaussian mixed model, obtaining the noise-type conditional probability of the current frame of the data (S2); according to the speech-type conditional probability of the current frame of the data and the noise-type conditional probability of the current frame of the data, calculating the probability of the current frame of the data being a speech frame and the probability of the current frame of the data being a noise frame (S3); when the probability of the current frame of the data being the speech frame is greater than the probability of the current frame of the data being the noise frame, determining the current frame of the audio data as the speech frame, and when the probability of the current frame of the data being the speech frame is less than the probability of the current frame of the data being the noise frame, determining the current frame of the data to be a noise frame (S4); when the current frame of the data is determined to be a speech frame, controlling the gain thereof on the basis of a pre-configured speech frame gain control rule, and when the current frame of the data is determined to be a noise frame, controlling the gain thereof on the basis of a pre-configured noise frame gain control rule (S5). The method can improve the level of identifying speech and noise, thereby automatically controlling the gain, and effectively improving user experience.

Description

一种音频数据的自动增益控制方法与装置Automatic gain control method and device for audio data 技术领域Technical field
本发明涉及音频信号处理技术,尤其涉及一种音频数据的自动增益控制方法及装置。The present invention relates to audio signal processing technologies, and in particular, to an automatic gain control method and apparatus for audio data.
背景技术Background technique
在语音信号处理过程中,不同音频信号的音量强度往往是不一样的,且伴随有噪声,但作为用户,期望与每个人之间的通话都是相同的音量强度而不通过音量键的控制来实现,提升用户体验。现有的自动增益控制方法通过分析出音频信号中的语音部分和噪声部分,分别对这两部分进行增益控制。In the process of speech signal processing, the volume intensity of different audio signals is often different, and accompanied by noise, but as a user, it is expected that the conversation with each person is the same volume intensity without the control of the volume keys. Realize and enhance the user experience. The existing automatic gain control method performs gain control on the two parts by analyzing the speech part and the noise part in the audio signal.
现有的自动增益控制方法都是通过时域分析来区分语音与噪声,这种区分方法的局限性较大,无法有效地区分语音和噪声的特征,往往会把语音识别为噪声,或者将噪声识别为语音,造成错误地对音频信号进行增益控制。例如,在人工耳蜗/助听器设备中,若错误地将噪声进行放大,对使用者的体验是非常差的,甚至会造成使用者严重的不舒适感。The existing automatic gain control methods use time domain analysis to distinguish between speech and noise. This method of differentiation has great limitations and cannot effectively distinguish the characteristics of speech and noise. It often recognizes speech as noise or noise. Recognized as speech, causing erroneous gain control of the audio signal. For example, in a cochlear/hearing aid device, if the noise is erroneously amplified, the user experience is very poor, and even the user may be seriously uncomfortable.
发明内容Summary of the invention
针对上述问题,本发明的目的在于提供一种音频数据的自动增益控制方法与装置,能够精确有效地区分音频数据中的语音部分和噪声部分,并分别对其进行增益控制,极大地提高了用户的舒适度。In view of the above problems, an object of the present invention is to provide an automatic gain control method and apparatus for audio data, which can accurately and effectively distinguish between a voice portion and a noise portion in audio data, and separately perform gain control thereon, thereby greatly improving the user. Comfort.
为了实现上述目的,本发明一方面提供了一种音频数据的自动增益控制方法,包括:In order to achieve the above object, an aspect of the present invention provides an automatic gain control method for audio data, including:
对音频数据进行分帧处理,并提取各帧数据的特征参数;Performing framing processing on the audio data, and extracting characteristic parameters of each frame data;
根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配 置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;Obtaining a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-matching according to the feature parameter of the current frame data a noise class Gaussian mixture model is obtained, and a noise class condition probability of the current frame data is obtained;
根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;Calculating a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;Determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and when the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame, The current frame data is determined to be a noise frame;
当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。When the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the pre-configured noise frame gain is performed when the current frame data is determined to be a noise frame. A control rule controls the gain of the current frame data.
优选地,所述音频数据的自动增益控制方法还包括构建语音类别高斯混合模型的步骤以及构建噪声类别高斯混合模型的步骤;Preferably, the automatic gain control method of the audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;
所述构建语音类别高斯混合模型的步骤具体包括:The step of constructing a voice category Gaussian mixture model specifically includes:
通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the voice sample data and extracting feature parameters of each frame data by the same processing method as the audio data;
根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;Dividing the feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each voice category;
通过EM算法对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The weight, mean and covariance of the Gaussian submodel corresponding to each speech category are iteratively optimized by the EM algorithm to obtain a Gaussian mixture model of the speech category;
所述构建噪声类别高斯混合模型的步骤具体包括:The step of constructing the noise class Gaussian mixture model specifically includes:
通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the noise sample data and extracting characteristic parameters of each frame data by the same processing method as the audio data;
根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;Dividing the characteristic parameters of the noise sample data into a plurality of noise categories according to a K-means algorithm;
获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each noise class;
通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。 The weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
优选地,所述根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率,包括:Preferably, the probability that the current frame data belongs to a voice frame and the probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data includes:
根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);Calculating the current frame data according to a speech-like conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data in combination with a Bayesian formula voice frame belonging posterior probability p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算p(Y1/xT);According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 / x T ) calculate p (Y1/x T );
根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );
其中,
Figure PCTCN2017104796-appb-000001
among them,
Figure PCTCN2017104796-appb-000001
T是所述当前帧数据在所述音频数据中的帧序号;XT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
优选地,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
优选地,所述当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益,包括:Preferably, when the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a preset voice frame gain control rule, and the current frame data is determined as a noise frame according to a pre- The configured noise frame gain control rule controls the gain of the current frame data, including:
当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;When the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;
当所述当前帧数据被判定为噪声帧时,保持所述当前帧数据不变。When the current frame data is determined to be a noise frame, the current frame data is kept unchanged.
本发明实施例另一方面还提供一种音频数据的自动增益控制装置,包括:Another aspect of the present invention provides an automatic gain control apparatus for audio data, including:
预处理模块,用于对音频数据进行分帧处理,并提取各帧数据的特征参数;a pre-processing module, configured to perform frame processing on the audio data, and extract feature parameters of each frame data;
第一概率获取模块,用于根据当前帧数据的特征参数与预先配置的语音类 别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;a first probability acquisition module, configured to use a feature parameter of the current frame data and a pre-configured voice class Obtaining a conditional probability of the speech condition of the current frame data; and obtaining a noise-like conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured noise class Gaussian mixture model;
第二概率获取模块,用于根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;a second probability acquisition module, configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
判定模块,用于当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;a determining module, configured to determine, as the voice frame, the current frame audio data when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise frame The probability of the current frame data is determined as a noise frame;
增益控制模块,用于当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。a gain control module, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame, A pre-configured noise frame gain control rule controls the gain of the current frame data.
优选地,所述音频数据的自动增益控制装置还包括第一模型构建模块以及第二模型构建模块;Preferably, the automatic gain control device of the audio data further includes a first model building module and a second model building module;
所述第一模型构建模块包括:The first model building module includes:
第一预处理单元,用于通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;a first pre-processing unit, configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
第一分类单元,用于根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;a first classifying unit, configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
第一初始参数获取单元,用于获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a first initial parameter obtaining unit, configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;
第一模型优化单元,用于通过EM算法对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;
所述第二模型构建模块包括:The second model building module includes:
第二预处理单元,用于通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数; a second pre-processing unit, configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
第二分类单元,用于根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;a second classifying unit, configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm;
第二初始参数获取单元,用于获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a second initial parameter obtaining unit, configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category;
第二模型优化单元,用于通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。The second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
优选地,所述第二概率获取模块包括:Preferably, the second probability acquisition module comprises:
后验概率获取单元,用于根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);a posterior probability acquiring unit, configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
概率加权平滑单元,用于Probability weighted smoothing unit for
根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算p(Y1/xT);以及用于According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 /x T ) calculate p(Y1/x T ); and
根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );
其中,
Figure PCTCN2017104796-appb-000002
among them,
Figure PCTCN2017104796-appb-000002
T是所述当前帧数据在所述音频数据中的帧序号;XT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
优选地,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
优选地,所述增益控制模块包括:Preferably, the gain control module comprises:
第一增益控制单元,用于当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;a first gain control unit, configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;
第二增益控制单元,用于当所述当前帧数据被判定为噪声帧时,保持所述 当前帧数据不变。a second gain control unit, configured to: when the current frame data is determined to be a noise frame, The current frame data does not change.
相对于现有技术,本发明实施例的有益效果在于:本发明实施例提供了一种音频数据自动增益控制方法与装置,其中方法包括:对音频数据进行分帧处理,并提取各帧数据的特征参数;根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。在语音实时通信中,由于使用环境的多样性,噪声是随着环境的变化而变换的,本发明实施例通过引入高斯混合模型,非常准确地判断出当前帧是语音段还是噪声段,并且分别对语音段和噪声端进行增益控制,实现自动增益控制,避免错误地将噪声进行放大。本发明技术方案极大地提高了语音与噪声的识别水平,并依此进行自动增益控制,有效改善了使用者的体验。Compared with the prior art, an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as a noise frame; when the current frame data When determining to be a speech frame, controlling the gain of the current frame data according to a preset voice frame gain control rule, and controlling the current frame according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame The gain of the data. In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
附图说明DRAWINGS
为了更清楚地说明本发明的技术方案,下面将对实施方式中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the present invention, the drawings used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention, which are common in the art. For the skilled person, other drawings can be obtained from these drawings without any creative work.
图1是本发明实施例提供的一种音频数据的自动增益控制方法的流程示意图; 1 is a schematic flow chart of an automatic gain control method for audio data according to an embodiment of the present invention;
图2是本发明实施例提供的一种音频数据的自动增益控制装置的结构框图。FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention.
具体实施方式detailed description
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
请参阅图1,其是本发明实施例提供的一种音频数据的自动增益控制方法的流程示意图,包括:Please refer to FIG. 1 , which is a schematic flowchart of an automatic gain control method for audio data according to an embodiment of the present invention, including:
S1,对音频数据进行分帧处理,并提取各帧数据的特征参数;S1, performing framing processing on the audio data, and extracting characteristic parameters of each frame data;
优选地,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。分帧虽然可以采用连续分段的方法,但采用交叠分段的方法可以使帧与帧之间平滑过渡,保持其连续性。前一帧和后一帧的交叠部分称为帧移,帧移与帧长的比值优选为0~1/2。Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions. Although the framing can adopt the method of continuous segmentation, the method of overlapping segmentation can make the frame-frame transition smoothly and maintain its continuity. The overlapping portion of the previous frame and the next frame is called frame shift, and the ratio of frame shift to frame length is preferably 0 to 1/2.
提取特征参数的方法可以采用MFCC(Mel频率倒谱系数)算法、LPC(线性预测分析)算法、LPL(线性预测分析)算法等。The method for extracting the feature parameters may be an MFCC (Mel Frequency Cepstral Coefficient) algorithm, an LPC (Linear Predictive Analysis) algorithm, an LPL (Linear Predictive Analysis) algorithm, or the like.
S2,根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;S2: obtaining a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class, Obtaining a noise-like conditional probability of the current frame data;
S3,根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;S3. Calculate, according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data, a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame;
S4,当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;S4, when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame, determining the current frame audio data as a voice frame; and when the probability that the current frame data belongs to the voice frame is less than a probability belonging to the noise frame Determining the current frame data as a noise frame;
需要说明的是,当所述当前帧数据属于语音帧的概率与属于噪声帧的概率相等时,可以根据预先设定将当前帧数据判定为语音帧或者噪声帧,这一点本 领域技术人员应当能够理解。It should be noted that when the probability that the current frame data belongs to the voice frame is equal to the probability of belonging to the noise frame, the current frame data may be determined as a voice frame or a noise frame according to a preset setting. Those skilled in the art should be able to understand.
S5,当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。S5. When the current frame data is determined to be a voice frame, control the gain of the current frame data according to a preset voice frame gain control rule, and follow the pre-configured noise when the current frame data is determined to be a noise frame. A frame gain control rule controls the gain of the current frame data.
在语音实时通信中,由于使用环境的多样性,噪声是随着环境的变化而变换的,本发明实施例通过引入高斯混合模型,非常准确地判断出当前帧是语音段还是噪声段,并且分别对语音段和噪声端进行增益控制,实现自动增益控制,避免错误地将噪声进行放大。本发明技术方案极大地提高了语音与噪声的识别水平,并依此进行自动增益控制,有效改善了使用者的体验。In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
优选地,所述音频数据的自动增益控制方法还包括构建语音类别高斯混合模型的步骤以及构建噪声类别高斯混合模型的步骤;Preferably, the automatic gain control method of the audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;
所述构建语音类别高斯混合模型的步骤具体包括:The step of constructing a voice category Gaussian mixture model specifically includes:
通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the voice sample data and extracting feature parameters of each frame data by the same processing method as the audio data;
根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;Dividing the feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each voice category;
通过EM算法(期望最大值算法)对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The weight, mean and covariance of the Gaussian submodel corresponding to each phonetic category are iteratively optimized by the EM algorithm (expected maximum algorithm) to obtain a Gaussian mixture model of the speech category;
所述构建噪声类别高斯混合模型的步骤具体包括:The step of constructing the noise class Gaussian mixture model specifically includes:
通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the noise sample data and extracting characteristic parameters of each frame data by the same processing method as the audio data;
根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;Dividing the characteristic parameters of the noise sample data into a plurality of noise categories according to a K-means algorithm;
获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each noise class;
通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。 The weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
通过以上的步骤可以构建出语音类别高斯混合模型与噪声类别高斯混合模型。由于构建出语音类别高斯混合模型与噪声类别高斯混合模型的步骤是基本一致的,因此以下以构建出语音类别高斯混合模型为例进行具体说明。Through the above steps, a Gaussian mixture model of speech categories and a Gaussian mixture model of noise categories can be constructed. Since the steps of constructing the Gaussian mixture model of the speech category and the Gaussian mixture model of the noise category are basically the same, the following is an example of constructing the Gaussian mixture model of the speech category.
1、假设将所述语音样本数据分割为m帧数据,根据K-means算法将所述语音样本数据的特征参数划分为K个语音类别,即语音类别高斯混合模型由K个高斯子模型构成。1. It is assumed that the voice sample data is divided into m frame data, and the feature parameters of the voice sample data are divided into K voice categories according to the K-means algorithm, that is, the voice category Gaussian mixture model is composed of K Gaussian sub-models.
2、对于第k个高斯子模型,可以得到其初始均值
Figure PCTCN2017104796-appb-000003
和初始协方差
Figure PCTCN2017104796-appb-000004
并且设定任意一个高斯子模型的初始权重
Figure PCTCN2017104796-appb-000005
2. For the kth Gaussian submodel, the initial mean value can be obtained.
Figure PCTCN2017104796-appb-000003
Initial covariance
Figure PCTCN2017104796-appb-000004
And set the initial weight of any Gaussian submodel
Figure PCTCN2017104796-appb-000005
3、对第k个高斯子模型的均值μk、协方差∑k、权重ωk进行迭代优化:3. Iteratively optimize the mean μ k , covariance ∑ k , and weight ω k of the kth Gaussian submodel:
Figure PCTCN2017104796-appb-000006
Figure PCTCN2017104796-appb-000006
Figure PCTCN2017104796-appb-000007
Figure PCTCN2017104796-appb-000007
Figure PCTCN2017104796-appb-000008
Figure PCTCN2017104796-appb-000008
Figure PCTCN2017104796-appb-000009
Figure PCTCN2017104796-appb-000009
其中,t为迭代次数,t大于或等于0;
Figure PCTCN2017104796-appb-000010
为标准高斯函数;xi表示第i帧语音样本数据的特征参数。
Where t is the number of iterations, and t is greater than or equal to 0;
Figure PCTCN2017104796-appb-000010
It is a standard Gaussian function; x i represents the characteristic parameter of the i-th frame speech sample data.
4、假设在t=t1时EM算法稳定了,则可以将
Figure PCTCN2017104796-appb-000011
赋给ωk,将
Figure PCTCN2017104796-appb-000012
赋给μk,将
Figure PCTCN2017104796-appb-000013
赋给Ck,从而得到语音类别高斯混合模型:
4. Assuming that the EM algorithm is stable at t=t1, you can
Figure PCTCN2017104796-appb-000011
Assigned to ω k , will
Figure PCTCN2017104796-appb-000012
Assigned to μ k , will
Figure PCTCN2017104796-appb-000013
Assigned to C k to obtain a Gaussian mixture model for speech categories:
Figure PCTCN2017104796-appb-000014
Figure PCTCN2017104796-appb-000014
将当前帧数据的特征参数xT代入语音类别高斯混合模型p(x/Y1)可以得到当前帧数据的语音类条件概率p(xT/Y1)。 Substituting the feature parameter x T of the current frame data into the speech class Gaussian mixture model p(x/Y 1 ) can obtain the speech-like conditional probability p(x T /Y 1 ) of the current frame data.
同理,可以得到噪声类别高斯混合模型p(x/Y2);将当前帧数据的特征参数xT代入噪声类别高斯混合模型p(x/Y2)可以得到当前帧数据的噪声类条件概率p(xT/Y2)。需要说明的是,噪声类别高斯混合模型与语音类别高斯混合模型在形式上是相同的,都是属于高斯混合模型,但具体各自包含的高斯子模型的个数以及具体的参数都可能不同,这一点本领域技术人员应当能够理解。Similarly, the noise class Gaussian mixture model p(x/Y 2 ) can be obtained; the characteristic parameter x T of the current frame data is substituted into the noise class Gaussian mixture model p(x/Y 2 ) to obtain the noise conditional probability of the current frame data. p(x T /Y 2 ). It should be noted that the Gaussian mixture model of the noise category and the Gaussian mixture model of the speech category are identical in form, and all belong to the Gaussian mixture model, but the number of specific Gaussian submodels and the specific parameters may be different. One skilled in the art should be able to understand.
作为对本发明实施例的进一步改进,在步骤S3中,所述根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率,包括:As a further improvement of the embodiment of the present invention, in step S3, the probability of the current frame data belonging to the voice frame is calculated according to the voice class condition probability of the current frame data and the noise class condition probability of the current frame data. The probability of belonging to a noise frame, including:
S31,根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);S31, calculating a current Bayes formula according to a speech-like conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data, and calculating the current after the frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
具体地,根据贝叶斯公式,所述当前帧数据属于语音帧的后验概率为Specifically, according to the Bayesian formula, the posterior probability that the current frame data belongs to the speech frame is
Figure PCTCN2017104796-appb-000015
Figure PCTCN2017104796-appb-000015
所述当前帧数据属于噪声帧的后验概率为The posterior probability that the current frame data belongs to the noise frame is
Figure PCTCN2017104796-appb-000016
Figure PCTCN2017104796-appb-000016
p(Y1)为语音类别的先验概率,p(Y2)为噪声类别的先验概率。因为实际应用场景中,噪声和语音的出现概率是无法估计的,故可以设置p(Y1)=p(Y2)为相等,因而p’(Y1/xT)和p’(Y2/xT)变换为:p(Y 1 ) is the prior probability of the speech class, and p(Y 2 ) is the prior probability of the noise class. Since the probability of occurrence of noise and speech cannot be estimated in the actual application scenario, p(Y 1 )=p(Y 2 ) can be set equal, thus p′(Y 1 /x T ) and p′(Y 2 /x T ) is transformed into:
Figure PCTCN2017104796-appb-000017
Figure PCTCN2017104796-appb-000017
Figure PCTCN2017104796-appb-000018
Figure PCTCN2017104796-appb-000018
S32,根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算 p(Y1/xT);以及,S32, according to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p' (Y 1 /x T ) calculates p(Y1/x T ); and,
根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );
其中,
Figure PCTCN2017104796-appb-000019
among them,
Figure PCTCN2017104796-appb-000019
T是所述当前帧数据在所述音频数据中的帧序号;xT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; x T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
p(Y1/xT)为p’(Y1/xT)经过加权平滑后得到的概率;同理,p(Y2/xT)为p’(Y1/xT)经过加权平滑后得到的概率。W代表加权平滑的窗口宽度。p(Y 1 /x T ) is the probability that p'(Y 1 /x T ) is weighted and smoothed; similarly, p(Y 2 /x T ) is p'(Y 1 /x T ) weighted smoothing The probability of getting after. W represents the window width of the weighted smoothing.
α1~αw为加权系数。从αj的表达式可知,α1~αw服从高斯分布且α1+...+αw-1w=1。在α1~αw中,αw为最大值。即当前帧数据的后验概率的加权系数最大。α 1 ~ α w is the weighting factor. It can be seen from the expression of α j that α 1 to α w obey the Gaussian distribution and α 1 +...+α w-1w =1. In α 1 to α w , α w is the maximum value. That is, the weighting coefficient of the posterior probability of the current frame data is the largest.
原则上根据p’(Y1/xT)和p’(Y2/xT)的大小可以判定所述当前帧数据属于语音帧还是噪声帧,但语音或噪声通常都是连续的多帧,加权平滑可以使识别结果过渡更平稳,防止一些异常突变结果。In principle, according to the size of p'(Y1/x T ) and p'(Y2/x T ), it can be determined whether the current frame data belongs to a speech frame or a noise frame, but the speech or noise is usually a continuous multi-frame, weighted smoothing. It can make the transition of the recognition result more stable and prevent some abnormal mutation results.
优选地,在步骤S5中,所述当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益,包括:Preferably, in step S5, when the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the current frame data is determined as The noise frame controls the gain of the current frame data according to a pre-configured noise frame gain control rule, including:
当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;When the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;
当所述当前帧数据被判定为噪声帧时,保持所述当前帧数据不变。When the current frame data is determined to be a noise frame, the current frame data is kept unchanged.
当所述比值大于1时,代表所述时域能量达不到所述期望能量值,需要对当前帧数据进行放大;当所述比值小于1时,代表所述时域能量超过所述期望能量值,需要进行缩小。 When the ratio is greater than 1, it means that the time domain energy does not reach the expected energy value, and the current frame data needs to be amplified; when the ratio is less than 1, it represents that the time domain energy exceeds the expected energy. The value needs to be reduced.
通过步骤S5可以根据语音帧的时域能量对语音帧进行放大或者缩小,达到自动增益控制效果,同时噪声帧则保持不变,避免错误地对噪声帧进行放大。Through step S5, the speech frame can be enlarged or reduced according to the time domain energy of the speech frame to achieve the automatic gain control effect, and the noise frame remains unchanged, thereby avoiding erroneously amplifying the noise frame.
需要说明的是,以上只是语音帧增益控制规则与噪声帧增益控制规则的其中一种实施方式,目的是实现自动对语音帧进行增益放大或缩小,同时避免对噪声帧进行了放大操作,其他实施的方式例如将噪声帧的增益进行压缩也是可选的。It should be noted that the above is only one implementation manner of the speech frame gain control rule and the noise frame gain control rule, and the purpose is to automatically perform gain amplification or reduction on the speech frame, and avoid the amplification operation of the noise frame, and other implementations. The way to compress the gain of a noise frame, for example, is also optional.
为了执行上述的音频数据的自动增益控制方法,本发明实施例还提供了一种音频数据的自动增益控制装置。如图2所示,其是本发明实施例提供的一种音频数据的自动增益控制装置的结构框图。所述音频数据的自动增益控制装置,包括:In order to perform the automatic gain control method of the audio data described above, an embodiment of the present invention further provides an automatic gain control apparatus for audio data. FIG. 2 is a structural block diagram of an automatic gain control apparatus for audio data according to an embodiment of the present invention. The automatic gain control device of the audio data includes:
预处理模块1,用于对音频数据进行分帧处理,并提取各帧数据的特征参数;The pre-processing module 1 is configured to perform frame processing on the audio data, and extract feature parameters of each frame data;
第一概率获取模块2,用于根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;The first probability acquisition module 2 is configured to obtain a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-configure according to the feature parameter of the current frame data a noise class Gaussian mixture model to obtain a noise-like conditional probability of the current frame data;
第二概率获取模块3,用于根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;The second probability acquisition module 3 is configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
判定模块4,用于当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;The determining module 4 is configured to determine the current frame audio data as a voice frame when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise When the probability of the frame is determined, the current frame data is determined as a noise frame;
增益控制模块5,用于当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。a gain control module 5, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame The gain of the current frame data is controlled in accordance with a pre-configured noise frame gain control rule.
优选地,所述音频数据的自动增益控制装置还包括第一模型构建模块以及第二模型构建模块; Preferably, the automatic gain control device of the audio data further includes a first model building module and a second model building module;
所述第一模型构建模块包括:The first model building module includes:
第一预处理单元,用于通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;a first pre-processing unit, configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
第一分类单元,用于根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;a first classifying unit, configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
第一初始参数获取单元,用于获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a first initial parameter obtaining unit, configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;
第一模型优化单元,用于通过EM算法对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;
所述第二模型构建模块包括:The second model building module includes:
第二预处理单元,用于通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数;a second pre-processing unit, configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
第二分类单元,用于根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;a second classifying unit, configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm;
第二初始参数获取单元,用于获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a second initial parameter obtaining unit, configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category;
第二模型优化单元,用于通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。The second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
优选地,所述第二概率获取模块3包括:Preferably, the second probability acquisition module 3 includes:
后验概率获取单元,用于根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);a posterior probability acquiring unit, configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
概率加权平滑单元,用于Probability weighted smoothing unit for
根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算p(Y1/xT);以及用于According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 /x T ) calculate p(Y 1 /x T ); and
根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算 p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 / x T ) Calculate p(Y 2 /x T );
其中,
Figure PCTCN2017104796-appb-000020
among them,
Figure PCTCN2017104796-appb-000020
T是所述当前帧数据在所述音频数据中的帧序号;xT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; x T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
优选地,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。Preferably, any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
优选地,所述增益控制模块5包括:Preferably, the gain control module 5 comprises:
第一增益控制单元,用于当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;a first gain control unit, configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;
第二增益控制单元,用于当所述当前帧数据被判定为噪声帧时,保持所述当前帧数据不变。And a second gain control unit, configured to keep the current frame data unchanged when the current frame data is determined to be a noise frame.
需要说明的是,本发明实施例提供的一种音频数据的自动增益控制装置用于执行上述的音频数据的自动增益控制方法,两者的有益效果以及工作原理一一对应,因而不再赘述。It should be noted that the automatic gain control apparatus for audio data provided by the embodiment of the present invention is used to perform the above-mentioned automatic gain control method for audio data, and the beneficial effects and working principles of the two are corresponding one-to-one, and thus will not be described again.
相对于现有技术,本发明实施例的有益效果在于:本发明实施例提供了一种音频数据自动增益控制方法与装置,其中方法包括:对音频数据进行分帧处理,并提取各帧数据的特征参数;根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;当所述当前帧数据被判定为语 音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。在语音实时通信中,由于使用环境的多样性,噪声是随着环境的变化而变换的,本发明实施例通过引入高斯混合模型,非常准确地判断出当前帧是语音段还是噪声段,并且分别对语音段和噪声端进行增益控制,实现自动增益控制,避免错误地将噪声进行放大。本发明技术方案极大地提高了语音与噪声的识别水平,并依此进行自动增益控制,有效改善了使用者的体验。Compared with the prior art, an embodiment of the present invention provides an audio data automatic gain control method and apparatus, where the method includes: performing frame processing on audio data, and extracting data of each frame. a feature parameter; obtaining a voice class conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class Obtaining a noise-like conditional probability of the current frame data; calculating a probability that the current frame data belongs to a speech frame and a noise frame according to a speech-like conditional probability of the current frame data and a noise-like conditional probability of the current frame data Probability of determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame Determining the current frame data as a noise frame; when the current frame data Determined to language In the case of a tone frame, controlling the gain of the current frame data according to a pre-configured voice frame gain control rule, and controlling the current frame data according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame Gain. In voice real-time communication, due to the diversity of the usage environment, the noise is changed with the change of the environment. In the embodiment of the present invention, by introducing a Gaussian mixture model, it is very accurately determined whether the current frame is a speech segment or a noise segment, and respectively Gain control is performed on the speech segment and the noise end to implement automatic gain control to avoid erroneously amplifying the noise. The technical solution of the invention greatly improves the recognition level of speech and noise, and performs automatic gain control accordingly, thereby effectively improving the user experience.
以上所揭露的仅为本发明一种较佳实施例而已,当然不能以此来限定本发明之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本发明权利要求所作的等同变化,仍属于发明所涵盖的范围。The above disclosure is only a preferred embodiment of the present invention, and of course, the scope of the present invention is not limited thereto, and those skilled in the art can understand all or part of the process of implementing the above embodiments, and according to the present invention. The equivalent changes required are still within the scope of the invention.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。 One of ordinary skill in the art can understand that all or part of the process of implementing the foregoing embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. When executed, the flow of an embodiment of the methods as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Claims (10)

  1. 一种音频数据的自动增益控制方法,其特征在于,包括:An automatic gain control method for audio data, comprising:
    对音频数据进行分帧处理,并提取各帧数据的特征参数;Performing framing processing on the audio data, and extracting characteristic parameters of each frame data;
    根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;Obtaining a speech-like conditional probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and obtaining a Gaussian mixture model according to the feature parameter of the current frame data and a pre-configured noise class a noise-like conditional probability of the current frame data;
    根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;Calculating a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
    当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;Determining the current frame audio data as a speech frame when the probability that the current frame data belongs to the speech frame is greater than the probability of belonging to the noise frame; and when the probability that the current frame data belongs to the speech frame is less than the probability of belonging to the noise frame, The current frame data is determined to be a noise frame;
    当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。When the current frame data is determined to be a speech frame, the gain of the current frame data is controlled according to a pre-configured speech frame gain control rule, and the pre-configured noise frame gain is performed when the current frame data is determined to be a noise frame. A control rule controls the gain of the current frame data.
  2. 如权利要求1所述的音频数据的自动增益控制方法,其特征在于,所述音频数据的自动增益控制方法还包括构建语音类别高斯混合模型的步骤以及构建噪声类别高斯混合模型的步骤;The automatic gain control method for audio data according to claim 1, wherein said automatic gain control method of audio data further comprises the steps of constructing a speech class Gaussian mixture model and constructing a noise class Gaussian mixture model;
    所述构建语音类别高斯混合模型的步骤具体包括:The step of constructing a voice category Gaussian mixture model specifically includes:
    通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the voice sample data and extracting feature parameters of each frame data by the same processing method as the audio data;
    根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;Dividing the feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
    获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each voice category;
    通过EM算法对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The weight, mean and covariance of the Gaussian submodel corresponding to each speech category are iteratively optimized by the EM algorithm to obtain a Gaussian mixture model of the speech category;
    所述构建噪声类别高斯混合模型的步骤具体包括: The step of constructing the noise class Gaussian mixture model specifically includes:
    通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数;Performing frame processing on the noise sample data and extracting characteristic parameters of each frame data by the same processing method as the audio data;
    根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;Dividing the characteristic parameters of the noise sample data into a plurality of noise categories according to a K-means algorithm;
    获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;Obtaining an initial weight, an initial mean, and an initial covariance of a Gaussian submodel corresponding to each noise class;
    通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。The weight, mean and covariance of the Gaussian submodel corresponding to each noise class are iteratively optimized by the EM algorithm to obtain the Gaussian mixture model of the noise class.
  3. 如权利要求1所述的音频数据的自动增益控制方法,其特征在于,所述根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率,包括:The automatic gain control method for audio data according to claim 1, wherein said calculating current frame data according to a speech condition condition probability of said current frame data and a noise class condition probability of said current frame data The probability of a speech frame and the probability of belonging to a noise frame, including:
    根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);Calculating the current frame data according to a speech-like conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data in combination with a Bayesian formula voice frame belonging posterior probability p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
    根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算p(Y1/xT);According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 / x T ) calculate p (Y1/x T );
    根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );
    其中,
    Figure PCTCN2017104796-appb-100001
    among them,
    Figure PCTCN2017104796-appb-100001
    T是所述当前帧数据在所述音频数据中的帧序号;XT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
  4. 如权利要求1所述的音频数据的自动增益控制方法,其特征在于,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。 The automatic gain control method for audio data according to claim 1, wherein any adjacent two frames of data obtained by performing frame processing on the audio data have overlapping portions.
  5. 如权利要求1~4任一项所述的音频数据的自动增益控制方法,其特征在于,所述当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益,包括:The automatic gain control method for audio data according to any one of claims 1 to 4, wherein when the current frame data is determined to be a speech frame, the control unit is controlled according to a pre-configured speech frame gain control rule. Determining the gain of the current frame data, and controlling the gain of the current frame data according to a pre-configured noise frame gain control rule when the current frame data is determined to be a noise frame, including:
    当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;When the current frame data is determined to be a speech frame, acquiring time domain energy of the current frame data and calculating a ratio of a preset expected energy value to the time domain energy, and each data point of the current frame data is obtained. Multiplying the ratio to enlarge or reduce the current frame data;
    当所述当前帧数据被判定为噪声帧时,保持所述当前帧数据不变。When the current frame data is determined to be a noise frame, the current frame data is kept unchanged.
  6. 一种音频数据的自动增益控制装置,其特征在于,包括:An automatic gain control device for audio data, comprising:
    预处理模块,用于对音频数据进行分帧处理,并提取各帧数据的特征参数;a pre-processing module, configured to perform frame processing on the audio data, and extract feature parameters of each frame data;
    第一概率获取模块,用于根据当前帧数据的特征参数与预先配置的语音类别高斯混合模型得到所述当前帧数据的语音类条件概率;以及根据所述当前帧数据的特征参数与预先配置的噪声类别高斯混合模型,得到所述当前帧数据的噪声类条件概率;a first probability acquisition module, configured to obtain a voice class condition probability of the current frame data according to a feature parameter of the current frame data and a pre-configured voice class Gaussian mixture model; and pre-configured according to the feature parameter of the current frame data a noise class Gaussian mixture model, which obtains a noise-like conditional probability of the current frame data;
    第二概率获取模块,用于根据所述当前帧数据的语音类条件概率与所述当前帧数据的噪声类条件概率计算所述当前帧数据属于语音帧的概率以及属于噪声帧的概率;a second probability acquisition module, configured to calculate a probability that the current frame data belongs to a voice frame and a probability of belonging to a noise frame according to a voice class condition probability of the current frame data and a noise class condition probability of the current frame data;
    判定模块,用于当所述当前帧数据属于语音帧的概率大于属于噪声帧的概率时,将当前帧音频数据判定为语音帧;以及当所述当前帧数据属于语音帧的概率小于属于噪声帧的概率时,将当前帧数据判定为噪声帧;a determining module, configured to determine, as the voice frame, the current frame audio data when the probability that the current frame data belongs to the voice frame is greater than the probability of belonging to the noise frame; and the probability that the current frame data belongs to the voice frame is less than the noise frame The probability of the current frame data is determined as a noise frame;
    增益控制模块,用于当所述当前帧数据被判定为语音帧时,按照预先配置的语音帧增益控制规则控制所述当前帧数据的增益,以及所述当前帧数据被判定为噪声帧时按照预先配置的噪声帧增益控制规则控制所述当前帧数据的增益。a gain control module, configured to: when the current frame data is determined to be a voice frame, control a gain of the current frame data according to a preset voice frame gain control rule, and when the current frame data is determined to be a noise frame, A pre-configured noise frame gain control rule controls the gain of the current frame data.
  7. 如权利要求6所述的音频数据的自动增益控制装置,其特征在于,所述 音频数据的自动增益控制装置还包括第一模型构建模块以及第二模型构建模块;An automatic gain control apparatus for audio data according to claim 6, wherein said said The automatic gain control device for audio data further includes a first model building module and a second model building module;
    所述第一模型构建模块包括:The first model building module includes:
    第一预处理单元,用于通过与所述音频数据相同的处理方法,对语音样本数据进行分帧处理并提取各帧数据的特征参数;a first pre-processing unit, configured to perform frame-by-frame processing on the voice sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
    第一分类单元,用于根据K-means算法将所述语音样本数据的特征参数划分为若干个语音类别;a first classifying unit, configured to divide feature parameters of the voice sample data into a plurality of voice categories according to a K-means algorithm;
    第一初始参数获取单元,用于获取每个语音类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a first initial parameter obtaining unit, configured to acquire an initial weight, an initial mean value, and an initial covariance of a Gaussian submodel corresponding to each voice category;
    第一模型优化单元,用于通过EM算法对每个语音类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到语音类别高斯混合模型;The first model optimization unit is configured to perform iterative optimization on the weight, the mean value and the covariance of the Gaussian submodel corresponding to each voice category by using an EM algorithm to obtain a Gaussian mixture model of the voice category;
    所述第二模型构建模块包括:The second model building module includes:
    第二预处理单元,用于通过与所述音频数据相同的处理方法,对噪声样本数据进行分帧处理并提取各帧数据的特征参数;a second pre-processing unit, configured to perform frame-by-frame processing on the noise sample data and extract feature parameters of each frame data by using the same processing method as the audio data;
    第二分类单元,用于根据K-means算法将所述噪声样本数据的特征参数划分为若干个噪声类别;a second classifying unit, configured to divide a feature parameter of the noise sample data into a plurality of noise categories according to a K-means algorithm;
    第二初始参数获取单元,用于获取每个噪声类别所对应的高斯子模型的初始权重、初始均值以及初始协方差;a second initial parameter obtaining unit, configured to acquire an initial weight, an initial mean, and an initial covariance of the Gaussian submodel corresponding to each noise category;
    第二模型优化单元,用于通过EM算法对每个噪声类别所对应的高斯子模型的权重、均值以及协方差进行迭代优化,得到噪声类别高斯混合模型。The second model optimization unit is configured to iteratively optimize the weight, mean and covariance of the Gaussian submodel corresponding to each noise class by using the EM algorithm to obtain a Gaussian mixture model of the noise class.
  8. 如权利要求6所述的音频数据的自动增益控制装置,其特征在于,所述第二概率获取模块包括:The automatic gain control apparatus for audio data according to claim 6, wherein the second probability acquisition module comprises:
    后验概率获取单元,用于根据所述当前帧数据的语音类条件概率p(xT/Y1)与所述当前帧数据的噪声类条件概率p(xT/Y2)结合贝叶斯公式,计算所述当前帧数据属于语音帧的后验概率p’(Y1/xT)以及属于噪声帧的后验概率p’(Y2/xT);a posterior probability acquiring unit, configured to combine Bayesian with a conditional probability p(x T /Y 1 ) of the current frame data and a noise-like conditional probability p(x T /Y 2 ) of the current frame data after formula to calculate the current frame data belonging to the speech frame posteriori probabilities p '(Y 1 / x T ) , and after the noise frame belonging posterior probability p' (Y 2 / x T );
    概率加权平滑单元,用于 Probability weighted smoothing unit for
    根据p(Y1/xT)=α1·p(Y1/xT-W+1)+…αW-1·p(Y1/xT-1)+αW·p′(Y1/xT)计算p(Y1/xT);以及用于According to p(Y 1 /x T )=α 1 ·p(Y 1 /x T-W+1 )+...α W-1 ·p(Y 1 /x T-1 )+α W ·p'(Y 1 /x T ) calculate p(Y1/x T ); and
    根据p(Y2/xT)=α1·p(Y2/xT-W+1)+…αW-1·p(Y2/xT-1)+αW·p′(Y2/xT)计算p(Y2/xT);According to p(Y 2 /x T )=α 1 ·p(Y 2 /x T-W+1 )+...α W-1 ·p(Y 2 /x T-1 )+α W ·p'(Y 2 /x T ) calculate p(Y2/x T );
    其中,
    Figure PCTCN2017104796-appb-100002
    among them,
    Figure PCTCN2017104796-appb-100002
    T是所述当前帧数据在所述音频数据中的帧序号;XT为所述当前帧数据的特征参数;T-W+1是所述当前帧数据的前W帧的帧序号;W与σ为预设值。T is a frame number of the current frame data in the audio data; X T is a feature parameter of the current frame data; T-W+1 is a frame number of a front W frame of the current frame data; σ is the default value.
  9. 如权利要求6所述的音频数据的自动增益控制装置,其特征在于,对所述音频数据进行分帧处理后得到的任意相邻的两帧数据具有重叠的部分。The automatic gain control apparatus for audio data according to claim 6, wherein any adjacent two frames of data obtained by performing frame processing on said audio data have overlapping portions.
  10. 如权利要求6~9任一项所述的音频数据的自动增益控制装置,其特征在于,所述增益控制模块包括:The automatic gain control apparatus for audio data according to any one of claims 6 to 9, wherein the gain control module comprises:
    第一增益控制单元,用于当所述当前帧数据被判定为语音帧时,获取所述当前帧数据的时域能量并计算预设的期望能量值与所述时域能量的比值,将所述当前帧数据的各数据点乘以所述比值以放大或缩小所述当前帧数据;a first gain control unit, configured to acquire time domain energy of the current frame data and calculate a ratio of a preset expected energy value to the time domain energy when the current frame data is determined to be a speech frame, Multiplying each data point of the current frame data by the ratio to enlarge or reduce the current frame data;
    第二增益控制单元,用于当所述当前帧数据被判定为噪声帧时,保持所述当前帧数据不变。 And a second gain control unit, configured to keep the current frame data unchanged when the current frame data is determined to be a noise frame.
PCT/CN2017/104796 2016-12-16 2017-09-30 Method and apparatus for automatically controlling gain of audio data WO2018107874A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611169178.4A CN106653047A (en) 2016-12-16 2016-12-16 Automatic gain control method and device for audio data
CN201611169178.4 2016-12-16

Publications (1)

Publication Number Publication Date
WO2018107874A1 true WO2018107874A1 (en) 2018-06-21

Family

ID=58822148

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/104796 WO2018107874A1 (en) 2016-12-16 2017-09-30 Method and apparatus for automatically controlling gain of audio data

Country Status (2)

Country Link
CN (1) CN106653047A (en)
WO (1) WO2018107874A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105798A (en) * 2018-10-29 2020-05-05 宁波方太厨具有限公司 Equipment control method based on voice recognition
CN111192573A (en) * 2018-10-29 2020-05-22 宁波方太厨具有限公司 Equipment intelligent control method based on voice recognition
CN113542863A (en) * 2020-04-14 2021-10-22 深圳Tcl数字技术有限公司 Sound processing method, storage medium and smart television
CN113593600A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Mixed voice separation method and device, storage medium and electronic equipment
CN113593603A (en) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 Audio category determination method and device, storage medium and electronic device

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data
CN107134277A (en) * 2017-06-15 2017-09-05 深圳市潮流网络技术有限公司 A kind of voice-activation detecting method based on GMM model
CN107507621B (en) * 2017-07-28 2021-06-22 维沃移动通信有限公司 Noise suppression method and mobile terminal
CN108171271B (en) * 2018-01-11 2022-04-29 湖南大唐先一科技有限公司 Early warning method and system for equipment degradation
CN109688284B (en) * 2018-12-28 2021-10-08 广东美电贝尔科技集团股份有限公司 Echo delay detection method
CN110111805B (en) * 2019-04-29 2021-10-29 北京声智科技有限公司 Automatic gain control method and device in far-field voice interaction and readable storage medium
CN112133299B (en) * 2019-06-25 2021-08-27 大众问问(北京)信息科技有限公司 Sound signal processing method, device and equipment
CN111028857B (en) * 2019-12-27 2024-01-19 宁波蛙声科技有限公司 Method and system for reducing noise of multichannel audio-video conference based on deep learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5455888A (en) * 1992-12-04 1995-10-03 Northern Telecom Limited Speech bandwidth extension method and apparatus
CN101593522A (en) * 2009-07-08 2009-12-02 清华大学 A kind of full frequency domain digital hearing aid method and apparatus
US20100082339A1 (en) * 2008-09-30 2010-04-01 Alon Konchitsky Wind Noise Reduction
CN101976566A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Voice enhancement method and device using same
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN105390142A (en) * 2015-12-17 2016-03-09 广州大学 Digital hearing aid voice noise elimination method
CN105741849A (en) * 2016-03-06 2016-07-06 北京工业大学 Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN105845150A (en) * 2016-03-21 2016-08-10 福州瑞芯微电子股份有限公司 Voice enhancement method and system adopting cepstrum to correct
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1322488C (en) * 2004-04-14 2007-06-20 华为技术有限公司 Method for strengthening sound
CN100419854C (en) * 2005-11-23 2008-09-17 北京中星微电子有限公司 Voice gain factor estimating device and method
CN101197130B (en) * 2006-12-07 2011-05-18 华为技术有限公司 Sound activity detecting method and detector thereof
CN101110217B (en) * 2007-07-25 2010-10-13 北京中星微电子有限公司 Automatic gain control method for audio signal and apparatus thereof
CN101930735B (en) * 2009-06-23 2012-11-21 富士通株式会社 Speech emotion recognition equipment and speech emotion recognition method
WO2011010604A1 (en) * 2009-07-21 2011-01-27 日本電信電話株式会社 Audio signal section estimating apparatus, audio signal section estimating method, program therefor and recording medium
CN102800322B (en) * 2011-05-27 2014-03-26 中国科学院声学研究所 Method for estimating noise power spectrum and voice activity
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN103646649B (en) * 2013-12-30 2016-04-13 中国科学院自动化研究所 A kind of speech detection method efficiently
CN105931635B (en) * 2016-03-31 2019-09-17 北京奇艺世纪科技有限公司 A kind of audio frequency splitting method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5455888A (en) * 1992-12-04 1995-10-03 Northern Telecom Limited Speech bandwidth extension method and apparatus
US20100082339A1 (en) * 2008-09-30 2010-04-01 Alon Konchitsky Wind Noise Reduction
CN101593522A (en) * 2009-07-08 2009-12-02 清华大学 A kind of full frequency domain digital hearing aid method and apparatus
CN101976566A (en) * 2010-07-09 2011-02-16 瑞声声学科技(深圳)有限公司 Voice enhancement method and device using same
CN103236260A (en) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 Voice recognition system
CN105390142A (en) * 2015-12-17 2016-03-09 广州大学 Digital hearing aid voice noise elimination method
CN105741849A (en) * 2016-03-06 2016-07-06 北京工业大学 Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN105845150A (en) * 2016-03-21 2016-08-10 福州瑞芯微电子股份有限公司 Voice enhancement method and system adopting cepstrum to correct
CN106653047A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Automatic gain control method and device for audio data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105798A (en) * 2018-10-29 2020-05-05 宁波方太厨具有限公司 Equipment control method based on voice recognition
CN111192573A (en) * 2018-10-29 2020-05-22 宁波方太厨具有限公司 Equipment intelligent control method based on voice recognition
CN111105798B (en) * 2018-10-29 2023-08-18 宁波方太厨具有限公司 Equipment control method based on voice recognition
CN111192573B (en) * 2018-10-29 2023-08-18 宁波方太厨具有限公司 Intelligent control method for equipment based on voice recognition
CN113542863A (en) * 2020-04-14 2021-10-22 深圳Tcl数字技术有限公司 Sound processing method, storage medium and smart television
CN113542863B (en) * 2020-04-14 2023-05-23 深圳Tcl数字技术有限公司 Sound processing method, storage medium and intelligent television
CN113593600A (en) * 2021-01-26 2021-11-02 腾讯科技(深圳)有限公司 Mixed voice separation method and device, storage medium and electronic equipment
CN113593600B (en) * 2021-01-26 2024-03-15 腾讯科技(深圳)有限公司 Mixed voice separation method and device, storage medium and electronic equipment
CN113593603A (en) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 Audio category determination method and device, storage medium and electronic device

Also Published As

Publication number Publication date
CN106653047A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
WO2018107874A1 (en) Method and apparatus for automatically controlling gain of audio data
US20210327448A1 (en) Speech noise reduction method and apparatus, computing device, and computer-readable storage medium
US11335352B2 (en) Voice identity feature extractor and classifier training
US9349384B2 (en) Method and system for object-dependent adjustment of levels of audio objects
US8239196B1 (en) System and method for multi-channel multi-feature speech/noise classification for noise suppression
JP4107613B2 (en) Low cost filter coefficient determination method in dereverberation.
JP6243858B2 (en) Speech model learning method, noise suppression method, speech model learning device, noise suppression device, speech model learning program, and noise suppression program
US9842608B2 (en) Automatic selective gain control of audio data for speech recognition
CN111418010A (en) Multi-microphone noise reduction method and device and terminal equipment
JP6361156B2 (en) Noise estimation apparatus, method and program
WO2020098083A1 (en) Call separation method and apparatus, computer device and storage medium
US8503694B2 (en) Sound capture system for devices with two microphones
US11128954B2 (en) Method and electronic device for managing loudness of audio signal
WO2020253073A1 (en) Speech endpoint detection method, apparatus and device, and storage medium
CN111048118B (en) Voice signal processing method and device and terminal
CN106571138B (en) Signal endpoint detection method, detection device and detection equipment
WO2024041512A1 (en) Audio noise reduction method and apparatus, and electronic device and readable storage medium
Tong et al. Evaluating VAD for automatic speech recognition
WO2017128910A1 (en) Method, apparatus and electronic device for determining speech presence probability
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
WO2021016925A1 (en) Audio processing method and apparatus
US10600432B1 (en) Methods for voice enhancement
JP2013235050A (en) Information processing apparatus and method, and program
KR101543300B1 (en) Speech Presence Uncertainty Estimation method Based on Multiple Linear Regression Analysis
CN114981888A (en) Noise floor estimation and noise reduction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17879685

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 24.10.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17879685

Country of ref document: EP

Kind code of ref document: A1