CN107919136B

CN107919136B - Digital voice sampling frequency estimation method based on Gaussian mixture model

Info

Publication number: CN107919136B
Application number: CN201711112810.6A
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2021-07-09
Anticipated expiration: 2037-11-13
Also published as: CN107919136A

Abstract

The invention discloses a Gaussian mixture model-based digital voice sampling frequency estimation method, which comprises the steps of firstly, generating a GMM by using high-sampling-rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice. The invention can identify the sampling frequency of unknown digital voice and reduce the system performance reduction caused by the mismatching of the sampling frequency.

Description

Digital voice sampling frequency estimation method based on Gaussian mixture model

Technical Field

The invention belongs to the field of voice processing, and particularly relates to a voice processing method for estimating the sampling frequency of input voice by using a Gaussian mixture model generated by high-sampling-rate digital voice training.

Background

The voice is a basic means for human to communicate information and is also the most convenient and effective human-computer interaction tool in the motion process. Digital speech has the advantages of high accuracy and easy storage and transmission, but different digital systems have different computational performance, access speed, storage space, battery capacity and application, and thus different sampling frequencies are used. If the sampling frequency of the input speech does not match the sampling frequency of the digital system, this can lead to a degradation of the performance of the speech processing system. Therefore, it is necessary to transform the input speech to match the sampling frequency with the digital system, thereby enhancing the practical application capability of the speech processing system.

If the sampling frequency of the input voice is known, only the ratio of the sampling frequency to the system sampling frequency needs to be calculated, and then interpolation or decimation is carried out on the input voice to enable the sampling frequency to be consistent with the system. However, in some applications, the sampling frequency of the input speech is unknown. For example, the audio on the network is monitored, capturing a digital speech signal segment whose sampling frequency may be unknown.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a digital speech sampling frequency estimation method based on a Gaussian Mixture Model (GMM). In the method, firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.

The method comprises the following specific steps:

(1) sampling training voice at 48kHz, windowing and framing the training voice, extracting cepstrum characteristics, and training by using the characteristic vectors of all voice units to generate a Gaussian mixture model;

(2) interpolating low-sampling-rate input voice to be estimated (namely voice with the sampling frequency lower than 48 kHz) to improve the sampling frequency;

(3) inputting the interpolated digital voice into GMM, and calculating the output probability;

(4) repeating (2) and (3) for all interpolation multiples, and recording the output probability of each time;

(5) and comparing the output probabilities corresponding to all the interpolation multiples, wherein the interpolation multiple corresponding to the maximum output probability is the ratio of the training voice sampling frequency to the input voice sampling frequency.

Drawings

Fig. 1 is an overall framework of a digital speech sampling frequency estimation system based on a gaussian mixture model, which mainly comprises a model training module, a signal interpolation module, an interpolation multiple control module and a frequency estimation module.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, the method for estimating the frequency of digital speech samples based on the gaussian mixture model mainly includes model training, signal interpolation, interpolation multiple control and frequency estimation modules. The specific embodiments of the various main modules in the drawings are described in detail below:

1. model training

Firstly, sampling training voice at 48kHz, windowing, framing, and carrying out fast Fourier transform on each frame of voice signal to obtain an amplitude spectrum of each frame of signal; then, performing Mel filtering on the magnitude spectrum of each frame of signal, and taking the logarithm to obtain the cepstrum characteristic parameter of the training voice; and finally, training by using the feature vectors of all the phonetic units to generate a Gaussian mixture model:

wherein o is_tA cepstrum feature vector representing the t frame of training speech; c. C_m、μ_mSum-sigma_mRespectively representing the mixing coefficient, the mean vector and the covariance matrix of the mth Gaussian unit in the GMM.

2. Signal interpolation

Because the gaussian mixture model is trained by training speech with a high sampling rate, the sampling frequency of the input digital speech can be considered to be lower than that of the training speech, and the input speech is interpolated to match the sampling frequency with the GMM.

Let the interpolation multiple be D_iThen, the input digital voice x (n) is interpolated to obtain x_i(n)：

Interpolated digital speech x_i(n) the sampling frequency is D of the original input digital speech x (n)_iAnd (4) doubling.

3. Interpolation multiple control

Using the sampling frequency of training speech and the sampling frequency f of a group of common speech₁，f₂，…，f_i，…，f_NRatio D of₁，D₂，…，D_i，…，D_NAs the initial value interpolation multiple.

4. Frequency estimation

And inputting the digital voice after each interpolation into a Gaussian mixture model, and calculating the output probability of the digital voice. Comparing the output probabilities corresponding to all the interpolation multiples to determine the interpolation multiple that maximizes the output probability

And is arranged at

And (4) fine-tuning the interpolation multiple nearby to enable the output probability of the GMM to reach the maximum value. If the interpolation multiple at this time is recorded as

Then the sampling frequency of the original input speech

It can be estimated that:

wherein f is₀The sampling frequency for the high rate training speech is here taken to be 48 kHz.

After the sampling frequency of the input voice is estimated, the input voice can be interpolated according to the sampling frequency of the target system, and then the input voice is input into the target system for processing.

Claims

1. A digital voice sampling frequency estimation method based on a Gaussian mixture model is characterized in that firstly, a GMM is generated by high sampling rate digital voice training; then interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency; and finally, carrying out probability calculation on the interpolated digital voice by using the GMM, and adjusting the interpolation multiple according to the calculation result to enable the output probability of the GMM to reach the maximum value, thereby obtaining the sampling frequency of the input voice.

2. The method for estimating the sampling frequency of the digital voice based on the Gaussian mixture model as claimed in claim 1, specifically comprising:

(2) interpolating the input voice with low sampling rate to be estimated, and increasing the sampling frequency;

3. The method for estimating the sampling frequency of the digital speech based on the Gaussian mixture model as claimed in claim 2, wherein the Gaussian mixture model generated in the step (1) is:

4. The method of claim 2, wherein the interpolation is performed on the low-sampling-rate input speech to be estimated:

5. The method of claim 4, wherein the sampling frequency of the training speech and the sampling frequency f of the set of common speech are used₁，f₂，…，f_i，…，f_NRatio D of₁，D₂，…，D_i，…，D_NAs the initial value interpolation multiple.