CN107919115B

CN107919115B - Characteristic compensation method based on nonlinear spectral transformation

Info

Publication number: CN107919115B
Application number: CN201711112747.6A
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2021-07-27
Anticipated expiration: 2037-11-13
Also published as: CN107919115A

Abstract

The invention discloses a characteristic compensation method based on nonlinear spectrum transformation, in the training stage, a Gaussian Mixture Model (GMM) is generated by using the standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters. The invention can match the voice characteristic of the target speaker with the acoustic model trained in advance, reduces the influence of the mismatch of the speaker on the voice recognition system, and has the advantages of good real-time performance and independence from a back-end recognizer.

Description

Characteristic compensation method based on nonlinear spectral transformation

Technical Field

The invention belongs to the field of voice recognition, and particularly relates to a nonlinear characteristic compensation method for carrying out nonlinear frequency conversion on the amplitude spectrum of each frame of voice of a target speaker to enable the frequency spectrum characteristic to be matched with a pre-trained acoustic model.

Background

In speech recognition systems, Hidden Markov Models (HMM), which are acoustic models for each speech unit, are typically trained from a large number of speaker's speech samples. This may cover the pronunciation characteristics of a large number of speakers, but also results in a reduced recognition performance of the speech recognition system for a certain speaker or class of speakers. Moreover, most speakers have a deviation from the standard pronunciation, even with severe accents. Therefore, in practical applications, it is necessary to compensate the speech characteristics of the target speaker or the parameters of the acoustic model, reduce the influence of environmental mismatch, and improve the recognition performance of the speech recognition system.

Speaker adaptation is a commonly used robust speech recognition method that adjusts the parameters of a pre-trained acoustic model to match the current speaker's pronunciation characteristics based on a small amount of pronunciation data of the target speaker in the test environment. In general, large vocabulary speech recognition systems have a large number of speech units and very limited data for model adaptation, so most gaussian units of acoustic models lack sufficient data to estimate their mean and variance. Thus, it is generally assumed that all or part of the gaussian units conform to the same linear transformation, and their data is combined to estimate the same transformation matrix for the parametric transformation of all gaussian units within the class. This reduces the accuracy of speaker adaptation and makes system performance far from ideal systems trained with large amounts of targeted speaker data. In addition, the speaker self-adapts to transform all Gaussian units of the system acoustic model, and the method relates to complex matrix operation and is high in calculation complexity.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a characteristic compensation method based on nonlinear spectral transformation. In the method, a Gaussian Mixture Model (GMM) is generated by standard speech training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.

The method comprises the following specific steps:

(1) extracting standard MFCC from training voices of a large number of speakers, and training to generate a Gaussian mixture model;

(2) windowing the voice of a target speaker, framing, and performing Fast Fourier Transform (FFT) to obtain the amplitude spectrum of each frame of voice signal;

(3) carrying out frequency conversion on the amplitude spectrum of each frame of voice signal;

(4) performing Mel filtering on the transformed magnitude spectrum, taking logarithm, and performing Discrete Cosine Transform (DCT) to obtain MFCC after nonlinear frequency transformation;

(5) performing acoustic decoding on the MFCC subjected to the nonlinear frequency conversion by using the GMM, and recording the output probability;

(6) replacing the frequency conversion parameters, and repeating the steps (3) to (5);

(7) and comparing the output probability corresponding to each frequency conversion parameter, and selecting the MFCC corresponding to the conversion parameter with the maximum output probability as the compensated target voice characteristic parameter.

Drawings

Fig. 1 is a general framework diagram of a nonlinear spectral transform-based feature compensation system, which mainly includes FFT, frequency transform, Mel filtering, logarithm taking, DCT and GMM decoding modules.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, the characteristic compensation method based on nonlinear spectral transformation mainly includes FFT, frequency transformation, Mel filtering, logarithm taking, DCT and GMM decoding modules. The specific embodiments of the various main modules in the drawings are described in detail below:

1. model training

A Gaussian mixture model is generated by extracting standard MFCCs from training voices of a large number of speakers and training.

2、FFT

Windowing, framing and FFT are carried out on input voice of a target speaker, and a magnitude spectrum of each frame of voice signal is obtained.

3. Frequency conversion

Generally speaking, the formants of different speakers in the same voice unit are different, and the formant characteristic of the target speaker can approach the standard frequency spectrum of the training voice by transforming the frequency through a bilinear transformation formula. And if the digital frequency variable of the original target voice amplitude spectrum is k, performing digital frequency conversion by the following formula to obtain a new digital frequency variable l:

where a is the frequency transform parameter, round () is the rounding function, and N is the length (number of points) of the discrete fourier transform.

4. Feature extraction

And (4) performing Mel filtering on the transformed magnitude spectrum, and performing logarithm taking and DCT operation to obtain the MFCC after nonlinear frequency transformation.

5. GMM decoding

The nonlinear frequency transformed MFCC is acoustically decoded with a pre-trained GMM and the output probability is recorded. The frequency transformation parameter a takes several values at equal intervals in the interval [ -1, 1], and each value a is subjected to frequency transformation, feature extraction and GMM decoding, and the output probability is recorded. After the operation of all the a values is completed, comparing the output probabilities corresponding to all the a values, and selecting the MFCC corresponding to the a value with the maximum output probability as the compensated target voice characteristic parameter.

Claims

1. A characteristic compensation method based on nonlinear spectral transformation is characterized in that: generating a Gaussian Mixture Model (GMM) by using standard voice training of a large number of speakers; in the testing stage, nonlinear Frequency conversion is carried out on the magnitude spectrum of each frame of voice of the target speaker by using various conversion parameters to maximize the output probability of the GMM, and the Mel Frequency Cepstral Coefficients (MFCC) when the output probability is the maximum are used as the compensated target voice characteristic parameters.

2. The feature compensation method based on nonlinear spectral transformation according to claim 1, specifically comprising:

3. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: the non-linear transformation of the digital frequency is performed by:

wherein k and l respectively represent the digital frequency of the voice amplitude spectrum before and after transformation, a is a frequency transformation parameter, and round () is an integer function.

4. The nonlinear spectral transform-based feature compensation method according to claim 2, wherein: using pre-trained GMM to perform acoustic decoding on the MFCC after nonlinear frequency conversion, recording output probability, taking a plurality of values of a frequency conversion parameter a at equal intervals in an interval of [ -1, 1], performing frequency conversion, feature extraction and GMM decoding on each value a, recording the output probability, comparing the output probabilities corresponding to all the values a after completing the operation of all the values a, and selecting the MFCC corresponding to the value a with the maximum output probability as a compensated target voice feature parameter.