WO2010109725A1

WO2010109725A1 - Voice processing apapratus, voice processing method, and voice processing program

Info

Publication number: WO2010109725A1
Application number: PCT/JP2009/069580
Authority: WO
Inventors: 雄介篠原; 政巳赤嶺
Original assignee: 株式会社東芝
Priority date: 2009-03-26
Filing date: 2009-11-18
Publication date: 2010-09-30
Also published as: JP2010230913A

Abstract

A voice processing apparatus comprises a characteristic extraction unit (1) which extracts first voice characteristics, a noise deduction unit (2) which deduces a noise, a first distribution storage unit (3) which stores a set of first base distributions, a distribution synthesization unit (6) which synthesizes a second base distribution from each base distribution on the basis of the deduced noise, a first acoustic model storage unit (7) which stores a first acoustic model, a probability calculation unit (9) which calculates a state posterior probability by comparing a series of the first voice characteristics and the first acoustic model, a mixture weight storage unit (10) which stores mixture weights corresponding to the respective second base distributions, a mixture weight fusion unit (12) which fuses the mixture weights using the state posterior probability to calculate a fused mixture weight, a mixed distribution generation unit (13) which mixes the second base distributions with the fused mixture weight to generate a mixed distribution, and a characteristic emphasis unit (14) which deduces second voice characteristics from the first voice characteristics using the mixed distribution.

Description

Audio processing apparatus, audio processing method, and audio processing program

The present invention relates to a voice processing device, a voice processing method, and a voice processing program.

Conventionally, many methods for stably operating a speech recognition device (speech processing device) under noise have been proposed. In particular, methods related to feature enhancement methods are actively studied. The feature enhancement method uses noisy speech features extracted from noise-superposed speech (hereinafter referred to as noisy speech) in a noisy environment. Technology). By using clean speech features estimated by the feature enhancement method, speech recognition performance under noise can be improved.

For example, Non-Patent Document 1 discloses a conventional speech recognition apparatus. A conventional speech recognition apparatus includes a feature extraction unit, a first acoustic model storage unit, a probability calculation unit, a distribution storage unit, a mixed distribution generation unit, a feature enhancement unit, a second acoustic model storage unit, and a decoding unit. Prepare. The feature extraction unit extracts a noisy speech feature from each frame of the input noisy speech. The first acoustic model storage unit stores a first acoustic model representing a standard phonemic pattern in a noisy environment. The probability calculation unit collates the noisy speech feature sequence with the first acoustic model, and calculates a probability of staying in each distribution of the first acoustic model in each frame (distribution posterior probability). The distribution storage unit stores a set of basis distributions. Each of the basis distributions is a combined Gaussian distribution of clean speech features and noisy speech features. The mixed distribution generation unit generates a mixed distribution by mixing the base distribution with the distribution posterior probability for each frame. This mixed distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame. The feature enhancement unit estimates the clean speech feature from the noisy speech feature using the mixture distribution in each frame. The second acoustic model storage unit stores a second acoustic model representing a standard phonemic pattern in a clean environment. The decoding unit collates the sequence of clean speech features estimated by the feature enhancement unit with the second acoustic model, and outputs an optimal word string.

However, since the speech recognition apparatus of Non-Patent Document 1 performs feature enhancement using a joint Gaussian distribution learned in advance, there is a problem in that speech recognition performance deteriorates under noise different from that during learning. . In order to solve this problem, a combined Gaussian distribution of clean speech characteristics and noisy speech features can be dynamically synthesized from a Gaussian distribution of clean speech features each time the noise changes. However, since an ordinary acoustic model has several thousand to several tens of thousands of Gaussian distributions, enormous amount of calculation is required to dynamically synthesize coupled Gaussian distributions, which is not realistic.

The present invention has been made in view of the above, and provides a voice processing device, a voice processing method, and a voice processing program capable of achieving high voice recognition performance with a small amount of calculation even in an environment where noise changes. The purpose is to provide.

In order to solve the above-described problems and achieve the object, the present invention extracts a first voice feature from each frame of the first voice on which noise is superimposed in a noisy environment, and extracts the first voice feature. A feature extraction unit for calculating a sequence; a noise estimation unit for estimating the noise superimposed on the first speech; and a first distribution representing a second speech feature distribution of the second speech in an environment free from noise. A first distribution storage unit that stores a set of basis distributions and a combined distribution of the first speech feature and the second speech feature from each of the first basis distributions based on the noise A distribution synthesis unit that synthesizes a second basis distribution, a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment, a sequence of the first speech features, The first acoustic model is collated and each frame is checked. A probability calculating unit that calculates a state posterior probability that is a probability of staying in each state of the first acoustic model, and each state of the first acoustic model corresponds to each of the second basis distributions A blending weight storage unit for storing blending weights, a blending weight blending unit for blending the blending weights using the state posterior probabilities and calculating a blended blending weight, and for each frame, with the blending blending weight, Mixing the second basis distribution, generating a mixture distribution that is a combined distribution of the first speech feature and the second speech feature in the frame; and for each frame, the mixture distribution And a feature enhancement unit for estimating the second speech feature from the first speech feature.

According to the present invention, it is possible to maintain high speech recognition performance with a small amount of calculation even in an environment where noise changes.

It is a block diagram which shows the structure of the audio | voice processing apparatus concerning this Embodiment. It is a flowchart which shows operation | movement of the audio processing apparatus concerning this Embodiment.

DETAILED DESCRIPTION Exemplary embodiments of an audio processing device, an audio processing method, and an audio processing program according to the present invention are explained in detail below with reference to the accompanying drawings.

FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the present embodiment. The speech recognition apparatus includes a feature extraction unit 1, a noise estimation unit 2, a first distribution storage unit 3, a first distribution storage control unit 4, a second distribution storage unit 5, a distribution synthesis unit 6, and a first acoustic model. Storage unit 7, first acoustic model storage control unit 8, probability calculation unit 9, mixture weight storage unit 10, mixture weight storage control unit 11, mixture weight fusion unit 12, mixture distribution generation unit 13, feature enhancement unit 14, 2 acoustic model storage unit 15, second acoustic model storage control unit 16, and decoding unit 17.

The feature extraction unit 1 extracts features from each frame of the input noisy speech, and calculates a sequence of noisy speech features. The frame is obtained by cutting out a part of the input audio signal, and is sequentially cut out while gradually shifting the cut out section. As a feature, for example, a vector having a mel frequency cepstrum coefficient (MFCC) as an element can be used. Hereinafter, the feature dimension is d. A sequence of noisy speech features is calculated by extracting features from each of the sequentially extracted frames.

The noise estimation unit 2 estimates noise superimposed on the input noisy speech. For example, it is possible to select a section that does not include speech and includes only noise using a voice section detector, and perform noise estimation using this section. More specifically, in the section consisting only of noise, the above features are extracted from each frame, and the average / covariance is obtained from the obtained set of features. This mean / covariance defines a Gaussian distribution of noise features.

The first distribution storage unit 3 stores a set of first basis distributions. In this embodiment, a d-dimensional Gaussian distribution is used as the first basis distribution. Each basis distribution represents a distribution of clean speech features. A method of calculating the first basis distribution set will be described in detail later.

The first distribution storage control unit 4 performs control such that the first distribution storage unit 3 stores the first set of basis distributions.

The second distribution storage unit 5 stores a set of second basis distributions. In the present embodiment, a 2 × d-dimensional Gaussian distribution is used as the second basis distribution. Each basis distribution represents a combined Gaussian distribution of clean speech features and noisy speech features.

The distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions stored in the first distribution storage unit 3 based on the noise calculated by the noise estimation unit 2, Store in the distribution storage unit 5. That is, a combined Gaussian distribution of clean speech features and noisy speech features is synthesized from a Gaussian distribution of noise features and a Gaussian distribution of clean speech features. For the synthesis of the distribution, for example, Vector Taylor Series method or uncentred transformation can be used.

The first acoustic model storage unit 7 stores a first acoustic model representing a standard phonemic pattern in a noisy environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored. The first acoustic model is created in advance from learning data consisting of a set of noisy speech features.

The first acoustic model storage control unit 8 performs control such that the first acoustic model storage unit 7 stores the first acoustic model.

The probability calculation unit 9 collates the noisy speech feature series calculated by the feature extraction unit 1 with the first acoustic model stored in the first acoustic model storage unit 7, and in each frame, The probability of staying in each state of the acoustic model (state posterior probability) is calculated. For example, the state posterior probability can be calculated by using a forward backward algorithm. Alternatively, the state posterior probability can be calculated from the N best candidate list. The method for calculating the state posterior probability using the N best candidate list is described in detail in Non-Patent Document 1, for example.

The mixing weight storage unit 10 stores the mixing weight corresponding to each of the second basis distributions for each state of the first acoustic model. When the number of states is L and the number of basis distributions is K, L × K values are stored. For each state of the first acoustic model, the mixed distribution generated by mixing the K basis distributions stored in the first distribution storage unit 3 with the mixing weight corresponding to the state is a clean distribution in the state. Represents the distribution of voice features. Similarly, for each state of the first acoustic model, a mixture distribution generated by mixing K basis distributions stored in the second distribution storage unit 5 with a mixture weight corresponding to the state is Represents the joint distribution of clean and noisy speech features in the state. The method for calculating the mixing weight will be described in detail later.

The mixing weight storage control unit 11 controls the mixing weight storage unit 10 to store the mixing weight.

The blending weight fusion unit 12 blends the blending weights stored in the blending weight storage unit 10 using the state posterior probabilities calculated by the probability calculation unit 9 and calculates a blending blending weight. Specifically, the blending of the blending weights is performed according to the equation (1).

Where γ (t, j) is the state posterior probability of staying in state j in frame t, w (j, k) is the mixing weight of the kth basis distribution in state j, Σj is the sum for j, v (t, k) represents the fusion mixture weight of the kth basis distribution in frame t.

The mixed distribution generation unit 13 mixes the second base distribution acquired from the second distribution storage unit 5 with the combined mixing weight calculated by the mixing weight combining unit 12 for each frame to generate a mixed distribution. In the present embodiment, since the basis distribution is a Gaussian distribution, the mixture distribution is a Gaussian mixture distribution. The generated mixture distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame.

The feature enhancement unit 14 estimates a clean speech feature from a noisy speech feature using the mixture distribution generated by the mixture distribution generation unit 13 for each frame. For example, Non-Patent Document 1 discloses details of a feature enhancement method using a mixed distribution representing a combined distribution of clean speech features and noisy speech features.

The second acoustic model storage unit 15 stores a second acoustic model representing a standard phonemic pattern in a clean environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored. The second acoustic model is created in advance using learning data composed of a set of clean speech features. Preferably, it is created in advance using learning data consisting of a set of clean speech features processed by the feature enhancement unit 14. That is, an acoustic model is created using a set of clean speech features obtained by processing the set of noisy speech features used for learning the first acoustic model by the feature enhancement unit 14 as learning data. By performing the same feature enhancement processing at the time of acoustic model learning and at the time of speech recognition, the problem of feature mismatch between learning and recognition can be prevented.

The second acoustic model storage control unit 16 controls the second acoustic model storage unit 15 to store the second acoustic model.

The decoding unit 17 collates the sequence of clean speech features estimated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 and outputs an optimum word string. A Viterbi algorithm is used for collation.

Next, a method for calculating the first basis distribution set stored in the first distribution storage unit 3 and the mixture weight stored in the mixture weight storage unit 10 will be described.

In this method, a first set of basis distributions and a mixture weight are calculated using an EM algorithm so as to maximize the likelihood of learning data including a given set of clean speech features. Are stored in the distribution storage unit 3 and the mixture weight storage unit 10. Hereinafter, a method of creating learning data including a set of clean speech features will be described first, and then a likelihood maximization method using the EM algorithm will be described.

Explain the procedure for creating learning data. First, prepare a set of clean speech features. Next, each clean speech feature is associated with one of the states of the first acoustic model. Specifically, given a sequence of clean speech features extracted from a single utterance, a transcription (spoken word sequence), and a first acoustic model, the sequence is obtained by using the Viterbi algorithm. Can be associated with any state of the acoustic model. Or you may perform a fuzzy matching using a forward backward algorithm. As described above, learning data that is a set of clean speech features associated with any state of the acoustic model can be created.

Next, a procedure for calculating a set of base distributions and a mixture weight using an EM algorithm so as to maximize the likelihood of learning data will be described. A set of learning data is D, and a set of learning data associated with the jth state is Dj. The i-th learning data (clean speech feature) is set to x _i . The parameters of the _kth basis distribution are θ _k , and they are collected and set as θ = {θ ₁ ... Θ _K }. Specifically, θ _k is an average / covariance parameter of the kth Gaussian distribution. Let w _{jk be} the mixing weight of the kth basis distribution in state j, w _j = {w _j1 ,..., W _jK } that is collected for all k, and w that is collected for all j. = {W ₁ ,..., W _L }. Here, K is the number of basis distributions, and L is the number of states. At this time, the (logarithm) likelihood L (θ, w) for the learning data is defined as in Expression (2).

Θ and w are calculated using an EM algorithm so as to maximize this likelihood. In step E, the posterior probability of which distribution each learning sample belongs to is calculated based on the current values of θ and w. In the M step, θ and w are calculated so as to maximize the expected value of the log likelihood of complete data based on this posterior probability. Here, initial values of θ and w are required. For example, a Gaussian mixture distribution with a distribution number K is learned for the entire learning data (D), and a set of obtained Gaussian distributions and a mixture weight (u) are set. ) Can be used. Here, L mixing weights w _j are set to the same value for all j. That is, w ₁ =... = W _L = u. After initialization of θ and w, it is possible to calculate θ and w that maximize the likelihood by repeating the E step and the M step until the increase in likelihood converges. The maximum likelihood learning method using the EM algorithm is described in detail in, for example, “L. Rabiner, B.-H. Jung (Author), Sadahiro Furui (translation), Basics of Speech Recognition, NTT Advanced Technology, 1995”. It is disclosed. The first basis distribution set and the mixture weight calculated as described above are stored in the first distribution storage unit 3 and the mixture weight storage unit 10, respectively.

Next, the operation of the speech processing apparatus according to this embodiment will be described. FIG. 2 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.

First, the feature extraction unit 1 extracts features from each frame of the input noisy speech and calculates a sequence of noisy speech features (step S1).

Next, the noise estimation unit 2 performs noise estimation from the noisy speech feature sequence calculated by the feature extraction unit 1 (step S2). Next, the distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions using the noise estimated by the noise estimation unit 2 and stores it in the second distribution storage unit 5 ( Step S3).

In parallel with executing steps S2 and S3, steps S4 and S5 are executed. That is, the probability calculation unit 9 collates the sequence of noisy speech features calculated by the feature extraction unit 1 with the first acoustic model, and the probability of staying in each state of the first acoustic model in each frame (state posterior (Probability) is calculated (step S4). Next, the blending weight blending unit 12 blends the blending weights acquired from the blending weight storage unit 10 with the state posterior probabilities calculated by the probability calculating unit 9 for each frame, and calculates the blending blending weight (Step S1). S5).

In step S6, the mixture distribution generation unit 13 mixes the second basis distribution stored in the second distribution storage unit 5 with the fusion mixture weight to generate a mixture distribution. Next, the feature emphasizing unit 14 calculates a clean speech feature from the noisy speech feature using the mixture distribution generated by the mixture distribution generation unit 13 for each frame (step S7).

Finally, the decoding unit 17 collates the series of clean speech features calculated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 to determine an optimum word string. The voice recognition (voice processing) is finished (step S8). As described above, the correct voice is recognized from the noisy voice.

As described above, according to the speech processing apparatus according to the present embodiment, instead of using a large number of distributions as in the prior art, only a small number of basis distributions are used, so that the combined distribution of clean speech features and noisy speech features is obtained. Therefore, it is possible to greatly reduce the amount of calculation required for the synthesis of the speech and maintain high speech recognition performance with a small amount of calculation even in an environment where noise changes.

The voice processing apparatus of this embodiment includes a control device such as a CPU, a storage device, an external storage device, a display device such as a display device, and an input device such as a keyboard and a mouse. The hardware configuration is used.

The audio processing program executed by the audio processing apparatus according to the present embodiment is a file in an installable or executable format, such as a CD-ROM, a flexible disk (FD), a CD-R, a DVD (Digital Versatile Disk), or the like. The program is recorded on a computer-readable recording medium and provided as a computer program product.

Further, the voice processing program executed by the voice processing apparatus of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network. The voice processing program executed by the voice processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

Also, the voice processing program of the present embodiment may be provided by being incorporated in advance in a ROM or the like.

The speech processing program executed by the speech processing apparatus according to the present embodiment includes the above-described units (feature extraction unit, noise estimation unit, first distribution storage control unit, distribution synthesis unit, first acoustic model storage control unit, The module configuration includes a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and a decoding unit. As actual hardware, a CPU (processor) reads out and executes an audio processing program from the storage medium, and the above-described units are loaded on the main storage device, so that a feature extraction unit, a noise estimation unit, and a first distribution storage A control unit, a distribution synthesis unit, a first acoustic model storage control unit, a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and The decoding unit is generated on the main storage device.

Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

As described above, the speech processing apparatus, speech processing method, and speech processing program according to the present invention are useful when speech recognition is performed under noise.

DESCRIPTION OF SYMBOLS 1 Feature extraction part 2 Noise estimation part 3 1st distribution storage part 6 Distribution synthetic | combination part 7 1st acoustic model storage part 9 Probability calculation part 10 Mixed weight memory | storage part 12 Mixed weight fusion part 13 Mixed distribution generation part 14 Feature emphasis part

Claims

A feature extraction unit that extracts a first speech feature from each frame of the first speech on which noise is superimposed in a noisy environment, and calculates a sequence of the first speech feature;
A noise estimator for estimating the noise superimposed on the first speech;
A first distribution storage unit for storing a set of first basis distributions representing a distribution of the second voice feature of the second voice in a noise-free environment;
A distribution synthesizer that synthesizes a second basis distribution representing a combined distribution of the first speech feature and the second speech feature from each of the first basis distributions based on the noise;
A first acoustic model storage unit for storing a first acoustic model representing a standard pattern of phonemes in a noisy environment;
Probability calculation that collates the first speech feature series with the first acoustic model and calculates a state posterior probability that is a probability of staying in each state of the first acoustic model in each frame. And
A mixing weight storage unit that stores a mixing weight corresponding to each of the second basis distributions for each state of the first acoustic model;
Using the state posterior probabilities, fusing the blend weights to calculate a blend blend weight; and
For each frame, mixed distribution generation is performed by mixing the second base distribution with the fusion mixture weight, and generating a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. And
A feature enhancement unit that estimates the second speech feature from the first speech feature using the mixture distribution for each frame;
A voice processing apparatus characterized by the above.
A second acoustic model storage unit for storing a second acoustic model representing a standard phonemic pattern in an environment free from noise;
A decoder that collates the second speech feature with the second acoustic model and outputs an optimal word string;
The speech processing apparatus according to claim 1.
The set of the first basis distribution and the mixture weight are determined by the EM algorithm using learning data including the second set of speech features associated with any state of the first acoustic model. Being calculated to maximize the likelihood for the training data,
The speech processing apparatus according to claim 1.
A speech processing method executed by a speech processing apparatus,
A feature extraction step of extracting a first voice feature from each frame of the first voice on which noise is superimposed in a noise environment, and calculating a sequence of the first voice feature;
A noise estimation step in which a noise estimation unit estimates the noise superimposed on the first speech;
Based on the noise, the distribution synthesis unit stores the first basis distribution set representing the distribution of the second speech feature of the second speech in a noise-free environment, in the first distribution storage unit. A distribution synthesis step of synthesizing a second basis distribution representing a combined distribution of the first voice feature and the second voice feature from each of the one basis distribution;
A probability calculation unit collates the first acoustic feature series with the first acoustic model of a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment. A probability calculating step of calculating a state posterior probability that is a probability of staying in each state of the first acoustic model in each frame;
The mixture weight merging unit uses the state posterior probability to determine the mixture weight of the mixture weight storage unit that stores the mixture weight corresponding to each of the second basis distributions for each state of the first acoustic model. A blend weight blending step for blending and calculating a blend blend weight;
The mixed distribution generation unit mixes the second base distribution with the fusion mixed weight for each frame, and is a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. A mixed distribution generation step for generating
A feature enhancement step including, for each frame, a feature enhancement step of estimating the second speech feature from the first speech feature using the mixture distribution;
A voice processing method characterized by the above.
A feature extraction step of extracting a first speech feature from each frame of the first speech on which noise is superimposed in a noisy environment, and calculating a sequence of the first speech feature;
A noise estimation step of estimating the noise superimposed on the first speech;
Based on the noise, the first basis distribution of the first distribution storage unit that stores a set of first basis distributions representing a distribution of the second speech feature of the second speech in a noise-free environment. A distribution synthesis step for synthesizing a second basis distribution representing a combined distribution of the first voice feature and the second voice feature from each;
The first acoustic feature series is collated with the first acoustic model in a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment. In a frame, a probability calculating step of calculating a state posterior probability that is a probability of staying in each state of the first acoustic model;
Using the state posterior probabilities, for each state of the first acoustic model, the mixture weights of the mixture weight storage unit that stores the mixture weights corresponding to each of the second basis distributions are merged, and fusion mixing A mixed weight fusion step for calculating weights;
For each frame, mixed distribution generation is performed by mixing the second base distribution with the fusion mixture weight, and generating a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. Steps,
A feature enhancement step for estimating the second speech feature from the first speech feature using the mixture distribution for each frame;
A voice processing program for causing a computer to execute.