WO2019171415A1

WO2019171415A1 - Speech feature compensation apparatus, method, and program

Info

Publication number: WO2019171415A1
Application number: PCT/JP2018/008251
Authority: WO
Inventors: Qiongqiong Wang; Koji Okabe; Takafumi Koshinaka
Original assignee: Nec Corporation
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2019-09-12
Also published as: JP7243760B2; JP2021510846A; JP2021140188A; JP6897879B2

Abstract

The speech feature compensation apparatus 100 includes training means11 for training a generator 21 and a discriminator 22 of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means 12 for extracting a feature vector from an input short speech, and generation means 13 for generating a robust feature vector based on the extracted feature vector using the trained parameters.

Description

SPEECH FEATURE COMPENSATION APPARATUS, METHOD, AND PROGRAM

The present invention relates to a feature compensation apparatus, a feature compensation method, and a program for compensating feature vectors in speech and audio to robust ones.
　

Speaker recognition refers to recognizing persons from their voice. No two individuals sound identical because their vocal tract shapes, larynx sizes, and other parts of their voice production organs are different. Considering the identities of human’s voice, speaker recognition has been increasingly applied to forensic, telephone-based services such as telephone banking, and so on.

Speaker recognition systems can be divided into text-dependent and text-independent ones. In text-dependent systems, recognition phrases are fixed, or known beforehand. In text-independent systems, there are no constraints on the words which the speakers are allowed to use. Text-independent recognition has wider range of applications but also is much more challenging of the two tasks and has consistently been improving in the past decades.

Since the reference (what are spoken in training) and the test (what are uttered in actual use) utterances in text-independent speaker recognition applications may have completely different contents, the recognition system must take this phonetic mismatch into account. The performance crucially depends on the length of speech. In the case where users speak a long utterance, usually equal to or longer than 1 minute, it is considered that most phonemes are covered. As a result, it produces good recognition accuracy despite of different speech contents, while in the case of short speech, speaker recognition performance degrades on short utterances because speaker feature vectors of such utterances extracted with statistical method are unreliable to perform accurate recognition.

In practical speaker verification applications, often only short speech segments are observed during testing. In general, short speech segments of less than 10 seconds are used. Then it is important to improve text-independent speaker recognition with short utterances by speaker feature vector restoration.

PTL1 discloses a technology that employs a Denoising Autoencoder (DAE) to compensate speaker feature vectors of a short utterance which contains limited phonetic information.

As shown in Fig. 23, in a feature compensation apparatus based on DAE described in PTL1, it first estimates acoustic diversity degree in the input utterance as posteriors based on speech models. Then both the acoustic diversity degree and recognition feature vector are presented to an input layer 401. "Feature vector" in this description refers to a set of numerics (specific data) that represents a target object. The DAE-based transformation including an input layer 401, one or multiple hidden layers 402, and an output layer 403 is able to produce a restored recognition feature vector in the output layer with help of supervised training using pairs of long and short speech segments.

NPL 1 disclosed MFCC (Mel-Frequency Cepstrum Coefficients) as acoustic features.
　

United States Patent Application 2016/0098993

Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

However, PTL1 uses only mean square error minimization in DAE optimization. Such objective function is too simple to perform accurately. In addition, the simple objective function restricts the short speech to be part of the long speech for better result. In real world, to train such a network, only long speech can be used (short speech is cut from it). It wastes the information of existing short speech of the speakers. This system needs sufficient speakers with multiple sets of long speech for training, which may not be realistic for all applications.

In view of the above mentioned situation, the objective of the present invention is to provide robust feature compensation for short speech.
　

An exemplary aspect of the speech feature compensation apparatus includes training means for training a generator and a discriminator of GAN (Generative Adversarial Network ) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means for extracting a feature vector from an input short speech, and generation means for generating a robust feature vector based on the extracted feature vector using the trained parameters.

An exemplary aspect of the speech processing method includes training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.

An exemplary aspect of the speech processing program causes a computer to execute training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.
　

According to the present invention, speech compensation apparatus, speech feature compensation method, and program of the present invention can provide robust feature compensation for short speech.
　

Fig. 1 is a block diagram of a robust feature compensation apparatus of the first example embodiment in accordance with the present invention. Fig. 2 shows an example of contents of the short speech data storage. Fig. 3 shows an example of contents of the long speech data storage. Fig. 4 shows an example of contents of the generator parameter storage. Fig. 5 shows a concept of NN architecture in the first example embodiment. Fig. 6 is a flowchart illustrating operation of the robust feature compensation apparatus of the first example embodiment. Fig. 7 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the first example embodiment. Fig. 8 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the first example embodiment. Fig. 9 is a block diagram of a robust feature compensation apparatus of the second example embodiment in accordance with the present invention. Fig. 10 shows a concept of NN architecture in the second example embodiment. Fig. 11 is a flowchart illustrating operation of the robust feature compensation apparatus of the second example embodiment. Fig. 12 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the second example embodiment. Fig. 13 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the second example embodiment. Fig. 14 is a block diagram of a robust feature compensation apparatus of the third example embodiment in accordance with the present invention. Fig. 15 shows a concept of NN architecture in the third example embodiment. Fig. 16 is a flowchart illustrating operation of the robust feature compensation apparatus of the third example embodiment. Fig. 17 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the third example embodiment. Fig. 18 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the third example embodiment. Fig. 19 is an exemplary computer configuration used in example embodiments in accordance with the present invention. Fig. 20 shows an exemplary computer configuration used in embodiments in accordance with the present invention. Fig. 21 shows a block diagram showing main parts of a speech feature compensation apparatus. Fig. 22 shows a block diagram showing another aspect of a speech feature compensation apparatus. Fig. 23 is a block diagram of a feature compensation apparatus of PTL 1.

Each example embodiment of the present invention will be described below with reference to the figures. The following detailed descriptions are merely exemplary in nature and are not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description.

Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures illustrating integrated circuit architecture may be exaggerated relative to other elements to help to improve understanding of the present and alternate example embodiments.

In speaker recognition applications in real world, often text-independent speaker recognition are used and short speech segments (less than 10 seconds) are observed. In such cases, phonetic mismatch must be taken into account, since the unbalanced phonetic distribution resulted in unreliable speaker feature vector extracted from the short speech. With the length of segments getting shorting, performance depredated accordingly. Therefore, there is a need for improve text-independent speaker recognition with short utterances by speaker feature restoration method.

In the view of the above, following example embodiments utilize Generative Adversarial Network (GAN) including a generator and a discriminator that improves each other during the iterative training process, so that the generator will generate a robust feature vector for the short speech with compensation.

First Example embodiment.
A robust feature compensation apparatus of the first example embodiment can provide robust feature vector to short speech segment, from raw feature vector of the short speech, using a generator. Thus, in this example embodiment, the generator of a GAN trained with short and long speech is capable of generating a robust feature vector for short speech. Note the term of the long speech is longer than the term of the short speech.

<Configuration of robust feature compensation apparatus >
In the first example embodiment of the present invention, a robust feature compensation apparatus for feature restoration using a generator of GAN will be described.

Fig. 1 illustrates a block diagram of a robust feature compensation apparatus 100 of the first example embodiment. The robust feature compensation apparatus 100 includes a training part 100A and a feature restoration part 100B.

The training part 100A includes a short speech data storage 101, a long speech data storage 102,

feature extraction units

103a and 103b, a noise storage 104, a generator & discriminator training unit 105, and a generator parameter storage 106. The feature restoration part 100B includes a feature extraction unit 103c, a generation unit 107 and a generated feature storage 108. The

feature extraction units

103a, 103b, and 103c have the same function.

The short speech data storage 101 stores short speech recordings with speaker labels as shown in Fig. 2.

The long speech data storage 102 stores long speech recordings with speaker labels as shown in Fig. 3. The long speech data storage 102 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 101.

The noise storage 104 stores random vector representing noise.

The generator parameter storage 106 stores generator parameters as shown in Fig. 4. The generator includes an encoder and a decoder, as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 106.

The feature extraction unit 103a extracts feature vectors from the short speech data in the short speech data storage 101. The feature extraction unit 103b extracts feature vectors from long speech in the long speech data storage 102. The feature vectors are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC described in NPL1.

The generator & discriminator training unit 105 receives the feature vectors of a short speech segment from the feature extraction unit 103a, the feature vector of a long speech segment from the feature extraction unit 103b and the noise from noise storage 104. The generator & discriminator training unit 105 trains a generator and a discriminator (not shown in Fig. 1) iteratively to determine "real" (the feature vector is extracted from a long speech) or "fake" (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. Each of the generator and the discriminator includes an input layer, one or multiple hidden layers, and an output layer.

In the training, in "real" case, the received feature vectors of the long speech are given to the input layer of the discriminator; in "fake" case, the received feature vectors of the short speech are given to the input layer of the generator. The output layer of the generator is the input layer of the discriminator. Further, "real/fake" and speaker labels are given to the output layer of the discriminator. Details of those layers will be described later. After the training, the generator & discriminator training unit 105 stores the generator parameters in the generator parameter storage 106.

In the feature restoration part 100B, the feature extraction unit 103c extracts feature vectors from a short speech recording. Together with the feature vector, the generation unit 107 receives noise stored in the noise storage 104 and generator parameters stored in the generator parameter storage 106. The generation unit 107 generates a robust restored feature.

Fig. 5 shows a concept of the architecture of the generator and the discriminator. The generator has two neural networks (NNs) - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is a feature vector of a short speech recording. The output layer of the encoder NN is a speaker factor (a feature vector). The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is a restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and a speaker label.

In the training part 100A, the input layer of encoder NN (feature vectors of short speech recordings), part of the input layer of decoder NN (noise), one type input layer of two for the discriminator (feature vectors of long speech recordings) and the output layer of the discriminator (outputting "real/fake" and speaker label) are given, and as a result, the hidden layer(s) of three NN (encoder, decoder, discriminator) parameters, the output layer of the encoder NN (speaker factor), the output layer of the decoder NN (restored feature vector is determined. For example, the number of layers in encoder, decoder and discriminative can be 15, 15 and 16.

In the evaluation part of the training part 100A, encoder parameters, decoder parameters, input layer of encoder NN (feature vector of short speech), part of the input layer of decoder NN (noise) are provided, and as a result, output layer of the decoder NN (restored feature vector) is determined.

In the discriminator, output layer consists of (2+n) neurons, where n is the number of speakers in the training data, and 2 is "real/fake"/. In the training part 100A, the neurons can take a value "1" or "0" corresponding to "real/fake" and "true speaker label/wrong speaker labels".

In the training part 100A, the generator (encoder and decoder) and the discriminator iteratively train each other. In each iteration, generator parameters are updated once while discriminator parameters are fixed, then the discriminator parameters are updated once while the generator parameters are fixed. For this purpose, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost functions, such as cross entropy, mean square error and so on.

For example, the objective function can be expressed as next:
For the generator:

For the discriminator:

Where values (a) are objectives for the generator and values (b) are objectives for the discriminator; A is the feature vector of the given short speech; B is the feature vector of the given long speech; element (c) is the noise modeling other variations besides speaker; G(A,z) is the generated feature vector form the generator; Element (d) is for speaker classification results, i.e. speaker posteriors, with N^d as the total number of speakers in the training set; element (e) is for "real/fake" feature vector classification; element (f) is the i^th element in D^d. operators (g) and (h) are the expectation and mean square error operators, respectively. Constants (i) are predetermined constants. y^d is the true speaker ID (ground truth).

It can also be expressed as next.
For the generator:

For the discriminator

<Operation of robust feature compensation apparatus>
Next, the operation of robust feature compensation apparatus 100 will be described with reference to drawings.

The whole operation of robust feature compensation apparatus 100 will be described by referring to Fig. 6. Fig. 6 contains operations of the training part 100A and the feature restoration part 100B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.

In step A01 (training part), the generator & discriminator training unit 105 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored the in short speech data storage 101 and long speech data storage 102, respectively. In detail, in each iteration, firstly discriminator parameters are fixed, and generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.

In step A02 (feature restoration part), the generation unit 107 generates a restored feature vector from a given short speech utterance in the output layer using generator parameters stored in the generator parameter storage 106.

Fig. 7 is a flowchart illustrating that the generator and the discriminator are together trained using short speech feature vectors and long speech feature vectors with noise. Fig. 7 shows the training part in Fig. 6.

First, in step B01, as the beginning of the training part, the feature extraction unit 103a reads short speech data with speaker labels from the short speech data storage 101.

In step B02, the feature extraction unit 103a further extracts feature vectors from the short speech.

In step B03, the feature extraction unit 103b reads long speech data with speaker labels from the long speech data storage 102.

In step B04, the feature extraction unit 103b further extracts feature vectors from the long speech.

In step B05, the generator & discriminator training unit 105 reads the noise data stored in the noise storage 104.

In step B06, the generator & discriminator training unit 105 trains a generator and a discriminator together using short speech feature vectors sent from the feature extraction unit 103a and long speech feature vectors sent from the feature extraction unit 103b with speaker labels and noise.

In step B07, as the result of the training, the generator & discriminator training unit 105 generates generator parameters and discriminator parameters, and stores the generator parameters in the generator parameter storage 106.

Note that the order of B01-B02 and B03-B04 can be switched, not limited to the form presented in Fig. 7.

Fig. 8 is a flowchart illustrating a feature restoration part 100B.

Firstly, in step C01, the feature extraction unit 103c reads short speech data represented through an external device (not shown in Fig. 1).
In step C02, the feature extraction unit 103c extracts feature vectors from the given short speech data.

In step C03, the generation unit 107 reads noise data stored in the noise storage 104.

In step C04, the generation unit 107 reads generator parameters from the generator parameter storage 106.

In step C06, the generation unit 107 restores the feature vector of the short speech and generates a robust feature vector.

Note here the order of C03 and C04 can be switched.

Effect of First Example Embodiment.
As explained above, the first example embodiment can improve the robustness of the feature vector of short speech. The reason is that the joint training of the generator and the discriminator improve the performance of each other, and it is learned the relationship between long speech feature vectors and short speech feature vectors in the training. As a result, such NN can generate a feature vector for a short speech as robust as that from long speech.

Second Example embodiment.
A robust feature compensation apparatus of the second example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using an encoder. Thus, in this example embodiment, the encoder - part of a generator of a GAN trained with short and long speech is capable of produce a robust speaker feature vector robust for short speech.

<Configuration of robust feature compensation apparatus >
In the second example embodiment of the present invention, a robust feature compensation apparatus for speaker feature extraction using a generator and a discriminator of GAN will be described.

Fig. 9 illustrates a block diagram of a robust feature compensation apparatus 200 of the second example embodiment. The robust feature compensation apparatus 200 includes a training part 200A and a speaker feature extraction part 200B.

The training part200A includes a short speech data storage 201, a long speech data storage 202,

feature extraction units

203a and 203b, a noise storage 204, a generator & discriminator training unit 205, and an encoder parameter storage 206. The speaker feature extraction part 200B includes a feature extraction unit 203c, an encoding unit 207 as generation means, and a generated feature storage 208. The

feature extraction units

203a, 203b and 203c have the same function.

The short speech data storage 201 stores short speech recordings with speaker labels, as shown in Fig 2.

The long speech data storage 202 stores long speech recordings with speaker labels, as shown in Fig 3. The long speech data storage 202 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 201.

The noise storage 204 stores random vector representing noise.

The encoder parameter storage 206 stores encoder parameters, each is a part of the result of the generator & discriminator training unit 205. The generator (not shown in Fig. 9) consists of an encoder and a decoder, same as that in the first example embodiment as understood from the figure Fig. 4.

The feature extraction unit 203a extracts features from the short speech in the short speech data storage 201. The feature extraction unit 203b extracts features from the long speech in the long speech data storage 202. The features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.

The generator & discriminator training unit 205 receives feature vectors of short speech from the feature extraction unit 203a, the feature vector of long speech from the feature extraction unit 203b and noise from the noise storage 204. The generator & discriminator training unit 205 trains the generator and the discriminator (not shown in Fig. 9) iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. The detail of training is shown in the first example embodiment. After the training, the generator & discriminator training unit 205 outputs generator parameters and discriminator parameter, and stored them in the encoder parameter storage 206.

In the speaker feature extraction part 200B, the feature extraction unit 203c extracts feature vectors from a short speech. Together with the feature vector, the encoding unit 207 receives the noise stored in the noise storage 204 and encoder parameters in encoder parameter storage 206. The encoding unit 207 encodes a robust speaker feature.

Fig. 10 shows a concept of the architecture of the generator and the discriminator of the second example embodiment. The generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is feature vector of the short speech. The output layer of the encoder NN is speaker factor. The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is the restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and speaker label.

The training part 200A of the second example embodiment is same as that in the first example embodiment as mentioned.

In the evaluation part, the encoder parameters and input layer of the encoder NN (feature vector of short speech) are provided, and as a result, the output layer of the encoder NN (speaker factor) is obtained.

<Operation of robust feature compensation apparatus>
Next, the operation of the robust feature compensation apparatus 200 will be described with reference to drawings.

The whole operation of the robust feature compensation apparatus 200 will be described by referring to Fig. 11. Fig. 11 contains operations of the training part 200A and the speaker feature extraction part 200B. However, this shows an example, the operations of the training and speaker feature extraction can be executed continuously or time intervals can be inserted.

In step D01 (training part), the generator & discriminator training unit 205 training the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in short speech data storage 201 and long speech data storage 202, respectively. In detail, in each iteration, first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.

In step D02 (speaker feature extraction part), the encoding unit 207 encodes a robust speaker feature vector from a given short speech utterance, in the output layer of the encoder using encoder parameter stored in the encoder parameter storage 206.

Fig. 12 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise. Fig. 12 shows the training part in Fig. 11.

First, in step E01, as the beginning of the training part, the feature extraction unit 203a reads short speech data with speaker labels from the short speech data storage 201.

In step E02, the feature extraction unit 203a further extracts feature vectors from the short speech.

In step E03, the feature extraction unit 203b reads long speech data with speaker labels from long speech data storage 202.

In step E04, the feature extraction unit 203b further extracts feature vectors from the long speech.

In step E05, the generator & discriminator training unit 205 reads noise data stored in the noise storage 204.

In step E06, the generator & discriminator training unit 205 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 203a and long speech feature vectors sent from the feature extraction unit 203b with speaker labels and noise.

In step E07, as the result of the training, the generator & discriminator training unit 205 trains the generator and the discriminator, and stores the parameters of encoder - part of the generator in the encoder parameter storage 206.

Note that the order of E01-E02 and E03-E04 can be switched, not limited to the form presented in Fig. 12.

Fig. 13 is a flowchart illustrating the speaker feature extraction part 200B.

Firstly, in step F01, the feature extraction unit 203c reads short speech data represented through an external device (not shown in Fig. 9).

In step F02, the feature extraction unit 203c extracts feature vectors from the given short speech data.

In step F03, the encoding unit 207 reads noise data stored in the noise storage 204.

In step F04, the encoding unit 207 reads encoder parameters from the encoder parameter storage 206.

In step F06, the encoding unit 207 encodes the feature vector of the short speech and extracts a robust speaker feature vector.

Note here the order of F03 and F04 can be switched.

Effect of Second Example Embodiment.

As explained above, the second example embodiment can improve the robustness of the feature vector of short speech. The robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust speaker feature vector in the output layer of the encoder. Using the speaker feature vectors is more direct for speaker verification application.

<Third Example Embodiment>
A robust feature compensation apparatus of the third example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using a generator and a discriminator, bottleneck feature vector produced in the last layer in the discriminator. Thus, in this example embodiment, a generator and a discriminator of a GAN trained with short and long speech is capable of produce a bottleneck feature robust for short speech.

<Configuration of robust feature compensation apparatus>
In the third example embodiment of the present invention, a robust feature compensation apparatus for bottleneck feature extraction using an encoder of a generator of GAN will be described.

Fig. 14 illustrates a block diagram of a robust feature compensation apparatus 300 of the third example embodiment. The robust feature compensation apparatus 300 includes a training part 300A and a bottleneck feature extraction part 300B.

The training part 300A includes a short speech data storage 301, a long speech data storage 302,

feature extraction unit

303a, 303b and 303c, a noise storage 304, a generator & discriminator training unit 305, a generator parameter storage 306, and a discriminator parameter storage 307. The bottleneck feature extraction part 300B includes a feature extraction unit 303c, a generation unit 308 and a bottleneck feature storage 309. The

feature extraction units

303a, 303b and 303c have the same function.

The short speech data storage 301 stores short speech recordings with speaker labels, as shown in Fig 2.

The long speech data storage 302 stores long speech recordings with speaker labels, as shown in Fig 3. The long speech data storage 302 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 301.

The noise storage 304 stores random vector representing noise.

The generator parameter storage 306 stores generator parameters. The generator (not shown in Fig. 14) consists of an encoder and a decoder, same as that in the first example embodiment as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 306.

The discriminator parameter storage 307 stores the parameter of the discriminator (not shown in Fig. 14).

The feature extraction unit 303a extracts features from the short speech in the short speech data storage 301. The feature extraction unit 303b extracts features from long speech in the long speech data storage 302. The features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.

The generator & discriminator training unit 305 receives feature vectors of short speech from the feature extraction unit 303a, the feature vector of long speech from the feature extraction unit 303b and the noise from noise storage 304. The generator & discriminator training unit 305 trains the generator and the discriminator iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. The detail of training is shown in the first example embodiment. After the training, The generator & discriminator training unit 305 outputs generator parameters and discriminator parameter, and stored them in generator parameter storage 306 and discriminator parameter storage 307, respectively.

In the bottleneck feature extraction part 300B, the feature extraction unit 303c extracts feature vectors from a short speech. Together with the feature vector, the generation unit 308 receives the noise stored in the noise storage 304 and generator parameters stored in the generator parameter storage 306. The generation unit 308 generates one or more robust bottleneck features representing the speaker factor.

Fig. 15 shows a concept of the architecture of the generator and the discriminator of the second example embodiment. The generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is feature vector of the short speech. The output layer of the encoder NN is speaker factor. The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is the restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and speaker label in the training, and in evaluation part, the original output layer is discard and the last layer before that is used as the output layer.

The training part of the third example embodiment is same as that in the first example embodiment.

In the evaluation part, the encoder parameters, decoder parameters, discriminator parameters, input layer of the encoder NN (feature vector of short speech), part of the input layer of the decoder NN (noise) are provided, and as a result, the output layer of discriminator NN (bottleneck feature vector) is obtained.

<Operation of robust feature compensation apparatus>
Next, the operation of robust feature compensation apparatus 300 will be described with reference to drawings.

The whole operation of the robust feature compensation apparatus 300 will be described by referring to Fig. 16. Fig. 16 contains operations of the training part 300A and the bottleneck feature extraction part 300B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.

In step G01 (training part), the generator & discriminator training unit 305 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in the short speech data storage 301 and the long speech data storage 302, respectively. In detail, in each iteration, first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.

In step G02 (bottleneck feature extraction part), the generation unit 308 generates the restored feature vector from a given short speech utterance in the output layer using generator parameter stored in the generator parameter storage 306, and input it into the discriminator. The generation unit 308 extracts the last hidden layer as the robust bottleneck feature.

Fig. 17 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise. Fig. 17 shows the training part in Fig. 16.

First, in step H01, as the beginning of the training part, the feature extraction unit 303a reads short speech data with speaker labels from the short speech data storage 301.

In step H02, the feature extraction unit 303a further extracts feature vectors from the short speech data.

In step H03, the feature extraction unit 303b reads long speech data with speaker labels from the long speech data storage 302.

In step H04, the feature extraction unit 303b further extracts feature vectors from the long speech data.

In step H05, the generator & discriminator training unit 305 reads noise data stored in the noise storage 304.

In step H06, the generator & discriminator training unit 305 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 303a and long speech feature vectors sent from the feature extraction unit 303b with speaker labels and noise.

In step H07, as the result of the training, the generator & discriminator training unit 305 generates generator parameters and discriminator parameters, and stores them in the generator parameter storage 306 and the discriminator parameter storage 307, respectively.

Note that the order of H01-H02 and H03-H04 can be switched, not limited to the form presented in Fig. 17.

Fig. 18 is a flowchart illustrating the bottleneck feature extraction part 300B.

Firstly, in step I01, the feature extraction unit 303c reads short speech data presented through from an external device (not shown in Fig. 14).

In step I02, the feature extraction unit 303c extracts feature vectors from the given short speech data.

In step I03, generation unit 308 reads noise data stored in noise storage 304.

In step I04, the generation unit 308 reads generator parameters from the generator parameter storage 306.

In step I05, the generation unit 308 reads discriminator parameters from the discriminator parameter storage 307.

Note here the order of I03 - I05 can be switched.

In step I06, the generation unit 308 extracts a bottleneck feature produced in the last layer of discriminator NN.

Effect of Third Example Embodiment.

As explained above, the third example embodiment can improve the robustness of the feature vector of short speech. As a result, such NN can generate a feature vector for a short speech as robust as that from long speech. The robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust bottleneck feature in the output layer of the discriminator (the original output layer "real/fake" and speaker labels are discarded after training part).

Note that in the all example embodiments, the speaker labels in the output layer in the discriminator in train can be replaced by emotion labels, language labels, etc. for the usage of feature compensation for emotion recognition, language recognition and so on. Correspondingly, the output later of encoder can be changed to represent emotion feature vectors or language feature vectors.

Fourth Example Embodiment.
A robust feature compensation apparatus of the fourth example embodiment is shown in Fig. 19. The speech feature compensation apparatus 500 based on GAN, includes: a generator & discriminator training unit 501 that trains an GAN model to generate generator and discriminator parameters, based on at least one short speech feature vector and at least one long speech feature vector from the same speaker; and a robust feature compensation unit 502 that compensate the feature vector of short speech, based on short speech vector and the generator and discriminator parameters.

The speech feature compensation apparatus 500 can provide robust feature compensation to short speech. The reason is that the generator and the discriminator are jointly trained which improves each other’s performance iteratively, using feature vectors of short speech and long speech, so as to learn the relation between the feature vectors of short speech and long speech.

<Configuration of information processing apparatus>
Fig. 20 illustrates, by way of example, a configuration of an information processing apparatus 900 (computer) which can implement a robust feature compensation apparatus relevant to an example embodiment of the present invention. In other words, Fig. 20 illustrates a configuration of a computer (information processing apparatus) capable of implementing the devices in Figs.1, 9, 14 and 19 representing a hardware environment where the individual functions in the above-described example embodiments can be implemented.

The information processing apparatus 900 illustrated in Fig. 19 includes the following components:
- CPU (Central Processing Unit) 901;
- ROM (Read Only Memory) 902;
- RAM (Random Access Memory) 903;
- Hard disk 904 (storage device);
- Communication interface to an external device 905;
- Reader/writer 908 capable of reading and writing data stored in a storage medium 907 such as CD-ROM (Compact Disc Read Only Memory); and
- Input/output interface 909.

The information processing apparatus 900 is a general computer where these components are connected via a bus 906 (communication line).

The present invention explained with the above-described example embodiments as examples is accomplished by providing the information processing apparatus 900 illustrated in Fig. 20 with a computer program which is capable of implementing the functions illustrated in the block diagrams (Figs. 1, 9, 14 and 19) or the flowcharts (Figs. 6-8, Figs. 11-13 and Figs 16-18) referenced in the explanation of these example embodiments, and then by reading the computer program into the CPU 901 in such hardware, interpreting it, and executing it. The computer program provided to the apparatus can be stored in a volatile readable and writable storage memory (RAM 903) or in a non-volatile storage device such as the hard disk 904.

In addition, in the case described above, general procedures can now be used to provide the computer program to such hardware. These procedures include, for example, installing the computer program into the apparatus via any of various storage medium 907 such as CD-ROM, or downloading it from an external source via communication lines such as the Internet. In these cases, the present invention can be seen as being composed of codes forming such computer program or being composed of the storage medium 907 storing the codes.

As a final point, it should be clear that the process, techniques and methodology described and illustrated here are not limited or related to a particular apparatus. It can be implemented using a combination of components. Also various types of general purpose devise may be used in accordance with the instructions herein. The present invention has also been described using a particular set of examples. However, these are merely illustrative and not restrictive. For example the described software may be implemented in a wide variety of languages such as C/C++, Java, MATLAB and Python etc. Moreover other implementations of the inventive technology will be apparent to those skilled in the art.

Fig. 21 is a block diagram showing main parts of a speech feature compensation apparatus according to the present invention. As shown in Fig. 21, the speech feature compensation apparatus 10 includes training means 11 (realized by the generator &

discriminator training unit

105, 205, 305 in the example embodiments) for training a generator 21 and a discriminator 22 of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means 12 (realized by the

feature extraction unit

103c, 203c, 303c in the example embodiments) for extracting a feature vector from an input short speech, and generation means 13 (realized by the

generation unit

107, 308 or the encoding unit 207 in the example embodiments) for generating a robust feature vector based on the extracted feature vector using the trained parameters.

As shown in Fig. 22, the generator 21 may include an encoder 211 inputting the first feature vector and outputting a feature vector and a decoder 212 outputting a restored feature vector, and outputs the trained parameters with respect to at least the encoder, and the generation means 13 may include an encoding unit 131 which generating the robust feature vector by encoding the feature vector of the input short speech using the trained parameters.
　

100, 200, 300 robust feature compensation apparatus
101, 201, 301 short speech data storage
102, 202, 302 long speech data storage
103a, 203a, 303a feature extraction unit
103b, 203b, 303b feature extraction unit
103c, 203c, 303c feature extraction unit
104, 204, 304 noise storage
105, 205, 305 generator & discriminator training unit
106 generator parameter storage
206 encoder parameter storage
306 generator parameter storage
107 generation unit
207 encoding unit
307 discriminator parameter storage
108, 208 generated feature storage
308 generation unit
309 bottleneck feature storage

Claims

A speech feature compensation apparatus comprising:
training means for training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
feature extraction means for extracting a feature vector from an input short speech, and
generation means for generating a robust feature vector based on the extracted feature vector using the trained parameters.
The speech feature compensation apparatus according to claim 1,
wherein the generation means generates a restored feature vector corresponding to the feature vector extracted from the input short speech.
The speech feature compensation apparatus according to claim 1 or 2,
wherein
the generator includes an encoder inputting the first feature vector and outputting a feature vector and a decoder outputting a restored feature vector, and outputs the trained parameters with respect to at least the encoder, and
the generation means includes an encoding unit which generating the robust feature vector by encoding the feature vector of the input short speech using the trained parameters.
The speech feature compensation apparatus according to claim 1,
wherein the generation means generates at least one bottleneck feature by the discriminator.
The speech feature compensation apparatus according to any one of claims 1 to 4,
wherein
the discriminator based on a neural network inputs the second feature vector, and
the training means trains the neural network so that a cost function is minimized, the cost function counting real/fake classification errors, speaker identification errors, and MSE (Mean Square Error) between the second feature vector and the generated feature vector of the long speech by the generator.
　
A speech feature compensation method comprising:
training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
extracting a feature vector from an input short speech, and
generating a robust feature vector based on the extracted feature vector using the trained parameters.
The speech feature compensation method according to claim 6,
wherein a restored feature vector corresponding to the feature vector extracted from the input short speech is generated.
A speech feature compensation program for causing a computer to execute:
training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
extracting a feature vector from an input short speech, and
generating a robust feature vector based on the extracted feature vector using the trained parameters.
The speech feature compensation program according to claim 8,
wherein a restored feature vector corresponding to the feature vector extracted from the input short speech is generated.