WO2019171415A1 - Speech feature compensation apparatus, method, and program - Google Patents

Speech feature compensation apparatus, method, and program Download PDF

Info

Publication number
WO2019171415A1
WO2019171415A1 PCT/JP2018/008251 JP2018008251W WO2019171415A1 WO 2019171415 A1 WO2019171415 A1 WO 2019171415A1 JP 2018008251 W JP2018008251 W JP 2018008251W WO 2019171415 A1 WO2019171415 A1 WO 2019171415A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
speech
feature
generator
discriminator
Prior art date
Application number
PCT/JP2018/008251
Other languages
French (fr)
Inventor
Qiongqiong Wang
Koji Okabe
Takafumi Koshinaka
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Priority to JP2020539019A priority Critical patent/JP6897879B2/en
Priority to PCT/JP2018/008251 priority patent/WO2019171415A1/en
Publication of WO2019171415A1 publication Critical patent/WO2019171415A1/en
Priority to JP2021096366A priority patent/JP7243760B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction

Definitions

  • the present invention relates to a feature compensation apparatus, a feature compensation method, and a program for compensating feature vectors in speech and audio to robust ones.
  • Speaker recognition refers to recognizing persons from their voice. No two individuals sound identical because their vocal tract shapes, larynx sizes, and other parts of their voice production organs are different. Considering the identities of human’s voice, speaker recognition has been increasingly applied to forensic, telephone-based services such as telephone banking, and so on.
  • Speaker recognition systems can be divided into text-dependent and text-independent ones.
  • text-dependent systems recognition phrases are fixed, or known beforehand.
  • text-independent systems there are no constraints on the words which the speakers are allowed to use. Text-independent recognition has wider range of applications but also is much more challenging of the two tasks and has consistently been improving in the past decades.
  • the recognition system Since the reference (what are spoken in training) and the test (what are uttered in actual use) utterances in text-independent speaker recognition applications may have completely different contents, the recognition system must take this phonetic mismatch into account. The performance crucially depends on the length of speech. In the case where users speak a long utterance, usually equal to or longer than 1 minute, it is considered that most phonemes are covered. As a result, it produces good recognition accuracy despite of different speech contents, while in the case of short speech, speaker recognition performance degrades on short utterances because speaker feature vectors of such utterances extracted with statistical method are unreliable to perform accurate recognition.
  • PTL1 discloses a technology that employs a Denoising Autoencoder (DAE) to compensate speaker feature vectors of a short utterance which contains limited phonetic information.
  • DAE Denoising Autoencoder
  • a feature compensation apparatus based on DAE described in PTL1
  • it first estimates acoustic diversity degree in the input utterance as posteriors based on speech models.
  • both the acoustic diversity degree and recognition feature vector are presented to an input layer 401.
  • Feature vector in this description refers to a set of numerics (specific data) that represents a target object.
  • the DAE-based transformation including an input layer 401, one or multiple hidden layers 402, and an output layer 403 is able to produce a restored recognition feature vector in the output layer with help of supervised training using pairs of long and short speech segments.
  • NPL 1 disclosed MFCC (Mel-Frequency Cepstrum Coefficients) as acoustic features.
  • PTL1 uses only mean square error minimization in DAE optimization.
  • Such objective function is too simple to perform accurately.
  • the simple objective function restricts the short speech to be part of the long speech for better result.
  • to train such a network only long speech can be used (short speech is cut from it). It wastes the information of existing short speech of the speakers. This system needs sufficient speakers with multiple sets of long speech for training, which may not be realistic for all applications.
  • the objective of the present invention is to provide robust feature compensation for short speech.
  • An exemplary aspect of the speech feature compensation apparatus includes training means for training a generator and a discriminator of GAN (Generative Adversarial Network ) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means for extracting a feature vector from an input short speech, and generation means for generating a robust feature vector based on the extracted feature vector using the trained parameters.
  • GAN Generic Adversarial Network
  • An exemplary aspect of the speech processing method includes training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.
  • An exemplary aspect of the speech processing program causes a computer to execute training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.
  • speech compensation apparatus speech feature compensation method, and program of the present invention can provide robust feature compensation for short speech.
  • Fig. 1 is a block diagram of a robust feature compensation apparatus of the first example embodiment in accordance with the present invention.
  • Fig. 2 shows an example of contents of the short speech data storage.
  • Fig. 3 shows an example of contents of the long speech data storage.
  • Fig. 4 shows an example of contents of the generator parameter storage.
  • Fig. 5 shows a concept of NN architecture in the first example embodiment.
  • Fig. 6 is a flowchart illustrating operation of the robust feature compensation apparatus of the first example embodiment.
  • Fig. 7 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the first example embodiment.
  • Fig. 8 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the first example embodiment.
  • Fig. 1 is a block diagram of a robust feature compensation apparatus of the first example embodiment in accordance with the present invention.
  • Fig. 2 shows an example of contents of the short speech data storage.
  • Fig. 3 shows an example of contents of
  • FIG. 9 is a block diagram of a robust feature compensation apparatus of the second example embodiment in accordance with the present invention.
  • Fig. 10 shows a concept of NN architecture in the second example embodiment.
  • Fig. 11 is a flowchart illustrating operation of the robust feature compensation apparatus of the second example embodiment.
  • Fig. 12 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the second example embodiment.
  • Fig. 13 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the second example embodiment.
  • Fig. 14 is a block diagram of a robust feature compensation apparatus of the third example embodiment in accordance with the present invention.
  • Fig. 15 shows a concept of NN architecture in the third example embodiment.
  • FIG. 16 is a flowchart illustrating operation of the robust feature compensation apparatus of the third example embodiment.
  • Fig. 17 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the third example embodiment.
  • Fig. 18 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the third example embodiment.
  • Fig. 19 is an exemplary computer configuration used in example embodiments in accordance with the present invention.
  • Fig. 20 shows an exemplary computer configuration used in embodiments in accordance with the present invention.
  • Fig. 21 shows a block diagram showing main parts of a speech feature compensation apparatus.
  • Fig. 22 shows a block diagram showing another aspect of a speech feature compensation apparatus.
  • Fig. 23 is a block diagram of a feature compensation apparatus of PTL 1.
  • GAN Generative Adversarial Network
  • a robust feature compensation apparatus of the first example embodiment can provide robust feature vector to short speech segment, from raw feature vector of the short speech, using a generator.
  • the generator of a GAN trained with short and long speech is capable of generating a robust feature vector for short speech. Note the term of the long speech is longer than the term of the short speech.
  • Fig. 1 illustrates a block diagram of a robust feature compensation apparatus 100 of the first example embodiment.
  • the robust feature compensation apparatus 100 includes a training part 100A and a feature restoration part 100B.
  • the training part 100A includes a short speech data storage 101, a long speech data storage 102, feature extraction units 103a and 103b, a noise storage 104, a generator & discriminator training unit 105, and a generator parameter storage 106.
  • the feature restoration part 100B includes a feature extraction unit 103c, a generation unit 107 and a generated feature storage 108.
  • the feature extraction units 103a, 103b, and 103c have the same function.
  • the short speech data storage 101 stores short speech recordings with speaker labels as shown in Fig. 2.
  • the long speech data storage 102 stores long speech recordings with speaker labels as shown in Fig. 3.
  • the long speech data storage 102 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 101.
  • the noise storage 104 stores random vector representing noise.
  • the generator parameter storage 106 stores generator parameters as shown in Fig. 4.
  • the generator includes an encoder and a decoder, as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 106.
  • the feature extraction unit 103a extracts feature vectors from the short speech data in the short speech data storage 101.
  • the feature extraction unit 103b extracts feature vectors from long speech in the long speech data storage 102.
  • the feature vectors are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC described in NPL1.
  • the generator & discriminator training unit 105 receives the feature vectors of a short speech segment from the feature extraction unit 103a, the feature vector of a long speech segment from the feature extraction unit 103b and the noise from noise storage 104.
  • the generator & discriminator training unit 105 trains a generator and a discriminator (not shown in Fig. 1) iteratively to determine "real" (the feature vector is extracted from a long speech) or "fake” (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to.
  • Each of the generator and the discriminator includes an input layer, one or multiple hidden layers, and an output layer.
  • the generator & discriminator training unit 105 stores the generator parameters in the generator parameter storage 106.
  • the feature extraction unit 103c extracts feature vectors from a short speech recording. Together with the feature vector, the generation unit 107 receives noise stored in the noise storage 104 and generator parameters stored in the generator parameter storage 106. The generation unit 107 generates a robust restored feature.
  • Fig. 5 shows a concept of the architecture of the generator and the discriminator.
  • the generator has two neural networks (NNs) - encoder NN and decoder NN, and the discriminator has one NN.
  • Each NN includes three types of layers: input, hidden and output.
  • the hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer.
  • the input layer of the encoder NN is a feature vector of a short speech recording.
  • the output layer of the encoder NN is a speaker factor (a feature vector).
  • the input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN.
  • the output layer of the decoder is a restored feature vector.
  • the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN.
  • the output of the discriminator is "real/fake" and a speaker label.
  • the input layer of encoder NN feature vectors of short speech recordings
  • part of the input layer of decoder NN noise
  • one type input layer of two for the discriminator feature vectors of long speech recordings
  • the output layer of the discriminator outputputting "real/fake” and speaker label
  • the hidden layer(s) of three NN encoder, decoder, discriminator
  • the output layer of the encoder NN peaker factor
  • the output layer of the decoder NN stored feature vector is determined.
  • the number of layers in encoder, decoder and discriminative can be 15, 15 and 16.
  • encoder parameters, decoder parameters, input layer of encoder NN (feature vector of short speech), part of the input layer of decoder NN (noise) are provided, and as a result, output layer of the decoder NN (restored feature vector) is determined.
  • output layer consists of (2+n) neurons, where n is the number of speakers in the training data, and 2 is "real/fake”/.
  • the neurons can take a value "1” or "0” corresponding to "real/fake” and "true speaker label/wrong speaker labels”.
  • the generator encoder and decoder
  • the discriminator iteratively train each other.
  • generator parameters are updated once while discriminator parameters are fixed, then the discriminator parameters are updated once while the generator parameters are fixed.
  • a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost functions, such as cross entropy, mean square error and so on.
  • the objective function can be expressed as next: For the generator:
  • values (a) are objectives for the generator and values (b) are objectives for the discriminator;
  • A is the feature vector of the given short speech;
  • B is the feature vector of the given long speech;
  • element (c) is the noise modeling other variations besides speaker;
  • G(A,z) is the generated feature vector form the generator;
  • Element (d) is for speaker classification results, i.e. speaker posteriors, with N d as the total number of speakers in the training set;
  • element (e) is for "real/fake” feature vector classification;
  • element (f) is the i th element in D d .
  • operators (g) and (h) are the expectation and mean square error operators, respectively.
  • Constants (i) are predetermined constants.
  • y d is the true speaker ID (ground truth).
  • Fig. 6 contains operations of the training part 100A and the feature restoration part 100B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.
  • step A01 the generator & discriminator training unit 105 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored the in short speech data storage 101 and long speech data storage 102, respectively.
  • firstly discriminator parameters are fixed, and generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions.
  • the order of updating generator parameters and discriminator parameters in an iterative can be changed.
  • a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on.
  • the objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
  • step A02 feature restoration part
  • the generation unit 107 generates a restored feature vector from a given short speech utterance in the output layer using generator parameters stored in the generator parameter storage 106.
  • Fig. 7 is a flowchart illustrating that the generator and the discriminator are together trained using short speech feature vectors and long speech feature vectors with noise.
  • Fig. 7 shows the training part in Fig. 6.
  • step B01 as the beginning of the training part, the feature extraction unit 103a reads short speech data with speaker labels from the short speech data storage 101.
  • step B02 the feature extraction unit 103a further extracts feature vectors from the short speech.
  • step B03 the feature extraction unit 103b reads long speech data with speaker labels from the long speech data storage 102.
  • step B04 the feature extraction unit 103b further extracts feature vectors from the long speech.
  • step B05 the generator & discriminator training unit 105 reads the noise data stored in the noise storage 104.
  • step B06 the generator & discriminator training unit 105 trains a generator and a discriminator together using short speech feature vectors sent from the feature extraction unit 103a and long speech feature vectors sent from the feature extraction unit 103b with speaker labels and noise.
  • step B07 as the result of the training, the generator & discriminator training unit 105 generates generator parameters and discriminator parameters, and stores the generator parameters in the generator parameter storage 106.
  • Fig. 8 is a flowchart illustrating a feature restoration part 100B.
  • step C01 the feature extraction unit 103c reads short speech data represented through an external device (not shown in Fig. 1).
  • the feature extraction unit 103c extracts feature vectors from the given short speech data.
  • step C03 the generation unit 107 reads noise data stored in the noise storage 104.
  • step C04 the generation unit 107 reads generator parameters from the generator parameter storage 106.
  • step C06 the generation unit 107 restores the feature vector of the short speech and generates a robust feature vector.
  • the first example embodiment can improve the robustness of the feature vector of short speech.
  • the reason is that the joint training of the generator and the discriminator improve the performance of each other, and it is learned the relationship between long speech feature vectors and short speech feature vectors in the training.
  • such NN can generate a feature vector for a short speech as robust as that from long speech.
  • a robust feature compensation apparatus of the second example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using an encoder.
  • the encoder - part of a generator of a GAN trained with short and long speech is capable of produce a robust speaker feature vector robust for short speech.
  • Fig. 9 illustrates a block diagram of a robust feature compensation apparatus 200 of the second example embodiment.
  • the robust feature compensation apparatus 200 includes a training part 200A and a speaker feature extraction part 200B.
  • the training part200A includes a short speech data storage 201, a long speech data storage 202, feature extraction units 203a and 203b, a noise storage 204, a generator & discriminator training unit 205, and an encoder parameter storage 206.
  • the speaker feature extraction part 200B includes a feature extraction unit 203c, an encoding unit 207 as generation means, and a generated feature storage 208.
  • the feature extraction units 203a, 203b and 203c have the same function.
  • the short speech data storage 201 stores short speech recordings with speaker labels, as shown in Fig 2.
  • the long speech data storage 202 stores long speech recordings with speaker labels, as shown in Fig 3.
  • the long speech data storage 202 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 201.
  • the noise storage 204 stores random vector representing noise.
  • the encoder parameter storage 206 stores encoder parameters, each is a part of the result of the generator & discriminator training unit 205.
  • the generator (not shown in Fig. 9) consists of an encoder and a decoder, same as that in the first example embodiment as understood from the figure Fig. 4.
  • the feature extraction unit 203a extracts features from the short speech in the short speech data storage 201.
  • the feature extraction unit 203b extracts features from the long speech in the long speech data storage 202.
  • the features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.
  • the generator & discriminator training unit 205 receives feature vectors of short speech from the feature extraction unit 203a, the feature vector of long speech from the feature extraction unit 203b and noise from the noise storage 204.
  • the generator & discriminator training unit 205 trains the generator and the discriminator (not shown in Fig. 9) iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to.
  • the detail of training is shown in the first example embodiment.
  • the generator & discriminator training unit 205 outputs generator parameters and discriminator parameter, and stored them in the encoder parameter storage 206.
  • the feature extraction unit 203c extracts feature vectors from a short speech. Together with the feature vector, the encoding unit 207 receives the noise stored in the noise storage 204 and encoder parameters in encoder parameter storage 206. The encoding unit 207 encodes a robust speaker feature.
  • Fig. 10 shows a concept of the architecture of the generator and the discriminator of the second example embodiment.
  • the generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN.
  • Each NN includes three types of layers: input, hidden and output.
  • the hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer.
  • the input layer of the encoder NN is feature vector of the short speech.
  • the output layer of the encoder NN is speaker factor.
  • the input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN.
  • the output layer of the decoder is the restored feature vector.
  • the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN.
  • the output of the discriminator is "real/fake"
  • the training part 200A of the second example embodiment is same as that in the first example embodiment as mentioned.
  • the encoder parameters and input layer of the encoder NN are provided, and as a result, the output layer of the encoder NN (speaker factor) is obtained.
  • Fig. 11 contains operations of the training part 200A and the speaker feature extraction part 200B.
  • step D01 the generator & discriminator training unit 205 training the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in short speech data storage 201 and long speech data storage 202, respectively.
  • first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions.
  • the order of updating generator parameters and discriminator parameters in an iterative can be changed.
  • a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on.
  • the objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
  • step D02 the encoding unit 207 encodes a robust speaker feature vector from a given short speech utterance, in the output layer of the encoder using encoder parameter stored in the encoder parameter storage 206.
  • Fig. 12 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise.
  • Fig. 12 shows the training part in Fig. 11.
  • step E01 as the beginning of the training part, the feature extraction unit 203a reads short speech data with speaker labels from the short speech data storage 201.
  • step E02 the feature extraction unit 203a further extracts feature vectors from the short speech.
  • step E03 the feature extraction unit 203b reads long speech data with speaker labels from long speech data storage 202.
  • step E04 the feature extraction unit 203b further extracts feature vectors from the long speech.
  • step E05 the generator & discriminator training unit 205 reads noise data stored in the noise storage 204.
  • step E06 the generator & discriminator training unit 205 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 203a and long speech feature vectors sent from the feature extraction unit 203b with speaker labels and noise.
  • step E07 as the result of the training, the generator & discriminator training unit 205 trains the generator and the discriminator, and stores the parameters of encoder - part of the generator in the encoder parameter storage 206.
  • E01-E02 and E03-E04 can be switched, not limited to the form presented in Fig. 12.
  • Fig. 13 is a flowchart illustrating the speaker feature extraction part 200B.
  • step F01 the feature extraction unit 203c reads short speech data represented through an external device (not shown in Fig. 9).
  • step F02 the feature extraction unit 203c extracts feature vectors from the given short speech data.
  • step F03 the encoding unit 207 reads noise data stored in the noise storage 204.
  • step F04 the encoding unit 207 reads encoder parameters from the encoder parameter storage 206.
  • step F06 the encoding unit 207 encodes the feature vector of the short speech and extracts a robust speaker feature vector.
  • the second example embodiment can improve the robustness of the feature vector of short speech.
  • the robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust speaker feature vector in the output layer of the encoder. Using the speaker feature vectors is more direct for speaker verification application.
  • a robust feature compensation apparatus of the third example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using a generator and a discriminator, bottleneck feature vector produced in the last layer in the discriminator.
  • a generator and a discriminator of a GAN trained with short and long speech is capable of produce a bottleneck feature robust for short speech.
  • Fig. 14 illustrates a block diagram of a robust feature compensation apparatus 300 of the third example embodiment.
  • the robust feature compensation apparatus 300 includes a training part 300A and a bottleneck feature extraction part 300B.
  • the training part 300A includes a short speech data storage 301, a long speech data storage 302, feature extraction unit 303a, 303b and 303c, a noise storage 304, a generator & discriminator training unit 305, a generator parameter storage 306, and a discriminator parameter storage 307.
  • the bottleneck feature extraction part 300B includes a feature extraction unit 303c, a generation unit 308 and a bottleneck feature storage 309. The feature extraction units 303a, 303b and 303c have the same function.
  • the short speech data storage 301 stores short speech recordings with speaker labels, as shown in Fig 2.
  • the long speech data storage 302 stores long speech recordings with speaker labels, as shown in Fig 3.
  • the long speech data storage 302 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 301.
  • the noise storage 304 stores random vector representing noise.
  • the generator parameter storage 306 stores generator parameters.
  • the generator (not shown in Fig. 14) consists of an encoder and a decoder, same as that in the first example embodiment as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 306.
  • the discriminator parameter storage 307 stores the parameter of the discriminator (not shown in Fig. 14).
  • the feature extraction unit 303a extracts features from the short speech in the short speech data storage 301.
  • the feature extraction unit 303b extracts features from long speech in the long speech data storage 302.
  • the features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.
  • the generator & discriminator training unit 305 receives feature vectors of short speech from the feature extraction unit 303a, the feature vector of long speech from the feature extraction unit 303b and the noise from noise storage 304.
  • the generator & discriminator training unit 305 trains the generator and the discriminator iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to.
  • the detail of training is shown in the first example embodiment.
  • the generator & discriminator training unit 305 outputs generator parameters and discriminator parameter, and stored them in generator parameter storage 306 and discriminator parameter storage 307, respectively.
  • the feature extraction unit 303c extracts feature vectors from a short speech. Together with the feature vector, the generation unit 308 receives the noise stored in the noise storage 304 and generator parameters stored in the generator parameter storage 306. The generation unit 308 generates one or more robust bottleneck features representing the speaker factor.
  • Fig. 15 shows a concept of the architecture of the generator and the discriminator of the second example embodiment.
  • the generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN.
  • Each NN includes three types of layers: input, hidden and output.
  • the hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer.
  • the input layer of the encoder NN is feature vector of the short speech.
  • the output layer of the encoder NN is speaker factor.
  • the input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN.
  • the output layer of the decoder is the restored feature vector.
  • the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN.
  • the output of the discriminator is "real/fake" and speaker label in the training, and in evaluation part, the original output layer is discard and the last layer before that is used as the output layer.
  • the training part of the third example embodiment is same as that in the first example embodiment.
  • the encoder parameters, decoder parameters, discriminator parameters, input layer of the encoder NN (feature vector of short speech), part of the input layer of the decoder NN (noise) are provided, and as a result, the output layer of discriminator NN (bottleneck feature vector) is obtained.
  • Fig. 16 contains operations of the training part 300A and the bottleneck feature extraction part 300B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.
  • step G01 the generator & discriminator training unit 305 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in the short speech data storage 301 and the long speech data storage 302, respectively.
  • first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions.
  • the order of updating generator parameters and discriminator parameters in an iterative can be changed.
  • a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on.
  • the objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
  • step G02 the generation unit 308 generates the restored feature vector from a given short speech utterance in the output layer using generator parameter stored in the generator parameter storage 306, and input it into the discriminator.
  • the generation unit 308 extracts the last hidden layer as the robust bottleneck feature.
  • Fig. 17 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise.
  • Fig. 17 shows the training part in Fig. 16.
  • step H01 as the beginning of the training part, the feature extraction unit 303a reads short speech data with speaker labels from the short speech data storage 301.
  • step H02 the feature extraction unit 303a further extracts feature vectors from the short speech data.
  • step H03 the feature extraction unit 303b reads long speech data with speaker labels from the long speech data storage 302.
  • step H04 the feature extraction unit 303b further extracts feature vectors from the long speech data.
  • step H05 the generator & discriminator training unit 305 reads noise data stored in the noise storage 304.
  • step H06 the generator & discriminator training unit 305 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 303a and long speech feature vectors sent from the feature extraction unit 303b with speaker labels and noise.
  • step H07 as the result of the training, the generator & discriminator training unit 305 generates generator parameters and discriminator parameters, and stores them in the generator parameter storage 306 and the discriminator parameter storage 307, respectively.
  • H01-H02 and H03-H04 can be switched, not limited to the form presented in Fig. 17.
  • Fig. 18 is a flowchart illustrating the bottleneck feature extraction part 300B.
  • step I01 the feature extraction unit 303c reads short speech data presented through from an external device (not shown in Fig. 14).
  • step I02 the feature extraction unit 303c extracts feature vectors from the given short speech data.
  • step I03 generation unit 308 reads noise data stored in noise storage 304.
  • step I04 the generation unit 308 reads generator parameters from the generator parameter storage 306.
  • step I05 the generation unit 308 reads discriminator parameters from the discriminator parameter storage 307.
  • step I06 the generation unit 308 extracts a bottleneck feature produced in the last layer of discriminator NN.
  • the third example embodiment can improve the robustness of the feature vector of short speech.
  • such NN can generate a feature vector for a short speech as robust as that from long speech.
  • the robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust bottleneck feature in the output layer of the discriminator (the original output layer "real/fake" and speaker labels are discarded after training part).
  • the speaker labels in the output layer in the discriminator in train can be replaced by emotion labels, language labels, etc. for the usage of feature compensation for emotion recognition, language recognition and so on.
  • the output later of encoder can be changed to represent emotion feature vectors or language feature vectors.
  • the speech feature compensation apparatus 500 based on GAN includes: a generator & discriminator training unit 501 that trains an GAN model to generate generator and discriminator parameters, based on at least one short speech feature vector and at least one long speech feature vector from the same speaker; and a robust feature compensation unit 502 that compensate the feature vector of short speech, based on short speech vector and the generator and discriminator parameters.
  • the speech feature compensation apparatus 500 can provide robust feature compensation to short speech.
  • the reason is that the generator and the discriminator are jointly trained which improves each other’s performance iteratively, using feature vectors of short speech and long speech, so as to learn the relation between the feature vectors of short speech and long speech.
  • Fig. 20 illustrates, by way of example, a configuration of an information processing apparatus 900 (computer) which can implement a robust feature compensation apparatus relevant to an example embodiment of the present invention.
  • Fig. 20 illustrates a configuration of a computer (information processing apparatus) capable of implementing the devices in Figs.1, 9, 14 and 19 representing a hardware environment where the individual functions in the above-described example embodiments can be implemented.
  • the information processing apparatus 900 illustrated in Fig. 19 includes the following components: - CPU (Central Processing Unit) 901; - ROM (Read Only Memory) 902; - RAM (Random Access Memory) 903; - Hard disk 904 (storage device); - Communication interface to an external device 905; - Reader/writer 908 capable of reading and writing data stored in a storage medium 907 such as CD-ROM (Compact Disc Read Only Memory); and - Input/output interface 909.
  • - CPU Central Processing Unit
  • ROM Read Only Memory
  • RAM Random Access Memory
  • Hard disk 904 storage device
  • Communication interface to an external device 905 - Reader/writer 908 capable of reading and writing data stored in a storage medium 907 such as CD-ROM (Compact Disc Read Only Memory); and - Input/output interface 909.
  • the information processing apparatus 900 is a general computer where these components are connected via a bus 906 (communication line).
  • the present invention explained with the above-described example embodiments as examples is accomplished by providing the information processing apparatus 900 illustrated in Fig. 20 with a computer program which is capable of implementing the functions illustrated in the block diagrams (Figs. 1, 9, 14 and 19) or the flowcharts (Figs. 6-8, Figs. 11-13 and Figs 16-18) referenced in the explanation of these example embodiments, and then by reading the computer program into the CPU 901 in such hardware, interpreting it, and executing it.
  • the computer program provided to the apparatus can be stored in a volatile readable and writable storage memory (RAM 903) or in a non-volatile storage device such as the hard disk 904.
  • Fig. 21 is a block diagram showing main parts of a speech feature compensation apparatus according to the present invention.
  • the speech feature compensation apparatus 10 includes training means 11 (realized by the generator & discriminator training unit 105, 205, 305 in the example embodiments) for training a generator 21 and a discriminator 22 of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means 12 (realized by the feature extraction unit 103c, 203c, 303c in the example embodiments) for extracting a feature vector from an input short speech, and generation means 13 (realized by the generation unit 107, 308 or the encoding unit 207 in the example embodiments) for generating a robust feature vector based on the extracted feature vector using the trained parameters.
  • GAN Generic Adversarial Network
  • the generator 21 may include an encoder 211 inputting the first feature vector and outputting a feature vector and a decoder 212 outputting a restored feature vector, and outputs the trained parameters with respect to at least the encoder, and the generation means 13 may include an encoding unit 131 which generating the robust feature vector by encoding the feature vector of the input short speech using the trained parameters.
  • 100, 200, 300 robust feature compensation apparatus 101, 201, 301 short speech data storage 102, 202, 302 long speech data storage 103a, 203a, 303a feature extraction unit 103b, 203b, 303b feature extraction unit 103c, 203c, 303c feature extraction unit 104, 204, 304 noise storage 105, 205, 305 generator & discriminator training unit 106 generator parameter storage 206 encoder parameter storage 306 generator parameter storage 107 generation unit 207 encoding unit 307 discriminator parameter storage 108, 208 generated feature storage 308 generation unit 309 bottleneck feature storage

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The speech feature compensation apparatus 100 includes training means11 for training a generator 21 and a discriminator 22 of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means 12 for extracting a feature vector from an input short speech, and generation means 13 for generating a robust feature vector based on the extracted feature vector using the trained parameters.

Description

SPEECH FEATURE COMPENSATION APPARATUS, METHOD, AND PROGRAM
The present invention relates to a feature compensation apparatus, a feature compensation method, and a program for compensating feature vectors in speech and audio to robust ones.
 
Speaker recognition refers to recognizing persons from their voice. No two individuals sound identical because their vocal tract shapes, larynx sizes, and other parts of their voice production organs are different. Considering the identities of human’s voice, speaker recognition has been increasingly applied to forensic, telephone-based services such as telephone banking, and so on.
Speaker recognition systems can be divided into text-dependent and text-independent ones. In text-dependent systems, recognition phrases are fixed, or known beforehand. In text-independent systems, there are no constraints on the words which the speakers are allowed to use. Text-independent recognition has wider range of applications but also is much more challenging of the two tasks and has consistently been improving in the past decades.
Since the reference (what are spoken in training) and the test (what are uttered in actual use) utterances in text-independent speaker recognition applications may have completely different contents, the recognition system must take this phonetic mismatch into account. The performance crucially depends on the length of speech. In the case where users speak a long utterance, usually equal to or longer than 1 minute, it is considered that most phonemes are covered. As a result, it produces good recognition accuracy despite of different speech contents, while in the case of short speech, speaker recognition performance degrades on short utterances because speaker feature vectors of such utterances extracted with statistical method are unreliable to perform accurate recognition.
In practical speaker verification applications, often only short speech segments are observed during testing. In general, short speech segments of less than 10 seconds are used. Then it is important to improve text-independent speaker recognition with short utterances by speaker feature vector restoration.
PTL1 discloses a technology that employs a Denoising Autoencoder (DAE) to compensate speaker feature vectors of a short utterance which contains limited phonetic information.
As shown in Fig. 23, in a feature compensation apparatus based on DAE described in PTL1, it first estimates acoustic diversity degree in the input utterance as posteriors based on speech models. Then both the acoustic diversity degree and recognition feature vector are presented to an input layer 401. "Feature vector" in this description refers to a set of numerics (specific data) that represents a target object. The DAE-based transformation including an input layer 401, one or multiple hidden layers 402, and an output layer 403 is able to produce a restored recognition feature vector in the output layer with help of supervised training using pairs of long and short speech segments.
NPL 1 disclosed MFCC (Mel-Frequency Cepstrum Coefficients) as acoustic features.
 
United States Patent Application 2016/0098993
Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011
However, PTL1 uses only mean square error minimization in DAE optimization. Such objective function is too simple to perform accurately. In addition, the simple objective function restricts the short speech to be part of the long speech for better result. In real world, to train such a network, only long speech can be used (short speech is cut from it). It wastes the information of existing short speech of the speakers. This system needs sufficient speakers with multiple sets of long speech for training, which may not be realistic for all applications.
In view of the above mentioned situation, the objective of the present invention is to provide robust feature compensation for short speech.
 
An exemplary aspect of the speech feature compensation apparatus includes training means for training a generator and a discriminator of GAN (Generative Adversarial Network ) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means for extracting a feature vector from an input short speech, and generation means for generating a robust feature vector based on the extracted feature vector using the trained parameters.
An exemplary aspect of the speech processing method includes training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.
An exemplary aspect of the speech processing program causes a computer to execute training a generator and a discriminator of GAN using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, extracting a feature vector from an input short speech, and generating a robust feature vector based on the extracted feature vector using the trained parameters.
 
According to the present invention, speech compensation apparatus, speech feature compensation method, and program of the present invention can provide robust feature compensation for short speech.
 
Fig. 1 is a block diagram of a robust feature compensation apparatus of the first example embodiment in accordance with the present invention. Fig. 2 shows an example of contents of the short speech data storage. Fig. 3 shows an example of contents of the long speech data storage. Fig. 4 shows an example of contents of the generator parameter storage. Fig. 5 shows a concept of NN architecture in the first example embodiment. Fig. 6 is a flowchart illustrating operation of the robust feature compensation apparatus of the first example embodiment. Fig. 7 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the first example embodiment. Fig. 8 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the first example embodiment. Fig. 9 is a block diagram of a robust feature compensation apparatus of the second example embodiment in accordance with the present invention. Fig. 10 shows a concept of NN architecture in the second example embodiment. Fig. 11 is a flowchart illustrating operation of the robust feature compensation apparatus of the second example embodiment. Fig. 12 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the second example embodiment. Fig. 13 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the second example embodiment. Fig. 14 is a block diagram of a robust feature compensation apparatus of the third example embodiment in accordance with the present invention. Fig. 15 shows a concept of NN architecture in the third example embodiment. Fig. 16 is a flowchart illustrating operation of the robust feature compensation apparatus of the third example embodiment. Fig. 17 is a flowchart illustrating operation of the training phase of the robust feature compensation apparatus of the third example embodiment. Fig. 18 is a flowchart illustrating operation of the robust feature compensation phase of the robust feature compensation apparatus of the third example embodiment. Fig. 19 is an exemplary computer configuration used in example embodiments in accordance with the present invention. Fig. 20 shows an exemplary computer configuration used in embodiments in accordance with the present invention. Fig. 21 shows a block diagram showing main parts of a speech feature compensation apparatus. Fig. 22 shows a block diagram showing another aspect of a speech feature compensation apparatus. Fig. 23 is a block diagram of a feature compensation apparatus of PTL 1.
Each example embodiment of the present invention will be described below with reference to the figures. The following detailed descriptions are merely exemplary in nature and are not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any theory presented in the preceding background of the invention or the following detailed description.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures illustrating integrated circuit architecture may be exaggerated relative to other elements to help to improve understanding of the present and alternate example embodiments.
In speaker recognition applications in real world, often text-independent speaker recognition are used and short speech segments (less than 10 seconds) are observed. In such cases, phonetic mismatch must be taken into account, since the unbalanced phonetic distribution resulted in unreliable speaker feature vector extracted from the short speech. With the length of segments getting shorting, performance depredated accordingly. Therefore, there is a need for improve text-independent speaker recognition with short utterances by speaker feature restoration method.
In the view of the above, following example embodiments utilize Generative Adversarial Network (GAN) including a generator and a discriminator that improves each other during the iterative training process, so that the generator will generate a robust feature vector for the short speech with compensation.
First Example embodiment.
A robust feature compensation apparatus of the first example embodiment can provide robust feature vector to short speech segment, from raw feature vector of the short speech, using a generator. Thus, in this example embodiment, the generator of a GAN trained with short and long speech is capable of generating a robust feature vector for short speech. Note the term of the long speech is longer than the term of the short speech.
<Configuration of robust feature compensation apparatus >
In the first example embodiment of the present invention, a robust feature compensation apparatus for feature restoration using a generator of GAN will be described.
Fig. 1 illustrates a block diagram of a robust feature compensation apparatus 100 of the first example embodiment. The robust feature compensation apparatus 100 includes a training part 100A and a feature restoration part 100B.
The training part 100A includes a short speech data storage 101, a long speech data storage 102, feature extraction units 103a and 103b, a noise storage 104, a generator & discriminator training unit 105, and a generator parameter storage 106. The feature restoration part 100B includes a feature extraction unit 103c, a generation unit 107 and a generated feature storage 108. The feature extraction units 103a, 103b, and 103c have the same function.
The short speech data storage 101 stores short speech recordings with speaker labels as shown in Fig. 2.
The long speech data storage 102 stores long speech recordings with speaker labels as shown in Fig. 3. The long speech data storage 102 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 101.
The noise storage 104 stores random vector representing noise.
The generator parameter storage 106 stores generator parameters as shown in Fig. 4. The generator includes an encoder and a decoder, as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 106.
The feature extraction unit 103a extracts feature vectors from the short speech data in the short speech data storage 101. The feature extraction unit 103b extracts feature vectors from long speech in the long speech data storage 102. The feature vectors are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC described in NPL1.
The generator & discriminator training unit 105 receives the feature vectors of a short speech segment from the feature extraction unit 103a, the feature vector of a long speech segment from the feature extraction unit 103b and the noise from noise storage 104. The generator & discriminator training unit 105 trains a generator and a discriminator (not shown in Fig. 1) iteratively to determine "real" (the feature vector is extracted from a long speech) or "fake" (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. Each of the generator and the discriminator includes an input layer, one or multiple hidden layers, and an output layer.
In the training, in "real" case, the received feature vectors of the long speech are given to the input layer of the discriminator; in "fake" case, the received feature vectors of the short speech are given to the input layer of the generator. The output layer of the generator is the input layer of the discriminator. Further, "real/fake" and speaker labels are given to the output layer of the discriminator. Details of those layers will be described later. After the training, the generator & discriminator training unit 105 stores the generator parameters in the generator parameter storage 106.
In the feature restoration part 100B, the feature extraction unit 103c extracts feature vectors from a short speech recording. Together with the feature vector, the generation unit 107 receives noise stored in the noise storage 104 and generator parameters stored in the generator parameter storage 106. The generation unit 107 generates a robust restored feature.
Fig. 5 shows a concept of the architecture of the generator and the discriminator. The generator has two neural networks (NNs) - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is a feature vector of a short speech recording. The output layer of the encoder NN is a speaker factor (a feature vector). The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is a restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and a speaker label.
In the training part 100A, the input layer of encoder NN (feature vectors of short speech recordings), part of the input layer of decoder NN (noise), one type input layer of two for the discriminator (feature vectors of long speech recordings) and the output layer of the discriminator (outputting "real/fake" and speaker label) are given, and as a result, the hidden layer(s) of three NN (encoder, decoder, discriminator) parameters, the output layer of the encoder NN (speaker factor), the output layer of the decoder NN (restored feature vector is determined. For example, the number of layers in encoder, decoder and discriminative can be 15, 15 and 16.
In the evaluation part of the training part 100A, encoder parameters, decoder parameters, input layer of encoder NN (feature vector of short speech), part of the input layer of decoder NN (noise) are provided, and as a result, output layer of the decoder NN (restored feature vector) is determined.
In the discriminator, output layer consists of (2+n) neurons, where n is the number of speakers in the training data, and 2 is "real/fake"/. In the training part 100A, the neurons can take a value "1" or "0" corresponding to "real/fake" and "true speaker label/wrong speaker labels".
In the training part 100A, the generator (encoder and decoder) and the discriminator iteratively train each other. In each iteration, generator parameters are updated once while discriminator parameters are fixed, then the discriminator parameters are updated once while the generator parameters are fixed. For this purpose, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost functions, such as cross entropy, mean square error and so on.
For example, the objective function can be expressed as next:
For the generator:
Figure JPOXMLDOC01-appb-M000001
 
For the discriminator:
Figure JPOXMLDOC01-appb-M000002
 
Figure JPOXMLDOC01-appb-M000003
 
Where values (a) are objectives for the generator and values (b) are objectives for the discriminator; A is the feature vector of the given short speech; B is the feature vector of the given long speech; element (c) is the noise modeling other variations besides speaker; G(A,z) is the generated feature vector form the generator; Element (d) is for speaker classification results, i.e. speaker posteriors, with Nd as the total number of speakers in the training set; element (e) is for "real/fake" feature vector classification; element (f) is the ith element in Dd. operators (g) and (h) are the expectation and mean square error operators, respectively. Constants (i) are predetermined constants. yd is the true speaker ID (ground truth).
It can also be expressed as next.
For the generator:
Figure JPOXMLDOC01-appb-M000004
 
For the discriminator
Figure JPOXMLDOC01-appb-M000005
 
<Operation of robust feature compensation apparatus>
Next, the operation of robust feature compensation apparatus 100 will be described with reference to drawings.
The whole operation of robust feature compensation apparatus 100 will be described by referring to Fig. 6. Fig. 6 contains operations of the training part 100A and the feature restoration part 100B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.
In step A01 (training part), the generator & discriminator training unit 105 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored the in short speech data storage 101 and long speech data storage 102, respectively. In detail, in each iteration, firstly discriminator parameters are fixed, and generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
In step A02 (feature restoration part), the generation unit 107 generates a restored feature vector from a given short speech utterance in the output layer using generator parameters stored in the generator parameter storage 106.
Fig. 7 is a flowchart illustrating that the generator and the discriminator are together trained using short speech feature vectors and long speech feature vectors with noise. Fig. 7 shows the training part in Fig. 6.
First, in step B01, as the beginning of the training part, the feature extraction unit 103a reads short speech data with speaker labels from the short speech data storage 101.
In step B02, the feature extraction unit 103a further extracts feature vectors from the short speech.
In step B03, the feature extraction unit 103b reads long speech data with speaker labels from the long speech data storage 102.
In step B04, the feature extraction unit 103b further extracts feature vectors from the long speech.
In step B05, the generator & discriminator training unit 105 reads the noise data stored in the noise storage 104.
In step B06, the generator & discriminator training unit 105 trains a generator and a discriminator together using short speech feature vectors sent from the feature extraction unit 103a and long speech feature vectors sent from the feature extraction unit 103b with speaker labels and noise.
In step B07, as the result of the training, the generator & discriminator training unit 105 generates generator parameters and discriminator parameters, and stores the generator parameters in the generator parameter storage 106.
Note that the order of B01-B02 and B03-B04 can be switched, not limited to the form presented in Fig. 7.
Fig. 8 is a flowchart illustrating a feature restoration part 100B.
Firstly, in step C01, the feature extraction unit 103c reads short speech data represented through an external device (not shown in Fig. 1).
In step C02, the feature extraction unit 103c extracts feature vectors from the given short speech data.
In step C03, the generation unit 107 reads noise data stored in the noise storage 104.
In step C04, the generation unit 107 reads generator parameters from the generator parameter storage 106.
In step C06, the generation unit 107 restores the feature vector of the short speech and generates a robust feature vector.
Note here the order of C03 and C04 can be switched.
Effect of First Example Embodiment.
As explained above, the first example embodiment can improve the robustness of the feature vector of short speech. The reason is that the joint training of the generator and the discriminator improve the performance of each other, and it is learned the relationship between long speech feature vectors and short speech feature vectors in the training. As a result, such NN can generate a feature vector for a short speech as robust as that from long speech.
Second Example embodiment.
A robust feature compensation apparatus of the second example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using an encoder. Thus, in this example embodiment, the encoder - part of a generator of a GAN trained with short and long speech is capable of produce a robust speaker feature vector robust for short speech.
<Configuration of robust feature compensation apparatus >
In the second example embodiment of the present invention, a robust feature compensation apparatus for speaker feature extraction using a generator and a discriminator of GAN will be described.
Fig. 9 illustrates a block diagram of a robust feature compensation apparatus 200 of the second example embodiment. The robust feature compensation apparatus 200 includes a training part 200A and a speaker feature extraction part 200B.
The training part200A includes a short speech data storage 201, a long speech data storage 202, feature extraction units 203a and 203b, a noise storage 204, a generator & discriminator training unit 205, and an encoder parameter storage 206. The speaker feature extraction part 200B includes a feature extraction unit 203c, an encoding unit 207 as generation means, and a generated feature storage 208. The feature extraction units 203a, 203b and 203c have the same function.
The short speech data storage 201 stores short speech recordings with speaker labels, as shown in Fig 2.
The long speech data storage 202 stores long speech recordings with speaker labels, as shown in Fig 3. The long speech data storage 202 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 201.
The noise storage 204 stores random vector representing noise.
The encoder parameter storage 206 stores encoder parameters, each is a part of the result of the generator & discriminator training unit 205. The generator (not shown in Fig. 9) consists of an encoder and a decoder, same as that in the first example embodiment as understood from the figure Fig. 4.
The feature extraction unit 203a extracts features from the short speech in the short speech data storage 201. The feature extraction unit 203b extracts features from the long speech in the long speech data storage 202. The features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.
The generator & discriminator training unit 205 receives feature vectors of short speech from the feature extraction unit 203a, the feature vector of long speech from the feature extraction unit 203b and noise from the noise storage 204. The generator & discriminator training unit 205 trains the generator and the discriminator (not shown in Fig. 9) iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. The detail of training is shown in the first example embodiment. After the training, the generator & discriminator training unit 205 outputs generator parameters and discriminator parameter, and stored them in the encoder parameter storage 206.
In the speaker feature extraction part 200B, the feature extraction unit 203c extracts feature vectors from a short speech. Together with the feature vector, the encoding unit 207 receives the noise stored in the noise storage 204 and encoder parameters in encoder parameter storage 206. The encoding unit 207 encodes a robust speaker feature.
Fig. 10 shows a concept of the architecture of the generator and the discriminator of the second example embodiment. The generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is feature vector of the short speech. The output layer of the encoder NN is speaker factor. The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is the restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and speaker label.
The training part 200A of the second example embodiment is same as that in the first example embodiment as mentioned.
In the evaluation part, the encoder parameters and input layer of the encoder NN (feature vector of short speech) are provided, and as a result, the output layer of the encoder NN (speaker factor) is obtained.
<Operation of robust feature compensation apparatus>
Next, the operation of the robust feature compensation apparatus 200 will be described with reference to drawings.
The whole operation of the robust feature compensation apparatus 200 will be described by referring to Fig. 11. Fig. 11 contains operations of the training part 200A and the speaker feature extraction part 200B. However, this shows an example, the operations of the training and speaker feature extraction can be executed continuously or time intervals can be inserted.
In step D01 (training part), the generator & discriminator training unit 205 training the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in short speech data storage 201 and long speech data storage 202, respectively. In detail, in each iteration, first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
In step D02 (speaker feature extraction part), the encoding unit 207 encodes a robust speaker feature vector from a given short speech utterance, in the output layer of the encoder using encoder parameter stored in the encoder parameter storage 206.
Fig. 12 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise. Fig. 12 shows the training part in Fig. 11.
First, in step E01, as the beginning of the training part, the feature extraction unit 203a reads short speech data with speaker labels from the short speech data storage 201.
In step E02, the feature extraction unit 203a further extracts feature vectors from the short speech.
In step E03, the feature extraction unit 203b reads long speech data with speaker labels from long speech data storage 202.
In step E04, the feature extraction unit 203b further extracts feature vectors from the long speech.
In step E05, the generator & discriminator training unit 205 reads noise data stored in the noise storage 204.
In step E06, the generator & discriminator training unit 205 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 203a and long speech feature vectors sent from the feature extraction unit 203b with speaker labels and noise.
In step E07, as the result of the training, the generator & discriminator training unit 205 trains the generator and the discriminator, and stores the parameters of encoder - part of the generator in the encoder parameter storage 206.
Note that the order of E01-E02 and E03-E04 can be switched, not limited to the form presented in Fig. 12.
Fig. 13 is a flowchart illustrating the speaker feature extraction part 200B.
Firstly, in step F01, the feature extraction unit 203c reads short speech data represented through an external device (not shown in Fig. 9).
In step F02, the feature extraction unit 203c extracts feature vectors from the given short speech data.
In step F03, the encoding unit 207 reads noise data stored in the noise storage 204.
In step F04, the encoding unit 207 reads encoder parameters from the encoder parameter storage 206.
In step F06, the encoding unit 207 encodes the feature vector of the short speech and extracts a robust speaker feature vector.
Note here the order of F03 and F04 can be switched.
Effect of Second Example Embodiment.
As explained above, the second example embodiment can improve the robustness of the feature vector of short speech. The robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust speaker feature vector in the output layer of the encoder. Using the speaker feature vectors is more direct for speaker verification application.
<Third Example Embodiment>
A robust feature compensation apparatus of the third example embodiment can provide robust feature to short speech segment, from raw feature of the short speech, using a generator and a discriminator, bottleneck feature vector produced in the last layer in the discriminator. Thus, in this example embodiment, a generator and a discriminator of a GAN trained with short and long speech is capable of produce a bottleneck feature robust for short speech.
<Configuration of robust feature compensation apparatus>
In the third example embodiment of the present invention, a robust feature compensation apparatus for bottleneck feature extraction using an encoder of a generator of GAN will be described.
Fig. 14 illustrates a block diagram of a robust feature compensation apparatus 300 of the third example embodiment. The robust feature compensation apparatus 300 includes a training part 300A and a bottleneck feature extraction part 300B.
The training part 300A includes a short speech data storage 301, a long speech data storage 302, feature extraction unit 303a, 303b and 303c, a noise storage 304, a generator & discriminator training unit 305, a generator parameter storage 306, and a discriminator parameter storage 307. The bottleneck feature extraction part 300B includes a feature extraction unit 303c, a generation unit 308 and a bottleneck feature storage 309. The feature extraction units 303a, 303b and 303c have the same function.
The short speech data storage 301 stores short speech recordings with speaker labels, as shown in Fig 2.
The long speech data storage 302 stores long speech recordings with speaker labels, as shown in Fig 3. The long speech data storage 302 contains at least one long speech recording of each speaker who has short speech recordings in the short speech data storage 301.
The noise storage 304 stores random vector representing noise.
The generator parameter storage 306 stores generator parameters. The generator (not shown in Fig. 14) consists of an encoder and a decoder, same as that in the first example embodiment as understood from Fig. 4. So parameters of both encoder and decoder are stored in the generator parameter storage 306.
The discriminator parameter storage 307 stores the parameter of the discriminator (not shown in Fig. 14).
The feature extraction unit 303a extracts features from the short speech in the short speech data storage 301. The feature extraction unit 303b extracts features from long speech in the long speech data storage 302. The features are individually-measurable properties of observations, for example, i-vector - a fixed dimensional feature vector extracted from acoustic features such as MFCC.
The generator & discriminator training unit 305 receives feature vectors of short speech from the feature extraction unit 303a, the feature vector of long speech from the feature extraction unit 303b and the noise from noise storage 304. The generator & discriminator training unit 305 trains the generator and the discriminator iteratively to determine real (the feature vector is extracted from a long speech) or fake (the feature vector is generated based on a feature vector from a short speech), and the speaker label which the feature vectors belong to. The detail of training is shown in the first example embodiment. After the training, The generator & discriminator training unit 305 outputs generator parameters and discriminator parameter, and stored them in generator parameter storage 306 and discriminator parameter storage 307, respectively.
In the bottleneck feature extraction part 300B, the feature extraction unit 303c extracts feature vectors from a short speech. Together with the feature vector, the generation unit 308 receives the noise stored in the noise storage 304 and generator parameters stored in the generator parameter storage 306. The generation unit 308 generates one or more robust bottleneck features representing the speaker factor.
Fig. 15 shows a concept of the architecture of the generator and the discriminator of the second example embodiment. The generator has two NNs - encoder NN and decoder NN, and the discriminator has one NN. Each NN includes three types of layers: input, hidden and output. The hidden layer can be plural. There are a linear transformation and/ or an activation (transfer) function at least between the input layer and the hidden layer(s), and between the hidden layer(s) and the output layer. The input layer of the encoder NN is feature vector of the short speech. The output layer of the encoder NN is speaker factor. The input layer of the decoder is addition or concatenation of the noise and the speaker factor - the output layer of the encoder NN. The output layer of the decoder is the restored feature vector. For the discriminator, the input layer is the feature vector of long speech or the restored feature vector - output of the decoder NN. The output of the discriminator is "real/fake" and speaker label in the training, and in evaluation part, the original output layer is discard and the last layer before that is used as the output layer.
The training part of the third example embodiment is same as that in the first example embodiment.
In the evaluation part, the encoder parameters, decoder parameters, discriminator parameters, input layer of the encoder NN (feature vector of short speech), part of the input layer of the decoder NN (noise) are provided, and as a result, the output layer of discriminator NN (bottleneck feature vector) is obtained.
<Operation of robust feature compensation apparatus>
Next, the operation of robust feature compensation apparatus 300 will be described with reference to drawings.
The whole operation of the robust feature compensation apparatus 300 will be described by referring to Fig. 16. Fig. 16 contains operations of the training part 300A and the bottleneck feature extraction part 300B. However, this shows an example, the operations of the training and feature restoration can be executed continuously or time intervals can be inserted.
In step G01 (training part), the generator & discriminator training unit 305 trains the generator and the discriminator together iteratively, based on short speech and long speech from the same speakers stored in the short speech data storage 301 and the long speech data storage 302, respectively. In detail, in each iteration, first parameters of the discriminator are fixed and the generator parameters are updated using objective functions; then the generator parameters are fixed and the discriminator parameters are updated using objective functions. Note that the order of updating generator parameters and discriminator parameters in an iterative can be changed. For the training, a wide range of optimization techniques can be applied, for example, the gradient decent method, known as back propagation to minimize pre-defined cost function, such as cross entropy, mean square error and so on. The objective function used in updating the generator is to update the generator to be able to generate a restored feature vector that the discriminator cannot discriminate, while the objective function in updating the discriminator is to update the discriminator to be able to discriminate the generated feature vector.
In step G02 (bottleneck feature extraction part), the generation unit 308 generates the restored feature vector from a given short speech utterance in the output layer using generator parameter stored in the generator parameter storage 306, and input it into the discriminator. The generation unit 308 extracts the last hidden layer as the robust bottleneck feature.
Fig. 17 is a flowchart illustrating that the generator and the discriminator are together trained using the short speech feature vectors and long speech feature vectors with noise. Fig. 17 shows the training part in Fig. 16.
First, in step H01, as the beginning of the training part, the feature extraction unit 303a reads short speech data with speaker labels from the short speech data storage 301.
In step H02, the feature extraction unit 303a further extracts feature vectors from the short speech data.
In step H03, the feature extraction unit 303b reads long speech data with speaker labels from the long speech data storage 302.
In step H04, the feature extraction unit 303b further extracts feature vectors from the long speech data.
In step H05, the generator & discriminator training unit 305 reads noise data stored in the noise storage 304.
In step H06, the generator & discriminator training unit 305 trains the generator and the discriminator together using short speech feature vectors sent from the feature extraction unit 303a and long speech feature vectors sent from the feature extraction unit 303b with speaker labels and noise.
In step H07, as the result of the training, the generator & discriminator training unit 305 generates generator parameters and discriminator parameters, and stores them in the generator parameter storage 306 and the discriminator parameter storage 307, respectively.
Note that the order of H01-H02 and H03-H04 can be switched, not limited to the form presented in Fig. 17.
Fig. 18 is a flowchart illustrating the bottleneck feature extraction part 300B.
Firstly, in step I01, the feature extraction unit 303c reads short speech data presented through from an external device (not shown in Fig. 14).
In step I02, the feature extraction unit 303c extracts feature vectors from the given short speech data.
In step I03, generation unit 308 reads noise data stored in noise storage 304.
In step I04, the generation unit 308 reads generator parameters from the generator parameter storage 306.
In step I05, the generation unit 308 reads discriminator parameters from the discriminator parameter storage 307.
Note here the order of I03 - I05 can be switched.
In step I06, the generation unit 308 extracts a bottleneck feature produced in the last layer of discriminator NN.
Effect of Third Example Embodiment.
As explained above, the third example embodiment can improve the robustness of the feature vector of short speech. As a result, such NN can generate a feature vector for a short speech as robust as that from long speech. The robust feature restoration is done in the first example embodiment. With the same training structure, it can at the same time produces a robust bottleneck feature in the output layer of the discriminator (the original output layer "real/fake" and speaker labels are discarded after training part).
Note that in the all example embodiments, the speaker labels in the output layer in the discriminator in train can be replaced by emotion labels, language labels, etc. for the usage of feature compensation for emotion recognition, language recognition and so on. Correspondingly, the output later of encoder can be changed to represent emotion feature vectors or language feature vectors.
Fourth Example Embodiment.
A robust feature compensation apparatus of the fourth example embodiment is shown in Fig. 19. The speech feature compensation apparatus 500 based on GAN, includes: a generator & discriminator training unit 501 that trains an GAN model to generate generator and discriminator parameters, based on at least one short speech feature vector and at least one long speech feature vector from the same speaker; and a robust feature compensation unit 502 that compensate the feature vector of short speech, based on short speech vector and the generator and discriminator parameters.
The speech feature compensation apparatus 500 can provide robust feature compensation to short speech. The reason is that the generator and the discriminator are jointly trained which improves each other’s performance iteratively, using feature vectors of short speech and long speech, so as to learn the relation between the feature vectors of short speech and long speech.
<Configuration of information processing apparatus>
Fig. 20 illustrates, by way of example, a configuration of an information processing apparatus 900 (computer) which can implement a robust feature compensation apparatus relevant to an example embodiment of the present invention. In other words, Fig. 20 illustrates a configuration of a computer (information processing apparatus) capable of implementing the devices in Figs.1, 9, 14 and 19 representing a hardware environment where the individual functions in the above-described example embodiments can be implemented.
The information processing apparatus 900 illustrated in Fig. 19 includes the following components:
- CPU (Central Processing Unit) 901;
- ROM (Read Only Memory) 902;
- RAM (Random Access Memory) 903;
- Hard disk 904 (storage device);
- Communication interface to an external device 905;
- Reader/writer 908 capable of reading and writing data stored in a storage medium 907 such as CD-ROM (Compact Disc Read Only Memory); and
- Input/output interface 909.
The information processing apparatus 900 is a general computer where these components are connected via a bus 906 (communication line).
The present invention explained with the above-described example embodiments as examples is accomplished by providing the information processing apparatus 900 illustrated in Fig. 20 with a computer program which is capable of implementing the functions illustrated in the block diagrams (Figs. 1, 9, 14 and 19) or the flowcharts (Figs. 6-8, Figs. 11-13 and Figs 16-18) referenced in the explanation of these example embodiments, and then by reading the computer program into the CPU 901 in such hardware, interpreting it, and executing it. The computer program provided to the apparatus can be stored in a volatile readable and writable storage memory (RAM 903) or in a non-volatile storage device such as the hard disk 904.
In addition, in the case described above, general procedures can now be used to provide the computer program to such hardware. These procedures include, for example, installing the computer program into the apparatus via any of various storage medium 907 such as CD-ROM, or downloading it from an external source via communication lines such as the Internet. In these cases, the present invention can be seen as being composed of codes forming such computer program or being composed of the storage medium 907 storing the codes.
As a final point, it should be clear that the process, techniques and methodology described and illustrated here are not limited or related to a particular apparatus. It can be implemented using a combination of components. Also various types of general purpose devise may be used in accordance with the instructions herein. The present invention has also been described using a particular set of examples. However, these are merely illustrative and not restrictive. For example the described software may be implemented in a wide variety of languages such as C/C++, Java, MATLAB and Python etc. Moreover other implementations of the inventive technology will be apparent to those skilled in the art.
Fig. 21 is a block diagram showing main parts of a speech feature compensation apparatus according to the present invention. As shown in Fig. 21, the speech feature compensation apparatus 10 includes training means 11 (realized by the generator & discriminator training unit 105, 205, 305 in the example embodiments) for training a generator 21 and a discriminator 22 of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN, feature extraction means 12 (realized by the feature extraction unit 103c, 203c, 303c in the example embodiments) for extracting a feature vector from an input short speech, and generation means 13 (realized by the generation unit 107, 308 or the encoding unit 207 in the example embodiments) for generating a robust feature vector based on the extracted feature vector using the trained parameters.
As shown in Fig. 22, the generator 21 may include an encoder 211 inputting the first feature vector and outputting a feature vector and a decoder 212 outputting a restored feature vector, and outputs the trained parameters with respect to at least the encoder, and the generation means 13 may include an encoding unit 131 which generating the robust feature vector by encoding the feature vector of the input short speech using the trained parameters.
 
100, 200, 300 robust feature compensation apparatus
101, 201, 301 short speech data storage
102, 202, 302 long speech data storage
103a, 203a, 303a feature extraction unit
103b, 203b, 303b feature extraction unit
103c, 203c, 303c feature extraction unit
104, 204, 304 noise storage
105, 205, 305 generator & discriminator training unit
106 generator parameter storage
206 encoder parameter storage
306 generator parameter storage
107 generation unit
207 encoding unit
307 discriminator parameter storage
108, 208 generated feature storage
308 generation unit
309 bottleneck feature storage

Claims (9)

  1. A speech feature compensation apparatus comprising:
    training means for training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
    feature extraction means for extracting a feature vector from an input short speech, and
    generation means for generating a robust feature vector based on the extracted feature vector using the trained parameters.
  2. The speech feature compensation apparatus according to claim 1,
    wherein the generation means generates a restored feature vector corresponding to the feature vector extracted from the input short speech.
  3. The speech feature compensation apparatus according to claim 1 or 2,
    wherein
    the generator includes an encoder inputting the first feature vector and outputting a feature vector and a decoder outputting a restored feature vector, and outputs the trained parameters with respect to at least the encoder, and
    the generation means includes an encoding unit which generating the robust feature vector by encoding the feature vector of the input short speech using the trained parameters.
  4. The speech feature compensation apparatus according to claim 1,
    wherein the generation means generates at least one bottleneck feature by the discriminator.
  5. The speech feature compensation apparatus according to any one of claims 1 to 4,
    wherein
    the discriminator based on a neural network inputs the second feature vector, and
    the training means trains the neural network so that a cost function is minimized, the cost function counting real/fake classification errors, speaker identification errors, and MSE (Mean Square Error) between the second feature vector and the generated feature vector of the long speech by the generator.
     
  6. A speech feature compensation method comprising:
    training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
    extracting a feature vector from an input short speech, and
    generating a robust feature vector based on the extracted feature vector using the trained parameters.
  7. The speech feature compensation method according to claim 6,
    wherein a restored feature vector corresponding to the feature vector extracted from the input short speech is generated.
  8. A speech feature compensation program for causing a computer to execute:
    training a generator and a discriminator of GAN (Generative Adversarial Network) using a first feature vector extracted from a short speech segment and a second feature vector extracted from a long speech segment, from the same speaker regarding the short speech, which is longer than the short speech segment, and outputting trained parameters of the GAN,
    extracting a feature vector from an input short speech, and
    generating a robust feature vector based on the extracted feature vector using the trained parameters.
  9. The speech feature compensation program according to claim 8,
    wherein a restored feature vector corresponding to the feature vector extracted from the input short speech is generated.
PCT/JP2018/008251 2018-03-05 2018-03-05 Speech feature compensation apparatus, method, and program WO2019171415A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020539019A JP6897879B2 (en) 2018-03-05 2018-03-05 Voice feature compensator, method and program
PCT/JP2018/008251 WO2019171415A1 (en) 2018-03-05 2018-03-05 Speech feature compensation apparatus, method, and program
JP2021096366A JP7243760B2 (en) 2018-03-05 2021-06-09 Audio feature compensator, method and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2018/008251 WO2019171415A1 (en) 2018-03-05 2018-03-05 Speech feature compensation apparatus, method, and program

Publications (1)

Publication Number Publication Date
WO2019171415A1 true WO2019171415A1 (en) 2019-09-12

Family

ID=67845548

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/008251 WO2019171415A1 (en) 2018-03-05 2018-03-05 Speech feature compensation apparatus, method, and program

Country Status (2)

Country Link
JP (2) JP6897879B2 (en)
WO (1) WO2019171415A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477247A (en) * 2020-04-01 2020-07-31 宁波大学 GAN-based voice countermeasure sample generation method
CN111785281A (en) * 2020-06-17 2020-10-16 国家计算机网络与信息安全管理中心 Voiceprint recognition method and system based on channel compensation
CN113488069A (en) * 2021-07-06 2021-10-08 浙江工业大学 Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN113555026A (en) * 2021-07-23 2021-10-26 平安科技(深圳)有限公司 Voice conversion method, device, electronic equipment and medium
WO2022007438A1 (en) * 2020-11-27 2022-01-13 平安科技(深圳)有限公司 Emotional voice data conversion method, apparatus, computer device, and storage medium
JP2022536189A (en) * 2020-04-28 2022-08-12 平安科技(深▲せん▼)有限公司 Method, Apparatus, Equipment and Storage Medium for Recognizing Voiceprint of Original Speech
CN116631406A (en) * 2023-07-21 2023-08-22 山东科技大学 Identity feature extraction method, equipment and storage medium based on acoustic feature generation

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6897879B2 (en) * 2018-03-05 2021-07-07 日本電気株式会社 Voice feature compensator, method and program
CN113314109B (en) * 2021-07-29 2021-11-02 南京烽火星空通信发展有限公司 Voice generation method based on cycle generation network
KR102498268B1 (en) * 2022-07-15 2023-02-09 국방과학연구소 Electronic apparatus for speaker recognition and operation method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160098987A1 (en) * 2014-10-02 2016-04-07 Microsoft Technology Licensing , LLC Neural network-based speech processing
US20160098993A1 (en) * 2014-10-03 2016-04-07 Nec Corporation Speech processing apparatus, speech processing method and computer-readable medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002023792A (en) * 2000-07-10 2002-01-25 Casio Comput Co Ltd Device and method for collating speech, and storage medium with speech collation processing program stored therein
US10395356B2 (en) * 2016-05-25 2019-08-27 Kla-Tencor Corp. Generating simulated images from input images for semiconductor applications
JP6897879B2 (en) * 2018-03-05 2021-07-07 日本電気株式会社 Voice feature compensator, method and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160098987A1 (en) * 2014-10-02 2016-04-07 Microsoft Technology Licensing , LLC Neural network-based speech processing
US20160098993A1 (en) * 2014-10-03 2016-04-07 Nec Corporation Speech processing apparatus, speech processing method and computer-readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PASCUAL, SANTIAGO ET AL.: "SEGAN: Speech Enhancement Generative Adversarial Network", ARXIV PREPRINT ARXIV:1703.09452, 9 June 2017 (2017-06-09), pages 3642 - 3646, XP055579756 *
YU , HONG ET AL.: "Adversarial Network Bottleneck Features for Noise Robust Speaker Verification", ARXIV PREPRINT ARXIV:1706.03397, 11 June 2017 (2017-06-11), XP080769015 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477247A (en) * 2020-04-01 2020-07-31 宁波大学 GAN-based voice countermeasure sample generation method
CN111477247B (en) * 2020-04-01 2023-08-11 宁波大学 Speech countermeasure sample generation method based on GAN
JP2022536189A (en) * 2020-04-28 2022-08-12 平安科技(深▲せん▼)有限公司 Method, Apparatus, Equipment and Storage Medium for Recognizing Voiceprint of Original Speech
JP7242912B2 (en) 2020-04-28 2023-03-20 平安科技(深▲せん▼)有限公司 Method, Apparatus, Equipment and Storage Medium for Recognizing Voiceprint of Original Speech
CN111785281A (en) * 2020-06-17 2020-10-16 国家计算机网络与信息安全管理中心 Voiceprint recognition method and system based on channel compensation
WO2022007438A1 (en) * 2020-11-27 2022-01-13 平安科技(深圳)有限公司 Emotional voice data conversion method, apparatus, computer device, and storage medium
CN113488069A (en) * 2021-07-06 2021-10-08 浙江工业大学 Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network
CN113488069B (en) * 2021-07-06 2024-05-24 浙江工业大学 Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network
CN113555026A (en) * 2021-07-23 2021-10-26 平安科技(深圳)有限公司 Voice conversion method, device, electronic equipment and medium
CN113555026B (en) * 2021-07-23 2024-04-19 平安科技(深圳)有限公司 Voice conversion method, device, electronic equipment and medium
CN116631406A (en) * 2023-07-21 2023-08-22 山东科技大学 Identity feature extraction method, equipment and storage medium based on acoustic feature generation
CN116631406B (en) * 2023-07-21 2023-10-13 山东科技大学 Identity feature extraction method, equipment and storage medium based on acoustic feature generation

Also Published As

Publication number Publication date
JP7243760B2 (en) 2023-03-22
JP2021510846A (en) 2021-04-30
JP2021140188A (en) 2021-09-16
JP6897879B2 (en) 2021-07-07

Similar Documents

Publication Publication Date Title
WO2019171415A1 (en) Speech feature compensation apparatus, method, and program
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
CN112071330B (en) Audio data processing method and device and computer readable storage medium
Kekre et al. Speaker identification by using vector quantization
CN107112006A (en) Speech processes based on neutral net
US12046226B2 (en) Text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score
Siuzdak et al. WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
US7505950B2 (en) Soft alignment based on a probability of time alignment
Kekre et al. Performance comparison of speaker recognition using vector quantization by LBG and KFCG
Soboleva et al. Replacing human audio with synthetic audio for on-device unspoken punctuation prediction
Saleem et al. NSE-CATNet: deep neural speech enhancement using convolutional attention transformer network
CN113963715A (en) Voice signal separation method and device, electronic equipment and storage medium
Devi et al. A novel approach for speech feature extraction by cubic-log compression in MFCC
Mengistu Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC
US20240119922A1 (en) Text to speech synthesis without using parallel text-audio data
Anand et al. Advancing Accessibility: Voice Cloning and Speech Synthesis for Individuals with Speech Disorders
Nguyen et al. Resident identification in smart home by voice biometrics
CN113270090B (en) Combined model training method and equipment based on ASR model and TTS model
Nijhawan et al. Real time speaker recognition system for hindi words
Park et al. Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss
Sathiarekha et al. A survey on the evolution of various voice conversion techniques
Yang et al. Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens
Kekre et al. Performance comparison of automatic speaker recognition using vector quantization by LBG KFCG and KMCG
Gunawan et al. Development of Language Identification using Line Spectral Frequencies and Learning Vector Quantization Networks
Pol et al. USE OF MEL FREQUENCY CEPSTRAL COEFFICIENTS FOR THE IMPLEMENTATION OF A SPEAKER RECOGNITION SYSTEM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18908539

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020539019

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18908539

Country of ref document: EP

Kind code of ref document: A1