WO2009122780A1

WO2009122780A1 - Adaptive speaker selection device, adaptive speaker selection method, and recording medium

Info

Publication number: WO2009122780A1
Application number: PCT/JP2009/052379
Authority: WO
Inventors: 真宏谷; 江森　正; 祥史大西; 孝文越仲
Original assignee: 日本電気株式会社
Priority date: 2008-03-31
Filing date: 2009-02-13
Publication date: 2009-10-08
Also published as: JPWO2009122780A1

Abstract

A feature quantity calculation unit of an adaptive speaker selection device calculates the feature quantity of a voice signal of an evaluation speaker. A similarity calculation unit calculates, regarding speaker models of respective learning speakers, the degrees of similarity to the evaluation speaker by using the feature quantity of the voice signal of the evaluation speaker calculated by the feature quantity calculation unit (S20). An adaptive speaker selection unit selects N learning speakers such that “the degrees of similarity between the evaluation speaker and the learning speakers are as high as possible and the degree of similarity between the learning speakers is as low as possible” as adaptive speakers on the basis of the degrees of similarity between the evaluation speaker and the learning speakers and the degree of similarity between the learning speakers (S30). Consequently, the adaptive speakers can be selected so that the deterioration of the accuracy of a speaker-adaptive model can be suppressed.

Description

Adaptive speaker selection apparatus, adaptive speaker selection method, and recording medium

The present invention relates to a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.

Voice recognition systems are used in various fields. In order to improve the accuracy of speech recognition, a technology (speaker adaptation technology) that adapts the acoustic model used in the speech recognition system to the user is known, and the speaker adaptation model (acoustic model adapted to the user) Various methods have been proposed for creation.

Patent Document 1 and Non-Patent Document 1 disclose a technique for creating a speaker adaptation model using sufficient statistics. FIG. 8 shows a schematic example of a speaker adaptive model creation apparatus that implements this method.

8 includes a storage means 10, an input means 20, and a data processing means 30. The speaker adaptive model creation apparatus 1 shown in FIG. The storage unit 10 includes a sufficient statistics storage unit 12 and a speaker model storage unit 14, and the data processing unit 30 includes a feature amount calculation unit 32, a similarity calculation unit 34, a speaker selection unit 36, and an adaptation A model creation unit 38 is included.

The speaker adaptive model creation device 1 creates an acoustic model for each speaker using a database composed of sample speech data of a plurality of speakers, selects a plurality of these acoustic models, and speaks An acoustic model for a speaker is created by adapting to (corresponding to the user described above). In the following description of the present specification, a speaker of sample speech data is referred to as a “learning speaker”, and a probability model created for each learning speaker and representing the acoustic characteristics of the speaker is referred to as a “speaker model”. That's it. Also, the speaker to be adapted is called “evaluation speaker”, and the acoustic model adapted to the evaluation speaker is called “adaptive model”. The speaker of the speaker model selected to create the adaptive model is called “adaptive speaker”.

The speaker adaptive model creation apparatus 1 creates an adaptive model through the following steps.
1. Create sufficient statistics and speaker models using the database.
The speaker model is a probability model created for each learning speaker and representing the acoustic characteristics of the speaker. Here, it is expressed by a mixed Gaussian distribution model (GMM: Gaussian Mixture Model) with one state 64 without distinguishing phonemes. GMM is a probability model of observation data expressed by a mixed normal distribution.

Sufficient statistics are created for each learning speaker and expressed in a hidden Markov model (HMM: Hidden Markov Model). By “sufficient statistics” is meant sufficient statistics to build an acoustic model from a database, where mean, variance, and EM count in the HMM are used. The “EM count” is the frequency of the probability of transition from the state i to the normal distribution of the state j in the EM algorithm generally used when learning the HMM. The sufficient statistic is calculated by learning once from the unspecified speaker model with the EM algorithm using the speech data of the learning speaker.

In the speaker adaptive model creation device 1, the sufficient statistic storage unit 12 and the speaker model storage unit 14 each store the sufficient statistic and the speaker model for each learning speaker calculated as described above.
2. Input of voice data of evaluation speaker

In the speaker adaptive model creation device 1, the voice data of the evaluation speaker is input by the input means 20. The input unit 20 receives the voice data of the evaluation speaker from a voice input device such as a microphone.
3. Selection of adaptive speakers and creation of adaptive models

The data processing means 30 of the speaker adaptive model creation device 1 is responsible for these processes.
The feature amount calculation unit 32 receives the voice data of the evaluation speaker input by the input unit 20, calculates the feature amount necessary for speech recognition, and outputs it to the similarity calculation unit 34.

The similarity calculation unit 34 reads the speaker model of each learning speaker stored in the speaker model storage unit 14, and the feature amount of the evaluation speaker received from the feature amount calculation unit 32 for each of these speaker models. And the combination of the similarity and the learning speaker corresponding to the similarity is output to the speaker selection unit 36.

Here, the likelihood obtained by inputting the feature amount extracted from the speech of the evaluation speaker into the speaker model of the learning speaker is used as the similarity. The greater the likelihood, the higher the similarity.

The speaker selection unit 36 selects, as an adaptive speaker, a learning speaker having the highest similarity, that is, the highest likelihood, from the combination of each similarity and the learning speaker output from the similarity calculation unit 34. An identifier (ID number or the like) indicating the adapted speaker is output to the adaptive model creation unit 38. The number N of adaptive speakers is a constant determined empirically.

The adaptive model creation unit 38 receives the identifiers of the learning speakers selected by the adaptive speakers from the speaker selection unit 36 and reads the sufficient statistics of the learning speakers indicated by these identifiers from the sufficient statistics storage unit 12. . Then, an adaptive model is created and output using the read sufficient statistics, and used for speech recognition of the evaluation speaker.

Specifically, the process of creating the adaptive model using the sufficient statistics read from the sufficient statistics storage unit 12 is a statistical processing calculation represented by the following equations (1) to (3).

Here, μ _i ^adp (i = 1,..., N _mix ) and ν _i ^adp (i = 1,..., N _mix ) are averages of normal distributions in each state of the HMM of the adaptive model, respectively. And N _mix is the number of mixed distributions. Further, a ^adp [i] [j] (i = 1,..., N _state , j = 1,..., N _state ) is a transition probability from state i to state j, and N _state is Is the number of states. N _sel is the number of selected adaptive speakers, and μ _i ^j (i = 1,..., N _mix , j = 1,..., N _sel ), ν _i ^j (i = 1,..., N _mix , j = 1,..., N _sel ) are the mean and variance of the acoustic model of the selected adaptive speaker, respectively. Also, C _mix ^j (j = 1,..., N _sel ), C _state ^k [i] [j] (k = 1,..., N _sel , i = 1,..., N _state , j = 1,..., N _state ) are an EM count in a normal distribution and an EM count related to state transition, respectively.

In the above-described method, the number of adaptive speakers N is empirically determined to be constant. However, as described in Non-Patent Document 2, for example, a story in an acoustic feature space between an evaluation speaker and a learning speaker. There is also a method for setting the distance between persons as a reference.

As the feature amount of audio data, for example, the Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 and the rate of change thereof are known.
Japanese Patent No. 3756879 Shinichi Yoshizawa, Akira Baba, Kanako Matsunami, Yuichiro Yonera, Shinichi Yamada, Nobuyoshi Lee, Kiyohiro Shikano, "Unsupervised learning method of phonological model using sufficient statistics and speaker distance", IEICE Transactions, D -II, Vol. J85-D-II, no. 3, pp. 382-289, March 2002 Masahiro Tani, Tadashi Emori, Yoshifumi Onishi, Takafumi Koshinaka, Koichi Shinoda, "Speaker Selection Method for Unsupervised Speaker Adaptation Using Sufficient Statistics", IEICE Tech. 107, no. 406, pp. 85-89, December 2007 Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, "Speech Recognition System", Ohm Co., Ltd., 2001, p. 13-15

The above-described methods described in Patent Document 1 and Non-Patent Document 1 use the likelihood of the evaluation speaker for speech as the similarity, and select a learning speaker having a high similarity as an adaptive speaker. That is, only the similarity of speech between the learning speaker and the evaluation speaker is used as the selection criterion for the adaptive speaker. For example, when not only the acoustic features but also the phonological features representing the utterance content are similar among the voices of a plurality of selected adaptive speakers, there are few variations such as the utterance content of the adaptive speaker Therefore, the appearance frequency of phonemes used for learning is biased, which may cause deterioration of the accuracy of the adaptive model.

The present invention has been made in view of the above circumstances, and provides an adaptive speaker selection technique for avoiding deterioration in accuracy of an adaptive model.

One aspect of the present invention is an adaptive speaker selection method for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to an evaluation speaker. In this method, a plurality of learning speakers having the highest possible similarity between the evaluation speaker and the speech and the smallest possible mutual speech similarity are selected as the adaptive speakers.

It should be noted that a method in which the method of the above aspect is replaced with an apparatus that executes the method or a program that causes a computer to execute the method is also effective as an aspect of the present invention.

According to the adaptive speaker selection technique of the present invention, when the acoustic model adapted to the evaluation speaker is created using the acoustic model of the selected adaptive speaker, deterioration in accuracy of the created acoustic model is suppressed. be able to.

It is a figure which shows the schematic example of the adaptive speaker selection apparatus for demonstrating the technique concerning this invention. It is a figure which shows the structural example of the similarity calculation part in the adaptive speaker selection apparatus shown in FIG. It is a flowchart which shows the flow of a process by the adaptive speaker selection apparatus shown in FIG. It is a flowchart which shows the flow of a process of the similarity calculation part of the example shown in FIG. It is a flowchart which shows an example of the flow of the process by the adaptive speaker selection part in the adaptive speaker selection apparatus shown in FIG. It is a figure which shows the adaptive speaker model production | generation apparatus concerning embodiment of this invention. It is a flowchart which shows the flow of a process by the adaptive speaker model production | generation apparatus shown in FIG. It is a schematic diagram of the speaker adaptive model creation apparatus used in order to explain a prior art.

Explanation of symbols

DESCRIPTION OF SYMBOLS 1 Speaker adaptive model production apparatus 10 Storage means 12 Sufficient statistics storage part 14 Speaker model storage part 20 Input means 30 Data processing means 32 Feature-value calculation part 34 Similarity degree calculation part 36 Speaker selection part 38 Adaptive model creation part 100 Adaptive speaker selection device 112 Speaker model storage unit 114 Learning speaker similarity storage unit 120 Feature amount calculation unit 130 Similarity calculation unit 132 Evaluation speaker model creation unit 134 Similarity calculation execution unit 140 Adaptive speaker selection unit 200 Adaptive speaker model generation apparatus 210 Storage unit 212 Sufficient statistics storage unit 214 Speaker model storage unit 216 Learning speaker similarity storage unit 220 Input unit 230 Data processing unit 232 Feature amount calculation unit 234 Similarity calculation unit 236 Speaker Selection unit 238 Adaptive model creation unit

In the drawings used in the following description, each element described as a functional block for performing various processes can be configured by a processor, a memory, and other circuits in terms of hardware, and in terms of software This is realized by a program recorded or loaded in the program. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one. Also, for the sake of clarity, only those necessary for explaining the technique of the present invention are shown in these drawings.

Before describing specific embodiments of the present invention, the principle of the present invention will be described first.
FIG. 1 is an example of a schematic diagram of an adaptive speaker selection device 100 based on the technique according to the present invention. The adaptive speaker selection device 100 includes a speaker model storage unit 112, a learning speaker similarity storage unit 114, a feature amount calculation unit 120, a similarity calculation unit 130, and an adaptive speaker selection unit 140.

The speaker model storage unit 112 stores a speaker model created for each learning speaker in association with the learning speaker. As an association method, for example, a unique identification number is assigned to the learning speaker, and the speaker model is associated with the identification number. The speaker model is expressed by GMM, for example. However, the speaker model may be HMM, SVM (Support Vector Machine), NN (Neural Network), or BN (Beesian Network).

The learning speaker similarity storage unit 114 is a similarity table indicating the speech similarity between all the two learning speakers in the set of learning speakers whose speaker models are stored in the speaker model storage unit 112. Is remembered. The number of these similarities is the same as the number of combinations of two learning speakers.

For example, the reciprocal of the distance between the speaker models of the two learning speakers or the nth power of the reciprocal (n: positive number) is used as the speech similarity between the two learning speakers (hereinafter simply referred to as the similarity between learning speakers). . For the calculation of the distance between speaker models, for example, KL divergence for calculating a statistical distance between two speaker models which are probability models can be used. Note that the degree of similarity is not limited to that derived from the distance between models, and may be based on, for example, the likelihood of a learning speaker's voice or a feature amount extracted from the voice.

The feature amount calculation unit 120 calculates a feature amount necessary for speech recognition from the voice signal (evaluation speaker voice signal) of the evaluation speaker, and outputs the feature amount to the similarity calculation unit 130. The evaluation speaker voice signal is, for example, voice data of the evaluation speaker obtained by A / D conversion with a sampling frequency of 16 kHz and 16 bits. The feature amount extracted by the feature amount calculation unit 120 is, for example, a Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 or a rate of change thereof. In this case, the feature amount calculation unit 120 cuts out the evaluation speaker voice signal at intervals of about 10 msec called frames, and performs pre-emphasis, fast Fourier transform (FFT), filter bank analysis, and cosine transform, and features. Extracts feature quantities in vector time-series format. Of course, the feature amount is not limited to this, and may be, for example, voice data itself as long as the feature of the voice can be expressed.

The similarity calculation unit 130 calculates the similarity between the evaluation speaker and the learning speaker using the feature amount of the evaluation speaker voice signal extracted by the feature amount calculation unit 120. Specifically, for example, the speaker model of each learning speaker is read from the speaker model storage unit 112, and the likelihood for the feature amount of the evaluated speaker is calculated as the similarity for each speaker model.

The similarity calculation between the evaluation speaker and the learning speaker is not limited to the above method. An example of another method will be described with reference to FIG.
In the case of this method, the similarity calculation unit 130 includes an evaluation speaker model creation unit 132 and a similarity calculation execution unit 134. The evaluation speaker model creation unit 132 creates a speaker model of the evaluation speaker (hereinafter referred to as an evaluation speaker model) using the feature amount of the evaluation speaker obtained by the feature amount calculation unit 120. The evaluation speaker model has the same format as the speaker model of the learning speaker stored in the speaker model storage unit 112. For example, if the speaker model is expressed in GMM, the evaluation speaker model The creation unit 132 creates an evaluation speaker model in the GMM format.

The similarity calculation execution unit 134 reads each speaker model from the speaker model storage unit 112, and calculates the similarity between each speaker model and the evaluation speaker model created by the evaluation speaker model creation unit 132. . Specifically, for example, the distance between the models of the evaluation speaker model and the speaker model is calculated using KL divergence, and the reciprocal of the inter-model distance or the nth power (n: positive number) of the reciprocal is derived as the similarity.

The similarity calculation unit 130 outputs the calculated similarities to the adaptive speaker selection unit 140.
The adaptive speaker selection unit 140 calculates the similarity between the evaluation speaker and the learning speaker calculated by the similarity calculation unit 130, and the learning speaker similarity stored in the learning speaker similarity storage unit 114. Use N to select N adaptive speakers. The number N of adaptive speakers to be selected may be determined by any conventionally known method. For example, as described in Non-Patent Document 1, it may be determined empirically as a constant. As described in Non-Patent Document 2, a story in an acoustic feature space between an evaluation speaker and a learning speaker. You may make it determine on the basis of distance between persons.

Specifically, the similarity calculation unit 130 selects such that “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”. Here, an example of an adaptive speaker selection method performed by the adaptive speaker selection unit 140 will be described.

One method uses a learning function that minimizes the value of the potential function as a potential function that is a sum of a decreasing function of the similarity between the evaluation speaker and the adaptive speaker and an increasing function of the similarity between the learning speakers. Select the speaker as the adaptive speaker. Specifically, N learning speakers who minimize the potential function U are selected using Equation (4).

In Equation (4), N is the number of adaptive speakers to be selected as described above. r _ti is the distance between models of the evaluation speaker t and the learning speaker i, r _ij is the distance between models of the learning speaker i and the learning speaker j, and both can be calculated using KL divergence. Parameters “k ₁ , k ₂ ,..., L ₁ , l ₂ ,..., M ₁ , m ₂ ,..., N ₁ , n ₂ ,. A speech recognition experiment is performed using the development data, and the recognition performance is set to be high. In addition, in order to simplify the calculation, the expression (4) is obtained when k ₁ = 1, l ₁ = 1, m1 = ₁ , n ₁ = 1, and other parameters are set to 0 ( 5) may be used.

Another example of the adaptive speaker selection method by the adaptive speaker selection unit 140 will be described. In this method, first, candidates for learning speakers to be candidates for adaptive speakers are narrowed down. Specifically, for example, a learning speaker whose similarity with the adaptive speaker is equal to or greater than a predetermined threshold is selected as a candidate. Thereafter, the learning speaker similarity is read from the learning speaker similarity storage unit 114 for the selected candidate learning speaker, and the potential function is used in the same manner as in the first example method described above. An adaptive speaker is selected from learning speakers. In the case of this method, when the number of learning speakers selected as candidates is equal to or less than the number of adaptive speakers to be selected, the candidate is selected as an adaptive speaker without performing the process of selecting an adaptive speaker from the candidates. May be determined as

Further, without using a candidate selection threshold, M learning speakers (M> N) are selected as candidates in descending order of similarity to the evaluation speaker, and then the method of the first example described above. In addition, an adaptive speaker may be selected from candidate learning speakers using a potential function.

The method of selecting an adaptive speaker after narrowing down the number of candidate learning speakers once can improve the processing speed. For example, when there are 1000 learning speakers and 10 adaptive speakers are selected from the learning speakers, the method of selecting the adaptive speakers without narrowing down the candidates, the calculation of equation (4) or equation (5) The number of times is ₁₀₀₀ C ₁₀ times. On the other hand, if adaptive speakers are selected after narrowing down to 30 candidates, the number of computations of Equation (4) or Equation (5) is reduced to ₃₀ C ₁₀ times.

FIG. 3 is a flowchart showing a flow of processing by the adaptive speaker selection device 100 shown in FIG. First, the feature amount calculation unit 120 calculates the feature amount of the evaluation speaker voice signal (S10). The similarity calculation unit 130 uses the feature amount of the evaluation speaker voice signal calculated by the feature amount calculation unit 120 to evaluate the speaker model of each learning speaker stored in the speaker model storage unit 112. The similarity is calculated (S20). The similarity may be calculated by, for example, calculating the likelihood for the feature amount of the evaluation speaker voice signal for each speaker model, or by using the feature amount of the evaluation speaker voice signal as shown in FIG. A speaker model may be created (S22), and the similarity between each speaker model and the evaluation speaker model may be calculated (S24).

Then, the adaptive speaker selection unit 140 determines that the similarity between the evaluation speaker and the learning speaker is as large as possible based on the similarity between the evaluation speaker and the learning speaker and the similarity between the learning speakers. N learning speakers whose “similarity between learning speakers is as small as possible” are selected as adaptive speakers (S30). In selecting the adaptive speaker, the adaptive speaker may be selected directly from all the learning speakers, or, as shown in FIG. 5, M persons (M > N) candidates may be selected (S32), and N adaptive speakers may be selected from the selected M candidates (S34).

The principle of the adaptive speaker selection technique according to the present invention has been described above. According to this technology, when selecting an adaptive speaker, the learning speaker is set as “adapted speaker” by “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”. Since the selection is made, it is possible to prevent the variation of the utterance content of the adaptive speaker from being reduced. Therefore, it is possible to suppress deterioration in accuracy of the adaptive model created using sufficient statistics of the adaptive speaker.

Based on the above description, an embodiment of the present invention will be described.
FIG. 6 shows an adaptive speaker model generation apparatus 200 according to the embodiment of the present invention. The adaptive speaker model generation apparatus 200 includes a storage unit 210, an input unit 220, and a data processing unit 230.

The storage unit 210 stores a sufficient statistic obtained for the learning speaker for each learning speaker ID, and a sufficient statistic storage unit 212 for each learning speaker ID, and stores an acoustic model of the learning speaker for each learning speaker ID. Speaker model storage unit 214 and a learning speaker similarity storage unit 216 that stores a similarity table indicating the similarity of speech between all two learning speakers in the set of learning speakers.

The input unit 220 receives the voice signal of the evaluation speaker from a voice input device such as a microphone and inputs it to the data processing unit 230.

The data processing unit 230 includes a feature amount calculation unit 232, a similarity calculation unit 234, a speaker selection unit 236, and an adaptive model creation unit 238.

The feature amount calculation unit 232 receives the evaluation speaker voice signal from the input unit 220, calculates a feature amount necessary for speech recognition, and outputs the feature amount to the similarity calculation unit 234. Note that the specific method of calculating the feature amount by the feature amount calculation unit 232 may be any of the methods used by the feature amount calculation unit 120 in the adaptive speaker selection device 100 illustrated in FIG.

The similarity calculation unit 234 reads the speaker model of each learning speaker stored in the speaker model storage unit 214, and for each speaker model, the evaluation speaker voice signal received from the feature amount calculation unit 232 and And a set of the similarity and the ID of the learning speaker corresponding to the similarity is output to the speaker selection unit 236. Note that the type of similarity calculated by the similarity calculation unit 234 and the method for calculating the similarity are the same as those of the similarity calculation unit 130 in the adaptive speaker selection device 100. Omitted.

Similarly to the adaptive speaker selection unit 140 in the adaptive speaker selection device 100, the speaker selection unit 236 also indicates that “the similarity between the evaluation speaker and the learning speaker is as large as possible, and the similarity between the learning speakers is as small as possible. “N learning speakers are selected as adaptive speakers. The speaker selection unit 236 outputs the IDs of the selected N adaptive speakers to the adaptive model creation unit 238.

The adaptive model creation unit 238 reads a sufficient statistic corresponding to the IDs of the N adaptive speakers output from the speaker selection unit 236 from the sufficient statistic storage unit 212, and uses the statistical processing calculation to determine the evaluation speaker. Create an adapted acoustic model (adaptive model).

Note that the adaptive model creation method by the adaptive model creation unit 238 is not limited to the above-described method. For example, the adaptive model creation unit 238 calculates the likelihood of the feature amount of the evaluation speaker of the speaker model of the adaptive speaker calculated by the similarity calculation unit 234. Even if it is a technique such as weighting and integrating sufficient statistics of each adaptive speaker with the corresponding weighting coefficient, or integrating the speaker model of the adaptive speaker by weighting with an arbitrary coefficient Good.

FIG. 7 is a flowchart showing the flow of processing by the adaptive speaker model generation apparatus 200. Steps S50 to S70 are processing until an adaptive speaker is selected, and are the same as the processing by the adaptive speaker selection device 100 shown in FIG. In step S80, the adaptive model creation unit 238 in the adaptive speaker model generation device 200 uses the sufficient statistics of the N adaptive speakers selected by the speaker selection unit 236 to generate an acoustic model adapted to the evaluation speaker. create.

The adaptive speaker model generation apparatus 200 according to the present embodiment selects an adaptive speaker by the same method as the adaptive speaker selection apparatus 100 shown in FIG. 1 and creates a model adapted to the evaluation speaker. The accuracy degradation of the model can be suppressed.

While the present invention has been described with reference to the embodiments (and examples), the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
This application claims priority based on Japanese Patent Application No. 2008-092206 filed on Mar. 31, 2008, the entire disclosure of which is incorporated herein.

The present invention is used, for example, in a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.

Claims

In an adaptive speaker selection method for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to the evaluation speakers,
An adaptive speaker selection method, wherein a plurality of learning speakers having as high a similarity as possible between the evaluation speaker and the voice and as small as possible a mutual speech similarity are selected as the adaptive speaker.
The adaptive speaker selection method according to claim 1, wherein N learning speakers who minimize the value of the potential function U shown in Formula (1) are selected as adaptive speakers.

In Expression (1), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
The learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more is selected as a candidate, and the plurality of adaptive speakers are selected from the selected candidates. Adaptive speaker selection method.
A speaker adaptive model generation method for creating an acoustic model selected by the adaptive speaker selection method according to any one of claims 1 to 3 and adapted to the evaluation speaker using sufficient statistics of a plurality of adaptive speakers.
An adaptive speaker selection device that selects a plurality of adaptive speakers from a set of learned speakers in order to create an acoustic model adapted to an evaluation speaker,
A learning speaker similarity storage unit that stores speech similarities between all two learning speakers in the set of learning speakers;
A similarity calculation unit that calculates the similarity of speech between the evaluation speaker and each of the learning speakers;
Based on the similarity calculated by the similarity calculation unit and each learning speaker similarity stored in the learning speaker similarity storage unit, the similarity between the evaluation speaker and the voice is as high as possible, and An adaptive speaker selection device, comprising: a speaker selection unit that selects a plurality of learning speakers whose mutual speech similarity is as small as possible as the adaptive speaker.
The adaptive speaker selection device according to claim 5, wherein the speaker selection unit selects N learning speakers who minimize the value of the potential function U shown in Equation (2) as an adaptive speaker. .

In Equation (2), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
The speaker selection unit selects a learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more as a candidate, and selects the plurality of adaptive speakers from the selected candidate. Item 7. The adaptive speaker selection device according to Item 5 or 6.
A sufficient statistics storage for storing sufficient statistics for each learner;
The evaluation speaker is selected using sufficient statistics of a plurality of adaptive speakers selected by the adaptive speaker selection device according to any one of claims 5 to 7 and stored in the sufficient statistics storage unit. A speaker adaptive model generation device comprising an adaptive model generation means for generating an adaptive acoustic model.
A computer-readable recording medium storing a program for causing a computer to execute an adaptive speaker selection process for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to an evaluation speaker. And
The adaptive speaker selection process is a process of selecting, as the adaptive speaker, a plurality of learning speakers having a speech similarity as high as possible and a speech similarity as small as possible. A recording medium.
10. The recording according to claim 9, wherein the adaptive speaker selection process is a process of selecting N learning speakers who minimize the value of the potential function U shown in Expression (3) as adaptive speakers. Medium.

In Equation (3), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
The program causes the computer to further execute candidate selection processing for selecting a learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more as a candidate,
The recording medium according to claim 9 or 10, wherein the adaptive speaker selection process is a process of selecting the plurality of adaptive speakers from candidates selected by the candidate selection process.
An adaptive speaker selection process according to any one of claims 9 to 11,
A computer-readable recording medium storing a program for causing a computer to execute processing for creating an acoustic model adapted to the evaluation speaker using sufficient statistics of a plurality of adaptive speakers selected by the adaptive speaker selection processing .