WO2009122780A1 - Adaptive speaker selection device, adaptive speaker selection method, and recording medium - Google Patents

Adaptive speaker selection device, adaptive speaker selection method, and recording medium Download PDF

Info

Publication number
WO2009122780A1
WO2009122780A1 PCT/JP2009/052379 JP2009052379W WO2009122780A1 WO 2009122780 A1 WO2009122780 A1 WO 2009122780A1 JP 2009052379 W JP2009052379 W JP 2009052379W WO 2009122780 A1 WO2009122780 A1 WO 2009122780A1
Authority
WO
WIPO (PCT)
Prior art keywords
speaker
adaptive
learning
speakers
similarity
Prior art date
Application number
PCT/JP2009/052379
Other languages
French (fr)
Japanese (ja)
Inventor
真宏 谷
江森 正
祥史 大西
孝文 越仲
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2010505436A priority Critical patent/JPWO2009122780A1/en
Publication of WO2009122780A1 publication Critical patent/WO2009122780A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker

Definitions

  • the present invention relates to a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.
  • Voice recognition systems are used in various fields.
  • a technology that adapts the acoustic model used in the speech recognition system to the user is known, and the speaker adaptation model (acoustic model adapted to the user)
  • speaker adaptation model acoustic model adapted to the user
  • Patent Document 1 and Non-Patent Document 1 disclose a technique for creating a speaker adaptation model using sufficient statistics.
  • FIG. 8 shows a schematic example of a speaker adaptive model creation apparatus that implements this method.
  • the storage unit 10 includes a sufficient statistics storage unit 12 and a speaker model storage unit 14, and the data processing unit 30 includes a feature amount calculation unit 32, a similarity calculation unit 34, a speaker selection unit 36, and an adaptation A model creation unit 38 is included.
  • the speaker adaptive model creation device 1 creates an acoustic model for each speaker using a database composed of sample speech data of a plurality of speakers, selects a plurality of these acoustic models, and speaks An acoustic model for a speaker is created by adapting to (corresponding to the user described above).
  • a speaker of sample speech data is referred to as a “learning speaker”, and a probability model created for each learning speaker and representing the acoustic characteristics of the speaker is referred to as a “speaker model”. That's it.
  • the speaker to be adapted is called “evaluation speaker”, and the acoustic model adapted to the evaluation speaker is called “adaptive model”.
  • the speaker of the speaker model selected to create the adaptive model is called “adaptive speaker”.
  • the speaker adaptive model creation apparatus 1 creates an adaptive model through the following steps. 1. Create sufficient statistics and speaker models using the database.
  • the speaker model is a probability model created for each learning speaker and representing the acoustic characteristics of the speaker. Here, it is expressed by a mixed Gaussian distribution model (GMM: Gaussian Mixture Model) with one state 64 without distinguishing phonemes.
  • GMM is a probability model of observation data expressed by a mixed normal distribution.
  • HMM Hidden Markov Model
  • sufficient statistics is meant sufficient statistics to build an acoustic model from a database, where mean, variance, and EM count in the HMM are used.
  • the “EM count” is the frequency of the probability of transition from the state i to the normal distribution of the state j in the EM algorithm generally used when learning the HMM.
  • the sufficient statistic is calculated by learning once from the unspecified speaker model with the EM algorithm using the speech data of the learning speaker.
  • the sufficient statistic storage unit 12 and the speaker model storage unit 14 each store the sufficient statistic and the speaker model for each learning speaker calculated as described above. 2. Input of voice data of evaluation speaker
  • the voice data of the evaluation speaker is input by the input means 20.
  • the input unit 20 receives the voice data of the evaluation speaker from a voice input device such as a microphone. 3. Selection of adaptive speakers and creation of adaptive models
  • the data processing means 30 of the speaker adaptive model creation device 1 is responsible for these processes.
  • the feature amount calculation unit 32 receives the voice data of the evaluation speaker input by the input unit 20, calculates the feature amount necessary for speech recognition, and outputs it to the similarity calculation unit 34.
  • the similarity calculation unit 34 reads the speaker model of each learning speaker stored in the speaker model storage unit 14, and the feature amount of the evaluation speaker received from the feature amount calculation unit 32 for each of these speaker models. And the combination of the similarity and the learning speaker corresponding to the similarity is output to the speaker selection unit 36.
  • the likelihood obtained by inputting the feature amount extracted from the speech of the evaluation speaker into the speaker model of the learning speaker is used as the similarity.
  • the speaker selection unit 36 selects, as an adaptive speaker, a learning speaker having the highest similarity, that is, the highest likelihood, from the combination of each similarity and the learning speaker output from the similarity calculation unit 34.
  • An identifier (ID number or the like) indicating the adapted speaker is output to the adaptive model creation unit 38.
  • the number N of adaptive speakers is a constant determined empirically.
  • the adaptive model creation unit 38 receives the identifiers of the learning speakers selected by the adaptive speakers from the speaker selection unit 36 and reads the sufficient statistics of the learning speakers indicated by these identifiers from the sufficient statistics storage unit 12. . Then, an adaptive model is created and output using the read sufficient statistics, and used for speech recognition of the evaluation speaker.
  • the process of creating the adaptive model using the sufficient statistics read from the sufficient statistics storage unit 12 is a statistical processing calculation represented by the following equations (1) to (3).
  • N mix is the number of mixed distributions.
  • N state is Is the number of states.
  • N sel is the number of selected adaptive speakers
  • the number of adaptive speakers N is empirically determined to be constant.
  • Non-Patent Document 2 for example, a story in an acoustic feature space between an evaluation speaker and a learning speaker.
  • Non-Patent Document 3 As the feature amount of audio data, for example, the Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 and the rate of change thereof are known.
  • MFCC Merke plus tram coefficient
  • Patent Document 1 and Non-Patent Document 1 use the likelihood of the evaluation speaker for speech as the similarity, and select a learning speaker having a high similarity as an adaptive speaker. That is, only the similarity of speech between the learning speaker and the evaluation speaker is used as the selection criterion for the adaptive speaker. For example, when not only the acoustic features but also the phonological features representing the utterance content are similar among the voices of a plurality of selected adaptive speakers, there are few variations such as the utterance content of the adaptive speaker Therefore, the appearance frequency of phonemes used for learning is biased, which may cause deterioration of the accuracy of the adaptive model.
  • the present invention has been made in view of the above circumstances, and provides an adaptive speaker selection technique for avoiding deterioration in accuracy of an adaptive model.
  • One aspect of the present invention is an adaptive speaker selection method for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to an evaluation speaker.
  • a plurality of learning speakers having the highest possible similarity between the evaluation speaker and the speech and the smallest possible mutual speech similarity are selected as the adaptive speakers.
  • the adaptive speaker selection technique of the present invention when the acoustic model adapted to the evaluation speaker is created using the acoustic model of the selected adaptive speaker, deterioration in accuracy of the created acoustic model is suppressed. be able to.
  • FIG. 1 It is a figure which shows the schematic example of the adaptive speaker selection apparatus for demonstrating the technique concerning this invention. It is a figure which shows the structural example of the similarity calculation part in the adaptive speaker selection apparatus shown in FIG. It is a flowchart which shows the flow of a process by the adaptive speaker selection apparatus shown in FIG. It is a flowchart which shows the flow of a process of the similarity calculation part of the example shown in FIG. It is a flowchart which shows an example of the flow of the process by the adaptive speaker selection part in the adaptive speaker selection apparatus shown in FIG. It is a figure which shows the adaptive speaker model production
  • each element described as a functional block for performing various processes can be configured by a processor, a memory, and other circuits in terms of hardware, and in terms of software This is realized by a program recorded or loaded in the program. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one. Also, for the sake of clarity, only those necessary for explaining the technique of the present invention are shown in these drawings.
  • FIG. 1 is an example of a schematic diagram of an adaptive speaker selection device 100 based on the technique according to the present invention.
  • the adaptive speaker selection device 100 includes a speaker model storage unit 112, a learning speaker similarity storage unit 114, a feature amount calculation unit 120, a similarity calculation unit 130, and an adaptive speaker selection unit 140.
  • the speaker model storage unit 112 stores a speaker model created for each learning speaker in association with the learning speaker.
  • a unique identification number is assigned to the learning speaker, and the speaker model is associated with the identification number.
  • the speaker model is expressed by GMM, for example.
  • the speaker model may be HMM, SVM (Support Vector Machine), NN (Neural Network), or BN (Beesian Network).
  • the learning speaker similarity storage unit 114 is a similarity table indicating the speech similarity between all the two learning speakers in the set of learning speakers whose speaker models are stored in the speaker model storage unit 112. Is remembered. The number of these similarities is the same as the number of combinations of two learning speakers.
  • the reciprocal of the distance between the speaker models of the two learning speakers or the nth power of the reciprocal is used as the speech similarity between the two learning speakers (hereinafter simply referred to as the similarity between learning speakers).
  • the similarity between learning speakers For the calculation of the distance between speaker models, for example, KL divergence for calculating a statistical distance between two speaker models which are probability models can be used.
  • the degree of similarity is not limited to that derived from the distance between models, and may be based on, for example, the likelihood of a learning speaker's voice or a feature amount extracted from the voice.
  • the feature amount calculation unit 120 calculates a feature amount necessary for speech recognition from the voice signal (evaluation speaker voice signal) of the evaluation speaker, and outputs the feature amount to the similarity calculation unit 130.
  • the evaluation speaker voice signal is, for example, voice data of the evaluation speaker obtained by A / D conversion with a sampling frequency of 16 kHz and 16 bits.
  • the feature amount extracted by the feature amount calculation unit 120 is, for example, a Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 or a rate of change thereof.
  • MFCC Merke plus tram coefficient
  • the feature amount calculation unit 120 cuts out the evaluation speaker voice signal at intervals of about 10 msec called frames, and performs pre-emphasis, fast Fourier transform (FFT), filter bank analysis, and cosine transform, and features. Extracts feature quantities in vector time-series format.
  • the feature amount is not limited to this, and may be, for example, voice data itself as long as the feature of the voice can be expressed.
  • the similarity calculation unit 130 calculates the similarity between the evaluation speaker and the learning speaker using the feature amount of the evaluation speaker voice signal extracted by the feature amount calculation unit 120. Specifically, for example, the speaker model of each learning speaker is read from the speaker model storage unit 112, and the likelihood for the feature amount of the evaluated speaker is calculated as the similarity for each speaker model.
  • the similarity calculation unit 130 includes an evaluation speaker model creation unit 132 and a similarity calculation execution unit 134.
  • the evaluation speaker model creation unit 132 creates a speaker model of the evaluation speaker (hereinafter referred to as an evaluation speaker model) using the feature amount of the evaluation speaker obtained by the feature amount calculation unit 120.
  • the evaluation speaker model has the same format as the speaker model of the learning speaker stored in the speaker model storage unit 112. For example, if the speaker model is expressed in GMM, the evaluation speaker model The creation unit 132 creates an evaluation speaker model in the GMM format.
  • the similarity calculation execution unit 134 reads each speaker model from the speaker model storage unit 112, and calculates the similarity between each speaker model and the evaluation speaker model created by the evaluation speaker model creation unit 132. . Specifically, for example, the distance between the models of the evaluation speaker model and the speaker model is calculated using KL divergence, and the reciprocal of the inter-model distance or the nth power (n: positive number) of the reciprocal is derived as the similarity.
  • the similarity calculation unit 130 outputs the calculated similarities to the adaptive speaker selection unit 140.
  • the adaptive speaker selection unit 140 calculates the similarity between the evaluation speaker and the learning speaker calculated by the similarity calculation unit 130, and the learning speaker similarity stored in the learning speaker similarity storage unit 114.
  • Use N to select N adaptive speakers.
  • the number N of adaptive speakers to be selected may be determined by any conventionally known method. For example, as described in Non-Patent Document 1, it may be determined empirically as a constant. As described in Non-Patent Document 2, a story in an acoustic feature space between an evaluation speaker and a learning speaker. You may make it determine on the basis of distance between persons.
  • the similarity calculation unit 130 selects such that “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”.
  • an adaptive speaker selection method performed by the adaptive speaker selection unit 140 will be described.
  • One method uses a learning function that minimizes the value of the potential function as a potential function that is a sum of a decreasing function of the similarity between the evaluation speaker and the adaptive speaker and an increasing function of the similarity between the learning speakers.
  • N learning speakers who minimize the potential function U are selected using Equation (4).
  • N is the number of adaptive speakers to be selected as described above.
  • r ti is the distance between models of the evaluation speaker t and the learning speaker i
  • r ij is the distance between models of the learning speaker i and the learning speaker j, and both can be calculated using KL divergence.
  • a speech recognition experiment is performed using the development data, and the recognition performance is set to be high.
  • the adaptive speaker selection method by the adaptive speaker selection unit 140 will be described.
  • candidates for learning speakers to be candidates for adaptive speakers are narrowed down. Specifically, for example, a learning speaker whose similarity with the adaptive speaker is equal to or greater than a predetermined threshold is selected as a candidate. Thereafter, the learning speaker similarity is read from the learning speaker similarity storage unit 114 for the selected candidate learning speaker, and the potential function is used in the same manner as in the first example method described above.
  • An adaptive speaker is selected from learning speakers.
  • the candidate is selected as an adaptive speaker without performing the process of selecting an adaptive speaker from the candidates. May be determined as
  • M learning speakers M> N are selected as candidates in descending order of similarity to the evaluation speaker, and then the method of the first example described above.
  • an adaptive speaker may be selected from candidate learning speakers using a potential function.
  • the method of selecting an adaptive speaker after narrowing down the number of candidate learning speakers once can improve the processing speed. For example, when there are 1000 learning speakers and 10 adaptive speakers are selected from the learning speakers, the method of selecting the adaptive speakers without narrowing down the candidates, the calculation of equation (4) or equation (5) The number of times is 1000 C 10 times. On the other hand, if adaptive speakers are selected after narrowing down to 30 candidates, the number of computations of Equation (4) or Equation (5) is reduced to 30 C 10 times.
  • FIG. 3 is a flowchart showing a flow of processing by the adaptive speaker selection device 100 shown in FIG.
  • the feature amount calculation unit 120 calculates the feature amount of the evaluation speaker voice signal (S10).
  • the similarity calculation unit 130 uses the feature amount of the evaluation speaker voice signal calculated by the feature amount calculation unit 120 to evaluate the speaker model of each learning speaker stored in the speaker model storage unit 112.
  • the similarity is calculated (S20).
  • the similarity may be calculated by, for example, calculating the likelihood for the feature amount of the evaluation speaker voice signal for each speaker model, or by using the feature amount of the evaluation speaker voice signal as shown in FIG.
  • a speaker model may be created (S22), and the similarity between each speaker model and the evaluation speaker model may be calculated (S24).
  • the adaptive speaker selection unit 140 determines that the similarity between the evaluation speaker and the learning speaker is as large as possible based on the similarity between the evaluation speaker and the learning speaker and the similarity between the learning speakers.
  • N learning speakers whose “similarity between learning speakers is as small as possible” are selected as adaptive speakers (S30).
  • the adaptive speaker may be selected directly from all the learning speakers, or, as shown in FIG. 5, M persons (M > N) candidates may be selected (S32), and N adaptive speakers may be selected from the selected M candidates (S34).
  • the learning speaker is set as “adapted speaker” by “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”. Since the selection is made, it is possible to prevent the variation of the utterance content of the adaptive speaker from being reduced. Therefore, it is possible to suppress deterioration in accuracy of the adaptive model created using sufficient statistics of the adaptive speaker.
  • FIG. 6 shows an adaptive speaker model generation apparatus 200 according to the embodiment of the present invention.
  • the adaptive speaker model generation apparatus 200 includes a storage unit 210, an input unit 220, and a data processing unit 230.
  • the storage unit 210 stores a sufficient statistic obtained for the learning speaker for each learning speaker ID, and a sufficient statistic storage unit 212 for each learning speaker ID, and stores an acoustic model of the learning speaker for each learning speaker ID.
  • Speaker model storage unit 214 and a learning speaker similarity storage unit 216 that stores a similarity table indicating the similarity of speech between all two learning speakers in the set of learning speakers.
  • the input unit 220 receives the voice signal of the evaluation speaker from a voice input device such as a microphone and inputs it to the data processing unit 230.
  • the data processing unit 230 includes a feature amount calculation unit 232, a similarity calculation unit 234, a speaker selection unit 236, and an adaptive model creation unit 238.
  • the feature amount calculation unit 232 receives the evaluation speaker voice signal from the input unit 220, calculates a feature amount necessary for speech recognition, and outputs the feature amount to the similarity calculation unit 234. Note that the specific method of calculating the feature amount by the feature amount calculation unit 232 may be any of the methods used by the feature amount calculation unit 120 in the adaptive speaker selection device 100 illustrated in FIG.
  • the similarity calculation unit 234 reads the speaker model of each learning speaker stored in the speaker model storage unit 214, and for each speaker model, the evaluation speaker voice signal received from the feature amount calculation unit 232 and And a set of the similarity and the ID of the learning speaker corresponding to the similarity is output to the speaker selection unit 236. Note that the type of similarity calculated by the similarity calculation unit 234 and the method for calculating the similarity are the same as those of the similarity calculation unit 130 in the adaptive speaker selection device 100. Omitted.
  • the speaker selection unit 236 also indicates that “the similarity between the evaluation speaker and the learning speaker is as large as possible, and the similarity between the learning speakers is as small as possible. “N learning speakers are selected as adaptive speakers. The speaker selection unit 236 outputs the IDs of the selected N adaptive speakers to the adaptive model creation unit 238.
  • the adaptive model creation unit 238 reads a sufficient statistic corresponding to the IDs of the N adaptive speakers output from the speaker selection unit 236 from the sufficient statistic storage unit 212, and uses the statistical processing calculation to determine the evaluation speaker. Create an adapted acoustic model (adaptive model).
  • the adaptive model creation method by the adaptive model creation unit 238 is not limited to the above-described method.
  • the adaptive model creation unit 238 calculates the likelihood of the feature amount of the evaluation speaker of the speaker model of the adaptive speaker calculated by the similarity calculation unit 234. Even if it is a technique such as weighting and integrating sufficient statistics of each adaptive speaker with the corresponding weighting coefficient, or integrating the speaker model of the adaptive speaker by weighting with an arbitrary coefficient Good.
  • FIG. 7 is a flowchart showing the flow of processing by the adaptive speaker model generation apparatus 200.
  • Steps S50 to S70 are processing until an adaptive speaker is selected, and are the same as the processing by the adaptive speaker selection device 100 shown in FIG.
  • step S80 the adaptive model creation unit 238 in the adaptive speaker model generation device 200 uses the sufficient statistics of the N adaptive speakers selected by the speaker selection unit 236 to generate an acoustic model adapted to the evaluation speaker. create.
  • the adaptive speaker model generation apparatus 200 selects an adaptive speaker by the same method as the adaptive speaker selection apparatus 100 shown in FIG. 1 and creates a model adapted to the evaluation speaker. The accuracy degradation of the model can be suppressed.
  • the present invention is used, for example, in a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.

Abstract

A feature quantity calculation unit of an adaptive speaker selection device calculates the feature quantity of a voice signal of an evaluation speaker. A similarity calculation unit calculates, regarding speaker models of respective learning speakers, the degrees of similarity to the evaluation speaker by using the feature quantity of the voice signal of the evaluation speaker calculated by the feature quantity calculation unit (S20). An adaptive speaker selection unit selects N learning speakers such that “the degrees of similarity between the evaluation speaker and the learning speakers are as high as possible and the degree of similarity between the learning speakers is as low as possible” as adaptive speakers on the basis of the degrees of similarity between the evaluation speaker and the learning speakers and the degree of similarity between the learning speakers (S30). Consequently, the adaptive speakers can be selected so that the deterioration of the accuracy of a speaker-adaptive model can be suppressed.

Description

適応話者選択装置および適応話者選択方法並びに記録媒体Adaptive speaker selection apparatus, adaptive speaker selection method, and recording medium
 本発明は、評価話者に適応した音響モデルを作成するために学習話者から適応話者を選択する技術に関する。 The present invention relates to a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.
 音声認識システムは様々な分野で利用されている。音声認識の精度を高めるために、音声認識システムに用いられる音響モデルを利用者に適応させる技術(話者適応技術)が知られており、話者適応モデル(利用者に適応した音響モデル)の作成について様々な手法が提案されている。 Voice recognition systems are used in various fields. In order to improve the accuracy of speech recognition, a technology (speaker adaptation technology) that adapts the acoustic model used in the speech recognition system to the user is known, and the speaker adaptation model (acoustic model adapted to the user) Various methods have been proposed for creation.
 特許文献1と非特許文献1には、十分統計量を用いて話者適応モデルを作成する手法が開示されている。図8は、この手法を実現する話者適応モデル作成装置の模式例を示す。 Patent Document 1 and Non-Patent Document 1 disclose a technique for creating a speaker adaptation model using sufficient statistics. FIG. 8 shows a schematic example of a speaker adaptive model creation apparatus that implements this method.
 図8に示す話者適応モデル作成装置1は、記憶手段10と、入力手段20と、データ処理手段30を備える。記憶手段10は、十分統計量記憶部12と話者モデル記憶部14を有し、データ処理手段30は、特徴量算出部32と、類似度算出部34と、話者選択部36と、適応モデル作成部38を有する。 8 includes a storage means 10, an input means 20, and a data processing means 30. The speaker adaptive model creation apparatus 1 shown in FIG. The storage unit 10 includes a sufficient statistics storage unit 12 and a speaker model storage unit 14, and the data processing unit 30 includes a feature amount calculation unit 32, a similarity calculation unit 34, a speaker selection unit 36, and an adaptation A model creation unit 38 is included.
 話者適応モデル作成装置1は、複数の話者のサンプル音声データで構成されたデータベースを用いて話者毎に音響モデルを作成し、これらの音響モデルから複数個を選択して、発声話者(上述した利用者に該当する)に適応させることによって発声話者用の音響モデルを作成する。本明細書の以下の説明において、サンプル音声データの話者を「学習話者」といい、学習話者毎に作成された、話者の音響的な特徴を表す確率モデルを「話者モデル」という。また、適応の対象となる発声話者を「評価話者」といい、評価話者に適応した音響モデルを「適応モデル」という。また、適応モデルを作成するために選択された話者モデルの話者を「適応話者」という。 The speaker adaptive model creation device 1 creates an acoustic model for each speaker using a database composed of sample speech data of a plurality of speakers, selects a plurality of these acoustic models, and speaks An acoustic model for a speaker is created by adapting to (corresponding to the user described above). In the following description of the present specification, a speaker of sample speech data is referred to as a “learning speaker”, and a probability model created for each learning speaker and representing the acoustic characteristics of the speaker is referred to as a “speaker model”. That's it. Also, the speaker to be adapted is called “evaluation speaker”, and the acoustic model adapted to the evaluation speaker is called “adaptive model”. The speaker of the speaker model selected to create the adaptive model is called “adaptive speaker”.
 話者適応モデル作成装置1は、下記のステップを経て適応モデルを作成する。
1.データベースを用いて十分統計量と話者モデルを作成する。
 話者モデルは、学習話者毎に作成された、該話者の音響的な特徴を表す確率モデルである。ここでは、音素を区別することなく1状態64混合の混合ガウス分布モデル(GMM:Gaussian Mixture Model)で表現される。なお、GMMは、混合正規分布で表現した観測データの確率モデルである。
The speaker adaptive model creation apparatus 1 creates an adaptive model through the following steps.
1. Create sufficient statistics and speaker models using the database.
The speaker model is a probability model created for each learning speaker and representing the acoustic characteristics of the speaker. Here, it is expressed by a mixed Gaussian distribution model (GMM: Gaussian Mixture Model) with one state 64 without distinguishing phonemes. GMM is a probability model of observation data expressed by a mixed normal distribution.
 十分統計量は、学習話者毎に作成され、隠れマルコフモデル(HMM:Hidden Markov Model)で表現される。「十分統計量」とは、データベースから音響モデルを構築するために十分な統計量のことを意味し、ここでは、HMMにおける平均、分散、およびEMカウントが用いられる。なお、「EMカウント」は、HMMを学習する際に一般的に用いられるEMアルゴリズムにおいて、状態iから状態jの正規分布に遷移する確率の度数である。十分統計量は、当該学習話者の音声データを用いて、EMアルゴリズで不特定話者モデルから1回学習することにより算出される。 Sufficient statistics are created for each learning speaker and expressed in a hidden Markov model (HMM: Hidden Markov Model). By “sufficient statistics” is meant sufficient statistics to build an acoustic model from a database, where mean, variance, and EM count in the HMM are used. The “EM count” is the frequency of the probability of transition from the state i to the normal distribution of the state j in the EM algorithm generally used when learning the HMM. The sufficient statistic is calculated by learning once from the unspecified speaker model with the EM algorithm using the speech data of the learning speaker.
 話者適応モデル作成装置1において、十分統計量記憶部12と話者モデル記憶部14は、上述のように算出された学習話者毎の十分統計量と話者モデルをそれぞれ記憶する。
2.評価話者の音声データの入力
In the speaker adaptive model creation device 1, the sufficient statistic storage unit 12 and the speaker model storage unit 14 each store the sufficient statistic and the speaker model for each learning speaker calculated as described above.
2. Input of voice data of evaluation speaker
 話者適応モデル作成装置1において、入力手段20により、評価話者の音声データを入力する。なお、入力手段20は、例えばマイクロホンなどの音声入力デバイスから評価話者の音声データを受け取る。
3.適応話者の選択と適応モデルの作成
In the speaker adaptive model creation device 1, the voice data of the evaluation speaker is input by the input means 20. The input unit 20 receives the voice data of the evaluation speaker from a voice input device such as a microphone.
3. Selection of adaptive speakers and creation of adaptive models
 話者適応モデル作成装置1のデータ処理手段30は、これらの処理を担う。
 特徴量算出部32は、入力手段20が入力した評価話者の音声データを受け取り、音声認識に必要な特徴量を算出して類似度算出部34に出力する。
The data processing means 30 of the speaker adaptive model creation device 1 is responsible for these processes.
The feature amount calculation unit 32 receives the voice data of the evaluation speaker input by the input unit 20, calculates the feature amount necessary for speech recognition, and outputs it to the similarity calculation unit 34.
 類似度算出部34は、話者モデル記憶部14に記憶された各学習話者の話者モデルを読み込み、これらの話者モデル毎に、特徴量算出部32から受け取った評価話者の特徴量との類似度を算出し、類似度と、該類似度に対応する学習話者との組を話者選択部36に出力する。 The similarity calculation unit 34 reads the speaker model of each learning speaker stored in the speaker model storage unit 14, and the feature amount of the evaluation speaker received from the feature amount calculation unit 32 for each of these speaker models. And the combination of the similarity and the learning speaker corresponding to the similarity is output to the speaker selection unit 36.
 ここでは、類似度として、学習話者の話者モデルに評価話者の音声から抽出した特徴量を入力して得た尤度が用いられる。この尤度が大きいほど類似度が高い。 Here, the likelihood obtained by inputting the feature amount extracted from the speech of the evaluation speaker into the speaker model of the learning speaker is used as the similarity. The greater the likelihood, the higher the similarity.
 話者選択部36は、類似度算出部34から出力された各々の類似度と学習話者の組から、類似度すなわち尤度が上位N人の学習話者を適応話者として選択し、選択された適応話者を示す識別子(ID番号など)を適応モデル作成部38に出力する。なお、適応話者の数Nは、経験的に定められた定数である。 The speaker selection unit 36 selects, as an adaptive speaker, a learning speaker having the highest similarity, that is, the highest likelihood, from the combination of each similarity and the learning speaker output from the similarity calculation unit 34. An identifier (ID number or the like) indicating the adapted speaker is output to the adaptive model creation unit 38. The number N of adaptive speakers is a constant determined empirically.
 適応モデル作成部38は、話者選択部36から、適応話者に選択された学習話者の識別子を受け取り、これらの識別子が示す学習話者の十分統計量を十分統計量記憶部12から読み出す。そして、読み出した十分統計量を用いて適応モデルを作成して出力し、評価話者の音声認識に供する。 The adaptive model creation unit 38 receives the identifiers of the learning speakers selected by the adaptive speakers from the speaker selection unit 36 and reads the sufficient statistics of the learning speakers indicated by these identifiers from the sufficient statistics storage unit 12. . Then, an adaptive model is created and output using the read sufficient statistics, and used for speech recognition of the evaluation speaker.
 十分統計量記憶部12から読み出した十分統計量を用いて適応モデルを作成する処理は、具体的には、下記の式(1)~式(3)が示す統計処理演算である。
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000006
 ここで、μ adp(i=1,・・・,Nmix)、ν adp(i=1,・・・,Nmix)は、それぞれ、適応モデルのHMMの各状態における正規分布の平均と分散であり、Nmixは、混合分布数である。また、aadp[i][j](i=1,・・・,Nstate,j=1,・・・,Nstate)は、状態iから状態jへの遷移確率であり、Nstateは、状態数である。また、Nselは、選択された適応話者の数であり、μ (i=1,・・・,Nmix,j=1,・・・,Nsel)、ν (i=1,・・・,Nmix,j=1,・・・,Nsel)は、それぞれ、選択された適応話者の音響モデルの平均、分散である。また、Cmix (j=1,・・・,Nsel)、Cstate [i][j](k=1,・・・,Nsel,i=1,・・・,Nstate,j=1,・・・,Nstate)は、それぞれ、正規分布におけるEMカウント、状態遷移に関するEMカウントである。
Specifically, the process of creating the adaptive model using the sufficient statistics read from the sufficient statistics storage unit 12 is a statistical processing calculation represented by the following equations (1) to (3).
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000006
Here, μ i adp (i = 1,..., N mix ) and ν i adp (i = 1,..., N mix ) are averages of normal distributions in each state of the HMM of the adaptive model, respectively. And N mix is the number of mixed distributions. Further, a adp [i] [j] (i = 1,..., N state , j = 1,..., N state ) is a transition probability from state i to state j, and N state is Is the number of states. N sel is the number of selected adaptive speakers, and μ i j (i = 1,..., N mix , j = 1,..., N sel ), ν i j (i = 1,..., N mix , j = 1,..., N sel ) are the mean and variance of the acoustic model of the selected adaptive speaker, respectively. Also, C mix j (j = 1,..., N sel ), C state k [i] [j] (k = 1,..., N sel , i = 1,..., N state , j = 1,..., N state ) are an EM count in a normal distribution and an EM count related to state transition, respectively.
 なお、上述した手法では、経験的に適応話者の数Nを一定に定めているが、たとえば非特許文献2に記載されたように、評価話者と学習話者との音響特徴空間における話者間距離を基準に定める方法もある。 In the above-described method, the number of adaptive speakers N is empirically determined to be constant. However, as described in Non-Patent Document 2, for example, a story in an acoustic feature space between an evaluation speaker and a learning speaker. There is also a method for setting the distance between persons as a reference.
 また、音声データの特徴量は、たとえば非特許文献3に記載されたメルケプラストラム係数(MFCC)やそれらの変化率などが知られている。
特許第3756879号公報 芳澤伸一,馬場朗,松浪加奈子,米良祐一郎,山田実一,李晃伸,鹿野清宏,「十分統計量と話者距離を用いた音韻モデルの教師なし学習法」,電子情報通信学会論文誌,D-II,Vol.J85-D-II,No.3,pp.382-289,2002年3月 谷真宏,江森正,大西祥史,越仲孝文,篠田浩一,「十分統計量を用いた教師なし話者適応における話者選択法」,信学技報,Vol.107,No.406,pp.85-89,2007年12月 鹿野清宏,伊藤克亘,河原達也,武田一哉,山本幹雄著,「音声認識システム」,株式会社オーム社,2001年,pp.13-15
As the feature amount of audio data, for example, the Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 and the rate of change thereof are known.
Japanese Patent No. 3756879 Shinichi Yoshizawa, Akira Baba, Kanako Matsunami, Yuichiro Yonera, Shinichi Yamada, Nobuyoshi Lee, Kiyohiro Shikano, "Unsupervised learning method of phonological model using sufficient statistics and speaker distance", IEICE Transactions, D -II, Vol. J85-D-II, no. 3, pp. 382-289, March 2002 Masahiro Tani, Tadashi Emori, Yoshifumi Onishi, Takafumi Koshinaka, Koichi Shinoda, "Speaker Selection Method for Unsupervised Speaker Adaptation Using Sufficient Statistics", IEICE Tech. 107, no. 406, pp. 85-89, December 2007 Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, "Speech Recognition System", Ohm Co., Ltd., 2001, p. 13-15
 特許文献1と非特許文献1に記載された上記の手法は、評価話者の音声に対する尤度を類似度として用い、類似度が高い学習話者を適応話者として選択する。すなわち、学習話者と評価話者間の音声の類似度のみを適応話者の選択基準としている。例えば、選択された複数の適応話者の音声間において、音響的な特徴のみならず、発話内容などを表す音韻性の特徴も類似している場合、適応話者の発話内容などのバリエーションが少ないため、学習に用いられる音素の出現頻度に偏りが生じ、適応モデルの精度劣化を引き起こしてしまう恐れがある。 The above-described methods described in Patent Document 1 and Non-Patent Document 1 use the likelihood of the evaluation speaker for speech as the similarity, and select a learning speaker having a high similarity as an adaptive speaker. That is, only the similarity of speech between the learning speaker and the evaluation speaker is used as the selection criterion for the adaptive speaker. For example, when not only the acoustic features but also the phonological features representing the utterance content are similar among the voices of a plurality of selected adaptive speakers, there are few variations such as the utterance content of the adaptive speaker Therefore, the appearance frequency of phonemes used for learning is biased, which may cause deterioration of the accuracy of the adaptive model.
 本発明は、上記事情に鑑みてなされたものであり、適応モデルの精度劣化を回避するための適応話者選択技術を提供する。 The present invention has been made in view of the above circumstances, and provides an adaptive speaker selection technique for avoiding deterioration in accuracy of an adaptive model.
 本発明の一つの態様は、評価話者に適応した音響モデルを作成するために、学習話者の集合から複数の適応話者を選択する適応話者選択方法である。この方法は、評価話者と音声の類似度ができるだけ高く、かつ互いの音声の類似度ができるだけ小さい複数の学習話者を適応話者として選択する。 One aspect of the present invention is an adaptive speaker selection method for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to an evaluation speaker. In this method, a plurality of learning speakers having the highest possible similarity between the evaluation speaker and the speech and the smallest possible mutual speech similarity are selected as the adaptive speakers.
 なお、上記態様の方法を、該方法を実行する装置や、該方法をコンピュータに実行せしめるプログラムとして置き換えて表現したものも、本発明の態様として有効である。 It should be noted that a method in which the method of the above aspect is replaced with an apparatus that executes the method or a program that causes a computer to execute the method is also effective as an aspect of the present invention.
 本発明にかかる適応話者選択技術によれば、選択された適応話者の音響モデルを用いて評価話者に適応した音響モデルを作成する際に、作成された音響モデルの精度劣化を抑制することができる。 According to the adaptive speaker selection technique of the present invention, when the acoustic model adapted to the evaluation speaker is created using the acoustic model of the selected adaptive speaker, deterioration in accuracy of the created acoustic model is suppressed. be able to.
本発明にかかる技術を説明するための適応話者選択装置の模式例を示す図である。It is a figure which shows the schematic example of the adaptive speaker selection apparatus for demonstrating the technique concerning this invention. 図1に示す適応話者選択装置における類似度算出部の構成例を示す図である。It is a figure which shows the structural example of the similarity calculation part in the adaptive speaker selection apparatus shown in FIG. 図1に示す適応話者選択装置による処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process by the adaptive speaker selection apparatus shown in FIG. 図2に示す例の類似度算出部の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the similarity calculation part of the example shown in FIG. 図1に示す適応話者選択装置における適応話者選択部による処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process by the adaptive speaker selection part in the adaptive speaker selection apparatus shown in FIG. 本発明の実施の形態にかかる適応話者モデル生成装置を示す図である。It is a figure which shows the adaptive speaker model production | generation apparatus concerning embodiment of this invention. 図6に示す適応話者モデル生成装置による処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process by the adaptive speaker model production | generation apparatus shown in FIG. 従来技術を説明するために用いた話者適応モデル作成装置の模式図である。It is a schematic diagram of the speaker adaptive model creation apparatus used in order to explain a prior art.
符号の説明Explanation of symbols
 1 話者適応モデル作成装置 10 記憶手段
 12 十分統計量記憶部 14 話者モデル記憶部
 20 入力手段 30 データ処理手段
 32 特徴量算出部 34 類似度算出部
 36 話者選択部 38 適応モデル作成部
 100 適応話者選択装置 112 話者モデル記憶部
 114 学習話者間類似度記憶部 120 特徴量算出部
 130 類似度算出部 132 評価話者モデル作成部
 134 類似度算出実行部 140 適応話者選択部
 200 適応話者モデル生成装置 210 記憶手段
 212 十分統計量記憶部 214 話者モデル記憶部
 216 学習話者間類似度記憶部 220 入力手段
 230 データ処理手段 232 特徴量算出部
 234 類似度算出部 236 話者選択部
 238 適応モデル作成部
DESCRIPTION OF SYMBOLS 1 Speaker adaptive model production apparatus 10 Storage means 12 Sufficient statistics storage part 14 Speaker model storage part 20 Input means 30 Data processing means 32 Feature-value calculation part 34 Similarity degree calculation part 36 Speaker selection part 38 Adaptive model creation part 100 Adaptive speaker selection device 112 Speaker model storage unit 114 Learning speaker similarity storage unit 120 Feature amount calculation unit 130 Similarity calculation unit 132 Evaluation speaker model creation unit 134 Similarity calculation execution unit 140 Adaptive speaker selection unit 200 Adaptive speaker model generation apparatus 210 Storage unit 212 Sufficient statistics storage unit 214 Speaker model storage unit 216 Learning speaker similarity storage unit 220 Input unit 230 Data processing unit 232 Feature amount calculation unit 234 Similarity calculation unit 236 Speaker Selection unit 238 Adaptive model creation unit
 以下の説明に用いられる図面に、様々な処理を行う機能ブロックとして記載される各要素は、ハードウェア的には、プロセッサ、メモリ、その他の回路で構成することができ、ソフトウェア的には、メモリに記録された、またはロードされたプログラムなどによって実現される。したがって、これらの機能ブロックがハードウェアのみ、ソフトウェアのみ、またはそれらの組合せによっていろいろな形で実現できることは当業者には理解されるところであり、いずれかに限定されるものではない。また、分かりやすいように、これらの図面において、本発明の技術を説明するために必要なもののみを示す。 In the drawings used in the following description, each element described as a functional block for performing various processes can be configured by a processor, a memory, and other circuits in terms of hardware, and in terms of software This is realized by a program recorded or loaded in the program. Therefore, it is understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and is not limited to any one. Also, for the sake of clarity, only those necessary for explaining the technique of the present invention are shown in these drawings.
 本発明の具体的な実施の形態を説明する前に、まず、本発明の原理を説明する。
 図1は、本発明にかかる技術に基づく適応話者選択装置100の模式図の例である。適応話者選択装置100は、話者モデル記憶部112と、学習話者間類似度記憶部114と、特徴量算出部120と、類似度算出部130と、適応話者選択部140を備える。
Before describing specific embodiments of the present invention, the principle of the present invention will be described first.
FIG. 1 is an example of a schematic diagram of an adaptive speaker selection device 100 based on the technique according to the present invention. The adaptive speaker selection device 100 includes a speaker model storage unit 112, a learning speaker similarity storage unit 114, a feature amount calculation unit 120, a similarity calculation unit 130, and an adaptive speaker selection unit 140.
 話者モデル記憶部112は、学習話者毎に作成された話者モデルを、学習話者と対応付けて記憶している。対応付けの手法としては、例えば学習話者にユニークな識別番号を付与し、話者モデルと、識別番号とを対応付ける。話者モデルは、例えばGMMで表現されるものとするが、HMM、SVM(Support Vector Machine)や、NN(Neural Network)、BN(Beyesian Network)によるものであってもよい。 The speaker model storage unit 112 stores a speaker model created for each learning speaker in association with the learning speaker. As an association method, for example, a unique identification number is assigned to the learning speaker, and the speaker model is associated with the identification number. The speaker model is expressed by GMM, for example. However, the speaker model may be HMM, SVM (Support Vector Machine), NN (Neural Network), or BN (Beesian Network).
 学習話者間類似度記憶部114は、話者モデル記憶部112にその話者モデルが記憶された各学習話者の集合における全ての2学習話者間の音声の類似度を示す類似度テーブルを記憶している。これらの類似度の数は、2学習話者の組合せの数と同一である。 The learning speaker similarity storage unit 114 is a similarity table indicating the speech similarity between all the two learning speakers in the set of learning speakers whose speaker models are stored in the speaker model storage unit 112. Is remembered. The number of these similarities is the same as the number of combinations of two learning speakers.
 2学習話者間の音声の類似度(以下単に学習話者間類似度という)は、例えば当該2学習話者の話者モデル間距離の逆数や逆数のn乗(n:正数)を用いる。話者モデル間距離の計算は、例えば、確率モデルである2つの話者モデルの統計的な距離を算出するKLダイバージェンスを用いることができる。なお、類似度は、モデル間距離から導き出されるものに限らず、たとえば、学習話者の音声または音声から抽出された特徴量に対する尤度に基づいたものであってもよい。 For example, the reciprocal of the distance between the speaker models of the two learning speakers or the nth power of the reciprocal (n: positive number) is used as the speech similarity between the two learning speakers (hereinafter simply referred to as the similarity between learning speakers). . For the calculation of the distance between speaker models, for example, KL divergence for calculating a statistical distance between two speaker models which are probability models can be used. Note that the degree of similarity is not limited to that derived from the distance between models, and may be based on, for example, the likelihood of a learning speaker's voice or a feature amount extracted from the voice.
 特徴量算出部120は、評価話者の音声信号(評価話者音声信号)から、音声認識に必要な特徴量を算出して類似度算出部130に出力する。評価話者音声信号は、例えば、サンプリング周波数16kHz、16bitのA/D変換により得られた評価話者の音声データである。特徴量算出部120が抽出する特徴量は、例えば、非特許文献3に記載されたメルケプラストラム係数(MFCC)やそれらの変化率である。この場合、特徴量算出部120は、評価話者音声信号を、フレームと呼ばれる10msec程度の一定区間毎に切り出し、プリエンファシス、高速フーリエ変換(FFT)、フィルタバンク分析、コサイン変換を行って、特徴ベクトルの時系列の形式の特徴量を抽出する。勿論、特徴量は、これに限られることがなく、音声の特徴を表すことができればたとえば音声データそのものであってもよい。 The feature amount calculation unit 120 calculates a feature amount necessary for speech recognition from the voice signal (evaluation speaker voice signal) of the evaluation speaker, and outputs the feature amount to the similarity calculation unit 130. The evaluation speaker voice signal is, for example, voice data of the evaluation speaker obtained by A / D conversion with a sampling frequency of 16 kHz and 16 bits. The feature amount extracted by the feature amount calculation unit 120 is, for example, a Merke plus tram coefficient (MFCC) described in Non-Patent Document 3 or a rate of change thereof. In this case, the feature amount calculation unit 120 cuts out the evaluation speaker voice signal at intervals of about 10 msec called frames, and performs pre-emphasis, fast Fourier transform (FFT), filter bank analysis, and cosine transform, and features. Extracts feature quantities in vector time-series format. Of course, the feature amount is not limited to this, and may be, for example, voice data itself as long as the feature of the voice can be expressed.
 類似度算出部130は、特徴量算出部120が抽出した評価話者音声信号の特徴量を用いて、評価話者と学習話者の類似度を算出する。具体的には、例えば、話者モデル記憶部112から各学習話者の話者モデルを読み出して、それぞれの話者モデルについて、評価話者の特徴量に対する尤度を類似度として算出する。 The similarity calculation unit 130 calculates the similarity between the evaluation speaker and the learning speaker using the feature amount of the evaluation speaker voice signal extracted by the feature amount calculation unit 120. Specifically, for example, the speaker model of each learning speaker is read from the speaker model storage unit 112, and the likelihood for the feature amount of the evaluated speaker is calculated as the similarity for each speaker model.
 評価話者と学習話者間の類似度算出は、上記手法に限らない。図2を参照して別の手法の一例を説明する。
 この手法の場合、類似度算出部130は、評価話者モデル作成部132と類似度算出実行部134を有する。評価話者モデル作成部132は、特徴量算出部120が得た評価話者の特徴量を用いて評価話者の話者モデル(以下評価話者モデルという)を作成する。評価話者モデルは、話者モデル記憶部112に記憶された学習話者の話者モデルと同じ形式を有し、たとえば、話者モデルがGMMで表現されるものであれば、評価話者モデル作成部132は、GMM形式で評価話者モデルを作成する。
The similarity calculation between the evaluation speaker and the learning speaker is not limited to the above method. An example of another method will be described with reference to FIG.
In the case of this method, the similarity calculation unit 130 includes an evaluation speaker model creation unit 132 and a similarity calculation execution unit 134. The evaluation speaker model creation unit 132 creates a speaker model of the evaluation speaker (hereinafter referred to as an evaluation speaker model) using the feature amount of the evaluation speaker obtained by the feature amount calculation unit 120. The evaluation speaker model has the same format as the speaker model of the learning speaker stored in the speaker model storage unit 112. For example, if the speaker model is expressed in GMM, the evaluation speaker model The creation unit 132 creates an evaluation speaker model in the GMM format.
 類似度算出実行部134は、話者モデル記憶部112から各話者モデルを読出し、それぞれの話者モデルについて、評価話者モデル作成部132が作成した評価話者モデルとの類似度を算出する。具体的には、たとえば、KLダイバージェンスを用いて評価話者モデルと話者モデルのモデル間距離を算出し、モデル間距離の逆数や逆数のn乗(n:正数)を類似度として導き出す。 The similarity calculation execution unit 134 reads each speaker model from the speaker model storage unit 112, and calculates the similarity between each speaker model and the evaluation speaker model created by the evaluation speaker model creation unit 132. . Specifically, for example, the distance between the models of the evaluation speaker model and the speaker model is calculated using KL divergence, and the reciprocal of the inter-model distance or the nth power (n: positive number) of the reciprocal is derived as the similarity.
 類似度算出部130は、算出した各類似度を適応話者選択部140に出力する。
 適応話者選択部140は、類似度算出部130が算出した評価話者と学習話者間の類似度と、および学習話者間類似度記憶部114に記憶された学習話者間類似度とを用いて適応話者をN人選択する。選択する適応話者の数Nは、従来知られているいかなる方法で定めてもよい。たとえば、非特許文献1に記載されたように、経験的に定数に定めるようにしてもよく、非特許文献2に記載されたように、評価話者と学習話者との音響特徴空間における話者間距離を基準に定めるようにしてもよい。
The similarity calculation unit 130 outputs the calculated similarities to the adaptive speaker selection unit 140.
The adaptive speaker selection unit 140 calculates the similarity between the evaluation speaker and the learning speaker calculated by the similarity calculation unit 130, and the learning speaker similarity stored in the learning speaker similarity storage unit 114. Use N to select N adaptive speakers. The number N of adaptive speakers to be selected may be determined by any conventionally known method. For example, as described in Non-Patent Document 1, it may be determined empirically as a constant. As described in Non-Patent Document 2, a story in an acoustic feature space between an evaluation speaker and a learning speaker. You may make it determine on the basis of distance between persons.
 類似度算出部130は、具体的には、「評価話者と学習話者間の類似度ができるだけ大きく、学習話者間類似度ができるだけ小さく」なるように選択する。ここで、適応話者選択部140による適応話者選択の手法の例を説明する。 Specifically, the similarity calculation unit 130 selects such that “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”. Here, an example of an adaptive speaker selection method performed by the adaptive speaker selection unit 140 will be described.
 1つの手法は、評価話者と適応話者間の類似度の減少関数と、学習話者間の類似度の増加関数との和をポテンシャル関数とし、このポテンシャル関数の値を最小とする学習話者を適応話者として選択する。具体的には、式(4)を用いて、ポテンシャル関数Uを最小とする学習話者N人を選択する。
Figure JPOXMLDOC01-appb-M000007
One method uses a learning function that minimizes the value of the potential function as a potential function that is a sum of a decreasing function of the similarity between the evaluation speaker and the adaptive speaker and an increasing function of the similarity between the learning speakers. Select the speaker as the adaptive speaker. Specifically, N learning speakers who minimize the potential function U are selected using Equation (4).
Figure JPOXMLDOC01-appb-M000007
 式(4)において、Nは、上述した、選択する適応話者の数である。rtiは、評価話者tと学習話者iのモデル間距離であり、rijは学習話者iと学習話者jのモデル間距離であり、両者は共にKLダイバージェンスを用いて算出できる。ポテンシャル関数Uを特徴付けるパラメータ「k,k,・・・,l,l,・・・,m,m,・・・,n,n,・・・」は、例えば開発データを用いて音声認識実験を行い、認識性能が高くなるように設定される。また、演算の簡素化のために、式(4)に対して、k=1、l=1、m1=、n=1、他のパラメータを0とした場合に得た式(5)を用いてもよい。
Figure JPOXMLDOC01-appb-M000008
In Equation (4), N is the number of adaptive speakers to be selected as described above. r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and both can be calculated using KL divergence. Parameters “k 1 , k 2 ,..., L 1 , l 2 ,..., M 1 , m 2 ,..., N 1 , n 2 ,. A speech recognition experiment is performed using the development data, and the recognition performance is set to be high. In addition, in order to simplify the calculation, the expression (4) is obtained when k 1 = 1, l 1 = 1, m1 = 1 , n 1 = 1, and other parameters are set to 0 ( 5) may be used.
Figure JPOXMLDOC01-appb-M000008
 適応話者選択部140による適応話者の選択手法について、もう1つの例を説明する。この手法は、まず、適応話者の候補となる学習話者の候補を絞る。具体的には、例えば、適応話者との類似度が、予め定められた閾値以上の学習話者を候補として選択する。その後、選択した候補の学習話者について学習話者間類似度を学習話者間類似度記憶部114から読み出して、上述した第1の例の手法と同じように、ポテンシャル関数を用いて、候補となる学習話者から適応話者を選択する。なお、この手法の場合、候補として選出された学習話者の数が、選択する適応話者の数以下であるときには、候補から適応話者を選択する処理を行わずに、候補を適応話者として決定してもよい。 Another example of the adaptive speaker selection method by the adaptive speaker selection unit 140 will be described. In this method, first, candidates for learning speakers to be candidates for adaptive speakers are narrowed down. Specifically, for example, a learning speaker whose similarity with the adaptive speaker is equal to or greater than a predetermined threshold is selected as a candidate. Thereafter, the learning speaker similarity is read from the learning speaker similarity storage unit 114 for the selected candidate learning speaker, and the potential function is used in the same manner as in the first example method described above. An adaptive speaker is selected from learning speakers. In the case of this method, when the number of learning speakers selected as candidates is equal to or less than the number of adaptive speakers to be selected, the candidate is selected as an adaptive speaker without performing the process of selecting an adaptive speaker from the candidates. May be determined as
 また、候補選択の閾値を用いずに、評価話者との類似度が高い順にM人(M>N)の学習話者を候補として選択し、その後、上述した第1の例の手法のように、ポテンシャル関数を用いて、候補となる学習話者から適応話者を選択するようにしてもよい。 Further, without using a candidate selection threshold, M learning speakers (M> N) are selected as candidates in descending order of similarity to the evaluation speaker, and then the method of the first example described above. In addition, an adaptive speaker may be selected from candidate learning speakers using a potential function.
 候補となる学習話者の数を一度絞ってから適応話者を選択する手法は、処理速度の向上を図ることができる。例えば、学習話者が1000人存在し、この中から10人の適応話者を選択する場合、候補を絞らずに適応話者を選択する手法では、式(4)または式(5)の演算回数は100010回である。一方、候補を30人に絞ってから適応話者を選択すれば、式(4)または式(5)の演算回数は、3010回に削減される。 The method of selecting an adaptive speaker after narrowing down the number of candidate learning speakers once can improve the processing speed. For example, when there are 1000 learning speakers and 10 adaptive speakers are selected from the learning speakers, the method of selecting the adaptive speakers without narrowing down the candidates, the calculation of equation (4) or equation (5) The number of times is 1000 C 10 times. On the other hand, if adaptive speakers are selected after narrowing down to 30 candidates, the number of computations of Equation (4) or Equation (5) is reduced to 30 C 10 times.
 図3は、図1に示す適応話者選択装置100による処理の流れを示すフローチャートである。まず、特徴量算出部120は、評価話者音声信号の特徴量を算出する(S10)。類似度算出部130は、特徴量算出部120が算出した評価話者音声信号の特徴量を用いて、話者モデル記憶部112に記憶された各学習話者の話者モデルについて、評価話者との類似度を算出する(S20)。類似度の算出は、例えば評価話者音声信号の特徴量に対する尤度を話者モデル毎に算出してもよいし、図4に示すように、評価話者音声信号の特徴量を用いて評価話者モデルを作成し(S22)、各話者モデルについて、評価話者モデルとの類似度を算出する(S24)ようにしてもよい。 FIG. 3 is a flowchart showing a flow of processing by the adaptive speaker selection device 100 shown in FIG. First, the feature amount calculation unit 120 calculates the feature amount of the evaluation speaker voice signal (S10). The similarity calculation unit 130 uses the feature amount of the evaluation speaker voice signal calculated by the feature amount calculation unit 120 to evaluate the speaker model of each learning speaker stored in the speaker model storage unit 112. The similarity is calculated (S20). The similarity may be calculated by, for example, calculating the likelihood for the feature amount of the evaluation speaker voice signal for each speaker model, or by using the feature amount of the evaluation speaker voice signal as shown in FIG. A speaker model may be created (S22), and the similarity between each speaker model and the evaluation speaker model may be calculated (S24).
 そして、適応話者選択部140は、評価話者と学習話者間の類似度、および学習話者間の類似度に基づいて、「評価話者と学習話者間の類似度ができるだけ大きく、学習話者間類似度ができるだけ小さく」なるN人の学習話者を適応話者として選択する(S30)。適応話者の選択に当たり、すべての学習話者から直接適応話者を選択するようにしてもよいし、図5に示すように、評価話者との間の類似度に応じてM人(M>N)の候補を選択し(S32)、選択されたM人の候補からN人の適応話者を選択する(S34)ようにしてもよい。 Then, the adaptive speaker selection unit 140 determines that the similarity between the evaluation speaker and the learning speaker is as large as possible based on the similarity between the evaluation speaker and the learning speaker and the similarity between the learning speakers. N learning speakers whose “similarity between learning speakers is as small as possible” are selected as adaptive speakers (S30). In selecting the adaptive speaker, the adaptive speaker may be selected directly from all the learning speakers, or, as shown in FIG. 5, M persons (M > N) candidates may be selected (S32), and N adaptive speakers may be selected from the selected M candidates (S34).
 以上において、本発明にかかる適応話者選択技術の原理を説明した。この技術によれば、適応話者を選択する際に、「評価話者と学習話者間の類似度ができるだけ大きく、学習話者間類似度ができるだけ小さくなる」学習話者を適応話者として選択するので、適応話者の発話内容のバリエーションが少なくなることを防ぐことができる。したがって、適応話者の十分統計量を用いて作成した適応モデルの精度劣化を抑制することができる。 The principle of the adaptive speaker selection technique according to the present invention has been described above. According to this technology, when selecting an adaptive speaker, the learning speaker is set as “adapted speaker” by “the similarity between the evaluation speaker and the learning speaker is as large as possible and the similarity between the learning speakers is as small as possible”. Since the selection is made, it is possible to prevent the variation of the utterance content of the adaptive speaker from being reduced. Therefore, it is possible to suppress deterioration in accuracy of the adaptive model created using sufficient statistics of the adaptive speaker.
 以上の説明を踏まえて本発明の実施の形態を説明する。
 図6は、本発明の実施の形態にかかる適応話者モデル生成装置200を示す。適応話者モデル生成装置200は、記憶手段210と、入力手段220と、データ処理手段230を備える。
Based on the above description, an embodiment of the present invention will be described.
FIG. 6 shows an adaptive speaker model generation apparatus 200 according to the embodiment of the present invention. The adaptive speaker model generation apparatus 200 includes a storage unit 210, an input unit 220, and a data processing unit 230.
 記憶手段210は、学習話者に対して求められた十分統計量を学習話者のID毎に記憶した十分統計量記憶部212と、学習話者の音響モデルを学習話者のID毎に記憶した話者モデル記憶部214と、各学習話者の集合における全ての2学習話者間の音声の類似度を示す類似度テーブルを記憶した学習話者間類似度記憶部216を有する。 The storage unit 210 stores a sufficient statistic obtained for the learning speaker for each learning speaker ID, and a sufficient statistic storage unit 212 for each learning speaker ID, and stores an acoustic model of the learning speaker for each learning speaker ID. Speaker model storage unit 214 and a learning speaker similarity storage unit 216 that stores a similarity table indicating the similarity of speech between all two learning speakers in the set of learning speakers.
 入力手段220は、例えばマイクロホンなどの音声入力デバイスから評価話者の音声信号を受け取ってデータ処理手段230に入力する。 The input unit 220 receives the voice signal of the evaluation speaker from a voice input device such as a microphone and inputs it to the data processing unit 230.
 データ処理手段230は、特徴量算出部232と、類似度算出部234と、話者選択部236と、適応モデル作成部238を有する。 The data processing unit 230 includes a feature amount calculation unit 232, a similarity calculation unit 234, a speaker selection unit 236, and an adaptive model creation unit 238.
 特徴量算出部232は、入力手段220から評価話者音声信号を受け取り、音声認識に必要な特徴量を算出して類似度算出部234に出力する。なお、特徴量算出部232による特徴量算出の具体的な手法は、図1に示す適応話者選択装置100における特徴量算出部120が用いる手法のいずれであってもよい。 The feature amount calculation unit 232 receives the evaluation speaker voice signal from the input unit 220, calculates a feature amount necessary for speech recognition, and outputs the feature amount to the similarity calculation unit 234. Note that the specific method of calculating the feature amount by the feature amount calculation unit 232 may be any of the methods used by the feature amount calculation unit 120 in the adaptive speaker selection device 100 illustrated in FIG.
 類似度算出部234は、話者モデル記憶部214に記憶された各学習話者の話者モデルを読み込み、これらの話者モデル毎に、特徴量算出部232から受け取った評価話者音声信号との類似度を算出し、類似度と、該類似度に対応する学習話者のIDとの組を話者選択部236に出力する。なお、類似度算出部234が算出する類似度の種類、および類似度を算出する手法は、適応話者選択装置100における類似度算出部130のものと同じであるので、ここで詳細な説明を省略する。 The similarity calculation unit 234 reads the speaker model of each learning speaker stored in the speaker model storage unit 214, and for each speaker model, the evaluation speaker voice signal received from the feature amount calculation unit 232 and And a set of the similarity and the ID of the learning speaker corresponding to the similarity is output to the speaker selection unit 236. Note that the type of similarity calculated by the similarity calculation unit 234 and the method for calculating the similarity are the same as those of the similarity calculation unit 130 in the adaptive speaker selection device 100. Omitted.
 話者選択部236も、適応話者選択装置100における適応話者選択部140と同様に、「評価話者と学習話者間の類似度ができるだけ大きく、学習話者間類似度ができるだけ小さくなる」学習話者N人を適応話者として選択する。話者選択部236は、選択したN人の適応話者のIDを適応モデル作成部238に出力する。 Similarly to the adaptive speaker selection unit 140 in the adaptive speaker selection device 100, the speaker selection unit 236 also indicates that “the similarity between the evaluation speaker and the learning speaker is as large as possible, and the similarity between the learning speakers is as small as possible. “N learning speakers are selected as adaptive speakers. The speaker selection unit 236 outputs the IDs of the selected N adaptive speakers to the adaptive model creation unit 238.
 適応モデル作成部238は、話者選択部236から出力されたN人の適応話者のIDに対応する十分統計量を十分統計量記憶部212から読み出して、統計処理計算により、評価話者に適応した音響モデル(適応モデル)を作成する。 The adaptive model creation unit 238 reads a sufficient statistic corresponding to the IDs of the N adaptive speakers output from the speaker selection unit 236 from the sufficient statistic storage unit 212, and uses the statistical processing calculation to determine the evaluation speaker. Create an adapted acoustic model (adaptive model).
 なお、適応モデル作成部238による適応モデルの作成手法は、上記手法に限らず、たとえば、類似度算出部234が算出した、適応話者の話者モデルの評価話者の特徴量に対する尤度に応じた重み付け係数で、各適応話者の十分統計量を重み付けして統合したり、適応話者の話者モデルを任意の係数で重み付けして統合するようにしたりするなどの手法であってもよい。 Note that the adaptive model creation method by the adaptive model creation unit 238 is not limited to the above-described method. For example, the adaptive model creation unit 238 calculates the likelihood of the feature amount of the evaluation speaker of the speaker model of the adaptive speaker calculated by the similarity calculation unit 234. Even if it is a technique such as weighting and integrating sufficient statistics of each adaptive speaker with the corresponding weighting coefficient, or integrating the speaker model of the adaptive speaker by weighting with an arbitrary coefficient Good.
 図7は、適応話者モデル生成装置200による処理の流れを示すフローチャートである。ステップS50~S70は、適応話者を選択するまでの処理であり、図3に示す適応話者選択装置100による処理と同じである。ステップS80において、適応話者モデル生成装置200における適応モデル作成部238は、話者選択部236が選択したN人の適応話者の十分統計量を用いて、評価話者に適応した音響モデルを作成する。 FIG. 7 is a flowchart showing the flow of processing by the adaptive speaker model generation apparatus 200. Steps S50 to S70 are processing until an adaptive speaker is selected, and are the same as the processing by the adaptive speaker selection device 100 shown in FIG. In step S80, the adaptive model creation unit 238 in the adaptive speaker model generation device 200 uses the sufficient statistics of the N adaptive speakers selected by the speaker selection unit 236 to generate an acoustic model adapted to the evaluation speaker. create.
 本実施の形態の適応話者モデル生成装置200は、図1に示す適応話者選択装置100と同様の手法で適応話者を選択して、評価話者に適応したモデルを作成するので、適応モデルの精度劣化を抑制することができる。 The adaptive speaker model generation apparatus 200 according to the present embodiment selects an adaptive speaker by the same method as the adaptive speaker selection apparatus 100 shown in FIG. 1 and creates a model adapted to the evaluation speaker. The accuracy degradation of the model can be suppressed.
 以上、実施の形態(および実施例)を参照して本願発明を説明したが、本願発明は上記実施の形態(および実施例)に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。
 この出願は、2008年3月31日に出願された日本出願特願2008-092206を基礎とする優先権を主張し、その開示の全てをここに取り込む。
While the present invention has been described with reference to the embodiments (and examples), the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
This application claims priority based on Japanese Patent Application No. 2008-092206 filed on Mar. 31, 2008, the entire disclosure of which is incorporated herein.
 本発明は、例えば、評価話者に適応した音響モデルを作成するために学習話者から適応話者を選択する技術に使用される。 The present invention is used, for example, in a technique for selecting an adaptive speaker from learning speakers in order to create an acoustic model adapted to an evaluation speaker.

Claims (12)

  1.  評価話者に適応した音響モデルを作成するために、学習話者の集合から複数の適応話者を選択する適応話者選択方法において、
     前記評価話者と音声の類似度ができるだけ高く、かつ互いの音声の類似度ができるだけ小さい複数の学習話者を前記適応話者として選択することを特徴とする適応話者選択方法。
    In an adaptive speaker selection method for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to the evaluation speakers,
    An adaptive speaker selection method, wherein a plurality of learning speakers having as high a similarity as possible between the evaluation speaker and the voice and as small as possible a mutual speech similarity are selected as the adaptive speaker.
  2.  式(1)に示すポテンシャル関数Uの値を最小化する学習話者N人を適応話者として選択することを特徴とする請求項1に記載の適応話者選択方法。
    Figure JPOXMLDOC01-appb-M000001
     式(1)において、rtiは、評価話者tと学習話者iのモデル間距離であり、rijは学習話者iと学習話者jのモデル間距離であり、「k,k,・・・,l,l,・・・,m,m,・・・,n,n,・・・」は、ポテンシャル関数Uの特徴パラメータである。
    The adaptive speaker selection method according to claim 1, wherein N learning speakers who minimize the value of the potential function U shown in Formula (1) are selected as adaptive speakers.
    Figure JPOXMLDOC01-appb-M000001
    In Expression (1), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
  3.  前記評価話者との類似度が所定の閾値以上である学習話者を候補として選択し、選択した候補から前記複数の適応話者を選択することを特徴とする請求項1または2に記載の適応話者選択方法。 The learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more is selected as a candidate, and the plurality of adaptive speakers are selected from the selected candidates. Adaptive speaker selection method.
  4.  請求項1から3のいずれかに記載の適応話者選択方法により選択され複数の適応話者の十分統計量を用いて前記評価話者に適応した音響モデルを作成する話者適応モデル生成方法。 A speaker adaptive model generation method for creating an acoustic model selected by the adaptive speaker selection method according to any one of claims 1 to 3 and adapted to the evaluation speaker using sufficient statistics of a plurality of adaptive speakers.
  5.  評価話者に適応した音響モデルを作成するために、学習話者の集合から複数の適応話者を選択する適応話者選択装置であって、
     前記学習話者の集合における全ての2学習話者間の音声の類似度を記憶する学習話者類似度記憶部と、
     前記評価話者と各前記学習話者間の音声の類似度を夫々算出する類似度算出部と、
     該類似度算出部により算出した前記類似度と、前記学習話者類似度記憶部に記憶された各学習話者類似度とに基づいて、前記評価話者と音声の類似度ができるだけ高く、かつ互いの音声の類似度ができるだけ小さい複数の学習話者を前記適応話者として選択する話者選択部とを備えることを特徴とする適応話者選択装置。
    An adaptive speaker selection device that selects a plurality of adaptive speakers from a set of learned speakers in order to create an acoustic model adapted to an evaluation speaker,
    A learning speaker similarity storage unit that stores speech similarities between all two learning speakers in the set of learning speakers;
    A similarity calculation unit that calculates the similarity of speech between the evaluation speaker and each of the learning speakers;
    Based on the similarity calculated by the similarity calculation unit and each learning speaker similarity stored in the learning speaker similarity storage unit, the similarity between the evaluation speaker and the voice is as high as possible, and An adaptive speaker selection device, comprising: a speaker selection unit that selects a plurality of learning speakers whose mutual speech similarity is as small as possible as the adaptive speaker.
  6.  前記話者選択部は、式(2)に示すポテンシャル関数Uの値を最小化する学習話者N人を適応話者として選択することを特徴とする請求項5に記載の適応話者選択装置。
    Figure JPOXMLDOC01-appb-M000002
     式(2)において、rtiは、評価話者tと学習話者iのモデル間距離であり、rijは学習話者iと学習話者jのモデル間距離であり、「k,k,・・・,l,l,・・・,m,m,・・・,n,n,・・・」は、ポテンシャル関数Uの特徴パラメータである。
    The adaptive speaker selection device according to claim 5, wherein the speaker selection unit selects N learning speakers who minimize the value of the potential function U shown in Equation (2) as an adaptive speaker. .
    Figure JPOXMLDOC01-appb-M000002
    In Equation (2), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
  7.  前記話者選択部は、前記評価話者との類似度が所定の閾値以上である学習話者を候補として選択し、選択した候補から前記複数の適応話者を選択することを特徴とする請求項5または6に記載の適応話者選択装置。 The speaker selection unit selects a learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more as a candidate, and selects the plurality of adaptive speakers from the selected candidate. Item 7. The adaptive speaker selection device according to Item 5 or 6.
  8.  各学習話者の十分統計量を記憶する十分統計量記憶部と、
     該十分当統計量記憶部に記憶された、請求項5から7のいずれか1項に記載の適応話者選択装置により選択され複数の適応話者の十分統計量を用いて前記評価話者に適応した音響モデルを作成する適応モデル作成手段とを備えた話者適応モデル生成装置。
    A sufficient statistics storage for storing sufficient statistics for each learner;
    The evaluation speaker is selected using sufficient statistics of a plurality of adaptive speakers selected by the adaptive speaker selection device according to any one of claims 5 to 7 and stored in the sufficient statistics storage unit. A speaker adaptive model generation device comprising an adaptive model generation means for generating an adaptive acoustic model.
  9.  評価話者に適応した音響モデルを作成するために、学習話者の集合から複数の適応話者を選択する適応話者選択処理をコンピュータに実行せしめるプログラムを記録したコンピュータ読取可能な記録媒体であって、
     前記適応話者選択処理は、記評価話者と音声の類似度ができるだけ高く、かつ互いの音声の類似度ができるだけ小さい複数の学習話者を前記適応話者として選択する処理であることを特徴とする記録媒体。
    A computer-readable recording medium storing a program for causing a computer to execute an adaptive speaker selection process for selecting a plurality of adaptive speakers from a set of learning speakers in order to create an acoustic model adapted to an evaluation speaker. And
    The adaptive speaker selection process is a process of selecting, as the adaptive speaker, a plurality of learning speakers having a speech similarity as high as possible and a speech similarity as small as possible. A recording medium.
  10.  前記適応話者選択処理は、式(3)に示すポテンシャル関数Uの値を最小化する学習話者N人を適応話者として選択する処理であることを特徴とする請求項9に記載の記録媒体。
    Figure JPOXMLDOC01-appb-M000003
     式(3)において、rtiは、評価話者tと学習話者iのモデル間距離であり、rijは学習話者iと学習話者jのモデル間距離であり、「k,k,・・・,l,l,・・・,m,m,・・・,n,n,・・・」は、ポテンシャル関数Uの特徴パラメータである。
    10. The recording according to claim 9, wherein the adaptive speaker selection process is a process of selecting N learning speakers who minimize the value of the potential function U shown in Expression (3) as adaptive speakers. Medium.
    Figure JPOXMLDOC01-appb-M000003
    In Equation (3), r ti is the distance between models of the evaluation speaker t and the learning speaker i, r ij is the distance between models of the learning speaker i and the learning speaker j, and “k 1 , k 2, ···, l 1, l 2, ···, m 1, m 2, ···, n 1, n 2, ··· "is a characteristic parameter of the potential function U.
  11.  前記プログラムは、前記評価話者との類似度が所定の閾値以上である学習話者を候補として選択する候補選択処理をさらにコンピュータに実行せしめ、
     前記適応話者選択処理が、前記候補選択処理により選択した候補から前記複数の適応話者を選択する処理であることを特徴とする請求項9または10に記載の記録媒体。
    The program causes the computer to further execute candidate selection processing for selecting a learning speaker whose similarity with the evaluation speaker is a predetermined threshold or more as a candidate,
    The recording medium according to claim 9 or 10, wherein the adaptive speaker selection process is a process of selecting the plurality of adaptive speakers from candidates selected by the candidate selection process.
  12.  請求項9から11のいずれかに記載の適応話者選択処理と、
     該適応話者選択処理により選択され複数の適応話者の十分統計量を用いて前記評価話者に適応した音響モデルを作成する処理とをコンピュータに実行せしめるプログラムを記録したコンピュータ読取可能な記録媒体。
    An adaptive speaker selection process according to any one of claims 9 to 11,
    A computer-readable recording medium storing a program for causing a computer to execute processing for creating an acoustic model adapted to the evaluation speaker using sufficient statistics of a plurality of adaptive speakers selected by the adaptive speaker selection processing .
PCT/JP2009/052379 2008-03-31 2009-02-13 Adaptive speaker selection device, adaptive speaker selection method, and recording medium WO2009122780A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2010505436A JPWO2009122780A1 (en) 2008-03-31 2009-02-13 Adaptive speaker selection device, adaptive speaker selection method, and program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2008092206 2008-03-31
JP2008-092206 2008-03-31

Publications (1)

Publication Number Publication Date
WO2009122780A1 true WO2009122780A1 (en) 2009-10-08

Family

ID=41135179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/052379 WO2009122780A1 (en) 2008-03-31 2009-02-13 Adaptive speaker selection device, adaptive speaker selection method, and recording medium

Country Status (2)

Country Link
JP (1) JPWO2009122780A1 (en)
WO (1) WO2009122780A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0420999A (en) * 1990-05-16 1992-01-24 Mitsubishi Electric Corp Standard speaker selector
JPH04324499A (en) * 1991-04-24 1992-11-13 Sharp Corp Speech recognition device
JPH08123466A (en) * 1994-10-28 1996-05-17 Mitsubishi Electric Corp Speech recognition device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0420999A (en) * 1990-05-16 1992-01-24 Mitsubishi Electric Corp Standard speaker selector
JPH04324499A (en) * 1991-04-24 1992-11-13 Sharp Corp Speech recognition device
JPH08123466A (en) * 1994-10-28 1996-05-17 Mitsubishi Electric Corp Speech recognition device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KANAKO MATSUNAMI ET AL.: "Jubun Tokeiryo o Mochiita Kyoshi Nashi Washa Tekio Oyobi Kankyo Tekio", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 43, no. 7, 15 July 2002 (2002-07-15), pages 2038 - 2045 *
MASAHIRO TANI ET AL.: "Jubun Tokeiryo o Mochiita Kyoshi Nashi Washa Tekio ni Okeru Washa Sentakuho", IEICE TECHNICAL REPORT, vol. 107, no. 405, 13 December 2007 (2007-12-13), pages 85 - 89 *
MITSURU SAMEJIMA ET AL.: "Kodomo Onsei ni Taisuru Jubun Tokeiryo ni Motozuku Kyoshi Nashi Washa Tekio no Kento", THE ACOUSTICAL SOCIETY OF JAPAN (ASJ) 2004 NEN SHUKI KENKYU HAPPYOKAI KOEN RONBUNSHU -I, 21 September 2004 (2004-09-21), pages 109 - 110 *

Also Published As

Publication number Publication date
JPWO2009122780A1 (en) 2011-07-28

Similar Documents

Publication Publication Date Title
US9536525B2 (en) Speaker indexing device and speaker indexing method
JP5229478B2 (en) Statistical model learning apparatus, statistical model learning method, and program
KR100800367B1 (en) Sensor based speech recognizer selection, adaptation and combination
JP4590692B2 (en) Acoustic model creation apparatus and method
JP5418223B2 (en) Speech classification device, speech classification method, and speech classification program
JP5229219B2 (en) Speaker selection device, speaker adaptation model creation device, speaker selection method, speaker selection program, and speaker adaptation model creation program
US8515758B2 (en) Speech recognition including removal of irrelevant information
CN110178178A (en) Microphone selection and multiple talkers segmentation with environment automatic speech recognition (ASR)
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
JP2010170075A (en) Information processing apparatus, program, and method for generating acoustic model
US11837236B2 (en) Speaker recognition based on signal segments weighted by quality
JPWO2007105409A1 (en) Standard pattern adaptation device, standard pattern adaptation method, and standard pattern adaptation program
US20060074657A1 (en) Transformation and combination of hidden Markov models for speaker selection training
JP2009086581A (en) Apparatus and program for creating speaker model of speech recognition
Balemarthy et al. Our practice of using machine learning to recognize species by voice
US9355636B1 (en) Selective speech recognition scoring using articulatory features
Walter et al. An evaluation of unsupervised acoustic model training for a dysarthric speech interface
JP2004117503A (en) Method, device, and program for generating acoustic model for voice recognition, recording medium, and voice recognition device using the acoustic model
JP2006201265A (en) Voice recognition device
WO2009122780A1 (en) Adaptive speaker selection device, adaptive speaker selection method, and recording medium
KR20110071742A (en) Apparatus for utterance verification based on word specific confidence threshold
WO2020049687A1 (en) Voice processing device, voice processing method, and program storage medium
JP6078402B2 (en) Speech recognition performance estimation apparatus, method and program thereof
Nahar et al. Effect of data augmentation on dnn-based vad for automatic speech recognition in noisy environment
JP7216348B2 (en) Speech processing device, speech processing method, and speech processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09728331

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2010505436

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09728331

Country of ref document: EP

Kind code of ref document: A1