JP6267667B2

JP6267667B2 - Learning data generating apparatus, method and program

Info

Publication number: JP6267667B2
Application number: JP2015040322A
Authority: JP
Inventors: 太一浅見; 隆伸大庭; 阪内　澄宇; 澄宇阪内
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-03-02
Filing date: 2015-03-02
Publication date: 2018-01-24
Anticipated expiration: 2035-03-02
Also published as: JP2016161762A

Description

この発明は、高い精度で話者認識を行える話者特徴量抽出モデルの学習データ生成技術に関する。 The present invention relates to a learning data generation technique for a speaker feature extraction model that can perform speaker recognition with high accuracy.

入力された音声信号から話者認識に利用する話者特徴量ベクトルを算出する方法が非特許文献１において開示されている。入力された音声信号（通常は１センテンスを発声した「発話」と呼ばれる区間の音声信号が入力される）を数十ｍｓｅｃの音響分析フレームに分割し、各音響分析フレームの音響特徴量ベクトルを抽出して時間順に並べた音響特徴量ベクトル系列を作成し、音響特徴量ベクトル系列から以下の式（１）により話者特徴量ベクトルｗを算出する。以下の式（１）は、非特許文献１の式（１３）に対応している。 Non-Patent Document 1 discloses a method for calculating a speaker feature vector used for speaker recognition from an input voice signal. Divides the input speech signal (usually the speech signal of the section called “speech” that utters one sentence) into several tens of msec acoustic analysis frames, and extracts the acoustic feature vector of each acoustic analysis frame Then, an acoustic feature vector sequence arranged in time order is created, and a speaker feature vector w is calculated from the acoustic feature vector sequence by the following equation (1). The following equation (1) corresponds to equation (13) of Non-Patent Document 1.

w=(I+T'Σ^-1NuT)^-1TΣ^-1Fu …（１）
Ｉは単位行列、「'」は行列の転置を表す。ＮｕおよびＦｕは、それぞれ入力された音響特徴量ベクトル系列を用いて所定の混合正規分布に対して計算した０次統計量および１次統計量である。ＴとΣは話者特徴量抽出モデルのパラメータであり、話者特徴量抽出の前に学習しておく。 w = (I + T'Σ ^-1 NuT) ^-1 TΣ ^-1 Fu (1)
I represents a unit matrix, and “′” represents transposition of the matrix. Nu and Fu are a zero-order statistic and a first-order statistic calculated with respect to a predetermined mixed normal distribution using the input acoustic feature vector sequence, respectively. T and Σ are parameters of the speaker feature extraction model and are learned before the speaker feature extraction.

同一話者の発話から得られる話者特徴量ベクトルは類似する（コサイン類似度の値が高くなる）性質を持つため、話者特徴量ベクトルを用いて話者認識を行うことができる。 Since speaker feature vectors obtained from the same speaker's utterance have similar properties (value of cosine similarity increases), speaker recognition can be performed using the speaker feature vectors.

小川哲司，塩田さやか，“i-vectorを用いた話者認識，”日本音響学会誌70巻6号，pp.332-339，2014年6月．Tetsuji Ogawa and Sayaka Shioda, “Speaker recognition using i-vector,” The Acoustical Society of Japan, Vol. 70, No. 6, pp.332-339, June 2014.

非特許文献１に記載されている従来技術は、話者特徴量抽出モデルのパラメータＴとΣを推定するための学習用発話セット（以降「学習セット」と書く）を必要とする。精度の高い話者特徴量抽出モデルを学習するために、学習セットには多様な話者、多様な収録機器、多様な周辺雑音環境で収録された発話が含まれることが望ましいが、学習セットの発話数が多くなるほど学習処理にかかる時間が長くなり、メモリ使用量も大きくなるため、現実的に利用できる学習セットの大きさには上限がある（通常は数万発話程度）。そのため、通常は多様な話者、収録機器、周辺雑音環境で収録された数百万発話からなる大規模な音声データセット（以降「母体セット」と書く）からランダムに数万発話を選択して学習セットとして利用する。母体セットの各発話に話者、収録機器、周辺雑音が何であるかを表すラベルが付与されていればラベルを用いて選択することも可能だが、大規模な音声データセットにラベルを付与する作業はコストが高く実際に行うことは現実的でないため、ランダム選択によって学習セットを作成することになる。 The conventional technique described in Non-Patent Document 1 requires a learning utterance set (hereinafter referred to as “learning set”) for estimating parameters T and Σ of a speaker feature extraction model. In order to learn a speaker feature extraction model with high accuracy, it is desirable that the learning set includes various speakers, various recording devices, and utterances recorded in various ambient noise environments. As the number of utterances increases, the time required for the learning process increases and the amount of memory used also increases. Therefore, there is an upper limit on the size of a learning set that can be practically used (usually about tens of thousands of utterances). Therefore, tens of thousands of utterances are randomly selected from a large audio data set (hereinafter referred to as “maternal set”) consisting of millions of utterances usually recorded in various speakers, recording devices, and ambient noise environments. Use as a learning set. It is possible to select using a label if each utterance of the mother set is given a label indicating what the speaker, recording device, and ambient noise are, but it is a task to give a label to a large audio data set Since the cost is high and it is not realistic to actually perform, a learning set is created by random selection.

しかし、ランダム選択によって得られる学習セットは、母体セットに含まれる多様な話者、収録機器、周辺雑音を十分に網羅できているとは限らない。学習セットに含まれる話者の多様性が低くなった場合、学習される話者特徴量抽出モデルのパラメータＴとΣは良い値とならず、異なる話者の話者特徴量ベクトルが類似してしまい話者認識性能が低下する可能性がある。学習セットに含まれる収録機器の多様性が低くなった場合も、同じ話者でも収録機器が異なる場合に話者特徴量ベクトルがかけ離れてしまい、話者認識性能が低下する可能性がある。学習セットに含まれる周辺雑音の多様性が低くなった場合も同様に、同じ話者でも周辺雑音が異なる場合に話者特徴量ベクトルがかけ離れてしまい、話者認識性能が低下する可能性がある。 However, the learning set obtained by random selection does not always fully cover various speakers, recording devices, and ambient noise included in the mother set. When the diversity of speakers included in the learning set is low, the parameters T and Σ of the speaker feature extraction model to be learned do not become good values, and the speaker feature vectors of different speakers are similar. There is a possibility that the speaker recognition performance is degraded. Even when the diversity of recording devices included in the learning set is low, speaker feature vectors may be different if the recording device is different even for the same speaker, and speaker recognition performance may deteriorate. Similarly, when the diversity of ambient noises included in the learning set is low, speaker feature vectors may be separated if the ambient noise is different even for the same speaker, which may reduce speaker recognition performance. .

この発明の目的は、多様な話者、収録機器、周辺雑音を含む学習データを生成する学習データ生成装置、方法及びプログラムを提供することである。 An object of the present invention is to provide a learning data generating apparatus, method, and program for generating learning data including various speakers, recording devices, and ambient noise.

この発明の一態様による学習データ生成装置は、母体セットに含まれる各発話の音声信号から音響特徴量ベクトル群を抽出する音響特徴量抽出部と、音響特徴量ベクトル群に対して所定の混合数Ｍの混合正規分布を当てはめることにより混合正規分布を得る混合正規分布当てはめ部と、得られた混合正規分布を構成するＭ個の正規分布のそれぞれをコンポーネントとして、音響特徴量ベクトル群を用いて母体セットにおける各コンポーネントの含有量を計算するコンポーネント含有量計算部と、母体セットにおける各コンポーネントの含有量に基づいて、母体セットに含まれる発話の中から、学習セットにおける各コンポーネントの構成比が母体セットにおける各コンポーネントの構成比に近くなるように発話を選択することにより学習データを生成する発話選択部と、を備えている。 A learning data generation device according to an aspect of the present invention includes an acoustic feature quantity extraction unit that extracts an acoustic feature quantity vector group from an audio signal of each utterance included in a matrix set, and a predetermined number of mixtures with respect to the acoustic feature quantity vector group Using the acoustic feature vector group as a matrix, a mixed normal distribution fitting unit that obtains a mixed normal distribution by fitting the mixed normal distribution of M and the M normal distributions constituting the obtained mixed normal distribution as components. The component content calculation unit that calculates the content of each component in the set and the composition ratio of each component in the learning set from the utterances included in the parent set based on the content of each component in the parent set Learning data by selecting the utterance so that it is close to the composition ratio of each component in And a, a speech selection unit for generating a.

多様な話者、収録機器、周辺雑音を含む学習データを生成することができる。 Learning data including various speakers, recording devices, and ambient noise can be generated.

学習データ生成装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a learning data generation apparatus. 学習データ生成方法の例を説明するための流れ図。The flowchart for demonstrating the example of the learning data generation method.

［技術的背景］
この発明の発想の１つは、発話集合に含まれる音響的性質（話者・収録機器・周辺雑音）の多様性を表す発話集合の多様性スコアを算出し、多様性スコアが高くなる発話集合を母体セットから選び出し、学習セットとすることである。 [Technical background]
One of the ideas of the present invention is to calculate a diversity score of an utterance set that represents the diversity of acoustic properties (speakers, recording devices, and ambient noise) included in the utterance set, and to increase the diversity score. Is selected from the mother set and used as a learning set.

多様性スコアは、「発話集合が、母体セットの音声を構成するコンポーネント（部品）を満遍なく含む場合、多様性スコアは高い」という考えに基づいて算出される。まず、母体セットの全フレームの音響特徴量ベクトルに対して混合正規分布を当てはめることにより、母体セットをコンポーネント（混合正規分布の各正規分布）に分解する。次に、母体セット中の各発話がどのコンポーネントをどの程度含有しているかを、混合正規分布の各正規分布に対する尤度を基に算出する。母体セットから発話を１つ学習セットとして選択すると、選択した発話に含有されるコンポーネントが学習セットに追加されることになる。いくつかの発話を選択して作成した学習セットに含まれるコンポーネントの構成比が、母体セット全体のコンポーネントの構成比に近い場合、学習セットは母体セットのコンポーネントを満遍なく含むと言える。そこで、学習セットに含まれる各コンポーネントの構成比が母体セットのコンポーネントの構成比に近い場合に値が高くなる以下の式により、学習セットＵの多様性スコアＤ（Ｕ）を算出する。 The diversity score is calculated based on the idea that “if the utterance set includes all the components (parts) that make up the speech of the mother set, the diversity score is high”. First, the matrix set is decomposed into components (each normal distribution of the mixed normal distribution) by applying a mixed normal distribution to the acoustic feature vectors of all frames of the matrix set. Next, how much component each utterance in the matrix set contains is calculated based on the likelihood of each normal distribution of the mixed normal distribution. When one utterance is selected as a learning set from the mother set, components included in the selected utterance are added to the learning set. If the composition ratio of the components included in the learning set created by selecting several utterances is close to the component ratio of the entire mother set, it can be said that the learning set includes all the components of the mother set. Therefore, the diversity score D (U) of the learning set U is calculated by the following equation that increases when the component ratio of each component included in the learning set is close to the component ratio of the matrix set component.

Ｍは混合正規分布の混合数（コンポーネント数）、ｗ_ｉは母体セット中のｉ番目のコンポーネントの割合、ｆ_ｉUは学習セットＵ中のｉ番目のコンポーネントの含有量である。 M is the number of mixtures in the mixed normal distribution (number of components), w _i is the ratio of the i-th component in the matrix set, and _fi u is the content of the i-th component in the learning set U.

［発明が解決しようとする課題］の欄で述べたように、話者特徴量抽出モデルの学習セットには計算時間とメモリ使用量の制約によりサイズの上限（数万発話）がある。学習セットの発話数の上限値をＣとすると、｜Ｕ｜≦Ｃを満たす範囲で、Ｄ(Ｕ)ができるだけ大きくなるように学習セットＵを母体セットから選び出すことで、多様な話者、収録機器、周辺雑音を満遍なく含む学習セットを母体セットから選別することができる。｜Ｕ｜は、学習セットＵに含まれる発話の数である。 As described in the section “Problems to be Solved by the Invention”, the learning set of the speaker feature extraction model has an upper limit of size (tens of thousands of utterances) due to restrictions on calculation time and memory usage. Assuming that the upper limit of the number of utterances in the learning set is C, various speakers and recordings are selected by selecting the learning set U from the mother set so that D (U) is as large as possible within a range satisfying | U | ≦ C. It is possible to select a learning set including equipment and ambient noise from the parent set. | U | is the number of utterances included in the learning set U.

しかし、数百万発話からなる母体セットから数万発話を選び出す組み合わせ数は膨大であり、全ての発話の組み合わせについてＤ(Ｕ)の値を計算し、最大となるＵを見つけ出すことは現実的な時間では不可能である。そこで、貪欲法により｜Ｕ｜＝Ｃとなるまで母体セットから１発話ずつ学習セットＵに追加していくことで、現実的な処理時間で学習セットＵを選び出す。多様性スコアＤ(Ｕ)は劣モジュラ関数であるため、貪欲法で発話を選択することによりＤ(Ｕ)を近似的に最大化するＵを得ることが可能である。 However, the number of combinations for selecting tens of thousands of utterances from a matrix set consisting of millions of utterances is enormous, and it is realistic to calculate the value of D (U) for all utterance combinations and find the maximum U. It is impossible in time. Therefore, the learning set U is selected in a realistic processing time by adding one utterance from the mother set to the learning set U until | U | = C by the greedy method. Since the diversity score D (U) is a submodular function, it is possible to obtain U that approximately maximizes D (U) by selecting an utterance by the greedy method.

［実施形態］
以下、図面を参照して、この発明の実施形態の例について説明する。 [Embodiment]
Examples of embodiments of the present invention will be described below with reference to the drawings.

学習データ生成装置は、音響特徴量抽出部１０１、混合正規分布当てはめ部１０２及びコンポーネント含有量計算部１０３を例えば備えている。この学習データ生成装置が図２に例示する各ステップ処理を行うことにより、学習データ生成方法が実現される。 The learning data generation device includes, for example, an acoustic feature quantity extraction unit 101, a mixed normal distribution fitting unit 102, and a component content calculation unit 103. The learning data generation method is realized by the learning data generation apparatus performing each step process illustrated in FIG.

＜音響特徴量抽出部１０１＞
入力：音響特徴量抽出部１０１には、母体セットが入力される。 <Sound Feature Extraction Unit 101>
Input: A matrix set is input to the acoustic feature quantity extraction unit 101.

出力：母体セットの各発話の音響特徴量ベクトル群（混合正規分布当てはめ部１０２およびコンポーネント含有量計算部１０３へ）
処理：音響特徴量抽出部１０１は、入力された母体セットに含まれる各発話の音声信号から音響特徴量ベクトル群を抽出する（ステップＳ１）。抽出された母体セットの各発話の音響特徴量ベクトル群は、混合正規分布当てはめ部１０２及びコンポーネント含有量計算部１０３に出力される。 Output: acoustic feature vector group of each utterance in the matrix set (to mixed normal distribution fitting unit 102 and component content calculation unit 103)
Process: The acoustic feature quantity extraction unit 101 extracts an acoustic feature quantity vector group from the speech signal of each utterance included in the input matrix set (step S1). The extracted acoustic feature vector group of each utterance in the matrix set is output to the mixed normal distribution fitting unit 102 and the component content calculation unit 103.

音響特徴量ベクトル群の抽出では、各発話の音声信号を数十ｍｓｅｃの音響分析フレームに分割し、各音響分析フレームから音響特徴量ベクトルを抽出することで、音響特徴量ベクトル群を得る。各フレームの音響特徴量ベクトルは実数値ベクトルであり、ＭＦＣＣやＬＰＣケプストラムなど既存のいずれの手法で抽出しても構わない。 In the extraction of the acoustic feature vector group, the speech signal of each utterance is divided into acoustic analysis frames of several tens of msec, and the acoustic feature vector is extracted from each acoustic analysis frame, thereby obtaining the acoustic feature vector group. The acoustic feature vector of each frame is a real value vector and may be extracted by any existing method such as MFCC or LPC cepstrum.

＜混合正規分布当てはめ部１０２＞
入力：母体セットの各発話の音響特徴量ベクトル群（音響特徴量抽出部１０１から）、混合数Ｍ
出力：混合正規分布（コンポーネント含有量計算部１０３および各正規分布の混合重みは発話選択部１０４へ）
処理：混合正規分布当てはめ部１０２は、入力された母体セットの各発話の音響特徴量ベクトル群に対して、入力された混合数Ｍの混合正規分布を当てはめて各正規分布の混合重みと平均ベクトルと共分散行列を求め、得られた混合正規分布を出力する（ステップＳ２）。次のコンポーネント含有量計算部１０３において、混合正規分布の各正規分布が１つのコンポーネントと見なされる。 <Mixed normal distribution fitting unit 102>
Input: acoustic feature vector group of each utterance in the mother set (from the acoustic feature extraction unit 101), number of mixtures M
Output: mixed normal distribution (component content calculation unit 103 and mixed weight of each normal distribution to utterance selection unit 104)
Processing: The mixed normal distribution fitting unit 102 applies the mixed normal distribution of the number M of the input mixtures to the acoustic feature vector group of each utterance of the input matrix set, and the mixture weight and average vector of each normal distribution And a covariance matrix are obtained, and the obtained mixed normal distribution is output (step S2). In the next component content calculation unit 103, each normal distribution of the mixed normal distribution is regarded as one component.

混合正規分布の当てはめ（混合重みと平均ベクトルと共分散行列の推定）には例えば参考文献１などに記載されている一般的なＥＭアルゴリズムを用いる。混合数Ｍは１以上の整数であり、大きくするとより精緻に音響特徴量をコンポーネントに分解することができるが、混合正規分布のパラメータ数が増加するため推定に必要な音響特徴量ベクトルの数が増加する。通常は５１２程度の混合数Ｍを用いる。 For fitting the mixed normal distribution (mixing weight, average vector, and covariance matrix estimation), for example, a general EM algorithm described in Reference 1 or the like is used. The mixture number M is an integer of 1 or more, and if it is increased, the acoustic feature quantity can be more precisely decomposed into components. However, since the number of parameters of the mixed normal distribution increases, the number of acoustic feature quantity vectors necessary for estimation is reduced. To increase. Usually, a mixing number M of about 512 is used.

〔参考文献１〕C.M. ビショップ，“パターン認識と機械学習下”，pp.154-155，シュプリンガー・ジャパン株式会社，2008-07-01． [Reference 1] C.M. Bishop, “Pattern recognition and machine learning,” pp.154-155, Springer Japan, 2008-07-01.

＜コンポーネント含有量計算部１０３＞
入力：母体セットの各発話の音響特徴量ベクトル群（音響特徴量抽出部１０１から）、混合正規分布（混合正規分布当てはめ部１０２から）
出力：母体セットの各発話のコンポーネント含有量
処理：コンポーネント含有量計算部１０３は、入力された母体セットの各発話の音響特徴量ベクトル群と混合正規分布を用いて、母体セットの各発話のコンポーネント含有量を計算し、出力する。 <Component content calculation unit 103>
Input: acoustic feature vector group (from acoustic feature extraction unit 101) and mixed normal distribution (from mixed normal distribution fitting unit 102) of each utterance in the matrix set
Output: Component content of each utterance in the mother set Processing: The component content calculation unit 103 uses the input acoustic feature vector group and mixed normal distribution of each utterance in the mother set as components of each utterance in the mother set. Calculate and output the content.

コンポーネントは混合正規分布の混合数Ｍ個あり、コンポーネントごとに含有量を計算する。ある１発話のコンポーネント含有量は、当該発話の各音響特徴量ベクトルのコンポーネント含有量の総和である。１つの音響特徴量ベクトルのコンポーネント含有量は以下のように計算される。 There are a number M of mixed normal distribution components, and the content is calculated for each component. The component content of a certain utterance is the sum of the component contents of each acoustic feature vector of the utterance. The component content of one acoustic feature vector is calculated as follows.

（１）コンポーネント含有量計算部１０３は、対象の音響特徴量ベクトルｘに対して、１番目からＭ番目までの全ての正規分布における尤度を計算する。ｍ番目の正規分布の平均ベクトルをμｍ、共分散行列をＳｍとすると、音響特徴量ベクトルｘに対するｍ番目の正規分布の尤度Lｍは以下の式で計算される。ｄは音響特徴量ベクトルの次元数である。 (1) The component content calculation unit 103 calculates the likelihood in all normal distributions from the first to the Mth for the target acoustic feature vector x. When the average vector of the mth normal distribution is μm and the covariance matrix is Sm, the likelihood Lm of the mth normal distribution for the acoustic feature vector x is calculated by the following equation. d is the number of dimensions of the acoustic feature vector.

（２）コンポーネント含有量計算部１０３は、得られたＬ１〜ＬＭまでのＭ個の尤度を、和が１となるように正規化する。 (2) The component content calculation unit 103 normalizes the obtained M likelihoods from L1 to LM so that the sum is 1.

（２）の手順で得られるＰ１〜ＰＭまでのＭ個の正規化された尤度が、音響特徴量ベクトルｘの各コンポーネント含有量である。コンポーネント含有量計算部１０３は、当該発話中の各音響特徴量ベクトルのコンポーネント含有量を計算し、コンポーネントごとに発話内で総和を取ることで、当該発話のコンポーネント含有量を計算する。 M normalized likelihoods from P1 to PM obtained in the procedure of (2) are the component contents of the acoustic feature vector x. The component content calculation unit 103 calculates the component content of each acoustic feature vector in the utterance, and calculates the component content of the utterance by taking the sum in the utterance for each component.

コンポーネント含有量計算部１０３は、母体セットの各発話に対して以上の手順でコンポーネント含有量計算を行い、母体セットの各発話のコンポーネント含有量（各発話が各コンポーネントをどれだけ含有しているか）を得る（ステップＳ３）。 The component content calculation unit 103 calculates the component content for each utterance in the matrix set according to the above procedure, and the component content of each utterance in the matrix set (how much each utterance contains each component). Is obtained (step S3).

＜発話選択部１０４＞
入力：母体セット、母体セットの各発話のコンポーネント含有量（コンポーネント含有量計算部１０３から）、混合正規分布の各正規分布の混合重み（混合正規分布当てはめ部１０２から）、発話数上限値Ｃ
出力：学習セット
処理：発話選択部１０４は、入力された母体セットの各発話のコンポーネント含有量と各正規分布の混合重みと発話数上限値Ｃを用いて、母体セットから発話を選択して学習セットとして出力する。 <Speech selection unit 104>
Input: matrix set, component content of each utterance of the matrix set (from the component content calculation unit 103), mixed weight of each normal distribution of the mixed normal distribution (from the mixed normal distribution fitting unit 102), utterance upper limit C
Output: Learning set Processing: The utterance selection unit 104 selects and learns an utterance from the mother set using the component content of each utterance of the inputted mother set, the mixture weight of each normal distribution, and the utterance number upper limit C. Output as a set.

発話選択は貪欲法を用いて以下の手順で行われる。 Utterance selection is performed by the following procedure using the greedy method.

（０）発話選択部１０４は、学習セットＵを空集合に初期化する。母体セットを母体セットの全発話を要素とする集合に初期化する。 (0) The utterance selection unit 104 initializes the learning set U to an empty set. Initialize the matrix set to a set whose elements are all utterances of the matrix set.

（１）発話選択部１０４は、学習セットＵに母体セット中の各発話を追加したときの多様性スコアの上昇値を計算する。母体セット中のｎ番目の発話ｕ_ｎを学習セットに追加したときの多様性スコアの上昇値Improve_nは以下の式（５）及び式（２）で計算される。 (1) The utterance selection unit 104 calculates an increase value of the diversity score when each utterance in the mother set is added to the learning set U. Rise value Improve _n diversity score when adding the n th utterance u _n in maternal set training set is calculated by the following equation (5) and (2).

ｗ_ｉはｉ番目の正規分布の混合重み、ｆ_ｉＵはＵに含まれる全発話のｉ番目のコンポーネント含有量の総和である。 w _i is the mixing weight of the i-th normal distribution, and f _iU is the sum of the i-th component contents of all utterances included in U.

（２）発話選択部１０４は、最も大きく多様性スコアを上昇させる発話を母体セットから学習セットＵに移動する。 (2) The utterance selection unit 104 moves the utterance that increases the diversity score the largest from the mother set to the learning set U.

（３）発話選択部１０４は、学習セットＵの発話数がＣ未満であれば手順（１）に戻って繰り返す。発話選択部１０４は、学習セットＵの発話数がＣになれば終了し学習セットＵを学習データとして出力する。 (3) If the number of utterances in the learning set U is less than C, the utterance selection unit 104 returns to step (1) and repeats. The utterance selection unit 104 ends when the number of utterances in the learning set U reaches C, and outputs the learning set U as learning data.

Ｃは１以上母体セットの発話数以下の整数であり、最終的に出力される学習セットの発話数を表す。Ｃを大きくすれば母体セットの話者・収録機器・周辺雑音をより多様に含む学習セットを得られるが、学習セットの発話数が大きくなるため話者特徴量抽出モデルの学習時に処理時間とメモリ使用量が大きくなる。通常は３〜５万程度の値に設定する。 C is an integer of 1 or more and less than or equal to the number of utterances of the mother set, and represents the number of utterances of the learning set that is finally output. If C is increased, a learning set including more diverse speakers, recording devices, and ambient noises of the mother set can be obtained. However, since the number of utterances in the learning set increases, processing time and memory during learning of the speaker feature extraction model Increased usage. Usually, it is set to a value of about 3 to 50,000.

手順（１）と（２）により、学習セットＵの中の各コンポーネントの構成比が混合正規分布の混合重み（＝母体セットの各コンポーネントの構成比）に近づくように発話が順次選択されていく。そのため、学習セットの発話数の上限値Ｃという制約の中で、可能な限り母体セットの各コンポーネントの構成比を忠実に再現するように発話集合が選択され、学習セットとして出力されることになる。 Through procedures (1) and (2), utterances are sequentially selected so that the component ratio of each component in the learning set U approaches the mixture weight of the mixed normal distribution (= the component ratio of each component of the matrix set). . For this reason, an utterance set is selected so as to faithfully reproduce the component ratio of each component of the matrix set as much as possible within the constraint of the upper limit C of the number of utterances in the learning set, and is output as a learning set. .

なお、多様性スコアＤ(Ｕ)は劣モジュラ関数であるため、例えば参考文献２に記載されている、最大化したい関数が劣モジュラ関数である場合に上記の貪欲法と同一の学習セットをより少ない処理量で得られる高速化法を用いても構わない。 Since the diversity score D (U) is a submodular function, for example, when the function to be maximized is a submodular function, the same learning set as the above greedy method is used. You may use the speed-up method obtained with a small processing amount.

〔参考文献２〕Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen and Natalie Glance, “Cost-effective outbreak detection in networks,” in Proceedings of the 13^th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.420-429, 2007. [Reference 2] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen and Natalie Glance, “Cost-effective outbreak detection in networks,” in Proceedings of the 13 ^th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.420-429, 2007.

以上の構成により、発話選択部１０４では、学習セットの各コンポーネントの構成比が母体セットの各コンポーネントの構成比に近くなるように（つまり、母体セットに含まれる多様な話者・収録機器・周辺雑音を満遍なく含むように）発話を選択した学習セットが出力される（ステップＳ４）。この学習セットから学習した話者特徴量抽出モデルを用いることにより、発話をランダム選択して作成した学習セットを使った場合よりも高い精度で話者認識を行うことができる。 With the above configuration, the utterance selection unit 104 makes the composition ratio of each component of the learning set close to the composition ratio of each component of the mother set (that is, various speakers, recording devices, and peripherals included in the mother set). A learning set in which an utterance is selected is output (so as to include noise uniformly) (step S4). By using the speaker feature extraction model learned from this learning set, speaker recognition can be performed with higher accuracy than when a learning set created by randomly selecting utterances is used.

[プログラム及び記録媒体]
上記学習データ生成装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program and recording medium]
The processes described in the learning data generating apparatus and method are not only executed in time series in the order described, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

また、学習データ選択置における各処理をコンピュータによって実現する場合、その各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 Further, when each process in the learning data selection device is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

１０１音響特徴量抽出部
１０２混合正規分布当てはめ部
１０３コンポーネント含有量計算部
１０４発話選択部 101 acoustic feature amount extraction unit 102 mixed normal distribution fitting unit 103 component content calculation unit 104 utterance selection unit

Claims

An acoustic feature quantity extraction unit that extracts an acoustic feature quantity vector group from an audio signal of each utterance included in the matrix set;
A mixed normal distribution fitting unit that obtains a mixed normal distribution by applying a mixed normal distribution of a predetermined number of mixtures M to the acoustic feature quantity vector group;
A component content calculation unit that calculates the content of each component in the matrix set using the acoustic feature vector group, with each of the M normal distributions constituting the obtained mixed normal distribution as components,
Based on the content of each component in the matrix set, select the utterance from the utterances included in the matrix set so that the composition ratio of each component in the learning set is close to the composition ratio of each component in the matrix set An utterance selection unit that generates learning data by
A learning data generating apparatus including:

The learning data generation device according to claim 1,
The utterance selection unit, based on the content of each component in the mother set, from among the utterances included in the mother set, the component ratio of each component in the learning set after adding the utterance is each in the mother set Generate learning data by repeating the process of selecting one utterance and adding it to the learning set so that it is close to the component composition ratio.
Learning data generation device.

An acoustic feature extraction step for extracting an acoustic feature vector from a speech signal of each utterance included in the matrix set;
A mixed normal distribution fitting step for obtaining a mixed normal distribution by fitting a mixed normal distribution having a predetermined number of mixtures M to the acoustic feature vector group;
A component content calculation step for calculating the content of each component in the matrix set using the acoustic feature vector group, with each of the M normal distributions constituting the obtained mixed normal distribution as components,
Based on the content of each component in the matrix set, select the utterance from the utterances included in the matrix set so that the composition ratio of each component in the learning set is close to the composition ratio of each component in the matrix set An utterance selection step for generating learning data by
Learning data generation method including

The program for functioning a computer as each part of the learning data generation apparatus of Claim 1 or 2.