JP6057170B2

JP6057170B2 - Spoken language evaluation device, parameter estimation device, method, and program

Info

Publication number: JP6057170B2
Application number: JP2013036258A
Authority: JP
Inventors: 康智大石; 弘和亀岡; 小野　順貴; 順貴小野; 祐一石本; 松井　知子; 知子松井
Original assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2013-02-26
Filing date: 2013-02-26
Publication date: 2017-01-11
Anticipated expiration: 2033-02-26
Also published as: JP2014164187A

Description

本発明は、音声言語評価装置、パラメータ推定装置、方法、及びプログラムに係り、特に、入力された音声信号が示す言語の種類を評価する音声言語評価装置、パラメータ推定装置、方法、及びプログラムに関する。 The present invention relates to a spoken language evaluation device, a parameter estimation device, a method, and a program, and more particularly, to a spoken language evaluation device, a parameter estimation device, a method, and a program for evaluating a language type indicated by an input speech signal.

言語の種類を分類または識別することは、言語学的観点及び工学的応用の両面から重要な技術である。言語学の分野では、機能的または地理的比較を用いた手法により、文法、語彙、歴史的または地理的背景などに基づいた言語の分類及び類型化が進められてきた。しかし、このような機能的または地理的比較を用いた手法は、研究者個人の観察や内省に基づくものが多く客観性が高いとは言い難い。 Classifying or identifying language types is an important technique from both a linguistic point of view and engineering applications. In the field of linguistics, categorization and typification of languages based on grammar, vocabulary, historical or geographical background, etc. have been promoted by methods using functional or geographical comparison. However, it is difficult to say that such methods using functional or geographical comparisons are based on observations and reflections of individual researchers and are highly objective.

一方、音声工学において、音声信号が示す言語の種類を分類することは、言語の種類の識別や多言語音声認識の基盤となることから、複数種類の言語による複数の音声信号を含む大規模音声コーパスを利用した言語の種類の分類や識別が試みられている。このような音声工学における言語の種類の分類や識別には、これまでベクトル量子化、隠れマルコフモデル（ＨＭＭ）、ガウス混合モデル（ＧＭＭ）など様々な手法が用いられている。 On the other hand, in speech engineering, classifying the language type indicated by a speech signal is the basis for language type identification and multilingual speech recognition, so large-scale speech containing multiple speech signals in multiple types of languages. Attempts have been made to classify and identify language types using a corpus. Various methods such as vector quantization, hidden Markov model (HMM), and Gaussian mixture model (GMM) have been used to classify and identify language types in speech engineering.

C.S. Greenberg et. al.,“The 2011 NIST Language Recognition Evaluation,”in Proc. Interspeech 2012.C.S.Greenberg et. Al., “The 2011 NIST Language Recognition Evaluation,” in Proc. Interspeech 2012.

しかしながら、従来の手法では、事前知識無しに音声信号のみから音声信号が示す言語の種類を評価することについて、充分な成果は得られていない。 However, in the conventional method, sufficient results have not been obtained for evaluating the type of language indicated by the voice signal from only the voice signal without prior knowledge.

本発明は、上記の事情に鑑みてなされたものであり、事前知識を要することなく、入力された音声信号が示す言語の種類を精度良く評価することができる音声言語評価装置、方法、及びプログラム、並びにこれらに用いるためのパラメータを推定するパラメータ推定装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a spoken language evaluation apparatus, method, and program capable of accurately evaluating the type of language indicated by an input speech signal without requiring prior knowledge. An object of the present invention is to provide a parameter estimation device, a method, and a program for estimating parameters to be used for them.

上記目的を達成するために、第１の発明に係る音声言語評価装置は、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する抽出手段と、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルと、前記抽出手段により抽出された評価用特徴情報とに基づいて、前記評価用音声信号が示す言語の種類が前記複数種類の各々であることの尤もらしさ示す尤度を計算する尤度計算手段と、前記尤度計算手段により計算された尤度に基づいて、前記評価用音声信号が示す言語の種類を評価する評価手段と、を含んで構成されている。 In order to achieve the above object, a speech language evaluation apparatus according to a first aspect of the present invention includes an extraction unit that extracts evaluation feature information from an evaluation speech signal whose language type is unknown, and a plurality of languages whose types are known. for each type of language, the first parameter indicating a basal spectra of a plurality of states corresponding to each of a plurality of phonemes and extracted with non-negative matrix factorization for training speech signal and the second indicating state transition probabilities of the underlying spectrum Based on the model including the parameter and the evaluation feature information extracted by the extraction unit, the likelihood indicating the likelihood that the language type indicated by the evaluation speech signal is each of the plurality of types is calculated. Likelihood calculating means, and evaluation means for evaluating the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculating means. To have.

第１の発明に係る音声言語評価装置によれば、抽出手段が、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する。そして、尤度計算手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルと、抽出手段により抽出された評価用特徴情報とに基づいて、評価用音声信号が示す言語の種類が複数種類の各々であることの尤もらしさ示す尤度を計算する。さらに、評価手段が、尤度計算手段により計算された尤度に基づいて、評価用音声信号が示す言語の種類を評価する。 According to the spoken language evaluation apparatus according to the first aspect of the invention, the extraction unit extracts the evaluation feature information from the evaluation speech signal whose language type is unknown. Then, the likelihood calculating means, with a plurality kinds of languages type of language is known, the base spectrum of a plurality of states corresponding to each of a plurality of phonemes and extracted with non-negative matrix factorization for training speech signals A plurality of kinds of languages indicated by the evaluation speech signal based on the model including the first parameter indicating the second parameter indicating the state transition probability of the base spectrum and the evaluation feature information extracted by the extraction unit. The likelihood indicating the likelihood of being each is calculated. Further, the evaluation means evaluates the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation means.

また、第２の発明に係る音声言語評価装置は、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する抽出手段と、学習用音声信号から学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数種類の言語に共通の複数の状態の基底スペクトルを示す第１パラメータと、言語の種類が既知である複数種類の言語それぞれについて、１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータとを含むモデルと、前記抽出手段により抽出された評価用特徴情報とに基づいて、前記評価用音声信号が示す言語の種類が前記複数種類の各々であることの尤もらしさ示す尤度を計算する尤度計算手段と、前記尤度計算手段により計算された尤度に基づいて、前記評価用音声信号が示す言語の種類を評価する評価手段と、
を含んで構成されている。 The spoken language evaluation apparatus according to the second aspect of the invention also includes an extraction means for extracting evaluation feature information from an evaluation speech signal whose language type is unknown, and a melody extracted as learning feature information from the learning speech signal. learned by updating rule according to the weighted average in accordance with the scale of the spectrum, and a first parameter indicating a basal spectrum common multiple states to a plurality of types of languages, for each of a plurality kinds of languages different languages are known , 1 a model and a second parameter indicating a state transition probability of the base spectrum time transitions depending on the previous base spectrum, based on the evaluation feature information extracted by the extraction means, the evaluation speech The likelihood calculation means for calculating likelihood indicating the likelihood that the language type indicated by the signal is each of the plurality of types, and the likelihood calculation means Based on the time, and evaluation means for evaluating the type of language that the sound signals used for evaluation is shown,
It is comprised including.

第２の発明に係る音声言語評価装置によれば、抽出手段が、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する。そして、尤度計算手段が、学習用音声信号から学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数種類の言語に共通の複数の状態の基底スペクトルを示す第１パラメータと、言語の種類が既知である複数種類の言語それぞれについて、１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータとを含むモデルと、抽出手段により抽出された評価用特徴情報とに基づいて、評価用音声信号が示す言語の種類が複数種類の各々であることの尤もらしさ示す尤度を計算する。さらに、評価手段が、尤度計算手段により計算された尤度に基づいて、評価用音声信号が示す言語の種類を評価する。 According to the spoken language evaluation apparatus according to the second aspect of the present invention, the extracting means extracts the evaluation feature information from the evaluation speech signal whose language type is unknown. Then, the likelihood calculating means is learned by an update rule by weighted average according to the scale of the mel spectrum extracted as learning feature information from the learning speech signal, and is based on a plurality of states common to a plurality of types of languages. model including a first parameter showing a spectrum for each of a plurality kinds of languages different languages are known, and a second parameter indicating a state transition probability of the base spectral transition depending on the underlying spectra of the immediately preceding time Based on the evaluation feature information extracted by the extraction unit, the likelihood indicating the likelihood that the language type indicated by the evaluation speech signal is each of a plurality of types is calculated. Further, the evaluation means evaluates the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation means.

このように、複数の状態の基底スペクトル及び基底スペクトルの状態遷移確率を示すパラメータを含むモデルを用いることにより、言語が持つ音声的性質と音素遷移とを含む言語的性質に基づく評価を行うことができるため、事前知識を要することなく、入力された音声信号が示す言語の種類を精度良く評価することができる。 As described above, by using a model including a base spectrum of a plurality of states and a parameter indicating a state transition probability of the base spectrum, it is possible to perform an evaluation based on a speech property of a language and a linguistic property including a phoneme transition. Therefore, it is possible to accurately evaluate the type of language indicated by the input voice signal without requiring prior knowledge.

また、第３の発明に係るパラメータ推定装置は、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出する抽出手段と、前記学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、前記複数種類の言語毎の前記第１パラメータ及び前記第２パラメータの初期値を生成する初期値生成手段と、前記第１パラメータ及び前記第２パラメータの初期値、または現在の前記第１パラメータ及び前記第２パラメータの値と、前記抽出手段により抽出された学習用特徴情報とを用いた最適化により、前記複数種類の言語毎の前記第１パラメータ及び前記第２パラメータを推定する推定手段と、前記推定手段の推定結果が所定の条件を満たした場合に、推定された前記第１パラメータ及び前記第２パラメータを出力し、前記推定結果が前記所定の条件を満たしていない場合に、前記推定手段により前記第１パラメータ及び前記第２パラメータの推定が行われるように制御する制御手段と、を含んで構成されている。 According to a third aspect of the present invention, there is provided a parameter estimation device including: an extraction unit that extracts learning feature information from a learning speech signal for each of a plurality of types of languages whose language types are known; and a non-negative for the learning speech signal. For a model including a first parameter indicating a base spectrum of a plurality of states corresponding to each of a plurality of phonemes extracted by value matrix factorization and a second parameter indicating a state transition probability of the base spectrum, for each of the plurality of types of languages Initial value generating means for generating initial values of the first parameter and the second parameter; initial values of the first parameter and the second parameter; or current values of the first parameter and the second parameter; The first parameter for each of the plurality of types of languages is obtained by optimization using the learning feature information extracted by the extracting means. And the estimation means for estimating the second parameter, and when the estimation result of the estimation means satisfies a predetermined condition, the estimated first parameter and the second parameter are output, and the estimation result is And control means for controlling the first parameter and the second parameter to be estimated by the estimating means when a predetermined condition is not satisfied.

第３の発明に係るパラメータ推定装置によれば、抽出手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出する。そして、初期値生成手段が、前記学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、前記複数種類の言語毎の前記第１パラメータ及び前記第２パラメータの初期値を生成する。次に、推定手段が、第１パラメータ及び第２パラメータの初期値と、抽出手段により抽出された学習用特徴情報とを用いた最適化により、複数種類の言語毎の第１パラメータ及び第２パラメータを推定する。そして、制御手段が、推定手段の推定結果が所定の条件を満たした場合に、推定された第１パラメータ及び第２パラメータを出力し、推定結果が所定の条件を満たしていない場合に、推定手段により第１パラメータ及び第２パラメータの推定が行われるように制御する。推定手段による２回目以降の処理の際には、第１パラメータ及び第２パラメータの初期値に替えて、推定手段で推定された第１パラメータ及び第２パラメータの値を用いる。 According to the parameter estimation device according to the third aspect of the invention, the extracting means extracts the learning feature information from the learning speech signal for each of a plurality of types of languages whose language types are known. Then, the initial value generation means indicates a first parameter indicating a plurality of state base spectra corresponding to each of a plurality of phonemes extracted by non-negative matrix factorization for the learning speech signal and a state transition probability of the base spectrum. For the model including the second parameter, initial values of the first parameter and the second parameter for each of the plurality of types of languages are generated. Next, the estimation unit performs optimization using the initial values of the first parameter and the second parameter and the learning feature information extracted by the extraction unit, so that the first parameter and the second parameter for each of a plurality of types of languages are obtained. Is estimated. The control means outputs the estimated first parameter and the second parameter when the estimation result of the estimation means satisfies a predetermined condition, and the estimation means when the estimation result does not satisfy the predetermined condition. Thus, control is performed so that the first parameter and the second parameter are estimated. In the second and subsequent processing by the estimation unit, the values of the first parameter and the second parameter estimated by the estimation unit are used instead of the initial values of the first parameter and the second parameter.

また、第４の発明に係るパラメータ推定装置は、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出する抽出手段と、前記学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数の状態の基底スペクトルを示す第１パラメータ、及び１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、前記複数種類の言語に共通の前記第１パラメータ及び前記複数種類の言語毎の前記第２パラメータの初期値を生成する初期値生成手段と、前記第１パラメータ及び前記第２パラメータの初期値、または現在の前記第１パラメータ及び前記第２パラメータの値と、前記抽出手段により抽出された学習用特徴情報とを用いた最適化により、前記複数種類の言語に共通の前記第１パラメータ及び前記複数種類の言語毎の前記第２パラメータを推定する推定手段と、前記推定手段の推定結果が所定の条件を満たした場合に、推定された前記第１パラメータ及び前記第２パラメータを出力し、前記推定結果が前記所定の条件を満たしていない場合に、前記推定手段により前記第１パラメータ及び前記第２パラメータの推定が行われるように制御する制御手段と、を含んで構成されている。 According to a fourth aspect of the present invention, there is provided a parameter estimation device that extracts, for each of a plurality of types of languages whose language types are known, learning feature information from a learning speech signal, and extracts the learning feature information as the learning feature information. has been learned by the update rule according to the weighted average in accordance with the scale of the Mel spectrum, and a plurality of first parameter indicative of the ground spectrum state, and 1 time the state transition probability of the base spectral transition in dependence on the previous base spectrum for model including a second parameter indicating an initial value generation means for generating an initial value of said second parameter for each of the common first parameter and the plurality of types of languages to said plurality of types of languages, the first parameter And the initial value of the second parameter, or the current value of the first parameter and the second parameter, and the extraction method An estimation unit that estimates the first parameter common to the plurality of types of languages and the second parameter for each of the plurality of types of languages by optimization using the learning feature information extracted by When the estimation result satisfies the predetermined condition, the estimated first parameter and the second parameter are output, and when the estimation result does not satisfy the predetermined condition, the estimation means And control means for controlling so that one parameter and the second parameter are estimated.

第４の発明に係るパラメータ推定装置によれば、抽出手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出する。そして、初期値生成手段が、学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数の状態の基底スペクトルを示す第１パラメータ、及び１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、複数種類の言語に共通の第１パラメータ及び複数種類の言語毎の第２パラメータの初期値を生成する。次に、推定手段が、第１パラメータ及び第２パラメータの初期値と、抽出手段により抽出された学習用特徴情報とを用いた最適化により、複数種類の言語に共通の第１パラメータ及び複数種類の言語毎の第２パラメータを推定する。そして、制御手段が、推定手段の推定結果が所定の条件を満たした場合に、推定された第１パラメータ及び第２パラメータを出力し、推定結果が所定の条件を満たしていない場合に、推定手段により第１パラメータ及び第２パラメータの推定が行われるように制御する。推定手段による２回目以降の処理の際には、第１パラメータ及び第２パラメータの初期値に替えて、推定手段で推定された第１パラメータ及び第２パラメータの値を用いる。 According to the parameter estimation apparatus according to the fourth aspect of the present invention, the extracting means extracts learning feature information from the learning speech signal for each of a plurality of types of languages whose language types are known. Then, the initial value generation means is learned with an update rule by weighted average according to the scale of the mel spectrum extracted as the learning feature information, and the first parameter indicating the base spectrum of a plurality of states , and one time before For a model including a second parameter indicating a state transition probability of a base spectrum that changes depending on the base spectrum, a first parameter common to a plurality of types of languages and an initial value of a second parameter for each of the plurality of types of languages are generated. . Next, the estimation unit optimizes using the initial values of the first parameter and the second parameter and the learning feature information extracted by the extraction unit, and thereby the first parameter and the plurality of types common to a plurality of types of languages. The second parameter for each language is estimated. The control means outputs the estimated first parameter and the second parameter when the estimation result of the estimation means satisfies a predetermined condition, and the estimation means when the estimation result does not satisfy the predetermined condition. Thus, control is performed so that the first parameter and the second parameter are estimated. In the second and subsequent processing by the estimation unit, the values of the first parameter and the second parameter estimated by the estimation unit are used instead of the initial values of the first parameter and the second parameter.

このように、複数の状態の基底スペクトル及び基底スペクトルの状態遷移確率を示すパラメータを最適化により推定するため、言語が持つ音声的性質と音素遷移とを含む言語的性質に基づく評価を行うことができるパラメータを推定することができる。このため、事前知識を要することなく、入力された音声信号が示す言語の種類を精度良く評価するためのパラメータを推定することができる。 As described above, since the parameters indicating the base spectrum of a plurality of states and the state transition probability of the base spectrum are estimated by optimization, evaluation based on linguistic properties including speech properties and phoneme transitions of languages can be performed. Possible parameters can be estimated. Therefore, it is possible to estimate parameters for accurately evaluating the language type indicated by the input speech signal without requiring prior knowledge.

また、第３及び第４の発明において、前記推定手段は、フォワード・バックワードアルゴリズムを用いて、前記モデルにおいて、各時刻で選択される基底スペクトルに対応した潜在変数の事後分布を示す変数γ、及び２つの連続した潜在変数に対する同時事後分布を示す変数ξを求め、変数γ及び変数ξを用いて、前記第１パラメータ及び前記第２パラメータの期待値が最大となるように、前記第１パラメータ及び前記第２パラメータを更新することができる。 In the third and fourth aspects of the invention, the estimation means uses a forward / backward algorithm, and in the model, a variable γ indicating a posterior distribution of latent variables corresponding to a base spectrum selected at each time, And a variable ξ indicating a simultaneous posterior distribution for two consecutive latent variables, and using the variable γ and the variable ξ, the first parameter and the second parameter are expected to have maximum expected values. And the second parameter can be updated.

また、第５の発明に係る音声言語評価方法は、抽出手段と、尤度計算手段と、評価手段とを含む音声言語評価装置における音声言語評価方法であって、前記抽出手段が、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出し、前記尤度計算手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルと、前記抽出手段により抽出された評価用特徴情報とに基づいて、前記評価用音声信号が示す言語の種類が前記複数種類の各々であることの尤もらしさ示す尤度を計算し、前記評価手段が、前記尤度計算手段により計算された尤度に基づいて、前記評価用音声信号が示す言語の種類を評価する方法である。 A spoken language evaluation method according to a fifth aspect of the present invention is a spoken language evaluation method in a spoken language evaluation apparatus that includes an extraction unit, a likelihood calculation unit, and an evaluation unit, wherein the extraction unit includes a language type. There extracts evaluation feature information from an unknown sound signals used for evaluation, extracting the likelihood calculating means, with a plurality kinds of languages type of language is known, the non-negative matrix factorization for training speech signals A model including a first parameter indicating a base spectrum of a plurality of states corresponding to each of the plurality of phonemes and a second parameter indicating a state transition probability of the base spectrum, and evaluation feature information extracted by the extraction unit Based on this, the likelihood indicating the likelihood that the language type indicated by the speech signal for evaluation is each of the plurality of types is calculated. Based on the calculated likelihood, a method for evaluating the type of language indicated by the evaluation speech signal.

また、第６の発明に係る音声言語評価方法は、抽出手段と、尤度計算手段と、評価手段とを含む音声言語評価装置における音声言語評価方法であって、前記抽出手段が、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出し、前記尤度計算手段が、学習用音声信号から学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数種類の言語に共通の複数の状態の基底スペクトルを示す第１パラメータと、言語の種類が既知である複数種類の言語それぞれについて、１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータとを含むモデルと、言語の種類が既知である複数種類の言語それぞれについての基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルと、前記抽出手段により抽出された評価用特徴情報とに基づいて、前記評価用音声信号が示す言語の種類が前記複数種類の各々であることの尤もらしさ示す尤度を計算し、前記評価手段が、前記尤度計算手段により計算された尤度に基づいて、前記評価用音声信号が示す言語の種類を評価する方法である。 A spoken language evaluation method according to a sixth aspect of the present invention is a spoken language evaluation method in a spoken language evaluation apparatus including an extraction unit, a likelihood calculation unit, and an evaluation unit, wherein the extraction unit includes a language type. Is used to extract evaluation feature information from an unknown evaluation speech signal, and the likelihood calculation means learns with an update rule by weighted average according to the scale of the mel spectrum extracted as learning feature information from the learning speech signal it is, and a first parameter indicating a basal spectrum common multiple states to a plurality of types of languages, for each of a plurality kinds of languages type of language is known, the transition depending on the underlying spectra of the immediately preceding time model and state transition probability of the base spectrum for each of a plurality kinds of languages type of language is known and a second parameter indicating a state transition probability of the basis spectra The likelihood indicating that the language type indicated by the evaluation speech signal is each of the plurality of types based on the model including the second parameter indicating, and the evaluation feature information extracted by the extraction unit The degree is calculated, and the evaluation means evaluates the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation means.

また、第７の発明に係るパラメータ推定方法は、抽出手段と、初期値生成手段と、推定手段と、制御手段とを含むパラメータ推定装置におけるパラメータ推定方法であって、前記抽出手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出し、前記初期値生成手段が、前記学習用音声信号に対する非負値行列因子分解により抽出した複数の音素の各々に相当する複数の状態の基底スペクトルを示す第１パラメータ及び基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、前記複数種類の言語毎の前記第１パラメータ及び前記第２パラメータの初期値を生成し、前記推定手段が、前記第１パラメータ及び前記第２パラメータの初期値、または現在の前記第１パラメータ及び前記第２パラメータの値と、前記抽出手段により抽出された学習用特徴情報とを用いた最適化により、前記複数種類の言語毎の前記第１パラメータ及び前記第２パラメータを推定し、前記制御手段が、前記推定手段の推定結果が所定の条件を満たした場合に、推定された前記第１パラメータ及び前記第２パラメータを出力し、前記推定結果が前記所定の条件を満たしていない場合に、前記推定手段により前記第１パラメータ及び前記第２パラメータの推定が行われるように制御する方法である。 A parameter estimation method according to a seventh aspect of the present invention is a parameter estimation method in a parameter estimation apparatus including an extraction unit, an initial value generation unit, an estimation unit, and a control unit, wherein the extraction unit includes a language For each of a plurality of types of languages whose types are known, feature information for learning is extracted from the learning speech signal, and the initial value generation unit extracts a plurality of phonemes extracted by non-negative matrix factorization with respect to the learning speech signal. For a model including a first parameter indicating a base spectrum of a plurality of states corresponding to each and a second parameter indicating a state transition probability of the base spectrum, an initial of the first parameter and the second parameter for each of the plurality of types of languages A first value of the first parameter and the second parameter, or the current first parameter. The first parameter and the second parameter for each of the plurality of types of languages are estimated by using the learning feature information extracted by the extraction unit and the learning parameter information extracted by the extraction unit, The control means outputs the estimated first parameter and the second parameter when the estimation result of the estimation means satisfies a predetermined condition, and when the estimation result does not satisfy the predetermined condition In this method, control is performed such that the first parameter and the second parameter are estimated by the estimation means.

また、第８の発明に係るパラメータ推定方法は、抽出手段と、初期値生成手段と、推定手段と、制御手段とを含むパラメータ推定装置におけるパラメータ推定方法であって、前記抽出手段が、言語の種類が既知である複数種類の言語それぞれについて、学習用音声信号から学習用特徴情報を抽出し、前記初期値生成手段が、前記学習用特徴情報として抽出されたメルスペクトルのスケールに応じた重み付け平均による更新則で学習され、かつ複数の状態の基底スペクトルを示す第１パラメータ、及び１時刻前の基底スペクトルに依存して遷移する基底スペクトルの状態遷移確率を示す第２パラメータを含むモデルについて、前記複数種類の言語に共通の前記第１パラメータ及び前記複数種類の言語毎の前記第２パラメータの初期値を生成し、前記推定手段が、前記第１パラメータ及び前記第２パラメータの初期値、または現在の前記第１パラメータ及び前記第２パラメータの値と、前記抽出手段により抽出された学習用特徴情報とを用いた最適化により、前記複数種類の言語に共通の前記第１パラメータ及び前記複数種類の言語毎の前記第２パラメータを推定し、前記制御手段が、前記推定手段の推定結果が所定の条件を満たした場合に、推定された前記第１パラメータ及び前記第２パラメータを出力し、前記推定結果が前記所定の条件を満たしていない場合に、前記推定手段により前記第１パラメータ及び前記第２パラメータの推定が行われるように制御する方法である。 A parameter estimation method according to an eighth invention is a parameter estimation method in a parameter estimation device including an extraction unit, an initial value generation unit, an estimation unit, and a control unit, wherein the extraction unit includes a language For each of a plurality of types of languages whose types are known, feature information for learning is extracted from the speech signal for learning, and the initial value generation means performs a weighted average according to the scale of the mel spectrum extracted as the feature information for learning A model that is learned by an update rule according to the above and includes a first parameter that indicates a base spectrum of a plurality of states , and a second parameter that indicates a state transition probability of the base spectrum that transitions depending on the base spectrum one time ago. Generating initial values of the first parameter common to a plurality of types of languages and the second parameter for each of the plurality of types of languages; The estimation means uses the initial values of the first parameter and the second parameter, or the current values of the first parameter and the second parameter, and the learning feature information extracted by the extraction means. The first parameter common to the plurality of types of languages and the second parameter for each of the plurality of types of languages are estimated, and the control unit satisfies a predetermined condition of the estimation result of the estimation unit The estimated first parameter and the second parameter are output, and the estimation means estimates the first parameter and the second parameter when the estimation result does not satisfy the predetermined condition. It is a method of controlling so that it is displayed.

また、第７及び第８の発明において、前記推定手段が、フォワード・バックワードアルゴリズムを用いて、前記モデルにおいて、各時刻で選択される基底スペクトルに対応した潜在変数の事後分布を示す変数γ、及び２つの連続した潜在変数に対する同時事後分布を示す変数ξを求め、変数γ及び変数ξを用いて、前記第１パラメータ及び前記第２パラメータの期待値が最大となるように、前記第１パラメータ及び前記第２パラメータを更新することができる。 Further, in the seventh and eighth inventions, the estimating means uses a forward / backward algorithm, and in the model, a variable γ indicating a posterior distribution of latent variables corresponding to a base spectrum selected at each time, And a variable ξ indicating a simultaneous posterior distribution for two consecutive latent variables, and using the variable γ and the variable ξ, the first parameter and the second parameter are expected to have maximum expected values. And the second parameter can be updated.

また、第９の発明に係る音声言語評価プログラムは、コンピュータを、上記の音声言語評価装置を構成する各手段として機能させるためのプログラムである。 A spoken language evaluation program according to a ninth aspect of the invention is a program for causing a computer to function as each means constituting the above spoken language evaluation device.

また、第９の発明に係るパラメータ推定プログラムは、コンピュータを、上記のパラメータ推定装置を構成する各手段として機能させるためのプログラムである。 A parameter estimation program according to the ninth invention is a program for causing a computer to function as each means constituting the parameter estimation device.

以上説明したように、本発明の音声言語評価装置、方法、及びプログラムによれば、複数の状態の基底スペクトル及び基底スペクトルの状態遷移確率を示すパラメータを含むモデルを用いることにより、言語が持つ音声的性質と音素遷移とを含む言語的性質に基づく評価を行うことができるため、事前知識を要することなく、入力された音声信号が示す言語の種類を精度良く評価することができる、という効果が得られる。 As described above, according to the spoken language evaluation apparatus, method, and program of the present invention, the speech possessed by a language is obtained by using a model including a base spectrum of a plurality of states and a parameter indicating a state transition probability of the base spectrum. Can be evaluated based on the linguistic properties including the physical properties and phoneme transitions, so that the language type indicated by the input speech signal can be accurately evaluated without requiring prior knowledge. can get.

また、本発明のパラメータ推定装置、方法、及びプログラムによれば、上記の音声言語評価装置、方法、及びプログラムで用いることのできるパラメータを推定することができる。 Further, according to the parameter estimation apparatus, method, and program of the present invention, it is possible to estimate parameters that can be used in the above spoken language evaluation apparatus, method, and program.

本実施の形態の原理を説明するための概略図である。It is the schematic for demonstrating the principle of this Embodiment. 本実施の形態の原理を説明するための概略図である。It is the schematic for demonstrating the principle of this Embodiment. 第１の実施の形態に係る音声言語評価装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the spoken language evaluation apparatus which concerns on 1st Embodiment. 第１の実施の形態における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in 1st Embodiment. 第１の実施の形態における評価処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the evaluation process routine in 1st Embodiment. 第１の実施の形態における学習処理及び評価処理を示すシーケンス図である。It is a sequence diagram which shows the learning process and evaluation process in 1st Embodiment. 第２の実施の形態に係る音声言語評価装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the spoken language evaluation apparatus which concerns on 2nd Embodiment. 第２の実施の形態における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in 2nd Embodiment. 第２の実施の形態における評価処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the evaluation process routine in 2nd Embodiment. 第２の実施の形態における学習処理及び評価処理を示すシーケンス図である。It is a sequence diagram which shows the learning process and evaluation process in 2nd Embodiment. ５種類の言語に対する評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result with respect to five types of languages. １３種類の言語に対する評価結果の一例を示すグラフである。It is a graph which shows an example of the evaluation result with respect to 13 types of languages.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜実施の形態の原理＞
まず、本発明の理解を容易とするため、本発明を適用した実施の形態の原理について説明する。 <Principle of Embodiment>
First, in order to facilitate understanding of the present invention, the principle of an embodiment to which the present invention is applied will be described.

後述の各実施の形態では、音声信号処理分野で発展している非負値行列因子分解（Nonnegative matrix factorization、ＮＭＦ、参考文献１：D.D. Lee et. al.,“Learning the parts of objects with nonnegative matrix factorization,” Nature, Vol. 401, pp. 788-791, 1999）を用いて、スパースな基底と、基底の時間的な遷移構造とを、大量の多言語音声コーパスデータのみから学習することにより、言語の種類毎の事前知識を必要とすることなく、言語の種類に応じた音声的性質と音素遷移とを含む言語的性質である言語情報を抽出する。そして、この言語情報に基づいて、言語の種類の分類及び識別等の評価を行う。 In each of the embodiments described later, nonnegative matrix factorization (NMF, Reference 1: DD Lee et. Al., “Learning the parts of objects with nonnegative matrix factorization” developed in the audio signal processing field. , ”Nature, Vol. 401, pp. 788-791, 1999), the sparse basis and temporal transition structure of the basis are learned only from a large amount of multilingual speech corpus data. Language information that is a linguistic property including a phonetic property and a phoneme transition corresponding to the language type is extracted without requiring prior knowledge for each type. Based on this language information, language type classification and identification are evaluated.

具体的には、ＮＭＦが、教師なしでデータからスパースな基底を学習する能力が高いことに着目し、音声信号のメルスペクトログラム（非負データ）から抽出される言語の種類に応じた特徴的な基底スペクトルを、多言語分類に応用する。一方、音声的性質を示す基底スペクトルのみならず、基底スペクトルの時間的遷移の性質を捉えることも重要である。そこで、各実施の形態では、ＮＭＦのような非負データを対象とする教師なし学習法に、基底の時間的遷移の確率モデルを組み込んだトピック遷移ＰＬＳＡという生成モデルを用いる。 Specifically, focusing on the high ability of NMF to learn a sparse basis from data without supervision, a characteristic basis corresponding to the type of language extracted from the mel spectrogram (non-negative data) of the speech signal Apply spectrum to multilingual classification. On the other hand, it is also important to capture not only the base spectrum indicating the speech property but also the temporal transition property of the base spectrum. Therefore, in each embodiment, a generation model called topic transition PLSA in which a probability model of a base temporal transition is incorporated into an unsupervised learning method for non-negative data such as NMF is used.

ＰＬＳＡ（Probabilistic Latent Semantic Analysis、参考文献２：T. Hofmann,“ Probabilistic Latent Semantic Indexing , ”in SIGIR1999）は、もともとテキストデータを対象とした自然言語処理の一手法であり、話題（トピック）に相当する潜在変数を介して、各文書中に現れる単語の度数データを扱うものである。ＰＬＳＡは数学的には、ある種のＮＭＦと等価であるが、ＮＭＦの定式化では導入が困難であった基底の時間遷移のモデリングが、ＰＬＳＡの場合には隠れ変数の遷移確率として自然に導入できる。この手法を用いて、言語が持つ音声的性質と音素遷移とを含む言語的性質を、それぞれ基底スペクトルと状態遷移確率として別々に学習する。 PLSA (Probabilistic Latent Semantic Analysis, Reference 2: T. Hofmann, “Probabilistic Latent Semantic Indexing,” in SIGIR1999) is a natural language processing method originally intended for text data and corresponds to a topic. It handles the frequency data of words appearing in each document via latent variables. PLSA is mathematically equivalent to a certain type of NMF, but modeling of temporal transitions of the base, which was difficult to introduce in the NMF formulation, is naturally introduced as a transition probability of hidden variables in the case of PLSA. it can. Using this method, the linguistic properties including the speech property and phoneme transition of the language are separately learned as the base spectrum and the state transition probability.

まず、トピック遷移ＰＬＳＡによる生成モデルについて説明する。図１に、ＰＬＳＡによって生成されるメルスペクトログラムの概略図を示す。例えば、言語の種類がｍの言語（以下、「言語(ｍ)」と表記）のメルスペクトログラムを生成するために、Ｋ個の基底スペクトル^→Ｈ^(ｍ)＝[^→ｈ₁ ^(ｍ)，・・・，^→ｈ_K ^(ｍ)]を用意する。なお、記号「→」はベクトルを表す。ｋ番目の基底スペクトルを^→ｈ_ｋ ^(ｍ)＝[ｈ_k,1 ^(ｍ)，・・・，ｈ_k,Ω ^(ｍ)]^Tと表現する。ω＝１，・・・，Ωはメルフィルタバンクの中心周波数を指すインデックスである。ここで、基底スペクトルの各要素を「周波数の出易さを表す確率」とみなす。また、各々の基底スペクトルは各音素に相当すると考える。時刻ｔにおいて、これらの基底スペクトルのいずれかが選ばれ、選ばれた基底スペクトルをパラメータとする多項分布から生成されたものを、時刻ｔのメルスペクトルと考える。すなわち、生成モデルは、下式（１）のように書ける。なお、数式内ではベクトルを太字で表記している。 First, a generation model based on topic transition PLSA will be described. FIG. 1 shows a schematic diagram of a mel spectrogram generated by PLSA. For example, in order to generate a mel spectrogram of a language whose language type is m (hereinafter referred to as “language (m)”), K basis spectra ^→ H ^(m) = [ ^→ h ₁ ^(m),.・・, ^→ h _K ^(m) ] is prepared. The symbol “→” represents a vector. The k-th base spectrum is expressed as ^→ h _k ^(m) = [h _{k, 1} ^(m) ,..., h _{k, Ω} ^(m) ] ^T. ω = 1,..., Ω is an index indicating the center frequency of the mel filter bank. Here, each element of the base spectrum is regarded as a “probability representing the ease with which the frequency is generated”. Each base spectrum is considered to correspond to each phoneme. Any one of these base spectra is selected at time t, and the one generated from the multinomial distribution with the selected base spectrum as a parameter is considered as the mel spectrum at time t. That is, the generation model can be written as the following formula (1). In the formula, the vector is shown in bold.

ここで、π_1,t ^(ｍ)，・・・，π_K,t ^(ｍ)は各基底スペクトルの時刻ｔにおける出現確率、^→ｙ_t ^(ｍ)は時刻ｔのメルスペクトルを表す。ＰＬＳＡはもともとテキストを対象とした自然言語処理の一手法であり、話題（トピック）に相当する潜在変数を介して、各文書中に現れる単語の度数データを扱うものである。ＰＬＳＡによるメルスペクトログラムの生成モデルでは、基底スペクトルのインデックスｋがトピックに相当する。また、メルスペクトログラムの値ｙ_ω,tは時刻（文書）ｔにおける周波数（単語）ωの度数と解釈する。 Here, π _{1, t} ^(m) ,..., Π _{K, t} ^(m) represents the appearance probability of each base spectrum at time t, and ^→ y _t ^(m) represents the mel spectrum at time t. PLSA is a natural language processing method originally intended for text, and handles frequency data of words appearing in each document via latent variables corresponding to topics. In the mel spectrogram generation model by PLSA, the index k of the base spectrum corresponds to the topic. The mel spectrogram value y _{ω, t} is interpreted as the frequency of the frequency (word) ω at time (document) t.

式（１）の生成モデルを、図２に示すように、ＰＬＳＡにおける基底スペクトル（トピック）が、一つ前の時刻（ｔ−１）の基底スペクトル（トピック）に依存して遷移するモデルへと拡張する。すなわち、拡張した生成モデルは、下式（２）のように書ける。ここで、^→Ａは、図２に示すように、時刻（ｔ−１）の基底スペクトル^→ｋ_t-1から時刻ｔに各基底スペクトル^→ｋ_t（ｋ＝１，・・・，Ｋ）へ遷移する確率を示す遷移確率行列である。 As shown in FIG. 2, the generation model of Expression (1) is changed to a model in which the base spectrum (topic) in PLSA transitions depending on the base spectrum (topic) at the previous time (t−1). Expand. That is, the extended generation model can be written as the following equation (2). Here, ^→ A, as shown in FIG. 2, the time (t-1) of the basis spectra ^→ k each basis spectral from _t-1 to time ^{_{t → k t (k = 1}} , ···, K) to It is a transition probability matrix indicating the probability of transition.

このとき、メルスペクトログラムの対数尤度関数は、下式（３）で示され、式（２）に示す生成モデルのパラメータは^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}となる。 At this time, the log likelihood function of the mel spectrogram is expressed by the following equation (3), and the parameters of the generation model shown in the equation (2) are: ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) }.

従って、観測されるメルスペクトログラムから、言語が持つ音声的性質と音素遷移とを含む言語的性質を、それぞれ基底スペクトルを示すパラメータ^→ｈ^(ｍ)＝{^→ｈ₁ ^(ｍ)，・・・，^→ｈ_K ^(ｍ)}、及び各基底スペクトルの出現確率（初期状態確率）^→π^(ｍ)と遷移確率行列^→Ａ^(ｍ)とで表される状態遷移確率を示すパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}として推定する逆問題を解くことを考える。なお、パラメータ^→ｈ^(ｍ)は、本発明の第１パラメータの一例であり、パラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}は、本発明の第２パラメータの一例である。 Accordingly, from the observed mer spectrogram, the speech properties of the language and the linguistic properties including the phoneme transition are respectively expressed as parameters indicating the base spectrum ^→ h ^(m) = { ^→ h ₁ ^(m) ,. ^→ h _K ^(m) } and the appearance probability of each base spectrum (initial state probability) ^→ π ^(m) and transition probability matrix ^→ A parameter indicating the state transition probability represented by A ^(m) { ^→ π ^{(m )} , ^→ A ^(m) } Consider solving the inverse problem. The parameter ^→ h ^(m) is an example of the first parameter of the present invention, and the parameter { ^→ π ^(m) , ^→ A ^(m) } is an example of the second parameter of the present invention.

便宜上、基底スペクトルのインデックスｋの代わりに^→ｚ_t ^(ｍ)＝[ｚ_1,t ^(ｍ)，・・・，ｚ_K,t ^(ｍ)]^Tを導入する。^→ｚ_t ^(ｍ)はＫ次元の２値確率変数であり、どれか１つのｚ_k,t ^(ｍ)だけが１で他は０とする。すなわち、ｚ_k,t ^(ｍ)はｚ_k,t ^(ｍ)∈{０，１}かつΣ_kｚ_k,t ^(ｍ)＝１を満たす。^→ｚ_t ^(ｍ)は式（２）に示す生成モデルの潜在変数であり、Ｋ種類の状態を取る。 For convenience, ^→ z _t ^(m) = [z _{1, t} ^(m) ,..., Z _{K, t} ^(m) ] ^T is introduced instead of the index k of the base spectrum. ^→ z _t ^(m) is a K-dimensional binary random variable, and only one z _{k, t} ^(m) is 1 and the others are 0. That is, z _{k, t} ^(m) satisfies z _{k, t} ^(m) ∈ {0, 1} and Σ _k z _{k, t} ^(m) = 1. ^→ z _t ^(m) is a latent variable of the generation model shown in Equation (2) and takes K types of states.

次に、トピック遷移ＰＬＳＡによる生成モデルのパラメータの推定について説明する。トピック遷移ＰＬＳＡのパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}は、例えばＥＭアルゴリズムを利用して推定することができる。Ｑ関数（完全データの対数尤度関数の期待値）は下式（４）のように書ける。ここで、^→θ^(ｍ)oldは、パラメータ^→θ^(ｍ)の現在の値（初期値、または直前に更新された更新値）である。 Next, estimation of the parameters of the generation model by topic transition PLSA will be described. Parameters of topic transition PLSA ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } can be estimated using, for example, an EM algorithm. The Q function (expected value of log likelihood function of complete data) can be written as the following equation (4). Here, ^→ θ ^{(m) old} is the current value (initial value or updated value immediately before ⁾ of parameter ^→ θ ^(m) .

式（４）において、γ(^→ｚ_t ^(ｍ))は潜在変数^→ｚ_t ^(ｍ)の事後分布、ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))は２つの連続した潜在変数(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))に対する同時事後分布であり、下式（５）〜（７）となる期待値を表す変数である。 In equation (4), γ ( ^→ z _t ^(m) ) is a latent variable ^→ z _t ^(m) posterior distribution, ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) is two consecutive Is a simultaneous posterior distribution with respect to the latent variables ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ), and is a variable representing an expected value represented by the following equations (5) to (7).

Ｅステップでは、例えば、フォワード・バックワードアルゴリズム（参考文献３：L.R.Rabiner,“A tutorial on hidden markov models and selected applications in speech recognition, ”Proceedings of the IEEE, pp. 257.286, 1989）を利用して、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を求めることができる。変数γ(^→ｚ_t ^(ｍ))は、下式（８）のように書き下すことができる。 In the E step, for example, a forward-backward algorithm (Reference 3: LRRabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, pp. 257.286, 1989) γ ( ^→ z _t ^(m) ) and variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ) can be obtained. The variable γ ( ^→ z _t ^(m) ) can be written as the following equation (8).

ここで、変数α(^→ｚ_t ^(ｍ))を下式（９）、変数β(^→ｚ_t ^(ｍ))を下式（１０）とする。 Here, the variable α ( ^→ z _t ^(m) ) is represented by the following expression (9), and the variable β ( ^→ z _t ^(m) ) is represented by the following expression (10).

そして、α(^→ｚ₁ ^(ｍ))を下式（１１）として、ｔ＝１，・・・，Ｔに対して、順番に変数α(^→ｚ_t ^(ｍ))を計算する。 Then, α ( ^→ z ₁ ^(m) ) is expressed by the following equation (11), and the variable α ( ^→ z _t ^(m) ) is calculated in order for t = 1,.

また、β(^→ｚ_T ^(ｍ))＝１として、ｔ＝Ｔ，・・・，１に対して、順番に変数β(^→ｚ_t ^(ｍ))を計算すると、式（８）により変数γ(^→ｚ_t ^(ｍ))が求められる。ここで、下式（１２）を用いると、同様に、変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))も下式（１３）のように書き下すことができるため、先に求めた変数α(^→ｚ_t ^(ｍ))及び変数β(^→ｚ_t ^(ｍ))を利用して、変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を計算することができる。 If β ( ^→ z _T ^(m) ) = 1 and t = T,..., 1 are calculated in order, the variable β ( ^→ z _t ^(m) ) is γ ( ^→ z _t ^(m) ) is obtained. Here, if the following equation (12) is used, similarly, the variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) can also be written as the following equation (13). Using the previously obtained variable α ( ^→ z _t ^(m) ) and variable β ( ^→ z _t ^(m) ), the variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) Can be calculated.

Ｍステップでは、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を定数とみなし、Ｑ関数Ｑ(^→θ^(ｍ)，^→θ^{(ｍ) old})を最大化するパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}を推定する。これには、適当なラグランジュ乗数を用いることができる。パラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の更新式は、下式（１４）〜（１６）のようになる。 In the M step, the variable γ ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) are regarded as constants, and the Q function Q ( ^→ θ ^(m) , ^→ The parameter that maximizes θ ^{(m) old} ) ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } is estimated. For this, a suitable Lagrange multiplier can be used. The update formula of the parameter ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } is expressed by the following formulas (14) to (16).

式（１６）は、言語(ｍ)毎に基底スペクトルを示すパラメータ^→ｈ^(ｍ)を推定する場合の更新式であるが、対象とする全ての言語の種類に対して共通の基底スペクトルを示すパラメータ^→ｈを推定すると共に、言語(ｍ)毎に異なる状態遷移確率を示すパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}を推定することも可能である。その場合、共通の基底スペクトルを示すパラメータ^→ｈの更新式は、下式（１７）となる。 Expression (16) is an update expression for estimating a parameter indicating a base spectrum for each language (m) ^→ h ^(m) , and shows a common base spectrum for all target language types. It is possible to estimate the parameter { ^→ π ^(m) , ^→ A ^(m) } indicating the state transition probability that differs for each language (m) while estimating the parameter ^→ h. In that case, the update formula of the parameter indicating the common base spectrum ^→ h is the following formula (17).

＜第１の実施の形態＞
次に、第１の実施の形態について説明する。 <First Embodiment>
Next, a first embodiment will be described.

第１の実施の形態に係る音声言語評価装置１０は、ＣＰＵと、ＲＡＭと、後述する学習処理及び評価処理を含む音声言語評価処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。 A spoken language evaluation apparatus 10 according to the first embodiment includes a CPU, a RAM, and a ROM that stores a program for executing a spoken language evaluation processing routine including a learning process and an evaluation process described later. It consists of

このコンピュータは、機能的には、図３に示すように、学習部２０と評価部４０とを含んだ構成で表すことができる。 Functionally, this computer can be represented by a configuration including a learning unit 20 and an evaluation unit 40 as shown in FIG.

まず、学習部２０の各部について詳述する。学習部２０は、音声特徴抽出部２１と、パラメータ初期値生成部２２と、パラメータ推定部２３と、収束判定部２４と、パラメータ出力部２５とを含んだ構成で表すことができる。なお、音声特徴抽出部２１は本発明の抽出手段の一例である。また、パラメータ初期値生成部２２は本発明の初期値生成手段の一例である。また、パラメータ推定部２３は本発明の推定手段の一例である。また、収束判定部２４及びパラメータ出力部２５は本発明の制御手段の一例である。 First, each part of the learning unit 20 will be described in detail. The learning unit 20 can be represented by a configuration including a speech feature extraction unit 21, a parameter initial value generation unit 22, a parameter estimation unit 23, a convergence determination unit 24, and a parameter output unit 25. Note that the voice feature extraction unit 21 is an example of the extraction means of the present invention. The parameter initial value generator 22 is an example of the initial value generator of the present invention. Moreover, the parameter estimation part 23 is an example of the estimation means of this invention. The convergence determination unit 24 and the parameter output unit 25 are an example of the control means of the present invention.

音声特徴抽出部２１は、言語の種類が既知の学習用音声信号を入力として受け付ける。学習用音声信号の示す言語の種類をｍとし、以下では、この学習用音声信号を「言語(ｍ)の学習用音声信号」と表記する。音声特徴抽出部２１が受け付ける言語の種類は２以上であり、ここではＭ種類であるとして説明する。すなわち、ｍは言語の種類を示すインデックスであり、ｍ＝１，・・・，Ｍである。学習部２０の各部（音声特徴抽出部２１、パラメータ初期値生成部２２、パラメータ推定部２３、収束判定部２４、及びパラメータ出力部２５）では、各言語の学習用音声信号について同じ処理を行う。以下では、学習部２０の各部において、Ｍ種類のうちの１種類の言語について行う処理を説明する。 The voice feature extraction unit 21 receives a learning voice signal whose language type is known as an input. The language type indicated by the learning speech signal is m, and in the following, this learning speech signal is expressed as “a learning speech signal of language (m)”. There are two or more types of languages accepted by the speech feature extraction unit 21, and here, explanation will be made assuming that there are M types. That is, m is an index indicating the language type, and m = 1,. Each unit of the learning unit 20 (speech feature extraction unit 21, parameter initial value generation unit 22, parameter estimation unit 23, convergence determination unit 24, and parameter output unit 25) performs the same processing on the learning speech signal of each language. Below, the process performed about one kind of language of M types in each part of the learning part 20 is demonstrated.

音声特徴抽出部２１は、言語(ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)を抽出して出力する。ここでｔは時刻である。例えば、音素の特徴を表現するスペクトル包絡に着目し、フレーム長３２ｍｓ、フレームシフト長１６ｍｓとして、言語(ｍ)の学習用音声信号に対してフレーム毎に短時間フーリエ変換を行い、その振幅スペクトルをメルフィルタバンク処理して得られる出力値を音声特徴量^→ｙ'_t ^(ｍ)とすることができる。ここでｔ番目のフレームを便宜的に時刻ｔと呼ぶこととすると、ｔはｔ＝１，・・・，Ｔであり、Ｔはフレームの総数に相当する。また、時刻ｔの音声特徴量^→ｙ'_t ^(ｍ)は、^→ｙ'_t ^(ｍ)＝[ｙ'_1,t ^(ｍ)，・・・，ｙ'_Ω,t ^(ｍ)]^Tであり、例えば、メルフィルタの数を２２個とすれば、Ω＝２２である。なお、振幅スペクトログラムそのものを音声特徴量としてもよい。また、式（２）に示すように、多項分布に従う確率変数として音声特徴量を表現するため、音声特徴抽出部２１は、音声特徴量^→ｙ'_t ^(ｍ)の全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_t ^(ｍ)を生成して出力する。 The voice feature extraction unit 21 extracts the voice feature quantity ^→ y ′ _t ^(m) from the learning voice signal of the language (m) and outputs it. Here, t is time. For example, paying attention to the spectral envelope that expresses the features of phonemes, the frame length is 32 ms, the frame shift length is 16 ms, and the speech spectrum for language (m) is subjected to short-time Fourier transform for each frame, The output value obtained by the mel filter bank processing can be set as voice feature amount ^→ y ′ _t ^(m) . Here, if the t-th frame is referred to as time t for convenience, t is t = 1,..., T, and T corresponds to the total number of frames. The audio feature amount of time _t ^→ y 't ^(m) ^{_{^{is, → y' t (m)}}} = In _{^{[y '1, t (m}} ), ···, y' Ω, t (m)] T Yes, for example, if the number of Mel filters is 22, Ω = 22. Note that the amplitude spectrogram itself may be used as the speech feature amount. Further, as shown in Expression (2), in order to express the speech feature quantity as a random variable according to the multinomial distribution, the speech feature extraction unit 21 converts all the elements of the speech feature quantity ^→ y ′ _t ^(m) to integer values. Marumekon's rounding audio feature ^→ generates a y _t ^(m) outputs.

パラメータ初期値生成部２２は、音声特徴抽出部２１から出力された丸め込み音声特徴量^→ｙ_t ^(ｍ)を入力として受け付け、式（２）に示す生成モデルのパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の初期値を生成する。なお、パラメータ^→ｈ^(ｍ)は基底スペクトルを示すパラメータ、パラメータ^→π^(ｍ)は各基底スペクトルの初期状態確率を示すパラメータ、及びパラメータ^→Ａ^(ｍ)は基底スペクトルの遷移確率行列を示すパラメータである。 The parameter initial value generation unit 22 accepts the rounded speech feature amount ^→ y _t ^(m) output from the speech feature extraction unit 21 as an input, and the parameters of the generation model shown in Expression (2) ^→ θ ^(m) = { ^→ An initial value of π ^(m) , ^→ A ^(m) , ^→ h ^(m) } is generated. Parameter ^→ h ^(m) is a parameter indicating the base spectrum, parameter ^→ π ^(m) is a parameter indicating the initial state probability of each base spectrum, and parameter ^→ A ^(m) is a parameter indicating the transition probability matrix of the base spectrum It is.

具体的には、パラメータ初期値生成部２２は、丸め込み音声特徴量^→ｙ_t ^(ｍ)（ｔ＝１，・・・，Ｔ）を一つの行列^→Ｙ^(ｍ)とみなし、通常のＮＭＦを適用して推定された基底スペクトルをパラメータ^→ｈ^(ｍ)の初期値として生成する。この部分のＮＭＦは、周知技術（例えば、参考文献４：D.D.Lee et. al., “ Algorithms for non-negative matrix factorization ” in NIPS2000.、参考文献５：A.T. Cemgil, “ Bayesian inference in nonnegative matrix factorisation models, ” in University of Cambridge, 2008.）により実現できる。 Specifically, the parameter initial value generation unit 22 regards the rounded speech feature value ^→ y _t ^(m) (t = 1,..., T) as one matrix ^→ Y ^(m), and converts the normal NMF into A base spectrum estimated by application is generated as an initial value of the parameter ^→ h ^(m) . NMF in this part is known in the art (for example, Reference 4: DDLee et. Al., “Algorithms for non-negative matrix factorization” in NIPS2000., Reference 5: AT Cemgil, “Bayesian inference in nonnegative matrix factorisation models, ”In University of Cambridge, 2008.)

また、パラメータ初期値生成部２２は、パラメータ^→π^(ｍ)については、全ての要素に対して等確率となる値１／Ｋ（Ｋは基底スペクトルの総数）を初期値として生成する。パラメータ^→Ａ^(ｍ)についても、全ての行に対して等確率となる値１／Ｋを初期値として生成する。 Further, the parameter initial value generation unit 22 generates a value 1 / K (K is the total number of base spectra) having an equal probability for all elements as an initial value for the parameter ^→ π ^(m) . Parameter ^→ A will be ^(m), to produce a value 1 / K to be equal probabilities for all lines as the initial value.

パラメータ推定部２３は、パラメータ初期値生成部２２で生成されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の初期値、または後述する状態遷移確率更新部２３２及び基底スペクトル更新部２３３で更新されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の更新値、及び音声特徴抽出部２１から出力された丸め込み音声特徴量^→ｙ_t ^(ｍ)を入力として、更新後のパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)} を、言語(ｍ)のメルスペクトルグラムの尤度関数の値が最大となるように更新する。パラメータ推定部２３は、さらに、フォワード・バックワードアルゴリズム部２３１と、状態遷移確率更新部２３２と、基底スペクトル更新部２３３とを含んだ構成で表すことができる。 The parameter estimator 23 is the parameter generated by the parameter initial value generator 22 ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) }, or a state described later Parameters updated by the transition probability update unit 232 and the base spectrum update unit 233 ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } update value, and speech feature extraction unit as input to output the rounded speech features ^→ y _t ^(m) from 21, the parameter after update ^{^{^{→ θ (m) = {→}}} π (m), → a (m), → h (m)} and The likelihood function value of the mel spectrum gram of language (m) is updated so as to be maximized. The parameter estimation unit 23 can be expressed by a configuration including a forward / backward algorithm unit 231, a state transition probability update unit 232, and a base spectrum update unit 233.

フォワード・バックワードアルゴリズム部２３１は、パラメータ初期値生成部２２で生成されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の初期値、または後述する状態遷移確率更新部２３２及び基底スペクトル更新部２３３で更新されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の更新値を入力として受け付ける。フォワード・バックワードアルゴリズム部２３１は、パラメータの初期値または更新値を用いて、フォワード・バックワードアルゴリズムにより、式（９）に示す変数α(^→ｚ_t ^(ｍ))及び式（１０）に示す変数β(^→ｚ_t ^(ｍ))を、式（１１）を用いて計算する。また、計算した変数α(^→ｚ_t ^(ｍ))及び変数β(^→ｚ_t ^(ｍ))を用いて、式（８）に示す変数γ(^→ｚ_t ^(ｍ))、及び式（１３）に示す変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を求めて、出力する。 The forward / backward algorithm unit 231 uses the initial value of the parameter generated by the parameter initial value generation unit 22 ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) }, or Parameters updated by a state transition probability updating unit 232 and a base spectrum updating unit 233 described later ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } Accept. The forward / backward algorithm unit 231 uses the initial value or the updated value of the parameter to represent the variable α ( ^→ z _t ^(m) ) shown in the equation (9) and the equation (10) by the forward / backward algorithm. The variable β ( ^→ z _t ^(m) ) is calculated using equation (11). Further, using the calculated variable α ( ^→ z _t ^(m) ) and variable β ( ^→ z _t ^(m) ), the variable γ ( ^→ z _t ^(m) ) shown in the equation (8) and the equation (13 ) ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) shown in FIG.

状態遷移確率更新部２３２は、フォワード・バックワードアルゴリズム部２３１から出力された変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を入力として受け付ける。状態遷移確率更新部２３２は、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を用いて、式（１４）及び式（１５）により、初期状態確率を示すパラメータ^→π^(ｍ)、及び遷移確率行列を示すパラメータ^→Ａ^(ｍ)をそれぞれ更新する。これにより、状態遷移確率を示すパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}が更新される。状態遷移確率更新部２３２は、パラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}の更新値を出力する。 The state transition probability updating unit 232 uses the variable γ ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ) output from the forward / backward algorithm unit 231. Accept as input. The state transition probability updating unit 232 uses the variable γ ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ), and uses the equations (14) and (15 ), The parameter indicating the initial state probability ^→ π ^(m) and the parameter indicating the transition probability matrix ^→ A ^(m) are updated. Thereby, the parameter { ^→ π ^(m) , ^→ A ^(m) } indicating the state transition probability is updated. The state transition probability updating unit 232 outputs an updated value of the parameter { ^→ π ^(m) , ^→ A ^(m) }.

基底スペクトル更新部２３３は、音声特徴抽出部２１から出力された丸め込み音声特徴量^→ｙ_t ^(ｍ)、及びフォワード・バックワードアルゴリズム部２３１から出力された変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を入力として受け付ける。基底スペクトル更新部２３３は、丸め込み音声特徴量^→ｙ_t ^(ｍ)、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を用いて、式（１６）により、基底スペクトルを示すパラメータ^→ｈ^(ｍ)＝{^→ｈ₁ ^(ｍ)，・・・，^→ｈ_K ^(ｍ)}の全ての要素を更新する。基底スペクトル更新部２３３は、更新したパラメータ^→ｈ^(ｍ)を出力する。 The base spectrum update unit 233 includes a rounded speech feature amount output from the speech feature extraction unit 21 ^→ y _t ^(m) and a variable γ ( ^→ z _t ^(m) ) output from the forward / backward algorithm unit 231 and Variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) is accepted as an input. The base spectrum updating unit 233 uses the rounded speech feature value ^→ y _t ^(m) , the variable γ ( ^→ z _t ^(m) ), and the variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ). Then, all the elements of the parameter indicating the base spectrum ^→ h ^(m) = { ^→ h ₁ ^(m) ,... ^→ h _K ^(m) } are updated by the equation (16). The base spectrum update unit 233 outputs the updated parameter ^→ h ^(m) .

収束判定部２４は、状態遷移確率更新部２３２から出力されたパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}、及び基底スペクトル更新部２３３から出力されたパラメータ^→ｈ^(ｍ)を入力として受け付ける。収束判定部２４は、パラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}を用いて、式（９）及び式（１１）により変数α(^→ｚ_T ^(ｍ))を計算し、式（１２）に示す尤度関数を計算する。収束判定部２４は、計算した尤度関数の値が収束したか否かを、例えば所定の条件を満たすか否かにより、判定する。例えば、１ステップ前に計算した尤度関数の値と今回計算した尤度関数の値との誤差が、所定の閾値ε以下であれば、収束したと判定することができる。例えば、ε＝1.0×10^-5とすることができる。 The convergence determination unit 24 receives the parameters { ^→ π ^(m) , ^→ A ^(m) } output from the state transition probability update unit 232 and the parameters output from the base spectrum update unit 233 ^→ h ^(m). Accept. The convergence determination unit 24 uses the parameter ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) }, and the variable α ( ^→ z _T ^(m) ) is calculated, and a likelihood function shown in Expression (12) is calculated. The convergence determination unit 24 determines whether or not the calculated likelihood function value has converged based on, for example, whether or not a predetermined condition is satisfied. For example, if the error between the value of the likelihood function calculated one step before and the value of the likelihood function calculated this time is equal to or smaller than a predetermined threshold value ε, it can be determined that convergence has occurred. For example, ε = 1.0 × 10 ⁻⁵ can be set.

収束判定部２４は、尤度関数の値が収束したと判定した場合、すなわち、所定の条件を満たした場合には、パラメータ^→θ^(ｍ)をパラメータ出力部２５に受け渡し、収束していないと判定した場合、すなわち、所定の条件を満たしていない場合には、パラメータ^→θ^(ｍ)をフォワード・バックワードアルゴリズム部２３１に受け渡す。 When it is determined that the likelihood function value has converged, that is, when a predetermined condition is satisfied, the convergence determination unit 24 passes the parameter ^→ θ ^(m) to the parameter output unit 25 and has not converged. When it is determined, that is, when the predetermined condition is not satisfied, the parameter ^→ θ ^{(m) is transferred} to the forward / backward algorithm unit 231.

なお、収束したか否かを判定する方法としては、尤度関数を用いる方法以外に、パラメータ各々の更新前の値と更新後の値との誤差が所定の閾値ε２以下か否かにより判定する方法を用いてもよい。この場合は、「パラメータ各々の更新前の値と更新後の値との誤差が所定の閾値ε２以下である」ことが「所定の条件を満たした」ことになる。例えば、ε２＝1.0×10^-5とすることができる。また、パラメータの更新が予め定めた繰り返し回数に到達した場合に、収束したと判定してもよい。この場合は、「パラメータの更新が予め定めた繰り返し回数に到達した」ことが「所定の条件を満たした」ことになる。例えば、繰り返し回数を１０００回とすることができる。 In addition to the method using the likelihood function, the method for determining whether or not the convergence has occurred is determined based on whether or not the error between the pre-update value and the post-update value of each parameter is equal to or less than a predetermined threshold value ε2. A method may be used. In this case, “the error between the pre-update value and the post-update value of each parameter is equal to or less than a predetermined threshold value ε2” means “a predetermined condition is satisfied”. For example, ε2 = 1.0 × 10 ⁻⁵ can be set. Further, it may be determined that the parameter has converged when the parameter update reaches a predetermined number of repetitions. In this case, “a parameter update has reached a predetermined number of repetitions” means “a predetermined condition is satisfied”. For example, the number of repetitions can be 1000.

パラメータ出力部２５は、収束判定部２４から受け渡されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の全てを、言語(ｍ)パラメータ蓄積データベース（ＤＢ）３１ｍに蓄積し、パラメータ格納部３０に格納する。 The parameter output unit 25 converts all of the parameters passed from the convergence determination unit 24 ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } into the language (m) parameter. The data is accumulated in the accumulation database (DB) 31 m and stored in the parameter storage unit 30.

上記の学習部２０の各部の処理を、Ｍ種類の言語のそれぞれについて行うことにより、言語(１)、言語(２)、・・・言語(Ｍ)のそれぞれについて推定されたパラメータ^→θ⁽¹⁾、^→θ⁽²⁾、・・・、^→θ^(Ｍ)が、言語(１)パラメータ蓄積ＤＢ３１１、言語(２)パラメータ蓄積ＤＢ３１２、・・・、言語(Ｍ)パラメータ蓄積ＤＢ３１Ｍにそれぞれ蓄積され、パラメータ格納部３０に格納される。 By performing the processing of each part of the learning unit 20 for each of the M kinds of languages, parameters estimated for each of the language (1), the language (2),..., The language (M) ^→ θ ^{(1 ^{^{), → θ (2),}}} ···, → θ is ^(M), language (1) parameter storage DB 311, language (2) parameter storage DB 312, · · ·, are respectively accumulated in the language (M) parameter storage DB31M And stored in the parameter storage unit 30.

次に、評価部４０の各部について詳述する。評価部４０は、音声特徴抽出部４１と、尤度計算部４２と、言語評価結果出力部４３とを含んだ構成で表すことができる。なお、音声特徴抽出部４１は本発明の抽出手段の一例である。また、尤度計算部４２は本発明の尤度計算手段の一例である。また、言語評価結果出力部４３は本発明の評価手段の一例である。 Next, each part of the evaluation unit 40 will be described in detail. The evaluation unit 40 can be represented by a configuration including a speech feature extraction unit 41, a likelihood calculation unit 42, and a language evaluation result output unit 43. Note that the voice feature extraction unit 41 is an example of the extraction means of the present invention. The likelihood calculation unit 42 is an example of the likelihood calculation means of the present invention. The language evaluation result output unit 43 is an example of the evaluation unit of the present invention.

音声特徴抽出部４１は、言語の種類が未知の評価用音声信号を入力として受け付ける。音声特徴抽出部４１は、学習部２０の音声特徴抽出部２１が言語(ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ）を抽出するのと同様に、評価用音声信号から音声特徴量^→ｙ'_t（ｔ＝１，・・・，Ｔ）を抽出する。また、音声特徴抽出部４１は、音声特徴量^→ｙ'_tの全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_tを生成し、出力する。 The voice feature extraction unit 41 receives an evaluation voice signal whose language type is unknown as an input. In the speech feature extraction unit 41, the speech feature extraction unit 21 of the learning unit 20 extracts the speech feature quantity ^→ y ′ _t ^(m) (t = 1,..., T) from the learning speech signal of the language (m). In the same manner as described above, the speech feature amount ^→ y ′ _t (t = 1,..., T) is extracted from the speech signal for evaluation. The speech feature extraction unit 41 generates and outputs a rounded speech feature amount ^→ y _t obtained by rounding all elements of the speech feature amount ^→ y ′ _t to an integer value.

尤度計算部４２は、音声特徴抽出部４１から出力された丸め込み音声特徴量^→ｙ_tを入力として受け付け、丸め込み音声特徴量^→ｙ_tと、パラメータ格納部３０に格納された各言語(ｍ)パラメータ蓄積ＤＢ３１ｍに蓄積されたパラメータ^→θ^(ｍ)とを用いて、パラメータ格納部３０にパラメータが格納されている言語の全てについて、評価用音声信号が示す言語の種類が言語(ｍ)であることの尤もらしさを示す尤度を計算する。 The likelihood calculation unit 42 receives the rounded speech feature amount ^→ y _t output from the speech feature extraction unit 41 as an input, and the rounded speech feature amount ^→ y _t and each language (m) stored in the parameter storage unit 30. Using all the parameters stored in the parameter storage DB 31m ^→ θ ^(m) , the language type indicated by the evaluation audio signal is language (m) for all the languages in which the parameters are stored in the parameter storage unit 30. The likelihood indicating the likelihood of this is calculated.

具体的には、尤度計算部４２は、各言語(ｍ)パラメータ蓄積ＤＢ３１ｍに蓄積された言語(ｍ)毎のパラメータ^→θ^(ｍ)を用いて、式（９）及び式（１１）により変数α(^→ｚ_T ^(ｍ))を計算し、式（１２）に示す尤度関数を計算する。尤度計算部４２は、各言語(ｍ)について計算した尤度関数の値（尤度）Ｌ_ｍ（ｍ＝１，・・・，Ｍ）を出力する。 More specifically, the likelihood calculating unit 42 uses the parameters for each language (m) stored in each language (m) parameter storage DB 31m ^→ θ ^(m), and uses Equation (9) and Equation (11). The variable α ( ^→ z _T ^(m) ) is calculated, and the likelihood function shown in Expression (12) is calculated. The likelihood calculation unit 42 outputs the value (likelihood) L _m (m = 1,..., M) of the likelihood function calculated for each language (m).

言語評価結果出力部４３は、尤度計算部４２から出力された各言語(ｍ)の尤度Ｌ_ｍを比較して、尤度Ｌ_ｍが最大となる言語(ｍ)が評価用音声信号の示す言語の種類である旨の言語評価結果を出力する。 The language evaluation result output unit 43 compares the likelihood L _m of each language (m) output from the likelihood calculation unit 42, and the language (m) having the maximum likelihood L _m is the voice signal for evaluation. Outputs the language evaluation result indicating that the language type is indicated.

次に、第１の実施の形態に係る音声言語評価装置１０の作用について説明する。まず、学習部２０において、図４に示す学習処理が実行され、対象となる複数種類の言語のそれぞれについて、言語(ｍ)パラメータ蓄積ＤＢ３１ｍを生成して、パラメータ格納部３０に格納する。言語(ｍ)パラメータ蓄積ＤＢ３１ｍ（ｍ＝１，・・・，Ｍ）が生成された状態で、音声言語評価装置１０に評価用音声信号が入力されると、評価部４０において、図５に示す評価処理が実行される。以下、図６に示すシーケンス図も参照して、各処理について詳述する。 Next, the operation of the spoken language evaluation apparatus 10 according to the first embodiment will be described. First, the learning process shown in FIG. 4 is executed in the learning unit 20, and the language (m) parameter accumulation DB 31 m is generated and stored in the parameter storage unit 30 for each of a plurality of types of target languages. When the speech signal for evaluation is input to the speech language evaluation apparatus 10 in a state where the language (m) parameter accumulation DB 31m (m = 1,..., M) is generated, the evaluation unit 40 shows in FIG. An evaluation process is executed. Hereinafter, each process will be described in detail with reference to the sequence diagram shown in FIG.

まず、図４に示す学習処理のステップ１００で、音声特徴抽出部２１が、入力された言語(ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ）を抽出し、全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_t ^(ｍ)を生成する。音声特徴抽出部２１は、図６に示すように、生成した丸め込み音声特徴量^→ｙ_t ^(ｍ)をパラメータ初期値生成部２２及び基底スペクトル更新部２３３へ受け渡す。 First, in step 100 of the learning processing shown in FIG. 4, the speech characteristic extraction unit 21, the audio features from training speech signal of the input language ^{_{^{(m) → y 't (}}} m) (t = 1, ·· .., T) is extracted, and a rounded speech feature amount ^→ y _t ^(m) is generated by rounding all elements to integer values. The speech feature extraction unit 21 passes the generated rounded speech feature amount ^→ y _t ^(m) to the parameter initial value generation unit 22 and the base spectrum update unit 233, as shown in FIG.

次に、ステップ１０２で、パラメータ初期値生成部２２が、音声特徴抽出部２１から受け渡された丸め込み音声特徴量^→ｙ_t ^(ｍ)を用いて、例えば通常のＮＭＦを適用して推定された基底スペクトルをパラメータ^→ｈ^(ｍ)の初期値として生成する。また、パラメータ初期値生成部２２が、パラメータ^→π^(ｍ)について、例えば全ての要素に対して等確率となる値１／Ｋを初期値として生成する。パラメータ^→Ａ^(ｍ)についても、例えば全ての行に対して等確率となる値１／Ｋを初期値として生成する。パラメータ初期値生成部２２は、図６に示すように、生成した各パラメータの初期値をまとめてパラメータ^→θ^(ｍ)の初期値とし、フォワード・バックワードアルゴリズム部２３１へ受け渡す。 Next, in step 102, the parameter initial value generation unit 22 is estimated by applying, for example, normal NMF, using the rounded speech feature amount passed from the speech feature extraction unit 21 ^→ y _t ^(m) . A base spectrum is generated as an initial value of the parameter ^→ h ^(m) . Further, the parameter initial value generation unit 22 generates, for example, a value 1 / K having an equal probability for all elements as an initial value for the parameter ^→ π ^(m) . For even Parameter ^→ A ^(m), for example, it generates a value 1 / K to be equal probabilities for all lines as the initial value. As shown in FIG. 6, the parameter initial value generation unit 22 collects the initial values of the generated parameters as an initial value of parameter ^→ θ ^(m) , and transfers the initial value to the forward / backward algorithm unit 231.

次に、ステップ１０４で、フォワード・バックワードアルゴリズム部２３１が、パラメータ初期値生成部２２から受け渡されたパラメータ^→θ^(ｍ)の初期値を用いて、フォワード・バックワードアルゴリズムにより、式（９）に示す変数α(^→ｚ_t ^(ｍ))及び式（１０）に示す変数β(^→ｚ_t ^(ｍ))を、式（１１）を用いて計算する。また、計算した変数α(^→ｚ_t ^(ｍ))及び変数β(^→ｚ_t ^(ｍ))を用いて、式（８）に示す変数γ(^→ｚ_t ^(ｍ))、及び式（１３）に示す変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を求める。フォワード・バックワードアルゴリズム部２３１は、図６に示すように、求めた変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を、状態遷移確率更新部２３２及び基底スペクトル更新部２３３へ受け渡す。 Next, in step 104, the forward / backward algorithm unit 231 uses the initial value of the parameter ^→ θ ^(m) passed from the parameter initial value generation unit 22, and the formula (9 ) (Α) ^→ z _t ^(m) ) and the variable β ( ^→ z _t ^(m) ) shown in equation (10) are calculated using equation (11). Further, using the calculated variable α ( ^→ z _t ^(m) ) and variable β ( ^→ z _t ^(m) ), the variable γ ( ^→ z _t ^(m) ) shown in the equation (8) and the equation (13 ) ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ) shown in FIG. As shown in FIG. 6, the forward / backward algorithm unit 231 uses the obtained variable γ ( ^→ z _t ^(m) ) and variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ), The data is transferred to the state transition probability update unit 232 and the base spectrum update unit 233.

次に、ステップ１０６で、状態遷移確率更新部２３２が、フォワード・バックワードアルゴリズム部２３１から受け渡された変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を用いて、式（１４）及び式（１５）により、初期状態確率を示すパラメータ^→π^(ｍ)、及び遷移確率行列を示すパラメータ^→Ａ^(ｍ)をそれぞれ更新する。状態遷移確率更新部２３２は、図６に示すように、パラメータ^→π^(ｍ)及び^→Ａ^(ｍ)の更新値を収束判定部２４へ受け渡す。 Next, in step 106, the state transition probability update unit 232 receives the variable γ ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t−1 ^(m) , passed from the forward / backward algorithm unit 231. ^→ z _t ^(m) ), the parameter indicating the initial state probability ^→ π ^(m) and the parameter indicating the transition probability matrix ^→ A ^(m) are respectively updated according to the equations (14) and (15). . As shown in FIG. 6, the state transition probability update unit 232 delivers the updated values of the parameters ^→ π ^(m) and ^→ A ^(m) to the convergence determination unit 24.

次に、ステップ１０８で、基底スペクトル更新部２３３が、音声特徴抽出部２１から受け渡された丸め込み音声特徴量^→ｙ_t ^(ｍ)、及びフォワード・バックワードアルゴリズム部２３１から受け渡された変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を用いて、式（１６）により、基底スペクトルを示すパラメータ^→ｈ^(ｍ)の全ての要素を更新する。基底スペクトル更新部２３３は、図６に示すように、パラメータ^→ｈ^(ｍ)の更新値を収束判定部２４へ受け渡す。 Next, in step 108, the base spectrum update unit 233 receives the rounded speech feature amount passed from the speech feature extraction unit 21 ^→ y _t ^(m) and the variable γ passed from the forward / backward algorithm unit 231. ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t-1 ^(m) , ^→ z _t ^(m) ⁾ , all the parameters indicating the base spectrum ^→ h ^(m) are obtained according to equation (16). Update elements of. As shown in FIG. 6, the base spectrum update unit 233 delivers the updated value of parameter ^→ h ^(m) to the convergence determination unit 24.

なお、ステップ１０６とステップ１０８とは、処理順序を逆にしてもよい。 Note that the processing order of step 106 and step 108 may be reversed.

次に、ステップ１１０で、収束判定部２４が、状態遷移確率更新部２３２から受け渡されたパラメータ^→π^(ｍ)及び^→Ａ^(ｍ)の更新値、並びに基底スペクトル更新部２３３から受け渡されたパラメータ^→ｈ^(ｍ)の更新値をまとめたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の更新値を用いて、式（９）及び式（１１）により変数α(^→ｚ_T ^(ｍ))を計算し、式（１２）に示す尤度関数を計算し、尤度関数の値が収束したか否かを、例えば所定の条件を満たすか否かにより、判定する。尤度関数の値が収束していない場合、すなわち、所定の条件を満たしていない場合には、ステップ１０４へ戻る。この際に、収束判定部２４は、図６に示すように、パラメータ^→θ^(ｍ)の更新値をフォワード・バックワードアルゴリズム部２３１に受け渡す。これにより、ステップ１０４〜１０８において、パラメータ^→θ^(ｍ)の更新が繰り返される。 Next, in step 110, the convergence determination unit 24 receives the updated values of the parameters ^→ π ^(m) and ^→ A ^(m) passed from the state transition probability update unit 232 and the base spectrum update unit 233. Parameter ^→ parameter that summarizes the updated value of h ^(m) ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } And the variable α ( ^→ z _T ^(m) ) is calculated according to the equation (11), the likelihood function shown in the equation (12) is calculated, and whether or not the value of the likelihood function has converged, for example, a predetermined condition Judgment is made depending on whether or not the above is satisfied. When the value of the likelihood function has not converged, that is, when the predetermined condition is not satisfied, the process returns to step 104. At this time, the convergence determination unit 24 passes the updated value of the parameter ^→ θ ^(m) to the forward / backward algorithm unit 231 as shown in FIG. Thereby, in steps 104 to 108, the parameter ^→ θ ^(m) is updated repeatedly.

一方、上記ステップ１１０で、尤度関数の値が収束したと判定された場合、すなわち、所定の条件を満たした場合には、ステップ１１２へ移行する。この際、収束判定部２４は、図６に示すように、パラメータ^→θ^(ｍ)の更新値をパラメータ出力部２５に受け渡す。 On the other hand, if it is determined in step 110 that the likelihood function value has converged, that is, if a predetermined condition is satisfied, the routine proceeds to step 112. At this time, the convergence determination unit 24 delivers the updated value of parameter ^→ θ ^(m) to the parameter output unit 25 as shown in FIG.

ステップ１１２では、パラメータ出力部２５が、収束判定部２４から受け渡されたパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}の全てを、言語(ｍ)パラメータ蓄積ＤＢ３１ｍに蓄積し、図６に示すように、パラメータ格納部３０へ受け渡して、学習処理を終了する。 In step 112, the parameter output unit 25 converts all of the parameters passed from the convergence determination unit 24 ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } into the language (m) The parameter is accumulated in the parameter accumulation DB 31m and transferred to the parameter storage unit 30 as shown in FIG. 6 to complete the learning process.

上記の学習処理を言語(１)、言語(２)、・・・言語(Ｍ)のそれぞれについて実行することにより、言語(１)のパラメータ^→θ⁽¹⁾を蓄積した言語(１)パラメータ蓄積ＤＢ３１１、言語(２)のパラメータ^→θ⁽²⁾を蓄積した言語(２)パラメータ蓄積ＤＢ３１２、・・・、言語(Ｍ)のパラメータ^→θ^(Ｍ)を蓄積した言語(Ｍ)パラメータ蓄積ＤＢ３１Ｍのそれぞれが生成され、パラメータ格納部３０に格納される。 Language the learning process (1), language (2), by executing for each of ... Language (M), the parameter ^→ theta ⁽¹⁾ the accumulated Language Language (1) (1) Parameters accumulation DB311, the language of (2) parameter ^{→ θ} ⁽²⁾ accumulated language (2) parameters accumulate DB312, ···, language parameters ^{→ θ} language that has accumulated ^(M) (M) parameters accumulation DB31M of (M) Each is generated and stored in the parameter storage unit 30.

次に、図５に示す評価処理のステップ１２０で、音声特徴抽出部４１が、入力された評価用音声信号から、音声特徴抽出部２１が言語(ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ）を抽出するのと同様に、音声特徴量^→ｙ'_t（ｔ＝１，・・・Ｔ）を抽出する。また、音声特徴抽出部４１は、音声特徴量^→ｙ'_tの全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_tを生成する。音声特徴抽出部４１は、図６に示すように、生成した丸め込み音声特徴量^→ｙ_tを尤度計算部４２へ受け渡す。 Next, FIG. 5 step 120 the evaluation processing shown in the voice characteristic extraction unit 41, from the input sound signals used for evaluation, speech features from training speech signal of the speech characteristic extraction unit 21 Language (m) ^→ Similarly to extracting y ′ _t ^(m) (t = 1,..., T), the speech feature quantity ^→ y ′ _t (t = 1,... T) is extracted. Also, the speech feature extraction unit 41 generates a rounded speech feature amount ^→ y _t by rounding all elements of the speech feature amount ^→ y ′ _t to an integer value. The speech feature extraction unit 41 delivers the generated rounded speech feature amount ^→ y _t to the likelihood calculation unit 42 as shown in FIG.

次に、ステップ１２２で、尤度計算部４２が、言語の種類を示すインデックスに対応した変数ｍを１に設定する。次に、ステップ１２４で、尤度計算部４２が、音声特徴抽出部４１から受け渡された丸め込み音声特徴量^→ｙ_tと、パラメータ格納部３０に格納された各言語(ｍ)パラメータ蓄積ＤＢ３１ｍに蓄積されたパラメータ^→θ^(ｍ)とを用いて、評価用音声信号が示す言語の種類が言語(ｍ)であることの尤もらしさを示す尤度Ｌ_ｍを計算する。 Next, in step 122, the likelihood calculator 42 sets the variable m corresponding to the index indicating the language type to 1. Next, in step 124, the likelihood calculation unit 42, the audio feature ^→ y _t rounding transferred from the speech characteristic extraction unit 41, for each language (m) parameter storage DB31m stored in the parameter storage section 30 Using the accumulated parameter ^→ θ ^(m) , a likelihood L _m indicating the likelihood that the language type indicated by the speech signal for evaluation is language (m) is calculated.

次に、ステップ１２６で、尤度計算部４２が、変数ｍがＭになったか否かを判定することにより、パラメータ格納部３０にパラメータが格納されている全ての言語について尤度を計算したか否かを判定する。ｍ≠Ｍの場合には、ステップ１２８へ移行し、ｍを１インクリメントして、ステップ１２４に戻る。ｍ＝Ｍの場合には、ステップ１３０へ移行する。この際、尤度計算部４２は、図６に示すように、計算した全ての尤度Ｌ_ｍ（ｍ＝１，・・・，Ｍ）を言語評価結果出力部４３へ受け渡す。 Next, in step 126, the likelihood calculation unit 42 determines whether or not the variable m is M, thereby calculating the likelihood for all languages in which parameters are stored in the parameter storage unit 30. Determine whether or not. If m ≠ M, the process proceeds to step 128, m is incremented by 1, and the process returns to step 124. If m = M, the process proceeds to step 130. At this time, the likelihood calculation unit 42 passes all the calculated likelihoods L _m (m = 1,..., M) to the language evaluation result output unit 43 as shown in FIG.

ステップ１３０では、言語評価結果出力部４３が、尤度計算部４２から受け渡された各言語(ｍ)の尤度Ｌ_ｍを比較して、尤度Ｌ_ｍが最大となる言語(ｍ)が評価用音声信号の示す言語の種類である旨の言語評価結果を出力して、評価処理を終了する。 In step 130, the language evaluation result output unit 43 compares the likelihood L _m of each language (m) delivered from the likelihood calculation unit 42, and the language (m) having the maximum likelihood L _m is determined. A language evaluation result indicating that the language type indicated by the evaluation audio signal is output, and the evaluation process is terminated.

以上説明したように、第１の実施の形態に係る音声言語評価装置によれば、言語毎の音声信号から抽出された音声特徴量に基づいて、基底スペクトル及び基底スペクトルの状態遷移確率の各々を示すパラメータを含む生成モデルの最適パラメータを推定する。これにより、言語が持つ音声的性質と音素遷移とを含む言語的性質に対応したパラメータが推定される。このパラメータを用いることにより、事前知識を要することなく、入力された音声信号が示す言語の種類を精度良く評価することができる。 As described above, according to the speech language evaluation apparatus according to the first embodiment, each of the base spectrum and the state transition probability of the base spectrum is calculated based on the speech feature amount extracted from the speech signal for each language. Estimate the optimal parameters of the generated model including the parameters shown. Thereby, the parameter corresponding to the linguistic property including the speech property and phoneme transition of the language is estimated. By using this parameter, it is possible to accurately evaluate the type of language indicated by the input voice signal without requiring prior knowledge.

＜第２の実施の形態＞
次に、第２の実施の形態について説明する。第１の実施の形態では、言語の種類(ｍ)毎に、パラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}を推定する場合について説明したが、第２の実施の形態では、対象となる全ての言語の種類に対して共通の基底スペクトルを示すパラメータ^→ｈを推定すると共に、言語の種類(ｍ)毎に状態遷移確率を示すパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}を推定する場合について説明する。なお、第２の実施の形態に係る音声言語評価装置について、第１の実施の形態に係る音声言語評価装置１０の構成と同一または対応する構成については、同一または対応する符号を付して詳細な説明を省略する。 <Second Embodiment>
Next, a second embodiment will be described. In the first embodiment, for each language type (m), a case has been described in which parameters ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h ^(m) } are estimated. However, in the second embodiment, a parameter indicating a common base spectrum for all target language types ^→ h is estimated, and a parameter indicating a state transition probability for each language type (m) { ^→ π ^(m) , ^→ A ^(m) } is estimated. In addition, about the speech language evaluation apparatus which concerns on 2nd Embodiment, about the structure which is the same as or corresponds to the structure of the speech language evaluation apparatus 10 which concerns on 1st Embodiment, attaches | subjects the same or corresponding code | symbol, and details The detailed explanation is omitted.

第２の実施の形態に係る音声言語評価装置２１０は、ＣＰＵと、ＲＡＭと、後述する学習処理及び評価処理を含む音声言語評価処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。 A spoken language evaluation apparatus 210 according to the second embodiment includes a CPU, a RAM, and a ROM that stores a program for executing a spoken language evaluation processing routine including a learning process and an evaluation process described later. It consists of

このコンピュータは、機能的には、図７に示すように、学習部２２０と評価部２４０とを含んだ構成で表すことができる。 This computer can be functionally represented by a configuration including a learning unit 220 and an evaluation unit 240, as shown in FIG.

まず、学習部２２０の各部について詳述する。学習部２２０は、音声特徴抽出部２２１と、パラメータ初期値生成部２２２と、パラメータ推定部２２３と、収束判定部２２４と、パラメータ出力部２２５とを含んだ構成で表すことができる。なお、音声特徴抽出部２２１は本発明の抽出手段の一例である。また、パラメータ初期値生成部２２２は本発明の初期値生成手段の一例である。また、パラメータ推定部２２３は本発明の推定手段の一例である。また、収束判定部２２４及びパラメータ出力部２２５は本発明の制御手段の一例である。 First, each part of the learning unit 220 will be described in detail. The learning unit 220 can be represented by a configuration including a speech feature extraction unit 221, a parameter initial value generation unit 222, a parameter estimation unit 223, a convergence determination unit 224, and a parameter output unit 225. Note that the voice feature extraction unit 221 is an example of the extraction means of the present invention. The parameter initial value generator 222 is an example of the initial value generator of the present invention. Moreover, the parameter estimation part 223 is an example of the estimation means of this invention. Moreover, the convergence determination part 224 and the parameter output part 225 are examples of the control means of this invention.

音声特徴抽出部２２１は、言語の種類が既知の学習用音声信号を入力として受け付ける。第１の実施の形態では、言語の種類がｍである学習用音声信号を入力したが、第２の実施の形態では、言語の種類が複数種類ｍ（ｍ＝１，・・・，Ｍ、Ｍは言語の種類の総数）の音声信号が学習用音声信号（以下、「言語(ｍ＝１，・・・，Ｍ)の学習用音声信号」と表記する）として入力される。 The speech feature extraction unit 221 accepts a learning speech signal whose language type is known as an input. In the first embodiment, the learning speech signal whose language type is m is input. However, in the second embodiment, there are a plurality of language types m (m = 1,..., M, M is a speech signal for learning (hereinafter referred to as “speech speech signal for language (m = 1,..., M)”).

音声特徴抽出部２２１は、第１の実施の形態における音声特徴抽出部２１が言語(ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ）を抽出するのと同様に、言語(ｍ＝１，・・・，Ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ、ｍ＝１，・・・，Ｍ）を抽出して出力する。また、音声特徴抽出部２２１は、音声特徴量^→ｙ'_t ^(ｍ)の全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_t ^(ｍ)を生成し、出力する。 The speech feature extraction unit 221 is configured such that the speech feature extraction unit 21 in the first embodiment uses speech feature amounts from learning speech signals in language (m) ^→ y ′ _t ^(m) (t = 1,..., T ) Is extracted from the learning speech signal of the language (m = 1,..., M) ^→ y ′ _t ^(m) (t = 1,..., T, m = 1,..., M) are extracted and output. The speech feature extraction unit 221 generates and outputs a rounded speech feature amount ^→ y _t ^(m) obtained by rounding all elements of the speech feature amount ^→ y ′ _t ^(m) to an integer value.

パラメータ初期値生成部２２２は、丸め込み音声特徴量^→ｙ_t ^(ｍ)（ｔ＝１，・・・，Ｔ、ｍ＝１，・・・，Ｍ）を一つの行列^→Ｙとみなし、通常のＮＭＦを適用して推定された基底スペクトルをパラメータ^→ｈの初期値として生成する。パラメータ^→π^(ｍ)の初期値及びパラメータ^→Ａ^(ｍ)の初期値については、第１の実施の形態のパラメータ初期値生成部２２がパラメータ^→π^(ｍ)の初期値及びパラメータ^→Ａ^(ｍ)の初期値を推定するのと同様である。なお、パラメータ^→π^(ｍ)の初期値及びパラメータ^→Ａ^(ｍ)の初期値において、ｍ＝１，・・・，Ｍである点が第１の実施の形態とは異なる。以下、パラメータ^→π^(ｍ)の更新値及びパラメータ^→Ａ^(ｍ)の更新値についても同様にｍ＝１，・・・，Ｍである点が第１の実施の形態とは異なる。 The parameter initial value generation unit 222 considers the rounded speech feature value ^→ y _t ^(m) (t = 1,..., T, m = 1,..., M) as one matrix ^→ Y, A base spectrum estimated by applying NMF is generated as an initial value of parameter ^→ h. For the initial value of the parameter ^→ π ^{(m) and} the initial value of the parameter ^→ A ^(m) , the parameter initial value generation unit 22 of the first embodiment sets the initial value of the parameter ^→ π ^(m) and the parameter ^→ A ^{( This} is the same as estimating the initial value of ^m) . Note that the initial value of parameter ^→ π ^{(m) and} the initial value of parameter ^→ A ^(m) are different from the first embodiment in that m = 1,..., M. Hereinafter, the update value of the parameter ^→ π ^{(m) and} the update value of the parameter ^→ A ^(m) are also different from the first embodiment in that m = 1,..., M.

パラメータ推定部２２３は、さらに、フォワード・バックワードアルゴリズム部２３１と、状態遷移確率更新部２３２と、基底スペクトル更新部２２３３とを含んだ構成で表すことができる。フォワード・バックワードアルゴリズム部２３１及び状態遷移確率更新部２３２は、パラメータ^→ｈ^(ｍ)に替えて、パラメータ^→ｈを用いる点、及び各変数及びパラメータにおいて、ｍ＝１，・・・，Ｍである点が第１の実施の形態と異なるだけであるため、説明を省略する。 The parameter estimation unit 223 can be expressed by a configuration including a forward / backward algorithm unit 231, a state transition probability update unit 232, and a base spectrum update unit 2233. Forward-backward algorithm unit 231 and the state transition probability update unit 232, instead of the parameter ^→ h ^(m), a point using the parameter ^→ h, and at each variable and parameter, m = 1, · · ·, in M Since a certain point is only different from the first embodiment, the description is omitted.

基底スペクトル更新部２２３３は、丸め込み音声特徴量^→ｙ_t ^(ｍ)、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))（ｍ＝１，・・・，Ｍ）を用いて、式（１７）により、全ての言語の種類に共通する基底スペクトルを示すパラメータ^→ｈ＝{^→ｈ₁，・・・，^→ｈ_K}の全ての要素を更新する。基底スペクトル更新部２２３３は、パラメータ^→ｈの更新値を出力する。 The base spectrum update unit 2233 includes a rounded speech feature amount ^→ y _t ^(m) , a variable γ ( ^→ z _t ^(m) ), and a variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ) (m = 1, · · ·, M) using, by the equation (17), the parameter ^→ h = illustrating a base spectrum common to all kinds of languages ^{_{^{{→ h 1, ···, →}}} h K} all Update elements of. The base spectrum update unit 2233 outputs an updated value of parameter ^→ h.

収束判定部２２４は、パラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ}を用いて、式（９）及び式（１１）により変数α(^→ｚ_T ^(ｍ))を計算し、下式（１８）に示す尤度関数を計算する。 The convergence determining unit 224 uses the parameter ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h}, and the variable α ( ^→ z _T ^{( m)} ) is calculated, and the likelihood function shown in the following equation (18) is calculated.

収束判定部２２４は、計算した尤度関数の値が収束したか否かを、例えば所定の条件を満たすか否かにより、判定する。例えば、１ステップ前に計算した尤度関数の値と今回計算した尤度関数の値との誤差が、所定の閾値ε３以下であれば、収束したと判定することができる。例えば、ε３＝1.0×10^-5とすることができる。 The convergence determination unit 224 determines whether or not the calculated likelihood function value has converged, for example, depending on whether or not a predetermined condition is satisfied. For example, if the error between the value of the likelihood function calculated one step before and the value of the likelihood function calculated this time is equal to or smaller than a predetermined threshold value ε3, it can be determined that convergence has occurred. For example, ε3 = 1.0 × 10 ⁻⁵ can be set.

なお、収束したか否かを判定する方法としては、尤度関数を用いる方法以外に、パラメータ各々の更新前の値と更新後の値との誤差が所定の閾値ε４以下か否かにより判定する方法を用いてもよい。この場合は、「パラメータ各々の更新前の値と更新後の値との誤差が所定の閾値ε４以下である」ことが「所定の条件を満たした」ことになる。例えば、ε４＝1.0×10^-5とすることができる。また、パラメータの更新が予め定めた繰り返し回数に到達した場合に、収束したと判定してもよい。この場合は、「パラメータの更新が予め定めた繰り返し回数に到達した」ことが「所定の条件を満たした」ことになる。例えば、繰り返し回数を１０００回とすることができる。 In addition to the method using the likelihood function, the method for determining whether or not the convergence has occurred is determined by whether or not the error between the pre-update value and the post-update value of each parameter is equal to or less than a predetermined threshold value ε4. A method may be used. In this case, “the error between the pre-update value and the post-update value of each parameter is equal to or less than a predetermined threshold value ε4” means “a predetermined condition is satisfied”. For example, ε4 = 1.0 × 10 ⁻⁵ can be set. Further, it may be determined that the parameter has converged when the parameter update reaches a predetermined number of repetitions. In this case, “a parameter update has reached a predetermined number of repetitions” means “a predetermined condition is satisfied”. For example, the number of repetitions can be 1000.

パラメータ出力部２２５は、収束判定部２２４から受け渡されたパラメータ^→θ＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ}（ｍ＝１，・・・，Ｍ）について、全ての言語の種類に共通の基底スペクトルを示すパラメータ^→ｈを基底スペクト蓄積ＤＢ３２に蓄積し、言語(ｍ)の状態遷移確率を示すパラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}（ｍ＝１，・・・，Ｍ）を言語(ｍ)遷移確率蓄積ＤＢ３３ｍに蓄積する。なお、“ｍ”は言語の種類を示すインデックスであるので、言語(１)、・・・、言語(Ｍ)のそれぞれについて推定したパラメータ{^→π⁽¹⁾，^→Ａ⁽¹⁾}、・・・、{^→π^(Ｍ)，^→Ａ^(Ｍ)}を、言語(１)遷移確率蓄積ＤＢ３３１、・・・、言語(Ｍ)遷移確率蓄積ＤＢ３３Ｍにそれぞれ蓄積する。基底スペクトル蓄積ＤＢ３２及び各言語(ｍ)遷移確率蓄積ＤＢ３３ｍは、パラメータ格納部３０に格納する。 Parameter output unit 225, the parameters passed from the convergence determination unit ^{^{^{224 → θ = {→ π (}}} m), → A (m), → h} (m = 1, ···, M) for all A parameter indicating a base spectrum common to the language type ^→ h is stored in the base spectrum storage DB 32, and a parameter indicating a state transition probability of the language (m) { ^→ π ^(m) , ^→ A ^(m) } (m = 1 ,..., M) are stored in the language (m) transition probability storage DB 33m. Since “m” is an index indicating the language type, the parameters { ^→ π ⁽¹⁾ , ^→ A ⁽¹⁾ } estimated for each of the language (1),..., Language (M),. .., { ^→ π ^(M) , ^→ A ^(M) } are stored in the language (1) transition probability storage DB 331,..., And the language (M) transition probability storage DB 33M, respectively. The base spectrum accumulation DB 32 and each language (m) transition probability accumulation DB 33 m are stored in the parameter storage unit 30.

次に、評価部２４０の各部について詳述する。評価部２４０は、音声特徴抽出部４１と、尤度計算部２４２と、言語評価結果出力部４３とを含んだ構成で表すことができる。なお、音声特徴抽出部４１は本発明の抽出手段の一例である。また、尤度計算部２４２は本発明の尤度計算手段の一例である。また、言語評価結果出力部４３は本発明の評価手段の一例である。 Next, each part of the evaluation unit 240 will be described in detail. The evaluation unit 240 can be represented by a configuration including a speech feature extraction unit 41, a likelihood calculation unit 242, and a language evaluation result output unit 43. Note that the voice feature extraction unit 41 is an example of the extraction means of the present invention. The likelihood calculation unit 242 is an example of the likelihood calculation means of the present invention. The language evaluation result output unit 43 is an example of the evaluation unit of the present invention.

尤度計算部２４２は、音声特徴抽出部４１から出力された丸め込み音声特徴量^→ｙ_tと、パラメータ格納部３０に格納された基本スペクトル蓄積ＤＢ３２に蓄積されたパラメータ^→ｈ、及び言語(ｍ)遷移確率蓄積ＤＢ３３ｍに蓄積されたパラメータ{^→π⁽¹⁾，^→Ａ⁽¹⁾}とを用いて、パラメータ格納部３０にパラメータが格納されている言語の全てについて、評価用音声信号が示す言語の種類が言語(ｍ)であることの尤もらしさを示す尤度を計算する。尤度の計算は、パラメータ^→ｈ^(ｍ)に替えて、パラメータ^→ｈを用いる点が第１の実施の形態と異なるだけである。 Likelihood calculating unit 242, audio feature rounding outputted from the voice characteristic extraction unit 41 ^→ y _t and, the stored parameters ^→ h the fundamental spectral accumulation DB32 stored in the parameter storage unit 30, and Language (m) Using the parameters { ^→ π ⁽¹⁾ , ^→ A ⁽¹⁾ } stored in the transition probability storage DB 33m, the language indicated by the evaluation speech signal for all the languages in which the parameters are stored in the parameter storage unit 30 The likelihood indicating the likelihood that the type of is language (m) is calculated. The likelihood calculation is different from the first embodiment only in that parameter ^→ h is used instead of parameter ^→ h ^(m) .

次に、第２の実施の形態に係る音声言語評価装置２１０の作用について説明する。まず、学習部２２０において、図８に示す学習処理が実行され、基底スペクトルＤＢ３２及び各言語(ｍ)遷移確率蓄積ＤＢ３３ｍを生成して、パラメータ格納部３０に格納する。各ＤＢが生成された状態で、音声言語評価装置２１０に評価用音声信号が入力されると、評価部２４０において、図９に示す評価処理が実行される。以下、図１０に示すシーケンス図も参照して、各処理について詳述する。なお、第１の実施の形態における学習処理（図４）及び評価処理（図５）の処理と同一または対応する処理については、同一または対応する符号を付して、詳細な説明を省略する。 Next, the operation of the spoken language evaluation apparatus 210 according to the second embodiment will be described. First, the learning process shown in FIG. 8 is executed in the learning unit 220, and the base spectrum DB 32 and each language (m) transition probability accumulation DB 33 m are generated and stored in the parameter storage unit 30. When an evaluation speech signal is input to the spoken language evaluation device 210 in a state where each DB is generated, the evaluation unit 240 executes the evaluation process shown in FIG. Hereinafter, each process will be described in detail with reference to the sequence diagram shown in FIG. In addition, about the process which is the same or corresponding to the process of the learning process (FIG. 4) and evaluation process (FIG. 5) in 1st Embodiment, the code | symbol same or corresponding is attached | subjected and detailed description is abbreviate | omitted.

まず、図８に示す学習処理のステップ２００で、音声特徴抽出部２２１が、言語(ｍ＝１，・・・，Ｍ)の学習用音声信号から音声特徴量^→ｙ'_t ^(ｍ)（ｔ＝１，・・・，Ｔ、ｍ＝１，・・・，Ｍ）を抽出し、全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_t ^(ｍ)を生成する。音声特徴抽出部２２１は、図１０に示すように、生成した丸め込み音声特徴量^→ｙ_t ^(ｍ)をパラメータ初期値生成部２２２及び基底スペクトル更新部２２３３へ受け渡す。 First, in step 200 of the learning process shown in FIG. 8, the speech feature extraction unit 221 determines the speech feature amount from the learning speech signal of the language (m = 1,..., M) ^→ y ′ _t ^(m) (t = 1, ..., extracted T, m = 1, ..., a M), generates a rounding it Marumekon all elements to integers speech features ^→ y _t ^(m). As shown in FIG. 10, the speech feature extraction unit 221 delivers the generated rounded speech feature amount ^→ y _t ^(m) to the parameter initial value generation unit 222 and the base spectrum update unit 2233.

次に、ステップ２０２で、パラメータ初期値生成部２２２が、丸め込み音声特徴量^→ｙ_t ^(ｍ)（ｔ＝１，・・・，Ｔ、ｍ＝１，・・・，Ｍ）を一つの行列^→Ｙとみなし、通常のＮＭＦを適用して推定された基底スペクトルをパラメータ^→ｈの初期値として生成する。パラメータ^→π^(ｍ)の初期値及びパラメータ^→Ａ^(ｍ)の初期値については、ｍ＝１，・・・，Ｍについて、第１の実施の形態におけるパラメータ初期値生成部２２がパラメータ^→π^(ｍ)の初期値及びパラメータ^→Ａ^(ｍ)の初期値を推定するのと同様に生成する。パラメータ初期値生成部２２２は、図１０に示すように、生成した各パラメータの初期値をまとめてパラメータ^→θ＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ}（ｍ＝１，・・・，Ｍ）の初期値とし、フォワード・バックワードアルゴリズム部２３１へ受け渡す。 Next, in step 202, the parameter initial value generation unit 222 converts the rounded speech feature quantity ^→ y _t ^(m) (t = 1,..., T, m = 1,..., M) into one matrix. ^→ Y is assumed, and a base spectrum estimated by applying normal NMF is generated as an initial value of parameter ^→ h. Parameters ^→ The initial value of [pi initial values and parameters ^→ A of ^{^{(m) (m), m}} = 1, ···, for M, the parameter initial value generator 22 in the first embodiment are parameter ^→ [pi The initial value of ^(m) and the parameter ^→ A are generated in the same manner as the initial value of ^(m) is estimated. As shown in FIG. 10, the parameter initial value generation unit 222 collects the initial values of the generated parameters as parameters ^→ θ = { ^→ π ^(m) , ^→ A ^(m) , ^→ h} (m = 1, .., M) as initial values and transferred to the forward / backward algorithm unit 231.

次に、ステップ２０４で、フォワード・バックワードアルゴリズム部２３１が、ｍ＝１，・・・，Ｍについて、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を求める。フォワード・バックワードアルゴリズム部２３１は、図１０に示すように、求めた変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を、状態遷移確率更新部２３２及び基底スペクトル更新部２２３３へ受け渡す。 Next, in step 204, the forward / backward algorithm unit 231 performs the variable γ ( ^→ z _t ^(m) ) and the variable ξ ( ^→ z _t−1 ^(m) , ^{m for} m = 1,. ^→ Find z _t ^(m) ). As shown in FIG. 10, the forward / backward algorithm unit 231 uses the obtained variable γ ( ^→ z _t ^(m) ) and variable ξ ( ^→ z _t−1 ^(m) , ^→ z _t ^(m) ), The data is transferred to the state transition probability update unit 232 and the base spectrum update unit 2233.

次に、ステップ２０６で、状態遷移確率更新部２３２が、ｍ＝１，・・・，Ｍについて、パラメータ^→π^(ｍ)及び^→Ａ^(ｍ)をそれぞれ更新する。状態遷移確率更新部２３２は、図１０に示すように、パラメータ^→π^(ｍ)及び^→Ａ^(ｍ)（ｍ＝１，・・・，Ｍ）の更新値を収束判定部２２４へ受け渡す。 Next, in step 206, the state transition probability update unit 232 updates the parameters ^→ π ^(m) and ^→ A ^(m) for m = 1,. As shown in FIG. 10, the state transition probability update unit 232 delivers the updated values of the parameters ^→ π ^(m) and ^→ A ^(m) (m = 1,..., M) to the convergence determination unit 224.

次に、ステップ２０８で、基底スペクトル更新部２２３３が、丸め込み音声特徴量^→ｙ_t ^(ｍ)、変数γ(^→ｚ_t ^(ｍ))及び変数ξ(^→ｚ_t-1 ^(ｍ)，^→ｚ_t ^(ｍ))を用いて、式（１７）により、全ての言語の種類に共通の基底スペクトルを示すパラメータ^→ｈの全ての要素を更新する。基底スペクトル更新部２２３３は、図１０に示すように、パラメータ^→ｈの更新値を収束判定部２２４へ受け渡す。 Next, in step 208, the base spectrum update unit 2233 causes the rounded speech feature value ^→ y _t ^(m) , variable γ ( ^→ z _t ^(m) ), and variable ξ ( ^→ z _t−1 ^(m) , ^→ z. _t ^(m) ) is used to update all elements of the parameter ^→ h indicating the base spectrum common to all language types according to equation (17). As shown in FIG. 10, the base spectrum update unit 2233 delivers the updated value of parameter ^→ h to the convergence determination unit 224.

なお、ステップ２０６とステップ２０８とは、処理順序を逆にしてもよい。 Note that the processing order of step 206 and step 208 may be reversed.

次に、ステップ２１１で、収束判定部２２４が、パラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ}を用いて、式（９）及び式（１１）により変数α(^→ｚ_T ^(ｍ))を計算し、式（１８）に示す尤度関数を計算し、尤度関数の値が収束したか否かを、例えば所定の条件を満たすか否かにより、判定する。尤度関数の値が収束していない場合、すなわち、所定の条件を満たしていない場合には、ステップ２０４へ戻る。この際に、収束判定部２２４は、図１０に示すように、パラメータ^→θの更新値をフォワード・バックワードアルゴリズム部２３１に受け渡す。一方、尤度関数の値が収束した場合、すなわち、所定の条件を満たした場合には、ステップ２１２へ移行する。この際、収束判定部２２４は、図１０に示すように、パラメータ^→θの更新値をパラメータ出力部２２４に受け渡す。 Next, in step 211, the convergence determination unit 224 uses the parameters ^→ θ ^(m) = { ^→ π ^(m) , ^→ A ^(m) , ^→ h} according to the expressions (9) and (11). The variable α ( ^→ z _T ^(m) ) is calculated, the likelihood function shown in the equation (18) is calculated, and whether or not the value of the likelihood function has converged, for example, depending on whether or not a predetermined condition is satisfied. ,judge. When the value of the likelihood function has not converged, that is, when the predetermined condition is not satisfied, the process returns to step 204. At this time, the convergence determination unit 224 delivers the updated value of parameter ^→ θ to the forward / backward algorithm unit 231 as shown in FIG. On the other hand, if the likelihood function value has converged, that is, if a predetermined condition is satisfied, the routine proceeds to step 212. At this time, the convergence determination unit 224 delivers the updated value of parameter ^→ θ to the parameter output unit 224 as shown in FIG.

ステップ２１２では、パラメータ出力部２２５が、収束判定部２２４から受け渡されたパラメータ^→θ＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ}（ｍ＝１，・・・，Ｍ）について、パラメータ^→ｈを基底スペクト蓄積ＤＢ３２に蓄積し、パラメータ{^→π^(ｍ)，^→Ａ^(ｍ)}（ｍ＝１，・・・，Ｍ）を言語(ｍ)遷移確率蓄積ＤＢ３３ｍに蓄積し、図１０に示すように、パラメータ格納部３０へ受け渡して、学習処理を終了する。 In step 212, the parameter output unit 225 receives the parameter passed from the convergence determination unit 224 ^→ θ = { ^→ π ^(m) , ^→ A ^(m) , ^→ h} (m = 1,..., M) for accumulates parameter ^→ h in basal spectrum accumulation DB 32, stored in the parameter ^{^{^{{→ π (m), →}}} a (m)} (m = 1, ···, M) the language (m) transition probability storage DB33m Then, as shown in FIG. 10, the data is transferred to the parameter storage unit 30 and the learning process is terminated.

次に、図９に示す評価処理のステップ１２０で、音声特徴抽出部４１が、入力された評価用音声信号から音声特徴量^→ｙ'_t（ｔ＝１，・・・Ｔ）を抽出する。また、音声特徴抽出部４１は、音声特徴量^→ｙ'_tの全ての要素を整数値に丸め込んだ丸め込み音声特徴量^→ｙ_tを生成する。音声特徴抽出部４１は、図１０に示すように、生成した丸め込み音声特徴量^→ｙ_tを尤度計算部２４２へ受け渡す。 Next, in step 120 of the evaluation process shown in FIG. 9, the speech feature extraction unit 41 extracts speech feature amounts ^→ y ′ _t (t = 1,... T) from the input speech signal for evaluation. Also, the speech feature extraction unit 41 generates a rounded speech feature amount ^→ y _t by rounding all elements of the speech feature amount ^→ y ′ _t to an integer value. The speech feature extraction unit 41 passes the generated rounded speech feature amount ^→ y _t to the likelihood calculation unit 242 as shown in FIG.

次に、ステップ１２２で、尤度計算部２４２が、言語の種類を示すインデックスに対応した変数ｍを１に設定する。次に、ステップ２２６で、尤度計算部２４２が、丸め込み音声特徴量^→ｙ_tと、パラメータ格納部３０に格納された基本スペクトル蓄積ＤＢ３２に蓄積されたパラメータ^→ｈ、及び言語(ｍ)遷移確率蓄積ＤＢ３３ｍに蓄積されたパラメータ{^→π⁽¹⁾，^→Ａ⁽¹⁾}とを用いて、評価用音声信号が示す言語の種類が言語(ｍ)であることの尤もらしさを示す尤度Ｌ_ｍを計算する。 Next, in step 122, the likelihood calculating unit 242 sets the variable m corresponding to the index indicating the language type to 1. Next, in step 226, the likelihood calculation unit 242 determines that the rounded speech feature value ^→ y _t , the parameter stored in the basic spectrum storage DB 32 stored in the parameter storage unit 30 ^→ h, and the language (m) transition probability Using the parameters { ^→ π ⁽¹⁾ , ^→ A ⁽¹⁾ } stored in the storage DB 33m, the likelihood L indicating the likelihood that the language type indicated by the evaluation speech signal is the language (m) _m is calculated.

以下、ステップ１２６〜１３０で、第１の実施の形態における評価処理と同様に処理して、言語評価結果を出力して、評価処理を終了する。 Thereafter, in steps 126 to 130, processing is performed in the same manner as the evaluation processing in the first embodiment, a language evaluation result is output, and the evaluation processing ends.

以上説明したように、第２の実施の形態に係る音声言語評価装置によれば、第１の実施の形態の効果に加え、１度の学習処理で複数の言語の種類を評価するためのパラメータを推定することができる。 As described above, according to the spoken language evaluation apparatus according to the second embodiment, in addition to the effects of the first embodiment, parameters for evaluating a plurality of language types in one learning process Can be estimated.

ここで、本発明の効果を説明するための評価結果の一例について説明する。 Here, an example of an evaluation result for explaining the effect of the present invention will be described.

まず、５種類の言語（英語、アメリカ英語、ドイツ語、スウェーデン語、フランス語）の音声信号に対する上記第１の実施の形態に係る音声言語評価装置１０による評価について説明する。基底スペクトルの総数Ｋは、Ｋ＝１２とした。言語毎に４名の話者（男性２名、女性２名）による１２発話を学習用音声信号として用いて、各言語(ｍ)のパラメータ^→θ^(ｍ)＝{^→π^(ｍ)，^→Ａ^(ｍ)，^→ｈ^(ｍ)}を学習した。なお、ｍは言語の種類を示すインデックスである。学習には用いていない他の４名の話者（男性２名、女性２名）による１２発話を評価用音声信号として、各言語に対する尤度関数の値を計算した。図１１に評価結果を示す。図１１内の各グラフは、評価用音声信号の言語の種類に対応し、横軸の１〜５は言語の種類を示すインデックス、縦軸は尤度関数の値（対数尤度）である。図１１に示すように、５種類の言語の分類であれば、比較的良好に言語の種類を分類することが可能である。 First, evaluation by the speech language evaluation apparatus 10 according to the first embodiment described above for speech signals in five languages (English, American English, German, Swedish, French) will be described. The total number K of base spectra was K = 12. Using 12 utterances by 4 speakers per language (2 males and 2 females) as learning speech signals, parameters for each language (m) ^→ θ ^(m) = { ^→ π ^(m) , ^→ I learned A ^(m) , ^→ h ^(m) }. Note that m is an index indicating the type of language. The value of the likelihood function for each language was calculated using 12 utterances by other 4 speakers (2 men and 2 women) not used for learning as evaluation speech signals. FIG. 11 shows the evaluation results. Each graph in FIG. 11 corresponds to the language type of the speech signal for evaluation, the horizontal axis 1 to 5 is an index indicating the language type, and the vertical axis is a likelihood function value (log likelihood). As shown in FIG. 11, if the classification of five types of languages is used, it is possible to classify the types of languages relatively well.

次に、１３種類の言語（英語、アメリカ英語、ドイツ語、オランダ語、スウェーデン語、フランス語、スペイン語、イタリア語、ポルトガル語、ロシア語、ポーランド語、ギリシャ語、ヒンズー語）の音声信号に対する上記第１の実施の形態に係る音声言語評価装置１０による評価について説明する。基底スペクトルの総数、パラメータの学習や評価方法など、言語の種類の数以外の条件は、上記の５種類の言語の場合と同様である。図１２に評価結果を示す。図１２に示すように、５種類の場合と比較すると分類精度は落ちるものの、言語によっては適切に分類することが可能である。 Next, the above-mentioned speech signals for 13 languages (English, American English, German, Dutch, Swedish, French, Spanish, Italian, Portuguese, Russian, Polish, Greek, Hindu) Evaluation by the spoken language evaluation device 10 according to the first exemplary embodiment will be described. Conditions other than the number of language types, such as the total number of base spectra, parameter learning, and evaluation methods, are the same as those for the above five languages. FIG. 12 shows the evaluation results. As shown in FIG. 12, although the classification accuracy is lower than in the case of five types, it can be classified appropriately depending on the language.

本発明に係る音声言語評価装置のような、時系列の音声信号の分析に基づいた多言語音声の分類は、直接的には言語識別技術の基盤となり、多言語音声認識の前処理としての応用が期待される。さらに多言語音声翻訳や字幕翻訳等を含むマルチモーダル処理へ展開することで、音声認識や音声合成といった音声工学分野にとどまらず、言語工学や認知科学などの様々な分野との融合へと発展しうる。一方、言語学的観点からは、文字言語を持たない多数の言語に対して、本手法により音素に近い要素の抽出が可能となり、それらの言語の記述及び言語系統の解明が期待できる。これによりサイエンスとしての音声言語学の発展とデータ駆動型手法による新たな言語科学の創出に寄与すると考えられる。 The classification of multilingual speech based on the analysis of time-series speech signals, such as the speech language evaluation apparatus according to the present invention, directly becomes the basis of language identification technology, and is applied as preprocessing for multilingual speech recognition. There is expected. Furthermore, by expanding to multimodal processing including multilingual speech translation and subtitle translation, it has been developed not only in speech engineering fields such as speech recognition and speech synthesis, but also in fusion with various fields such as language engineering and cognitive science. sell. On the other hand, from a linguistic point of view, elements similar to phonemes can be extracted by this method for many languages that do not have a character language, and the description of these languages and the elucidation of the language system can be expected. This will contribute to the development of spoken linguistics as a science and to the creation of new linguistic science using data-driven methods.

なお、上記の実施の形態では、学習部と評価部とを１つのコンピュータで構成する場合について説明したが、各々別のコンピュータで構成するようにしてもよい。学習部を構成するコンピュータは、本発明のパラメータ推定装置の一例であり、評価部を構成するコンピュータは、本発明の音声言語評価装置の一例である。 In the above-described embodiment, the case where the learning unit and the evaluation unit are configured by one computer has been described. However, the learning unit and the evaluation unit may be configured by separate computers. The computer constituting the learning unit is an example of the parameter estimation device of the present invention, and the computer constituting the evaluation unit is an example of the spoken language evaluation device of the present invention.

また、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述の音声言語評価装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, although the above-described spoken language evaluation apparatus has a computer system therein, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０、２１０音声言語評価装置
２０、２２０学習部
２１、２２１学習部の音声特徴抽出部
２２、２２２パラメータ初期値生成部
２３、２２３パラメータ推定部
２４、２２４収束判定部
２５パラメータ出力部
３０パラメータ格納部
４０、２４０評価部
４１評価部の音声特徴抽出部
４２、２４２尤度計算部
４３言語評価結果出力部
２２５パラメータ出力部
２３１フォワード・バックワードアルゴリズム部
２３２状態遷移確率更新部
２３３、２２３３基底スペクトル更新部 10, 210 Spoken language evaluation device 20, 220 Learning unit 21, 221 Speech feature extraction unit 22, 222 of learning unit Parameter initial value generation unit 23, 223 Parameter estimation unit 24, 224 Convergence determination unit 25 Parameter output unit 30 Parameter storage unit 40, 240 Evaluation unit 41 Speech feature extraction unit 42, 242 Likelihood calculation unit 43 Language evaluation result output unit 225 Parameter output unit 231 Forward / backward algorithm unit 232 State transition probability update unit 233, 2233 Base spectrum update unit

Claims

Extraction means for extracting evaluation feature information from an evaluation speech signal whose language type is unknown;
For each plurality of kinds of languages type is known language, the first parameter and the base spectrum shows the basal spectrum of a plurality of states corresponding to each of a plurality of phonemes and extracted with non-negative matrix factorization for training speech signals The language type indicated by the evaluation speech signal is each of the plurality of types based on the model including the second parameter indicating the state transition probability of the input and the evaluation feature information extracted by the extraction unit. Likelihood calculating means for calculating likelihood indicating likelihood;
Evaluation means for evaluating the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation means;
Spoken language evaluation device including

Extraction means for extracting evaluation feature information from an evaluation speech signal whose language type is unknown;
Learned by update rule from the training speech signals by weighted averaging in accordance with the scale of the Mel spectrum extracted as training feature information, and a first parameter indicating a basal spectrum common multiple states to a plurality of types of languages , for each of a plurality kinds of languages type of language is known, it is extracted by the model and the extraction means and a second parameter indicating a state transition probability of the base spectral transition depending on the underlying spectra of the immediately preceding time Likelihood calculating means for calculating likelihood indicating the likelihood that the language type indicated by the evaluation speech signal is each of the plurality of types based on the evaluation feature information;
Evaluation means for evaluating the type of language indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation means;
Spoken language evaluation device including

For each of a plurality of types of languages whose language types are known, an extracting means for extracting learning feature information from the learning speech signal;
For a model including a first parameter indicating a base spectrum of a plurality of states corresponding to each of a plurality of phonemes extracted by non-negative matrix factorization for the learning speech signal and a second parameter indicating a state transition probability of the base spectrum, Initial value generating means for generating initial values of the first parameter and the second parameter for each of the plurality of types of languages;
By the optimization using the initial values of the first parameter and the second parameter, or the current values of the first parameter and the second parameter, and the learning feature information extracted by the extraction unit, the plurality of Estimating means for estimating the first parameter and the second parameter for each type of language;
When the estimation result of the estimation unit satisfies a predetermined condition, the estimated first parameter and the second parameter are output, and when the estimation result does not satisfy the predetermined condition, the estimation unit Control means for controlling the first parameter and the second parameter to be estimated by:
A parameter estimation apparatus including:

For each of a plurality of types of languages whose language types are known, an extracting means for extracting learning feature information from the learning speech signal;
The first parameter indicating the base spectrum of a plurality of states , which is learned by the update rule by weighted average corresponding to the scale of the mel spectrum extracted as the learning feature information , and the transition depending on the base spectrum one time ago Initial value generation for generating a first parameter common to the plurality of languages and an initial value of the second parameter for each of the plurality of languages for a model including a second parameter indicating a state transition probability of a base spectrum to be performed Means,
By the optimization using the initial values of the first parameter and the second parameter, or the current values of the first parameter and the second parameter, and the learning feature information extracted by the extraction unit, the plurality of Estimating means for estimating the first parameter common to types of languages and the second parameter for each of the plurality of types of languages;
When the estimation result of the estimation unit satisfies a predetermined condition, the estimated first parameter and the second parameter are output, and when the estimation result does not satisfy the predetermined condition, the estimation unit Control means for controlling the first parameter and the second parameter to be estimated by:
A parameter estimation apparatus including:

The estimation means uses a forward-backward algorithm, and in the model, a variable γ indicating a posterior distribution of latent variables corresponding to a base spectrum selected at each time, and a simultaneous posterior distribution for two consecutive latent variables The variable ξ indicating the first parameter and the second parameter are updated using the variable γ and the variable ξ so that the expected values of the first parameter and the second parameter are maximized. Or the parameter estimation apparatus of Claim 4.

A spoken language evaluation method in a spoken language evaluation apparatus including an extraction unit, a likelihood calculation unit, and an evaluation unit,
The extraction means extracts evaluation feature information from an evaluation speech signal whose language type is unknown,
The likelihood calculating means, with a plurality kinds of languages type of language is known, the base spectrum of a plurality of states corresponding to each of a plurality of phonemes and extracted with non-negative matrix factorization for training speech signals A plurality of types of languages indicated by the evaluation speech signal based on a model including a first parameter to be indicated and a second parameter indicating a state transition probability of a base spectrum and the evaluation feature information extracted by the extraction unit; Calculate the likelihood that each of the types is likely,
The speech language evaluation method, wherein the evaluation unit evaluates a language type indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation unit.

A spoken language evaluation method in a spoken language evaluation apparatus including an extraction unit, a likelihood calculation unit, and an evaluation unit,
The extraction means extracts evaluation feature information from an evaluation speech signal whose language type is unknown,
The likelihood calculation means is learned by an update rule based on a weighted average corresponding to the scale of the mel spectrum extracted as learning feature information from the learning speech signal, and is a base spectrum of a plurality of states common to a plurality of types of languages a first parameter indicating a, for each plurality of kinds of languages type of language is known, the model and a second parameter indicating a state transition probability of the base spectral transition depending on the underlying spectra of the immediately preceding time Then, based on the evaluation feature information extracted by the extraction unit, a likelihood indicating the likelihood that the language type indicated by the evaluation speech signal is each of the plurality of types is calculated,
The speech language evaluation method, wherein the evaluation unit evaluates a language type indicated by the evaluation speech signal based on the likelihood calculated by the likelihood calculation unit.

A parameter estimation method in a parameter estimation device including an extraction unit, an initial value generation unit, an estimation unit, and a control unit,
The extraction means extracts learning feature information from a learning speech signal for each of a plurality of types of languages whose language types are known;
A first parameter indicating a base spectrum of a plurality of states corresponding to each of a plurality of phonemes extracted by non-negative matrix factorization of the learning speech signal by the initial value generation means and a state transition probability of the base spectrum. For a model including two parameters , generating initial values of the first parameter and the second parameter for each of the plurality of types of languages,
The estimation means uses the initial values of the first parameter and the second parameter, or the current values of the first parameter and the second parameter, and the learning feature information extracted by the extraction means. To estimate the first parameter and the second parameter for each of the plurality of types of languages,
When the control means outputs the estimated first parameter and the second parameter when the estimation result of the estimation means satisfies a predetermined condition, and the estimation result does not satisfy the predetermined condition A parameter estimation method for controlling the first parameter and the second parameter to be estimated by the estimation means.

A parameter estimation method in a parameter estimation device including an extraction unit, an initial value generation unit, an estimation unit, and a control unit,
The extraction means extracts learning feature information from a learning speech signal for each of a plurality of types of languages whose language types are known;
The initial value generation means is trained with an update rule based on a weighted average corresponding to the scale of the mel spectrum extracted as the learning feature information, and has a first parameter indicating a plurality of states of the base spectrum , and one time before For a model including a second parameter indicating a state transition probability of a base spectrum that changes depending on the base spectrum, the first parameter common to the plurality of languages and the initial value of the second parameter for each of the plurality of languages Generate a value,
The estimation means uses the initial values of the first parameter and the second parameter, or the current values of the first parameter and the second parameter, and the learning feature information extracted by the extraction means. By estimating, the first parameter common to the plurality of types of languages and the second parameter for each of the plurality of types of languages are estimated,
When the control means outputs the estimated first parameter and the second parameter when the estimation result of the estimation means satisfies a predetermined condition, and the estimation result does not satisfy the predetermined condition A parameter estimation method for controlling the first parameter and the second parameter to be estimated by the estimation means.

The estimation means uses a forward-backward algorithm in the model, a variable γ indicating a posterior distribution of latent variables corresponding to a base spectrum selected at each time, and a simultaneous posterior distribution for two consecutive latent variables The first parameter and the second parameter are updated so that the expected values of the first parameter and the second parameter are maximized using the variable γ and the variable ξ. Or the parameter estimation method of Claim 9.

A spoken language evaluation program for causing a computer to function as each means constituting the spoken language evaluation device according to claim 1.

The parameter estimation program for functioning a computer as each means which comprises the parameter estimation apparatus of any one of Claims 3-5.