JP5544575B2

JP5544575B2 - Spoken language evaluation apparatus, method, and program

Info

Publication number: JP5544575B2
Application number: JP2011198383A
Authority: JP
Inventors: ジョナトンルルー; 弘和亀岡; 隆仁川西; 邦夫柏野; 秀一板橋; 祐一石本
Original assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2011-09-12
Filing date: 2011-09-12
Publication date: 2014-07-09
Anticipated expiration: 2031-09-12
Also published as: JP2013061402A

Description

本発明は、音声言語評価装置、方法、及びプログラムに係り、特に、入力された音声信号が示す言語の種類を評価する音声言語評価装置、方法、及びプログラムに関する。 The present invention relates to a spoken language evaluation apparatus, method, and program, and more particularly, to a spoken language evaluation apparatus, method, and program for evaluating the type of language indicated by an input speech signal.

従来、音声信号からその音声信号が示す言語の種類を識別することが行われており、そのための技術が多数提案されている(例えば、特許文献１、特許文献２、非特許文献１、及び非特許文献２参照)。このような音声信号が示す言語の種類を識別する技術としては、主に、音情報だけでなくテキストレベルの文法を活用したものと、音情報だけを用い音素レベルの特徴を活用したものとに分類できる。 Conventionally, the type of language indicated by an audio signal is identified from the audio signal, and many techniques have been proposed for that purpose (for example, Patent Literature 1, Patent Literature 2, Non-Patent Literature 1, and Non-Patent Literature 1). (See Patent Document 2). As a technology for identifying the type of language indicated by such an audio signal, there are mainly a technique that utilizes not only sound information but also text level grammar, and a technique that utilizes only phonetic information and features of phoneme level. Can be classified.

テキストレベルの文法を用いる手法として、例えば、特許文献１記載の技術では、語彙文法モデルや意味規則等を用いた自然言語解析処理により、言語の認識及び解析を行っている。また、音素レベルの特徴を活用した手法としては、母音などの各言語に含まれる音への類似性を考慮して、言語の分類を行う手法が数多く提案されている。例えば、特許文献２記載の技術では、事前知識としていくつかの音声アルファベットを仮定して、テキストではなく音素情報を用いて、言語の認識を行っている。 As a technique using text-level grammar, for example, in the technique described in Patent Document 1, language recognition and analysis are performed by natural language analysis processing using a vocabulary grammar model, semantic rules, and the like. In addition, as methods utilizing the features of phoneme levels, many methods for classifying languages in consideration of similarities to sounds included in each language such as vowels have been proposed. For example, in the technique described in Patent Document 2, language recognition is performed using phoneme information instead of text, assuming several phonetic alphabets as prior knowledge.

特開平８−１０６３７４号公報JP-A-8-106374 特開２００１−１０９４９０号公報JP 2001-109490 A

Zissman, M.A."Comparison of four approaches to automatic language identification of telephone speech," IEEE Trans. on Speech and Audio Processing, Vol.4, No.1, pp. 31-44, Jan. 1996.Zissman, M.A. "Comparison of four approaches to automatic language identification of telephone speech," IEEE Trans. On Speech and Audio Processing, Vol.4, No.1, pp. 31-44, Jan. 1996. Yeshwant K. et.al "Reviewing Automatic Language Identification," IEEE Signal Processing Magazine, pp. 33-41, Oct. 1994Yeshwant K. et.al "Reviewing Automatic Language Identification," IEEE Signal Processing Magazine, pp. 33-41, Oct. 1994

しかしながら、テキストレベルの文法を用いる手法は、文字を持たず文法が解析されていない言語への適用は困難である、という問題がある。例えば、特許文献１記載の技術では、自然言語解析処理を必要とし、文字言語が存在しない言語へは適用できない。 However, the technique using the text level grammar has a problem that it is difficult to apply to a language which has no characters and the grammar is not analyzed. For example, the technique described in Patent Document 1 requires a natural language analysis process and cannot be applied to a language in which no character language exists.

また、音素レベルの特徴を活用する場合には、例えば、特許文献２に記載の技術のように、事前知識を必要とし、分析の行われていない多くの文字を持たない言語への適用は困難である、という問題がある。 Further, when utilizing the features of phoneme level, for example, as in the technique described in Patent Document 2, it is difficult to apply to a language that requires prior knowledge and does not have many characters that have not been analyzed. There is a problem that.

本発明は、上記の課題を解決するためになされたもので、テキストレベルの言語表現への変換を行うことなく、また事前知識を要することなく、入力された音声信号が示す言語の種類を評価することができる音声言語評価装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and evaluates the type of language indicated by an input speech signal without performing conversion to a text-level language expression and without requiring prior knowledge. An object of the present invention is to provide a spoken language evaluation apparatus, method, and program that can be used.

上記目的を達成するために、本発明の音声言語評価装置は、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する抽出手段と、言語の種類が既知の複数の学習用音声信号から抽出された言語の種類毎の学習用特徴情報を非負値行列分解することにより得られた音素毎の基底ベクトルで表された言語の種類毎の音素表現と、前記抽出手段により抽出された評価用特徴情報とに基づいて、該評価用特徴情報に配合された各音素の基底ベクトルの比率を示す配合比率を、言語の種類毎に計算する配合比率計算手段と、前記評価用特徴情報と、前記配合比率計算手段により計算された言語の種類毎の配合比率と前記言語の種類毎の音素表現との積で示される情報各々との類似度に基づいて、該評価用特徴情報に対応する評価用音声信号が示す言語の種類を評価する評価手段と、を含んで構成されている。 In order to achieve the above object, the spoken language evaluation apparatus of the present invention includes an extraction means for extracting evaluation feature information from an evaluation speech signal whose language type is unknown, and a plurality of learning speeches whose language types are known. The phoneme representation for each language type represented by the basis vector for each phoneme obtained by non-negative matrix decomposition of the learning feature information for each language type extracted from the signal, and extracted by the extraction means Based on the evaluation feature information, a blending ratio calculating means for calculating a blending ratio indicating a base vector ratio of each phoneme blended in the evaluation feature information for each language type, the evaluation feature information, , Corresponding to the evaluation feature information based on the similarity between each of the information indicated by the product of the blending ratio for each language type calculated by the blending ratio calculating means and the phoneme representation for each language type. Audio signal for evaluation It is configured to include an evaluation means for evaluating the type of to language, the.

本発明の音声言語評価装置によれば、抽出手段が、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出する。また、言語の種類が既知の複数の学習用音声信号から抽出された言語の種類毎の学習用特徴情報を非負値行列分解することにより、音素毎の基底ベクトルで表された言語の種類毎の音素表現が予め得られている。そして、配合比率計算手段が、予め得られた言語の種類毎の音素表現と、抽出手段により抽出された評価用特徴情報とに基づいて、評価用特徴情報に配合された各音素の基底ベクトルの比率を示す配合比率を、言語の種類毎に計算する。そして、評価手段が、評価用特徴情報と、配合比率計算手段により計算された言語の種類毎の配合比率と言語の種類毎の音素表現との積で示される情報各々との類似度に基づいて、評価用特徴情報に対応する評価用音声信号が示す言語の種類を評価する。 According to the speech language evaluation apparatus of the present invention, the extraction unit extracts the evaluation feature information from the evaluation speech signal whose language type is unknown. Further, by performing non-negative matrix decomposition on learning feature information for each language type extracted from a plurality of learning speech signals whose language types are known, each language type represented by a base vector for each phoneme is obtained. Phoneme expression is obtained in advance. Then, based on the phoneme representation for each type of language obtained in advance and the evaluation feature information extracted by the extraction unit, the blending ratio calculation unit calculates the basis vector of each phoneme blended in the evaluation feature information. A blending ratio indicating a ratio is calculated for each language type. Then, the evaluation means is based on the similarity between the evaluation feature information and each information indicated by the product of the blending ratio for each language type calculated by the blending ratio calculating means and the phoneme representation for each language type. The language type indicated by the evaluation audio signal corresponding to the evaluation feature information is evaluated.

このように、学習用音声信号を非負値行列分解して得られた言語の種類毎の音素表現と、その音素表現と評価用特徴情報とに基づいて計算された配合比率との積で示される情報と評価用特徴情報との類似度により、評価用音声信号が示す言語の種類を評価するため、テキストレベルの言語表現への変換を行うことなく、また事前知識を要することなく、入力された音声信号が示す言語の種類を評価することができる。 In this way, it is represented by the product of the phoneme representation for each language type obtained by non-negative matrix decomposition of the learning speech signal and the blending ratio calculated based on the phoneme representation and the evaluation feature information. Because the type of language indicated by the evaluation speech signal is evaluated based on the similarity between the information and the evaluation feature information, it is input without conversion to a text-level language expression and without prior knowledge. The type of language indicated by the audio signal can be evaluated.

また、前記音素表現を、時系列構造の音素表現とすることができる。これにより、音の連続的な変化における微妙な音素の変化も考慮して、入力された音声信号が示す言語の種類を評価することができる。 The phoneme expression may be a phoneme expression having a time-series structure. Accordingly, it is possible to evaluate the type of language indicated by the input voice signal in consideration of a subtle change in phonemes in a continuous change in sound.

また、前記評価手段は、前記類似度が最も高くなるときの音素表現に対応する言語の種類を、前記評価用音声信号が示す言語の種類であると識別するか、または、言語の種類毎の類似度に基づいて、言語の種類間の系統的関連性を示す言語系統樹を作成することができる。 Further, the evaluation means identifies the language type corresponding to the phoneme expression when the similarity is the highest as the language type indicated by the evaluation speech signal, or for each language type Based on the similarity, a language phylogenetic tree showing the systematic relationship between the types of languages can be created.

また、前記配合比率計算手段は、発話者の性別及び年齢の少なくとも一方が既知の学習用音声信号から抽出された学習用特徴情報より得られた言語の種類並びに性別及び年齢別の少なくとも一方毎の音素表現に基づいて、言語の種類並びに性別及び年齢別の少なくとも一方毎に前記配合比率を計算することができる。 In addition, the blending ratio calculation means is provided for at least one of the language type and the sex and age obtained from the learning feature information extracted from the learning speech signal in which at least one of the sex and age of the speaker is known. Based on the phoneme expression, the blending ratio can be calculated for at least one of language type, sex, and age.

また、前記抽出手段は、前記複数の学習用音声信号から前記言語の種類毎の学習用特徴情報を抽出し、前記抽出手段により抽出された言語の種類毎の学習用特徴情報を非負値行列分解することにより前記言語の種類毎の音素表現を計算する音素表現計算手段を含んで構成することができる。 Further, the extraction means extracts learning feature information for each language type from the plurality of learning speech signals, and non-negative matrix decomposition of the learning feature information for each language type extracted by the extraction means Thus, a phoneme expression calculating means for calculating a phoneme expression for each language type can be included.

また、本発明の音声言語評価方法は、抽出手段と、配合比率計算手段と、評価手段とを含む音声言語評価装置における音声言語評価方法であって、前記抽出手段は、言語の種類が未知の評価用音声信号から評価用特徴情報を抽出し、前記配合比率計算手段は、言語の種類が既知の複数の学習用音声信号から抽出された言語の種類毎の学習用特徴情報を非負値行列分解することにより得られた音素毎の基底ベクトルで表された言語の種類毎の音素表現と、前記抽出手段により抽出された評価用特徴情報とに基づいて、該評価用特徴情報に配合された各音素の基底ベクトルの比率を示す配合比率を、言語の種類毎に計算し、前記評価手段は、前記評価用特徴情報と、前記配合比率計算手段により計算された言語の種類毎の配合比率と前記言語の種類毎の音素表現との積で示される情報各々との類似度に基づいて、該評価用特徴情報に対応する音声信号が示す言語の種類を評価する方法である。 The spoken language evaluation method of the present invention is a spoken language evaluation method in a spoken language evaluation apparatus including an extraction unit, a blending ratio calculation unit, and an evaluation unit, and the extraction unit has an unknown language type. The feature information for evaluation is extracted from the speech signal for evaluation, and the blending ratio calculation means performs non-negative matrix decomposition of the learning feature information for each language type extracted from a plurality of learning speech signals whose language types are known. Each of the blended features in the evaluation feature information based on the phoneme representation for each language type represented by the basis vector for each phoneme obtained by the above and the evaluation feature information extracted by the extraction means A blending ratio indicating a ratio of basis vectors of phonemes is calculated for each language type, and the evaluation unit includes the evaluation feature information, a blending ratio for each language type calculated by the blending ratio calculating unit, and the Language seed Based on the similarity between the information each represented by the product of the phoneme transcription for each, it is a method of evaluating the type of language indicated by the audio signal corresponding to the evaluation feature information.

また、音素表現計算手段を更に含む音声言語評価装置における音声言語評価方法では、前記抽出手段は、前記複数の学習用音声信号から前記言語の種類毎の学習用特徴情報を抽出し、前記音素表現計算手段は、前記抽出手段により抽出された言語の種類毎の学習用特徴情報を非負値行列分解することにより前記言語の種類毎の音素表現を計算する。 Further, in the spoken language evaluation method in the spoken language evaluation apparatus further including a phoneme expression calculating unit, the extracting unit extracts learning feature information for each language type from the plurality of learning speech signals, and the phoneme expression The calculation means calculates a phoneme expression for each language type by performing non-negative matrix decomposition on the learning feature information for each language type extracted by the extraction means.

また、本発明の音声言語評価プログラムは、コンピュータを、上記の音声言語評価装置を構成する各手段として機能させるためのプログラムである。 The spoken language evaluation program of the present invention is a program for causing a computer to function as each means constituting the above-described speech language evaluation apparatus.

以上説明したように、本発明の音声言語評価装置、方法、及びプログラムによれば、学習用音声信号を非負値行列分解して得られた言語の種類毎の音素表現と、その音素表現と評価用特徴情報とに基づいて計算された配合比率との積で示される情報と評価用特徴情報との類似度により、評価用音声信号が示す言語の種類を評価するため、テキストレベルの言語表現への変換を行うことなく、また事前知識を要することなく、入力された音声信号が示す言語の種類を評価することができる、という効果が得られる。 As described above, according to the speech language evaluation apparatus, method, and program of the present invention, the phoneme expression for each language type obtained by non-negative matrix decomposition of the learning speech signal, and the phoneme expression and evaluation thereof In order to evaluate the type of language indicated by the evaluation speech signal based on the similarity between the information indicated by the product of the blending ratio calculated based on the feature information and the feature information for evaluation, the text level language expression is used. Thus, there is an effect that it is possible to evaluate the type of language indicated by the input voice signal without performing the above conversion and without requiring prior knowledge.

第１の実施の形態に係る音声言語評価装置の構成を示す概略図である。It is the schematic which shows the structure of the spoken language evaluation apparatus which concerns on 1st Embodiment. 非負値行列分解のイメージ図である。It is an image figure of nonnegative matrix decomposition | disassembly. 中国語の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of Chinese. スペイン語の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of Spanish. 第１の実施の形態に係る音声言語評価装置における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in the spoken language evaluation apparatus which concerns on 1st Embodiment. 第１の実施の形態に係る音声言語評価装置における評価処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the evaluation process routine in the spoken language evaluation apparatus which concerns on 1st Embodiment. 時系列の音素表現に対する非負値行列分解のイメージ図である。It is an image figure of nonnegative matrix decomposition | disassembly with respect to a time-sequential phoneme expression. 英語の時系列の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of the time series of English. ドイツ語の時系列の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of the time series of German. スウェーデン語の時系列の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of the time series of Swedish. フランス語の時系列の音素表現の一例を示すグラフである。It is a graph which shows an example of the phoneme expression of the time series of French. ある音声信号に対する類似値の一例を示すグラフである。It is a graph which shows an example of the similar value with respect to a certain audio | voice signal. 言語系統樹の出力の一例を示す図である。It is a figure which shows an example of the output of a language phylogenetic tree. 時系列の音素表現を利用した言語分類の一例を示すグラフである。It is a graph which shows an example of the language classification using the time-sequential phoneme expression.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

第１の実施の形態に係る音声言語評価装置１は、ＣＰＵと、ＲＡＭと、後述する学習処理及び評価処理を含む音声言語評価処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。 A spoken language evaluation apparatus 1 according to the first embodiment includes a CPU, a RAM, and a ROM that stores a program for executing a spoken language evaluation processing routine including a learning process and an evaluation process described later. It consists of

このコンピュータは、機能的には、図１に示すように、音声信号を入力する音声信号入力部１１と、音声信号から特徴情報を抽出する特徴情報抽出部１２と、事前学習のために、言語の種類及び性別毎に得られた特徴情報に対して音素表現を計算する音素表現計算部１３と、言語の種類及び性別毎の音素表現が記憶される音素表現記憶部１４と、言語評価のために、特徴情報抽出部１２から得られた特徴情報に対し、音素表現記憶部１４に記憶された言語の種類及び性別毎の音素表現各々を用いて配合比率を計算する音素配合比率計算部１５と、言語の種類毎の音素配合比率を解析して、各言語との類似性を評価する言語類似性評価部１６と、評価結果が表示装置に表示されるように制御する表示制御部１７とを含んだ構成で表すことができる。 As shown in FIG. 1, this computer functionally includes an audio signal input unit 11 for inputting an audio signal, a feature information extraction unit 12 for extracting feature information from the audio signal, and a language for pre-learning. Phoneme expression calculation unit 13 for calculating phoneme expression for feature information obtained for each type and sex, phoneme expression storage unit 14 for storing phoneme representation for each language type and sex, and for language evaluation A phoneme blending ratio calculation unit 15 that calculates a blending ratio for the feature information obtained from the feature information extraction unit 12 using each of the phoneme representations for each language type and gender stored in the phoneme representation storage unit 14; Analyzing the phoneme blending ratio for each language type and evaluating the similarity to each language, and the display control unit 17 for controlling the evaluation result to be displayed on the display device Can be represented by the included configuration .

また、音声信号入力部１１、特徴情報抽出部１２、音素表現計算部１３、及び音素表現記憶部１４が学習部２として機能し、音声信号入力部１１、特徴情報抽出部１２、音素配合比率計算部１５、言語類似性評価部１６、及び表示制御部１７が評価部３として機能する。すなわち、音声信号入力部１１及び特徴情報抽出部１２は、学習部２及び評価部３で共通に用いられる。 Also, the speech signal input unit 11, the feature information extraction unit 12, the phoneme expression calculation unit 13, and the phoneme representation storage unit 14 function as the learning unit 2, and the speech signal input unit 11, the feature information extraction unit 12, the phoneme blending ratio calculation. The unit 15, the language similarity evaluation unit 16, and the display control unit 17 function as the evaluation unit 3. That is, the audio signal input unit 11 and the feature information extraction unit 12 are used in common by the learning unit 2 and the evaluation unit 3.

音声信号入力部１１には、例えば、電子的に記録されたファイルまたはマイクなどの入力装置から、デジタル化された音声信号が入力される。学習段階では、言語の種類及び発話者の性別（男女別）が既知の音声信号（学習用音声信号）が入力される。また、評価段階では、言語の種類が未知で、発話者の性別が既知または未知の音声信号（評価用音声信号）が入力される。 The audio signal input unit 11 receives a digitized audio signal from an input device such as an electronically recorded file or a microphone. In the learning stage, a speech signal (learning speech signal) having a known language type and gender (gender-specific) of the speaker is input. In the evaluation stage, a speech signal (evaluation speech signal) whose language type is unknown and whose speaker's gender is known or unknown is input.

特徴情報抽出部１２は、音声信号入力部１１から得られるデジタル化された音声信号から、特徴情報を抽出する。本実施の形態では、特徴情報として、メルスペクトルを抽出する場合について説明する。なお、特徴情報は、音素表現やその識別方法に何を使うかにより異なる特徴（例えば、スペクトルと主成分分析（ＰＣＡ）、メルケプストラムとベクトル量子化など）を抽出するようにしてもよい。学習段階では、学習用音声信号から言語の種類及び性別毎にメルスペクトルを抽出し、これを学習用特徴情報とする。また、評価段階では、評価用音声信号からメルスペクトルを抽出し、これを評価用特徴情報とする。 The feature information extraction unit 12 extracts feature information from the digitized audio signal obtained from the audio signal input unit 11. In the present embodiment, a case where a mel spectrum is extracted as feature information will be described. The feature information may be extracted depending on what is used for the phoneme expression and its identification method (for example, spectrum and principal component analysis (PCA), mel cepstrum and vector quantization, etc.). In the learning stage, a mel spectrum is extracted from the learning speech signal for each language type and sex, and this is used as learning feature information. Further, in the evaluation stage, a mel spectrum is extracted from the evaluation audio signal, and this is used as evaluation feature information.

音素表現計算部１３では、特徴情報抽出部１２により学習用特徴情報として抽出された言語の種類及び性別毎のメルスペクトルを分析して、音声信号内に繰り返し現れる音素構造を抽出する。このような方法には、例えば、音のような非負の情報を取り扱うのに適した非負値行列分解（ＮＭＦ：Non-negative Matrix Factorization）を用いることができる（例えば、「D. D. Lee, H. S. Seung, “Learning the part of objects by non-negative matrix factorization,” Nature Vol.401, pp. 788-791, 1999.」参照）。ＮＭＦは、自動採譜やモノラル混合信号からの音源の分離に適用されている（例えば、「P. Smaragdis, J. C. Brown, “Non-Negative Matrix Factorization for Music Transcription,”ln Proc. 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2003), pp. 177-180,2003.」及び「T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1066−1074, 2007.」参照）
ＮＭＦによって音声信号を音素へ分解するイメージを図２に示す。図中Ｙは、特徴情報抽出部１２で抽出されたメルスペクトルを、図中Ｈは、音素表現（音素毎の基底ベクトルを並べたもの。音素の基底ベクトルを、以下では単に「音素」ともいう）を、図中Ｕは、各音素がＹにどのくらいの比率で配合されているかを示す配合比率を表す。ＮＭＦによる繰り返し演算で、メルスペクトルＹと、音素表現Ｈと配合比率Ｕとの積との差を最小化することにより、適切な音素表現Ｈ及び配合比率Ｕを求めることができる。評価段階では、音素表現Ｈのみを利用するため、求めた音素表現Ｈを出力する。 The phoneme expression calculation unit 13 analyzes the mel spectrum for each language type and sex extracted as the learning feature information by the feature information extraction unit 12, and extracts phoneme structures that repeatedly appear in the speech signal. In such a method, for example, non-negative matrix factorization (NMF) suitable for handling non-negative information such as sound can be used (for example, “DD Lee, HS Seung, “Learning the part of objects by non-negative matrix factorization,” Nature Vol. 401, pp. 788-791, 1999. ”). NMF is applied to automatic transcription and separation of sound sources from monaural mixed signals (for example, “P. Smaragdis, JC Brown,“ Non-Negative Matrix Factorization for Music Transcription, ”ln Proc. 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2003), pp. 177-180,2003. ”And“ T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1066-1074, 2007 ”)
FIG. 2 shows an image in which an audio signal is decomposed into phonemes by NMF. In the figure, Y is the mel spectrum extracted by the feature information extraction unit 12, and H in the figure is a phoneme expression (arranged base vectors for each phoneme. The phoneme base vectors are also simply referred to as “phonemes” below. ), U in the figure represents a blending ratio indicating how much each phoneme is blended with Y. Appropriate phoneme representation H and blending ratio U can be obtained by minimizing the difference between Mel spectrum Y and the product of phoneme representation H and blending ratio U by repetitive calculation by NMF. In the evaluation stage, since only the phoneme expression H is used, the obtained phoneme expression H is output.

ここでは、ＮＭＦでメルスペクトルＹと、音素表現Ｈと配合比率Ｕとの積との差を最小化するために用いる距離尺度にＫＬ（Kullback-Leibler）−ダイバージェンスを用いる。なお、ＫＬ−ダイバージェンスではなく、板倉斎藤距離やユークリッド距離を用いてもよい。メルスペクトルを入力とし、ＫＬ−ダイバージェンスを距離尺度として作成した中国語の音素表現を図３に示す。同図では、１０個の音素を横方向に並べ、音素各々の縦軸を周波数、横軸をその強さとして、音素表現を表している。同様に表したスペイン語の音素表現を図４に示す。スペイン語と中国語とで、似通った音素とそうではない音素とが存在する。このような特徴は、各言語の母音の種類の差に相当すると推察される。 Here, KL (Kullback-Leibler) -divergence is used as a distance measure used to minimize the difference between the mel spectrum Y and the product of the phoneme expression H and the blending ratio U in NMF. Instead of KL-divergence, the Itakura Saito distance or the Euclidean distance may be used. FIG. 3 shows a Chinese phoneme expression created using the mel spectrum as an input and KL-divergence as a distance scale. In the same figure, 10 phonemes are arranged in the horizontal direction, and the phoneme expression is expressed by using the vertical axis of each phoneme as the frequency and the horizontal axis as its strength. A similar phoneme expression in Spanish is shown in FIG. In Spanish and Chinese, there are similar phonemes and phonemes that are not. Such a feature is assumed to correspond to the difference in the type of vowel in each language.

音素表現記憶部１４には、音素表現計算部１３から出力された音素表現Ｈが、男女別に言語の種類毎に記憶される。なお、ここでは、音素表現を男女別に言語の種類毎に記憶する構成としたが、特徴情報や距離の定義によっては、男女をまとめてもよいし、年齢別等さらに細かく分類してもよい。 The phoneme expression storage unit 14 stores the phoneme expression H output from the phoneme expression calculation unit 13 for each language type for each gender. Here, the phoneme expression is stored for each language type for each gender, but depending on the feature information and the definition of the distance, men and women may be grouped or may be further classified by age.

音素配合比率計算部１５は、特徴情報抽出部１２で評価用特徴情報として抽出されたメルスペクトルが入力される。ＮＭＦでは、図２の上式のように、音声信号のメルスペクトルＹを、音素表現Ｈと配合比率Ｕとの積で近似的に表現することができる。音素配合比率計算部１５は、この近似表現に基づいて、入力されたメルスペクトルＹと、音素表現記憶部１４に男女別に記憶された言語の種類毎の音素表現Ｈとに基づいて、言語の種類毎に配合比率Ｕを計算する。 The phoneme blending ratio calculation unit 15 receives the mel spectrum extracted by the feature information extraction unit 12 as the evaluation feature information. In the NMF, the mel spectrum Y of the audio signal can be approximately expressed by the product of the phoneme expression H and the blending ratio U as shown in the upper equation of FIG. Based on this approximate expression, the phoneme blending ratio calculation unit 15 determines the language type based on the input mel spectrum Y and the phoneme expression H for each language type stored in the phoneme expression storage unit 14 for each gender. The blending ratio U is calculated every time.

ここで、評価用音声信号の発話者の性別が既知の場合には、音素表現記憶部１４に記憶された男女別の言語の種類毎の音素表現Ｈのうち、発話者の性別に対応した音素表現Ｈを用いて配合比率Ｕを計算する。発話者の性別が未知の場合には、記憶された音素表現Ｈの全てを用いて、言語の種類及び性別毎に配合比率Ｕを計算する。 Here, when the gender of the speaker of the evaluation speech signal is known, the phoneme corresponding to the gender of the speaker out of the phoneme representation H for each type of language for each gender stored in the phoneme representation storage unit 14. Using the expression H, the blending ratio U is calculated. When the gender of the speaker is unknown, the blending ratio U is calculated for each language type and gender using all the stored phoneme expressions H.

言語類似性評価部１６は、音素配合比率計算部１５で計算された配合比率Ｕと音素表現記憶部１４に記憶された音素表現Ｈとの積を言語の種類毎に各々計算し、特徴情報抽出部１２から出力された評価用音声信号のメルスペクトルＹとの類似度を計算する。類似度は、ＵとＨとの積とＹとの差分（距離）とすることができる。これにより、入力された音声信号が示す言語の種類と各言語の種類の音素表現との類似性が距離として表現される。この距離が最も近い場合の計算に用いられた音素表現Ｈに対応する言語の種類を、評価用音声信号が示す言語の種類に最も類似する言語の種類であると評価する。 The language similarity evaluation unit 16 calculates the product of the blending ratio U calculated by the phoneme blending ratio calculation unit 15 and the phoneme representation H stored in the phoneme representation storage unit 14 for each language type, and extracts feature information. The similarity with the mel spectrum Y of the audio signal for evaluation output from the unit 12 is calculated. The similarity can be a difference (distance) between Y and the product of U and H. Thereby, the similarity between the language type indicated by the input speech signal and the phoneme representation of each language type is expressed as a distance. The language type corresponding to the phoneme expression H used for the calculation when the distance is the shortest is evaluated as the language type most similar to the language type indicated by the evaluation speech signal.

また、言語類似性評価部１６は、計算された類似度を用いて、言語の体系化のために、入力された音声信号に対する言語系統樹を評価結果として求めてもよい。言語系統樹の作成方法としては、群平均法（ＵＰＧＭＡ：UnweightedPair-Group Method using Average）などを用いることができる。ＵＰＧＭＡは、段階的な言語系統樹の作成方法であり、最小距離となる２つの言語を結合していく処理を繰り返す方法である。結合された言語グループとの距離の計算にはグループ内のそれぞれの言語との距離の平均値を用いる。 Moreover, the language similarity evaluation part 16 may obtain | require the language phylogenetic tree with respect to the input audio | voice signal as an evaluation result for the systematization of a language using the calculated similarity. As a method of creating a language tree, a group average method (UPGMA: Unweighted Pair-Group Method using Average) or the like can be used. UPGMA is a method of creating a step-by-step language phylogenetic tree, and is a method of repeating a process of combining two languages having the minimum distance. In calculating the distance to the combined language group, the average value of the distance to each language in the group is used.

なお、評価用音声信号の発話者の性別が未知の場合には、男性版の音素表現を用いた場合の類似度、女性版の音素表現を用いた場合の類似度の両方を計算し、類似度が高い方の音素表現に対応する性別も合わせて評価結果として求めるようにするとよい。 If the gender of the speaker of the evaluation speech signal is unknown, both the similarity when using the male phoneme expression and the similarity when using the female phoneme expression are calculated. The gender corresponding to the phoneme expression having the higher degree is preferably obtained as an evaluation result.

表示制御部１７は、言語類似性評価部１６による評価結果が表示装置に表示されるように制御する。例えば、評価用音声信号が示す言語の種類に最も類似する言語の種類が何であるかを文字で表示したり、評価用音声信号と各言語の種類の音素表現との類似度を棒グラフ等で表示したりすることができる。また、言語系統樹を求めた場合には、求めた言語系統樹を表示するようにするとよい。 The display control unit 17 performs control so that the evaluation result by the language similarity evaluation unit 16 is displayed on the display device. For example, text indicates what language type is most similar to the language type indicated by the speech signal for evaluation, and displays the similarity between the speech signal for evaluation and the phoneme representation of each language type as a bar graph You can do it. In addition, when the language phylogenetic tree is obtained, the obtained language phylogenetic tree is preferably displayed.

なお、ここでは、評価結果を表示装置に表示する場合について説明したが、音声出力装置により音声で評価結果を出力するようにしてもよい。例えば、評価用音声信号が示す言語の種類に最も類似する言語の種類が何であるかを音声で表示したり、最も類似する言語の種類の学習データ中の音声を出力したりすることができる。 Here, the case where the evaluation result is displayed on the display device has been described. However, the evaluation result may be output by voice using a voice output device. For example, what is the most similar language type to the language type indicated by the evaluation audio signal can be displayed by voice, or the voice in the learning data of the most similar language type can be output.

次に、第１の実施の形態に係る音声言語評価装置１の作用について説明する。評価用音声信号が示す言語の種類を評価する評価処理に先立って、図５に示す学習処理ルーチンが実行される。 Next, the operation of the spoken language evaluation apparatus 1 according to the first embodiment will be described. Prior to the evaluation process for evaluating the type of language indicated by the evaluation speech signal, a learning process routine shown in FIG. 5 is executed.

ステップ１００で、電子的に記録されたファイルまたはマイクなどの入力装置から、デジタル化された学習用音声信号が入力される。 In step 100, a digitized learning audio signal is input from an electronically recorded file or input device such as a microphone.

次に、ステップ１０２で、上記ステップ１００で入力された学習用音声信号から、学習用特徴情報としてメルスペクトルを抽出する。ここで抽出された学習用特徴情報は、言語の種類及び性別毎の特徴情報である。 Next, in step 102, a mel spectrum is extracted as learning feature information from the learning speech signal input in step 100. The feature information for learning extracted here is feature information for each language type and sex.

次に、ステップ１０４で、上記ステップ１０２で抽出されたメルスペクトルを、言語の種類及び性別毎にＮＭＦにより音素に分解し、音素表現Ｈと配合比率Ｕとする。 Next, in step 104, the mel spectrum extracted in step 102 is decomposed into phonemes by NMF for each language type and gender to obtain phoneme expression H and blending ratio U.

次に、ステップ１０６で、上記ステップ１０４で計算された音素表現Ｈを、音素表現記憶部１４に男女別に言語の種類毎に記憶して処理を終了する。 Next, in step 106, the phoneme expression H calculated in step 104 is stored in the phoneme expression storage unit 14 for each language type for each gender, and the process is terminated.

そして、上記の学習処理ルーチンが実行されて、音素表現記憶部１４に男女別に言語の種類毎の音素表現Ｈが記憶された状態で、図６に示す評価処理ルーチンが実行される。 Then, the above-described learning process routine is executed, and the evaluation process routine shown in FIG. 6 is executed in a state where the phoneme expression storage unit 14 stores the phoneme expression H for each language type for each gender.

ステップ１２０で、評価用音声信号を入力する。ここで入力された評価用音声信号は、発話者の性別が既知であるとする。 In step 120, an audio signal for evaluation is input. Assume that the gender of the speaker is known from the evaluation audio signal input here.

次に、ステップ１２２で、学習処理のステップ１０２と同様の処理により、上記ステップ１２０で入力された評価用音声信号から、評価用特徴情報としてメルスペクトルを抽出する。 Next, in step 122, a mel spectrum is extracted as evaluation feature information from the evaluation speech signal input in step 120 by the same process as in step 102 of the learning process.

次に、ステップ１２４で、上記ステップ１２２で評価用音声信号から抽出されたメルスペクトルＹと、音素表現記憶部１４に記憶された発話者の性別に対応した言語の種類毎の音素表現Ｈとに基づいて、言語の種類毎に配合比率Ｕを計算する。 Next, at step 124, the mel spectrum Y extracted from the speech signal for evaluation at step 122 and the phoneme representation H for each language type corresponding to the gender of the speaker stored in the phoneme representation storage unit 14. Based on this, the blending ratio U is calculated for each language type.

次に、ステップ１２６で、上記ステップ１２４で計算された言語の種類毎の配合比率Ｕと音素表現記憶部１４に記憶された発話者の性別に対応した言語の種類毎の音素表現Ｈとの積を各々計算し、上記ステップ１２２で抽出されたメルスペクトルＹとの類似度を計算する。この類似度が最も高い場合の計算に用いられた音素表現Ｈに対応する言語の種類を、評価用音声信号が示す言語の種類に最も類似する言語の種類であると評価する。また、計算された類似度を用いて、言語の種類の体系化のために、入力された評価用音声信号に対する系統樹を評価結果として求める。 Next, in step 126, the product of the blending ratio U for each language type calculated in step 124 and the phoneme representation H for each language type corresponding to the gender of the speaker stored in the phoneme representation storage unit 14. Are calculated, and the degree of similarity with the mel spectrum Y extracted in step 122 is calculated. The language type corresponding to the phoneme expression H used for the calculation when the similarity is the highest is evaluated as the language type most similar to the language type indicated by the evaluation speech signal. Further, using the calculated similarity, a systematic tree for the input evaluation speech signal is obtained as an evaluation result for systematization of language types.

次に、ステップ１２８で、上記ステップ１２６での評価結果を表示装置に表示して、処理を終了する。 Next, in step 128, the evaluation result in step 126 is displayed on the display device, and the process is terminated.

以上説明したように、第１の実施の形態の音声言語評価装置によれば、学習用音声信号から抽出された学習用特徴情報を非負値行列分割により音素表現とその配合比率とで表現した場合の音素表現を言語の種類毎に記憶しておき、評価用音声信号から抽出された評価用特徴情報と記憶された言語の種類毎の音素表現とに基づいて、言語の種類毎に配合比率を計算し、評価用特徴情報と、記憶された言語の種類毎の音素表現と計算された配合比率との積との類似度に基づいて、評価用音声信号が示す言語の種類がどの言語の種類に類似するかを評価する。このように、テキストレベルの言語表現への変換を行うことなく、また事前知識を要することなく、音声信号のみを用いて、入力された音声信号が示す言語の種類を評価することができる。 As described above, according to the speech language evaluation apparatus of the first embodiment, when the learning feature information extracted from the learning speech signal is expressed by phoneme expression and its blending ratio by non-negative matrix division For each language type, and based on the evaluation feature information extracted from the evaluation speech signal and the stored phoneme representation for each language type, the blending ratio is determined for each language type. Based on the similarity between the calculated feature information for evaluation and the product of the stored phoneme representation for each language type and the calculated blending ratio, the language type indicated by the audio signal for evaluation is the type of language Evaluate if it is similar to In this way, it is possible to evaluate the type of language indicated by the input speech signal using only the speech signal without performing conversion to a text-level language expression and without requiring prior knowledge.

また、評価用音声信号が示す言語の種類と各言語の種類との類似性を用いて言語系統樹を求めることができ、言語の種類間の関係性に対する新たな文化的歴史的新知見も期待できる。 In addition, language phylogenetic trees can be obtained by using the similarity between the language type indicated by the speech signal for evaluation and the type of each language, and new cultural and historical knowledge about the relationship between language types is also expected. it can.

次に、第２の実施の形態について説明する。なお、第２の実施の形態に係る音声言語評価装置は、音素表現計算部１３において、時系列の音素表現を用いる点が第１の実施の形態と異なるため、その点について説明する。 Next, a second embodiment will be described. The spoken language evaluation apparatus according to the second embodiment is different from the first embodiment in that the phoneme expression calculation unit 13 uses time-series phoneme expression, and this point will be described.

言語の特性は母音の種類などにより分類されるが、特に連続音のように、母音などの各音素の音量が連続的に変化する場合には、前の音素から後の音素に連続的に変化していく中で、認識が困難になる状況がある（例えば、「おはよう」の「よ」から「う」にかけての音の変化）。第２の実施の形態では、このような連続音における音の微妙な変化も考慮に入れて、言語の種類を評価する。 Language characteristics are classified according to the type of vowel, etc., but especially when the volume of each phoneme such as a vowel changes continuously, such as continuous sounds, it continuously changes from the previous phoneme to the subsequent phoneme. In the process, there are situations where it becomes difficult to recognize (for example, a change in sound from “good” to “good” in “good morning”). In the second embodiment, the kind of language is evaluated in consideration of such subtle changes in sound in continuous sounds.

第２の実施の形態における音素表現計算部１３は、図７に示すように、時系列構造を持った音素に対し、非負値行列分解（Non-negative Matrix Deconvolution、ＮＭＦＤ、例えば、「Paris Smaragdis, “Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs,” Independent Component Analysis and Blind Signal Separation, Lecture Notes in Computer Science, 2004, Volume 3195/2004, 494-499」参照）を用いて時系列の音素表現を計算する。 As shown in FIG. 7, the phoneme representation calculation unit 13 according to the second embodiment performs non-negative matrix decomposition (NMFD), for example, “Paris Smaragdis, “Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs,” Independent Component Analysis and Blind Signal Separation, Lecture Notes in Computer Science, 2004, Volume 3195/2004, 494-499 ”) Compute phoneme representation of.

ＮＭＦＤで計算した英語、ドイツ語、スウェーデン語、及びフランス語の時系列の音素表現を図８〜１１に示す。ここでは、１２個の音素を横に並べた音素表現となっている。各音素の内部では、横に５つの時系列の変化を表し、縦が周波数を表している。内部の各四角形が暗いほど強い値であることを示す。すなわち、各音素がメルスペクトルの時間推移を表現している。 8 to 11 show time series phoneme expressions of English, German, Swedish, and French calculated by NMFD. Here, the phoneme expression is such that 12 phonemes are arranged side by side. Within each phoneme, five time-series changes are represented horizontally, and the frequency represents frequency. The darker each square inside, the stronger the value. That is, each phoneme represents the time transition of the mel spectrum.

第２の実施の形態における学習処理及び評価処理については、第１の実施の形態の学習処理及び評価処理においてＮＭＦにより音素表現を計算した点が、上記のＮＭＦＤを用いて時系列の音素表現を計算する点と異なるだけであるので、説明を省略する。 Regarding the learning process and evaluation process in the second embodiment, the phoneme expression is calculated by NMF in the learning process and evaluation process in the first embodiment, and the time-series phoneme expression is calculated by using the NMFD. Since only the point of calculation is different, the description is omitted.

以上説明したように、第２の実施の形態の音声言語評価装置によれば、第１の実施の形態の効果に加え、音の連続的な変化における微妙な音素の変化も考慮して、適切に評価用音声信号が示す言語の種類を評価することができる。 As described above, according to the spoken language evaluation apparatus of the second embodiment, in addition to the effects of the first embodiment, the subtle phoneme change in the continuous change of sound is also taken into consideration. The type of language indicated by the evaluation audio signal can be evaluated.

ここで、本発明の効果を説明するために、評価結果の一例について説明する。 Here, in order to explain the effect of the present invention, an example of an evaluation result will be described.

２１言語の音声コーパスから音素表現を作成し、ある女性の日本語入力に対し、類似度を比較した評価結果を図１２に示す。距離を類似値に変換するため、ｅｘｐ（−距離値）を類似値とした。同図に示すとおり、日本語との類似値が最も高く、入力された音声信号が示す言語の種類が日本語であると正しく識別できることが確認できた。 A phoneme expression is created from a speech corpus of 21 languages, and an evaluation result of comparing similarities for a Japanese input of a woman is shown in FIG. In order to convert the distance into a similar value, exp (−distance value) is set as the similar value. As shown in the figure, it was confirmed that the similarity value with Japanese was the highest, and the language type indicated by the input voice signal could be correctly identified as Japanese.

同様の方法を各言語間で繰り返し、ＵＰＧＭＡ法で言語系統樹を作成した結果を図１３に示す。同図の評価結果は、必ずしも言語学的分類の言語系統樹とは一致していないが、地理的な近さは反映したものとなっており、音素と言語の種類とのなんらかの関係を示唆していると思われる。 FIG. 13 shows the result of repeating the same method between languages and creating a language phylogenetic tree by the UPGMA method. The evaluation results in the figure do not necessarily match the linguistic phylogenetic tree of linguistic classification, but reflect the geographical proximity, suggesting some relationship between phonemes and language types. It seems that

次に、ＮＭＦＤを用いて、英語、ドイツ語、スウェーデン語、及びフランス語の４言語について、距離を測定した結果を図１４に示す。縦軸が学習した音素の変化であり、横軸は入力した音声の言語の種類である。各四角形内が黒いほど類似しており、白いほど違いが大きいことを示している。同図からわかるように、全ての入力された言語の種類に対して最も類似する言語の種類が正解の言語の種類となっており、ＮＭＦＤを用いた識別が有効であることがわかる。さらにスウェーデン語の音素は他の言語との違いが大きく、ドイツ語の音素は他の言語に類似している。一般に英語よりもスウェーデン語やフランス語は母音の種類が多く、ドイツ語は母音の種類が少ない。このような傾向の影響を受けているように思われる。 Next, FIG. 14 shows the results of measuring distances for four languages of English, German, Swedish, and French using NMFD. The vertical axis represents learned phoneme changes, and the horizontal axis represents the language type of the input speech. The black in each square is more similar, and the whiter the difference is. As can be seen from the figure, the type of language that is most similar to all input language types is the correct language type, and it can be seen that identification using NMFD is effective. In addition, Swedish phonemes are very different from other languages, and German phonemes are similar to other languages. In general, Swedish and French have more types of vowels than English, and German has fewer types of vowels. It seems to be influenced by this trend.

なお、上記の実施の形態では、学習部と評価部とを１つのコンピュータで構成する場合について説明したが、各々別のコンピュータで構成するようにしてもよい。 In the above-described embodiment, the case where the learning unit and the evaluation unit are configured by one computer has been described. However, the learning unit and the evaluation unit may be configured by separate computers.

また、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述の音声言語評価装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, although the above-described spoken language evaluation apparatus has a computer system therein, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１音声言語評価装置
２学習部
３評価部
１１音声信号入力部
１２特徴情報抽出部
１３音素表現計算部
１４音素表現記憶部
１５音素配合比率計算部
１６言語類似性評価部
１７表示制御部 DESCRIPTION OF SYMBOLS 1 Spoken language evaluation apparatus 2 Learning part 3 Evaluation part 11 Speech signal input part 12 Feature information extraction part 13 Phoneme expression calculation part 14 Phoneme expression storage part 15 Phoneme combination ratio calculation part 16 Language similarity evaluation part 17 Display control part

Claims

Extraction means for extracting evaluation feature information from an evaluation speech signal whose language type is unknown;
For each language type represented by a basis vector for each phoneme obtained by non-negative matrix decomposition of learning feature information for each language type extracted from a plurality of learning speech signals with known language types Based on the phoneme expression and the evaluation feature information extracted by the extraction means, a combination that calculates a combination ratio indicating the ratio of the basis vectors of each phoneme included in the evaluation feature information for each language type A ratio calculation means;
Based on the similarity between the feature information for evaluation and information indicated by the product of the blending ratio for each language type calculated by the blending ratio calculating means and the phoneme representation for each language type, the evaluation Evaluation means for evaluating the type of language indicated by the evaluation audio signal corresponding to the feature information;
Spoken language evaluation device including

The spoken language evaluation apparatus according to claim 1, wherein the phoneme expression is a phoneme expression having a time-series structure.

The evaluation means identifies the language type corresponding to the phoneme expression when the similarity is the highest as the language type indicated by the evaluation speech signal, or the similarity for each language type The spoken language evaluation apparatus according to claim 1 or 2, wherein a language phylogenetic tree indicating a systematic relationship between language types is created based on the language.

The blending ratio calculation means includes at least one of a speaker's gender and age, and a phoneme expression for each language type and gender and age obtained from learning feature information extracted from a learning speech signal. The spoken language evaluation apparatus according to claim 1, wherein the blending ratio is calculated for each of at least one of language type, sex, and age based on the language.

The extraction means extracts learning feature information for each type of language from the plurality of learning audio signals,
The phoneme expression calculation means for calculating the phoneme expression for each language type by performing non-negative matrix decomposition on the learning feature information for each language type extracted by the extraction means. The spoken language evaluation apparatus according to claim 1.

A spoken language evaluation method in a spoken language evaluation apparatus including an extraction unit, a blending ratio calculation unit, and an evaluation unit,
The extracting means extracts evaluation feature information from an evaluation speech signal whose language type is unknown,
The blending ratio calculating means is represented by a basis vector for each phoneme obtained by non-negative matrix decomposition of learning feature information for each language type extracted from a plurality of learning speech signals with known language types. Based on the phoneme representation for each type of language and the evaluation feature information extracted by the extraction means, a blending ratio indicating a ratio of basis vectors of each phoneme blended in the evaluation feature information is expressed in the language. For each type of
The evaluation means is based on the similarity between the evaluation feature information and each information indicated by the product of the mixture ratio for each language type calculated by the combination ratio calculation means and the phoneme expression for each language type. A speech language evaluation method for evaluating a language type indicated by an evaluation speech signal corresponding to the evaluation feature information.

The spoken language evaluation apparatus further includes a phoneme expression calculation unit,
The extraction means extracts learning feature information for each type of language from the plurality of learning audio signals,
7. The phonetic language evaluation according to claim 6, wherein the phoneme expression calculation unit calculates a phoneme expression for each language type by performing non-negative matrix decomposition on learning feature information for each language type extracted by the extraction unit. Method.

The spoken language evaluation program for functioning a computer as each means which comprises the spoken language evaluation apparatus of any one of Claims 1-5.