JP2017126004A

JP2017126004A - Voice evaluating device, method, and program

Info

Publication number: JP2017126004A
Application number: JP2016005914A
Authority: JP
Inventors: 飛雄太田中; Hyuta Tanaka
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2017-07-20

Abstract

PROBLEM TO BE SOLVED: To make comparable, regarding a technique for supporting language learning by presenting differences in user-uttered voice and model voice, the extent of variation in individual syllables and that of variation from one to another of consecutive multiple syllables in comparing rhythmic information on comparable user-uttered voice and model voice.SOLUTION: A rhythmic characteristic analyzing module 105 analyzes, regarding user-uttered voice to be evaluated, a rhythmic characteristic quantity and calculates the extent of variation of the rhythmic characteristic quantity in each syllable contained in the voice to be evaluated. A rhythmic characteristic evaluating module 106 compares the extent of variation of the rhythmic characteristic quantity in each syllable contained in the voice to be evaluated with the extent of variation of the rhythmic characteristic quantity in each syllable contained in pre-accumulated corresponding voice by a native speaker (standard voice) and displays on a display device 108 the result of comparison to be presented to the user.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザの発話した音声と、例えばネイティブ話者の音声からなる模範音声との相違を提示することにより語学学習を支援する技術に関する。 The present invention relates to a technique for supporting language learning by presenting a difference between a voice uttered by a user and a model voice composed of, for example, a native speaker's voice.

音声データを用いて語学学習を支援する技術が一般に普及している。一般的に外国語発音の評価手法としては、音声認識技術や信号処理技術を用いた音韻の評価（英語における「Ｌ」と「Ｒ」のチェックなど）が主である。外国語発音の韻律（アクセントやイントネーション、リズムなどと呼ばれるもの）の評価に関しては、基本周波数（ピッチ）やパワー、時間長のグラフの表示のみであったりすることが多く、評価を行ったとしても、アクセント位置を推定してアクセント位置が正しい位置かを評価したり、基本周波数、パワー、時間長などの物理量の単純比較であったりする場合が多い。 A technology for supporting language learning using speech data is widely used. In general, as a foreign language pronunciation evaluation method, phonological evaluation using a speech recognition technique or a signal processing technique (such as “L” and “R” check in English) is mainly used. Regarding the evaluation of foreign language prosody (called accents, intonations, rhythms, etc.), there are many cases where only the basic frequency (pitch), power, and time length graphs are displayed. In many cases, the accent position is estimated to evaluate whether the accent position is correct, or a simple comparison of physical quantities such as fundamental frequency, power, and time length is performed.

このような単純な評価から一歩進んで、学習者の入力発声音声に対し音声認識を使ってセグメンテーションを行い、さらに模範音声と学習者の入力発声音声の同一音節区間を比較することにより、学習者に、両者で異なる箇所を提示し、また発音とアクセントのそれぞれについて数字レベルで模範音声との違いを提示するようにした従来技術が開示されている（例えば特許文献１に記載の技術）。 A step forward from this simple evaluation, the learner's input speech is segmented using speech recognition, and the learner learns by comparing the same syllable interval between the model speech and the learner's input speech. In addition, there is disclosed a conventional technique in which different portions are presented, and the difference between the pronunciation and the accent is expressed at the numerical level with respect to the model voice (for example, the technique described in Patent Document 1).

特開２００３−１６２２９１号公報JP 2003-162291 A

しかし、音節単位で模範音声と学習者の入力発声音声とを比較する従来技術では、各音節内の平均的な基本周波数やパワーを数値的に比較しているだけであり、各音節内における基本周波数やパワーの変化の度合いを比較することはできなかった。その一方で、この従来技術では、連続する複数の音節にまたがる変化の度合いを比較することもできなかった。 However, the conventional technology that compares the model speech to the learner's input speech in syllable units only numerically compares the average fundamental frequency and power within each syllable, and the fundamental within each syllable. The degree of change in frequency and power could not be compared. On the other hand, in this prior art, the degree of change across a plurality of continuous syllables cannot be compared.

そこで、本発明は、ユーザが発話した評価対象音声と模範音声との韻律情報の比較において、各音節内における変化の度合いと、連続する複数の音節にまたがる変化の度合いをそれぞれ比較可能とすることを目的とする。 Therefore, the present invention makes it possible to compare the degree of change within each syllable and the degree of change across a plurality of consecutive syllables in the comparison of prosodic information between the evaluation target voice uttered by the user and the model voice. With the goal.

態様の一例では、入力された評価対象音声に含まれる音節毎の韻律特徴の変化度合いを取得する韻律特徴取得処理と、取得された音節毎の韻律特徴の変化度合いと、評価対象音声に対応する模範音声に含まれる音節毎の韻律特徴の変化度合いとを比較し、比較の結果を提示する韻律特徴比較処理と、を実行する韻律評価部を備える。 In one example, the prosodic feature acquisition process for acquiring the change degree of the prosodic feature for each syllable included in the input evaluation target speech, the change degree of the prosodic feature for each acquired syllable, and the evaluation target speech A prosody evaluation unit is provided that performs prosodic feature comparison processing for comparing the degree of change of prosodic features for each syllable included in the exemplary speech and presenting the comparison result.

本発明によれば、韻律情報について、各音節内における変化の度合いと、連続する複数の音節にまたがる変化の度合いをそれぞれ比較することが可能となる。 According to the present invention, the prosodic information can be compared with the degree of change in each syllable and the degree of change across a plurality of consecutive syllables.

音声評価装置の実施形態のブロック図である。It is a block diagram of an embodiment of a voice evaluation device. 音声評価装置における韻律評価部のアプリケーションの処理概要を示す図である。It is a figure which shows the process outline | summary of the application of the prosody evaluation part in a speech evaluation apparatus. 韻律評価アプリケーション実行部による処理シーケンスの例を示す図である。It is a figure which shows the example of the processing sequence by a prosody evaluation application execution part. 評価対象音声と音素記号ラベルと音節の説明図である。It is explanatory drawing of an evaluation object speech, a phoneme symbol label, and a syllable. 韻律特徴比較モジュールがアプリケーションフロントエンドを介して表示装置に表示する評価結果の表示例を示す図である。It is a figure which shows the example of a display of the evaluation result which a prosodic feature comparison module displays on a display apparatus via an application front end. 語学学習装置の実施形態のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of embodiment of a language learning apparatus. 韻律比較処理の全体処理の例を示すフローチャートである。It is a flowchart which shows the example of the whole process of a prosody comparison process. 基本周波数の取得処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of the acquisition process of a fundamental frequency. ＲＭＳパワーの取得処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of the acquisition process of RMS power. 継続時間長の取得処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of the acquisition process of continuation time length. 韻律特徴比較処理の詳細例を示すフローチャートである。It is a flowchart which shows the detailed example of a prosodic feature comparison process.

以下、本発明を実施するための形態について図面を参照しながら詳細に説明する。図１は、本願発明に係る音声評価装置を適用した語学学習装置１００の実施形態のブロック図である。語学学習装置１００は、韻律評価部１０１と、韻律評価部１０１から参照される音響モデル１１０、発音データベース１１１、模範音声データ１１２と、韻律評価部１０１に接続される入力装置１０２および出力装置１０３を備える。韻律評価部１０１は、それぞれソフトウェアモジュールである音素ラベリングモジュール１０４、韻律特徴取得処理を実行する韻律特徴取得モジュール１０５、及び韻律特徴比較処理を実行する韻律特徴比較モジュール１０６を備える。また、特には図示しないが、韻律評価部１０１は、上記各モジュールの実行を制御するソフトウェア部分であるアプリケーションフロントエンドを備える。入力装置１０２は、ユーザが発話した音声を入力する音声入力装置１０７を備える。音声入力装置１０７は例えば、マイク、マイクの出力を増幅するアンプ、アンプが出力するアナログ音声信号をデジタル音声信号に変換するＡ／Ｄ（アナログ／デジタル）変換器などからなる。出力装置１０３は、表示装置１０８および音声出力装置１０９を備える。表示装置１０８は、例えば液晶ディスプレイ装置である。音声出力装置１０９は例えば、出力されるデジタル音声信号をアナログ音声信号に変換するＤ／Ａ（デジタル／アナログ）変換器、Ｄ／Ａ変換器の出力を増幅するアンプ、そしてこのアンプが出力するアナログ音声信号を放音するスピーカなどからなる。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram of an embodiment of a language learning device 100 to which a speech evaluation device according to the present invention is applied. The language learning device 100 includes a prosody evaluation unit 101, an acoustic model 110 referred to by the prosody evaluation unit 101, a pronunciation database 111, model speech data 112, and an input device 102 and an output device 103 connected to the prosody evaluation unit 101. Prepare. The prosody evaluation unit 101 includes a phoneme labeling module 104 that is a software module, a prosodic feature acquisition module 105 that executes prosodic feature acquisition processing, and a prosodic feature comparison module 106 that executes prosodic feature comparison processing. Although not particularly shown, the prosody evaluation unit 101 includes an application front end that is a software part that controls the execution of each module. The input device 102 includes a voice input device 107 that inputs voice spoken by the user. The audio input device 107 includes, for example, a microphone, an amplifier that amplifies the output of the microphone, an A / D (analog / digital) converter that converts an analog audio signal output from the amplifier into a digital audio signal, and the like. The output device 103 includes a display device 108 and an audio output device 109. The display device 108 is, for example, a liquid crystal display device. The audio output device 109 includes, for example, a D / A (digital / analog) converter that converts an output digital audio signal into an analog audio signal, an amplifier that amplifies the output of the D / A converter, and an analog output by the amplifier It consists of a speaker that emits sound signals.

図１において、音響モデル１１０は、音素ラベリングを行うための外国語音声の統計モデルである。発音データベース１１１は、単語や文章の発音データ（音素記号列）及び音節に関する情報（音節の区切り位置）を格納したデータベースである。模範音声データ１１２は、比較評価を行うための、例えばネイティブ話者の音声からなる模範の音声及びそれから取得した韻律特徴（基本周波数、パワー、継続時間長）を格納したデータベースである。 In FIG. 1, an acoustic model 110 is a foreign language speech statistical model for phoneme labeling. The pronunciation database 111 is a database that stores pronunciation data (phoneme symbol strings) of words and sentences and information about syllables (syllable separation positions). The model voice data 112 is a database that stores, for example, a model voice composed of the voice of a native speaker and prosodic features (basic frequency, power, duration) obtained from the voice for comparison evaluation.

図２は、図１の語学学習装置１００における韻律評価部１０１の処理概要を示す図である。韻律評価部１０１はまず、図１の表示装置１０８に、模範音声の単語又は文章の文字列、例えば“Ｉｎｆｏｒｍａｔｉｏｎ”を表示させ（Ｓ２０１）、図１の音声出力装置１０９から当該模範音声を放音させる（Ｓ２０２）。これに対して、ユーザは、図１の音声入力装置１０７に向かって、上記模範音声に対応する評価対象音声を発話する。韻律評価部１０１は、評価対象音声を模範音声と比較し、その比較結果の文字列やグラフを図１の表示装置１０８に表示させ、また、比較結果を図１の音声出力装置１０９から放音させることにより、評価対象音声の正確さの評価や誤っている箇所等をユーザにフィードバックする。 FIG. 2 is a diagram showing a processing outline of the prosody evaluation unit 101 in the language learning device 100 of FIG. First, the prosody evaluation unit 101 displays a word string of a model voice or a text string such as “Information” on the display device 108 in FIG. 1 (S201), and emits the model voice from the voice output device 109 in FIG. (S202). On the other hand, the user utters the evaluation target voice corresponding to the exemplary voice toward the voice input device 107 in FIG. The prosody evaluation unit 101 compares the evaluation target voice with the model voice, displays a character string or graph of the comparison result on the display device 108 in FIG. 1, and outputs the comparison result from the voice output device 109 in FIG. By doing so, the evaluation of the accuracy of the evaluation target voice, the erroneous location, and the like are fed back to the user.

図３は、図１の韻律評価部１０１の処理シーケンスの例を示す図である。 FIG. 3 is a diagram illustrating an example of a processing sequence of the prosody evaluation unit 101 in FIG.

まず、ユーザが、図１の音声入力装置１０７から、評価対象音声を入力する（Ｓ３０１）。この結果、アプリケーションフロントエンド３００は、デジタル時間信号のサンプリングデータとして、評価対象音声を得る。以下、「評価対象音声」と記載したときは、このデジタル時間信号のサンプリングデータを意味するものとする。 First, the user inputs an evaluation target voice from the voice input device 107 of FIG. 1 (S301). As a result, the application front end 300 obtains the evaluation target sound as sampling data of the digital time signal. Hereinafter, when “evaluation target speech” is described, it means sampling data of this digital time signal.

次に、韻律評価部１０１のアプリケーションフロントエンド３００は、発音データベース１１１から、発音情報として、課題として提示されている単語または文章の音素記号列を取得する（Ｓ３０２）。 Next, the application front end 300 of the prosody evaluation unit 101 acquires a phoneme symbol string of a word or sentence presented as a task as pronunciation information from the pronunciation database 111 (S302).

続いて、アプリケーションフロントエンド３００は、音素ラベリングモジュール１０４を起動して、Ｓ３０２で取得した発話情報を引き渡す（Ｓ３０３）。音素ラベリングモジュール１０４は、発話情報に基づいて、音響モデル１１０を参照しながら、Ｓ３０１で入力された評価対象音声に対して音素セグメンテーションを実行し、その結果得られる各音素セグメントに音素記号ラベルを付与する音素ラベリングを実行する（Ｓ３０４）。音素ラベリングモジュール１０４は、その結果得られる音素ラベリング情報を、アプリケーションフロントエンド３００に返す（Ｓ３０５）。 Subsequently, the application front end 300 activates the phoneme labeling module 104 and delivers the utterance information acquired in S302 (S303). The phoneme labeling module 104 performs phoneme segmentation on the speech to be evaluated input in S301 based on the utterance information while referring to the acoustic model 110, and assigns a phoneme symbol label to each phoneme segment obtained as a result Phoneme labeling is executed (S304). The phoneme labeling module 104 returns the phoneme labeling information obtained as a result to the application front end 300 (S305).

次に、アプリケーションフロントエンド３００は、韻律特徴取得モジュール１０５を起動して、Ｓ３０５で取得した音素ラベリング情報を引き渡す（Ｓ３０６）。韻律特徴取得モジュール１０５は、Ｓ３０で入力した評価対象音声について、韻律特徴を取得する。韻律特徴取得モジュール１０５は、発音データベース１１１中の模範音声に付与されている音素ラベリング情報と音節区切り情報を参照し、その音素ラベリング情報をアプリケーションフロントエンド３００から引き渡された評価対象音声の音素ラベリング情報と照合することにより、評価対象音声に対して音節セグメンテーションを実行する。韻律特徴取得モジュール１０５は、この結果得られる音節（セグメント）毎に、韻律特徴の変化度合いを算出する。また、韻律特徴取得モジュール１０５は、音節毎に、韻律特徴の代表値を算出する（以上、Ｓ３０７）。韻律特徴取得モジュール１０５は、この結果得られる音節毎の韻律特徴の変化度合いと代表値を、アプリケーションフロントエンド３００に返す（Ｓ３０８）。 Next, the application front end 300 activates the prosodic feature acquisition module 105 and delivers the phoneme labeling information acquired in S305 (S306). The prosodic feature acquisition module 105 acquires prosodic features for the evaluation target speech input in S30. The prosodic feature acquisition module 105 refers to the phoneme labeling information and syllable separation information given to the model speech in the pronunciation database 111, and the phoneme labeling information of the speech to be evaluated passed from the application front end 300 to the phoneme labeling information. Syllable segmentation is performed on the speech to be evaluated. The prosodic feature acquisition module 105 calculates the change degree of the prosodic feature for each syllable (segment) obtained as a result. The prosodic feature acquisition module 105 calculates a representative value of the prosodic feature for each syllable (S307). The prosodic feature acquisition module 105 returns the change level and representative value of the prosodic feature for each syllable obtained as a result to the application front end 300 (S308).

次に、アプリケーションフロントエンド３００は、韻律特徴比較モジュール１０６を起動し、Ｓ３０８で取得した音節毎の韻律特徴の変化度合いと代表値を引き渡す（Ｓ３０９）。韻律特徴比較モジュール１０６は、模範音声データ１１２を参照することにより、評価対象音声について算出された音節毎の韻律特徴の変化度合いを、評価対象音声に対応する模範音声について予め蓄積されている音節毎の韻律特徴の変化度合いと比較する。また、韻律特徴比較モジュール１０６は、模範音声データ１１２を参照することにより、評価対象音声について算出された音節毎の韻律特徴の代表値の複数の音節間での変化度合いを、模範音声について予め蓄積されている音節毎の韻律特徴の代表値の複数の音節間での変化度合いと比較する（以上、Ｓ３１０）。韻律特徴比較モジュール１０６は、上述の２種類の比較の評価結果を、アプリケーションフロントエンド３００に返す（Ｓ３１１）。 Next, the application front end 300 activates the prosodic feature comparison module 106 and delivers the degree of prosodic feature change and the representative value for each syllable acquired in S308 (S309). The prosodic feature comparison module 106 refers to the model speech data 112 to determine the degree of change of the prosodic feature for each syllable calculated for the evaluation target speech for each syllable stored in advance for the model speech corresponding to the evaluation target speech. Compared with the degree of change in the prosodic features. In addition, the prosodic feature comparison module 106 refers to the model voice data 112 to store in advance the degree of change between the plurality of syllables of the representative value of the prosodic feature for each syllable calculated for the evaluation target voice for the model voice in advance. The representative value of the prosodic feature for each syllable is compared with the degree of change between a plurality of syllables (S310). The prosodic feature comparison module 106 returns the above two types of comparison evaluation results to the application front end 300 (S311).

アプリケーションフロントエンド３００は、Ｓ３１１で取得した、評価対象音声と模範音声の音節毎の韻律特徴の変化度合いの比較の評価結果に基づいて、音節毎に、評価対象音声の韻律特徴の変化形状と模範音声の韻律特徴の変化形状とを対比させて、図１の表示装置１０８にグラフ表示する。また、アプリケーションフロントエンド３００は、Ｓ３１１で取得した、評価対象音声に対応する韻律特徴の代表値を複数の音節間でプロットした折れ線グラフと、模範音声に対応する韻律特徴の代表値を複数の音節間でプロットした折れ線グラフとを、対比させて表示装置１０８にグラフ表示する（以上、Ｓ３１２）。 Based on the evaluation result of the comparison of the degree of change of the prosodic feature for each syllable between the evaluation target speech and the exemplary speech acquired in S311, the application front end 300 determines the change shape and the exemplary model of the prosodic feature of the evaluation target speech for each syllable. The change shape of the prosodic feature of the speech is compared and displayed in a graph on the display device 108 of FIG. In addition, the application front end 300 obtains a line graph obtained by plotting representative values of prosodic features corresponding to the evaluation target speech between a plurality of syllables acquired in S311 and representative values of prosodic features corresponding to the exemplary speech. The line graph plotted between the two is compared and displayed on the display device 108 as a graph (S312).

図４および図５は、図１の韻律評価部１０１による図３の処理シーケンスの説明図である。 4 and 5 are explanatory diagrams of the processing sequence of FIG. 3 by the prosody evaluation unit 101 of FIG.

まず、図４は、評価対象音声と音素記号ラベルと音節の説明図である。図３のＳ３０３で、ユーザが例えば“Ｉｎｆｏｒｍａｔｉｏｎ”という英語を発話すると、アプリケーションフロントエンド３００は、図４（ａ）に例示されるような評価対象音声のデジタル時間波形データを入力する。これに対し、図３のＳ３０３で、音素ラベリングモジュール１０４が音素ラベリングを実行すると、図４（ｂ）に示されるように、評価対象音声のデジタル時間波形データに対し時間軸上で音素セグメンテーションが実行され、各音素セグメントに音素記号ラベル“ＩＨ”“Ｎ”“Ｆ”“ＥＲ”“Ｍ”“ＥＹ”“ＳＨ”“ＡＨ”“Ｎ”が付与される。これらの音素記号列は、“Ｉｎｆｏｒｍａｔｉｏｎ”という英語の発音に対応している。次に、図３のＳ３０７で、韻律特徴取得モジュール１０５が、発音データベース１１１中の模範音声に付与されている音素ラベリング情報と音節区切り情報を参照し、その音素ラベリング情報を図４（ｂ）の評価対象音声の音素記号ラベル列（音素ラベリング情報）と照合することによって、評価対象音声のデジタル時間波形データに対して時間軸上で音節セグメンテーションが実行され、それぞれの音節セグメンに音節“ｉｎ”“ｆｏｒ”“ｍａ”“ｔｉｏｎ”が付与される。 First, FIG. 4 is an explanatory diagram of an evaluation target speech, a phoneme symbol label, and a syllable. In S303 of FIG. 3, when the user speaks English, for example, “Information”, the application front end 300 inputs digital time waveform data of the speech to be evaluated as illustrated in FIG. On the other hand, when the phoneme labeling module 104 executes phoneme labeling in S303 of FIG. 3, as shown in FIG. 4B, phoneme segmentation is executed on the time axis for the digital time waveform data of the evaluation target speech. The phoneme symbol labels “IH”, “N”, “F”, “ER”, “M”, “EY”, “SH”, “AH”, and “N” are assigned to each phoneme segment. These phoneme symbol strings correspond to the English pronunciation “Information”. Next, in S307 of FIG. 3, the prosodic feature acquisition module 105 refers to the phoneme labeling information and syllable break information given to the model speech in the pronunciation database 111, and the phoneme labeling information is shown in FIG. By collating with the phoneme symbol label string (phoneme labeling information) of the speech to be evaluated, syllable segmentation is executed on the time axis for the digital time waveform data of the speech to be evaluated, and the syllable “in” “ “for”, “ma”, and “tion” are given.

韻律特徴取得モジュール１０５は、図３のＳ３０７において、図４（ｃ）に例示される音節毎に、韻律特徴の変化度合いを算出する。 The prosodic feature acquisition module 105 calculates the change degree of the prosodic feature for each syllable illustrated in FIG. 4C in S307 of FIG.

具体的には、韻律特徴取得モジュール１０５は、図４（ａ）に例示される評価対象音声について、所定の時間長（例えば２５６ミリ秒）を有するフレーム毎に基本周波数の対数値を算出する。基本周波数の算出方法としては例えば、フレーム内の評価対象音声のサンプル値に対する自己相関演算又はケプストラム演算による方法などが適用できる。次に、韻律特徴取得モジュール１０５は、図４（ｃ）に例示される音節“ｉｎ”“ｆｏｒ”“ｍａ”“ｔｉｏｎ”のそれぞれにつき、その音節に含まれる各フレームに対応する基本周波数の対数値を、その音節の第１の韻律特徴量とする。次に韻律特徴取得モジュール１０５は、音節毎に、その音節に含まれる各フレームの基本周波数の対数値に対して例えば回帰分析（線形回帰分析又は多項式回帰分析等）を実行することにより、その音節での基本周波数の対数値（第１の韻律特徴）の変化度合いを、「増加傾向」「変化しない傾向」「減少傾向」の何れかに判別する。 Specifically, the prosodic feature acquisition module 105 calculates the logarithmic value of the fundamental frequency for each frame having a predetermined time length (for example, 256 milliseconds) for the evaluation target speech exemplified in FIG. As a calculation method of the fundamental frequency, for example, a method using an autocorrelation calculation or a cepstrum calculation with respect to a sample value of an evaluation target voice in a frame can be applied. Next, for each of the syllables “in”, “for”, “ma”, and “tion” illustrated in FIG. 4C, the prosodic feature acquisition module 105 performs a pair of fundamental frequencies corresponding to each frame included in the syllable. The numerical value is set as the first prosodic feature quantity of the syllable. Next, for each syllable, the prosodic feature acquisition module 105 performs, for example, regression analysis (linear regression analysis, polynomial regression analysis, etc.) on the logarithm of the fundamental frequency of each frame included in the syllable, thereby obtaining the syllable. The degree of change of the logarithmic value (first prosodic feature) of the fundamental frequency at 1 is discriminated as one of “increasing tendency”, “not changing tendency”, and “decreasing tendency”.

また、韻律特徴取得モジュール１０５は、図４（ａ）に例示される評価対象音声について、フレーム毎に、下記数１式に従って、そのフレーム内の評価対象音声のＲＭＳ［ｘ］（二乗平均平方根）を計算し、さらにその対数値を算出する。ここで、１フレームのサンプル長をＮサンプルとしたときに、ｘ₁、ｘ₂、・・・、ｘ_Nはそれぞれ、フレーム内の評価対象音声の各サンプル値である。
次に、韻律特徴取得モジュール１０５は、図４（ｃ）に例示される音節“ｉｎ”“ｆｏｒ”“ｍａ”“ｔｉｏｎ”のそれぞれにつき、その音節に含まれる各フレームに対応するＲＭＳパワーの対数値を、その音節の第２の韻律特徴とする。続いて、韻律特徴取得モジュール１０５は、音節毎に、その音節に含まれる各フレームのＲＭＳパワーの対数値に対して基本周波数の場合と同様に例えば回帰分析を実行することにより、その音節でのＲＭＳパワー対数値（第２の韻律特徴）の変化度合いを、「増加傾向」「変化しない傾向」「減少傾向」の何れかに判別する。 Further, the prosodic feature acquisition module 105 performs RMS [x] (root mean square) of the evaluation target speech in the frame for each evaluation target speech illustrated in FIG. And the logarithmic value thereof is calculated. Here, when the sample length of one frame is N samples, x ₁ , x ₂ ,..., X _N are the respective sample values of the evaluation target speech in the frame.
Next, for each of the syllables “in”, “for”, “ma”, and “tion” illustrated in FIG. 4C, the prosodic feature acquisition module 105 performs a pair of RMS power corresponding to each frame included in the syllable. Let the number be the second prosodic feature of that syllable. Subsequently, for each syllable, the prosodic feature acquisition module 105 performs, for example, regression analysis on the logarithmic value of the RMS power of each frame included in the syllable in the same manner as in the case of the fundamental frequency, so that The degree of change in the RMS power logarithmic value (second prosodic feature) is determined as one of “increasing tendency”, “not changing tendency”, and “decreasing tendency”.

また韻律特徴取得モジュール１０５は、音節“ｉｎ”“ｆｏｒ”“ｍａ”“ｔｉｏｎ”のそれぞれにつき、その音節に含まれる各フレームに対応する基本周波数の対数値から代表値を算出する。同様に、韻律特徴取得モジュール１０５は、音節毎に、その音節に含まれる各フレームに対応するＲＭＳパワーの対数値から代表値を算出する。 The prosodic feature acquisition module 105 calculates a representative value for each syllable “in”, “for”, “ma”, and “tion” from the logarithmic value of the fundamental frequency corresponding to each frame included in the syllable. Similarly, the prosodic feature acquisition module 105 calculates, for each syllable, a representative value from the logarithmic value of RMS power corresponding to each frame included in the syllable.

更に、韻律特徴取得モジュール１０５は、図４（ａ）に例示される評価対象音声について、アプリケーションフロントエンド３００から引き渡された音素ラベリング情報に付与されている音素毎の継続時間長を取得し、図４（ｃ）に例示される音節毎に、その音節に含まれる各音素の継続時間長の総和を、その音節の継続時間長として算出する。 Further, the prosodic feature acquisition module 105 acquires the duration length for each phoneme given to the phoneme labeling information delivered from the application front end 300 for the evaluation target speech exemplified in FIG. For each syllable illustrated in 4 (c), the sum of the durations of the phonemes included in the syllable is calculated as the duration of the syllable.

次に、韻律特徴評価モジュール１０６は、図３のＳ３１０において、評価対象音声について算出された音節毎の第１又は第２の韻律特徴の各変化度合いを、評価対象音声に対応する模範音声について予め蓄積されている音節毎の第１又は第２の韻律特徴の各変化度合いと比較し、その比較の評価結果を、アプリケーションフロントエンド３００を介して図１の表示装置１０８に表示させ、ユーザに提示する。この場合に、韻律特徴比較モジュール１０６は、評価対象音声に対応する音節毎の第１又は第２の韻律特徴の変化度合いに対応した変化形状と、模範音声に対応する音節毎の第１又は第２の韻律特徴の変化度合いに対応した変化形状とを、対比させて表示装置１０８にグラフ表示させる。この変化形状は、その変化形状に対応する変化度合いが「増加傾向」、「変化しない傾向」「減少傾向」の何れであるかに応じて、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかとすることができる。 Next, in S310 of FIG. 3, the prosodic feature evaluation module 106 preliminarily indicates each degree of change of the first or second prosodic feature calculated for the evaluation target speech for the exemplary speech corresponding to the evaluation target speech. The degree of change of the first or second prosodic feature for each stored syllable is compared, and the evaluation result of the comparison is displayed on the display device 108 of FIG. 1 via the application front end 300 and presented to the user. To do. In this case, the prosodic feature comparison module 106 changes the shape corresponding to the degree of change of the first or second prosodic feature for each syllable corresponding to the evaluation target speech and the first or second for each syllable corresponding to the model speech. The change shape corresponding to the change degree of the second prosodic feature is displayed in a graph on the display device 108 in comparison. Depending on whether the degree of change corresponding to the changed shape is “increasing tendency”, “not changing tendency”, or “decreasing tendency”, this changing shape is an arrow shape that rises to the right, a horizontal arrow shape, or It can be any one of the shape of an arrow that descends to the right.

図５は、韻律特徴比較モジュール１０６がアプリケーションフロントエンド３００を介して表示装置１０８に表示する評価結果の表示例を示す図である。図５で具体的に説明すると、韻律特徴比較モジュール１０６は、評価対象音声に対応する音節毎の基本周波数の対数値（第１の韻律特徴）に関して、図５の（ａ−１）の音節毎に、「増加傾向」、「変化しない傾向」「減少傾向」の何れかの変化度合いに対応した、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｃ−１）に示されるように表示装置１０８に表示させる。韻律特徴比較モジュール１０６は、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎の基本周波数の対数値（第１の韻律特徴）に関し、図５の（ａ−２）の音節毎に、「増加傾向」、「変化しない傾向」「減少傾向」の何れかの変化度合いに対応した、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｃ−２）に示されるように表示装置１０８に表示させる。この場合、韻律特徴比較モジュール１０６は、図５（ａ−１）の音節“ｆｏｒ”に対応する図５（ｃ−１）の右肩下がりの矢印形状のように、図５（ｃ−１）の評価対象音声と図５（ｃ−２）の模範音声とで基本周波数の対数値の変化形状が異なる部分については、表示装置１０８に強調表示をさせる。このような対比表示により、ユーザは、どの音節部分で基本周波数すなわち声の高さがずれているかを容易に把握することが可能となる。 FIG. 5 is a diagram illustrating a display example of evaluation results displayed on the display device 108 by the prosodic feature comparison module 106 via the application front end 300. More specifically, FIG. 5 shows the prosodic feature comparison module 106 for each syllable of (a-1) in FIG. 5 with respect to the logarithmic value (first prosodic feature) of the fundamental frequency for each syllable corresponding to the evaluation target speech. In addition, any one of an upward-sloping arrow shape, a horizontal arrow shape, or a downward-sloping arrow shape corresponding to the change degree of any of “increasing tendency”, “sending tendency”, and “decreasing tendency”, As shown in FIG. 5 (c-1), the image is displayed on the display device 108. The prosodic feature comparison module 106 compares the display of the speech to be evaluated with respect to the logarithmic value (first prosodic feature) of the fundamental frequency for each syllable corresponding to the model speech, and the syllable of (a-2) in FIG. For each of the above, an arrow shape that rises to the right, a horizontal arrow shape, or an arrow shape that descends to the right corresponding to the degree of change of any of “increase tendency”, “prone tendency not to change”, and “decrease tendency”. As shown in FIG. 5C-2, the image is displayed on the display device. In this case, the prosodic feature comparison module 106 is similar to the shape of the downward-sloping arrow in FIG. 5 (c-1) corresponding to the syllable “for” in FIG. 5 (a-1). The display device 108 is highlighted with respect to a portion where the change shape of the logarithmic value of the fundamental frequency is different between the evaluation target speech and the exemplary speech in FIG. Such a comparison display allows the user to easily grasp at which syllable portion the fundamental frequency, that is, the voice pitch is shifted.

また、韻律特徴比較モジュール１０６は、評価対象音声に対応する音節毎のＲＭＳパワーの対数値（第２の韻律特徴）に関して、図５（ａ−１）の音節毎に、「増加傾向」、「変化しない傾向」「減少傾向」の何れかの変化度合いに対応した、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｂ−１）に示されるように表示装置１０８に表示させる。韻律特徴比較モジュール１０６は、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎のＲＭＳパワーの対数値（第２の韻律特徴）に関して、図５（ａ−２）の音節毎に、「増加傾向」、「変化しない傾向」「減少傾向」の何れかの変化度合いに対応した、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｂ−２）に示されるように表示装置１０８に表示させる。この場合、韻律特徴比較モジュール１０６は、図５（ａ−１）の音節“ｆｏｒ”に対応する図５（ｂ−１）の水平の矢印形状又は図５（ａ−１）の音節“ｍａ”に対応する図５（ｂ−１）の右肩上がりの矢印形状のように、図５（ｂ−１）の評価対象音声と図５（ｂ−２）の模範音声とでＲＭＳパワーの対数値が異なる部分については、表示装置１０８に強調表示をさせる。このような対比表示により、ユーザは、どの音節部分でＲＭＳパワーすなわち声の強さがずれているかを容易に把握することが可能となる。 Further, the prosodic feature comparison module 106 relates to the logarithmic value (second prosodic feature) of RMS power for each syllable corresponding to the evaluation target speech, for each syllable in FIG. FIG. 5 (b-1) shows one of the rising arrow shape, the horizontal arrow shape, and the lowering arrow shape corresponding to any degree of change of “the tendency not to change” and “the decreasing tendency”. As shown, it is displayed on the display device 108. The prosodic feature comparison module 106 compares the display of the speech to be evaluated with respect to the logarithmic value (second prosodic feature) of the RMS power for each syllable corresponding to the model speech, for each syllable in FIG. In addition, any one of an upward-sloping arrow shape, a horizontal arrow shape, or a downward-sloping arrow shape corresponding to the change degree of any of “increasing tendency”, “sending tendency”, and “decreasing tendency”, As shown in FIG. 5B-2, the image is displayed on the display device. In this case, the prosodic feature comparison module 106 uses the horizontal arrow shape of FIG. 5B-1 corresponding to the syllable “for” of FIG. 5A-1 or the syllable “ma” of FIG. 5A-1. 5 (b-1) corresponding to FIG. 5 (b-1), the logarithmic value of the RMS power between the evaluation target voice of FIG. 5 (b-1) and the exemplary voice of FIG. 5 (b-2). Are highlighted on the display device 108. Such a comparison display allows the user to easily grasp at which syllable portion the RMS power, that is, the voice strength is shifted.

これと共に、韻律特徴比較モジュール１０６は、図３のＳ３１０において、評価対象音声について算出された音節毎の第１又は第２の韻律特徴の代表値の複数の音節間での変化度合いを、模範音声について予め蓄積されている音節毎の第１又は第２の韻律特徴の代表値の複数の音節間での変化度合いと比較し、その比較の評価結果をアプリケーションフロントエンド３００を介して図１の表示装置１０８に表示させ、ユーザに提示する。この場合に、韻律特徴比較モジュール１０６は、評価対象音声に対応する第１又は第２の韻律特徴の代表値を複数の音節間でプロットした折れ線グラフと、模範音声に対応する第１又は第２の韻律特徴の代表値を複数の音節間でプロットした折れ線グラフとを、対比させて表示装置１０８にグラフ表示させる。 At the same time, the prosodic feature comparison module 106 determines, in S310 of FIG. 3, the degree of change between the plurality of syllables of the representative value of the first or second prosodic feature for each syllable calculated for the evaluation target speech. 1 is compared with the degree of change between the plurality of syllables of the representative value of the first or second prosodic feature for each syllable stored in advance, and the evaluation result of the comparison is displayed via the application front end 300 in FIG. It is displayed on the device 108 and presented to the user. In this case, the prosodic feature comparison module 106 includes a line graph in which representative values of the first or second prosodic feature corresponding to the evaluation target speech are plotted between a plurality of syllables, and the first or second corresponding to the exemplary speech. A line graph in which representative values of prosodic features are plotted between a plurality of syllables is displayed in a graph on the display device 108 in comparison.

図５で具体的に説明すると、韻律特徴比較モジュール１０６は、評価対象音声に対応する音節毎の基本周波数の対数値（第１の韻律特徴）の代表値に関して、図５（ａ−１）の音節毎に、図５（ｆ−１）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。韻律特徴比較モジュール１０６は、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎の基本周波数の対数値（第１の韻律特徴）に関して、図５（ａ−２）の音節毎に、図５（ｆ−２）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。この場合に、韻律特徴比較モジュール１０６は、図５（ａ−１）の音節“ｉｎ”と“ｆｏｒ”にまたがる部分に対応する図５（ｆ−１）の折れ線グラフの傾き又は図５（ａ−１）の音節“ｍａ”と“ｔｉｏｎ”にまたがる部分に対応する図５（ｆ−１）の折れ線グラフの傾きのように、図５（ｆ−１）の評価対象音声と図５（ｆ−２）の模範音声とで基本周波数の対数値の代表値に関して、音節間の変化が異なる部分については、表示装置１０８に強調表示をさせる。このような対比表示により、ユーザは、どの音節間で基本周波数すなわち声の高さの変化がずれているかを容易に把握することが可能となる。 Specifically, the prosody feature comparison module 106 relates to the representative value of the logarithmic value (first prosodic feature) of the fundamental frequency for each syllable corresponding to the evaluation target speech, as shown in FIG. For each syllable, as shown in FIG. 5 (f-1), a line graph connecting the plots across a plurality of syllables is displayed on the display device 108 by plotting with square marks. The prosodic feature comparison module 106 compares the display of the speech to be evaluated with respect to the logarithmic value (first prosodic feature) of the fundamental frequency for each syllable corresponding to the model speech, for each syllable in FIG. In addition, as shown in FIG. 5 (f-2), a line graph that plots with square marks and connects the plots across a plurality of syllables is displayed on the display device 108. In this case, the prosodic feature comparison module 106 determines the slope of the line graph of FIG. 5 (f-1) corresponding to the part spanning the syllables “in” and “for” of FIG. 5 (a-1) or FIG. 5), the evaluation target speech in FIG. 5 (f-1) and the evaluation target speech in FIG. 5 (f), like the slope of the line graph in FIG. 5 (f-1) corresponding to the part spanning the syllables “ma” and “tion”. With respect to the representative value of the logarithmic value of the fundamental frequency in the exemplary voice of -2), the display device 108 is highlighted on the part where the change between syllables is different. Such a comparison display allows the user to easily grasp between which syllable the fundamental frequency, that is, the change in voice pitch, is shifted.

また、韻律特徴比較モジュール１０６は、評価対象音声に対応する音節毎のＲＭＳパワーの対数値（第２の韻律特徴）の代表値に関して、図５（ａ−１）の音節毎に、図５（ｅ−１）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。韻律特徴比較モジュール１０６は、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎のＲＭＳパワーの対数値（第２の韻律特徴）に関して、図５（ａ−２）の音節毎に、図５（ｅ−２）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。この場合に、韻律特徴比較モジュール１０６は、図５（ａ−１ｎ）の音節“ｆｏｒ”と“ｍａ”にまたがる部分に対応する図５（ｅ−１）の折れ線グラフの傾きのように、図５（ｅ−１）の評価対象音声と図５（ｅ−２）の模範音声とでＲＭＳパワーの対数値の代表値に関して、音節間の変化が異なる部分については、表示装置１０８に強調表示をさせる。このような対比表示により、ユーザは、どの音節間でＲＭＳパワーすなわち声の強さの変化がずれているかを容易に把握することが可能となる。 Further, the prosodic feature comparison module 106 relates to the representative value of the RMS power logarithm value (second prosodic feature) for each syllable corresponding to the evaluation target speech for each syllable of FIG. As shown in e-1), a line graph connecting the plots across a plurality of syllables is plotted on the display device 108 and plotted with square marks. The prosodic feature comparison module 106 compares the display of the speech to be evaluated with respect to the logarithmic value (second prosodic feature) of the RMS power for each syllable corresponding to the model speech, for each syllable in FIG. Further, as shown in FIG. 5E-2, a line graph that plots with square marks and connects the plots across a plurality of syllables is displayed on the display device. In this case, the prosodic feature comparison module 106 displays a graph like the slope of the line graph of FIG. 5 (e-1) corresponding to the part spanning the syllables “for” and “ma” of FIG. 5 (a-1n). Regarding the representative value of the RMS power logarithm of the evaluation target voice of 5 (e-1) and the exemplary voice of FIG. 5 (e-2), a portion where the change between syllables is highlighted on the display device. Let Such a comparison display allows the user to easily grasp between which syllable the RMS power, that is, the change in voice strength, is shifted.

更に、韻律特徴比較モジュール１０６は、図３のＳ３１０において、評価対象音声について算出された音節毎の継続時間長を、模範音声について予め蓄積されている音節毎の継続時間長と比較し、その比較の評価結果を、アプリケーションフロントエンド３００を介して図１の表示装置１０８に表示させ、ユーザに提示する。 Further, in S310 of FIG. 3, the prosodic feature comparison module 106 compares the duration time for each syllable calculated for the evaluation target speech with the duration length for each syllable stored in advance for the model speech, and compares Are displayed on the display device 108 of FIG. 1 via the application front end 300 and presented to the user.

図５で具体的に説明すると、韻律特徴比較モジュール１０６は、評価対象音声に関し、図５（ａ−１）の音節毎に、図５（ｄ−１）に示されるように継続時間長を、表示装置１０８に表示させる。韻律特徴比較モジュール１０６は、評価対象音声に関するこの表示に対比させて、模範音声に関し、図５（ａ−２）の音節毎に、図５（ｄ−２）に示されるように継続時間長を、表示装置１０８に表示させる。この場合、韻律特徴比較モジュール１０６は、図５（ａ−１）の音節“ｆｏｒ”に対応する図５（ｄ−１）の継続時間長の数値のように、図５（ｄ−１）の評価対象音声と図５（ｄ−２）の模範音声とで継続時間長が大きく異なる部分については、表示装置１０８に強調表示をさせる。このような対比表示により、ユーザは、どの音節部分で継続時間長すなわち声を伸ばす長さがずれているかを容易に把握することが可能となる。 More specifically, in FIG. 5, the prosodic feature comparison module 106 sets the duration length as shown in FIG. 5 (d-1) for each syllable of FIG. It is displayed on the display device 108. The prosodic feature comparison module 106 contrasts this display with respect to the evaluation target speech, and for the exemplary speech, for each syllable of FIG. 5 (a-2), sets the duration length as shown in FIG. 5 (d-2). And displayed on the display device 108. In this case, the prosodic feature comparison module 106 is similar to the numerical value of the duration length in FIG. 5 (d-1) corresponding to the syllable “for” in FIG. 5 (a-1), as shown in FIG. 5 (d-1). The display device 108 is highlighted on the portion where the duration is greatly different between the evaluation target voice and the model voice shown in FIG. Such a comparison display allows the user to easily grasp in which syllable portion the duration time, that is, the length of the voice extension is shifted.

以上のようにして、本実施形態によれば、基本周波数、パワーなどの韻律情報や継続時間長などについて、各音節内における変化の度合いと、連続する複数の音節にまたがる変化の度合いをそれぞれ比較することが可能となる。 As described above, according to the present embodiment, the degree of change in each syllable and the degree of change across a plurality of consecutive syllables are compared with respect to prosodic information such as fundamental frequency and power and duration length. It becomes possible to do.

韻律特徴比較モジュール１０６は、上述のような表示による対比比較のほか、評価対象音声の言語の性質やユーザなどに合わせて、下記の手法の中から適当なものを用いて評価処理を実行することができる。 The prosodic feature comparison module 106 executes an evaluation process using an appropriate one of the following methods in accordance with the language characteristics of the speech to be evaluated, the user, etc., in addition to the comparison by display as described above. Can do.

＜韻律特徴の変化形状の比較評価１＞
韻律特徴比較モジュール１０６は、評価対象音声と模範音声とで、音節毎に、その音節に含まれる各フレームの基本周波数の対数値やＲＭＳパワーの対数値の変化形状がそれぞれどれだけ一致しているかを算出する。 <Comparison evaluation 1 of prosody feature change shape>
The prosodic feature comparison module 106 matches how much the logarithmic value of the fundamental frequency and the logarithmic value of the RMS power of each frame included in the syllable of the evaluation target speech and the exemplary speech match each other. Is calculated.

＜韻律特徴の変化形状の比較評価２＞
また、音節が複数ある場合には、韻律特徴比較モジュール１０６は、各音節間の変化形状を前後の音節の韻律特徴の代表値から計算・識別し、それらも併せてどれだけ一致しているかを算出する。 <Comparison evaluation 2 of prosodic features change shape>
When there are a plurality of syllables, the prosodic feature comparison module 106 calculates and identifies the change shape between the syllables from the representative values of the prosodic features of the preceding and succeeding syllables, and how much they match together. calculate.

＜音声全体を通しての音節毎の韻律特徴の比較評価＞
韻律特徴比較モジュール１０６は、評価対象音声と模範音声とで、音節毎の基本周波数の代表値、ＲＭＳパワーの代表値、又は継続時間長につき、下記数２式に従い、全音節にわたる相関係数γを算出する。ここで、ｘ_iとｙ_i（ｉ＝１，２，・・・，ｎ）（ｎは音節数）はそれぞれ、評価対象音声と模範音声に対応する音節毎の韻律特徴（基本周波数の対数値の各代表値、ＲＭＳパワーの対数値の各代表値、又は各継続時間長）である。 <Comparison evaluation of prosodic features for each syllable throughout the speech>
The prosodic feature comparison module 106 calculates the correlation coefficient γ over all syllables according to the following equation 2 for the representative value of the fundamental frequency, the representative value of the RMS power, or the duration of each syllable between the evaluation target speech and the exemplary speech. Is calculated. Here, x _i and y _i (i = 1, 2,..., N) (n is the number of syllables) are prosodic features (logarithmic values of fundamental frequencies) for each syllable corresponding to the evaluation target speech and the exemplary speech, respectively. , Each representative value of logarithmic value of RMS power, or each duration length).

＜一致率や相関係数の表示＞
韻律特徴評価モジュール１０６は、上述の＜韻律特徴の変化形状の比較評価１＞又は＜韻律特徴の変化形状の比較評価２＞の手法により算出された韻律特徴の一致率や、上述の＜音声全体を通しての音節毎の韻律特徴の比較評価＞により算出された相関係数γに対して、例えば、ユーザの習熟度や習熟目標などを考慮して、点数や閾値などを設定してＮ段階評価（１・２・３・４・５やＢＡＤ／ＧＯＯＤ／Ｅｘｃｅｌｌｅｎｔなど）にして、図１の表示装置１０８に表示してユーザに提示する。 <Display of match rate and correlation coefficient>
The prosodic feature evaluation module 106 calculates the prosody feature match rate calculated by the above-described method <Comparison Evaluation 1 of Prosodic Feature Change Shapes> or <Comparison Evaluation 2 of Prosodic Feature Change Shapes> For the correlation coefficient γ calculated by the comparative evaluation of prosodic features for each syllable through>, for example, considering the user's proficiency level and proficiency target, the score, threshold value, etc. are set and N-level evaluation ( 1, 2, 3, 4, 5, BAD / GOOD / Excellent, etc.) are displayed on the display device 108 of FIG. 1 and presented to the user.

図６は、図１の語学学習装置１００をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。図６に示されるコンピュータは、ＣＰＵ６０１、ＲＯＭ（リードオンリーメモリ）６０２、ＲＡＭ（ランダムアクセスメモリ）６０３、入力装置６０４、出力装置６０５、外部記憶装置６０６、可搬記録媒体６１０が挿入される可搬記録媒体駆動装置６０７、及び通信インターフェース６０８を有し、これらがバス６０９によって相互に接続された構成を有する。同図に示される構成は上記システムを実現できるコンピュータの一例であり、そのようなコンピュータはこの構成に限定されるものではない。 FIG. 6 is a diagram illustrating a hardware configuration example of a computer capable of realizing the language learning apparatus 100 of FIG. 1 as software processing. The computer shown in FIG. 6 has a CPU 601, a ROM (Read Only Memory) 602, a RAM (Random Access Memory) 603, an input device 604, an output device 605, an external storage device 606, and a portable recording medium 610 inserted therein. A recording medium driving device 607 and a communication interface 608 are included and are connected to each other via a bus 609. The configuration shown in the figure is an example of a computer that can implement the above system, and such a computer is not limited to this configuration.

ＲＯＭ６０２は、コンピュータを制御する韻律評価プログラムを含む各プログラムを記憶するメモリである。ＲＡＭ６０３は、各プログラムの実行時に、ＲＯＭ６０２に記憶されているプログラム又はデータを一時的に格納するメモリである。 The ROM 602 is a memory that stores programs including a prosody evaluation program that controls the computer. The RAM 603 is a memory that temporarily stores a program or data stored in the ROM 602 when each program is executed.

外部記憶装置６０６は、例えばＳＳＤ（ソリッドステートドライブ）記憶装置またはハードディスク記憶装置であり、入力テキストデータ、入力音声データ、接続音声素片データ、または合成音声データ等の保存に用いられる。また、外部記憶装置６０６は、図１の音響モデル１１０、発音データベース１１１、模範音声データ１１２を記憶する。 The external storage device 606 is, for example, an SSD (solid state drive) storage device or a hard disk storage device, and is used for storing input text data, input speech data, connected speech segment data, synthesized speech data, or the like. Further, the external storage device 606 stores the acoustic model 110, the pronunciation database 111, and the model voice data 112 of FIG.

ＣＰＵ６０１は、各プログラムを、ＲＯＭ６０２からＲＡＭ６０３に読み出して実行することにより、当該コンピュータ全体の制御を行う。 The CPU 601 controls the entire computer by reading each program from the ROM 602 to the RAM 603 and executing it.

入力装置６０４は、図１の入力装置１０２に対応する。出力装置６０５は、図１の出力装置１０３に対応する。 The input device 604 corresponds to the input device 102 of FIG. The output device 605 corresponds to the output device 103 in FIG.

可搬記録媒体駆動装置６０７は、光ディスクやＳＤＲＡＭ、コンパクトフラッシュ等の可搬記録媒体６１０を収容するもので、外部記憶装置６０６の補助の役割を有する。 The portable recording medium driving device 607 accommodates a portable recording medium 610 such as an optical disk, SDRAM, or compact flash, and has an auxiliary role for the external storage device 606.

通信インターフェース６０８は、例えばＬＡＮ（ローカルエリアネットワーク）又はＷＡＮ（ワイドエリアネットワーク）の通信回線を接続するための装置である。 The communication interface 608 is a device for connecting, for example, a LAN (local area network) or WAN (wide area network) communication line.

本実施形態による語学学習装置１００では、ＣＰＵ６０１が、ＲＯＭ６０２に記憶された韻律評価プログラムを、ＲＡＭ６０３をワークメモリとして使用しながら実行することにより、図１の韻律評価部１０１内の音素ラベリングモジュール１０４、韻律特徴取得モジュール１０５、及び韻律特徴比較モジュール１０６の各ブロックの機能を実現する。そのプログラムは、例えば外部記憶装置６０６や可搬記録媒体６１０に記録して配布してもよく、或いはネットワーク接続装置６０８によりネットワークから取得できるようにしてもよい。 In the language learning device 100 according to the present embodiment, the CPU 601 executes the prosody evaluation program stored in the ROM 602 while using the RAM 603 as a work memory, whereby the phoneme labeling module 104 in the prosody evaluation unit 101 in FIG. The function of each block of the prosodic feature acquisition module 105 and the prosodic feature comparison module 106 is realized. For example, the program may be recorded and distributed in the external storage device 606 or the portable recording medium 610, or may be acquired from the network by the network connection device 608.

図７は、図１の構成に対応する語学学習装置１００の機能を、図６のハードウェア構成例を有するコンピュータのＣＰＵ６０１が、ソフトウェアプログラムの処理により実現する場合の、韻律評価処理の全体処理の例を示すフローチャートである。ＣＰＵ６０１は、ＲＯＭ６０２に記憶された韻律評価処理プログラムを実行することにより、評価対象音声の入力処理（Ｓ７０１）、音素ラベリング処理（Ｓ７０２）、韻律特徴取得処理（Ｓ７０３からＳ７０５）、及び韻律特徴比較処理（Ｓ７０６）を順次実行する。 7 shows the overall processing of the prosody evaluation process when the CPU 601 of the computer having the hardware configuration example of FIG. 6 realizes the functions of the language learning device 100 corresponding to the configuration of FIG. 1 by the processing of the software program. It is a flowchart which shows an example. The CPU 601 executes the prosody evaluation processing program stored in the ROM 602 to input an evaluation target speech (S701), phoneme labeling (S702), prosodic feature acquisition (S703 to S705), and prosody feature comparison processing. (S706) are sequentially executed.

Ｓ７０１の評価対象音声の入力処理は、図３のＳ３０１の処理シーケンスに対応し、これにより評価対象音声が入力される。 The input processing of the evaluation target voice in S701 corresponds to the processing sequence of S301 in FIG. 3, and thereby the evaluation target voice is input.

Ｓ７０２の音素ラベリング処理は、図３のＳ３０２からＳ３０５の処理シーケンスに対応し、これにより評価対象音声に対して音素ラベルが付与された音素ラベリング情報が得られる。 The phoneme labeling process in S702 corresponds to the processing sequence from S302 to S305 in FIG. 3, and thereby, phoneme labeling information in which a phoneme label is assigned to the evaluation target speech is obtained.

Ｓ７０３からＳ７０５の韻律特徴取得処理は、図３のＳ３０６からＳ３０８の処理シーケンスに対応する。韻律特徴取得処理は、基本周波数の取得処理（Ｓ７０３）、ＲＭＳパワーの取得処理（Ｓ７０４）、及び継続時間長の取得処理（Ｓ７０５）を含む。 The prosodic feature acquisition processing from S703 to S705 corresponds to the processing sequence from S306 to S308 in FIG. The prosodic feature acquisition process includes a fundamental frequency acquisition process (S703), an RMS power acquisition process (S704), and a duration length acquisition process (S705).

図８は、図７のＳ７０３の基本周波数の取得処理の詳細例を示すフローチャートである。 FIG. 8 is a flowchart showing a detailed example of the fundamental frequency acquisition process in S703 of FIG.

図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１はまず、フレーム毎に基本周波数を算出し（Ｓ８０１）、更に基本周波数の対数値を算出する（Ｓ８０２）。 As described above in detail with reference to FIGS. 4 and 5 regarding S307 in FIG. 3, the CPU 601 first calculates a fundamental frequency for each frame (S801), and further calculates a logarithmic value of the fundamental frequency (S802).

次に、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、図６の外部記憶装置６０６に記憶されている発音データベース１１１（図１参照）中の模範音声に付与されている音素ラベリング情報と音節区切り情報を参照し、その音素ラベリング情報を図７のＳ７０２で得られた音素ラベリング情報中の音素記号ラベル列と照合することによって、評価対象音声の音節情報を決定する（Ｓ８０３）。 Next, as described above in detail with reference to FIG. 4 and FIG. 5 regarding S307 in FIG. 3, the CPU 601 adds the model voice in the pronunciation database 111 (see FIG. 1) stored in the external storage device 606 in FIG. By referring to the given phoneme labeling information and syllable separation information and comparing the phoneme labeling information with the phoneme symbol label string in the phoneme labeling information obtained in S702 of FIG. 7, the syllable information of the speech to be evaluated is obtained. Determine (S803).

次に、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ８０３で決定した音節毎に、Ｓ８０１及びＳ８０２で算出した、その音節に含まれる各フレームの基本周波数の対数値に対して回帰分析を実行する（Ｓ８０４）。 Next, as described above in detail with reference to FIGS. 4 and 5 regarding S307 in FIG. 3, the CPU 601 calculates the basic frequency of each frame included in the syllable calculated in S801 and S802 for each syllable determined in S803. Regression analysis is performed on the logarithmic value of (S804).

続いて、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ８０３で決定した音節毎に、Ｓ８０４での回帰分析結果に基づいて、その音節での基本周波数の対数値の変化度合いを、「増加傾向」「変化しない傾向」「減少傾向」の何れかに判別する（Ｓ８０５）。 Subsequently, as described above in detail with reference to FIGS. 4 and 5 regarding S307 in FIG. 3, the CPU 601 determines the basic frequency of the syllable for each syllable determined in S803 based on the regression analysis result in S804. The change degree of the logarithmic value is discriminated as one of “increase tendency”, “not change tendency”, and “decrease tendency” (S805).

更に、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ８０３で決定した音節毎に、Ｓ８０１及びＳ８０２で算出した、その音節に含まれる各フレームに対応する基本周波数の対数値から代表値を算出する（Ｓ８０６）。代表値の算出方法としては、中央値の選択に方法もしくは刈込み平均の計算による方法などを採用できる。 Further, as described above in detail with reference to FIG. 4 and FIG. 5 regarding S307 in FIG. 3, the CPU 601 calculates the basic corresponding to each frame included in the syllable calculated in S801 and S802 for each syllable determined in S803. A representative value is calculated from the logarithmic value of the frequency (S806). As a method for calculating the representative value, a method for selecting a median value or a method by calculating a trimmed average can be employed.

図９は、図７のＳ７０４のＲＭＳパワーの取得処理の詳細例を示すフローチャートである。 FIG. 9 is a flowchart showing a detailed example of the RMS power acquisition process in S704 of FIG.

図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１はまず、前述した数１式に基づいて、フレーム毎にＲＭＳパワーを算出し（Ｓ９０１）、更にＲＭＳパワーの対数値を算出する（Ｓ９０２）。 As described above in detail with reference to FIGS. 4 and 5 relating to S307 in FIG. 3, the CPU 601 first calculates the RMS power for each frame based on the above-described equation (1) (S901), and further calculates the logarithmic value of the RMS power. Is calculated (S902).

次に、ＣＰＵ６０１は、図８のＳ８０３と同様にして、評価対象音声の音節情報を決定する（Ｓ９０３）。 Next, the CPU 601 determines the syllable information of the evaluation target speech in the same manner as S803 in FIG. 8 (S903).

次に、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ９０３で決定した音節毎に、Ｓ９０１及びＳ９０２で算出した、その音節に含まれる各フレームのＲＭＳパワーの対数値に対して回帰分析を実行する（Ｓ９０４）。 Next, as described above in detail with reference to FIGS. 4 and 5 regarding S307 in FIG. 3, the CPU 601 calculates, for each syllable determined in S903, the RMS power of each frame included in the syllable calculated in S901 and S902. Regression analysis is performed on the logarithmic value of (S904).

続いて、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ９０３で決定した音節毎に、Ｓ９０４での回帰分析結果に基づいて、その音節でのＲＭＳパワーの対数値の変化度合いを、「増加傾向」「変化しない傾向」「減少傾向」の何れかに判別する（Ｓ９０５）。 Subsequently, as described above in detail with reference to FIGS. 4 and 5 relating to S307 in FIG. 3, the CPU 601 determines the RMS power of the syllable for each syllable determined in S903 based on the regression analysis result in S904. The degree of change of the logarithmic value is determined as one of “increase tendency”, “not change tendency”, and “decrease tendency” (S905).

更に、図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１は、Ｓ９０３で決定した音節毎に、Ｓ９０１及びＳ９０２で算出した、その音節に含まれる各フレームに対応するＲＭＳパワーの対数値から代表値を算出する（Ｓ９０６）。代表値の算出方法は、図８のＳ８０６の場合と同様である。 Further, as described above in detail with reference to FIGS. 4 and 5 regarding S307 in FIG. 3, the CPU 601 calculates, for each syllable determined in S903, the RMS corresponding to each frame included in the syllable calculated in S901 and S902. A representative value is calculated from the logarithmic value of power (S906). The method for calculating the representative value is the same as in S806 of FIG.

図１０は、図７のＳ７０５の継続時間長の取得処理の詳細例を示すフローチャートである。 FIG. 10 is a flowchart showing a detailed example of the duration time length acquisition process in S705 of FIG.

図３のＳ３０７に関する図４及び図５による詳細説明で前述したように、ＣＰＵ６０１はまず、評価対象音声について、図７のＳ７０２で得られた音素ラベリング情報に付与されている音素毎の継続時間長を取得する（Ｓ１００１）。 As described above in detail with reference to FIGS. 4 and 5 relating to S307 in FIG. 3, the CPU 601 first determines the duration of each phoneme given to the phoneme labeling information obtained in S702 of FIG. Is acquired (S1001).

次に、ＣＰＵ６０１は、図８のＳ８０３と同様にして、評価対象音声の音節情報を決定する（Ｓ１００２）。 Next, the CPU 601 determines the syllable information of the evaluation target voice in the same manner as S803 in FIG. 8 (S1002).

続いて、ＣＰＵ６０１は、Ｓ１００２で決定した音節毎に、その音節に含まれる各音素の継続時間長の総和を、その音節の継続時間長として算出する（Ｓ１００３）。 Subsequently, for each syllable determined in S1002, the CPU 601 calculates the sum of the durations of the phonemes included in the syllable as the duration of the syllable (S1003).

更に、ＣＰＵ６０１は、Ｓ１００３で算出した音節毎の継続時間長を、図７のＳ７０１で入力した評価対象音声の全体の時間長で除算することにより、音節毎の継続時間長を正規化する（Ｓ１００４）。 Further, the CPU 601 normalizes the duration time for each syllable by dividing the duration time for each syllable calculated in S1003 by the overall time length of the evaluation target speech input in S701 in FIG. 7 (S1004). ).

図１１は、図６のＳ７０６の韻律特徴評価処理の詳細例を示すフローチャートである。 FIG. 11 is a flowchart showing a detailed example of the prosodic feature evaluation process in S706 of FIG.

図３のＳ３１０に関する図５による詳細説明で前述したように、ＣＰＵ６０１はまず、図７のＳ７０３で得られた評価対象音声に対応する図５の（ａ−１）の音節毎の基本周波数の対数値の変化度合いに対応して、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかの変化形状を、図５（ｃ−１）に示されるように表示装置１０８に表示させる。また、ＣＰＵ６０１は、外部記憶装置６０６に記憶されている模範音声データ１１２を参照することにより、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎の基本周波数の対数値に関し、図５の（ａ−２）の音節毎に、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｃ−２）に示されるように表示装置１０８に表示させる。この場合に、ＣＰＵ６０１は、図５（ｃ−１）の評価対象音声と図５（ｃ−２）の模範音声とで基本周波数の対数値の変化形状が異なる部分については、表示装置１０８に強調表示をさせる（以上、Ｓ１１０１）。 As described above in detail with reference to FIG. 5 regarding S310 in FIG. 3, the CPU 601 first sets the basic frequency pair for each syllable in FIG. 5A-1 corresponding to the evaluation target speech obtained in S703 in FIG. Corresponding to the change degree of the numerical value, any one of the changing shape of the rising arrow shape, the horizontal arrow shape, or the falling arrow shape is displayed as shown in FIG. 108 is displayed. Further, the CPU 601 refers to the model voice data 112 stored in the external storage device 606, and in contrast to this display regarding the evaluation target voice, the CPU 601 relates to the logarithmic value of the fundamental frequency for each syllable corresponding to the model voice. For each syllable of (a-2) in FIG. 5, any one of the rising arrow shape, the horizontal arrow shape, or the lowering arrow shape is displayed as shown in FIG. 5 (c-2). It is displayed on the device 108. In this case, the CPU 601 emphasizes on the display device 108 a portion where the change shape of the logarithmic value of the fundamental frequency differs between the evaluation target voice in FIG. 5C-1 and the model voice in FIG. 5C-2. Display is performed (S1101).

次に図３のＳ３１０に関する図５による詳細説明で前述したように、ＣＰＵ６０１は、図７のＳ７０４で得られた評価対象音声に対応する図５の（ａ−１）の音節毎のＲＭＳパワーの対数値の変化度合いに対応して、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかの変化形状を、図５（ｂ−１）に示されるように表示装置１０８に表示させる。また、ＣＰＵ６０１は、外部記憶装置６０６に記憶されている模範音声データ１１２を参照することにより、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎のＲＭＳパワーの対数値に関し、図５の（ａ−２）の音節毎に、右肩上がりの矢印形状、水平の矢印形状、又は右肩下がりの矢印形状の何れかを、図５（ｂ−２）に示されるように表示装置１０８に表示させる。この場合に、ＣＰＵ６０１は、図５（ｂ−１）の評価対象音声と図５（ｂ−２）の模範音声とでＲＭＳパワーの対数値の変化形状が異なる部分については、表示装置１０８に強調表示をさせる（以上、Ｓ１１０２）。 Next, as described above in detail with reference to FIG. 5 regarding S310 in FIG. 3, the CPU 601 determines the RMS power for each syllable in FIG. 5A corresponding to the evaluation target speech obtained in S704 in FIG. Corresponding to the degree of change of the logarithmic value, any one of the rising shape of the arrow that rises to the right, the shape of the horizontal arrow, or the shape of the arrow that drops to the right is displayed as shown in FIG. It is displayed on the device 108. Further, the CPU 601 refers to the model voice data 112 stored in the external storage device 606, and in contrast to this display regarding the evaluation target voice, the CPU 601 relates to the logarithmic value of the RMS power for each syllable corresponding to the model voice. For each syllable of (a-2) in FIG. 5, any one of the rising arrow shape, the horizontal arrow shape, or the lowering arrow shape is displayed as shown in FIG. 5 (b-2). It is displayed on the device 108. In this case, the CPU 601 emphasizes on the display device 108 a portion where the change shape of the logarithmic value of the RMS power is different between the evaluation target voice in FIG. 5B-1 and the exemplary voice in FIG. 5B-2. Display is made (S1102).

次に図３のＳ３１０に関する図５による詳細説明で前述したように、ＣＰＵ６０１は、図７のＳ７０３で得られた評価対象音声に対応する音節毎の基本周波数の対数値の代表値を、図５の（ａ−１）の音節毎に、図５（ｆ−１）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。また、ＣＰＵ６０１は、外部記憶装置６０６に記憶されている模範音声データ１１２を参照することにより、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎の基本周波数の対数値の代表値を、図５（ａ−２）の音節毎に、図５（ｆ−２）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。この場合に、ＣＰＵ６０１は、図５（ｆ−１）の評価対象音声と図５（ｆ−２）の模範音声とで基本周波数の対数値の代表値に関して、音節間の変化が異なる部分については、表示装置１０８に強調表示をさせる（以上、Ｓ１１０３）。 Next, as described above in detail with reference to FIG. 5 regarding S310 in FIG. 3, the CPU 601 sets the representative value of the logarithmic value of the fundamental frequency for each syllable corresponding to the evaluation target speech obtained in S703 in FIG. For each syllable of (a-1), as shown in FIG. 5 (f-1), a plot is made with square marks, and a line graph connecting the plots across a plurality of syllables is displayed on the display device 108. Let In addition, the CPU 601 refers to the model voice data 112 stored in the external storage device 606, and contrasts with this display related to the evaluation target voice to represent the logarithmic value of the fundamental frequency for each syllable corresponding to the model voice. A value is plotted for each syllable in FIG. 5 (a-2) with a square mark as shown in FIG. 5 (f-2), and a line graph connecting the plots across a plurality of syllables is displayed. It is displayed on the device 108. In this case, the CPU 601 determines a portion in which the change between syllables is different with respect to the representative value of the logarithmic value of the fundamental frequency between the evaluation target voice in FIG. 5F-1 and the model voice in FIG. 5F-2. Then, the display device 108 is highlighted (S1103).

次に図３のＳ３１０に関する図５による詳細説明で前述したように、ＣＰＵ６０１は、図７のＳ７０４で得られた評価対象音声に対応する音節毎のＲＭＳパワーの対数値の代表値を、図５の（ａ−１）の音節毎に、図５（ｅ−１）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。また、ＣＰＵ６０１は、外部記憶装置６０６に記憶されている模範音声データ１１２を参照することにより、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎のＲＭＳパワーの対数値の代表値を、図５（ａ−２）の音節毎に、図５（ｅ−２）に示されるように、四角のマークでプロットをし、かつ複数の音節にわたってプロット間を結ぶ折れ線グラフを、表示装置１０８に表示させる。この場合に、ＣＰＵ６０１は、図５（ｅ−１）の評価対象音声と図５（ｅ−２）の模範音声とでＲＭＳパワーの対数値の代表値に関して、音節間の変化が異なる部分については、表示装置１０８に強調表示をさせる（以上、Ｓ１１０４）。 Next, as described above in detail with reference to FIG. 5 regarding S310 in FIG. 3, the CPU 601 sets the representative value of the logarithmic value of the RMS power for each syllable corresponding to the evaluation target speech obtained in S704 in FIG. For each syllable of (a-1) of FIG. 5, as shown in FIG. 5 (e-1), a line graph connecting the plots across a plurality of syllables is displayed on the display device 108. Let Further, the CPU 601 refers to the model voice data 112 stored in the external storage device 606, and contrasts with this display related to the evaluation target voice, and represents the logarithmic value of the RMS power for each syllable corresponding to the model voice. A value is plotted for each syllable of FIG. 5 (a-2) with a square mark as shown in FIG. 5 (e-2), and a line graph connecting the plots across a plurality of syllables is displayed. It is displayed on the device 108. In this case, the CPU 601 determines a portion in which the change between syllables is different regarding the representative value of the logarithmic value of the RMS power between the evaluation target voice in FIG. 5 (e-1) and the exemplary voice in FIG. 5 (e-2). Then, the display device 108 is highlighted (S1104).

続いて図３のＳ３１０に関する図５による詳細説明で前述したように、ＣＰＵ６０１は、図７のＳ７０５で得られた評価対象音声に対応する音節毎の正規化された継続時間長を、図５（ａ−１）の音節毎に、図５（ｄ−１）に示されるように、表示装置１０８に数値表示させる。また、ＣＰＵ６０１は、外部記憶装置６０６に記憶されている模範音声データ１１２を参照することにより、評価対象音声に関するこの表示に対比させて、模範音声に対応する音節毎の正規化された継続時間長を、図５（ａ−２）の音節毎に、図５（ｄ−２）に示されるように、表示装置１０８に数値表示させる。この場合、図５（ｄ−１）の評価対象音声と図５（ｄ−２）の模範音声とで継続時間長が大きく異なる部分については、表示装置１０８に強調表示をさせる。 Subsequently, as described above in detail with reference to FIG. 5 regarding S310 in FIG. 3, the CPU 601 determines the normalized duration length for each syllable corresponding to the evaluation target speech obtained in S705 in FIG. For each syllable of (a-1), a numerical value is displayed on the display device 108 as shown in FIG. 5 (d-1). Further, the CPU 601 refers to the model voice data 112 stored in the external storage device 606, and contrasts this display regarding the evaluation target voice with respect to the normalized duration time for each syllable corresponding to the model voice. For each syllable of FIG. 5 (a-2), as shown in FIG. 5 (d-2), numerical values are displayed on the display device. In this case, the display device 108 is highlighted on a portion where the duration is significantly different between the evaluation target voice in FIG. 5 (d-1) and the model voice in FIG. 5 (d-2).

その後、ＣＰＵ６０１は、評価対象音声に音節が複数存在するか否かを判定する（Ｓ１１０６）。 Thereafter, the CPU 601 determines whether or not there are a plurality of syllables in the evaluation target voice (S1106).

Ｓ１１０６の判定がＮＯならば、ＣＰＵ６０１は、Ｓ１１０９の処理に移行し、前述した＜韻律特徴の変化形状の比較評価１＞の手法により、評価対象音声と模範音声とで、音節毎に、その音節に含まれる各フレームの基本周波数の対数値やＲＭＳパワーの対数値や継続時間長の変化形状の一致率を算出し、その算出結果を、前述した＜一致率や相関係数の表示＞の手法により、図１の表示装置１０８に表示する。 If the determination in S1106 is NO, the CPU 601 proceeds to the processing in S1109, and for each syllable of the evaluation target speech and the exemplary speech, using the method of <Comparison evaluation 1 of prosody feature change shape> described above. The logarithmic value of the fundamental frequency of each frame, the logarithmic value of RMS power, and the coincidence rate of the change shape of the duration length are calculated, and the calculation result is the method of <displaying the coincidence rate and correlation coefficient> described above. Is displayed on the display device 108 of FIG.

Ｓ１１０６の判定がＹＥＳならば、ＣＰＵ６０１は、図７のＳ７０３で得られた評価対象音声に対応する音節毎のピッチ周波数の対数値の代表値に基づいて、各音節間のピッチ周波数の代表値の変化量を計算する。同様に、ＣＰＵ６０１は、図７のＳ７０４で得られた評価対象音声に対応する音節毎のＲＭＳパワーの対数値の代表値に基づいて、各音節間のＲＭＳパワーの代表値の変化量を計算する。さらに同様に、ＣＰＵ６０１は、図７のＳ７０５で得られた評価対象音声に対応する音節毎の継続時間長に基づいて、各音節間の継続時間長の変化量を計算する（以上、Ｓ１１０７）。 If the determination in S1106 is YES, the CPU 601 determines the representative value of the pitch frequency between syllables based on the representative value of the logarithmic value of the pitch frequency for each syllable corresponding to the evaluation target speech obtained in S703 of FIG. Calculate the amount of change. Similarly, the CPU 601 calculates the amount of change in the representative value of the RMS power between syllables based on the representative value of the logarithmic value of the RMS power for each syllable corresponding to the evaluation target speech obtained in S704 of FIG. . Similarly, the CPU 601 calculates the amount of change in the duration of each syllable based on the duration of each syllable corresponding to the evaluation target speech obtained in S705 of FIG. 7 (hereinafter, S1107).

続いて、ＣＰＵ６０１は、Ｓ１１０７で算出された各音節間のピッチ周波数の代表値の変化量に基づいて、各音節間のピッチ周波数の代表値を結ぶ変化形状を識別する。これは、図５（ｆ−１）の折れ線グラフの変化形状に対応する。同様に、ＣＰＵ６０１は、Ｓ１１０７で算出された各音節間のＲＭＳパワーの代表値の変化量に基づいて、各音節間のＲＭＳパワーの代表値を結ぶ変化形状を識別する。これは、図５（ｅ−１）の折れ線グラフの変化形状に対応する。さらに同様に、ＣＰＵ６０１は、Ｓ１１０７で算出された各音節間の継続時間長の変化量に基づいて、各音節間の継続時間長を結ぶ変化形状を識別する（以上、Ｓ１１０８）。 Subsequently, the CPU 601 identifies a change shape connecting the representative values of the pitch frequencies between the syllables based on the change amount of the representative value of the pitch frequency between the syllables calculated in S1107. This corresponds to the change shape of the line graph in FIG. Similarly, the CPU 601 identifies a change shape connecting the representative values of the RMS power between the syllables based on the change amount of the representative value of the RMS power between the syllables calculated in S1107. This corresponds to the change shape of the line graph in FIG. Further, similarly, the CPU 601 identifies a change shape that connects durations between syllables based on the amount of change in duration between syllables calculated in S1107 (S1108).

その後、ＣＰＵ６０１は、Ｓ１１０９の処理に移行し、前述した＜韻律特徴の変化形状の比較評価２＞の手法により、評価対象音声と模範音声とで、音節毎に、その音節に含まれる各フレームの基本周波数の対数値やＲＭＳパワーの対数値や継続時間長の変化形状の一致率を算出し、その算出結果を、前述した＜一致率や相関係数の表示＞の手法により、図１の表示装置１０８に表示する。 Thereafter, the CPU 601 proceeds to the processing of S1109, and for each syllable of each frame included in the syllable of the evaluation target speech and the exemplary speech by the above-described method <Comparison evaluation 2 of prosodic feature change shape>. The coincidence rate of the logarithmic value of the fundamental frequency, the logarithmic value of the RMS power, and the change shape of the duration length is calculated, and the calculation result is displayed as shown in FIG. Display on the device 108.

以上のようにして、本実施形態によれば、韻律情報について、各音節内における変化の度合いと、連続する複数の音節にまたがる変化の度合いをそれぞれ比較することが可能となる。 As described above, according to the present embodiment, the prosody information can be compared with the degree of change in each syllable and the degree of change across a plurality of consecutive syllables.

以上の実施形態に関して、更に以下の付記を開示する。
（付記１）
入力された評価対象音声に含まれる音節毎の前記韻律特徴の変化度合いを取得する韻律特徴取得処理と、
前記取得された前記音節毎の前記韻律特徴の変化度合いと、前記評価対象音声に対応する模範音声に含まれる前記音節毎の前記韻律特徴の変化度合いとを比較し、前記比較の結果を提示する韻律特徴比較処理と、
を実行する韻律評価部を備える音声評価装置。
（付記２）
前記変化度合いは、前記音節内で前記韻律特徴が、増加傾向、変化しない傾向、及び減少傾向の何れかを示す、付記１記載の音声評価装置。
（付記３）
前記韻律評価部は、前記韻律特徴取得処理において、前記音節毎に含まれる前記韻律特徴量に対して回帰分析を行うことにより、前記変化度合いを取得する処理を実行する、付記２記載の音声評価装置。
（付記４）
前記韻律評価部は、前記韻律特徴比較処理において、前記評価対象音声に含まれる前記音節毎の前記韻律特徴の変化度合いと、前記模範音声に含まれる前記音節毎の前記韻律特徴の変化度合いとを対比させて表示装置に表示させる、付記１乃至３の何れかに記載の音声評価装置。
（付記５）
前記韻律評価部は、前記韻律特徴評価処理において、前記変化度合いが増加傾向、変化しない傾向、及び減少傾向の何れかであるかに応じて、右肩上がりの矢印形状、水平の矢印形状、及び右肩下がりの矢印形状の何れかで前記表示装置に表示させる処理を実行する、付記４記載の音声評価装置。
（付記６）
前記韻律評価部は、前記韻律特徴比較処理において、前記評価対象音声に含まれる前記音節の夫々と、前記模範音声に含まれる前記音節の夫々とを対応付け、それぞれ対応関係にある全ての前記評価対象音声の音節と前記模範音声の音節とにおける前記韻律特徴同士の一致割合を取得して前記表示装置に数値表示させる処理を実行する、付記１乃至３の何れかに記載の音声評価装置。
（付記７）
前記韻律評価部は、
前記韻律特徴取得処理において更に、前記評価対象音声の音節毎に前記韻律特徴の代表値を取得する処理を実行し、
前記韻律特徴比較処理において更に、前記取得された前記音節毎の前記韻律特徴の代表値の複数の前記音節間での変化度合いと、前記模範音声に含まれる前記音節毎の前記韻律特徴の代表値の前記複数の音節間での変化度合いとを比較し、前記比較の比較結果を前記ユーザに提示する処理を実行する、付記１乃至６の何れかに記載の音声評価装置。
（付記８）
前記韻律評価部は、前記韻律特徴比較処理において、前記評価対象音声に含まれる前記音節毎に取得された前記韻律特徴の代表値を、前記複数の音節間でプロットした折れ線グラフと、前記模範音声に含まれる前記音節毎の前記韻律特徴の代表値を前記複数の音節間でプロットした折れ線グラフとを対比させて表示装置に表示させる処理を実行する、付記７記載の音声評価装置。
（付記９）
前記韻律評価部は、前記韻律特徴比較処理において、前記評価対象音声に含まれる前記音節の夫々と前記模範音声に含まれる前記音節の夫々とを対応付け、それぞれ対応関係にある全ての前記評価対象音声の音節と前記模範音声の音節とにおける前記韻律特徴の代表地同士の一致割合いを取得して前記表示装置に数値表示させる処理を実行する、付記７記載の音声評価装置。
（付記１０）
前記韻律評価部は、前記韻律特徴取得処理において、前記評価対象音声を、予め定められた時間長を有するフレーム夫々に分割し、前記分割されたフレーム毎に前記評価対象音声の基本周波数を取得し、前記音節毎に前記音節に含まれるフレーム毎に取得された基本周波数夫々を前記音節の韻律特徴とする処理を実行する、付記１乃至９の何れかに記載の語学学習装置。
（付記１１）
前記韻律評価部は、前記韻律特徴取得処理において、前記評価対象音声について、予め定められた時間長を有するフレーム夫々に分割し、前記分割されたフレーム毎に前記評価対象音声のパワーを取得し、前記音節毎に前記音節に含まれるフレーム毎に取得されたパワー夫々を前記音節の韻律特徴とする処理を実行する、付記１乃至９の何れかに記載の音声評価装置。
（付記１２）
前記韻律評価部は、
前記韻律特徴取得処理において更に、前記評価対象音声に含まれる音素毎に継続時間長を取得し、前記評価対象音声に含まれる前記音節毎に、前記音節に含まれる前記各音素の継続時間長の総和を当該音節の継続時間長として取得する処理を実行し、
前記韻律特徴比較処理において更に、前記取得された前記音節毎の前記継続時間長と、前記模範音声に含まれる前記音節毎の継続時間長とを比較し、前記比較の結果を提示する処理を実行する、
付記１乃至１１の何れかに記載の音声評価装置。
（付記１３）
前記韻律評価部は、前記韻律特徴比較処理において、前記音節毎に、前記評価対象音声の韻律特徴と、前記韻律特徴と対応する前記模範音声の韻律特徴との間で異なる部分が存在する場合は、前記異なる部分を前記表示装置に強調表示する処理を実行する、付記４、５、８、又は１２の何れかに記載の音声評価装置。
（付記１４）
音声評価装置が、
入力された評価対象音声に含まれる音節毎に前記韻律特徴量の変化度合いを取得し、
前記取得された前記音節毎の前記韻律特徴量の変化度合いと、模範音声に含まれる前記音節毎の前記韻律特徴量の変化度合いとを比較し、前記比較の比較結果を前記ユーザに提示する、語学学習方法。
（付記１５）
音声評価装置として用いられるコンピュータに、
入力された評価対象音声に含まれる音節毎に前記韻律特徴の変化度合いを取得するステップと、
前記評価対象音声について取得された前記音節毎の前記韻律特徴の変化度合いと、模範音声に含まれる前記音節毎の前記韻律特徴量の変化度合いとを比較し、前記比較の結果を提示するステップと、
を実行させるプログラム。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
Prosodic feature acquisition processing for acquiring the degree of change of the prosodic feature for each syllable included in the input speech to be evaluated;
The obtained degree of change of the prosodic feature for each syllable is compared with the degree of change of the prosodic feature for each syllable included in the model speech corresponding to the evaluation target speech, and the comparison result is presented. Prosodic feature comparison processing;
A speech evaluation apparatus comprising a prosody evaluation unit for executing
(Appendix 2)
The speech evaluation apparatus according to supplementary note 1, wherein the degree of change indicates one of an increasing tendency, a tendency not to change, and a decreasing tendency of the prosodic feature in the syllable.
(Appendix 3)
The speech evaluation according to appendix 2, wherein the prosodic evaluation unit executes a process of acquiring the degree of change by performing regression analysis on the prosodic feature amount included in each syllable in the prosodic feature acquiring process. apparatus.
(Appendix 4)
In the prosodic feature comparison process, the prosody evaluation unit calculates a change degree of the prosodic feature for each syllable included in the evaluation target speech and a change degree of the prosodic feature for each syllable included in the exemplary speech. 4. The voice evaluation device according to any one of appendices 1 to 3, which is displayed on a display device in contrast.
(Appendix 5)
The prosodic evaluation unit, in the prosodic feature evaluation process, depending on whether the degree of change is an increasing tendency, a tendency not to change, or a decreasing tendency, a rising arrow shape, a horizontal arrow shape, and The speech evaluation apparatus according to appendix 4, wherein a process of displaying on the display apparatus in any one of the downward-sloping arrow shapes is executed.
(Appendix 6)
The prosodic evaluation unit associates each of the syllables included in the evaluation target speech with each of the syllables included in the exemplary speech in the prosodic feature comparison process, and all the evaluations that are in a correspondence relationship with each other. The speech evaluation apparatus according to any one of appendices 1 to 3, wherein a process of acquiring a matching ratio between the prosodic features in the syllable of the target speech and the syllable of the exemplary speech and displaying the numerical value on the display device is performed.
(Appendix 7)
The prosody evaluation unit includes:
In the prosodic feature acquisition process, further, a process of acquiring a representative value of the prosodic feature for each syllable of the evaluation target speech,
In the prosodic feature comparison processing, the degree of change between the plurality of syllables of the representative value of the prosodic feature for each of the acquired syllables, and the representative value of the prosodic feature for each of the syllables included in the exemplary speech The speech evaluation apparatus according to any one of appendices 1 to 6, wherein the degree of change between the plurality of syllables is compared, and a process of presenting the comparison result of the comparison to the user is executed.
(Appendix 8)
The prosody evaluation unit includes a line graph in which representative values of the prosodic features acquired for each of the syllables included in the evaluation target speech are plotted between the plurality of syllables in the prosody feature comparison process, and the exemplary speech 8. The speech evaluation apparatus according to appendix 7, wherein a processing for displaying a representative value of the prosodic feature for each syllable included in the syllable on a display device in comparison with a line graph plotted between the plurality of syllables is performed.
(Appendix 9)
The prosodic evaluation unit associates each of the syllables included in the evaluation target speech with each of the syllables included in the exemplary speech in the prosodic feature comparison process, and all the evaluation targets having a corresponding relationship with each other 8. The speech evaluation apparatus according to appendix 7, wherein a process of acquiring a matching ratio between representative locations of the prosodic features in the syllable of the speech and the syllable of the exemplary speech and displaying the numerical value on the display device is executed.
(Appendix 10)
In the prosodic feature acquisition process, the prosody evaluation unit divides the evaluation target speech into frames each having a predetermined time length, and acquires the fundamental frequency of the evaluation target speech for each of the divided frames. The language learning device according to any one of appendices 1 to 9, wherein a process is performed for each syllable, wherein the fundamental frequency acquired for each frame included in the syllable is used as a prosodic feature of the syllable.
(Appendix 11)
In the prosodic feature acquisition process, the prosody evaluation unit divides the evaluation target speech into frames each having a predetermined time length, acquires the power of the evaluation target speech for each of the divided frames, The speech evaluation apparatus according to any one of appendices 1 to 9, wherein a process is performed for each of the syllables so that the power acquired for each frame included in the syllable is a prosodic feature of the syllable.
(Appendix 12)
The prosody evaluation unit includes:
In the prosodic feature acquisition process, a duration time is further acquired for each phoneme included in the evaluation target speech, and a duration length of each phoneme included in the syllable is determined for each syllable included in the evaluation target speech. Execute the process of acquiring the sum as the duration of the syllable,
In the prosodic feature comparison process, a process of comparing the obtained duration time for each of the acquired syllables with a duration time of each of the syllables included in the exemplary speech and presenting the result of the comparison is executed. To
The speech evaluation apparatus according to any one of supplementary notes 1 to 11.
(Appendix 13)
The prosodic evaluation unit, in the prosodic feature comparison process, for each syllable, there is a portion that differs between the prosodic feature of the speech to be evaluated and the prosodic feature of the exemplary speech corresponding to the prosodic feature The voice evaluation device according to any one of supplementary notes 4, 5, 8, and 12, wherein a process of highlighting the different portions on the display device is executed.
(Appendix 14)
Voice evaluation device
Obtaining a change degree of the prosodic feature value for each syllable included in the input speech to be evaluated;
Comparing the obtained change degree of the prosodic feature value for each of the syllables with the change degree of the prosodic feature value for each of the syllables included in a model voice, and presenting the comparison result of the comparison to the user; Language learning method.
(Appendix 15)
In a computer used as a voice evaluation device,
Obtaining a change degree of the prosodic feature for each syllable included in the input evaluation target speech;
Comparing the degree of change of the prosodic feature for each syllable acquired for the evaluation target speech with the degree of change of the prosodic feature for each syllable included in an exemplary speech, and presenting the result of the comparison; ,
A program that executes

１００語学学習装置
１０１韻律評価部
１０２入力装置
１０３出力装置
１０４音素ラベリングモジュール
１０５韻律特徴取得モジュール
１０６韻律特徴評価モジュール
１０７音声入力装置
１０８表示装置
１０９音声出力装置
１１０音響モデル
１１１発音データベース
１１２模範音声データ
３００アプリケーションフロントエンド
６０１ＣＰＵ
６０２ＲＯＭ
６０３ＲＡＭ
６０４入力装置
６０５出力装置
６０６外部記憶装置
６０７可搬記録媒体駆動装置
６０８通信インターフェース
６０９バス
６１０可搬記録媒体 DESCRIPTION OF SYMBOLS 100 Language learning apparatus 101 Prosody evaluation part 102 Input device 103 Output device 104 Phoneme labeling module 105 Prosody feature acquisition module 106 Prosody feature evaluation module 107 Speech input device 108 Display device 109 Speech output device 110 Acoustic model 111 Pronunciation database 112 Model speech data 300 Application front end 601 CPU
602 ROM
603 RAM
604 input device 605 output device 606 external storage device 607 portable recording medium driving device 608 communication interface 609 bus 610 portable recording medium

Claims

Prosodic feature acquisition processing for acquiring the degree of change of the prosodic feature for each syllable included in the input speech to be evaluated;
The obtained degree of change of the prosodic feature for each syllable is compared with the degree of change of the prosodic feature for each syllable included in the model speech corresponding to the evaluation target speech, and the comparison result is presented. Prosodic feature comparison processing;
A speech evaluation apparatus comprising a prosody evaluation unit for executing

The speech evaluation apparatus according to claim 1, wherein the degree of change indicates one of an increasing tendency, a tendency not to change, and a decreasing tendency of the prosodic feature in the syllable.

The speech according to claim 2, wherein the prosodic evaluation unit performs a process of acquiring the degree of change by performing regression analysis on the prosodic feature amount included in each syllable in the prosodic feature acquiring process. Evaluation device.

In the prosodic feature comparison process, the prosody evaluation unit calculates a change degree of the prosodic feature for each syllable included in the evaluation target speech and a change degree of the prosodic feature for each syllable included in the exemplary speech. The voice evaluation apparatus according to claim 1, wherein the voice evaluation apparatus is displayed on a display device in contrast.

The prosodic evaluation unit, in the prosodic feature evaluation process, depending on whether the degree of change is an increasing tendency, a tendency not to change, or a decreasing tendency, a rising arrow shape, a horizontal arrow shape, and The voice evaluation apparatus according to claim 4, wherein a process of displaying on the display device in any one of the downward-sloping arrow shapes is executed.

The prosodic evaluation unit associates each of the syllables included in the evaluation target speech with each of the syllables included in the exemplary speech in the prosodic feature comparison process, and all the evaluations that are in a correspondence relationship with each other. The speech evaluation apparatus according to any one of claims 1 to 3, wherein a process of acquiring a matching ratio between the prosodic features in the syllable of the target speech and the syllable of the exemplary speech and displaying the numerical value on the display device is performed.

The prosody evaluation unit includes:
In the prosodic feature acquisition process, further, a process of acquiring a representative value of the prosodic feature for each syllable of the evaluation target speech,
In the prosodic feature comparison processing, the degree of change between the plurality of syllables of the representative value of the prosodic feature for each of the acquired syllables, and the representative value of the prosodic feature for each of the syllables included in the exemplary speech The voice evaluation device according to claim 1, wherein a process of comparing a degree of change between the plurality of syllables and presenting a comparison result of the comparison to the user is executed.

The prosody evaluation unit includes a line graph in which representative values of the prosodic features acquired for each of the syllables included in the evaluation target speech are plotted between the plurality of syllables in the prosody feature comparison process, and the exemplary speech The speech evaluation apparatus according to claim 7, wherein a processing for displaying a representative value of the prosodic feature for each syllable included in the syllable on a display device in comparison with a line graph plotted between the plurality of syllables is performed.

The prosodic evaluation unit associates each of the syllables included in the evaluation target speech with each of the syllables included in the exemplary speech in the prosodic feature comparison process, and all the evaluation targets having a corresponding relationship with each other The speech evaluation apparatus according to claim 7, wherein a process of acquiring a matching ratio between representative locations of the prosodic features in the syllable of the speech and the syllable of the exemplary speech and displaying the numerical value on the display device is executed.

In the prosodic feature acquisition process, the prosody evaluation unit divides the evaluation target speech into frames each having a predetermined time length, and acquires the fundamental frequency of the evaluation target speech for each of the divided frames. The language learning device according to claim 1, wherein a process is performed in which each fundamental frequency acquired for each frame included in the syllable is set as a prosodic feature of the syllable for each syllable.

In the prosodic feature acquisition process, the prosody evaluation unit divides the evaluation target speech into frames each having a predetermined time length, acquires the power of the evaluation target speech for each of the divided frames, The speech evaluation apparatus according to any one of claims 1 to 9, wherein for each syllable, processing is performed in which each power acquired for each frame included in the syllable is a prosodic feature of the syllable.

The prosody evaluation unit includes:
In the prosodic feature acquisition process, a duration time is further acquired for each phoneme included in the evaluation target speech, and a duration length of each phoneme included in the syllable is determined for each syllable included in the evaluation target speech. Execute the process of acquiring the sum as the duration of the syllable,
In the prosodic feature comparison process, a process of comparing the obtained duration time for each of the acquired syllables with a duration time of each of the syllables included in the exemplary speech and presenting the result of the comparison is executed. To
The voice evaluation apparatus according to claim 1.

The prosodic evaluation unit, in the prosodic feature comparison process, for each syllable, there is a portion that differs between the prosodic feature of the speech to be evaluated and the prosodic feature of the exemplary speech corresponding to the prosodic feature The voice evaluation device according to claim 4, wherein processing for highlighting the different portion on the display device is executed.

Voice evaluation device
Obtaining a change degree of the prosodic feature value for each syllable included in the input speech to be evaluated;
Comparing the obtained change degree of the prosodic feature value for each of the syllables with the change degree of the prosodic feature value for each of the syllables included in a model voice, and presenting the comparison result of the comparison to the user; Language learning method.

In a computer used as a voice evaluation device,
Obtaining a change degree of the prosodic feature for each syllable included in the input evaluation target speech;
Comparing the degree of change of the prosodic feature for each syllable acquired for the evaluation target speech with the degree of change of the prosodic feature for each syllable included in an exemplary speech, and presenting the result of the comparison; ,
A program that executes