JP4797597B2

JP4797597B2 - Language learning device

Info

Publication number: JP4797597B2
Application number: JP2005339398A
Authority: JP
Inventors: あかね野口; 直博江本; 寿一佐藤; 達也入山; 隆一成山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-11-24
Filing date: 2005-11-24
Publication date: 2011-10-19
Anticipated expiration: 2025-11-24
Also published as: JP2007147783A

Description

本発明は、手本となる音と学習者が発する音との違いを学習者に示す技術に関する。 The present invention relates to a technique for showing a learner the difference between a sound as a model and a sound emitted by a learner.

従来より、語学学習を支援する種々のシステムが提案されている。例えば、特許文献１には、学習者の発音内容を示す音声信号とネイティブスピーカの発音内容を示す音声信号とを比較することによって発音の良否を評価する発音学習方法が開示されている。この方法によれば、学習者は、自身の発音能力に対する客観的な評価数値を得ることができる。
特開２００２−４０９２６号公報 Conventionally, various systems that support language learning have been proposed. For example, Patent Literature 1 discloses a pronunciation learning method for evaluating the quality of pronunciation by comparing an audio signal indicating the pronunciation content of a learner with an audio signal indicating the pronunciation content of a native speaker. According to this method, the learner can obtain an objective evaluation value for his / her pronunciation ability.
JP 2002-40926 A

しかしながら、特許文献１に記載の技術においては、学習者に発音の評価数値を示すのみであったため、学習者は、自身の発音能力に対する客観的な評価数値を得ることができるものの、具体的にお手本の発音と自分の発音がどのように違っているのかを知ることはできない。このため、お手本の発音に一致した発音ができるようになるまでには、試行錯誤して発音の改善と評価を繰り返すという、根気のいる学習を行うこととなる。 However, in the technique described in Patent Document 1, only the evaluation value of pronunciation is shown to the learner, so that the learner can obtain an objective evaluation value for his / her pronunciation ability. I can't know how the pronunciation of the example is different from my pronunciation. For this reason, until a pronunciation that matches the pronunciation of the model can be achieved, it is necessary to carry out a persistent learning that repeats improvement and evaluation of pronunciation by trial and error.

本発明は上述した背景の下になされたものであり、語学学習において、学習者の音声と、お手本の音声との違いを学習者が把握できる技術を提供することを目的とする。 The present invention has been made under the above-described background, and an object of the present invention is to provide a technique that enables a learner to understand the difference between a learner's voice and a model voice in language learning.

上記課題を解決するため、本発明は、入力される音声を発話音声データとして出力する音声入力手段と、複数の単語の音声を示す模範音声データであって、前記単語の各々の音声の発話時間を示す第１単語区切り情報を含む模範音声データから音声のピッチの時間的変化を示す第１ピッチ曲線を生成する第１ピッチ曲線生成手段と、前記発話音声データのピッチの時間的変化を表す第２ピッチ曲線を生成する第２ピッチ曲線生成手段と、前記模範音声データと前記発話音声データを所定のフレーム単位で解析し、両者の対応するフレームを特定するフレーム特定手段と、前記第１ピッチ曲線の平均値を第１平均値として特定する第１ピッチ平均値特定手段と、前記第２ピッチ曲線の平均値を第２平均値として特定する第２ピッチ平均値特定手段と、ピッチの高低を上下方向に示し、時間的変化を左右方向に示すように前記第１ピッチ曲線と前記第２ピッチ曲線とを表示し、前記フレーム特定手段によって特定された各フレームについて、前記模範音声データと前記発話音声データの対応するフレームのピッチが同じ左右方向位置になるように、前記第１ピッチ曲線と前記第２ピッチ曲線とを表示するとともに、前記第１ピッチ曲線と前記第２ピッチ曲線とを、前記第１ピッチ曲線における前記第１平均値の上下方向の表示位置と、前記第２ピッチ曲線における前記第２平均値の上下方向の表示位置とが一致する位置関係で表示する表示手段と、前記第１単語区切り情報と、前記フレーム特定手段によって特定されたフレームとに基づいて、前記発話音声データ中の各単語に対応する音声の発話時間を第２単語区切り情報として特定する特定手段と、前記第１ピッチ曲線について、前記第１単語区切り情報に基づく区間毎に一つの代表値を特定し、前記第２ピッチ曲線について、前記第２単語区切り情報に基づく区間毎に一つの代表値を特定する単語ピッチ特定手段と、前記模範音声データ中の各単語に対応する図形を、前記単語ピッチ特定手段によって特定された各単語のピッチの高低に応じた位置に表示するとともに、前記発話音声データ中の各単語に対応する図形を、前記単語ピッチ特定手段によって特定された各単語のピッチの高低に応じた位置に表示する図形表示手段と前記第１ピッチ曲線と前記第２ピッチ曲線のピッチの差分値が所定値以上である単語を検出し、当該単語を特定する情報を表示する相違点表示手段と、操作者によって操作される操作手段と、前記相違点表示手段によって表示された１または複数の情報のうち、前記操作手段の操作に応じていずれか一つを選択する選択手段と、前記模範音声データおよび前記発話音声データを記憶する記憶手段と、前記選択手段によって選択された情報が示す単語に対応する前記模範音声データまたは前記発話音声データを前記記憶手段から読み出して出力する出力手段であって、前記模範音声データを出力する場合には、前記発話音声データの発話時間と同じ発話時間となるように前記模範音声データにタイムストレッチ処理を施して出力する出力手段とを備えることを特徴とする語学学習装置を提供する。 In order to solve the above-described problems, the present invention provides voice input means for outputting input voice as speech voice data, and model voice data indicating the voices of a plurality of words, and the speech duration of each voice of the words First pitch curve generating means for generating a first pitch curve indicating a temporal change in the pitch of the voice from the exemplary voice data including the first word break information indicating the first pitch curve, and a first representing a temporal change in the pitch of the utterance voice data. A second pitch curve generating means for generating a two pitch curve; a frame specifying means for analyzing the model voice data and the uttered voice data in units of a predetermined frame and specifying corresponding frames; and the first pitch curve. First pitch average value specifying means for specifying the average value of the second pitch value as the first average value, and second pitch average value specifying means for specifying the average value of the second pitch curve as the second average value And displaying the first pitch curve and the second pitch curve so as to indicate the pitch height in the vertical direction and the temporal change in the horizontal direction, for each frame specified by the frame specifying means, The first pitch curve and the second pitch curve are displayed so that the pitches of the corresponding frames of the model voice data and the speech voice data are the same in the left-right direction position, and the first pitch curve and the second pitch curve are displayed. The pitch curve is displayed in a positional relationship in which the display position in the vertical direction of the first average value in the first pitch curve matches the display position in the vertical direction of the second average value in the second pitch curve. Corresponding to each word in the speech data based on the display means, the first word break information, and the frame specified by the frame specifying means Specifying means for specifying voice utterance time as second word break information; and for the first pitch curve, specifying one representative value for each section based on the first word break information, and for the second pitch curve, Word pitch specifying means for specifying one representative value for each section based on the second word break information, and a figure corresponding to each word in the exemplary speech data for each word specified by the word pitch specifying means A graphic display for displaying a figure corresponding to each word in the utterance voice data at a position corresponding to the pitch height of each word specified by the word pitch specifying means while displaying the position at a position corresponding to the pitch height. Difference display means for detecting a word having a pitch difference value of a means, the first pitch curve and the second pitch curve being a predetermined value or more and displaying information for identifying the word A selection means for selecting any one of the information displayed by the difference display means according to the operation of the operation means; A storage means for storing the model voice data and the utterance voice data, and an output means for reading out the model voice data or the utterance voice data corresponding to the word indicated by the information selected by the selection means from the storage means and outputting it. In the case of outputting the model voice data, an output unit is provided that outputs the model voice data by performing a time stretch process so as to have the same utterance time as the utterance time of the utterance voice data. A language learning device is provided.

本発明の好ましい態様においては、前記第１ピッチ曲線と前記第２ピッチ曲線の相違部分を検出する検出手段を備え、前記表示手段は、前記検出手段が検出した相違部分を識別表示することを特徴とする。 In a preferred aspect of the present invention, there is provided a detecting means for detecting a different portion between the first pitch curve and the second pitch curve, and the display means identifies and displays the different portion detected by the detecting means. And

本発明によれば、語学学習において、学習者の音声とお手本の音声との違いを学習者が把握することができる。 According to the present invention, in language learning, a learner can grasp a difference between a learner's voice and a model voice.

＜Ａ：構成＞
図１は、この発明の一実施形態である語学学習装置１のハードウェア構成を例示したブロック図である。図において、１１は、例えばＣＰＵ（Central Processing Unit）等の演算装置や、ＲＯＭ（Read Only Memory）やＲＡＭ（Random Access Memory）などの各種メモリを備えた制御部である。１２は、例えばハードディスクなどの大容量記憶装置で構成された記憶部である。制御部１１の演算装置は、ＲＯＭ等のメモリや記憶部１２に記憶されているコンピュータプログラムを読み出して実行することにより、バス１３を介して語学学習装置１の各部を制御する。 <A: Configuration>
FIG. 1 is a block diagram illustrating a hardware configuration of a language learning device 1 according to an embodiment of the invention. In the figure, reference numeral 11 denotes a control unit including an arithmetic device such as a CPU (Central Processing Unit) and various memories such as a ROM (Read Only Memory) and a RAM (Random Access Memory). Reference numeral 12 denotes a storage unit composed of a large-capacity storage device such as a hard disk. The arithmetic unit of the control unit 11 controls each unit of the language learning device 1 via the bus 13 by reading and executing a computer program stored in a memory such as a ROM or the storage unit 12.

１４は例えば液晶ディスプレイ等で構成される表示部であり、制御部１１の制御の下、文字列や各種メッセージ、語学学習装置１を操作するためのメニュー画面等を表示する。１５はキーボードやマウス等の入力装置を備える入力部であり、キーの押下やマウスの操作等に応じて操作内容に対応した信号を制御部１１へ出力する。１６は音声を収音するマイクロフォンであり、１７は音声処理部である。マイクロフォン１６は音声処理部１７に接続されており、音声処理部１７は、マイクロフォン１６から入力される音声（以下、学習者音声と称する）をデジタルデータ（以下、学習者データと称する）に変換して制御部１１に供給する。１８は、音声処理部１７に接続されたスピーカであり、音声処理部１７から出力される信号に対応した音を出力する。 Reference numeral 14 denotes a display unit composed of, for example, a liquid crystal display, and displays a character string, various messages, a menu screen for operating the language learning device 1 and the like under the control of the control unit 11. An input unit 15 includes an input device such as a keyboard and a mouse, and outputs a signal corresponding to the operation content to the control unit 11 in response to a key press or a mouse operation. Reference numeral 16 denotes a microphone that collects sound, and reference numeral 17 denotes an audio processing unit. The microphone 16 is connected to the sound processing unit 17, and the sound processing unit 17 converts the sound input from the microphone 16 (hereinafter referred to as learner sound) into digital data (hereinafter referred to as learner data). To the control unit 11. Reference numeral 18 denotes a speaker connected to the sound processing unit 17 and outputs a sound corresponding to a signal output from the sound processing unit 17.

語学学習装置１の記憶部１２には、テーブルＴＢＬ１が記憶されている。図２は、テーブルＴＢＬ１の構造の一例を示す図である。このテーブルには、図示のように、「例文テキストデータ」と「模範音声データ」と「単語区切り情報」との各項目が互いに関連付けて記憶されている。これらの項目のうち、「例文テキストデータ」の項目には、語学学習に用いられる例文を表すテキストデータが記憶される。「模範音声データ」の項目には、ネイティブスピーカが例文を読み上げたときの音声（以下、模範音声と称する）を表すＷＡＶＥ形式のデジタルデータ（以下、模範音声データと称する）が記憶される。「単語区切り情報」の項目には、模範音声中における各単語の発話時間を示す情報（以下、第１単語区切り情報と称する）が記憶される。 The storage unit 12 of the language learning device 1 stores a table TBL1. FIG. 2 is a diagram illustrating an example of the structure of the table TBL1. In this table, items of “example sentence text data”, “exemplary speech data”, and “word break information” are stored in association with each other as shown in the figure. Among these items, the “example sentence text data” item stores text data representing example sentences used for language learning. In the “model voice data” item, WAVE-format digital data (hereinafter referred to as model voice data) representing voice (hereinafter referred to as model voice) when the native speaker reads out the example sentence is stored. In the item “word break information”, information indicating the utterance time of each word in the model voice (hereinafter referred to as first word break information) is stored.

次に、語学学習装置１の制御部１１が表示部１４に表示する画面について説明する。図３は、語学学習装置１の表示部１４に表示される画面の一例を示す図である。図示のように、この画面は、画面上段のピッチ表示部分１００と、画面中段のリズム表示部分２００と、画面下段の操作ボタン表示部分３００とに分けられている。 Next, a screen displayed on the display unit 14 by the control unit 11 of the language learning device 1 will be described. FIG. 3 is a diagram illustrating an example of a screen displayed on the display unit 14 of the language learning device 1. As shown in the figure, this screen is divided into a pitch display portion 100 in the upper part of the screen, a rhythm display part 200 in the middle part of the screen, and an operation button display part 300 in the lower part of the screen.

ピッチ表示部分１００は、模範音声と学習者音声のピッチ（音高）の時間的変化を対応付けて表示する部分である。このピッチ表示部分１００は、更に、ラジオボタン表示部分１１０と、ピッチグラフ表示部分１２０とに分けられている。
ラジオボタン表示部分１１０には、ピッチグラフ表示部分１２０に表示される内容を切り替えるためのラジオボタン１１１が表示される。このラジオボタン１１１は、「グラフ表示」と「単語表示」のいずれか一方を選択できるようになっている。このラジオボタン１１１が表示されている領域にマウスポインタを移動し左クリックする等の操作が学習者によって行われると、制御部１１は、ピッチグラフ表示部分１２０に表示する内容を切り替える。
ラジオボタン１１１において「グラフ表示」が選択されている場合は、ピッチグラフ表示部分１２０には、図３に示すような、模範音声と学習者音声のピッチの時間的変化を示すグラフ（ピッチ曲線）が表示される。逆に、ラジオボタン１１１において「単語表示」が選択されている場合は、ピッチグラフ表示部分１２０には、図４に示すような、模範音声と学習者音声のピッチの時間的変化を単語毎に示す帯状の図形が表示される。 The pitch display part 100 is a part that displays the temporal change of the pitch (pitch) of the model voice and the learner voice in association with each other. The pitch display portion 100 is further divided into a radio button display portion 110 and a pitch graph display portion 120.
In the radio button display portion 110, a radio button 111 for switching the contents displayed in the pitch graph display portion 120 is displayed. The radio button 111 can select either “graph display” or “word display”. When the learner performs an operation such as moving the mouse pointer to the area where the radio button 111 is displayed and left-clicking, the control unit 11 switches the content displayed on the pitch graph display portion 120.
When “graph display” is selected in the radio button 111, the pitch graph display portion 120 has a graph (pitch curve) showing temporal changes in the pitches of the model voice and the learner voice as shown in FIG. Is displayed. On the other hand, when “word display” is selected in the radio button 111, the pitch graph display portion 120 displays a time change of the pitch of the model voice and the learner voice for each word as shown in FIG. A strip-shaped figure is displayed.

次に、リズム表示部分２００は、模範音声と学習者音声の発音のリズム（発話時間）を対応付けて表示する部分である。このリズム表示部分２００は、更に、ラジオボタン表示部分２１０と、リズムグラフ表示部分２２０とに分けられている。
ラジオボタン表示部分２１０には、リズムグラフ表示部分２２０に表示される図形を切り替えるためのラジオボタン２１１が表示される。このラジオボタン２１１は、「単語比較」と「等倍比較」と「全文比較」とのいずれか一つを選択できるようになっている。
ラジオボタン２１１において「単語比較」が選択されている場合は、リズムグラフ表示部分２２０には、図３に示すような、模範音声と学習者音声の例文全体の発話時間における各単語の発話時間の比率を比較するための帯状の図形が表示される。
また、「等倍比較」が選択されている場合は、リズムグラフ表示部分２２０には、各単語の発音時間の絶対的な長さを比較するための図形が表示される。具体的には、図５に示すような、模範音声と学習者音声における各単語の発話時間を示す帯状の図形が表示される。
また、「全文比較」が選択されている場合は、図４に示すような、模範音声と学習者音声の例文全体の発話時間を示す帯状の図形が表示される。 Next, the rhythm display portion 200 is a portion that displays the rhythm (utterance time) of the model speech and the learner speech in association with each other. The rhythm display portion 200 is further divided into a radio button display portion 210 and a rhythm graph display portion 220.
In the radio button display part 210, a radio button 211 for switching the graphic displayed in the rhythm graph display part 220 is displayed. The radio button 211 can select any one of “word comparison”, “same size comparison”, and “full text comparison”.
When “word comparison” is selected in the radio button 211, the rhythm graph display portion 220 displays the utterance time of each word in the utterance time of the entire example sentence of the model voice and the learner voice as shown in FIG. A band-like figure for comparing the ratio is displayed.
When “same comparison” is selected, the rhythm graph display portion 220 displays a graphic for comparing the absolute lengths of pronunciation times of the words. Specifically, as shown in FIG. 5, a band-like figure indicating the utterance time of each word in the model voice and the learner voice is displayed.
Further, when “full text comparison” is selected, a band-like figure indicating the utterance time of the entire example sentence of the model voice and the learner voice as shown in FIG. 4 is displayed.

このように、ピッチグラフ表示部分１２０とリズムグラフ表示部分２２０は、ラジオボタン１１１，２１１によって適宜その表示内容を変更することが可能となっており、これらの表示内容は任意のものを組み合わせて表示させることが可能である。 As described above, the display contents of the pitch graph display part 120 and the rhythm graph display part 220 can be appropriately changed by the radio buttons 111 and 211, and these display contents are displayed by combining arbitrary ones. It is possible to make it.

次に、図３における操作ボタン表示部分３００には、録音操作や再生操作等の各種操作を学習者に行わせるための各種のボタンが表示される。図３に示す操作ボタン表示部分３００において、３０１は、音声入力の開始指示または終了指示を入力するためのボタンである。３０２は、模範音声の再生指示を入力するためのボタンである。３０３は、模範音声の再生における再生スピードを選択するためのラジオボタンであり、「通常再生」と「スロー再生」と「録音に合わせる」のいずれかを選択できるようになっている。「録音に合わせる」が選択されている場合は、録音された学習者音声の例文の発話時間と同じ発話時間となるようにタイムストレッチ処理が施された模範音声が再生される。「スロー再生」が選択されている場合は、所定の比率でタイムストレッチ処理が施された模範音声が再生される。
３０４は、録音した学習者音声の再生指示を入力するためのボタンである。３０５は、学習者音声の採点指示を入力するためのボタンであり、３０６は、学習者の発音音声に対する評価結果を表示する評価結果表示部分である。ボタン３０５がクリックされたことを検知すると、制御部１１は、模範音声データと学習者データとを比較してその一致度に基づいて所定のアルゴリズムにより点数を算出し、算出された点数を評価結果表示部分３０６に表示させる。
３０７は、採点詳細の表示指示を入力するためのボタンである。このボタンがクリックされたことを検知すると、制御部１１は、採点結果の詳細を表示する。これは例えば、「ピッチ」や「リズム」といった複数の項目の採点結果を項目毎に表示するようにしてもよく、例文に含まれる単語毎の採点結果を単語毎に表示するようにしてもよい。 Next, in the operation button display part 300 in FIG. 3, various buttons for causing the learner to perform various operations such as a recording operation and a reproduction operation are displayed. In the operation button display portion 300 shown in FIG. 3, reference numeral 301 denotes a button for inputting a voice input start instruction or end instruction. Reference numeral 302 denotes a button for inputting an instruction to reproduce a model voice. Reference numeral 303 denotes a radio button for selecting a playback speed in playback of the model voice, and one of “normal playback”, “slow playback”, and “according to recording” can be selected. When “synchronize with recording” is selected, the model voice that has been subjected to the time stretch processing so as to have the same utterance time as the utterance time of the recorded example sentence of the learner voice is reproduced. When “slow playback” is selected, the model voice subjected to the time stretch process at a predetermined ratio is played back.
A button 304 is used to input a playback instruction for the recorded learner voice. Reference numeral 305 denotes a button for inputting a learner's voice scoring instruction, and reference numeral 306 denotes an evaluation result display portion that displays an evaluation result for the learner's pronunciation voice. When detecting that the button 305 has been clicked, the control unit 11 compares the model voice data with the learner data, calculates a score using a predetermined algorithm based on the degree of coincidence, and evaluates the calculated score. It is displayed on the display portion 306.
Reference numeral 307 denotes a button for inputting a scoring detail display instruction. When detecting that this button has been clicked, the control unit 11 displays details of the scoring result. For example, the scoring results of a plurality of items such as “pitch” and “rhythm” may be displayed for each item, or the scoring results for each word included in the example sentence may be displayed for each word. .

３０８は、模範音声と学習者音声との相違点において、その差異が顕著である単語の音声出力を指示するためのボタンである。このボタンがクリックされたことを検知すると、語学学習装置１の制御部１１は、模範音声と学習者音声において、ピッチ比較においてその差異が最も顕著であった単語をそれぞれ再生する。更に、リズム比較においてその差異が最も顕著であって単語についてもそれぞれ再生する。
以上が表示部１４に表示される画面の説明である。 Reference numeral 308 denotes a button for instructing voice output of a word in which the difference is remarkable in the difference between the model voice and the learner voice. When it is detected that the button has been clicked, the control unit 11 of the language learning device 1 reproduces each word having the most significant difference in pitch comparison between the model voice and the learner voice. Furthermore, the difference is most remarkable in the rhythm comparison, and each word is also reproduced.
The above is the description of the screen displayed on the display unit 14.

＜Ｂ：動作＞
次に、本実施形態の動作について、図６を参照しつつ説明する。
＜Ｂ−１：ピッチグラフ表示処理＞
図６は、語学学習装置１の制御部１１が行う処理の流れを示すフローチャートである。まず、学習者は、語学学習装置１の入力部１５を操作して例文の一覧の表示指示を入力する。語学学習装置１の制御部１１は、例文の一覧の表示指示が入力されたことを検知すると、テーブルＴＢＬ１に格納されている例文テキストデータを読出し（ステップＳＡ１）、読み出したデータが表す例文の一覧を表示部１４に表示する（ステップＳＡ２）。この後、学習者が入力部１５を操作し、表示された例文の一つを選択する操作を行うと（ステップＳＡ３；ＹＥＳ）、制御部１１は、表示部１４に表示されている画面と、入力部１５から送られる信号に基づいて、選択された例文を特定する（ステップＳＡ４）。制御部１１は、選択された例文を特定すると、テーブルＴＢＬ１において、選択された例文に対応付けて格納されている模範音声データを読み出す（ステップＳＡ５）。 <B: Operation>
Next, the operation of this embodiment will be described with reference to FIG.
<B-1: Pitch graph display processing>
FIG. 6 is a flowchart showing a flow of processing performed by the control unit 11 of the language learning device 1. First, the learner operates the input unit 15 of the language learning device 1 to input an instruction to display a list of example sentences. When the control unit 11 of the language learning device 1 detects that an instruction to display a list of example sentences is input, the example sentence text data stored in the table TBL1 is read (step SA1), and the list of example sentences represented by the read data is displayed. Is displayed on the display unit 14 (step SA2). Thereafter, when the learner operates the input unit 15 and performs an operation of selecting one of the displayed example sentences (step SA3; YES), the control unit 11 includes a screen displayed on the display unit 14, Based on the signal sent from the input unit 15, the selected example sentence is specified (step SA4). When the selected example sentence is specified, the control unit 11 reads out model voice data stored in association with the selected example sentence in the table TBL1 (step SA5).

次に、制御部１１は、読み出した模範音声データが示す音声のピッチを抽出し、ピッチの時間的変化を示すピッチ曲線を生成する（ステップＳＡ６）。ピッチ曲線の生成は具体的には以下のようにして行う。まず、制御部１１は、読み出した模範音声データが示す音声を、その再生時間軸上において所定の時間間隔（例えば、１００ｍｓｅｃ）で分割する（以下、各分割された区間をフレームと称する）。次に制御部１１は、分割された区間毎に、各区間の音声のピッチを抽出する。分割されたフレーム毎にピッチを抽出すると、フレーム毎に求められたピッチを結んだピッチ曲線を生成し（以下、このピッチ曲線を第１ピッチ曲線と称する）、生成した第１ピッチ曲線を示す曲線データを記憶部１２に記憶する。 Next, the control unit 11 extracts the pitch of the voice indicated by the read exemplary voice data, and generates a pitch curve indicating the temporal change of the pitch (step SA6). Specifically, the pitch curve is generated as follows. First, the control unit 11 divides the voice indicated by the read exemplary voice data at a predetermined time interval (for example, 100 msec) on the reproduction time axis (hereinafter, each divided section is referred to as a frame). Next, the control unit 11 extracts the pitch of the voice of each section for each divided section. When a pitch is extracted for each divided frame, a pitch curve connecting the pitches obtained for each frame is generated (hereinafter, this pitch curve is referred to as a first pitch curve), and a curve indicating the generated first pitch curve is generated. Data is stored in the storage unit 12.

続けて、制御部１１は、表示部１４を制御し、例えば、「キーを押してから発音し、発音が終わったら再度キーを押してください」という、例文の発音を促すメッセージを表示する（ステップＳＡ８）。 Subsequently, the control unit 11 controls the display unit 14 to display, for example, a message prompting the pronunciation of the example sentence, such as “Press the key to generate a sound and then press the key again when the sound is finished” (step SA8). .

学習者は、表示部１４に表示されたメッセージに従って入力部１５を操作し、例文を読み上げる。学習者が発音すると、学習者の音声がマイクロフォン１６によって音声信号に変換され、変換された信号が音声処理部１７へ出力される。音声処理部１７は、マイクロフォン１６から出力された音声信号をデジタルデータに変換して、学習者データとする。この学習者データは、音声処理部１７から出力されて記憶部１２に記憶される。 The learner operates the input unit 15 according to the message displayed on the display unit 14 and reads the example sentence. When the learner pronounces, the learner's voice is converted into a voice signal by the microphone 16, and the converted signal is output to the voice processing unit 17. The voice processing unit 17 converts the voice signal output from the microphone 16 into digital data to obtain learner data. The learner data is output from the voice processing unit 17 and stored in the storage unit 12.

次に、制御部１１は、入力部１５から送られる信号を監視し、学習者が発音を終了したか否かを判断する。学習者が発音を終了して入力部１５を操作したことを検知すると（ステップＳＡ９；ＹＥＳ）、制御部１１は、模範音声データと学習者データとの両者の波形同士を直接対比して、例えばＤＴＷ（ＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ）等により、模範音声データと、学習者音声データとの時間的な対応付けをフレーム毎に行い、この対応付けの結果と第１単語区切り情報に基づき学習者データ中における各単語の発話時間を示す情報（以下、第２単語区切り情報と称する）を特定する（ステップＳＡ１０）。例えば、図７に示すように、模範音声と学習者音声の２つの波形の対応付けを行う。 Next, the control unit 11 monitors a signal sent from the input unit 15 and determines whether or not the learner has finished pronunciation. When it is detected that the learner has finished sounding and operated the input unit 15 (step SA9; YES), the control unit 11 directly compares the waveforms of both the model voice data and the learner data, for example, The temporal correspondence between the model voice data and the learner voice data is performed for each frame by DTW (Dynamic Time Warping) or the like, and each of the learner data in the learner data is based on the result of this correspondence and the first word break information. Information indicating the utterance time of the word (hereinafter referred to as second word break information) is specified (step SA10). For example, as shown in FIG. 7, two waveforms of a model voice and a learner voice are associated with each other.

続けて、語学学習装置１の制御部１１は、学習者データに対してステップＳＡ６と同様の処理を行って、学習者データのピッチを抽出してピッチ曲線（以下、第２ピッチ曲線と称する）を生成し、生成した第２ピッチ曲線を示すデータを記憶部１２に記憶する（ステップＳＡ１１）。 Subsequently, the control unit 11 of the language learning device 1 performs a process similar to that in step SA6 on the learner data, extracts the pitch of the learner data, and then a pitch curve (hereinafter referred to as a second pitch curve). And the data indicating the generated second pitch curve is stored in the storage unit 12 (step SA11).

次に、制御部１１は、記憶部１２に記憶された模範音声データのピッチの平均値を算出する（ステップＳＡ１２）。具体的には、例えばステップＳＡ６において分割されたフレーム毎のピッチの値の平均値を算出する。また、制御部１１は、記憶部１２に記憶された学習者データについても同様の処理を行って、学習者データのピッチの平均値を算出する（ステップＳＡ１３）。 Next, the control part 11 calculates the average value of the pitch of the model audio | voice data memorize | stored in the memory | storage part 12 (step SA12). Specifically, for example, an average value of pitch values for each frame divided in step SA6 is calculated. Further, the control unit 11 performs the same process on the learner data stored in the storage unit 12 and calculates the average value of the pitch of the learner data (step SA13).

次に、制御部１１は、表示部１４を制御して、図３に例示するように、選択された例文と第１ピッチ曲線と第２ピッチ曲線とをピッチグラフ表示部分１２０に表示させる（ステップＳＡ１４）。このとき、制御部１１は、学習者データにおける各フレームの発話時間が、第１ピッチ曲線において対応するフレームの発話時間と一致するように、第２ピッチ曲線をフレーム毎に時間軸方向に伸縮または伸張して表示させる。また、制御部１１は、第１ピッチ曲線におけるピッチの平均値の表示位置と、第２ピッチ曲線におけるピッチの平均値の表示位置とが一致する位置関係で、第１ピッチ曲線と第２ピッチ曲線とを表示する。 Next, the control unit 11 controls the display unit 14 to display the selected example sentence, the first pitch curve, and the second pitch curve on the pitch graph display part 120 as illustrated in FIG. SA14). At this time, the control unit 11 expands or contracts the second pitch curve for each frame in the time axis direction so that the utterance time of each frame in the learner data matches the utterance time of the corresponding frame in the first pitch curve. Expand and display. Further, the control unit 11 has a positional relationship in which the display position of the average pitch value in the first pitch curve matches the display position of the average pitch value in the second pitch curve, and the first pitch curve and the second pitch curve. Is displayed.

図３に示すピッチグラフ表示部分１２０において、鎖線で示す曲線は第１ピッチ曲線を表し、実線で示す曲線は第２ピッチ曲線を表している。図示のように、第１ピッチ曲線と第２ピッチ曲線とは、双方のピッチの平均値が縦軸のゼロ点（例えば横軸の位置）に一致するように上下方向の表示位置が調節される。これにより、学習者は、模範音声と自身の発音におけるピッチの時間的変化（イントネーション）の差異を視覚的に把握することができる。 In the pitch graph display portion 120 shown in FIG. 3, a curve indicated by a chain line represents a first pitch curve, and a curve indicated by a solid line represents a second pitch curve. As shown in the figure, the vertical display position of the first pitch curve and the second pitch curve is adjusted so that the average value of both pitches coincides with the zero point of the vertical axis (for example, the position of the horizontal axis). . Thereby, the learner can visually grasp the difference in the temporal change (intonation) of the pitch between the model voice and his / her pronunciation.

また、制御部１１は、模範音声と学習者音声においてピッチが異なる箇所を抽出し、抽出した箇所を示すアイコンを表示させる。この抽出処理は、例えば、第１ピッチ曲線と第２ピッチ曲線との対応箇所におけるピッチの値の差分値を算出し、算出した差分値が所定値以上である箇所を特定することによって行われる。そして、制御部１１は、図３に示すように、模範音声とは発音のピッチが異なることを報知するアイコンＩ１，Ｉ２を、両者のピッチ差が所定値以上の部分に表示するとともに、第２ピッチ曲線におけるその部分を、色を変えたり、太さを変えたりして表示する。なお、図３においては、第２ピッチ曲線における該当箇所を太い実線で示している。 Moreover, the control part 11 extracts the location from which pitch differs in a model audio | voice and a learner audio | voice, and displays the icon which shows the extracted location. This extraction process is performed, for example, by calculating a difference value between pitch values at corresponding positions between the first pitch curve and the second pitch curve and specifying a position where the calculated difference value is equal to or greater than a predetermined value. Then, as shown in FIG. 3, the control unit 11 displays the icons I1 and I2 that notify that the pitch of the pronunciation is different from that of the model voice in a portion where the pitch difference between the two is equal to or larger than a predetermined value, and the second The part of the pitch curve is displayed by changing the color or changing the thickness. In FIG. 3, the corresponding part in the second pitch curve is indicated by a thick solid line.

このとき、学習者は、入力部１５を操作して、相違箇所を音声出力する旨の操作を行うことができる。これは例えば、表示されたアイコンＩ１，Ｉ２が表示されている領域にマウスポインタを移動し左クリックする等の操作によって行われる。語学学習装置１の制御部１１は、音声出力する旨の操作が行われたことを検知すると、選択された相違箇所を含む単語について、模範音声におけるその単語部分の音声データと、学習者音声におけるその単語部分の音声データとを記憶部１２から順次読み出して音声処理部１７に供給する。これにより、両者の音声がスピーカ１８を介して順次出力される。
例えば、図３において、アイコンＩ１が左クリックされた場合は、制御部１１は、模範音声における「One」という音声と、学習者の発話音声中の「One」という音声を、順次スピーカ１８を介して出力する。なお、音声を出力する単位は、単語毎でなくてもよく、例えば、問題のある単語の前後の単語を含む区間の音声を出力するようにしてもよい。このようにすると、一連の単語の流れの中で発音の違いを理解することができる。 At this time, the learner can operate the input unit 15 to perform an operation for outputting the difference portion as a sound. This is performed, for example, by an operation such as moving the mouse pointer to the area where the displayed icons I1 and I2 are displayed and left-clicking. When the control unit 11 of the language learning device 1 detects that an operation for outputting a voice is performed, for the word including the selected difference, the voice data of the word part in the model voice and the learner voice The voice data of the word part is sequentially read from the storage unit 12 and supplied to the voice processing unit 17. Thereby, both voices are sequentially output via the speaker 18.
For example, in FIG. 3, when the icon I1 is left-clicked, the control unit 11 sequentially transmits the voice “One” in the model voice and the voice “One” in the learner's voice through the speaker 18. Output. Note that the unit for outputting the sound may not be for each word. For example, the sound of the section including the words before and after the problematic word may be output. In this way, it is possible to understand the difference in pronunciation in a series of word flows.

ところで、学習者音声のピッチ曲線と模範音声のピッチ曲線とを表示する場合において、学習者の発話時間と模範音声の発話時間との差異が大きい場合等は、図９に示すように、それぞれのピッチ曲線を上下または左右に並べて表示すると、ピッチ曲線のどの箇所が対応しているのか、自身の発音と模範音声とのどの部分が異なっているのかを、学習者が視覚的に把握することは困難であった。
更に、これらのピッチ曲線を同じスケール上に重ね合わせて表示しても、一般的に模範音声と学習者音声の平均ピッチは異なっているから、両者には上下方向のずれが生じ、修正すべき部分を把握することは難しい。特に、模範音声と学習者との性別が異なる場合等は、模範音声の平均ピッチと学習者音声の平均ピッチとが大きく異なり、２つのピッチ曲線を重ね合わせても、具体的にどのように修正すればいいのかを学習者が把握することは困難であった。 By the way, when the pitch curve of the learner voice and the pitch curve of the model voice are displayed, when the difference between the utterance time of the learner and the utterance time of the model voice is large, as shown in FIG. When the pitch curves are displayed side-by-side or up and down, it is possible for the learner to visually grasp which part of the pitch curve corresponds and which part of his pronunciation and model voice is different. It was difficult.
Furthermore, even if these pitch curves are displayed superimposed on the same scale, the average pitch of the model voice and the learner voice is generally different. It is difficult to grasp the part. In particular, when the gender of the model voice and the learner are different, the average pitch of the model voice and the average pitch of the learner voice are very different, and how it is specifically corrected even if two pitch curves are superimposed. It was difficult for learners to figure out what to do.

これに対し本実施形態においては、図３に示すように、第１ピッチ曲線と第２ピッチ曲線とが、双方のピッチの平均値の表示位置が一致する位置関係で表示されることから、学習者の発音と模範音声の発音におけるイントネーションの差異が視覚的に把握しやすい。 In contrast, in the present embodiment, as shown in FIG. 3, the first pitch curve and the second pitch curve are displayed in a positional relationship in which the display positions of the average values of both pitches coincide with each other. It is easy to visually grasp the difference of intonation between the pronunciation of a person and the pronunciation of a model voice.

また、学習者データにおける各フレームの発話時間が、第１ピッチ曲線の対応するフレームの発話時間と一致するように、第２ピッチ曲線をフレーム毎に伸縮または伸張して表示させることにより、学習者の発話時間と模範音声の発話時間との時間長の差異が大きい場合であっても、グラフのどの箇所とどの箇所とが対応しているのかを視覚的に把握することができ、どの部分（どの単語）のピッチ（イントネーション）を修正すべきかを容易に知ることが可能となる。 Further, the learner data is displayed by expanding / contracting or expanding the second pitch curve for each frame so that the utterance time of each frame in the learner data matches the utterance time of the corresponding frame of the first pitch curve. Even if there is a large difference in duration between the utterance time and the speech time of the model voice, you can visually understand which part of the graph corresponds to which part ( It is possible to easily know which word (pitch) (intonation) should be corrected.

また、模範音声と異なるピッチの箇所がアイコンで表示されることにより、学習者は、お手本と自身の発音の相違点を視覚的に把握することが可能となり、どの箇所の発音を修正すべきかを容易に把握することができる。更に、その相違箇所の単語が、模範音声と学習者音声のそれぞれについて音声出力されることにより、学習者は、修正すべき単語とその修正内容を容易に把握することが可能となる。 Also, by displaying icons with different pitches from the model voice, the learner can visually grasp the difference between the model and his pronunciation, and which part of the pronunciation should be corrected. It can be easily grasped. Furthermore, the words at the different points are output as voices for each of the model voice and the learner voice, so that the learner can easily grasp the word to be corrected and the details of the correction.

＜Ｂ−２：リズム単語比率表示処理＞
図６の説明に戻る。語学学習装置１の制御部１１は、ピッチグラフの表示と併せて、図３に示すような、リズムを示す帯状の図形をリズムグラフ表示部分２２０に表示させる（ステップＳＡ１５）。具体的には、模範音声の例文全体の発話時間における各単語の発話時間の比率を示す帯状の図形と、学習者音声の例文全体の発話時間における各単語の発話時間の比率を示す帯状の図形とを、リズムグラフ表示部分２２０に表示させる。
図３に示すリズムグラフ表示部２２０において、単語が内部に表示されている帯は、模範音声の例文全体の発話時間における各単語の発音時間の比率を表し、網掛けで示す帯は学習者の例文全体の発話時間における各単語の発音時間の比率を表している。
単語が内部に表示されている帯の長さは、テーブルＴＢＬ１に記憶されている単語毎の発話時間（第１単語区切り情報）に応じて決定され、内部が塗りつぶされている帯の長さは、ステップＳＡ１０で求めた発話時間（第２単語区切り情報）に応じて決定される。具体的には、各帯の長さは単語の発話時間の比率に対応しており、発話時間の比率が大きいと帯は長く表示され、発話時間の比率が小さいと帯は短く表示される。
更に、制御部１１は、模範音声と学習者音声においてリズム（単語の発話時間）が異なる箇所を抽出し、図３に示すように、抽出した箇所の近傍にアイコンＩ３を表示させる。 <B-2: Rhythm word ratio display processing>
Returning to the description of FIG. The control unit 11 of the language learning device 1 displays a strip-like figure indicating a rhythm on the rhythm graph display portion 220 as shown in FIG. 3 together with the display of the pitch graph (step SA15). Specifically, a band-shaped figure showing the ratio of the utterance time of each word in the utterance time of the entire example sentence of the model voice, and a band-shaped figure showing the ratio of the utterance time of each word in the utterance time of the entire example sentence of the learner voice Are displayed on the rhythm graph display portion 220.
In the rhythm graph display unit 220 shown in FIG. 3, the band in which the word is displayed represents the ratio of the pronunciation time of each word in the utterance time of the entire example sentence of the model voice. It represents the ratio of the pronunciation time of each word in the utterance time of the entire example sentence.
The length of the band in which the word is displayed is determined according to the utterance time (first word break information) for each word stored in the table TBL1, and the length of the band in which the inside is filled is This is determined according to the utterance time (second word break information) obtained in step SA10. Specifically, the length of each band corresponds to the ratio of the utterance time of the word. When the ratio of the utterance time is large, the band is displayed long, and when the ratio of the utterance time is small, the band is displayed short.
Further, the control unit 11 extracts a portion where the rhythm (word utterance time) differs between the model voice and the learner voice, and displays an icon I3 in the vicinity of the extracted portion as shown in FIG.

このように、模範音声と学習者音声のそれぞれについて、各単語の発話時間の比率を表示させることにより、学習者音声と模範音声のリズム（単語の発話時間）の差異が視覚的に把握しやすくなり、学習者は、どの単語の発話時間（発音スピード）を修正すべきかを視覚的に知ることが容易となる。 Thus, by displaying the ratio of the utterance time of each word for each of the model voice and the learner voice, it is easy to visually understand the difference between the rhythm (word utterance time) of the learner voice and the model voice. Thus, it becomes easy for the learner to visually know which word utterance time (pronunciation speed) should be corrected.

＜Ｂ−３：ピッチ単語グラフ表示処理＞
次に、ピッチグラフ表示部分１２０におけるピッチ単語グラフの表示処理について、以下に説明する。
図３に示す画面が表示部１４に表示されている状態において、学習者が入力部１５を操作してラジオボタン１１１の「単語表示」を選択する操作を行うと、語学学習装置１の制御部１１は、操作された内容を検知して、ピッチグラフ表示部分１２０の表示を、ピッチ単語グラフ表示に切り替える。具体的には、まず、制御部１１は、模範音声データの各単語について、その単語におけるピッチの最高値を特定する。次に、学習者データの各単語について、その単語におけるピッチの最高値を特定する。そして、制御部１１は、ピッチグラフ表示部分１２０に、図４に示すような、各単語の発話時間を示す帯を表示させる。 <B-3: Pitch word graph display process>
Next, the display process of the pitch word graph in the pitch graph display portion 120 will be described below.
When the learner operates the input unit 15 and selects “word display” of the radio button 111 in a state where the screen shown in FIG. 3 is displayed on the display unit 14, the control unit of the language learning device 1 11 detects the operated content and switches the display of the pitch graph display portion 120 to the pitch word graph display. Specifically, first, the control unit 11 specifies, for each word in the model voice data, the highest pitch value in the word. Next, for each word in the learner data, the highest pitch value in that word is specified. And the control part 11 displays the zone | band which shows the utterance time of each word as shown in FIG.

図４のピッチグラフ表示部分１２０において、単語が内部に表示されている帯は、模範音声の発音を表し、網掛けされた帯は学習者の発音を表している。単語が内部に表示されている帯の上下方向の配置位置は、単語毎のピッチの最高値に応じて決定される。具体的には、各帯は、画面の所定の表示位置を基準にして発音のピッチの高低に応じて画面上に配置され、ピッチが高いと帯は上方向に表示され、ピッチが低いと帯は下方向に表示される。また、この場合、各帯は、模範音声データのピッチの平均値の表示位置と学習者データのピッチの平均値の表示位置とが一致する位置関係で表示される。 In the pitch graph display portion 120 of FIG. 4, the band in which the word is displayed represents the pronunciation of the model voice, and the shaded band represents the pronunciation of the learner. The arrangement position in the vertical direction of the band in which the word is displayed is determined according to the maximum pitch value for each word. Specifically, each band is arranged on the screen according to the pitch of the pronunciation with reference to a predetermined display position on the screen. When the pitch is high, the band is displayed upward, and when the pitch is low, the band is displayed. Is displayed in the downward direction. Further, in this case, each band is displayed in a positional relationship in which the display position of the average value of the pitch of the model voice data matches the display position of the average value of the pitch of the learner data.

このように、各単語のそれぞれについて、その単語のピッチの高低に応じた位置に帯状の図形を表示することにより、学習者にどの単語のピッチがお手本の発音のピッチとずれているのかを視覚的に提示することが可能となる。 In this way, for each word, by displaying a band-like figure at a position corresponding to the pitch of the word, the learner can visually identify which word pitch is shifted from the pitch of the model pronunciation. It is possible to present it automatically.

＜Ｂ−４：リズム等倍比較表示処理＞
続けて、リズム等倍比較表示処理について説明する。図３に示す画面が表示部１４に表示されている状態において、学習者が入力部１５を操作してラジオボタン２１１の「等倍比較」を選択する操作を行うと、語学学習装置１の制御部１１は、リズムグラフ表示部分２２０の表示をリズム等倍比較表示に切り替える。具体的には、制御部１１は、図５に示すような、模範音声データと学習者データにおける各単語の発話時間を示す帯状の図形をリズムグラフ表示部分２２０に表示する。図５に示すリズムグラフ表示部分２２０において、単語が内部に表示されている帯は、模範音声における各単語の発話時間を表し、網掛けで示す帯は学習者の発話音声における各単語の発話時間を表す。 <B-4: Rhythm equal magnification comparison display processing>
Next, the rhythm equal magnification comparison display process will be described. When the learner operates the input unit 15 and selects “same size comparison” of the radio button 211 in a state where the screen shown in FIG. 3 is displayed on the display unit 14, the control of the language learning device 1 is performed. The unit 11 switches the display of the rhythm graph display part 220 to the rhythm equal magnification comparison display. Specifically, the control unit 11 displays a band-like figure indicating the utterance time of each word in the model voice data and the learner data, as shown in FIG. In the rhythm graph display portion 220 shown in FIG. 5, the band in which the word is displayed represents the utterance time of each word in the model voice, and the band shown in shaded represents the utterance time of each word in the learner's utterance voice. Represents.

このとき、学習者は、表示部１４に表示される画面を確認しながら、入力部１５を操作して、リズムグラフ表示部分２２０に表示された、学習者音声中の単語の発話時間を示す帯を選択することができる。これは例えばその帯が表示されている領域にマウスポインタを移動し左クリックする等の操作によって行われる。語学学習装置１の制御部１１は、選択された帯について、図６のステップＳＡ２０，ＳＡ２１に示す処理を行ってその帯の表示位置を変更する。この表示位置の変更処理について図６を参照しつつ以下に説明する。 At this time, the learner operates the input unit 15 while checking the screen displayed on the display unit 14, and shows the utterance time of the word in the learner's voice displayed on the rhythm graph display unit 220. Can be selected. This is performed, for example, by an operation such as moving the mouse pointer to the area where the band is displayed and left-clicking. The control unit 11 of the language learning device 1 changes the display position of the selected band by performing the processing shown in steps SA20 and SA21 in FIG. This display position changing process will be described below with reference to FIG.

まず、制御部１１は、帯を選択する旨の操作が行われたことを検知すると（ステップＳＡ２０；ＹＥＳ）、選択された帯が表す単語と同一の単語であって、模範音声に含まれる単語の発話時間を表す帯の左端の描画位置（以下、左端座標と称する）を取得する。例えば、図５に示す例において帯Ｂ１が選択された場合は、帯Ｂ２の左端座標を取得する。そして、制御部１１は、選択された帯の左端座標を、取得した左端座標に変更して帯を描き直す（ステップＳＡ２１）。図８に示す例においては、「ten」という単語と対応する帯が選択された場合に表示される画面を示している。 First, when the control unit 11 detects that an operation for selecting a band has been performed (step SA20; YES), it is the same word as the word represented by the selected band and is included in the model voice. The drawing position at the left end of the band representing the utterance time (hereinafter referred to as the left end coordinates) is acquired. For example, when the band B1 is selected in the example shown in FIG. 5, the left end coordinate of the band B2 is acquired. Then, the control unit 11 changes the left end coordinate of the selected band to the acquired left end coordinate and redraws the band (step SA21). The example shown in FIG. 8 shows a screen displayed when a band corresponding to the word “ten” is selected.

図５に示す画面おいては、模範音声による「ten」という単語の発話時間と、学習者の発音による「ten」という単語の発話時間との、どちらの発話時間の方が長いかを、一見して把握することは困難であった。これに対し、図８に示す例においては、「ten」という単語と対応する帯の左端の描画位置が同じになっているため、どちらの単語の発話時間が長いかを一目で容易に把握することが可能となる。これにより、学習者は、単語の発音を長くすべきか短くすべきかを視覚的に把握することが可能となる。 In the screen shown in FIG. 5, it can be seen at a glance which utterance time is longer, the utterance time of the word “ten” by the model voice and the utterance time of the word “ten” by the pronunciation of the learner. It was difficult to grasp. On the other hand, in the example shown in FIG. 8, since the drawing position at the left end of the band corresponding to the word “ten” is the same, it is easy to grasp which word has a longer utterance time at a glance. It becomes possible. Thereby, the learner can visually grasp whether the pronunciation of the word should be lengthened or shortened.

＜Ｂ−５：リズム全文比較表示処理＞
続けて、リズム全文比較表示処理について説明する。図３に示す画面において、ラジオボタン２１１の「全文比較」を選択する操作を行うと、語学学習装置１の制御部１１は、図４に示すような、模範音声の例文全体の発話時間を示す帯状の図形と、学習者音声の例文全体の発話時間を示す帯状の図形とを、リズムグラフ表示部分２２０に表示させる。これにより、学習者は、模範音声と自身の音声とにおける文章全体の発音時間の差異を視覚的に把握することが可能となる。 <B-5: Rhythm full text comparison display processing>
Next, the rhythm full text comparison display process will be described. When the operation of selecting “full text comparison” of the radio button 211 is performed on the screen illustrated in FIG. 3, the control unit 11 of the language learning device 1 indicates the utterance time of the entire example sentence of the exemplary speech as illustrated in FIG. 4. A band-shaped figure and a band-shaped figure indicating the utterance time of the entire example sentence of the learner's voice are displayed on the rhythm graph display portion 220. As a result, the learner can visually grasp the difference in pronunciation time of the entire sentence between the model voice and the own voice.

以上説明したように本実施形態においては、語学学習において、学習者の発音におけるイントネーション（ピッチの時間的変化）やリズムを、お手本のイントネーションやリズムと比較しやすい形に視覚化して表示することにより、学習者に自分の発音をお手本の発音に近づけるために何をすればよいかを認識させることが可能となる。 As described above, in this embodiment, in language learning, the intonation (temporal change in pitch) and rhythm in the learner's pronunciation are visualized and displayed in a form that is easy to compare with the intonation and rhythm of the model. This makes it possible for the learner to recognize what should be done to bring his pronunciation closer to the model pronunciation.

＜Ｃ：変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。
（１）上述した実施形態においては、一つの画面にピッチとリズムとの両方を表示しているが、ピッチのみ、またはリズムのみを表示するようにしてもよく、また、ピッチとリズムのどちらを表示するか、学習者の操作により選択できるようにしてもよい。 <C: Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. An example is shown below.
(1) In the above-described embodiment, both pitch and rhythm are displayed on one screen. However, only the pitch or only the rhythm may be displayed. It may be displayed or selected by a learner's operation.

（２）上述した実施形態においては、例文を発音したときの模範音声を示す模範音声データと、模範音声中における各単語の発話時間を示す単語区切り情報と、例文を示すテキストデータとを、語学学習装置の記憶部に記憶させておくようにしたが、これらのデータを予め記憶部に記憶させておくようにせず、学習者が入力部１５を操作して入力するようにしてもよい。 (2) In the above-described embodiment, language data indicating model speech when the example sentence is pronounced, word break information indicating the utterance time of each word in the model voice, and text data indicating the example sentence Although it was made to memorize | store in the memory | storage part of a learning apparatus, such a data may not be memorize | stored in a memory | storage part beforehand, but a learner may operate the input part 15 and may input it.

（３）上述した実施形態においては、ピッチ表示またはリズム表示において、各単語（または例文全体）に対応する帯状の図形を表示部に表示するようにした。単語（または例文全体）に対応する図形は、帯状の図形に限定されるものではなく、例えば、図１０に示すように、円状の図形を複数つなげて表示するようにしてもよい。または、各単語（または例文全体）の発話開始と終了とを示す図形（例えば、人の顔を文字で模した図形（顔文字）等）を該当箇所に表示するようにし、途中の区間にはつなぎ図形を表示して各単語（または例文全体）を表すようにしてもよい。要するに、各単語（または例文全体）と対応する図形であればどのようなものであってもよい。 (3) In the embodiment described above, in the pitch display or rhythm display, a band-like figure corresponding to each word (or the entire example sentence) is displayed on the display unit. The figure corresponding to the word (or the entire example sentence) is not limited to a band-like figure, and for example, a plurality of circular figures may be connected and displayed as shown in FIG. Alternatively, a figure indicating the start and end of each word (or the entire example sentence) (for example, a figure imitating a person's face (a face character)) is displayed at the corresponding location, A connected graphic may be displayed to represent each word (or the entire example sentence). In short, any figure may be used as long as it corresponds to each word (or the entire example sentence).

（４）上述した実施形態においては、記憶部１２に記憶される模範音声データはＷＡＶＥ形式のデータとしたが、データの形式はこれに限定されるものではなく、音声を示すデータであればどのような形式のデータであってもよい。
なお、上述した実施形態においては、模範音声データと学習者データとをデジタルデータとしたが、アナログデータを用いるようにしてもよい。 (4) In the above-described embodiment, the model audio data stored in the storage unit 12 is WAVE format data. However, the data format is not limited to this, and any data indicating audio can be used. Data in such a format may be used.
In the above-described embodiment, the model voice data and the learner data are digital data, but analog data may be used.

（５）上述した実施形態では、語学学習装置１が、本発明に係るピッチ曲線生成処理や、ピッチ曲線表示処理等を行うようになっていた。これに対し、通信ネットワークで接続された２以上の装置が上記実施形態に係る機能を分担するようにし、それら複数の装置を備えるシステムが同実施形態の語学学習装置１を実現するようにしてもよい。
例えば、マイクロフォンや、スピーカ、表示装置および入力装置等を備えるコンピュータ装置と、模範音声データを記憶して模範音声データと音声データとの比較処理を行うサーバ装置とが通信ネットワークで接続されたシステムとして構成されていてもよい。この場合は、コンピュータ装置が、マイクロフォンから入力された音声を音声データに変換してサーバ装置に送信し、サーバ装置が受信した音声データと模範音声データとの比較処理を行い、ピッチ曲線等を生成して語学学習装置１に送信するようにすればよい。 (5) In the embodiment described above, the language learning device 1 performs the pitch curve generation process, the pitch curve display process, and the like according to the present invention. On the other hand, two or more devices connected by a communication network share the functions according to the above-described embodiment, and a system including the plurality of devices realizes the language learning device 1 of the same embodiment. Good.
For example, as a system in which a computer device including a microphone, a speaker, a display device, an input device, and the like and a server device that stores exemplary voice data and performs comparison processing between the exemplary voice data and the voice data are connected via a communication network. It may be configured. In this case, the computer device converts the sound input from the microphone into sound data and transmits it to the server device, compares the sound data received by the server device with the model sound data, and generates a pitch curve, etc. Then, it may be transmitted to the language learning device 1.

（６）上述した実施形態における語学学習装置１の制御部１１によって実行されるプログラムは、磁気テープ、磁気ディスク、フロッピー（登録商標）ディスク、光記録媒体、光磁気記録媒体、ＣＤ（Compact Disk）−ＲＯＭ、ＤＶＤ（Digital Versatile Disk）、ＲＡＭなどの記録媒体に記憶した状態で提供し得る。 (6) Programs executed by the control unit 11 of the language learning device 1 in the above-described embodiment are a magnetic tape, a magnetic disk, a floppy (registered trademark) disk, an optical recording medium, a magneto-optical recording medium, and a CD (Compact Disk). -It can be provided in a state stored in a recording medium such as a ROM, a DVD (Digital Versatile Disk), or a RAM.

本発明の第１実施形態である語学学習装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the language learning apparatus which is 1st Embodiment of this invention. 同実施形態の記憶部に記憶されるテーブルのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the table memorize | stored in the memory | storage part of the embodiment. 同実施形態の表示部に表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed on the display part of the embodiment. 本発明の第２実施形態である語学学習装置の表示部に表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed on the display part of the language learning apparatus which is 2nd Embodiment of this invention. 同実施形態の表示部に表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed on the display part of the embodiment. 同実施形態の制御部が行う処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which the control part of the embodiment performs. 模範音声の波形と学習者音声の波形との一例を示す図である。It is a figure which shows an example of the waveform of an exemplary audio | voice, and the waveform of a learner audio | voice. 同実施形態の表示部に表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed on the display part of the embodiment. ピッチグラフの一例を示す図である。It is a figure which shows an example of a pitch graph. 本発明の変形例に係る画面の一例を示す図である。It is a figure which shows an example of the screen which concerns on the modification of this invention.

Explanation of symbols

１…語学学習装置、１１…制御部、１２…記憶部、１３…バス、１４…表示部、１５…入力部、１６…マイクロフォン、１７…音声処理部、１８…スピーカ。 DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 11 ... Control part, 12 ... Memory | storage part, 13 ... Bus, 14 ... Display part, 15 ... Input part, 16 ... Microphone, 17 ... Audio | voice processing part, 18 ... Speaker.

Claims

Voice input means for outputting input voice as speech voice data;
A first pitch curve indicating temporal change in pitch of speech from exemplary speech data including speech of each word and including first word delimiter information indicating speech time of each speech. First pitch curve generating means for generating;
Second pitch curve generation means for generating a second pitch curve representing a temporal change in pitch of the speech data;
Analyzing the exemplary voice data and the spoken voice data in predetermined frame units, and specifying a frame corresponding to both;
First pitch average value specifying means for specifying an average value of the first pitch curve as a first average value;
Second pitch average value specifying means for specifying an average value of the second pitch curve as a second average value;
The first pitch curve and the second pitch curve are displayed so that the pitch level is indicated in the vertical direction and the temporal change is indicated in the horizontal direction, and for each frame specified by the frame specifying means, the model voice is displayed. The first pitch curve and the second pitch curve are displayed so that the pitches of the corresponding frames of the data and the speech voice data are the same in the left-right direction, and the first pitch curve and the second pitch curve are displayed. Is displayed in a positional relationship in which the vertical display position of the first average value on the first pitch curve matches the vertical display position of the second average value on the second pitch curve. When,
Specifying means for specifying speech utterance time corresponding to each word in the utterance voice data as second word break information based on the first word break information and the frame specified by the frame specifying means;
For the first pitch curve, one representative value is specified for each section based on the first word break information, and for the second pitch curve, one representative value is specified for each section based on the second word break information Word pitch identification means to
A graphic corresponding to each word in the exemplary speech data is displayed at a position corresponding to the pitch of each word specified by the word pitch specifying means, and a graphic corresponding to each word in the speech audio data Graphic display means for displaying at a position corresponding to the pitch of each word specified by the word pitch specifying means ,
A difference display means for detecting a word having a difference value of a pitch between the first pitch curve and the second pitch curve being a predetermined value or more and displaying information for specifying the word;
An operation means operated by an operator;
A selection unit that selects one of a plurality of pieces of information displayed by the difference display unit according to an operation of the operation unit;
Storage means for storing the model voice data and the utterance voice data;
Output means for reading out the model voice data or the speech voice data corresponding to the word indicated by the information selected by the selection means from the storage means and outputting the model voice data, A language learning apparatus comprising: output means for performing time-stretch processing on the exemplary voice data so as to have the same utterance time as the utterance time of the utterance voice data .

Detecting means for detecting a difference between the first pitch curve and the second pitch curve;
The language learning apparatus according to claim 1, wherein the display unit identifies and displays a different portion detected by the detection unit.