JP4543919B2

JP4543919B2 - Language learning device

Info

Publication number: JP4543919B2
Application number: JP2004371875A
Authority: JP
Inventors: 紀行畑
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-12-22
Filing date: 2004-12-22
Publication date: 2010-09-15
Anticipated expiration: 2024-12-22
Also published as: JP2006178214A

Description

本発明は、語学学習を支援する語学学習システムに関する。 The present invention relates to a language learning system that supports language learning.

外国語あるいは母国語の語学学習、特に、発音あるいは発話の独習においては、ＣＤ（Compact Disk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音あるいは発話するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものである。ここで、学習をより効果的に進めるためには、模範音声と自分の音声との差を客観的に評価する必要がある。しかし、ＣＤに記録された模範音声を聞いてその真似をするだけでは、自分の発した音声と模範音声との差を具体的に把握することが困難であるという問題があった。 In language learning of foreign languages or native languages, especially self-study of pronunciation or utterance, a model voice recorded on a recording medium such as a CD (Compact Disk) is played, and the model voice is imitated to pronounce or speak. The learning method is widely used. The purpose of this is to acquire correct pronunciation by imitating model voices. Here, in order to advance learning more effectively, it is necessary to objectively evaluate the difference between the model voice and one's own voice. However, there is a problem that it is difficult to specifically grasp the difference between the voice produced by the user and the model voice only by listening to the model voice recorded on the CD and imitating the model voice.

このような問題を解決する技術として、例えば特許文献１〜３に記載の技術がある。特許文献１には、模範音声とユーザ音声とを同時に再生する技術が開示されている。特許文献２には、模範音声の波形（模範波形）とユーザ音声の波形（自声波形）とを同時に出力する技術が開示されている。特許文献３には、模範音声とユーザ音声との比較を行う際に両者の頭を揃えたり、両者の長さを同一にするために一方の音声を一律に引き伸ばす技術が開示されている。
特開平７−２１９４１８号公報特開２００２−２３６１３号公報特開平１１−１４３４９６号公報 As a technique for solving such a problem, there are techniques described in Patent Documents 1 to 3, for example. Patent Document 1 discloses a technique for simultaneously reproducing an exemplary voice and a user voice. Patent Document 2 discloses a technique for simultaneously outputting a model voice waveform (model waveform) and a user voice waveform (voice waveform). Japanese Patent Application Laid-Open No. 2004-151867 discloses a technique for uniformly expanding one voice in order to align both heads when comparing the model voice and the user voice, or to make the lengths of both the same.
JP 7-219418 A JP 2002-23613 A Japanese Patent Laid-Open No. 11-14396

特許文献１に記載の技術においては、模範音声とユーザ音声とが同時に再生されるのみで、両者の差異点が分かりにくいという問題があった。また、特許文献２に記載の技術においても、模範音声の波形とユーザ音声の波形とが同時に出力されるのみで、両者の差異点が分かりにくいという問題があった。さらに、特許文献３に記載の技術においては、模範音声とユーザ音声の長さを揃えるために、一方の音声の長さを一律に引き伸ばすのみであり、時間軸に対して両者の音韻が必ずしも一致しないという問題があった。 The technique described in Patent Document 1 has a problem that only the model voice and the user voice are reproduced at the same time, and the difference between the two is difficult to understand. The technique described in Patent Document 2 also has a problem in that only the waveform of the model voice and the waveform of the user voice are output at the same time, and the difference between the two is difficult to understand. Furthermore, in the technique described in Patent Document 3, in order to align the lengths of the model voice and the user voice, only the length of one voice is stretched uniformly, and the two phonemes do not necessarily match the time axis. There was a problem of not doing.

本発明は上述の事情に鑑みてなされたものであり、模範音声とユーザ音声との差異点を提示することができる語学学習装置を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a language learning device capable of presenting differences between a model voice and a user voice.

上述の課題を解決するため、本発明は、模範音声を記憶する記憶手段と、ユーザ音声を入力する入力手段と、前記模範音声および前記ユーザ音声をそれぞれ、子音部分と母音部分とに分離する分離手段と、前記模範音声および前記ユーザ音声の長さが同一となるように、前記模範音声あるいは前記ユーザ音声の母音部分の長さを変化させる処理手段と、前記処理手段により長さが揃えられた模範音声とユーザ音声とを比較し、差異点を抽出する比較手段とを有する語学学習装置を提供する。 In order to solve the above-described problem, the present invention provides a storage unit that stores an exemplary voice, an input unit that inputs a user voice, and a separation that separates the exemplary voice and the user voice into a consonant part and a vowel part, respectively. And the processing unit for changing the length of the vowel part of the exemplary voice or the user voice so that the lengths of the exemplary voice and the user voice are the same. Provided is a language learning device having comparison means for comparing a model voice and a user voice and extracting a difference point.

好ましい態様において、この語学学習装置は、前記模範音声および前記ユーザ音声のいずれか一方の子音部分を、他方の子音部分と時間軸上の位置が同じとなるように再配置する再配置手段をさらに有し、前記処理手段が、前記再配置手段により再配置された子音部分と、時間軸上でその子音部分の次に現れる子音部分との間に位置する母音部分の長さをそれぞれ変化させることにより、前記模範音声および前記ユーザ音声の長さを同一としてもよい。
別の好ましい態様において、この語学学習装置は、前記ユーザ音声および前記模範音声に対し所定のパラメータを抽出し、該パラメータの変化量に応じた図形を示す画像データを生成する画像生成手段と、前記画像生成手段により生成された画像データを、前記比較手段により抽出された差異点で異なる表示態様で表示を行う表示手段とをさらに有してもよい。 In a preferred embodiment, the language learning device further includes a rearrangement unit that rearranges the consonant part of either the model voice or the user voice so that the position on the time axis is the same as that of the other consonant part. And the processing means changes a length of a vowel part positioned between a consonant part rearranged by the rearrangement means and a consonant part appearing next to the consonant part on the time axis. Thus, the model voice and the user voice may have the same length.
In another preferred embodiment, the language learning device extracts predetermined parameters for the user voice and the model voice, and generates image data indicating a graphic corresponding to a change amount of the parameter; You may further have a display means to display the image data produced | generated by the image production | generation means in a different display mode by the difference extracted by the said comparison means.

さらに別の好ましい態様において、この語学学習装置は、前記比較手段により抽出された差異点を強調する差異点強調手段と、前記差異点強調手段により差異点が強調された音声を出力する出力手段とをさらに有してもよい。
この態様において、前記差異点強調手段が、前記比較手段により差異点が抽出された部分については模範音声の音量をユーザ音声の音量より大きくし、前記比較手段により差異点が抽出されなかった部分については模範音声の音量をユーザ音声の音量より小さくすることとしてもよい。
あるいは、前記差異点強調手段が、前記比較手段により差異点が抽出された部分については模範音声のうち特定の周波数領域を増幅することとしてもよい。 In still another preferred embodiment, the language learning device includes a difference point emphasizing unit that emphasizes the difference points extracted by the comparison unit, and an output unit that outputs a voice in which the difference points are emphasized by the difference point emphasizing unit. May further be included.
In this aspect, the difference highlighting means makes the volume of the model voice higher than the volume of the user voice for the part where the difference is extracted by the comparison means, and the part where the difference is not extracted by the comparison means The volume of the model voice may be made smaller than the volume of the user voice.
Alternatively, the difference point emphasizing unit may amplify a specific frequency region in the model voice for the portion from which the difference point is extracted by the comparison unit.

本発明によれば、ユーザは自分の音声と模範音声との差異がある部分を具体的に特定することができる。 According to the present invention, the user can specifically specify a portion where there is a difference between the user's voice and the model voice.

以下、図面を参照して本発明の実施形態について説明する。
＜第１実施形態＞
図１は、本発明の第１実施形態に係る語学学習装置１の機能構成を示すブロック図である。記憶部１０は、語学学習においてお手本となる音声を示す模範音声データを記憶する。入力部２０は、ユーザ（学習者あるいは生徒）の音声を取得し、ユーザ音声データを出力する。データ処理部３０は、ユーザ音声と模範音声とが同一の長さとなるようにユーザ音声データの処理を行う。差異点抽出部４０は同一の長さに揃えられたユーザ音声データと模範音声データとを比較して、両者の差異点を抽出する。差異点強調部５０は、模範音声データおよびユーザ音声データに対して差異点抽出部４０で抽出された差異点を強調する処理を行う。音声出力部６０は、差異点が強調された模範音声およびユーザ音声を再生する。各構成要素の機能の詳細については後述する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<First Embodiment>
FIG. 1 is a block diagram showing a functional configuration of a language learning device 1 according to the first embodiment of the present invention. The memory | storage part 10 memorize | stores model audio | voice data which shows the audio | voice as a model in language learning. The input unit 20 acquires the voice of the user (learner or student) and outputs user voice data. The data processing unit 30 processes the user voice data so that the user voice and the model voice have the same length. The difference point extraction unit 40 compares the user voice data and the model voice data that are arranged to have the same length, and extracts the difference between the two. The difference point emphasizing unit 50 performs processing for emphasizing the difference points extracted by the difference point extracting unit 40 with respect to the model voice data and the user voice data. The sound output unit 60 reproduces the model sound and the user sound in which the difference is emphasized. Details of the function of each component will be described later.

図２は、語学学習装置１のハードウェア構成を示すブロック図である。ＣＰＵ（Central Processing Unit）１０１は、ＲＡＭ（Random Access Memory）１０２を作業エリアとして、ＲＯＭ（Read Only Memory）１０３あるいはＨＤＤ（Hard Disk Drive）１０４に記憶されているプログラムを読み出して実行する。ＨＤＤ１０４は、各種アプリケーションプログラムやデータを記憶する記憶装置である。本実施形態に関して、特に、ＨＤＤ１０４は、語学学習プログラム、この語学学習プログラムで使用する模範音声データを記録した模範音声データベースＤＢ１を記憶している。 FIG. 2 is a block diagram illustrating a hardware configuration of the language learning device 1. A CPU (Central Processing Unit) 101 reads and executes a program stored in a ROM (Read Only Memory) 103 or an HDD (Hard Disk Drive) 104 using a RAM (Random Access Memory) 102 as a work area. The HDD 104 is a storage device that stores various application programs and data. Regarding the present embodiment, in particular, the HDD 104 stores a language learning program and a model voice database DB1 in which model voice data used in the language learning program is recorded.

ディスプレイ１０５は、ＣＲＴ（Cathode Ray Tube）やＬＣＤ（Liquid Crystal Display）等、ＣＰＵ１０１の制御下で文字や画像を表示する表示装置である。マイク１０６は、ユーザの音声を取得するための集音装置であり、ユーザの発した音声に対応する音声信号を出力する。音声処理部１０７は、マイク１０６により出力されたアナログ音声信号をデジタル音声データに変換する機能や、ＨＤＤ１０４に記憶された音声データを音声信号に変換してスピーカ１０８に出力する機能を有する。また、ユーザはキーボード１０９を操作することにより、語学学習装置１に対して指示入力を行うことができる。以上で説明した各構成要素は、バス１１０を介して相互に接続されている。 The display 105 is a display device that displays characters and images under the control of the CPU 101, such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display). The microphone 106 is a sound collection device for acquiring the user's voice, and outputs a voice signal corresponding to the voice uttered by the user. The sound processing unit 107 has a function of converting an analog sound signal output from the microphone 106 into digital sound data, and a function of converting sound data stored in the HDD 104 into a sound signal and outputting the sound signal to the speaker 108. Further, the user can input an instruction to the language learning device 1 by operating the keyboard 109. Each component described above is connected to each other via a bus 110.

図３は、模範音声データベースＤＢ１の内容を示す図である。模範音声データベースＤＢ１には、語学学習に用いる例文（例えば、英語の学習において「Good to see you again. How are you?」等の例文）のテキストデータと、例文単位の音声波形をデジタル化した音声波形データと、その例文を特定する識別子とが複数記憶されている。例文テキストデータ、音声波形データ、識別子はそれぞれ対応付けられている。 FIG. 3 is a diagram showing the contents of the model voice database DB1. In the exemplary speech database DB1, text data of example sentences used for language learning (for example, “Good to see you again. How are you?” Etc. in English learning) and voices obtained by digitizing speech waveforms in units of example sentences. A plurality of waveform data and identifiers specifying the example sentences are stored. Example sentence text data, speech waveform data, and an identifier are associated with each other.

続いて、語学学習装置１の動作について説明する。本実施形態においては、ＣＰＵ１０１がＨＤＤ１０４に記憶された語学学習プログラムを実行することにより、語学学習装置１において図１に示される各機能構成要素に相当する機能が実現される。 Next, the operation of the language learning device 1 will be described. In the present embodiment, when the CPU 101 executes a language learning program stored in the HDD 104, functions corresponding to the functional components shown in FIG.

図４は、本実施形態に係る語学学習装置１の動作を示すフローチャートである。語学学習プログラムを実行すると、ＣＰＵ１０１は、ディスプレイ１０５上に例文の選択を促すメッセージを表示する。ユーザはディスプレイ１０５上に表示されたメッセージに従い、模範音声データベースＤＢ１に記録された例文から１の例文を選択する。ＣＰＵ１０１は選択された例文の音声を再生する（ステップＳ１０１）。具体的には次のとおりである。ＣＰＵ１０１は、模範音声データベースＤＢ１から、選択された例文に対応する模範音声データを抽出する。ＣＰＵ１０１は、抽出した模範音声データを音声処理部１０７に出力する。音声処理部１０７は入力された模範音声データをデジタル／アナログ変換してアナログ音声信号としてスピーカ１０８に出力する。こうして模範音声が再生される。 FIG. 4 is a flowchart showing the operation of the language learning device 1 according to the present embodiment. When the language learning program is executed, the CPU 101 displays a message prompting the user to select an example sentence on the display 105. In accordance with the message displayed on the display 105, the user selects one example sentence from the example sentences recorded in the model voice database DB1. The CPU 101 reproduces the voice of the selected example sentence (step S101). Specifically, it is as follows. The CPU 101 extracts model voice data corresponding to the selected example sentence from the model voice database DB1. The CPU 101 outputs the extracted model voice data to the voice processing unit 107. The voice processing unit 107 performs digital / analog conversion on the input model voice data and outputs the analog voice signal to the speaker 108. In this way, the model voice is reproduced.

ユーザはスピーカ１０８から再生された模範音声を聞き、マイク１０６に向かって模範音声を真似して例文を発声する。すなわち、ユーザ音声の入力が行われる（ステップＳ１０２）。具体的には次のとおりである。模範音声の再生が終了すると、ＣＰＵ１０１は、「次はあなたの番です。例文を発音してください」等、ユーザに例文の発生を促すメッセージをディスプレイ１０５に表示する。さらにＣＰＵ１０１は、「スペースキーを押してから発音し、発音が終わったらもう一度スペースキーを押してください」等、ユーザ音声の入力を行うための操作を指示するメッセージをディスプレイ１０５に表示する。ユーザは、ディスプレイ１０５に表示されたメッセージに従ってキーボード１０９を操作し、ユーザ音声の入力を行う。すなわち、キーボード１０９のスペースキーを押した後に、マイク１０６に向かって例文を発声する。発声が終了したら、ユーザはもう一度スペースキーを押す。 The user listens to the model voice reproduced from the speaker 108, and utters an example sentence simulating the model voice toward the microphone 106. That is, a user voice is input (step S102). Specifically, it is as follows. When the reproduction of the model voice is finished, the CPU 101 displays a message on the display 105 urging the user to generate an example sentence such as “Next is your turn. Please pronounce the example sentence”. Further, the CPU 101 displays a message on the display 105 instructing an operation for inputting the user voice, such as “Sound after pressing the space key and press the space key again when the sound is finished”. The user operates the keyboard 109 according to the message displayed on the display 105 and inputs the user voice. That is, after the space key on the keyboard 109 is pressed, the example sentence is uttered toward the microphone 106. When the utterance is finished, the user presses the space key again.

ユーザの音声はマイク１０６により電気信号に変換される。マイク１０６は、ユーザ音声信号を出力する。ユーザ音声信号は、音声処理部１０７によりデジタル音声データに変換され、ユーザ音声データとしてＨＤＤ１０４に記録される。ＣＰＵ１０１は、模範音声の再生が完了した後、スペースキーの押下をトリガとしてユーザ音声データの記録を開始し、再度のスペースキーの押下をトリガとしてユーザ音声データの記録を終了する。すなわち、ユーザが最初にスペースキーを押してから、もう一度スペースキーを押すまでの間のユーザ音声がＨＤＤ１０４に記録される。 The user's voice is converted into an electric signal by the microphone 106. The microphone 106 outputs a user voice signal. The user voice signal is converted into digital voice data by the voice processing unit 107 and recorded in the HDD 104 as user voice data. After the reproduction of the model voice is completed, the CPU 101 starts the recording of the user voice data with the press of the space key as a trigger, and ends the recording of the user voice data with the press of the space key again as a trigger. That is, the user voice from when the user first presses the space key until the user presses the space key again is recorded in the HDD 104.

続いてＣＰＵ１０１は、ユーザ音声と模範音声の長さが同一となるようにユーザ音声データを処理する（ステップＳ１０３）。具体的には次のとおりである。図５は、模範音声（図５（Ａ））およびユーザ音声（図５（Ｂ））の波形を例示する図である。図５に示される例では、模範音声およびユーザ音声はともに同一の例文を発声したものであるが、発話の速度が異なっているため、長さが異なっている。すなわち、ユーザ音声の方が発話速度が遅いため、音声の長さが長くなっている。ＣＰＵ１０１は、以下のようにしてユーザ音声の長さを模範音声と同一にする。 Subsequently, the CPU 101 processes the user voice data so that the user voice and the model voice have the same length (step S103). Specifically, it is as follows. FIG. 5 is a diagram illustrating waveforms of the model voice (FIG. 5A) and the user voice (FIG. 5B). In the example shown in FIG. 5, the model voice and the user voice are both uttered by the same example sentence, but have different lengths because the utterance speeds are different. That is, since the voice of the user voice is slower, the length of the voice is longer. The CPU 101 makes the length of the user voice the same as that of the model voice as follows.

図６は、ステップＳ１０３における、ユーザ音声と模範音声の長さを同一にする処理をより詳細に示すフローチャートである。ＣＰＵ１０１は、まず、データサイズを計測する等の方法により、模範音声およびユーザ音声の長さを算出する（ステップＳ１０３−１）。ＣＰＵ１０１は、この算出結果から、模範音声とユーザ音声の長さの差（図５のΔｔ）をさらに算出する。 FIG. 6 is a flowchart showing in more detail the process of making the lengths of the user voice and the model voice the same in step S103. First, the CPU 101 calculates the lengths of the model voice and the user voice by a method such as measuring the data size (step S103-1). From this calculation result, the CPU 101 further calculates the difference in length between the model voice and the user voice (Δt in FIG. 5).

続いてＣＰＵ１０１は、ユーザ音声データのうち、子音に係る部分および母音に係る部分のそれぞれに識別子を付加する（ステップＳ１０３−２）。これは次のような目的による。すなわち、一般に子音の長さは話者によらずほぼ同一であるのに対し、母音の長さは話者によって大きく異なる。したがって、ユーザ音声の長さを変更する際に、子音の長さはそのままで母音の長さのみ変化させれば、聴感上の不自然さを生じさせずに音声の長さを変更することができる。従来技術においては、子音であるか母音であるかにかかわらず一様に音声の長さが変更されるので、不自然な聴感を与えてしまうという問題があった。しかし、本実施形態によればこのような問題は起こらない。 Subsequently, the CPU 101 adds an identifier to each of the part related to the consonant and the part related to the vowel in the user voice data (step S103-2). This is for the following purpose. That is, the length of consonants is generally the same regardless of the speaker, whereas the length of vowels varies greatly depending on the speaker. Therefore, when changing the length of the user voice, if the length of the consonant is kept unchanged and only the length of the vowel is changed, the length of the voice can be changed without causing unnaturalness in hearing. it can. In the prior art, since the length of the voice is uniformly changed regardless of whether it is a consonant or a vowel, there is a problem that an unnatural audibility is given. However, according to the present embodiment, such a problem does not occur.

母音部分と子音部分の分離は例えば次のように行う。ＣＰＵ１０１は、選択された例文のテキストデータから、その例文に含まれる母音を抽出する。例えばテキストデータの先頭から順に「ａ」「ｕ」「ｉ」「ｅ」という母音が抽出された場合を考える。ＣＰＵ１０１は、音声データをあらかじめ決められた時間（フレーム）毎に分割する。ＣＰＵ１０１は、フレームに分解された模範音声データが示す波形およびユーザ音声信号が示す波形をフーリエ変換して得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を得る。ＣＰＵ１０１は、こうして得られたスペクトル包絡から第１フォルマントおよび第２、第３フォルマントのフォルマント周波数を抽出する。一般に母音は第１および第２、第３フォルマントの分布により特徴付けられる。ＣＰＵ１０１は、音声データの先頭からまず母音「ａ」のフォルマント周波数分布とマッチングを行う。マッチングによりそのフレームが母音「ａ」に相当するものであると判断された場合、ＣＰＵ１０１は、検出された母音の種類および音声データ上の位置を示すデータＣを生成する。既にデータＣが生成されているときは、データＣに新たな情報を追加する。ＣＰＵ１０１は、後続するフレームについても母音「ａ」とのマッチングを行い、マッチしなくなったら続いて母音「ｕ」とのマッチングを行う。このようにして先頭から母音を検索し、データＣを生成する。なお、母音の種類および位置を示すデータを生成する代わりに、子音の位置を示すデータを生成してもよいし、母音と子音両方の位置を示すデータを生成してもよい。また、音と音のリエゾン区間（中間的な音）の区間位置データや、無音区間データを生成してもよい。 For example, the vowel part and the consonant part are separated as follows. The CPU 101 extracts vowels included in the example sentence from the text data of the selected example sentence. For example, consider a case where vowels “a”, “u”, “i”, and “e” are extracted in order from the top of text data. The CPU 101 divides the audio data every predetermined time (frame). The CPU 101 obtains the logarithm of the amplitude spectrum obtained by Fourier transforming the waveform indicated by the exemplary voice data decomposed into frames and the waveform indicated by the user voice signal, and inverse Fourier transforms it to obtain a spectrum envelope for each frame. . The CPU 101 extracts the formant frequencies of the first formant and the second and third formants from the spectrum envelope thus obtained. In general, a vowel is characterized by a distribution of first, second, and third formants. The CPU 101 first matches the formant frequency distribution of the vowel “a” from the beginning of the audio data. If it is determined by matching that the frame corresponds to the vowel “a”, the CPU 101 generates data C indicating the type of the detected vowel and the position on the audio data. When the data C has already been generated, new information is added to the data C. The CPU 101 performs matching with the vowel “a” for the subsequent frames, and subsequently performs matching with the vowel “u” when there is no match. In this way, vowels are searched from the beginning, and data C is generated. Instead of generating data indicating the type and position of vowels, data indicating the position of consonants may be generated, or data indicating the positions of both vowels and consonants may be generated. Also, section position data of sound and sound liaison sections (intermediate sounds) or silence section data may be generated.

続いて、ＣＰＵ１０１は、模範音声データのうち、子音に係る部分について、データの先頭から順に番号を付し、子音に係る部分それぞれの先頭位置を示す情報（例えば、データの先頭からの時間）と共にテーブルＴＢ１としてＲＡＭ１０２に記憶する。ここで、模範音声データにおける母音と子音の分離は、上述のユーザ音声データの場合と同様に行ってもよいし、あらかじめ模範音声データベースＤＢ１に母音の位置または子音の位置を示す情報を記憶しておき、その情報に基づいて子音部分を特定してもよい。続いてＣＰＵ１０１は、ユーザ音声データから子音に係る部分を切り出し、テーブルＴＢ１を参照して、模範音声と子音の位置が一致するように、切り出した子音部分を再配置する（ステップＳ１０３−３）。さらにＣＰＵ１０１は、ユーザ音声データのうち、子音と子音の間に位置する母音に係る部分について、模範音声と同じ長さとなるようにデータの加工を行う（ステップＳ１０３−３）。これは例えば、ユーザ音声の母音部分の方が模範音声の母音部分よりも長い場合には、その長い部分のデータを削除することにより実現できる。あるいは、ユーザ音声の母音部分の方が模範音声の母音部分よりも短い場合には、所望の長さになるまで母音部分の波形を繰り返し足しつなげていけばよい。このようにして、ユーザ音声の長さは模範音声と同一となり、また、時間軸上の子音および母音の位置も一致することとなる。 Subsequently, the CPU 101 assigns numbers to the parts related to the consonant in the model voice data in order from the head of the data, together with information indicating the head position of each part related to the consonant (for example, the time from the head of the data). The data is stored in the RAM 102 as the table TB1. Here, the separation of the vowels and the consonants in the model voice data may be performed in the same manner as in the case of the user voice data described above, or information indicating the position of the vowel or the position of the consonant is stored in the model voice database DB1 in advance. Alternatively, the consonant part may be specified based on the information. Subsequently, the CPU 101 cuts out the part related to the consonant from the user voice data, and refers to the table TB1, and rearranges the cut out consonant part so that the positions of the model voice and the consonant match (step S103-3). Further, the CPU 101 processes the data so that the portion related to the vowel located between the consonants in the user voice data has the same length as the model voice (step S103-3). For example, when the vowel part of the user voice is longer than the vowel part of the model voice, this can be realized by deleting the data of the long part. Alternatively, if the vowel part of the user voice is shorter than the vowel part of the model voice, the waveforms of the vowel part may be repeatedly added until the desired length is reached. In this way, the length of the user voice is the same as that of the model voice, and the positions of consonants and vowels on the time axis also coincide.

再び図４を参照して説明する。ＣＰＵ１０１は、模範音声とユーザ音声との差異点を抽出する（ステップＳ１０４）。この処理は例えば次のように行われる。ＣＰＵ１０１は、前述のように模範音声データが示す波形をあらかじめ決められた時間（フレーム）ごとに分割する。また、ＣＰＵ１０１は、ユーザ音声データが示す波形についてもフレームごとに分割する。ＣＰＵ１０１は、フレームに分解された模範音声データが示す波形およびユーザ音声信号が示す波形をフーリエ変換して得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を得る。 A description will be given with reference to FIG. 4 again. The CPU 101 extracts a difference between the model voice and the user voice (step S104). This process is performed as follows, for example. As described above, the CPU 101 divides the waveform indicated by the model voice data at predetermined time (frames). The CPU 101 also divides the waveform indicated by the user voice data for each frame. The CPU 101 obtains the logarithm of the amplitude spectrum obtained by Fourier transforming the waveform indicated by the exemplary voice data decomposed into frames and the waveform indicated by the user voice signal, and inverse Fourier transforms it to obtain a spectrum envelope for each frame. .

図７は、模範音声（上）およびユーザ音声（下）のスペクトル包絡を例示する図である。図７に示されるスペクトル包絡は、フレームＩ〜フレームＩＩＩの３つのフレームから構成されている。ＣＰＵ１０１は、得られたスペクトル包絡をフレームごとに比較する。ＣＰＵ１０１は、模範音声のスペクトル包絡とユーザ音声のスペクトル包絡との差異が、あらかじめ決められたしきい値を超えた場合は、そのフレームにおいて模範音声とユーザ音声とに差異があるものと判断する。模範音声とユーザ音声との差異は、例えば、特徴的なフォルマントの周波数とスペクトル密度とをスペクトル密度−周波数図に表したときの２点間の距離によって求めてもよいし、特定の周波数においてスペクトル密度を比較することによって求めてもよい。あるいは、１以上の特定のフォルマントのフォルマント周波数を比較することにより模範音声とユーザ音声との差異を求めてもよい。図５に示される例では、ＣＰＵ１０１はフレームＩＩについて差異があるものと判断する。ＣＰＵ１０１は、模範音声とユーザ音声に差異があったことを示すフラグを記録したデータＦを生成し、ＲＡＭ１０２に記憶する。模範音声とユーザ音声とに差異が無い場合は、ＣＰＵ１０１は、そのことを示すデータＦを生成し、ＲＡＭ１０２に記憶する。さらに、ＣＰＵ１０１は、そのフレームにおける模範音声とユーザ音声との差異の有無を示すフラグを記録したデータＤを生成し、ＲＡＭ１０２に記憶する。すなわち、データＤは、フレームごとにユーザの発音の良否（模範音声との差異の有無）を示している。ＣＰＵ１０１はこのようにしてすべてのフレームについて模範音声のスペクトルとユーザ音声のスペクトルを比較する。ＲＡＭ１０２には、模範音声と差異があると判断されたフレームを特定するデータＤが記憶されている。 FIG. 7 is a diagram illustrating a spectrum envelope of an exemplary voice (upper) and a user voice (lower). The spectrum envelope shown in FIG. 7 is composed of three frames, frame I to frame III. The CPU 101 compares the obtained spectrum envelope for each frame. When the difference between the spectrum envelope of the model voice and the spectrum envelope of the user voice exceeds a predetermined threshold, the CPU 101 determines that there is a difference between the model voice and the user voice in the frame. The difference between the model voice and the user voice may be obtained, for example, by a distance between two points when a characteristic formant frequency and a spectral density are represented in a spectral density-frequency diagram, or a spectrum at a specific frequency. You may obtain | require by comparing a density. Alternatively, the difference between the model voice and the user voice may be obtained by comparing the formant frequencies of one or more specific formants. In the example shown in FIG. 5, the CPU 101 determines that there is a difference with respect to the frame II. The CPU 101 generates data F in which a flag indicating that there is a difference between the model voice and the user voice is generated and stored in the RAM 102. If there is no difference between the model voice and the user voice, the CPU 101 generates data F indicating that and stores it in the RAM 102. Further, the CPU 101 generates data D in which a flag indicating whether or not there is a difference between the model voice and the user voice in the frame is generated and stored in the RAM 102. That is, the data D indicates whether the user's pronunciation is good or not for each frame (whether there is a difference from the model voice). In this way, the CPU 101 compares the spectrum of the model voice and the spectrum of the user voice for all frames. The RAM 102 stores data D that identifies a frame that is determined to be different from the model voice.

再び図４を参照して説明する。ＣＰＵ１０１は、データＦに基づいてユーザ音声に模範音声と異なっている部分が存在するか否か判断する（ステップＳ１０５）。ユーザ音声に模範音声と異なっている部分がある場合（ステップＳ１０５：ＹＥＳ）、ＣＰＵ１０１は、以下で説明する差異点強調処理を行う（ステップＳ１０６）。ＣＰＵ１０１は、ステップＳ１０５、Ｓ１０６の処理を全フレームに渡って行う（ステップＳ１０８）。これにより、差異点が強調された模範音声が再生される。ユーザの発音に悪い部分が無い場合（ステップＳ１０５：ＮＯ）、ＣＰＵ１０１は「良好です」等のメッセージをディスプレイ１０５に表示し、処理を終了する。 A description will be given with reference to FIG. 4 again. Based on the data F, the CPU 101 determines whether there is a portion that is different from the model voice in the user voice (step S105). When there is a part that is different from the model voice in the user voice (step S105: YES), the CPU 101 performs a difference enhancement process described below (step S106). The CPU 101 performs the processing of steps S105 and S106 over all frames (step S108). Thereby, the model voice in which the difference is emphasized is reproduced. If there is no bad part in the user's pronunciation (step S105: NO), the CPU 101 displays a message such as “good” on the display 105 and ends the process.

ステップＳ１０６における差異点強調処理は、例えば以下のように行われる。ユーザが自分の発音を確認するという目的から、後述するステップＳ１０７において、基本的にはユーザ音声が再生される。しかし、模範音声との差異があった部分については、強調処理として、ユーザ音声ではなく模範音声を再生する。これにより、ユーザが模範音声との差異があった部分を具体的に特定することができるという効果、および差異があった部分について正しい発音をユーザに示すことができるという効果が奏される。差異点強調処理は具体的には、次のように行われる。ＣＰＵ１０１は、ステップＳ１０３において長さを揃えられた模範音声およびユーザ音声に対し、それぞれ音量係数を乗じて加算する。音量係数は、再生される音声の音量を示すパラメータである。例えばユーザ音声の音量係数が１で模範音声の音量係数が０である場合は、スピーカ１０８からはユーザ音声のみが再生される。逆にユーザ音声の音量係数が０で模範音声の音量係数が１である場合にはスピーカ１０８からは模範音声のみが再生される。 The difference point emphasis process in step S106 is performed as follows, for example. For the purpose of the user confirming his / her pronunciation, the user voice is basically reproduced in step S107 described later. However, with respect to a portion that is different from the model voice, the model voice is reproduced instead of the user voice as the enhancement process. As a result, an effect that the user can specifically specify a portion where there is a difference from the model voice and an effect that the correct pronunciation can be shown to the user regarding the portion where there is a difference are exhibited. Specifically, the difference highlighting process is performed as follows. The CPU 101 multiplies the model voice and the user voice whose lengths are equalized in step S103 by multiplying each by the volume coefficient. The volume coefficient is a parameter indicating the volume of the reproduced voice. For example, when the volume coefficient of the user voice is 1 and the volume coefficient of the model voice is 0, only the user voice is reproduced from the speaker 108. On the contrary, when the volume coefficient of the user voice is 0 and the volume coefficient of the model voice is 1, only the model voice is reproduced from the speaker 108.

ＣＰＵ１０１は、データＤを参照してフレーム毎に音量係数を決定する。すなわち、ユーザ音声データにおいて、データＤが差異点ありを示している場合には、ＣＰＵ１０１はそのフレームの音量係数を０に設定する。逆に、データＤが模範音声との差異点なしを示している場合には、ＣＰＵ１０１はそのフレームの音量係数を１に設定する。一方、模範音声データにおいて、データＤが差異点ありを示している場合には、ＣＰＵ１０１はそのフレームの音量係数を１に設定する。逆に、データＤが模範音声との差異点なしを示している場合には、ＣＰＵ１０１はそのフレームの音量係数を０に設定する。ＣＰＵ１０１は、このようにして求められた音量係数をユーザ音声データおよび模範音声データに乗じて、ユーザ音声データと模範音声データとを混合する。ＣＰＵ１０１は、こうして得られた混合音声データを音声処理部１０７に出力する。 The CPU 101 refers to the data D and determines a volume coefficient for each frame. That is, in the user voice data, when the data D indicates that there is a difference, the CPU 101 sets the volume coefficient of the frame to 0. On the contrary, when the data D indicates that there is no difference from the model voice, the CPU 101 sets the volume coefficient of the frame to 1. On the other hand, in the exemplary audio data, when the data D indicates that there is a difference, the CPU 101 sets the volume coefficient of the frame to 1. Conversely, if the data D indicates that there is no difference from the model voice, the CPU 101 sets the volume coefficient of the frame to 0. The CPU 101 multiplies the user sound data and the model sound data by the volume coefficient obtained in this way, and mixes the user sound data and the model sound data. The CPU 101 outputs the mixed sound data obtained in this way to the sound processing unit 107.

続いて音声処理部１０７は、入力された混合音声データをデジタル／アナログ変換し、音声信号としてスピーカ１０８に出力する。スピーカ１０８からは、強調処理を施された音声が再生される（ステップＳ１０７）。この音声を聞くことにより、ユーザは自分の音声と模範音声との差異がある部分を具体的に特定することができ、また、その差異がある部分については正しい発音を知ることができる。 Subsequently, the audio processing unit 107 performs digital / analog conversion on the input mixed audio data and outputs the converted audio data to the speaker 108 as an audio signal. The enhanced sound is reproduced from the speaker 108 (step S107). By listening to this voice, the user can specifically identify a portion where there is a difference between the user's own voice and the model voice, and can know the correct pronunciation of the portion where there is the difference.

＜第２実施形態＞
続いて、本発明の第２実施形態について説明する。なお、以下の説明において第１実施形態と共通の要素には共通の参照符号を付与し、その説明を省略する。
図８は、本発明の第２実施形態に係る語学学習装置２の機能構成を示すブロック図である。第１実施形態に係る語学学習装置１と異なる部分についてのみ説明すると、差異点強調部５１は、模範音声データおよびユーザ音声データに対して差異点抽出部４０で抽出された差異点に基づいて、模範音声およびユーザ音声を視覚化した画像であって、両者の差異点が強調された画像を示す画像データを生成する。差異点表示部６１は、差異点強調部５１により生成された画像データに基づいて画像表示を行う。なお、語学学習装置２のハードウェア構成は図２に示される語学学習装置１のハードウェア構成と同一であるのでその説明を省略する。 <Second Embodiment>
Subsequently, a second embodiment of the present invention will be described. In the following description, common reference numerals are assigned to elements common to the first embodiment, and description thereof is omitted.
FIG. 8 is a block diagram showing a functional configuration of the language learning device 2 according to the second embodiment of the present invention. If only a different part from the language learning apparatus 1 which concerns on 1st Embodiment is demonstrated, the difference point emphasis part 51 will be based on the difference point extracted in the difference point extraction part 40 with respect to model audio | voice data and user audio | voice data, Image data representing an image in which the model voice and the user voice are visualized and the difference between the two is emphasized is generated. The difference point display unit 61 performs image display based on the image data generated by the difference point emphasizing unit 51. Since the hardware configuration of the language learning device 2 is the same as the hardware configuration of the language learning device 1 shown in FIG.

図９は、本発明の第２実施形態に係る語学学習装置２の動作を示すフローチャートである。ステップＳ１０１〜Ｓ１０５の処理は第１実施形態と同じであるのでその説明を省略する。ユーザ音声に模範音声と異なっている部分がある場合（ステップＳ１０５：ＹＥＳ）、ＣＰＵ１０１は、差異点強調処理を行う（ステップＳ２０６）。本実施形態において、模範音声とユーザ音声との差異点は画像で視覚的に表される。 FIG. 9 is a flowchart showing the operation of the language learning device 2 according to the second embodiment of the present invention. Since the processing in steps S101 to S105 is the same as that in the first embodiment, the description thereof is omitted. When there is a part in the user voice that is different from the model voice (step S105: YES), the CPU 101 performs a difference point emphasis process (step S206). In the present embodiment, the difference between the model voice and the user voice is visually represented by an image.

図１０は、本実施形態において出力される画像を例示する図である。模範音声画像Ａおよびユーザ音声画像Ｂが、縦に並べられて表示される。各音声画像は、音量および音程を示す図（Ａ−１、Ａ−２）と、周波数特性を示す図（Ｂ−１、Ｂ−２）と、主として２つの図から構成される。これらの図はいずれも、水平方向が時間軸となっている。ステップＳ２０６において、ＣＰＵ１０１は、模範音声データおよびユーザ音声データに基づいて、図１０に示されるような画像を示す画像データを生成する。すなわち、ＣＰＵ１０１は各音声データから、フレーム毎に音程（ピッチ）および音量を算出する。ＣＰＵ１０１は、図１０に示されるように、音量に対応させて図形の幅を、音程に対応させて図形の上下方向の位置（座標）を決定する。ＣＰＵ１０１は、処理対象となっているフレームに相当する時間軸上の位置に、決定された位置（上下方向の座標）に決定された幅を有する図形を表示させる画像データを生成する。また、ＣＰＵ１０１は、音声波形に対しフレーム毎にフーリエ変換を行い、周波数分析を行う。すなわち、フーリエ変換により得られたスペクトルから、振幅が最大となる周波数（最大振幅周波数）を求める。ＣＰＵ１０１は、最大振幅周波数を、ディスプレイ１０５に表示する色を特定する識別子（色コード）に変換する。これは、例えばＨＤＤ１０４に、周波数と色コードとを対応付けて記録したテーブルをあらかじめ記憶させておき、ＣＰＵ１０１はこのテーブルを参照して最大振幅周波数を色コードに変換する。ＣＰＵ１０１は処理対象となっているフレームに相当する位置に、この色コードで特定される色を表示させる画像データを生成する。そして、模範音声とユーザ音声の差異点において表示の色や明るさを変えることにより、ユーザは模範音声と自分の音声の差異を視覚的に認識することができる。 FIG. 10 is a diagram illustrating an image output in the present embodiment. The model audio image A and the user audio image B are displayed vertically arranged. Each audio image is mainly composed of two diagrams, a diagram (A-1, A-2) showing the volume and pitch, a diagram (B-1, B-2) showing the frequency characteristics, and the like. In these figures, the horizontal direction is the time axis. In step S206, the CPU 101 generates image data indicating an image as shown in FIG. 10 based on the model audio data and the user audio data. That is, the CPU 101 calculates the pitch (pitch) and volume for each frame from each audio data. As shown in FIG. 10, the CPU 101 determines the position (coordinates) in the vertical direction of the graphic in correspondence with the volume and the width of the graphic in correspondence with the pitch. The CPU 101 generates image data for displaying a graphic having the determined width at the determined position (the vertical coordinate) at the position on the time axis corresponding to the frame to be processed. In addition, the CPU 101 performs Fourier analysis on the speech waveform for each frame and performs frequency analysis. That is, a frequency (maximum amplitude frequency) having the maximum amplitude is obtained from the spectrum obtained by Fourier transform. The CPU 101 converts the maximum amplitude frequency into an identifier (color code) that specifies the color displayed on the display 105. For example, a table in which a frequency and a color code are recorded in association with each other is stored in advance in the HDD 104, for example, and the CPU 101 refers to this table and converts the maximum amplitude frequency into a color code. The CPU 101 generates image data for displaying the color specified by this color code at a position corresponding to the frame to be processed. Then, by changing the display color and brightness at the difference between the model voice and the user voice, the user can visually recognize the difference between the model voice and his / her voice.

ＣＰＵ１０１は生成した画像データをディスプレイ１０５に出力する。ディスプレイ１０５は、画像データに従って図１０に示されるような画像を表示する（ステップＳ２０７）。 The CPU 101 outputs the generated image data to the display 105. The display 105 displays an image as shown in FIG. 10 according to the image data (step S207).

＜変形例＞
本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。
上述の各実施形態では、ＣＰＵ１０１がＨＤＤ１０４に記憶された語学学習プログラムを実行することにより語学学習装置１あるいは語学学習装置２としての機能が実現されたが、図１あるいは図８に示される機能構成要素に相当する電子回路等を用いて、語学学習装置としての機能をハードウェア的に実現してもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and various modifications can be made.
In each of the above-described embodiments, the function as the language learning device 1 or the language learning device 2 is realized by the CPU 101 executing the language learning program stored in the HDD 104. The functional configuration shown in FIG. 1 or FIG. The function as a language learning device may be realized in hardware by using an electronic circuit or the like corresponding to the element.

また、上述の各実施形態では、模範音声およびユーザ音声の長さを揃える処理（ステップＳ１０３）において、ユーザ音声の長さを模範音声に合わせる態様について説明したが、模範音声の長さをユーザ音声の長さに合わせるようにしてもよい。 Further, in each of the above-described embodiments, the mode in which the length of the user voice is matched with the model voice in the process of aligning the lengths of the model voice and the user voice (step S103) has been described. You may make it match with the length of.

また、第１実施形態では、模範音声とユーザ音声との差異点を強調する処理は、音量の調整により行う態様について説明したが、さらに付加的な処理を行ってもよい。例えば、模範音声とユーザ音声とで差異があった部分が子音である場合について、模範音声を再生する際に、より模範音声を明確に聞かせるために特定の周波数領域を増幅する構成としてもよい。あるいは、例えば一般に英語におけるｌとｒの発音においては第２フォルマントおよび第３フォルマントの位置が異なることが知られているため、特定のフォルマントを増幅する強調処理を行ってもよい。この強調処理は、語学学習プログラムに従ってＣＰＵ１０１が行ってもよいし、バンドパスフィルタと増幅器とを用いてハードウェア的に実現してもよい。
また、音量の調整によって差異点の強調を行う態様においても、音量は「０」「１」の２段階だけでなく、中間の音量値を用いてもよい。 In the first embodiment, the process for emphasizing the difference between the model voice and the user voice has been described by adjusting the volume. However, an additional process may be performed. For example, in the case where the part where the model voice and the user voice are different is a consonant, when reproducing the model voice, a specific frequency region may be amplified in order to hear the model voice more clearly. . Alternatively, for example, in general, it is known that the positions of the second formant and the third formant are different in pronunciation of l and r in English, and therefore, an emphasis process for amplifying a specific formant may be performed. This enhancement processing may be performed by the CPU 101 in accordance with a language learning program, or may be realized by hardware using a bandpass filter and an amplifier.
Also, in the aspect in which the difference is emphasized by adjusting the volume, the volume may be an intermediate volume value as well as the two levels “0” and “1”.

また、第２実施形態では、模範音声画像およびユーザ音声画像を縦に並べて表示する態様について説明したが、両画像の表示の態様はこれに限定されない。例えば、両画像を横に並べても良いし、重ねて表示してもよい。重ねて表示する場合には、模範音声とユーザ音声とで表示色を変えることが望ましい。
また、模範音声画像およびユーザ音声画像を並べて表示する態様において、図１０の例では、音程および音量を示す画像と、周波数特性を示す画像とをそれぞれ別個に表示する態様について説明したが、これらを１つの画像で表現してもよい。例えば、第２実施形態では、音程および音量を示す画像は単一の色で表示されたが、これに周波数特性を示す色をつけて表示してもよい。
また、音声画像は図１０で例示したものに限定されない。例えば、音声波形を直接表示してもよい。
また、ステップＳ２０６における周波数分析の手法は、第２実施形態で説明したものに限定されない。例えば、フレーム毎に音声波形のスペクトル包絡を求め、フォルマント周波数を求めることとしてもよい。フォルマント周波数としては、例えば、第１〜第３フォルマントのいずれかを用いることができる。 Moreover, although 2nd Embodiment demonstrated the aspect which displays a model audio | voice image and a user audio | voice image vertically, the display aspect of both images is not limited to this. For example, both images may be arranged side by side or displayed in an overlapping manner. In the case of overlapping display, it is desirable to change the display color between the model voice and the user voice.
Further, in the aspect in which the model voice image and the user voice image are displayed side by side, the example in FIG. 10 has described the aspect in which the image indicating the pitch and the volume and the image indicating the frequency characteristic are separately displayed. You may express by one image. For example, in the second embodiment, the image indicating the pitch and volume is displayed in a single color, but it may be displayed with a color indicating the frequency characteristic.
Further, the audio image is not limited to that illustrated in FIG. For example, the voice waveform may be displayed directly.
Further, the method of frequency analysis in step S206 is not limited to that described in the second embodiment. For example, the spectral envelope of the speech waveform may be obtained for each frame, and the formant frequency may be obtained. As the formant frequency, for example, any of the first to third formants can be used.

また、第１実施形態で説明した差異点強調処理を行う機能と、第２実施形態で説明した画像生成・表示機能とを同時に具備する語学学習装置を提供することもできる。 In addition, it is possible to provide a language learning device that simultaneously has the function of performing the difference enhancement process described in the first embodiment and the image generation / display function described in the second embodiment.

本発明の第１実施形態に係る語学学習装置１の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the language learning apparatus 1 which concerns on 1st Embodiment of this invention. 語学学習装置１のハードウェア構成を示すブロック図である。2 is a block diagram showing a hardware configuration of a language learning device 1. FIG. 模範音声データベースＤＢ１の内容を示す図である。It is a figure which shows the content of model voice database DB1. 同実施形態に係る語学学習装置１の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the language learning apparatus 1 which concerns on the same embodiment. 模範音声（Ａ）およびユーザ音声（Ｂ）の波形を例示する図である。It is a figure which illustrates the waveform of model voice (A) and user voice (B). ステップＳ１０３の処理をより詳細に示すフローチャートである。It is a flowchart which shows the process of step S103 in detail. 模範音声（上）およびユーザ音声（下）のスペクトル包絡を例示する図である。It is a figure which illustrates the spectrum envelope of model voice (upper) and user voice (lower). 本発明の第２実施形態に係る語学学習装置２の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the language learning apparatus 2 which concerns on 2nd Embodiment of this invention. 同実施形態に係る語学学習装置２の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the language learning apparatus 2 which concerns on the same embodiment. 同実施形態において出力される画像を例示する図である。It is a figure which illustrates the image output in the embodiment.

Explanation of symbols

１…語学学習装置、２…語学学習装置、１０…記憶部、２０…入力部、３０…音声処理部、４０…差異点抽出部、５０…差異点強調部、５１…差異点強調部、６０…音声出力部、６１…差異点表示部、１０１…ＣＰＵ、１０２…ＲＡＭ、１０３…ＲＯＭ、１０４…ＨＤＤ、１０５…ディスプレイ、１０６…マイク、１０７…音声処理部、１０８…スピーカ、１０９…キーボード、１１０…バス DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 2 ... Language learning apparatus, 10 ... Memory | storage part, 20 ... Input part, 30 ... Speech processing part, 40 ... Difference point extraction part, 50 ... Difference point emphasis part, 51 ... Difference point emphasis part, 60 DESCRIPTION OF SYMBOLS ... Audio | voice output part 61 ... Difference point display part 101 ... CPU, 102 ... RAM, 103 ... ROM, 104 ... HDD, 105 ... Display, 106 ... Microphone, 107 ... Voice processing part, 108 ... Speaker, 109 ... Keyboard, 110 ... Bus

Claims

Storage means for storing the model voice;
An input means for inputting user voice;
Separating means for separating the exemplary voice and the user voice into a consonant part and a vowel part, respectively;
Processing means for changing the length of the vowel part of the exemplary voice or the user voice so that the lengths of the exemplary voice and the user voice are the same;
Comparison means for comparing the model voice and the user voice whose lengths are aligned by the processing means, and extracting a difference point ;
Difference highlighting means for highlighting the difference extracted by the comparison means;
Possess and output means for outputting sound differences is emphasized by the differences highlighted means,
The difference enhancement means increases the volume of the model voice for the part from which the difference is extracted by the comparison means, and increases the volume of the model voice for the part from which the difference is not extracted by the comparison means. Make the volume lower than the volume of the user voice
Language learning device characterized by that .

Re-arrangement means for rearranging the consonant part of either one of the exemplary voice and the user voice so that the position on the time axis is the same as the other consonant part;
The processing means changes the length of the vowel part located between the consonant part rearranged by the rearrangement means and the consonant part appearing next to the consonant part on the time axis, respectively. The language learning device according to claim 1, wherein the exemplary voice and the user voice have the same length.

Image generating means for extracting predetermined parameters for the user voice and the model voice, and generating image data indicating a figure corresponding to the amount of change of the parameters;
The language learning apparatus according to claim 1, further comprising: a display unit configured to display the image data generated by the image generation unit in a different display mode at the difference extracted by the comparison unit.

2. The language learning apparatus according to claim 1 , wherein the difference point emphasizing unit amplifies a specific frequency region in the model voice for a portion from which the difference point is extracted by the comparison unit.