JP2007139868A

JP2007139868A - Language learning device

Info

Publication number: JP2007139868A
Application number: JP2005330100A
Authority: JP
Inventors: Takaya Kakizaki; 貴也柿崎; Naohiro Emoto; 直博江本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-11-15
Filing date: 2005-11-15
Publication date: 2007-06-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a language learning device by which a learner can easily comprehend difference between learner's pronunciation and model pronunciation. <P>SOLUTION: A band in which a word is displayed inside shows the model pronunciation, and a band which is filled shows the learner's pronunciation. An arrangement position of a vertical direction of the band which is filled is determined according to a pitch stored to be made correspond to each word, and the arrangement position of the vertical position of the band which is filled is determined according to the pitch obtained from the learner's pronunciation. When the pitch of the model pronunciation is higher than that of the learner's pronunciation, for example, like the band in which a word "critical" is shown inside, the band showing the model pronunciation is displayed on the upper side of the band showing the learner's pronunciation. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、語学学習を支援する技術に関する。 The present invention relates to a technique for supporting language learning.

外国語あるいは母国語の語学学習、特に、発音あるいは発話の独習においては、ＣＤ（Compact Disk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音あるいは発話するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものであるが、学習をより効果的に進めるためには、自分の音声を具体的に把握し、模範音声と自分の音声との差を知って自分の発音を正しい発音に近づける必要がある。 In language learning of a foreign language or native language, especially in self-study of pronunciation or utterance, the model voice recorded on a recording medium such as a CD (Compact Disk) is played, and the model voice is imitated to pronounce or speak. The learning method is widely used. This is intended to acquire the correct pronunciation by imitating the model voice, but in order to advance learning more effectively, you need to know your voice specifically, You need to know the difference from your voice and bring your pronunciation closer to the correct pronunciation.

学習者が自分の音声を具体的に把握できるようにする技術としては、例えば特許文献１、２に開示された技術がある。特許文献１には、学習者の音声波形、パワー、ピッチ等を表示する技術が開示されている。また、特許文献２には、模範となる音声の音声波形と、学習者の音声の音声波形とを並べて表示する技術が開示されている。
特開平１−１５４１８９号公報特開２００２−２３６１３号公報 As a technique for enabling a learner to specifically grasp his / her voice, there are techniques disclosed in Patent Documents 1 and 2, for example. Patent Document 1 discloses a technique for displaying a learner's speech waveform, power, pitch, and the like. Patent Document 2 discloses a technique for displaying a voice waveform of an exemplary voice and a voice waveform of a learner's voice side by side.
Japanese Patent Laid-Open No. 1-154189 JP 2002-23613 A

特許文献１に開示された技術によれば、音声の音声波形やパワー、ピッチ等が視覚化されるため、学習者は自分の音声を把握することができる。また、特許文献２に開示された技術によれば、学習者は、音声波形を比較して、模範となる音声と自分の音声との違いを見ることができる。しかしながら、特許文献１に開示された技術では、音声波形について専門的な知識を持たない普通の学習者が波形等の違いから発音の改善点を把握するのは難しく、自分の音声と模範音声との差を具体的に把握し、発音の悪い部分を把握するのが難しいという問題がある。また、特許文献２に開示された技術においては、模範音声の波形と学習者の音声の波形とが並べて表示されるものの、音声波形について専門的な知識を持たない普通の学習者が波形の違いを把握するのは難しく、どのように発音を改善すれば良いのか学習者自身では分かりにくいという問題がある。 According to the technique disclosed in Patent Document 1, since the voice waveform, power, pitch, and the like of the voice are visualized, the learner can grasp his / her voice. Further, according to the technique disclosed in Patent Document 2, the learner can compare the speech waveforms and see the difference between the model speech and his / her speech. However, with the technique disclosed in Patent Document 1, it is difficult for ordinary learners who do not have specialized knowledge about speech waveforms to grasp improvements in pronunciation based on differences in waveforms and the like. There is a problem that it is difficult to grasp the difference between them specifically and to understand the part where pronunciation is bad. In the technique disclosed in Patent Document 2, the waveform of the model speech and the waveform of the learner's speech are displayed side by side, but an ordinary learner who does not have specialized knowledge about the speech waveform has a difference in waveform. There is a problem that it is difficult for the learner himself to understand how to improve pronunciation.

本発明は、上述した背景の下になされたものであり、模範音声と学習者の音声の相違点を学習者が容易に把握できるようにする技術を提供することを目的とする。 The present invention has been made under the above-described background, and an object of the present invention is to provide a technique that enables a learner to easily grasp the difference between a model voice and a learner's voice.

上述した課題を解決するために本発明は、会話の例文と、前記例文中の各単語の発話音声のピッチとを記憶した記憶手段と、音声が入力される音声入力手段と、前記音声入力手段に入力された音声から、前記例文中の各単語に対応する発話音声を抽出する発話音声抽出手段と、前記発話音声抽出手段により抽出された各音声のピッチを抽出するピッチ抽出手段と、前記記憶手段に記憶されている例文を横組みで表示し、且つ、前記発話音声抽出手段により抽出された各単語の発話音声を表す複数の帯状図形を横組みで表示すると共に、前記例文中の単語と、該単語の発話音声を表す帯状図形とを対応付けて表示する表示手段と、前記例文中の各単語の上下方向の表示位置を、前記記憶手段に記憶されている各単語の発話音声のピッチの高低に応じて決定すると共に、前記複数の帯状図形の上下方向の表示位置を、前記ピッチ抽出手段で抽出された単語毎のピッチの高低に応じて決定する表示位置決定手段とを有する語学学習装置を提供する。 In order to solve the above-mentioned problems, the present invention provides a storage means for storing an example sentence of conversation and a pitch of the uttered voice of each word in the example sentence, a voice input means for inputting a voice, and the voice input means. Utterance voice extraction means for extracting utterance voice corresponding to each word in the example sentence, pitch extraction means for extracting the pitch of each voice extracted by the utterance voice extraction means, and the storage The example sentences stored in the means are displayed in horizontal composition, and a plurality of band-like figures representing the utterances of each word extracted by the utterance voice extraction means are displayed in horizontal composition, and the words in the example sentences Display means for displaying the word-like speech voice in association with the display, and the vertical display position of each word in the example sentence, the pitch of the speech voice of each word stored in the storage means To high and low And a display position determining means for determining the vertical display positions of the plurality of strip-shaped figures according to the pitch of each word extracted by the pitch extracting means. To do.

また、本発明は、会話の例文と、前記例文中の各単語の発話音声の発話時間とを記憶した記憶手段と、音声が入力される音声入力手段と、前記音声入力手段に入力された音声から、前記例文中の各単語に対応する発話音声を抽出する発話音声抽出手段と、前記発話音声抽出手段により抽出された発話音声毎に発話時間を抽出する発話時間抽出手段と、前記記憶手段に記憶されている例文を横組みで表示し、且つ、前記発話音声抽出手段により抽出された各単語の発話音声を表す複数の帯状図形を横組みで表示すると共に、前記例文中の単語と、該単語の発話音声を表す帯状図形とを対応付けて表示する表示手段と、前記例文中の各単語の表示長さを、前記記憶手段に記憶されている各単語の発話音声の発話時間の長短に応じて決定すると共に、前記複数の帯状図形の表示長さを、前記発話時間抽出手段で抽出された単語毎の発話時間の長短に応じて決定する表示長さ決定手段とを有する語学学習装置を提供する。 Further, the present invention provides a storage means for storing an example sentence of conversation and an utterance time of an utterance voice of each word in the example sentence, a voice input means for inputting voice, and a voice input to the voice input means. Utterance voice extraction means for extracting utterance voice corresponding to each word in the example sentence, utterance time extraction means for extracting utterance time for each utterance voice extracted by the utterance voice extraction means, and the storage means The stored example sentences are displayed in horizontal composition, and a plurality of band-like figures representing the utterances of each word extracted by the utterance voice extraction means are displayed in horizontal composition, and the words in the example sentences, Display means for displaying in association with a band-like figure representing the utterance voice of a word, and the display length of each word in the example sentence to the length of the utterance time of the utterance voice of each word stored in the storage means As well as to decide Wherein the plurality of the display length of the strip shape, to provide a language learning apparatus and a display length determining means for determining in accordance with the length of speech time of each word extracted by said utterance time extracting means.

また、本発明は、例文と、前記例文中の各音素あるいは音節の発話音声のピッチとを記憶した記憶手段と、音声が入力される音声入力手段と、前記音声入力手段に入力された音声から、前記例文中の各音素あるいは音節に対応する発話音声を抽出する発話音声抽出手段と、前記発話音声抽出手段により抽出された各音素あるいは音節のピッチを抽出するピッチ抽出手段と、前記記憶手段に記憶されている例文を横組みで表示し、且つ、前記発話音声抽出手段により抽出された各音素あるいは音節の発話音声を表す複数の帯状図形を横組みで表示すると共に、前記例文中の音素あるいは音節と、該音素あるいは音節の発話音声を表す帯状図形とを対応付けて表示する表示手段と、前記例文中の各音素あるいは音節の上下方向の表示位置を、前記記憶手段に記憶されている各音素あるいは音節の発話音声のピッチの高低に応じて決定すると共に、前記複数の帯状図形の上下方向の表示位置を、前記ピッチ抽出手段で抽出された音素あるいは音節毎のピッチの高低に応じて決定する表示位置決定手段とを有する語学学習装置を提供する。 Further, the present invention provides a storage means for storing example sentences and pitches of speech sounds of each phoneme or syllable in the example sentences, a voice input means for inputting a voice, and a voice input to the voice input means. Utterance voice extraction means for extracting utterance voice corresponding to each phoneme or syllable in the example sentence, pitch extraction means for extracting the pitch of each phoneme or syllable extracted by the utterance voice extraction means, and the storage means The stored example sentences are displayed in horizontal composition, and a plurality of band-like figures representing the speech sounds of each phoneme or syllable extracted by the utterance voice extraction means are displayed in horizontal composition, and the phonemes in the example sentences or Display means for displaying the syllable and the band-like figure representing the phoneme or the utterance voice of the syllable in association with each other, and the display position in the vertical direction of each phoneme or syllable in the example sentence, Each of the phonemes or syllables stored in the memory means is determined according to the pitch of the utterance voice, and the vertical display positions of the plurality of band-like figures are determined for each phoneme or syllable extracted by the pitch extracting means. There is provided a language learning device having display position determining means for determining the pitch according to the height of the pitch.

本発明によれば、学習者は、模範音声と学習者の音声の相違点を容易に把握することができる。 According to the present invention, the learner can easily grasp the difference between the model voice and the learner's voice.

以下、図面を参照して本発明の実施形態について説明する。
［実施形態の構成］
図１は、本発明の実施形態に係る語学学習装置１のハードウェア構成を例示したブロック図である。図１に示したように、語学学習装置１の各部は、バス１０１に接続されており、このバス１０１を介して各部間で信号やデータの授受を行う。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of the embodiment]
FIG. 1 is a block diagram illustrating a hardware configuration of a language learning device 1 according to an embodiment of the invention. As shown in FIG. 1, each unit of the language learning device 1 is connected to a bus 101, and signals and data are exchanged between the units via the bus 101.

マイクロホン１０９は、音声処理部１０８に接続されており、入力される音声を電気信号（以下、音声信号と称する）に変換して音声処理部１０８へ出力する。スピーカ１１０は、音声処理部１０８に接続されており、音声処理部１０８から出力される信号に対応した音を出力する。音声処理部１０８は、マイクロホン１０９から入力される音声信号をデジタルデータ（以下、学習者データと称する）に変換して出力する機能や、音声を表すデジタルデータをアナログの音声信号に変換し、スピーカ１１０へ出力する機能を備えている。 The microphone 109 is connected to the sound processing unit 108, converts input sound into an electrical signal (hereinafter referred to as a sound signal), and outputs the electric signal to the sound processing unit 108. The speaker 110 is connected to the sound processing unit 108 and outputs a sound corresponding to the signal output from the sound processing unit 108. The audio processing unit 108 converts the audio signal input from the microphone 109 into digital data (hereinafter referred to as “learner data”) and outputs it, or converts the digital data representing the audio into an analog audio signal, and the speaker. The function to output to 110 is provided.

表示部１０６は、例えば、液晶ディスプレイ等の表示デバイスを備えており、ＣＰＵ１０２の制御の下、文字列や各種メッセージ、語学学習装置１を操作するためのメニュー画面等を表示する。入力部１０７は、キーボードやマウス等（いずれも図示略）の入力装置を具備しており、キーの押下やマウスの操作等に応じて操作内容に対応した信号をＣＰＵ１０２へ出力する。 The display unit 106 includes, for example, a display device such as a liquid crystal display, and displays a character string, various messages, a menu screen for operating the language learning device 1, and the like under the control of the CPU 102. The input unit 107 includes an input device such as a keyboard and a mouse (both not shown), and outputs a signal corresponding to the operation content to the CPU 102 in response to a key press or a mouse operation.

記憶部１０５は、例えば、ＨＤＤ（Hard Disk Drive）装置を備えており、各種データを記憶する。具体的には、記憶部１０５は、音声処理部１０８から出力される学習者データを記憶する。また、記憶部１０５は、語学学習に用いられる例文を表す例文テキストデータと、ネイティブスピーカが例文を読み上げた時の音声（以下、模範音声と称する）を表すデジタルデータ（以下、模範データと称する）とを記憶している。記憶部１０５は、図２に例示したフォーマットの例文テーブルＴＢ１を記憶しており、このテーブルに例文テキストデータと、ネイティブスピーカが例文を読み上げた時の音声を表す模範データのファイル名と、各例文テキストデータを一意に識別する識別子とを対応付けて格納している。また、記憶部１０５は、例文に含まれている単語を示す単語テキストデータと、模範音声中における各単語の発音音声のピッチ、および模範音声中における各単語の発話開始時間とを記憶している。記憶部１０５は、図３に例示したフォーマットの単語テーブルＴＢ２を記憶しており、このテーブルに、例文を示す識別子と、例文に含まれている各単語のテキストデータと、各単語の発音音声のピッチおよび発話開始時間とを対応付けて格納している。また、記憶部１０５は、音声処理部１０８が出力した学習者データを記憶する。 The storage unit 105 includes, for example, an HDD (Hard Disk Drive) device, and stores various data. Specifically, the storage unit 105 stores learner data output from the voice processing unit 108. The storage unit 105 also stores example sentence text data representing example sentences used for language learning, and digital data (hereinafter referred to as model data) representing voices (hereinafter referred to as model sounds) when the native speaker reads out the example sentences. Is remembered. The storage unit 105 stores an example sentence table TB1 in the format illustrated in FIG. 2. In this table, example sentence text data, a file name of model data representing a voice when the native speaker reads the example sentence, and each example sentence An identifier for uniquely identifying text data is stored in association with each other. The storage unit 105 also stores word text data indicating the words included in the example sentence, the pitch of the pronunciation sound of each word in the model voice, and the utterance start time of each word in the model voice. . The storage unit 105 stores a word table TB2 in the format illustrated in FIG. 3. In this table, an identifier indicating an example sentence, text data of each word included in the example sentence, and pronunciation sound of each word are stored. The pitch and the utterance start time are stored in association with each other. The storage unit 105 stores the learner data output from the voice processing unit 108.

ＣＰＵ（Central Processing Unit）１０２は、ＲＯＭ（Read Only Memory）１０３に記憶されているプログラムを、ＲＡＭ（Random Access Memory）１０４を作業エリアにして実行する。ＣＰＵ１０２がプログラムを実行すると、ＣＰＵ１０２によって各部が制御され、模範音声と学習者の音声との相違点を表示する機能が実現する。 A CPU (Central Processing Unit) 102 executes a program stored in a ROM (Read Only Memory) 103 using a RAM (Random Access Memory) 104 as a work area. When the CPU 102 executes the program, each unit is controlled by the CPU 102 to realize a function of displaying the difference between the model voice and the learner's voice.

［実施形態の動作］
次に本実施形態の動作について説明する。
まず、学習者が例文の一覧の表示を指示する操作を行うと、ＣＰＵ１０２は例文テーブルＴＢ１に格納されている例文テキストデータを読み出し（図４：ステップＳＡ１）、読み出したデータが表す例文の一覧を表示部１０６に表示する（ステップＳＡ２）。この後、学習者が入力部１０７を操作し、表示された例文の一つを選択する操作を行うと（ステップＳＡ３；ＹＥＳ）、ＣＰＵ１０２は、表示部１０６に表示されている画面と、入力部１０７から送られる信号に基づいて、選択された例文を特定する（ステップＳＡ４）。ＣＰＵ１０２は、選択された例文を特定すると、例文テーブルＴＢ１において、選択された例文に対応付けて格納されている模範データのファイル名を読み出す（ステップＳＡ５）。例えば、図２に示したテーブルにおいて、識別子が「００１」である例文が選択された場合、ファイル名「ａ００１」が読み出される。 [Operation of the embodiment]
Next, the operation of this embodiment will be described.
First, when the learner performs an operation to instruct display of a list of example sentences, the CPU 102 reads out example sentence text data stored in the example sentence table TB1 (FIG. 4: step SA1), and displays a list of example sentences represented by the read data. The information is displayed on the display unit 106 (step SA2). Thereafter, when the learner operates the input unit 107 and performs an operation of selecting one of the displayed example sentences (step SA3; YES), the CPU 102 displays the screen displayed on the display unit 106, the input unit, and the like. Based on the signal sent from 107, the selected example sentence is specified (step SA4). When CPU 102 identifies the selected example sentence, CPU 102 reads out the file name of the model data stored in association with the selected example sentence in example sentence table TB1 (step SA5). For example, in the table shown in FIG. 2, when an example sentence with the identifier “001” is selected, the file name “a001” is read.

次にＣＰＵ１０２は、読み出したファイル名で特定される模範データを記憶部１０５から読み出し、読み出した模範データを音声処理部１０８へ出力する（ステップＳＡ６）。音声処理部１０８に模範データが入力されると、デジタルデータである模範データがアナログの信号に変換されてスピーカ１１０へ出力され、スピーカ１１０から模範音声が再生される。 Next, the CPU 102 reads out the model data specified by the read file name from the storage unit 105, and outputs the read out model data to the sound processing unit 108 (step SA6). When the model data is input to the sound processing unit 108, the model data that is digital data is converted into an analog signal and output to the speaker 110, and the model sound is reproduced from the speaker 110.

ＣＰＵ１０２は、模範音声の再生が終了すると、表示部１０６を制御し、例えば、「キーを押してから発音し、発音が終わったら再度キーを押してください」という、例文の発音を促すメッセージを表示する（ステップＳＡ７）。学習者は、スピーカ１１０から出力された模範音声を聞いた後、メッセージに従って入力部１０７を操作し、模範音声を真似て例文を読み上げる。学習者が発音すると、学習者の音声（以下、学習者音声と称する）がマイクロホン１０９によって音声信号に変換され、変換された信号が音声処理部１０８へ出力される。音声処理部１０８は、マイクロホン１０９から出力された音声信号が入力されると、音声信号をデジタルデータである学習者データに変換する。この学習者データは、音声処理部１０８から出力されて記憶部１０５に記憶される。 When the reproduction of the model voice is finished, the CPU 102 controls the display unit 106 to display, for example, a message for prompting pronunciation of the example sentence such as “Press the key to pronounce and then press the key again when the pronunciation is finished” ( Step SA7). The learner listens to the model voice output from the speaker 110, and then operates the input unit 107 according to the message to read the example sentence by imitating the model voice. When the learner pronounces, the learner's voice (hereinafter referred to as learner voice) is converted into a voice signal by the microphone 109, and the converted signal is output to the voice processing unit 108. When the audio signal output from the microphone 109 is input, the audio processing unit 108 converts the audio signal into learner data that is digital data. This learner data is output from the voice processing unit 108 and stored in the storage unit 105.

学習者が発音を終了して入力部１０７を操作すると（ステップＳＡ８；ＹＥＳ）、ＣＰＵ１０２は、学習者データが示す音声の長さを調整し、模範音声データが示す模範音声の長さと、学習者データが示す学習者の音声の長さとが同じとなるように学習者データを処理する（ステップＳＡ９）。図５は、模範音声の波形とマイクロホン１０７に入力された学習者音声の波形とを例示した図である。図５においては、模範音声の波形および学習者音声の波形は同じ例文を発話した時のものを示しているが、発話速度が異なっているため、音声波形の長さが異なっている。ＣＰＵ１０２は、模範音声データと学習者データを解析し、模範音声の長さと学習者音声の長さ（図５のΔｔ）を求める。図５に示したように、学習者音声の長さが模範音声の長さよりΔｔ長い場合、学習者音声の長さをΔｔ分だけ縮める処理を行う。 When the learner finishes pronunciation and operates the input unit 107 (step SA8; YES), the CPU 102 adjusts the length of the voice indicated by the learner data, the length of the model voice indicated by the model voice data, and the learner. The learner data is processed so that the length of the learner's voice indicated by the data is the same (step SA9). FIG. 5 is a diagram exemplifying the waveform of the model voice and the waveform of the learner voice input to the microphone 107. In FIG. 5, the waveform of the model voice and the waveform of the learner voice are shown when the same example sentence is uttered, but the lengths of the voice waveforms are different because the utterance speeds are different. The CPU 102 analyzes the model voice data and the learner data, and determines the length of the model voice and the length of the learner voice (Δt in FIG. 5). As shown in FIG. 5, when the length of the learner voice is longer by Δt than the length of the model voice, a process of reducing the length of the learner voice by Δt is performed.

次にＣＰＵ１０２は、模範音声の波形と学習者音声の波形とを、図６に示したように所定の時間間隔で区切って複数のフレームに分割する。そして、模範音声の各フレームの音声波形と、学習者音声の各フレームの音声波形との対応付けをＤＰ（Dynamic Programming）マッチング法を用いて行う（ステップＳＡ１０）。例えば、図６に例示した波形においては、模範音声のフレームＡ１は、学習者音声のフレームＢ１に対応付けされ、模範音声のフレームＡ３は、学習者音声のフレームＢ４に対応付けされる。 Next, the CPU 102 divides the waveform of the model voice and the waveform of the learner voice into a plurality of frames by dividing them at predetermined time intervals as shown in FIG. Then, the speech waveform of each frame of the model speech and the speech waveform of each frame of the learner speech are associated using a DP (Dynamic Programming) matching method (step SA10). For example, in the waveform illustrated in FIG. 6, the model voice frame A1 is associated with the learner voice frame B1, and the model voice frame A3 is associated with the learner voice frame B4.

ＣＰＵ１０２は、模範音声と学習者音声との対応付けが終了すると、各音声波形を単語の発音毎に分割する（ステップＳＡ１１）。具体的には、まず、模範音声については、単語テーブルＴＢ１から発話開始時間を読み出す。ここで、学習者が選択した例文が「The critical region for ecology though is the east end of Panama.」であるので、まず、「The」の発話開始時間「０．０sec」が単語テーブルＴＢ２から読み出される。ＣＰＵ１０２は、図６に示したように、音声波形の「０．０sec」の位置のフレームに（フレームＡ１）単語の区切りを示す情報（以下、単語区切り情報Ｃと称する）を付加する。次にＣＰＵ１０２は「critical」の発話時間「０．３sec」を単語テーブルＴＢ２から読み出し、発音開始から０．３sec後の位置に対応したフレーム（フレームＡ３）に単語区切り情報Ｃを付加する。 When the association between the model voice and the learner voice ends, the CPU 102 divides each voice waveform for each word pronunciation (step SA11). Specifically, first, for the model voice, the utterance start time is read from the word table TB1. Here, since the example sentence selected by the learner is “The critical region for ecology though is the east end of Panama.”, First, the utterance start time “0.0 sec” of “The” is read from the word table TB2. . As shown in FIG. 6, the CPU 102 adds information indicating the word break (hereinafter referred to as word break information C) to the frame at the position of “0.0 sec” in the speech waveform (frame A1). Next, the CPU 102 reads the utterance time “0.3 sec” of “critical” from the word table TB2, and adds the word break information C to the frame (frame A3) corresponding to the position 0.3 sec after the start of pronunciation.

ＣＰＵ１０２は、模範音声について最後の単語「Panama」まで単語区切り情報Ｃを付加すると、次に、学習者音声について単語区切り情報を付加する。まず、ＣＰＵ１０２は、模範音声において単語区切り情報が付加されたフレームを抽出する。そして、抽出されたフレームに対応したフレームを、学習者音声において特定し、特定したフレームに単語区切り情報Ｃを付加する。例えば、単語区切り情報Ｃが付加されたフレームＡ１が抽出されると、上述したステップＳＡ１０の処理によってフレームＡ１はフレームＢ１に対応付けされているので、ＣＰＵ１０２は、フレームＡ１に対応しているフレームＢ１を特定し、このフレームＢ１に単語区切り情報を付加する。また、単語区切り情報が付加されたフレームＡ３が抽出されると、上述したステップＳＡ１０の処理によってフレームＡ３はフレームＢ４に対応付けされているので、ＣＰＵ１０２は、フレームＡ３に対応付けされたフレームＢ４を特定し、このフレームＢ４に単語区切り情報Ｃを付加する。 After adding the word break information C to the last word “Panama” for the model voice, the CPU 102 then adds the word break information for the learner voice. First, the CPU 102 extracts a frame to which word break information is added in the model voice. Then, a frame corresponding to the extracted frame is specified in the learner's voice, and word break information C is added to the specified frame. For example, when the frame A1 to which the word break information C is added is extracted, since the frame A1 is associated with the frame B1 by the process of step SA10 described above, the CPU 102 determines the frame B1 corresponding to the frame A1. And the word break information is added to the frame B1. Further, when the frame A3 to which the word break information is added is extracted, the frame A3 is associated with the frame B4 by the process of step SA10 described above, so the CPU 102 determines the frame B4 associated with the frame A3. Identify and add word break information C to this frame B4.

ＣＰＵ１０２は、フレームに単語区切り情報を付加して音声波形を単語の発音毎に分割すると、まず、模範音声については、各単語の発話時間を算出する（ステップＳＡ１２）。例えば、模範音声の「The」の場合、「The」の発音を表す音声波形として、フレームＡ１〜フレームＡ２までの音声波形が抽出される。そして、抽出した音声波形が解析され、発音時間とが算出される。 When the CPU 102 adds the word break information to the frame and divides the speech waveform for each pronunciation of the word, first, for the exemplary speech, the utterance time of each word is calculated (step SA12). For example, in the case of the model voice “The”, the voice waveforms from frame A1 to frame A2 are extracted as voice waveforms representing the pronunciation of “The”. Then, the extracted speech waveform is analyzed, and the sound generation time is calculated.

次にＣＰＵ１０２は、学習者音声について、各単語の音声のピッチと発話時間とを算出する（ステップＳＡ１３）。例えば、学習者音声の「The」の場合、「The」の発音を表す音声波形として、フレームＢ１〜フレームＢ３までの音声波形が抽出される。そして、抽出した音声波形が解析され、音声のピッチと発音時間とが算出される。 Next, the CPU 102 calculates the voice pitch and utterance time of each word for the learner voice (step SA13). For example, in the case of “The” of the learner's voice, voice waveforms from frame B1 to frame B3 are extracted as a voice waveform representing the pronunciation of “The”. Then, the extracted speech waveform is analyzed, and the pitch and sounding time of the speech are calculated.

ＣＰＵ１０２は、各単語の音声のピッチと発話時間との算出が終了すると、求めたピッチと発話時間とに従って、図７に例示したように、単語毎にピッチと発話時間とを表示する（ステップＳＡ１４）。
図７のピッチ表示部Ａ１において、単語が内部に表示されている帯は、模範音声の発音を表し、内部が塗りつぶされている帯は学習者の発音を表している。単語が内部に表示されている帯の上下方向の配置位置は、単語テーブルＴＢ２において各単語に対応付けて格納されているピッチに応じて決定され、内部が塗りつぶされている帯の上下方向の配置位置は、ステップＳ１３で求めた学習者の音声のピッチに応じて決定される。各帯は、画面の所定の表示位置を基準にして発音のピッチの高低に応じて画面上に配置され、ピッチが高いと帯は上方向に表示され、ピッチが低いと帯は下方向に表示される。例えば、「The」が内部に表示されている帯の配置位置は、単語テーブルＴＢ２において、「The」に対応付けて格納されているピッチに応じて決定され、学習者の「The」の発音を表す帯は、ステップＳ１３で算出されたピッチに応じて決定される。ここで、模範音声のピッチと学習者の音声のピッチが一致している場合には、学習者の発音を示す帯は表示されない。例えば、学習者の「The」の発音のピッチと、模範音声の「The」の発音のピッチとが同じである場合、図７に例示したように、内部が塗りつぶされている帯が表示されない。 When the calculation of the voice pitch and the utterance time of each word is completed, the CPU 102 displays the pitch and the utterance time for each word as illustrated in FIG. 7 according to the obtained pitch and utterance time (step SA14). ).
In the pitch display part A1 in FIG. 7, the band in which the word is displayed inside represents the pronunciation of the model voice, and the band in which the inside is filled represents the pronunciation of the learner. The vertical position of the band in which the word is displayed is determined according to the pitch stored in association with each word in the word table TB2, and the vertical position of the band in which the inside is filled The position is determined according to the pitch of the learner's voice obtained in step S13. Each band is arranged on the screen according to the pitch of the pronunciation based on a predetermined display position on the screen. When the pitch is high, the band is displayed upward, and when the pitch is low, the band is displayed downward. Is done. For example, the arrangement position of the band in which “The” is displayed is determined in accordance with the pitch stored in association with “The” in the word table TB2, and the learner pronounces “The”. The band to be represented is determined according to the pitch calculated in step S13. Here, when the pitch of the model voice matches the pitch of the learner's voice, a band indicating the pronunciation of the learner is not displayed. For example, when the pitch of the pronunciation of the learner “The” and the pitch of the pronunciation of the model voice “The” are the same, as illustrated in FIG. 7, a band whose interior is filled is not displayed.

また、模範音声のピッチが学習者の音声のピッチより高い場合、例えば、図７に例示したように、「critical」という単語が内部に表示されている帯、即ち、模範音声の発音を示す帯が、学習者の「critical」の発音を示す帯よりも上に表示される。また、模範音声のピッチが学習者の音声のピッチよりも低い場合、例えば、図７に例示したように、「region」という単語が内部に表示されている帯、即ち、模範音声の発音を示す帯が、学習者の「region」の発音を示す帯よりも下に表示される。 When the pitch of the model voice is higher than the pitch of the learner's voice, for example, as illustrated in FIG. 7, a band in which the word “critical” is displayed, that is, a band indicating the pronunciation of the model voice. Is displayed above the band indicating the pronunciation of the learner's “critical”. When the pitch of the model voice is lower than the pitch of the learner's voice, for example, as illustrated in FIG. 7, a band in which the word “region” is displayed, that is, the pronunciation of the model voice is shown. A band is displayed below the band indicating the pronunciation of the learner's “region”.

また、図７の発話時間表示部Ａ２において、単語が内部に表示されている帯は、模範音声の発話時間を表し、内部に色がついている帯は学習者の発話時間を表している。
模範音声の発話時間の長さを表している帯の長さは、ステップＳＡ１２で算出された発話時間に応じて決定され、内部に色がつけられている帯の長さは、ステップＳＡ１３で算出された発話時間に応じて決定される。例えば、「The」が内部に表示されている帯の長さは、ステップＳＡ１２で算出された発話時間に応じて決定され、学習者の「The」の発音を表す帯の長さは、ステップＳＡ１３で算出された発話時間に応じて決定される。ここで、模範音声の発話時間と学習者の発話時間とが一致している場合には、模範音声の発話時間を表す帯と学習者の発話時間を表す帯は同じ長さになる。例えば、学習者の「region」の発話時間と、模範音声の「region」の発話時間とが同じである場合、図７に例示したように、模範音声の「region」の発話時間を表す帯と、学習者の「region」の発話時間を表す帯は同じ長さとなる。 Further, in the utterance time display part A2 of FIG. 7, the band in which the word is displayed inside represents the utterance time of the model voice, and the colored band inside represents the utterance time of the learner.
The length of the band representing the length of the utterance time of the model voice is determined according to the utterance time calculated in Step SA12, and the length of the band colored inside is calculated in Step SA13. It is determined according to the utterance time. For example, the length of the band in which “The” is displayed is determined according to the utterance time calculated in step SA12, and the length of the band indicating the pronunciation of the learner “The” is determined in step SA13. It is determined according to the utterance time calculated in (1). Here, when the utterance time of the model voice matches the utterance time of the learner, the band representing the utterance time of the model voice and the band representing the utterance time of the learner have the same length. For example, if the utterance time of the learner “region” is the same as the utterance time of the “region” of the model voice, a band representing the utterance time of the “region” of the model voice as illustrated in FIG. The bands representing the utterance time of the learner's “region” have the same length.

また、模範音声の発話時間が学習者の発話時間よりも長い場合、例えば、図７に例示したように、「critical」という単語が内部に表示されている帯、即ち、模範音声の発話時間を示す帯が、学習者の「critical」の発話時間を示す帯よりも長く表示される。また、模範音声の発話時間が学習者の発話時間よりも短い場合、例えば、図７に例示したように、「though」という単語が内部に表示されている帯、即ち、模範音声の発話時間を示す帯が、学習者の「though」の発話時間を示す帯よりも短く表示される。 When the utterance time of the model voice is longer than the utterance time of the learner, for example, as illustrated in FIG. 7, the band in which the word “critical” is displayed inside, that is, the utterance time of the model voice is set. The band shown is displayed longer than the band showing the utterance time of the learner's “critical”. Also, when the utterance time of the model voice is shorter than the utterance time of the learner, for example, as illustrated in FIG. 7, the band in which the word “though” is displayed inside, that is, the utterance time of the model voice is set. The band indicated is displayed shorter than the band indicating the utterance time of the learner's “though”.

以上説明したように、本実施形態によれば、単語毎に模範音声のピッチと、学習者の音声のピッチとが一緒に表示されるため、模範音声と異なる点を容易に知ることができると共に、模範音声と学習者の音声との相違を具体的に把握することができる。また、模範音声の発話時間と、学習者の音声の発話時間とが一緒に表示されるため、発話時間に関しても、模範音声と異なる点を容易に知ることができると共に、模範音声と学習者の音声との相違を具体的に把握することができる。 As described above, according to the present embodiment, the pitch of the model voice and the pitch of the learner's voice are displayed together for each word, so that the difference from the model voice can be easily known. The difference between the model voice and the learner's voice can be specifically grasped. In addition, since the voice time of the model voice and the voice time of the learner's voice are displayed together, it is possible to easily know the points different from the model voice, and the voice of the model voice and the learner's voice. The difference from the voice can be grasped specifically.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. For example, the present invention may be implemented by modifying the above-described embodiment as follows.

上述した実施形態においては、一つの画面にピッチと発話時間との両方を表示しているが、ピッチのみ、または発音スピードのみを表示するようにしてもよく、また、ピッチと発音スピードのどちらを表示するか、学習者の操作により選択できるようにしてもよい。また、上述した実施形態においては、図８に例示したように、模範音声とは発音が異なることを報知するアイコンを、模範音声と異なる発音の部分に表示するようにしてもよい。 In the embodiment described above, both the pitch and the utterance time are displayed on one screen, but only the pitch or only the pronunciation speed may be displayed, and either the pitch or the pronunciation speed may be displayed. It may be displayed or selected by a learner's operation. Further, in the above-described embodiment, as illustrated in FIG. 8, an icon for notifying that the pronunciation is different from the model voice may be displayed in a portion of the pronunciation different from the model voice.

上述した実施形態においては、単語毎にピッチや発話時間を表示しているが、音素あるいは音節毎に模範音声を記憶し、模範音声と学習者の音声と比較して、音素毎にピッチや発話時間を表示するようにしてもよい。 In the embodiment described above, the pitch and utterance time are displayed for each word, but the model voice is stored for each phoneme or syllable, and the pitch and utterance for each phoneme are compared with the model voice and the learner's voice. You may make it display time.

上述した実施形態においては、模範音声や学習者音声に無音区間がある場合、図９に示したように空白の帯によって無音区間を表すようにしてもよい。 In the above-described embodiment, when there is a silent section in the model voice or the learner voice, the silent section may be represented by a blank band as shown in FIG.

また、上述した実施形態においては、図１０（ａ）に例示したように、発話時間を表す帯を隣の帯に密着させて表示するようにしてもよい。また、この態様においては、単語の発話時間を表している帯をクリックする操作が行われた場合、クリックされた帯の表示位置を基準にして各帯の表示位置を変更するようにしてもよい。例えば、図１０（ａ）のように各単語の発話時間が表示されている時に「region」の単語の発話時間を表す帯をクリックする操作が行われると、語学学習装置は図１０（ｂ）に示したように模範音声の「region」の帯の左端の位置と、学習者音声の「region」の帯の左端の位置とを揃えて表示し、他の帯の表示位置を「region」の表示位置に合わせて表示するようにしてもよい。 In the above-described embodiment, as illustrated in FIG. 10A, a band representing the speech time may be displayed in close contact with the adjacent band. Further, in this aspect, when an operation of clicking a band representing the utterance time of a word is performed, the display position of each band may be changed based on the display position of the clicked band. . For example, when the utterance time of each word is displayed as shown in FIG. 10 (a), if an operation of clicking on a band representing the utterance time of the word “region” is performed, the language learning device will As shown in Fig. 4, the left edge position of the "region" band of the model voice is aligned with the left edge position of the "region" band of the learner voice, and the display positions of the other bands are set to "region". You may make it display according to a display position.

また、上述した実施形態においては、学習者の単語の発話時間が対応する模範音声の単語の発話時間より著しく長かった場合、この単語の前の単語の帯を、発話時間が長かった単語の帯から離して表示するようにしてもよい。 Further, in the above-described embodiment, when the utterance time of the learner's word is significantly longer than the utterance time of the word of the model voice corresponding to the learner's word, the word band before the word is replaced with the word band having the longer utterance time You may make it display away from.

本発明の実施形態に係る語学学習装置１のハードウェア構成を例示した図である。It is the figure which illustrated the hardware constitutions of the language learning apparatus 1 which concerns on embodiment of this invention. 例文テーブルＴＢ１のフォーマットを例示した図である。It is the figure which illustrated the format of example sentence table TB1. 単語テーブルＴＢ２のフォーマットを例示した図である。It is the figure which illustrated the format of word table TB2. ＣＰＵ１０２が行う処理の流れを例示したフローチャートである。It is the flowchart which illustrated the flow of the process which CPU102 performs. 模範音声の波形と学習者音声の波形とを例示した図である。It is the figure which illustrated the waveform of the model voice, and the waveform of the learner voice. 模範音声の波形と学習者音声の波形とを複数のフレームに分割した時の図である。It is a figure when the waveform of an exemplary voice and the waveform of a learner voice are divided into a plurality of frames. 表示部１０６に表示される画面を例示した図である。4 is a diagram illustrating a screen displayed on a display unit 106. FIG. 本発明の変形例において、表示部１０６に表示される画面を例示した図である。In the modification of this invention, it is the figure which illustrated the screen displayed on the display part. 本発明の変形例において、表示部１０６に表示される画面を例示した図である。In the modification of this invention, it is the figure which illustrated the screen displayed on the display part. 本発明の変形例において、表示部１０６に表示される画面を例示した図である。In the modification of this invention, it is the figure which illustrated the screen displayed on the display part.

Explanation of symbols

１・・・語学学習装置、１０１・・・バス、１０２・・・ＣＰＵ、１０３・・・ＲＯＭ、１０４・・・ＲＡＭ、１０５・・・記憶部、１０６・・・表示部、１０７・・・入力部、１０８・・・音声処理部、１０９・・・マイクロホン、１１０・・・スピーカ。 DESCRIPTION OF SYMBOLS 1 ... Language learning apparatus, 101 ... Bus, 102 ... CPU, 103 ... ROM, 104 ... RAM, 105 ... Memory | storage part, 106 ... Display part, 107 ... Input unit, 108... Voice processing unit, 109... Microphone, 110.

Claims

Storage means for storing an example sentence of the conversation and a pitch of the utterance voice of each word in the example sentence;
Voice input means for inputting voice;
Utterance voice extraction means for extracting utterance voice corresponding to each word in the example sentence from the voice input to the voice input means;
Pitch extraction means for extracting the pitch of each voice extracted by the utterance voice extraction means;
The example sentences stored in the storage means are displayed in horizontal composition, and a plurality of band-like figures representing the utterance voices of the words extracted by the utterance voice extraction means are displayed in horizontal composition, Display means for displaying a word in association with a band-like figure representing the speech of the word;
The vertical display position of each word in the example sentence is determined according to the pitch of the utterance voice of each word stored in the storage means, and the vertical display position of the plurality of band-like figures is determined. A language learning apparatus comprising: display position determining means that determines the pitch according to the pitch of each word extracted by the pitch extracting means.

Storage means for storing an example sentence of a conversation and an utterance time of an utterance voice of each word in the example sentence;
Voice input means for inputting voice;
Utterance voice extraction means for extracting utterance voice corresponding to each word in the example sentence from the voice input to the voice input means;
Utterance time extraction means for extracting the utterance time for each utterance voice extracted by the utterance voice extraction means;
The example sentences stored in the storage means are displayed in horizontal composition, and a plurality of band-like figures representing the utterance voices of the words extracted by the utterance voice extraction means are displayed in horizontal composition, Display means for displaying a word in association with a band-like figure representing the speech of the word;
The display length of each word in the example sentence is determined according to the length of the utterance time of the utterance voice of each word stored in the storage means, and the display length of the plurality of band-like figures is determined by the utterance A language learning apparatus, comprising: a display length determining unit that determines according to the length of the utterance time of each word extracted by the time extracting unit.

Storage means for storing an example sentence and the pitch of the utterance voice of each phoneme or syllable in the example sentence;
Voice input means for inputting voice;
Utterance voice extraction means for extracting utterance voice corresponding to each phoneme or syllable in the example sentence from the voice input to the voice input means;
Pitch extraction means for extracting the pitch of each phoneme or syllable extracted by the speech voice extraction means;
The example sentences stored in the storage means are displayed in horizontal composition, and a plurality of band-like figures representing the speech sounds of each phoneme or syllable extracted by the utterance voice extraction means are displayed in horizontal composition. Display means for displaying the phoneme or syllable in the middle and a band-like figure representing the utterance voice of the phoneme or syllable;
The display position in the vertical direction of each phoneme or syllable in the example sentence is determined according to the pitch of the utterance voice of each phoneme or syllable stored in the storage means, and the vertical direction of the plurality of band-like figures Display position determining means for determining the display position according to the phoneme extracted by the pitch extracting means or the pitch of each syllable.