JP5580019B2

JP5580019B2 - Language learning support system and language learning support method

Info

Publication number: JP5580019B2
Application number: JP2009236237A
Authority: JP
Inventors: 洋一時岡; 和広奥田
Original assignee: POWERSHIFT INC.
Current assignee: POWERSHIFT INC.
Priority date: 2009-10-13
Filing date: 2009-10-13
Publication date: 2014-08-27
Anticipated expiration: 2029-10-13
Also published as: JP2011085641A

Description

本発明は、外国語の習得のための語学学習支援システム及び語学学習支援方法に関し、さらに詳しくは、外国語の会話力の向上のための語学学習支援システム及び語学学習支援方法に関する。 The present invention relates to a language learning support system and a language learning support method for acquiring a foreign language, and more particularly to a language learning support system and a language learning support method for improving the conversational ability of a foreign language.

グローバル化が進む現代社会において、外国語習得の重要性はますます高まっている。特に、外国人とのコミュニケーションに必要な会話力の習得に対する需要はかなり高く、多くの語学学校がサービスを提供しているほか、学習者が一人で学習ができるような様々な教材が市販されている。しかしながら、何年もかけて外国語を勉強しているのにも拘わらず外国語が十分に話せるようにならない、といったケースが少なからず見受けられる。こうした問題は従来から指摘されてきたことであり、外国語、特に外国語の会話力の向上に有効な学習方法として、様々な方法がこれまで提案されている。 In today's globalized society, learning foreign languages is becoming increasingly important. In particular, the demand for acquiring conversation skills necessary for communication with foreigners is quite high. Many language schools provide services, and various educational materials are available on the market that allow learners to learn alone. Yes. However, there are quite a few cases where foreign languages cannot be spoken sufficiently despite years of studying foreign languages. Such problems have been pointed out in the past, and various methods have been proposed as learning methods effective for improving the foreign language, particularly foreign language conversation skills.

一般に、外国語会話力を高めるには、以下の４つの能力を総合的に高める必要があると言われている。
（１）音声そのものを聴き取る能力
（２）聴き取った音声と意味（又は概念）とを結びつける能力
（３）音声そのものの発声能力
（４）発声している（又は発声しようとする）音声と意味（又は概念）とを結びつける能力 In general, it is said that the following four abilities must be comprehensively improved to improve foreign language conversation skills.
(1) Ability to listen to the sound itself (2) Ability to connect the heard sound and meaning (or concept) (3) Ability to speak the sound itself Ability to connect meaning (or concept)

ここでいう発声能力とは、単なる単語単位の「発音」能力だけを指すものではなく、日常会話の実際の場面において当該言語のネイティブ話者にとって自然（ナチュラル）であると感じられる「発声＝音声」であり、言語学上プロソディ（韻律）と呼ばれる、イントネーション（抑揚）・リズム（律動）・ストレス（強勢）・ピッチ（高低）・トーン（音調）、テンポ、さらには声の大小、発声速度など、様々な音声学的特徴を指す。小学校の英語教育で、このプロソディ的側面を重要視した発声（発音）教育も提案されている。 The utterance ability here is not just a word-based “pronunciation” ability, but “speaking = speech” which is felt natural for the native speaker of the language in the actual scene of everyday conversation. Linguistically called prosody (prosody), intonation (intonation), rhythm (rhythm), stress (strength), pitch (high and low), tone (tone), tempo, voice size, vocalization speed, etc. , Refers to various phonetic features. In English education at elementary schools, voice (pronunciation) education that emphasizes this prosody aspect has also been proposed.

上記の能力を効果的に向上させる訓練法として、これまでの研究や実際の教育結果から効果があるとされているものに、「シャドーイング」、「リプロダクション」等がある。 “Shadowing”, “reproduction”, etc. are effective training methods for improving the above-mentioned abilities, based on past research and actual educational results.

例えば特許文献１、２などにも開示されている「シャドーイング」は、ネイティブ話者等による発音を聴きながら、それにかぶせるようにして訓練者が聴き取った音を発声するという方法である。これはもともとプロの同時通訳者を目指す者を対象とした訓練法であって、シャドーイングを行うことにより、上記４つの能力、特に上記の（１）、（３）の能力を向上させるのに有効であると言われている。しかしながら、シャドーイングは、ある程度の構文の知識や意味の理解がなければ行うことができないため、初学者や初級者には向かず、適切な効果が得られない場合もよく見受けられる。 For example, “shadowing” disclosed in Patent Documents 1 and 2 is a method of uttering a sound heard by a trainee while listening to pronunciation by a native speaker or the like and covering it. This is a training method for those who aim to become professional simultaneous interpreters. To improve the above four abilities, especially the abilities (1) and (3) above, by shadowing. It is said to be effective. However, shadowing cannot be performed without a certain level of syntax knowledge and meaning, so it is not suitable for beginners and beginners, and there are many cases where appropriate effects cannot be obtained.

また、シャドーイングは、自分の発声（自分の骨を伝って聞こえてしまう音）を聴きながら、同時に模範となるネイティブ話者の音声（以下「原文」という）を聴かなければならないという点で、認知的負荷の高い作業を課していることになる。その結果として、特に、全く未知の言語を成人してから学ぶ初期の段階では、原文の聴き落としや復唱発声自体の品質低下を招いている可能性もあることが、本願発明者らの実験によってわかってきている。 In addition, shadowing means that you must listen to the voice of your native speaker (hereinafter referred to as the “original text”) while listening to your utterance (the sound you hear through your bones). It imposes a cognitively burdensome task. As a result, the experiments by the inventors of the present application indicate that, in particular, at the initial stage of learning a completely unknown language after adulthood, it is possible that the original text may be listened to and the quality of the recitation utterance itself may be reduced. I know.

一方、「リプロダクション」は、或る程度の長さの原文を聴いた直後に、学習者がそれを真似て発声するという訓練方法である。一般的なリプロダクションでは、原文をセンテンス（ひとかたまりの意味を持つ文）毎に区切ってその音声を一旦停止するため、原文の長さは様々であり、場合によってはかなり長くなる場合もある。 On the other hand, “reproduction” is a training method in which a learner speaks by imitating it immediately after listening to a certain length of original text. In general reproduction, the original text is divided into sentences (sentences having a meaning of a group) and the sound is temporarily stopped. Therefore, the length of the original text varies, and in some cases, it may become considerably long.

しかしながら、本願発明者らが行った実験によれば、原文が時間的に長くなればなるほど、学習者による原文の再現性（正確な復唱発声の品質）が失われる傾向にあり、その質が落ちることが判明した。これは、非特許文献１等に開示されている、言語音声に対する人間の短期記憶の理論にも関係すると考えられる。こうしたことから、従来の行われているリプロダクションも、必ずしも効率的な学習方法であるとは言えない。 However, according to experiments conducted by the inventors of the present application, the longer the original text is, the more likely it is that the reproducibility of the original text (accurate quality of repeated utterances) by the learner tends to be lost and the quality of the original text decreases. It has been found. This is considered to be related to the theory of human short-term memory for verbal speech, which is disclosed in Non-Patent Document 1 and the like. For these reasons, the conventional re-production is not necessarily an efficient learning method.

特開２００３−２４１６２９号公報JP 2003-241629 A 特開２００３−３０７９９７号公報JP 2003-307997 A

ミラー（Mille G. A.）、「ザ・マジカル・ナンバー・セブン、プラス・オア・マイナス・トゥー：サム・リミッツオンアーワキャパシティフォー・プロセッシング・インフォメイション（The magical number seven, plus or minus two: Some limits on our capacity for processing information）」、サイコロジカル・レビュー（Psychological Review）、６３巻、ｐ．８１−９７Miller GA, “The Magical Number Seven, Plus or Minus Two: Some Limits, Some Limits on our capacity for processing information), Psychological Review, Volume 63, p. 81-97 “［チャンク］スピーキング力の決め手は「小さな英語」”[online]、Eigoriki.net、［平成２１年９月２３日検索］、インターネット＜http://eigoriki.net/2008/04/post-9.html＞“[Chunk] Speaking ability is determined by“ small English ”” [online], Eigoriki.net, [searched September 23, 2009], Internet <http://eigoriki.net/2008/04/post-9 .html>

前述のように、外国語の会話力の向上を目的とした学習方法には様々なものが提案されているものの、必ずしも人間の記憶力の特性などを考慮したものではなく、効率的な学習方法として確立されてはいない。
本発明はこうした点に鑑みてなされたものであり、その目的とするところは、特に外国語会話を習得する際に効率的な訓練を行うことができる語学学習支援システム及び語学学習支援方法を提供することである。 As mentioned above, although various methods have been proposed to improve foreign language conversation skills, they do not necessarily take into account the characteristics of human memory, but as an efficient learning method. It has not been established.
The present invention has been made in view of these points, and an object of the present invention is to provide a language learning support system and a language learning support method capable of performing efficient training especially when learning a foreign language conversation. It is to be.

上記課題を解決するためになされた第１発明は、学習対象言語の原音声に基づいて音声出力手段から出力される再生音声を学習者が聴取し、これに従って発声を行うことにより当該言語の会話の訓練を行う語学学習支援システムであって、
a)学習対象言語の原音声の文を取得してチャンク単位に分割して原音声チャンクを得る文分割手段と、
b)学習者の母語における前記原音声チャンクにそれぞれ対応した意味を有する母語音声チャンクを取得する母語音声チャンク取得手段と、
c)前記原音声チャンク及び前記母語音声チャンクに対しそれぞれ、再生時に繰り返し音がそれぞれ独立に聴取可能な遅延時間を有して所定回数繰り返すようなエコー付加処理を行うエコー付加手段と、
d)それぞれエコー付加処理された学習対象言語の原音声チャンクと母語音声チャンクとが、チャンク単位で時間的に交互に、且つ、所定の時間間隔空けた状態で音声出力されるように、時間経過に伴う出力タイミングを制御した第１音声データを作成するデータ作成手段と、
を備え、前記データ作成手段により作成された第１音声データを音声出力手段から出力させるようにしたことを特徴としている。
The first invention made to solve the above-mentioned problem is that the learner listens to the reproduced voice output from the voice output means based on the original voice of the language to be learned, and utters according to the reproduced voice. A language learning support system for training
a) sentence dividing means for acquiring a sentence of the original speech of the language to be learned and dividing it into chunks to obtain an original voice chunk;
b) a native speech chunk acquisition means for acquiring a native speech chunk having a meaning corresponding to each of the original speech chunks in the learner's native language;
c) echo adding means for performing echo addition processing such that each of the original voice chunk and the native voice chunk is repeated a predetermined number of times with a delay time at which a repeated sound can be heard independently during reproduction;
d) Elapsed time so that the original speech chunk and the native speech chunk of the learning target language, each echo-added, are alternately output in time in units of chunks and at a predetermined time interval. Data creating means for creating first audio data in which the output timing associated with is controlled;
And the first sound data created by the data creating means is outputted from the sound output means.

また上記課題を解決するためになされた第２発明は、学習対象言語の原音声に基づいて音声出力手段から出力される再生音声を学習者が聴取し、これに従って発声を行うことにより当該言語の会話の訓練を行う語学学習支援方法であって、
a)学習対象言語の原音声の文を取得してチャンク単位に分割して原音声チャンクを得る文分割ステップと、
b)学習者の母語における前記原音声チャンクにそれぞれ対応した意味を有する母語音声チャンクを取得する母語音声チャンク取得ステップと、
c)前記原音声チャンク及び前記母語音声チャンクに対しそれぞれ、再生時に繰り返し音がそれぞれ独立に聴取可能な遅延時間を有して所定回数繰り返すようなエコーを付加するエコー付加ステップと、
d)それぞれエコー付加された学習対象言語の原音声チャンクと母語音声チャンクとが、チャンク単位で時間的に交互に、且つ、所定の時間間隔空けた状態で音声出力されるように、時間経過に伴う出力タイミングを制御した第１音声データを作成するデータ作成ステップと、
を有し、前記データ作成ステップにより作成された第１音声データを音声出力手段から出力させるようにしたことを特徴としている。 Further, the second invention made to solve the above problem is that the learner listens to the reproduced sound output from the sound output means based on the original sound of the language to be learned, and utters the sound according to the reproduced sound. A language learning support method for training conversation,
a) a sentence division step of obtaining a sentence of the original speech of the language to be learned and dividing it into chunks to obtain an original voice chunk;
b) a native speech chunk acquisition step of acquiring a native speech chunk having a meaning corresponding to each of the original speech chunks in the learner's native language;
c) an echo adding step for adding an echo that repeats a predetermined number of times with a delay time that allows each of the repeated sounds to be heard independently during playback, respectively, for the original speech chunk and the native speech chunk;
d) The time passes so that the original speech chunk and the native speech chunk of the learning target language that are echoed are alternately output in time in units of chunks and at a predetermined time interval. A data creation step for creating first audio data with controlled output timing;
The first audio data created by the data creation step is output from the audio output means.

当業者によく知られているように、チャンクとは、会話などにおいて文よりも短い意味を持つ１乃至複数の単語からなる発声出力の単位である（例えば非特許文献２など参照）。本願第１発明に係る語学学習支援システムにおいて、文分割手段は、与えられた学習対象言語、例えば英語の原音声の一つの文に対し、音声波形解析を用いた無音検出などを実行することにより、その一文を複数のチャンクに分割する。多くの場合、自動的にこうしたチャンク分割が可能であるものの、文によっては或いは話者の話し方の特性によっては適切なチャンク分割ができない場合がある。そうした場合には、作業者が自動で実行されたチャンク分割の結果を評価し、適切でない場合に適宜マニュアルで修正するようにしてもよい。 As is well known to those skilled in the art, a chunk is a unit of utterance output composed of one or more words having a shorter meaning than a sentence in conversation or the like (see, for example, Non-Patent Document 2). In the language learning support system according to the first invention of the present application, the sentence dividing means executes silence detection or the like using speech waveform analysis for a given learning target language, for example, one sentence of the original English speech. , Divide the sentence into multiple chunks. In many cases, such chunk division is possible automatically, but depending on the sentence or the characteristics of the speaker's way of speaking, proper chunk division may not be possible. In such a case, the worker may evaluate the result of the chunk division that is automatically executed, and may manually correct the result if it is not appropriate.

一方、母語音声チャンク取得手段は例えば、原音声チャンク毎に翻訳者等が発声した母語音声を取り込んでチャンク毎に保存するものである。この母語音声の発音自体は学習対象ではないから、この母語音声はコンピュータ等による合成音声であってもよい。 On the other hand, the native language voice chunk acquisition means, for example, takes in the native language voice uttered by a translator or the like for each original voice chunk and stores it for each chunk. Since the pronunciation of the native language speech itself is not a learning target, the native language speech may be synthesized speech by a computer or the like.

エコー付加手段は、一文を構成する複数の原音声チャンクとこれに対応した複数の母語音声チャンクとにそれぞれエコー付加処理を実行する。ここで、エコーの繰り返し回数は特に限定されないが、時間遅延は各音声がそれぞれ独立に聴き取れる程度の時間遅延に設定される。また、エコー付加処理された後の音声チャンクの時間長さは３〜５秒程度に収まるようにするとよい。これは、正確な言語音声の短期記憶が可能な長さを考慮した時間範囲である。非特許文献１などの過去の研究によれば、一般的な短期記憶の限界は１５〜２０秒程度であると言われているが、これは記憶対象の意味や概念を理解している場合であると考えられ、そうした理解がない場合や不足している場合には記憶可能な時間はかなり短縮される。本願発明者らのこれまでの実験によれば、特に全く未知の言語を成人してから学ぶ初期の段階では、記憶に留めることが可能であるのは数秒以内にすぎない。そこで、こうした学習においては、上記時間長を３〜５秒程度にしておくのがよい。 The echo adding means executes an echo adding process on each of a plurality of original speech chunks constituting a sentence and a plurality of native speech chunks corresponding thereto. Here, the number of echo repetitions is not particularly limited, but the time delay is set to such a time delay that each sound can be heard independently. Moreover, it is preferable that the time length of the voice chunk after the echo addition process is within about 3 to 5 seconds. This is a time range in consideration of the length capable of short-term memory of accurate speech. According to past research such as Non-Patent Document 1, it is said that the limit of general short-term memory is about 15-20 seconds, but this is when the meaning and concept of the memory object is understood. If you don't understand or lack that understanding, the time you can remember is considerably reduced. According to the experiments conducted by the inventors of the present application, it is only within a few seconds that it is possible to keep in memory, particularly in the early stage of learning a completely unknown language after adulthood. Therefore, in such learning, it is preferable to set the time length to about 3 to 5 seconds.

データ作成手段は、上記のようにエコー付加処理された原音声チャンクとそれに意味的に対応した母語音声チャンクとが、チャンク単位で時間的に交互に出力されるように並び替えるとともに、１つのチャンクの音声出力の終了時点と次のチャンクの音声出力の開始時点との間に所定の時間間隔が空くように無音区間を挿入することで第１音声データを作成し、これを例えば記憶装置に格納する。学習者は、記憶装置に格納された音声データを音声出力手段により読み出して再生・聴取する。この再生音は、例えば母語である日本語で１つのチャンクの意味が所定回数繰り返され、引き続いて学習対象言語である英語で１つのチャンクが所定回数繰り返される。 The data creation means rearranges the original speech chunk subjected to the echo addition processing as described above and the native speech chunk semantically corresponding to the chunk so that the chunks are alternately output in time and one chunk. The first audio data is created by inserting a silent period so that a predetermined time interval is left between the end time of the sound output of the next and the start time of the sound output of the next chunk, and this is stored in, for example, a storage device To do. The learner reads out the audio data stored in the storage device by the audio output means and reproduces / listens. In the reproduced sound, for example, the meaning of one chunk is repeated a predetermined number of times in Japanese as a native language, and then one chunk is repeated a predetermined number of times in English as a learning target language.

したがって、学習者がもともとそのチャンクの意味を十分に理解していなくても、意味と英語音声とをテンポ良く交互に聴くことができる。また、それぞれのチャンクの再生時間長は人間が容易に記憶できる時間内に収まっており、しかもその再生時間中に音声出力は徐々に小さくなるので、学習者は記憶した英語音声をその再生音声に被せるように容易に発声することができる。これにより、英語音声の発声の訓練を行うと同時に意味の理解を促進することができる。 Therefore, even if the learner does not fully understand the meaning of the chunk, the learner can listen to the meaning and the English voice alternately at a good tempo. In addition, the playback time length of each chunk is within the time that humans can easily memorize, and the audio output gradually decreases during the playback time, so the learner uses the stored English voice as the playback voice. It can be easily spoken to cover. As a result, it is possible to promote the understanding of meanings while training English speech.

第１発明に係る語学学習支援システムの一態様において、前記データ作成手段は、原音声チャンクと母語音声チャンクとを異なる音声チャンネルに格納する構成とするとよい。ここで言う異なる音声チャンネルとは、通常、ステレオ音声の左（Ｌ）チャンネルと右（Ｒ）チャンネルとである。これによれば、音声出力手段が左右チャンネルを独立に出力可能なものである場合に、例えば学習対象言語のみを聴取したり、逆に母語音声のみを聴取したりすることが可能となる。 In one aspect of the language learning support system according to the first aspect of the present invention, the data creation means may be configured to store the original speech chunk and the native speech chunk in different speech channels. The different audio channels here are usually the left (L) channel and the right (R) channel of stereo audio. According to this, when the audio output means can output the left and right channels independently, for example, it is possible to listen only to the language to be learned, or conversely to listen only to the native language speech.

また第１発明に係る語学学習支援システム及び第２発明に係る語学学習支援方法では、学習効果を高めるために、上記第１音声データの再生出力のほかに他の形態の音声データの再生出力を行えるようにしてもよい。
例えば、好ましくは、前記データ作成手段は、エコー付加処理を施していない原音声チャンクが、チャンク毎に少なくとも間に学習者による発声が可能な時間間隔を空けて間欠的に音声出力されるように、時間経過に伴う出力タイミングを制御した第２音声データを作成するようにし、第１音声データと第２音声データとを選択的に音声出力手段から再生出力可能とするとよい。 In the language learning support system according to the first invention and the language learning support method according to the second invention, in addition to the reproduction output of the first audio data, the reproduction output of other forms of audio data is performed in order to enhance the learning effect. You may be able to do it.
For example, preferably, the data creation means is configured so that the original voice chunk that has not been subjected to the echo addition process is intermittently output with a time interval that allows the learner to speak at least between each chunk. The second audio data in which the output timing with the passage of time is controlled may be created, and the first audio data and the second audio data may be selectively reproduced and output from the audio output means.

この構成においては、データ作成手段は、上記のように分割され、エコー付加処理がなされていない原音声チャンクに対し、チャンク毎に少なくとも間に学習者による発声が可能な時間間隔を空けて間欠的に音声が出力されるように、時間経過に伴う出力タイミングを制御した第２音声データを作成し、これを例えば記憶装置に格納する。学習者は、この記憶装置に格納された第２音声データを音声出力手段により読み出して再生・聴取する。通常、エコー付加処理がなされていない１つのチャンクの音声の長さは１秒程度であるので、学習者は、時間を空けて間欠的に再生される原音声チャンクを聴きながら、それを繰り返すように復唱発声することができる。 In this configuration, the data creation means is intermittent with a time interval that allows the learner to utter at least for each chunk with respect to the original speech chunk that has been divided as described above and has not been subjected to echo addition processing. The second audio data in which the output timing with the passage of time is controlled so that the audio is output is generated and stored in, for example, a storage device. The learner reads out the second audio data stored in the storage device by the audio output means and reproduces / listens. Usually, the length of the voice of one chunk that has not been subjected to echo addition processing is about 1 second, so that the learner repeats it while listening to the original voice chunk that is played back intermittently. You can repeat the voice.

この場合、データ作成手段は、一定時間間隔で発せられる基準音（例えばメトロノーム音）のテンポに合わせて原音声チャンクが音声出力されるように出力タイミングを制御するとともに、その基準音を音声出力に合成するようにするとよい。原音声チャンクが基準音に合わせて出力されるようにするとともに、学習者による発声が可能な時間間隔の間にも基準音が発生するように出力タイミングを決めることにより、学習者はこの基準音でリズムをとりながらチャンク毎の繰り返し発声を行うことができる。特に、その基準音によって自らが発声を行うタイミングを把握し易いので、発声が促進され、英語音声を口に出す訓練の効果を一層高めることができる。さらにまた、発声を促す基準音が発せられることにより、心理的によい意味での強制力が作用し、学習者が意図しない積極性が生まれ、学習効果を高めることができる。また一般に、基準音なしに学習者が自らタイミングをとって発声を行う場合に比べて学習者が疲れを感じにくく、長時間の訓練を行い易いという副次的な効果もある。 In this case, the data creation means controls the output timing so that the original voice chunk is output in accordance with the tempo of a reference sound (for example, a metronome sound) emitted at regular time intervals, and the reference sound is output to the sound. It is recommended to synthesize. By determining the output timing so that the original sound chunk is output in time with the reference sound and the reference sound is generated even during the time interval in which the learner can speak, the learner can You can repeatedly utter each chunk while taking the rhythm. In particular, since it is easy to grasp the timing of utterance by the reference sound, the utterance is promoted, and the effect of training to speak English speech can be further enhanced. Furthermore, by generating a reference sound that promotes utterance, a forcing force in a psychologically good sense acts, and a positiveness unintended by the learner is born, and the learning effect can be enhanced. In general, there is also a secondary effect that the learner is less tired and easy to train for a longer time than when the learner utters at the timing without using the reference sound.

また、一般に知られているように、視覚的な情報を音声と並行して与えることで、学習効果を一層高めることができる。そこで、第１発明に係る語学学習支援システムでは、各原音声チャンクに対応した画像データを格納しておく画像データ格納手段と、前記音声出力手段から音声データが再生出力される際に、出力される音声に同期して対応した画像データを再生する画像再生手段と、をさらに備える構成とするとよい。ここで、画像データは静止画、動画のいずれでもよい。 Further, as is generally known, the learning effect can be further enhanced by providing visual information in parallel with the voice. Therefore, in the language learning support system according to the first aspect of the invention, image data storage means for storing image data corresponding to each original voice chunk and output when the voice data is reproduced and output from the voice output means. It is preferable to further comprise image reproduction means for reproducing the corresponding image data in synchronization with the sound. Here, the image data may be either a still image or a moving image.

第１発明に係る語学学習支援システムは様々な態様で学習者に学習環境を提供するものとすることができるが、学習者が自宅等に居ながら任意の時間に任意の時間長だけ学習できることが好ましい。そこで、第１発明に係る語学学習支援システムの好ましい一実施態様として、ネットワークを介して相互に接続されたホストコンピュータと端末とを含むシステムであり、前記音声出力手段は前記端末に備えられ、前記データ作成手段により作成された音声データは前記ホストコンピュータが有する記憶装置に格納され、前記端末からホストコンピュータへの要求に対し、該ホストコンピュータからその端末へと音声データが送出される構成とするとよい。 The language learning support system according to the first invention can provide a learning environment to the learner in various manners, but the learner can learn for an arbitrary length of time at an arbitrary time while at home. preferable. Therefore, as a preferred embodiment of the language learning support system according to the first invention, the system includes a host computer and a terminal connected to each other via a network, and the voice output means is provided in the terminal, The voice data created by the data creation means may be stored in a storage device of the host computer, and the voice data may be sent from the host computer to the terminal in response to a request from the terminal to the host computer. .

この場合、ネットワークはインターネット、端末は一般的なパーソナルコンピュータを用いることができる。パーソナルコンピュータに標準的に搭載されている音声ファイル（画像付き音声ファイル）再生用アプリケーションソフトウエアで再生可能なデータ形式でホストコンピュータから端末へと音声データ（又は画像付き音声データ）を送出するようにすれば、端末側では専用のソフトウエアを特にインストールする必要がない。 In this case, the Internet can be used as the network, and a general personal computer can be used as the terminal. Audio data (or audio data with images) is sent from the host computer to the terminal in a data format that can be played back by application software for playing audio files (audio files with images) that are standardly installed in personal computers. In this case, it is not necessary to install special software on the terminal side.

以上のように、本発明に係る語学学習支援システム及び語学学習支援方法によれば、学習対象言語の音声とそれに対応した意味を有する母語の音声とが、チャンク単位で繰り返し且つ交互に音声出力されるので、初学者、初級者など、学習対象言語の構文の知識や意味の理解が不足している者であっても、聴き取った音声と意味とを結び付けながら、聴き取った音声を真似た発声の訓練を行うことができる。また特に、学習者が聴き取る音声単位の時間長さは多くの場合、短期記憶が可能な時間範囲内に収まっているので、原音声の聴き逃しを生じにくく、また復唱発声を正確に行うことができる。これにより、効果性が高い外国語会話の学習を行うことが可能となる。 As described above, according to the language learning support system and the language learning support method according to the present invention, the speech of the language to be learned and the speech of the native language having the corresponding meaning are repeatedly and alternately output in units of chunks. Therefore, even beginners, beginners, etc., who lacked knowledge of the syntax and meaning of the language being studied, imitated the sound they heard while linking the sound and meaning Speaking training can be performed. In particular, the length of the speech unit that the learner listens to is often within the time range that allows short-term memory, so it is difficult for the original speech to be missed and the repeat utterance must be performed accurately. Can do. This makes it possible to learn foreign language conversation with high effectiveness.

本発明に係る語学学習支援システムの一実施例のシステム構成図。1 is a system configuration diagram of an embodiment of a language learning support system according to the present invention. サーバの動画ファイル記憶部に格納される動画ファイルを生成する素材となる再生素材ファイルの一例を示す図。The figure which shows an example of the reproduction | regeneration raw material file used as the raw material which produces | generates the moving image file stored in the moving image file storage part of a server. 本実施例の語学学習支援システムにおける基本音声の一例を示す図。The figure which shows an example of the basic sound in the language learning assistance system of a present Example. 図３に示した例に基づく第１練習モード用音声ファイル再生時の音声出力を示す図。The figure which shows the audio | voice output at the time of the audio | voice file for 1st practice modes based on the example shown in FIG. 図３に示した例に基づく第２練習モード用音声ファイル再生時の音声出力を示す図。The figure which shows the audio | voice output at the time of reproduction | regeneration of the audio | voice file for 2nd practice modes based on the example shown in FIG. 図３に示した例に基づく第３練習モード用音声ファイル再生時の音声出力を示す図。The figure which shows the audio | voice output at the time of reproduction | regeneration of the audio | voice file for 3rd practice modes based on the example shown in FIG. 図３に示した例に基づく復習モード用音声ファイル再生時の音声出力を示す図。The figure which shows the audio | voice output at the time of the audio | voice file for review modes based on the example shown in FIG. 図３に示した例に基づく静止画付き第２練習モード用音声ファイル再生時の状態を示す図。The figure which shows the state at the time of the audio | voice file for 2nd practice modes with a still image based on the example shown in FIG. 本実施例の語学学習支援システムにおけるチャンク分割処理の手順を示すフローチャート。The flowchart which shows the procedure of the chunk division | segmentation process in the language learning assistance system of a present Example. 本実施例の語学学習支援システムにおける第１練習モード用音声加工処理の手順を示すフローチャート。The flowchart which shows the procedure of the audio | voice processing process for 1st practice modes in the language learning assistance system of a present Example. 本実施例の語学学習支援システムにおける第２練習モード用音声加工処理の手順を示すフローチャート。The flowchart which shows the procedure of the audio | voice processing process for 2nd practice modes in the language learning assistance system of a present Example. 本実施例の語学学習支援システムにおける第３練習モード用音声加工処理の手順を示すフローチャート。The flowchart which shows the procedure of the audio | voice processing process for 3rd practice modes in the language learning assistance system of a present Example. 本実施例の語学学習支援システムにおける復習モード用音声加工処理の手順を示すフローチャート。The flowchart which shows the procedure of the speech processing for review modes in the language learning support system of a present Example.

本発明に係る語学学習支援システムの一実施例について、添付図面を参照して説明する。ここで例示する語学学習支援システムは、母語が日本語である学習者が英語（学習対象言語）会話を学習するものであるが、学習者や学習対象言語はこれに限るものではないことは後述の説明から明らかである。 An embodiment of a language learning support system according to the present invention will be described with reference to the accompanying drawings. In the language learning support system exemplified here, a learner whose native language is Japanese learns an English (learning target language) conversation, but the learner and the learning target language are not limited to this. It is clear from the explanation.

図１は本実施例の英会話学習支援システムの全体構成図である。本実施例の英会話学習支援システムにおいては、サービス提供者３側のサーバ３０と学習者１１の手元にある端末１０とが、インターネットなどのネットワーク２を介して接続される。 FIG. 1 is an overall configuration diagram of an English conversation learning support system according to the present embodiment. In the English conversation learning support system of the present embodiment, the server 30 on the service provider 3 side and the terminal 10 at hand of the learner 11 are connected via a network 2 such as the Internet.

端末１０は、キーボードなどの操作部１４、ＬＣＤモニタ等の表示部１３、スピーカ１５、などが本体１２に付設された一般的なパーソナルコンピュータ（ＰＣ）である。このＰＣには一般的なインターネット閲覧ソフトウエア、音声ファイル再生ソフトウエアなどが搭載されており、学習者１１は、こうしたアプリケーションソフトウエアで実現可能な機能を利用して、後述するような学習を行うことが可能である。 The terminal 10 is a general personal computer (PC) in which an operation unit 14 such as a keyboard, a display unit 13 such as an LCD monitor, a speaker 15, and the like are attached to the main body 12. This PC is equipped with general Internet browsing software, audio file playback software, and the like, and the learner 11 performs learning as described later by using functions that can be realized by such application software. It is possible.

サーバ３０は、学習者１１に送信すべき音声付き動画像データが格納された動画ファイルが予め格納された動画ファイル記憶部３２と、動画ファイルの関連情報が収集・格納されている再生ファイル関連データベース（ＤＢ）３３と、ネットワーク２を介した端末１０からの要求などに応じて、再生ファイル関連データベース３３に格納されている関連情報を利用して動画ファイル記憶部３２から適当な動画ファイルを読み出して端末１０に送出するデータ送出処理部３１と、を備える。データ送出処理部３１はサーバ３０に搭載された専用のソフトウエアにより実現される機能ブロックである。動画ファイル記憶部３２に保存される動画ファイルは近年、動画配信等で多用されているＦＬＶ形式のファイルであり、通常、サーバ３０とは別の素材作成用コンピュータ４０上で作成されたＡＶＩ形式或いはＷＭＶ形式などの動画ファイルがサーバ３０にアップロードされた段階でＦＬＶ形式に変換され、動画ファイル記憶部３２に保存される。再生ファイル関連データベース３３に格納される関連情報には、動画ファイル記憶部３２上で動画ファイルの保存場所を示す情報のほか、ファイル名やそのファイルに関するコメント情報、ファイルサイズ、再生時間等のいわゆるファイル属性情報が含まれる。
また、素材作成用コンピュータ４０は、付設されたハードウエアや搭載されたソフトウエアにより実現される機能ブロックとして、音声入力部４５、データ格納部、画像情報作成部、チャンク分割部４２、音声データ加工部４１、及び、動画ファイル生成部４６、などを備える。 The server 30 includes a moving image file storage unit 32 in which a moving image file in which moving image data with sound to be transmitted to the learner 11 is stored, and a reproduction file related database in which related information of the moving image file is collected and stored. In response to a request from (DB) 33 and the terminal 10 via the network 2, an appropriate moving image file is read from the moving image file storage unit 32 using related information stored in the reproduction file related database 33. A data transmission processing unit 31 to be transmitted to the terminal 10. The data transmission processing unit 31 is a functional block realized by dedicated software installed in the server 30. The moving image file stored in the moving image file storage unit 32 is an FLV format file that has been frequently used for moving image distribution in recent years. Usually, the AVI format or the AVI format created on the material creation computer 40 different from the server 30 is used. When a moving image file such as the WMV format is uploaded to the server 30, it is converted to the FLV format and stored in the moving image file storage unit 32. The related information stored in the playback file related database 33 includes information indicating the storage location of the moving image file on the moving image file storage unit 32, a so-called file such as a file name, comment information about the file, a file size, and a playback time. Contains attribute information.
In addition, the material creation computer 40 includes a voice input unit 45, a data storage unit, an image information creation unit, a chunk division unit 42, a voice data processing unit as functional blocks realized by attached hardware and installed software. A unit 41, a moving image file generation unit 46, and the like.

図２は、サーバの動画ファイル記憶部に格納される動画ファイルを生成する素材となる再生素材ファイルの一例を示す図である。再生素材ファイルには、大別して、例えばＷＡＶＥ形式等の所定形式で音声データが格納された音声ファイル、例えばＪＰＥＧ形式等の所定形式の静止画データが格納された画像ファイルがある。音声ファイルには、基本音声ファイル、第１練習モード用音声ファイル、第２練習モード用音声ファイル、第３練習モード用音声ファイル、及び復習モード用音声ファイルの５種のファイルがある。 FIG. 2 is a diagram illustrating an example of a reproduction material file that is a material for generating a moving image file stored in the moving image file storage unit of the server. The reproduction material file is roughly classified into an audio file in which audio data is stored in a predetermined format such as WAVE format, and an image file in which still image data in a predetermined format such as JPEG format is stored. There are five types of audio files: basic audio files, first practice mode audio files, second practice mode audio files, third practice mode audio files, and review mode audio files .

次に、本実施例の英会話学習支援システムにおいて、学習者１１が端末１０のスピーカ１５（又は図示しないヘッドホン等）を通して聴くことができる英会話学習教材の音声出力形態について、一例を挙げて詳しく説明する。
図３は加工対象である５種類の基本音声の一例を示す図であり、図４は第１練習モード用音声の再生出力を示す模式図であり、図５は第２練習モード用音声の再生出力を示す模式図であり、図６は第３練習モード用音声の再生出力を示す模式図であり、図７は復習モード用音声の再生出力を示す模式図である。 Next, in the English conversation learning support system of the present embodiment, an audio output form of the English conversation learning material that the learner 11 can listen to through the speaker 15 (or headphones or the like not shown) of the terminal 10 will be described in detail with an example. .
FIG. 3 is a diagram showing an example of the five types of basic voices to be processed, FIG. 4 is a schematic diagram showing the playback output of the first practice mode voice, and FIG. 5 is the playback of the second practice mode voice. FIG. 6 is a schematic diagram showing the reproduction output of the third practice mode audio, and FIG. 7 is a schematic diagram showing the reproduction output of the review mode audio.

ここでは一例として、図３に示すように、学習対象言語である英語の一文が、"Where is the check in counter for this flight?"、であるものとする。図３に示した基本音声の英語の一文は、典型的には、ネイティブ話者が標準的な話速で発した音声である。この一文の音声は複数のチャンクに分割される。前述したように、チャンクとは意味を持つ１乃至複数の単語からなる発声出力の単位である。上記例文は、"Where is" 、"the check in counter"、" for this flight?"、という３つのチャンクに分割される。一方、この英語の一文に対応する「私の乗る飛行機のチェックインカウンターはどこですか？」という日本語音声が別に用意され、さらに上記英語音声のチャンクにそれぞれ対応して、「どこなの」、「チェックインカウンターは」、「この便の」、という３つの日本語音声のチャンクも用意される。さらに、こうした音声データとは別に、四分の一拍子（アンダンテ）のメトロノーム音のデータも用意される。 Here, as an example, as shown in FIG. 3, it is assumed that a sentence in English as a learning target language is “Where is the check in counter for this flight?”. The English sentence of the basic voice shown in FIG. 3 is typically voice spoken by a native speaker at a standard speaking speed. This single sentence voice is divided into a plurality of chunks. As described above, a chunk is a unit of utterance output consisting of one or more meaningful words. The above example sentence is divided into three chunks: “Where is”, “the check in counter”, and “for this flight?”. On the other hand, a Japanese voice corresponding to this English sentence, “Where is the check-in counter for my plane,” is prepared separately. There are also three Japanese voice chunks, "Check-in counter" and "This flight". In addition to such audio data, metronome sound data of a quarter time (andante) is also prepared.

これら５種類の基本音声から、以下に説明する４種類の形態の音声データが作成される。なお、図４〜図８では、「どこなの」、「チェックインカウンターは」、「この便の」、という日本語音声チャンクそれぞれＪ１、Ｊ２、Ｊ３と記述し、"Where is" 、"the check in counter"、" for this flight?"、という英語音声チャンクをそれぞれＥ１、Ｅ２、Ｅ３と記述している。また、チャンク分割されていない元の英語音声"Where is the check in counter for this flight?"をＥと記述し、元の日本語音声「私の乗る飛行機のチェックインカウンターはどこですか？」をＪと記述している。 From these five types of basic audio, four types of audio data described below are created. In FIGS. 4 to 8, the Japanese voice chunks “where”, “check-in counter”, and “this flight” are written as J1, J2, and J3, respectively, “Where is”, “the check” The English voice chunks “in counter” and “for this flight?” are described as E1, E2, and E3, respectively. Also, write the original English voice “Where is the check in counter for this flight?” As E, which is not chunked, and the original Japanese voice “Where is the check-in counter for my flight?” It is described.

図４に示すように、第１練習モード用音声では、ステレオ音声の左チャンネル（Ｌｃｈ）に日本語音声が収録され、右チャンネル（Ｒｃｈ）に英語音声が収録される。英語音声、日本語音声ともに、チャンク単位で複数回（この例では５回）の音声の繰り返しがエコー音として聞こえるように、つまり１回の音声チャンクの再生毎にその音声出力レベルが低下するようにエコー付加がなされる。そして、エコー付加された英語音声チャンクに先立ってその英語音声チャンクの意味を示す、エコー付加された日本語音声チャンクが再生出力されるように、エコー付加された日本語音声チャンクとエコー付加された英語音声チャンクとが時間的に交互に再生出力される。 As shown in FIG. 4, in the first practice mode audio, Japanese audio is recorded on the left channel (Lch) of the stereo audio, and English audio is recorded on the right channel (Rch). For both English and Japanese voices, repeated voices in multiple chunks (in this example, five times) can be heard as echo sounds, that is, the voice output level is lowered every time a voice chunk is played. Echo is added to. Then, prior to the echoed English voice chunk, the echoed Japanese voice chunk indicating the meaning of the English voice chunk is reproduced and output so that the echoed Japanese voice chunk is reproduced and output. English voice chunks are played back and output alternately in time.

ここで、英語音声の１つのチャンクの繰り返しの時間長、つまりエコー付加された状態の英語音声チャンクの再生時間ｔ１は５秒程度以内に収まるように遅延時間や繰り返し回数が調整されている。これは、学習者１１がその言語の意味や概念を理解していない場合に短期記憶が可能な時間長さ、を考慮して決められている。また、エコー付加された音声チャンクの再生出力の終了時点から次の（他方の言語の）エコー付加された音声チャンクの再生出力が開始されるまでの間には、所定時間ｔ２（０．５〜１秒程度）の無音期間が設けられる。 Here, the delay time and the number of repetitions are adjusted so that the repetition time length of one chunk of English speech, that is, the playback time t1 of the English speech chunk with echo added is within about 5 seconds. This is determined in consideration of the length of time during which short-term memory is possible when the learner 11 does not understand the meaning or concept of the language. Also, a predetermined time t2 (0.5 to 0.5) is reached from the end of the reproduction output of the echoed voice chunk until the start of the reproduction output of the next echo chunk (in the other language). A silence period of about 1 second) is provided.

この第１練習モード用音声の通常の再生音声を学習者１１が聴くと、英語音声チャンクに先立ってその意味が簡潔な日本語音声チャンクで発せられるので、意味と英語との一体的な理解が促進される。また、エコー付加された英語音声チャンクの再生時間は初学者であっても容易に記憶可能な時間範囲に抑えられているので聴き逃すことなく、例えば次の音声チャンクの再生が開始されるまでの時間ｔ２内に学習者１１がその英語音声チャンクを発声することができる。即ち、初学者でも無理なくシャドーイングと同様の訓練を行うことができる。また、チャンクは一塊の意味を持つものであるから、実質的にリプロダクションと同等の訓練も行える。また、通常の学習では、Ｌｃｈ音声（日本語音声）とＲｃｈ音声（英語音声）とを併せて聴くが、ステレオ再生可能な端末１０では、学習者１１は日本語音声のみ又は英語音声のみを選択的に聴くこともできる。 When the learner 11 listens to the normal playback voice of the voice for the first practice mode, the meaning is uttered in a simple Japanese voice chunk prior to the English voice chunk. Promoted. In addition, the playback time of the echoed English voice chunk is limited to a time range that can be easily memorized even by a beginner, so it is not missed, for example, until the next voice chunk starts playing. The learner 11 can utter the English voice chunk within time t2. That is, even beginners can do the same training as shadowing without difficulty. In addition, since chunks have a single meaning, they can perform training that is substantially equivalent to re-production. In normal learning, Lch voice (Japanese voice) and Rch voice (English voice) are listened together. On the terminal 10 capable of stereo playback, the learner 11 selects only Japanese voice or only English voice. You can also listen to it.

図５に示すように、第２練習モード用音声では、ステレオ音声のＬｃｈに四分の１拍子のメトロノーム音が収録され、Ｒｃｈに英語音声が収録される。この場合には日本語音声はない。メトロノーム音は１秒毎に音を発するものであり、例えば１拍目に相当する４秒毎に「チン」という音がし、２拍目以降には「カチ・カチ・カチ」という音が発せられるものとすることができる。もちろん、電子音などを利用したものでもよい。 As shown in FIG. 5, in the second practice mode sound, a quarter-beat metronome sound is recorded on the stereo sound Lch, and an English sound is recorded on the Rch. In this case, there is no Japanese voice. A metronome sound is emitted every second, for example, a sound of “chin” is generated every 4 seconds corresponding to the first beat, and a sound of “click, click, click” is generated after the second beat. Can be. Of course, electronic sounds may be used.

Ｒｃｈに収録される英語音声において、まず１巡目には、Ｌｃｈのメトロノーム音の１拍目に同期して、英語音声チャンクがＥ１、Ｅ２、Ｅ３と順に出力され、メトロノーム音の３拍目に同期して１拍目の英語音声チャンクの音声出力レベルが低下されたものが出力される。次の２巡目では、メトロノーム音の１拍目に同期した英語音声チャンクの出力は１巡目と同じであるが、３拍目は無音となる。 In English voice recorded in Rch, first, in the first round, English voice chunks are output in order of E1, E2, E3 in synchronization with the first beat of the Lch metronome sound, and on the third beat of the metronome sound. Synchronously, the first voice English chunk whose voice output level is reduced is output. In the second round, the output of the English voice chunk synchronized with the first beat of the metronome sound is the same as in the first round, but the third beat is silent.

基本的な訓練としては、まず学習者１１は１拍目の英語音声チャンクの再生出力を聴取し、３拍目にレベルが下げられた英語音声チャンクが出力されるタイミングで、学習者１１はシャドーイングの要領で発声する。このときには減衰された英語音声に被せるように発声すればよい。次の２巡目には、学習者１１は１拍目の英語音声チャンクの再生出力を聴取し、３拍目のメトロノーム音に合わせてシャドーイングの要領で発声する。このように学習者１１は周期的なメトロノーム音に合わせてその直前に聴いた英語音声チャンクを発声すればよいので、学習者１１は発声のタイミングをとり易い。このように学習者１１に対して積極的な発声を促すことができるので、シャドーイングやリプロダクションの効果が高い。 As basic training, first, the learner 11 listens to the playback output of the English voice chunk of the first beat, and at the timing when the English voice chunk whose level is lowered is outputted on the third beat, the learner 11 Speak in the way of Ing. At this time, the voice may be uttered so as to cover the attenuated English voice. In the next second round, the learner 11 listens to the playback output of the English voice chunk of the first beat, and speaks in the manner of shadowing in accordance with the metronome sound of the third beat. In this way, the learner 11 only needs to utter the English voice chunk that was listened to just before the periodic metronome sound, so the learner 11 can easily take the timing of the utterance. In this way, the learner 11 can be encouraged to actively speak, so the effects of shadowing and re-production are high.

図６に示すように、第３練習モード用音声では、ステレオ音声のＬｃｈに、四分の１拍子のメトロノーム音と日本語音声との合成音が収録され、Ｒｃｈに英語音声が収録される。メトロノーム音は第２練習モード用音声と同様である。 As shown in FIG. 6, in the third practice mode sound, a synthesized sound of a quarter-beat metronome sound and a Japanese sound is recorded in the stereo sound Lch, and an English sound is recorded in the Rch. The metronome sound is the same as the sound for the second practice mode.

メトロノーム音に合成される日本語音声では、メトロノーム音の１拍目に同期して各日本語音声チャンクがＪ１、Ｊ２、Ｊ３の順に出力され、これが３巡繰り返される。他方、Ｒｃｈの英語音声では、メトロノーム音の２拍目に同期して各英語音声チャンクがＥ１、Ｅ２、Ｅ３順に出力されるが、１巡目には通常の音声出力レベルで、２巡目には減衰された音声出力レベルで、３巡目には完全にミュートされた無音状態となる。従って、１巡目、２巡目では、メトロノーム音のテンポに合わせて日本語音声チャンクと英語音声チャンクとが交互に出力され（但し２巡目では英語音声は低出力レベルで出力され）、３巡目には、メトロノーム音のテンポに合わせて日本語音声チャンクに続く無音期間に学習者１１の発声が可能となっている。 In the Japanese voice synthesized with the metronome sound, each Japanese voice chunk is output in the order of J1, J2, J3 in synchronization with the first beat of the metronome sound, and this is repeated three times. On the other hand, in English speech of Rch, each English speech chunk is output in the order of E1, E2, and E3 in synchronization with the second beat of the metronome sound. Is the attenuated audio output level, and in the third round, it becomes a completely muted silence state. Therefore, in the first and second rounds, Japanese voice chunks and English voice chunks are alternately output in time with the metronome sound tempo (however, in the second round, English voice is output at a low output level). During the tour, the learner 11 can speak during the silence period following the Japanese voice chunk in time with the tempo of the metronome sound.

これにより、学習者１１は英語音声チャンクの再生順に、つまり英語の文法に則った順序で日本語の意味を理解しながら（確認しながら）、且つ、リズミカルに英語音声チャンクの発声を行うことができる。 As a result, the learner 11 can utter English speech chunks rhythmically while understanding (confirming) the meaning of Japanese in the order of playback of the English speech chunks, that is, in the order according to the English grammar. it can.

図７に示すように、復習モード用音声では、ステレオ音声のＬｃｈに日本語音声が収録され、Ｒｃｈに英語音声が収録される。この場合には、英語音声、日本語音声ともに、チャンク単位ではなく、分割されていない一文の英語音声Ｅ及び日本語音声Ｊが利用される。即ち、基本的に一文単位で、日本語音声Ｊ、英語音声Ｅが交互に再生出力されるが、日本語音声Ｊは全て同じ話速、同じ出力レベルであるのに対し、英語音声Ｅは、２回目が通常話速、標準出力レベルであって、１回目は話速が１．２〜１．５倍程度に話速が上げられている（図７中には→Ｅ←で示す）。また３回目に相当する時間領域は無音であり、４回目は通常話速で、出力レベルは減衰された低出力レベルある。 As shown in FIG. 7, in the review mode audio, Japanese audio is recorded on the stereo audio Lch and English audio is recorded on the Rch. In this case, the English speech E and the Japanese speech J that are not divided are used for both the English speech and the Japanese speech, not in chunk units. That is, Japanese speech J and English speech E are alternately reproduced and output in units of one sentence. Japanese speech J has the same speech speed and the same output level, whereas English speech E is The second time is the normal speech speed and the standard output level, and the first time the speech speed is increased to about 1.2 to 1.5 times (indicated by → E ← in FIG. 7). The time domain corresponding to the third time is silent, the fourth time is normal speech speed, and the output level is a low output level attenuated.

初めに話速を速めた英語音声を聴かせることで２回目に標準話速の英語音声を聴いたときにゆっくり聞こえるため、聴き取りが容易になる。これにより、学習者１１は１回目、２回目に英語文を的確に聴き取り、３回目、４回目にこれをリプロダクションの要領で発声することができる。 By listening to the English voice with a higher speaking speed first, it is heard slowly when listening to the English voice with the standard speaking speed for the second time. Thereby, the learner 11 can listen to the English sentence accurately at the first time and the second time, and can utter it at the third time and the fourth time in the manner of re-production.

本実施例の英会話学習支援システムでは、上記のような３種の練習モード用音声と１種の復習モード用音声とを適宜に組み合わせた学習（訓練）が可能である。典型的には、第１練習モード用音声→第２練習モード用音声→第３練習モード用音声→復習モード用音声、の順で一文の英語文の会話の訓練が実行される。もちろん、学習者のレベルや理解の進度により、第１乃至第３練習モード用音声、及び復習モード用音声の１つを繰り返すようにしてもよい。 In the English conversation learning support system according to the present embodiment, learning (training) in which the above three types of practice mode voices and one type of review mode voice are appropriately combined is possible. Typically, an English sentence conversation training is executed in the order of first practice mode voice → second practice mode voice → third practice mode voice → review mode voice. Of course, one of the first to third practice mode voices and the review mode voice may be repeated depending on the level of the learner and the degree of understanding.

さらに本実施例の英会話学習支援システムでは、スピーカ１５等からの音声出力のほか、それに同期した画像を表示部１３の画面上に描出することで、視覚的な学習効果を付加するようにしている。具体的には、各英語音声チャンクに対応した画像データが用意され、その英語音声チャンクやそれに意味的に対応した日本語音声チャンクによる音声が再生出力されるのに同期して、対応した画像が順次表示されるようになっている。図３に示した例文に対し、第２練習モード用音声が再生出力される際の表示出力の一例を示したのが図８である。こうした画像が音声と同期して表示されることで、学習者１１はその英語音声チャンクが用いられる状況やシチュエーションなどを視覚的に把握することができ、音声のみを用いた場合に比べて格段に高い学習効果が得られる。
なお、学習者が文字情報に頼った発声を行うことを避けるために、基本的には、英語（学習対象言語）の文字表示は行わないが、例えば抽象的な感情表現など、画像情報だけでは適切な状況を表現できないような場合には、最小限の日本語（母国語）文字を例えば漫画の吹き出し様に表示するようにしてもよい。 Furthermore, in the English conversation learning support system of the present embodiment, in addition to the sound output from the speaker 15 or the like, a visual learning effect is added by drawing an image synchronized with the sound on the screen of the display unit 13. . Specifically, image data corresponding to each English audio chunk is prepared, and the corresponding image is synchronized with the English audio chunk and the Japanese audio chunk semantically corresponding to the playback and output. They are displayed sequentially. FIG. 8 shows an example of display output when the second practice mode sound is reproduced and output for the example sentence shown in FIG. By displaying these images in synchronism with the sound, the learner 11 can visually grasp the situation and situation in which the English sound chunk is used, which is much more than when only the sound is used. High learning effect is obtained.
In order to avoid the learner's utterance relying on character information, English (learning target language) characters are not basically displayed, but only for image information such as abstract emotional expressions. When an appropriate situation cannot be expressed, a minimum number of Japanese (native language) characters may be displayed, for example, as a comic balloon.

本実施例の英会話学習支援システムにおいて、動画ファイル記憶部３２には、上記の各練習モード用音声ファイルや復習モード用音声ファイルなどの音声ファイルに格納された音声データとそれに対応した画像ファイルに格納された画像データとが合成されることで生成された動画ファイルが保存される。 In the English conversation learning support system according to the present embodiment, the moving image file storage unit 32 stores the audio data stored in the audio files such as the audio files for the practice modes and the audio files for the review mode and the corresponding image files. A moving image file generated by combining the image data thus stored is stored.

例えば学習者１１がサービス提供者３に対し学習教材の提供サービスを予め依頼すると、学習者１１を識別する認証情報がサーバ３０内に保持される。そうした環境の下で、学習者１１が手元の端末１０を操作して英会話学習の要求をサーバ３０に対して出すと、サーバ３０は、要求を出した学習者１１が許可された者であるか否かを認証する。そして、認証が完了すると、サーバ３０においてデータ送出処理部３１は、端末１０から要求された学習を行うために必要なファイルの関連情報を再生ファイル関連データベース３３から読み出し、その情報によって該当する動画ファイルの保存場所を認識して動画ファイル記憶部３２からその動画ファイルを読み出し、ネットワーク２を介して端末１０に送信する。端末１０は、所定のアプリケーションソフトウエア上で送られてきたファイル中の音声データ及び画像データをそれぞれ復号し、ストリーミング再生する。これにより、上述したような音声出力や表示出力が端末１０から行われ、学習者１１はこれを利用した英会話学習を実施することができる。 For example, when the learner 11 requests the service provider 3 to provide a learning material providing service in advance, authentication information for identifying the learner 11 is held in the server 30. Under such circumstances, when the learner 11 operates the terminal 10 at hand and issues a request for English conversation learning to the server 30, the server 30 is the person who has given the request the permitted learner 11. Authenticate whether or not. When the authentication is completed, the data transmission processing unit 31 in the server 30 reads the related information of the file necessary for performing the learning requested from the terminal 10 from the reproduction file related database 33, and the corresponding moving image file is determined by the information. The moving image file is read from the moving image file storage unit 32 and transmitted to the terminal 10 via the network 2. The terminal 10 decodes each of the audio data and the image data in the file sent on the predetermined application software, and performs streaming reproduction. Thereby, the audio | voice output and display output as above-mentioned are performed from the terminal 10, and the learner 11 can implement English conversation learning using this.

もちろん、送信ファイルの蓄積が学習者１１側に許可されている場合には、端末１０はサーバ３０から送信されてきたファイルを一旦、内部の記憶装置に蓄積し（つまりダウンロード）し、その後にオフラインで再生できるようにしてもよい。 Of course, when the learner 11 is permitted to store the transmission file, the terminal 10 temporarily stores (that is, downloads) the file transmitted from the server 30 in the internal storage device, and then offline. You may be able to play it.

次に、本実施例の英会話学習支援システムにおいて、素材作成用コンピュータ４０において実行される、各種音声ファイルの作成処理を図９〜図１３を参照して説明する。図９はチャンク分割処理の手順を示すフローチャート、図１０は第１練習モード用音声加工処理の手順を示すフローチャート、図１１は第２練習モード用音声加工処理の手順を示すフローチャート、図１２は第３練習モード用音声加工処理の手順を示すフローチャート、図１３は復習モード用音声加工処理の手順を示すフローチャート、である。 Next, in the English conversation learning support system of the present embodiment, various voice file creation processing executed by the material creation computer 40 will be described with reference to FIGS. 9 is a flowchart showing the procedure of the chunk division process, FIG. 10 is a flowchart showing the procedure of the first practice mode voice processing process, FIG. 11 is a flowchart showing the procedure of the second practice mode voice processing process, and FIG. FIG. 13 is a flowchart showing a procedure of the voice processing for the three practice modes, and FIG. 13 is a flowchart showing a procedure of the voice processing for the review mode.

教材の元となる英文（英語音声）は、例えばネイティブ話者が教材用として話したもの音声入力部４５により取り込み、そのまま又は適宜に音声圧縮してデータ格納部４４に格納されたものとすることができる。もちろん、教材用としてではなく一般に録音された会話などを教材とすることもできる。但し、後述したような処理を行う関係上、背景ノイズはできるだけ小さいことが好ましい。また、これとは別に、教材として用意された英文に対応する日本語文を用意する必要があるが、これは、その英文を適宜に翻訳した日本文をネイティブ話者が読み、これを音声入力部４５を介してデータ格納部４４に格納することで得るようにすることができる。 The English text (English speech) that is the source of the teaching material is taken, for example, by the voice input unit 45 that is spoken by the native speaker as the teaching material, and is stored in the data storage unit 44 as it is or after being appropriately compressed. Can do. Of course, it is also possible to use generally recorded conversations as teaching materials instead of teaching materials. However, the background noise is preferably as small as possible in view of performing processing as described later. Separately, it is necessary to prepare a Japanese sentence corresponding to the English sentence prepared as a teaching material. This is because the native speaker reads the Japanese sentence translated from the English sentence appropriately, and this is read out by the voice input unit. It can be obtained by storing in the data storage unit 44 via 45.

上記のように用意された英語音声に対し、まずチャンク分割部４２は図９に従ったチャンク分割処理を開始する。
即ち、チャンク分割部４２はデータ格納部４４から上記のような英語音声（原音声）を内部のワーキングメモリに読み込む（ステップＳ１）。次に、この原音声に対し音声波形分析を実施し、無音領域と音声領域とを識別する（ステップＳ２）。例えば、無音状態（出力レベルが所定閾値以下である状態）が所定時間以上継続した場合に、その無音状態の開始時点から終了時点までを無音領域であると判断し、それ以外を音声領域であると判断する。そして２つの無音領域で挟まれた音声領域を１つのチャンクとみなすようにすることができる。なお、そうした判断のための各種パラメータ（例えば無音領域判定のための時間長さの閾値など）は予めオペレータが設定しておくようにし、設定されないパラメータについてはデフォルト値を利用するものとすればよい。 For the English speech prepared as described above, first, the chunk division unit 42 starts chunk division processing according to FIG.
That is, the chunk division unit 42 reads the above English speech (original speech) from the data storage unit 44 into the internal working memory (step S1). Next, a speech waveform analysis is performed on the original speech to identify a silent region and a speech region (step S2). For example, when a silent state (a state where the output level is equal to or lower than a predetermined threshold) continues for a predetermined time or more, it is determined that the silent state is from the start point to the end point, and the rest is a voice region. Judge. A voice area sandwiched between two silence areas can be regarded as one chunk. It should be noted that various parameters for such determination (for example, a threshold value of time length for silent region determination) are set in advance by the operator, and default values are used for parameters that are not set. .

一般的に、英語のネイティブ話者が話をする場合に、ひとまとまりの意味を持つチャンクの区切りで息つぎをしたり次の話の内容を考えたりするために無音区間が生じる。そのため、多くの場合には、上記のような無音領域及び音声領域の識別結果に基づいて、一つの文をチャンクに分割することができる。但し、話し方などによっては、一つのチャンクの途中に無音領域が入って複数のチャンクに誤って分割されてしまったり、逆に複数のチャンクが１つのチャンクであると判定されてしまったりすることもある。そこで、ステップＳ２の処理の終了後に例えばオペレータがステップＳ２の識別結果を評価し（ステップＳ３）、適切な識別がなされていない（つまりチャンク分割が適切でない）と判定した場合には、オペレータが例えば手動操作で無音領域や音声領域の識別位置をずらしたり適宜に強制的な分割指示を挿入したりすることにより、自動判定によるチャンク分割を修正し（ステップＳ４）、ステップＳ３に戻ってそうして修正されたチャンク分割を再び評価する。ステップＳ３においてチャンク分割が適切であると判定されると、オペレータの所定の操作により、チャンク分割された音声が図３に示した基本音声の１つとしてデータ格納部４４に格納される（ステップＳ５）。 In general, when a native English speaker speaks, a silent period is generated because a breath is taken at a chunk delimiter having a grouped meaning or the content of the next story is considered. Therefore, in many cases, one sentence can be divided into chunks based on the silent region and speech region identification results as described above. However, depending on the way of speaking, there may be a silent area in the middle of one chunk and it may be mistakenly divided into multiple chunks, or conversely, multiple chunks may be judged as one chunk. is there. Therefore, after the processing of step S2 is completed, for example, the operator evaluates the identification result of step S2 (step S3), and determines that the proper identification is not made (that is, chunk division is not appropriate). Correct the chunk division by automatic determination by shifting the identification position of the silent area and the voice area by manual operation or inserting a forced division instruction as appropriate (step S4), and return to step S3 Evaluate the modified chunk split again. If it is determined in step S3 that the chunk division is appropriate, the chunk-divided voice is stored in the data storage unit 44 as one of the basic voices shown in FIG. 3 by a predetermined operation of the operator (step S5). ).

日本語音声についても英語音声と同様の処理によりチャンク分割してもよいが、一般的に、日本語音声チャンクは英語音声チャンクの意味を簡潔に且つ的確に伝えるためには、日本語の原文を単に分割したものは適切ではない。そのため、日本語音声チャンクは上記のような処理によらず、ネイティブ話者によってチャンク単位で発声されたものを音声入力部４５を介して取り込むようにすることが好ましい。 Japanese speech may be divided into chunks by the same process as English speech, but in general, Japanese speech chunks are not translated into Japanese in order to convey the meaning of English speech chunks in a concise and accurate manner. A simple division is not appropriate. For this reason, it is preferable that Japanese speech chunks are captured via the speech input unit 45 as spoken in units of chunks by a native speaker, regardless of the processing described above.

オペレータの指示により、第１練習モード用音声ファイル作成処理が開始されると、音声データ加工部４１は、任意の一文についてチャンク分割された英語音声（英語音声チャンク）と、チャンク分割された日本語音声（日本語音声チャンク）とをデータ格納部４４から内部のワーキングメモリに読み込む（ステップＳ１１）。図３の例で言えば、"Where is" 、"the check in counter"、" for this flight?"の３つの英語音声チャンクと、「どこなの」、「チェックインカウンターは」、「この便の」という３つの日本語音声チャンクとが読み込まれる。 When the first practice mode voice file creation process is started by an operator's instruction, the voice data processing unit 41 performs chunked English speech (English voice chunk) and chunk-divided Japanese for an arbitrary sentence. The voice (Japanese voice chunk) is read from the data storage unit 44 into the internal working memory (step S11). In the example shown in Fig. 3, there are three English voice chunks "Where is", "the check in counter", and "for this flight?", "Where is", "Check-in counter", " Are read with three Japanese voice chunks.

次に上記の各チャンク毎に、所定繰り返し回数のエコー付加処理を実行する（ステップＳ１２）。このときのエコーの時間遅延や繰り返し回数などのパラメータは予めオペレータが設定しておくようにし、設定されないパラメータについてはデフォルト値を利用するものとすればよい。これにより、例えば図４に示したような、５回のエコーが付加されたチャンク毎の音声データが得られる。次に、エコー付加処理された各チャンクを、図４に示すように各チャンネルに振り分け、且つ所定の時間間隔で日本語音声チャンクと英語音声チャンクとが交互に出力されるように時間軸上のデータ配置を調整する（ステップＳ１３）。 Next, the echo addition process is executed a predetermined number of times for each chunk (step S12). Parameters such as the echo time delay and the number of repetitions at this time may be set in advance by the operator, and default values may be used for parameters that are not set. As a result, for example, voice data for each chunk to which five echoes are added as shown in FIG. 4 is obtained. Next, each chunk subjected to echo addition processing is distributed to each channel as shown in FIG. 4, and Japanese voice chunks and English voice chunks are alternately output at predetermined time intervals. The data arrangement is adjusted (step S13).

但し、ステップＳ１２、Ｓ１３で自動的に処理を実行した場合、処理後の音声を人間が聴いたときに不自然であったり、或いは意図する学習に適合するものとなっていなかったりする場合がある。そこで、ステップＳ１３の処理の終了後に例えばオペレータが処理後の再生音を評価し（ステップＳ１４）、適切な処理がなされていないと判定した場合には、オペレータが例えばエコー遅延時間等のパラメータを修正し（ステップＳ１５）、ステップＳ１２に戻って再び処理を実行する。或いは、予め複数のパラメータの下での処理を実行して異なる音声ファイルを作成し、オペレータがそれを聴き比べて最も適切な音声を選ぶようにしてもよい。ステップＳ１４において再生音が適切であると判定されると、オペレータの所定の操作により、作成された音声ファイルが第１練習モード用の完成音声ファイルとしてデータ格納部４４に格納される（ステップＳ１６）。 However, when the process is automatically executed in steps S12 and S13, it may be unnatural when a human hears the processed sound or may not be suitable for intended learning. . Therefore, after the processing of step S13 is completed, for example, the operator evaluates the reproduced sound after the processing (step S14), and when it is determined that appropriate processing has not been performed, the operator corrects parameters such as the echo delay time, for example. (Step S15), the process returns to Step S12 to execute the process again. Alternatively, processing under a plurality of parameters may be executed in advance to create different audio files, and the operator may listen to them and select the most appropriate audio. If it is determined in step S14 that the reproduced sound is appropriate, the created sound file is stored in the data storage unit 44 as a completed sound file for the first practice mode by a predetermined operation of the operator (step S16). .

オペレータの指示により、第２練習モード用音声ファイル作成処理が開始されると、音声データ加工部４１は、任意の一文についてのチャンク分割された英語音声（英語音声チャンク）と、メトロノーム音とをデータ格納部４４から内部のワーキングメモリに読み込む（ステップＳ２１、Ｓ２２）。 When the second practice mode audio file creation process is started in accordance with the operator's instruction, the audio data processing unit 41 converts the chunked English voice (English voice chunk) and metronome sound for any one sentence into data. The data is read from the storage unit 44 into the internal working memory (steps S21 and S22).

次に四分の一拍子のメトロノーム音を基準に、上記の各英語音声チャンクの再生が一拍目にちょうど開始されるようにし、その一拍目の英語音声チャンクをコピーし音声出力レベルを所定レベルだけ減衰させ、その再生が三拍目にちょうど開始されるように時間軸上のデータ配置を調整する（ステップＳ２３、２４）。また、２巡目では三拍目の英語音声チャンクを削除する。そして、図５に示すように、上記のように再生出力の時間的なタイミングを調整した英語音声チャンクをステレオ音声のＲｃｈに、メトロノーム音をＬｃｈに格納して音声ファイルとする（ステップＳ２５）。 Next, based on the metronome sound of a quarter beat, the playback of each English voice chunk described above is started just at the first beat, and the English voice chunk of the first beat is copied and the audio output level is set to a predetermined value. The data arrangement on the time axis is adjusted so that the level is attenuated and the reproduction is just started at the third beat (steps S23 and S24). In the second round, the third English voice chunk is deleted. Then, as shown in FIG. 5, the English voice chunk whose playback output timing is adjusted as described above is stored in the Rch of the stereo voice, and the metronome sound is stored in the Lch to form a voice file (step S25).

ステップＳ２５の処理の終了後に例えばオペレータが処理後の再生音を評価し（ステップＳ２６）、適切な処理がなされていないと判定した場合には、オペレータが例えば音声出力レベルの減衰量等のパラメータを修正し（ステップＳ２７）、ステップＳ２３に戻って再び処理を実行する。或いは、予め複数のパラメータの下での処理を実行して異なる音声ファイルを作成し、オペレータがそれを聴き比べて最も適切な音声を選ぶようにしてもよい。ステップＳ２６において再生音が適切であると判定されると、オペレータの所定の操作により、作成された音声ファイルが第２練習モード用の完成音声ファイルとしてデータ格納部４４に格納される（ステップＳ２８）。 After the processing of step S25 is completed, for example, the operator evaluates the reproduced sound after the processing (step S26), and when the operator determines that appropriate processing has not been performed, the operator sets parameters such as the amount of attenuation of the audio output level. It corrects (step S27), returns to step S23, and executes processing again. Alternatively, processing under a plurality of parameters may be executed in advance to create different audio files, and the operator may listen to them and select the most appropriate audio. If it is determined in step S26 that the reproduced sound is appropriate, the created sound file is stored in the data storage unit 44 as a completed sound file for the second practice mode by a predetermined operation of the operator (step S28). .

オペレータの指示により、第３練習モード用音声ファイル作成処理が開始されると、音声データ加工部４１は、任意の一文について英語音声チャンクと日本語音声チャンクとをデータ格納部４４から内部のワーキングメモリに読み込む（ステップＳ３１）。また、メトロノーム音をデータ格納部４４から内部のワーキングメモリに読み込む（ステップＳ３２）。 When the third practice mode voice file creation process is started by the operator's instruction, the voice data processing unit 41 sends an English voice chunk and a Japanese voice chunk from the data storage unit 44 to an internal working memory for an arbitrary sentence. (Step S31). Further, the metronome sound is read from the data storage unit 44 into the internal working memory (step S32).

次に四分の一拍子のメトロノーム音を基準に、上記の各日本語音声チャンクの再生が一拍目にちょうど開始されるように順に配置し、且つ、これが３回（３巡）繰り返されるように各日本語音声チャンクをコピーして、時間軸上のデータ配置を調整する（ステップＳ３３）。そして、このように加工した日本語音声チャンクとメトロノーム音とを時間軸上で合成する（ステップＳ３４）。次に、図６に示すように、一文の英語音声チャンクの再生がメトロノーム音の三拍目に開始されるように順に配置し、且つ、これが３回（３巡）繰り返されるように各英語音声チャンクをコピーして、時間軸上のデータ配置を調整する（ステップＳ３５）。さらに、このように加工した英語音声チャンクの２巡目の各音声チャンクの出力レベルを所定レベルだけ減衰させ、３巡目の各音声チャンクは削除する（ステップＳ３６）。そして、英語音声をステレオ音声のＲｃｈに、日本語音声をＬｃｈに格納して音声ファイルとする（ステップＳ３７）。 Next, with the metronome sound of a quarter time as a reference, the above-mentioned Japanese voice chunks are arranged in order so that playback starts just at the first beat, and this is repeated three times (three rounds). Each Japanese voice chunk is copied to and the data arrangement on the time axis is adjusted (step S33). Then, the processed Japanese voice chunk and metronome sound are synthesized on the time axis (step S34). Next, as shown in FIG. 6, the English voice chunks of one sentence are arranged in order so that playback starts at the third beat of the metronome sound, and each English voice is repeated three times (three rounds). The chunk is copied and the data arrangement on the time axis is adjusted (step S35). Further, the output level of the second voice chunk of the English voice chunk processed in this way is attenuated by a predetermined level, and the third voice chunk is deleted (step S36). Then, the English sound is stored in the stereo sound Rch and the Japanese sound is stored in the Lch to form a sound file (step S37).

ステップＳ３７の処理の終了後に例えばオペレータが処理後の再生音を評価し（ステップＳ３８）、適切な処理がなされていないと判定した場合には、オペレータが例えば音声出力レベルの減衰量等のパラメータを修正し（ステップＳ３９）、ステップＳ３３に戻って再び処理を実行する。或いは、予め複数のパラメータの下での処理を実行して異なる音声ファイルを作成し、オペレータがそれを聴き比べて最も適切な音声を選ぶようにしてもよい。ステップＳ３８において再生音が適切であると判定されると、オペレータの所定の操作により、作成された音声ファイルが第３練習モード用の完成音声ファイルとしてデータ格納部４４に格納される（ステップＳ４０）。 After the processing of step S37 is completed, for example, the operator evaluates the reproduced sound after the processing (step S38), and when it is determined that the appropriate processing is not performed, the operator sets parameters such as the attenuation amount of the audio output level, for example. It corrects (step S39), returns to step S33, and executes the process again. Alternatively, processing under a plurality of parameters may be executed in advance to create different audio files, and the operator may listen to them and select the most appropriate audio. If it is determined in step S38 that the reproduced sound is appropriate, the created sound file is stored in the data storage unit 44 as a completed sound file for the third practice mode by a predetermined operation of the operator (step S40). .

オペレータの指示により、復習モード用音声ファイル作成処理が開始されると、音声データ加工部４１は、チャンク分割されていない任意の一文の英語音声と日本後音声、つまり原音声をデータ格納部４４から内部のワーキングメモリに読み込む（ステップＳ４１）。そして、元の英語音声Ｅから話速を１．２〜１．５倍程度に上げた英語音声Ｅ１、音声出力レベルを所定レベルだけ減衰させた英語音声Ｅ２、無音化された英語音声Ｅ３とを作成する（ステップＳ４２）。次に、図７に示すように、日本語音声をＬｃｈに、上記のような英語音声をＲｃｈに格納し、日本語音声Ｊと英語音声Ｅ１、Ｅ、Ｅ２、Ｅ３とが交互に出力されるようにデータ配置を調整する（ステップＳ４３）。 When the review mode audio file creation process is started by an operator's instruction, the audio data processing unit 41 sends an arbitrary one-sentence English speech and non-chunk English speech, that is, the original speech from the data storage unit 44. Read into the internal working memory (step S41). Then, an English voice E1 whose speech speed is increased by 1.2 to 1.5 times from the original English voice E, an English voice E2 whose voice output level is attenuated by a predetermined level, and a silenced English voice E3. Create (step S42). Next, as shown in FIG. 7, Japanese speech is stored in Lch, English speech as described above is stored in Rch, and Japanese speech J and English speech E1, E, E2, E3 are alternately output. The data arrangement is adjusted as described above (step S43).

ステップＳ４３の処理の終了後に例えばオペレータが処理後の再生音を評価し（ステップＳ４４）、適切な処理がなされていないと判定した場合には、オペレータが例えば話速等のパラメータを修正し（ステップＳ４５）、ステップＳ４２に戻って再び処理を実行する。或いは、予め複数のパラメータの下での処理を実行して異なる音声ファイルを作成し、オペレータがそれを聴き比べて最も適切な音声を選ぶようにしてもよい。ステップＳ４４において再生音が適切であると判定されると、オペレータの所定の操作により、作成された音声ファイルが復習モード用の完成音声ファイルとしてデータ格納部４４に格納される（ステップＳ４６）。 After the process of step S43 is completed, for example, the operator evaluates the reproduced sound after the process (step S44), and when it is determined that an appropriate process has not been performed, the operator corrects parameters such as the speech speed (step S45), returning to step S42, the process is executed again. Alternatively, processing under a plurality of parameters may be executed in advance to create different audio files, and the operator may listen to them and select the most appropriate audio. If it is determined in step S44 that the reproduced sound is appropriate, the created audio file is stored in the data storage unit 44 as a completed audio file for the review mode by a predetermined operation of the operator (step S46).

上記のような各処理により第１乃至第３練習モード用音声ファイルと復習モード用ファイルとがデータ格納部４４に用意される。また、オペレータは画像情報作成部４３により各英語音声チャンクに対応した適宜の画像を作成し、これもデータ格納部４４に格納する。動画ファイル生成部４６は上記のようにデータ格納部４４に格納されたファイルを用い、一文に対応する各音声ファイル毎に音声出力と画像出力とが同期するように画像データと音声データとを合成して例えばＡＶＩ、ＷＭＶなどのファイル形式の動画ファイルを自動的に生成する。そして、生成された動画ファイルに所定のファイル名を付与してサーバ３０にアップロードすると、ＦＬＶ形式の動画ファイルに変換されて動画ファイル記憶部３２に格納される。また、これと同時に、その動画ファイルの関連情報（例えば上述したようなファイルの保存場所やファイルの属性情報など）が収集され、これが再生ファイル関連データベース３３に保存される。これにより、データ送出処理部３１は、学習者１１から或る一文の再生の指示を受けたときに、再生ファイル関連データベース３３に保存されている関連情報を参照して該当する動画ファイルを特定するとともにその保存場所を即座に見つけ、動画ファイル記憶部３２から該当する動画ファイルを構成するデータを読み出して端末１０に送出することができる。 The first to third practice mode audio files and the review mode file are prepared in the data storage unit 44 by the processes described above. In addition, the operator creates an appropriate image corresponding to each English voice chunk by the image information creation unit 43 and also stores it in the data storage unit 44. The moving image file generation unit 46 uses the file stored in the data storage unit 44 as described above, and synthesizes the image data and the audio data so that the audio output and the image output are synchronized for each audio file corresponding to one sentence. Then, for example, a moving image file in a file format such as AVI or WMV is automatically generated. Then, when a predetermined file name is assigned to the generated moving image file and uploaded to the server 30, the moving image file is converted into an FLV format moving image file and stored in the moving image file storage unit 32. At the same time, related information of the moving image file (for example, the file storage location and file attribute information as described above) is collected and stored in the reproduction file related database 33. Thereby, when the data transmission processing unit 31 receives an instruction to reproduce a certain sentence from the learner 11, the data transmission processing unit 31 refers to the related information stored in the reproduction file related database 33 and specifies the corresponding moving image file. At the same time, the storage location can be found immediately, and the data constituting the corresponding video file can be read from the video file storage unit 32 and sent to the terminal 10.

なお、上記実施例は本発明の一例にすぎず、本発明の趣旨の範囲で適宜、変形、追加、修正を行っても本願特許請求の範囲に包含されることは明らかである。 It should be noted that the above embodiment is merely an example of the present invention, and it is obvious that modifications, additions, and modifications as appropriate within the scope of the present invention are included in the scope of the claims of the present application.

１０…端末
１１…学習者
１２…ＰＣ本体
１３…表示部
１４…操作部
１５…スピーカ
２…ネットワーク
３…サービス提供者
３０…サーバ
３１…データ送出処理部
３２…動画ファイル記憶部
３３…再生ファイル関連データベース
４０…素材作成用コンピュータ
４１…音声データ加工部
４２…チャンク分割部
４３…画像情報作成部
４４…データ格納部
４５…音声入力部
４６…動画ファイル生成部 DESCRIPTION OF SYMBOLS 10 ... Terminal 11 ... Learner 12 ... PC main body 13 ... Display part 14 ... Operation part 15 ... Speaker 2 ... Network 3 ... Service provider 30 ... Server 31 ... Data transmission processing part 32 ... Movie file storage part 33 ... Reproduction file related Database 40 ... Material creation computer 41 ... Audio data processing unit 42 ... Chunk division unit 43 ... Image information creation unit 44 ... Data storage unit 45 ... Audio input unit 46 ... Movie file generation unit

Claims

A language learning support system in which a learner listens to a reproduced voice output from a voice output unit based on an original voice of a language to be learned, and trains a conversation of the language by speaking according to the voice,
a) Acquiring a sentence of the original speech of the language to be learned , performing speech waveform analysis on the original speech, and identifying the silence area and the speech area, thereby dividing the sentence of the original speech into chunks. Sentence dividing means for obtaining an original voice chunk,
b) a native speech chunk acquisition means for acquiring a native speech chunk having a meaning corresponding to each of the original speech chunks in the learner's native language;
c) echo adding means for performing echo addition processing such that each of the original voice chunk and the native voice chunk is repeated a predetermined number of times with a delay time at which a repeated sound can be heard independently during reproduction;
d) Elapsed time so that the original speech chunk and the native speech chunk of the learning target language, each echo-added, are alternately output in time in units of chunks and at a predetermined time interval. Data creating means for creating first audio data in which the output timing associated with is controlled;
A language learning support system characterized in that the first voice data created by the data creation means is output from the voice output means.

The language learning support system according to claim 1,
The language learning support system, wherein the data creating means stores the original voice chunk and the native voice chunk in different voice channels.

The language learning support system according to claim 1,
The data creation means accompanies the passage of time so that an original voice chunk that has not been subjected to echo addition processing is intermittently output with a time interval that allows the learner to utter at least every chunk. A language learning support system characterized in that second audio data whose output timing is controlled is created, and the first audio data and the second audio data can be selectively reproduced and output from the audio output means.

The language learning support system according to claim 3,
The data creation means controls the output timing so that the original voice chunk is output in audio in accordance with the tempo of the reference sound emitted at a fixed time interval, and the second sound data obtained by synthesizing the reference sound with the audio output Language learning support system characterized by creating

The language learning support system according to any one of claims 1 to 4,
Image data storage means for storing image data corresponding to each original audio chunk;
The language learning support system further comprising: an image reproduction unit that reproduces image data corresponding to the output audio when the audio data is reproduced and output from the audio output unit.

The language learning support system according to any one of claims 1 to 5,
A system including a host computer and a terminal connected to each other via a network, wherein the voice output means is provided in the terminal, and the voice data created by the data creation means is stored in a storage device included in the host computer. A language learning support system which is stored and voice data is transmitted from the host computer to the terminal in response to a request from the terminal to the host computer.

A language learning support method in which a learner listens to a reproduced voice output from a voice output unit based on an original voice of a learning target language, and trains a conversation of the language by uttering according to the voice,
a) sentence dividing means obtains a sentence of the original speech of the language to be learned, performs speech waveform analysis on the original speech, and discriminates between the silent area and the speech area, thereby A sentence division step for obtaining an original voice chunk by dividing into chunks;
b) a native speech chunk acquisition means for acquiring a native speech chunk having a meaning corresponding to each of the original speech chunks in the learner's native language;
c) echo adding means, respectively to said original audio chunk and the native voice chunk, and the echo adding step of adding an echo as the sound repeatedly at the time of reproduction is repeated a predetermined number of times a delay time that can be listened to independently ,
d) The data creation means outputs the voices of the original speech chunks and the native speech chunks of the learning target language to which the echoes are added alternately in time in units of chunks and at a predetermined time interval. And a data creation step for creating first audio data in which the output timing with the passage of time is controlled,
A language learning support method, characterized in that the first voice data created in the data creation step is output from the voice output means.

The language learning support method according to claim 7,
In the data creation step, with the passage of time, an original voice chunk that has not undergone echo addition processing is intermittently output with a time interval that allows the learner to speak at least between each chunk. A language learning support method, characterized in that second audio data whose output timing is controlled is created, and the first audio data and the second audio data can be selectively reproduced and output from the audio output means.

The language learning support method according to claim 8,
In the data creation step, the output timing is controlled so that the original voice chunk is output as a sound in accordance with the tempo of the reference sound generated at regular time intervals, and the second sound data obtained by synthesizing the reference sound with the sound output. Language learning support method characterized by creating

The language learning support method according to any one of claims 7 to 9,
Image data corresponding to each original audio chunk is prepared, and when the audio data is reproduced and output from the audio output means, the corresponding image data is reproduced in synchronization with the output audio. Language learning support method.