JP2006178334A

JP2006178334A - Language learning system

Info

Publication number: JP2006178334A
Application number: JP2004373815A
Authority: JP
Inventors: Naohiro Emoto; 直博江本
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2004-12-24
Filing date: 2004-12-24
Publication date: 2006-07-06
Also published as: CN100585663C; KR20060073502A; CN1794315A; KR100659212B1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a language learning system by which a learner learns language by using a model speech. <P>SOLUTION: The language learning system includes: a database in which the featured values extracted from the speeches of a speaker and the speech data of the speaker are made correspondent to each other and are recorded; a speech acquisition means which acquires the learner's speeches; a featured value extraction means which extracts the featured values of the speeches of the learner from the speeches acquired by the speech acquisition means; an approximation degree calculation means which calculates the approximation degree index indicating the difference between the featured values recorded in the database, and the featured values extracted by the featured value extraction means; a speech data extraction means which extracts the speech data made correspondent to the featured values at which the approximation degree index calculated by the approximation degree calculation means from the database satisfies first conditions: and a reproduction means which outputs the speeches in compliance with the speech data extracted by the speech data extraction means. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、語学学習を支援する語学学習システムに関する。 The present invention relates to a language learning system that supports language learning.

外国語あるいは母国語の語学学習、特に、発音あるいは発話の独習においては、ＣＤ（Compact Disk）等の記録媒体に記録された模範音声を再生し、その模範音声の真似をして発音あるいは発話するという学習方法が広く用いられている。これは模範音声の真似をすることで正しい発音を身につけることを目的とするものである。ここで、学習をより効果的に進めるためには、模範音声と自分の音声との差を評価する必要がある。しかし、ＣＤに記録された模範音声は、ある特定のアナウンサーやネイティブスピーカーの音声である場合がほとんどである。すなわち、多くの学習者にとってこれらの模範音声は、自分の音声とはまったく異なる特徴を有する音声により発せられるものであるため、模範音声と比較して自分の発音がどの程度正確にできているかという評価が困難であるという問題があった。 In language learning of a foreign language or native language, especially in self-study of pronunciation or utterance, the model voice recorded on a recording medium such as a CD (Compact Disk) is played, and the model voice is imitated to pronounce or speak. The learning method is widely used. The purpose of this is to acquire correct pronunciation by imitating model voices. Here, in order to advance learning more effectively, it is necessary to evaluate the difference between the model voice and one's own voice. However, the model voice recorded on the CD is mostly the voice of a specific announcer or native speaker. That is, for many learners, these model voices are uttered by voices that have completely different characteristics from their own voices, so how accurate their pronunciation is compared to the model voices. There was a problem that evaluation was difficult.

このような問題を解決する技術として、例えば特許文献１、２に記載の技術がある。特許文献１に記載の技術は、模範音声にユーザの抑揚、話速、声質等のパラメータを反映させて、模範音声をユーザ音声に似た音声に変換するものである。特許文献２に記載の技術は、複数の模範音声の中から、学習者が任意のものを選択可能とするものである。
特開２００２−２４４５４７号公報特開２００４−１３３４０９号公報 As a technique for solving such a problem, there are techniques described in Patent Documents 1 and 2, for example. The technique described in Patent Literature 1 reflects parameters such as user inflection, speech speed, and voice quality in the model voice, and converts the model voice into a voice similar to the user voice. The technique described in Patent Document 2 enables a learner to select an arbitrary one from a plurality of model sounds.
JP 2002-244547 A JP 2004-133409 A

しかし、特許文献１に記載の技術によればイントネーションの較正は可能であるものの、例えば英語における「ｒとｌ」や「ｓとｔｈ」など明らかに発音が異なるものの較正を行うことは困難であるという問題があった。さらに、音声波形に対して修正を施すため、処理が複雑になるという問題もあった。また、特許文献２の記載の技術においては、模範音声を選択する方式であるために、学習者が自ら模範音声を選択する必要があり、煩雑であるという問題があった。 However, although it is possible to calibrate intonation according to the technique described in Patent Document 1, it is difficult to calibrate a material whose pronunciation is clearly different, such as “r and l” and “s and th” in English. There was a problem. Furthermore, since the speech waveform is corrected, there is a problem that the processing becomes complicated. Further, the technique described in Patent Document 2 is a method of selecting a model voice, and thus requires a learner to select the model voice by himself, which is cumbersome.

本発明は上述の事情に鑑みてなされたものであり、より簡単な処理で学習者に似た模範音声を用いて学習することが可能な語学学習装置を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a language learning apparatus that can learn using model speech similar to a learner with simpler processing.

上述の課題を解決するため、本発明は、話者の音声から抽出した特徴量と、その話者の１または複数の音声データとを対応付けたデータを複数の話者について記録したデータベースと、学習者の音声を取得する音声取得手段と、前記音声取得手段が取得した音声から、前記学習者の音声の特徴量を抽出する特徴量抽出手段と、前記データベースに記録された複数の話者の特徴量と、前記特徴量抽出手段により抽出された特徴量との差を示す近似度指数を話者毎に算出する近似度算出手段と、前記データベースから、前記近似度算出手段により算出された近似度指数が第１の条件を満足する特徴量と対応付けられた１の音声データを抽出する音声データ抽出手段と、前記音声データ抽出手段により抽出された１の音声データに従って音声を出力する再生手段とを有する語学学習システムを提供する。 In order to solve the above-described problem, the present invention provides a database that records, for a plurality of speakers, data in which feature amounts extracted from the speech of the speaker are associated with one or more speech data of the speaker, Voice acquisition means for acquiring the learner's voice; feature quantity extraction means for extracting the feature quantity of the learner's voice from the voice acquired by the voice acquisition means; and a plurality of speakers recorded in the database. Approximation degree calculating means for calculating an approximation index indicating a difference between the feature quantity and the feature quantity extracted by the feature quantity extracting means for each speaker, and the approximation calculated by the approximation degree calculating means from the database Voice data extracting means for extracting one voice data associated with a feature quantity whose degree index satisfies the first condition, and outputting voice according to the one voice data extracted by the voice data extracting means That provides a language learning system and a reproducing means.

好ましい態様において、この語学学習システムは、前記第１の条件が、近似度が最も高いものを抽出するという条件であってもよい。
別の好ましい態様において、この語学学習システムは、前記抽出手段により抽出された音声データの話速を変換する話速変換手段をさらに有し、前記再生手段が、前記話速変換手段により話速変換された音声データに従って音声を出力してもよい。
さらに別の好ましい態様において、この語学学習システムは、模範音声を記憶する記憶手段と、前記模範音声と、前記音声取得手段により取得された学習者の音声を比較し、両者の近似度を数値化する比較手段と、前記比較手段により得られた近似度が第２の条件を満たしている場合、前記取得手段により取得された学習者の音声を、前記特徴量抽出手段により抽出された特徴量と対応付けて前記データベースに追加するデータベース更新手段とをさらに有してもよい。 In a preferred embodiment, in the language learning system, the first condition may be a condition that the one having the highest degree of approximation is extracted.
In another preferred embodiment, the language learning system further includes speech speed converting means for converting the speech speed of the voice data extracted by the extracting means, and the reproducing means converts the speech speed by the speech speed converting means. Audio may be output according to the audio data.
In yet another preferred embodiment, the language learning system compares a storage unit that stores a model voice, the model voice, and a learner's voice acquired by the voice acquisition unit, and quantifies the degree of approximation between the two. And when the approximation obtained by the comparison unit satisfies the second condition, the learner's voice acquired by the acquisition unit is extracted from the feature amount extracted by the feature amount extraction unit. You may further have a database update means to match and add to the said database.

本発明によれば、学習者と似た声の特徴を有する話者により発せられた音声が、学習における例文の音声として再生される。したがって学習者は、真似すべき（目標とすべき）発音をより正確に認識することができ、これにより学習効率を向上させることができる。 According to the present invention, a voice uttered by a speaker having a voice characteristic similar to that of a learner is reproduced as a voice of an example sentence in learning. Therefore, the learner can more accurately recognize the pronunciation to be imitated (targeted), thereby improving the learning efficiency.

以下、図面を参照して本発明の実施形態について説明する。
＜１．構成＞
図１は、本発明の第１実施形態に係る語学学習システム１の機能構成を示すブロック図である。記憶部１１は、話者の音声から抽出した特徴量と、その話者による音声の音声データとを対応付けて記録したデータベースＤＢ１を記憶している。入力部１２は、学習者（ユーザ）の音声を取得し、ユーザ音声データとして出力する。特徴抽出部１３は、学習者の音声から特徴量を抽出する。音声データ抽出部１４は、特徴抽出部１３により抽出された特徴量と、データベースＤＢ１に記録されている特徴量とを比較し、あらかじめ決められた条件を満足するものを抽出する。音声データ抽出部１４はさらに、抽出された特徴量に対応付けられている音声データを抽出する。再生部１５は、音声データ抽出部１４により抽出された音声データを再生する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<1. Configuration>
FIG. 1 is a block diagram showing a functional configuration of a language learning system 1 according to the first embodiment of the present invention. The storage unit 11 stores a database DB1 in which a feature amount extracted from a speaker's voice and voice data of the speaker's voice are recorded in association with each other. The input unit 12 acquires the learner (user) voice and outputs it as user voice data. The feature extraction unit 13 extracts feature amounts from the learner's voice. The voice data extraction unit 14 compares the feature amount extracted by the feature extraction unit 13 with the feature amount recorded in the database DB1, and extracts the one that satisfies a predetermined condition. The voice data extraction unit 14 further extracts voice data associated with the extracted feature amount. The reproduction unit 15 reproduces the audio data extracted by the audio data extraction unit 14.

データベースＤＢ１の詳細な内容については後述するが、語学学習システム１はさらに、データベースＤＢ１を更新するために以下の構成要素を有している。記憶部１６は、語学学習のお手本となる模範音声データとその模範音声のテキストデータとを対応付けて記録した模範音声データベースＤＢ２を記憶している。比較部１７は、入力部１２により取得されたユーザ音声データと、記憶部１６に記憶された模範音声データとの比較を行う。比較の結果、ユーザ音声があらかじめ決められた条件を満足すると、ＤＢ更新部１８はユーザ音声データをデータベースＤＢ１に追加する。 Although the detailed contents of the database DB1 will be described later, the language learning system 1 further includes the following components to update the database DB1. The storage unit 16 stores an exemplary speech database DB2 in which exemplary speech data serving as a model for language learning and text data of the exemplary speech are recorded in association with each other. The comparison unit 17 compares the user voice data acquired by the input unit 12 with the model voice data stored in the storage unit 16. As a result of the comparison, when the user voice satisfies a predetermined condition, the DB update unit 18 adds the user voice data to the database DB1.

図２は、データベースＤＢ１の内容を例示する図である。データベースＤＢ１には、話者を特定する識別子である話者ＩＤ（図２では「ＩＤ００１」）と、その話者の音声データから抽出した特徴量とが記録されている。データベースＤＢ１にはさらに、例文を特定する識別子である例文ＩＤと、その例文の音声データと、その例文の発音レベル（後述する）とが対応付けて記録されている。データベースＤＢ１は、例文ＩＤ、音声データ、および発音レベルからなるデータセットを複数有しており、各データセットは音声データの話者に与えられた話者ＩＤと対応付けて記録されている。すなわち、データベースＤＢ１は、複数の話者による複数の例文の音声データを有しており、これらのデータは話者ＩＤおよび特徴量によって話者毎に対応付けられて記録されている。 FIG. 2 is a diagram illustrating the contents of the database DB1. In the database DB1, a speaker ID (“ID001” in FIG. 2), which is an identifier for identifying a speaker, and a feature amount extracted from the voice data of the speaker are recorded. The database DB1 further stores an example sentence ID that is an identifier for identifying an example sentence, voice data of the example sentence, and a pronunciation level (described later) of the example sentence in association with each other. The database DB1 has a plurality of data sets composed of example sentence IDs, voice data, and sound generation levels, and each data set is recorded in association with a speaker ID given to the speaker of the voice data. That is, the database DB1 has audio data of a plurality of example sentences by a plurality of speakers, and these data are recorded in association with each speaker by the speaker ID and the feature amount.

図３は、語学学習システム１のハードウェア構成を示すブロック図である。ＣＰＵ（Central Processing Unit）１０１は、ＲＡＭ（Random Access Memory）１０２を作業エリアとして、ＲＯＭ（Read Only Memory）１０３あるいはＨＤＤ（Hard Disk Drive）１０４に記憶されているプログラムを読み出して実行する。ＨＤＤ１０４は、各種アプリケーションプログラムやデータを記憶する記憶装置である。また、ＨＤＤ１０４は、データベースＤＢ１および模範音声データベースＤＢ２も記憶する。ディスプレイ１０５は、ＣＲＴ（Cathode Ray Tube）やＬＣＤ（Liquid Crystal Display）等、ＣＰＵ１０１の制御下で文字や画像を表示する表示装置である。マイク１０６は、ユーザの音声を取得するための集音装置であり、ユーザの発した音声に対応する音声信号を出力する。音声処理部１０７は、マイク１０６により出力されたアナログ音声信号をデジタル音声データに変換する機能や、ＨＤＤ１０４に記憶された音声データを音声信号に変換してスピーカ１０８に出力する機能を有する。また、ユーザはキーボード１０９を操作することにより、語学学習システム１に対して指示入力を行うことができる。以上で説明した各構成要素は、バス１１０を介して相互に接続されている。また、語学学習システム１は、Ｉ／Ｆ（インターフェース）１１１を介して他の機器と通信を行うことができる。 FIG. 3 is a block diagram illustrating a hardware configuration of the language learning system 1. A CPU (Central Processing Unit) 101 reads and executes a program stored in a ROM (Read Only Memory) 103 or an HDD (Hard Disk Drive) 104 using a RAM (Random Access Memory) 102 as a work area. The HDD 104 is a storage device that stores various application programs and data. The HDD 104 also stores a database DB1 and an exemplary voice database DB2. The display 105 is a display device that displays characters and images under the control of the CPU 101, such as a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display). The microphone 106 is a sound collection device for acquiring the user's voice, and outputs a voice signal corresponding to the voice uttered by the user. The sound processing unit 107 has a function of converting an analog sound signal output from the microphone 106 into digital sound data, and a function of converting sound data stored in the HDD 104 into a sound signal and outputting the sound signal to the speaker 108. Further, the user can input instructions to the language learning system 1 by operating the keyboard 109. Each component described above is connected to each other via a bus 110. The language learning system 1 can communicate with other devices via an I / F (interface) 111.

＜２．動作＞
続いて、本実施形態に係る語学学習システム１の動作について説明する。ここでは、まず例文の音声を再生する動作について説明した後に、データベースＤＢ１の内容を更新する動作について説明する。語学学習システム１において、ＣＰＵ１０１がＨＤＤ１０４に記憶された語学学習プログラムを実行することにより図１に示される機能を有する。また、学習者（ユーザ）は、語学学習プログラムの開始時等にキーボード１０９を操作して自分を特定する識別子であるユーザＩＤを入力する。ＣＰＵ１０１は、入力されたユーザＩＤを現在システムを使用している学習者のユーザＩＤとしてＲＡＭ１０２に記憶する。 <2. Operation>
Next, the operation of the language learning system 1 according to this embodiment will be described. Here, after first describing the operation of reproducing the sound of the example sentence, the operation of updating the contents of the database DB1 will be described. In the language learning system 1, the CPU 101 has a function shown in FIG. 1 by executing a language learning program stored in the HDD 104. A learner (user) inputs a user ID, which is an identifier for identifying himself / herself, by operating the keyboard 109 at the start of a language learning program. The CPU 101 stores the input user ID in the RAM 102 as the user ID of the learner who is currently using the system.

＜２−１．音声再生＞
図４は、語学学習システム１の動作を示すフローチャートである。語学学習プログラムを実行すると、語学学習システム１のＣＰＵ１０１は、模範音声データベースＤＢ２を検索して利用可能な例文のリストを作成する。ＣＰＵ１０１は、このリストに基づいて、ディスプレイ１０５上に例文の選択を促すメッセージを表示する。ユーザはディスプレイ１０５上に表示されたメッセージに従い、リストにある例文から１の例文を選択する。ＣＰＵ１０１は選択された例文の音声を再生する（ステップＳ１０１）。具体的には、ＣＰＵ１０１は例文の模範音声データを模範音声データベースＤＢ２から読み出し、読み出した模範音声データを音声処理部１０７に出力する。音声処理部１０７は入力された模範音声データをデジタル／アナログ変換してアナログ音声信号としてスピーカ１０８に出力する。こうしてスピーカ１０８から模範音声が再生される。 <2-1. Audio playback>
FIG. 4 is a flowchart showing the operation of the language learning system 1. When the language learning program is executed, the CPU 101 of the language learning system 1 searches the model speech database DB2 and creates a list of usable example sentences. Based on this list, the CPU 101 displays a message prompting the user to select an example sentence on the display 105. The user selects one example sentence from the example sentences in the list according to the message displayed on the display 105. The CPU 101 reproduces the voice of the selected example sentence (step S101). Specifically, the CPU 101 reads out the model voice data of the example sentence from the model voice database DB2, and outputs the read out model voice data to the voice processing unit 107. The voice processing unit 107 performs digital / analog conversion on the input model voice data and outputs the analog voice signal to the speaker 108. Thus, the model voice is reproduced from the speaker 108.

ユーザはスピーカ１０８から再生された模範音声を聞き、マイク１０６に向かって模範音声を真似して例文を発声する。すなわち、ユーザ音声の入力が行われる（ステップＳ１０２）。具体的には次のとおりである。模範音声の再生が終了すると、ＣＰＵ１０１は、「次はあなたの番です。例文を発音してください」等、ユーザに例文の発生を促すメッセージをディスプレイ１０５に表示する。さらにＣＰＵ１０１は、「スペースキーを押してから発音し、発音が終わったらもう一度スペースキーを押してください」等、ユーザ音声の入力を行うための操作を指示するメッセージをディスプレイ１０５に表示する。ユーザは、ディスプレイ１０５に表示されたメッセージに従ってキーボード１０９を操作し、ユーザ音声の入力を行う。すなわち、キーボード１０９のスペースキーを押した後に、マイク１０６に向かって例文を発声する。発声が終了したら、ユーザはもう一度スペースキーを押す。 The user listens to the model voice reproduced from the speaker 108, and utters an example sentence simulating the model voice toward the microphone 106. That is, a user voice is input (step S102). Specifically, it is as follows. When the reproduction of the model voice is finished, the CPU 101 displays a message on the display 105 urging the user to generate an example sentence such as “Next is your turn. Please pronounce the example sentence”. Further, the CPU 101 displays a message on the display 105 instructing an operation for inputting the user voice, such as “Sound after pressing the space key and press the space key again when the sound is finished”. The user operates the keyboard 109 according to the message displayed on the display 105 and inputs the user voice. That is, after the space key on the keyboard 109 is pressed, the example sentence is uttered toward the microphone 106. When the utterance is finished, the user presses the space key again.

ユーザの音声はマイク１０６により電気信号に変換される。マイク１０６は、ユーザ音声信号を出力する。ユーザ音声信号は、音声処理部１０７によりデジタル音声データに変換され、ユーザ音声データとしてＨＤＤ１０４に記録される。ＣＰＵ１０１は、模範音声の再生が完了した後、スペースキーの押下をトリガとしてユーザ音声データの記録を開始し、再度のスペースキーの押下をトリガとしてユーザ音声データの記録を終了する。すなわち、ユーザが最初にスペースキーを押してから、もう一度スペースキーを押すまでの間のユーザ音声がＨＤＤ１０４に記録される。 The user's voice is converted into an electric signal by the microphone 106. The microphone 106 outputs a user voice signal. The user voice signal is converted into digital voice data by the voice processing unit 107 and recorded in the HDD 104 as user voice data. After the reproduction of the model voice is completed, the CPU 101 starts the recording of the user voice data with the press of the space key as a trigger, and ends the recording of the user voice data with the press of the space key again as a trigger. That is, the user voice from when the user first presses the space key until the user presses the space key again is recorded in the HDD 104.

続いてＣＰＵ１０１は、得られたユーザ音声データに対して特徴量抽出処理を行う（ステップＳ１０３）。具体的には次のとおりである。ＣＰＵ１０１は、音声データをあらかじめ決められた時間（フレーム）毎に分割する。ＣＰＵ１０１は、フレームに分解された模範音声データが示す波形およびユーザ音声信号が示す波形をフーリエ変換して得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を得る。ＣＰＵ１０１は、こうして得られたスペクトル包絡から第１フォルマントおよび第２フォルマントのフォルマント周波数を抽出する。一般に母音は第１および第２フォルマントの分布により特徴付けられる。ＣＰＵ１０１は、音声データの先頭から、フレーム毎に得られたフォルマント周波数の分布を、あらかじめ決められた母音（例えば「ａ」）のフォルマント周波数分布とマッチングを行う。マッチングによりそのフレームが母音「ａ」に相当するものであると判断されると、ＣＰＵ１０１は、そのフレームにおけるフォルマントのうち、あらかじめ決められたフォルマント（例えば、第１、第２、第３の３つのフォルマント）のフォルマント周波数を算出する。ＣＰＵ１０１は、算出したフォルマント周波数を、ユーザの音声の特徴量ＰとしてＲＡＭ１０２に記憶する。 Subsequently, the CPU 101 performs feature amount extraction processing on the obtained user voice data (step S103). Specifically, it is as follows. The CPU 101 divides the audio data every predetermined time (frame). The CPU 101 obtains the logarithm of the amplitude spectrum obtained by Fourier transforming the waveform indicated by the exemplary voice data decomposed into frames and the waveform indicated by the user voice signal, and inverse Fourier transforms it to obtain a spectrum envelope for each frame. . The CPU 101 extracts the formant frequencies of the first formant and the second formant from the spectrum envelope thus obtained. In general, a vowel is characterized by a distribution of first and second formants. The CPU 101 matches the formant frequency distribution obtained for each frame from the head of the audio data with the formant frequency distribution of a predetermined vowel (eg, “a”). When it is determined by matching that the frame corresponds to the vowel “a”, the CPU 101 determines the formant (for example, first, second, and third) determined in advance among the formants in the frame. The formant frequency of formant) is calculated. The CPU 101 stores the calculated formant frequency in the RAM 102 as the feature amount P of the user's voice.

続いてＣＰＵ１０１は、データベースＤＢ１から、このユーザの音声の特徴量Ｐと似た特徴量と対応付けられている音声データを抽出する（ステップＳ１０４）。具体的には、抽出された特徴量ＰとデータベースＤＢ１に記録された特徴量とを比較し、特徴量Ｐと最も近似するものを特定する。比較においては、例えば、特徴量ＰとデータベースＤＢ１との間で第１〜第３フォルマント周波数の値の差を算出し、さらに３つのフォルマント周波数の差の絶対値を足し合わせた量を両者の近似度を示す近似度指数として算出する。ＣＰＵ１０１は、算出した近似度指数が最も小さいもの、すなわち特徴量Ｐと最も近似する特徴量をデータベースＤＢ１から特定する。ＣＰＵ１０１はさらに、特定された特徴量と対応付けられている音声データを抽出し、抽出した音声データをＲＡＭ１０２に記憶する。 Subsequently, the CPU 101 extracts voice data associated with a feature amount similar to the feature amount P of the user's voice from the database DB1 (step S104). Specifically, the extracted feature quantity P and the feature quantity recorded in the database DB1 are compared, and the closest approximation to the feature quantity P is specified. In the comparison, for example, a difference between the values of the first to third formant frequencies is calculated between the feature amount P and the database DB1, and an amount obtained by adding the absolute values of the differences between the three formant frequencies is also approximated between the two. Calculated as an approximation index indicating degrees. The CPU 101 specifies from the database DB1 a feature amount that has the smallest calculated approximation index, that is, a feature amount that is closest to the feature amount P. The CPU 101 further extracts audio data associated with the specified feature amount, and stores the extracted audio data in the RAM 102.

続いてＣＰＵ１０１は、音声データの再生を行う（ステップＳ１０５）。具体的には次のとおりである。ＣＰＵ１０１は音声データを音声処理部１０７に出力する。音声処理部１０７は、入力された音声データをデジタル／アナログ変換して音声信号としてスピーカ１０８に出力する。こうして、抽出された音声データはスピーカ１０８から音声として
再生される。ここで、音声データは特徴量のマッチングにより抽出されたものであるので、再生された音声は、学習者の音声と特徴が似た音声となっている。したがって、学習者は、自分とはまったく異なる声の特徴を有する話者（アナウンサー、ネイティブスピーカー等）により発せられた音声を聞くだけでは真似をすることが困難であった例文であっても、自分とよく似た声の特徴を有する話者により発せられた音声であれば、真似すべき発音をより正しく理解することができ、学習効率を向上させることができる Subsequently, the CPU 101 reproduces audio data (step S105). Specifically, it is as follows. The CPU 101 outputs audio data to the audio processing unit 107. The audio processing unit 107 performs digital / analog conversion on the input audio data and outputs the audio data to the speaker 108 as an audio signal. Thus, the extracted audio data is reproduced as audio from the speaker 108. Here, since the voice data is extracted by feature amount matching, the reproduced voice is a voice having features similar to those of the learner. Therefore, even if the learner has an example sentence that was difficult to imitate only by listening to the voice produced by a speaker (announcer, native speaker, etc.) having a voice characteristic completely different from that of himself / herself, If the voice is produced by a speaker with similar voice characteristics, the pronunciation that should be imitated can be understood more correctly and the learning efficiency can be improved.

＜２−２．データベース更新＞
続いて、データベースＤＢ１の更新動作について説明する。
図５は、語学学習システム１におけるデータベースＤＢ１の更新動作を示すフローチャートである。まず、上述のステップＳ１０１〜Ｓ１０２の処理により、模範音声の再生およびユーザ音声の入力が行われる。続いてＣＰＵ１０１は、模範音声とユーザ音声の比較処理を行う（ステップＳ２０１）。具体的には次のとおりである。ＣＰＵ１０１は、模範音声データが示す波形をあらかじめ決められた時間（フレーム）ごとに分割する。また、ＣＰＵ１０１は、ユーザ音声データが示す波形についてもフレームごとに分割する。ＣＰＵ１０１は、フレームに分解された模範音声データが示す波形およびユーザ音声信号が示す波形をフーリエ変換して得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を得る。 <2-2. Database update>
Subsequently, an update operation of the database DB1 will be described.
FIG. 5 is a flowchart showing the update operation of the database DB1 in the language learning system 1. First, the reproduction of the model voice and the input of the user voice are performed by the processes in steps S101 to S102 described above. Subsequently, the CPU 101 performs a comparison process between the model voice and the user voice (step S201). Specifically, it is as follows. The CPU 101 divides the waveform indicated by the exemplary audio data for each predetermined time (frame). The CPU 101 also divides the waveform indicated by the user voice data for each frame. The CPU 101 obtains the logarithm of the amplitude spectrum obtained by performing Fourier transform on the waveform indicated by the exemplary voice data decomposed into frames and the waveform indicated by the user voice signal, and inverse Fourier transforms it to obtain a spectrum envelope for each frame. .

図６は、模範音声（上）およびユーザ音声（下）のスペクトル包絡を例示する図である。図６に示されるスペクトル包絡は、フレームＩ〜フレームＩＩＩの３つのフレームから構成されている。ＣＰＵ１０１は、得られたスペクトル包絡をフレームごとに比較し、両者の近似度を数値化する処理を行う。近似度の数値化（近似度指数の算出）は、例えば以下のように行う。ＣＰＵ１０１は、特徴的なフォルマントの周波数とスペクトル密度とをスペクトル密度−周波数図に表したときの２点間の距離を音声データ全体について足し合わせたものを近似度指数として算出してもよい。あるいは、特定の周波数におけるスペクトル密度の差を音声データ全体について積分したものを近似度指数として算出してもよい。なお、模範音声とユーザ音声とは長さ（時間）が異なるのが通常であるので、上述の処理に先立ち両者の長さを揃える処理を行うことが好ましい。 FIG. 6 is a diagram illustrating a spectrum envelope of an exemplary voice (upper) and a user voice (lower). The spectrum envelope shown in FIG. 6 is composed of three frames, frame I to frame III. The CPU 101 performs a process of comparing the obtained spectrum envelopes for each frame and digitizing the degree of approximation between them. The numerical value of the approximation (calculation of the approximation index) is performed as follows, for example. The CPU 101 may calculate, as an approximation index, the sum of the distance between two points when the characteristic formant frequency and spectral density are represented in the spectral density-frequency diagram for the entire audio data. Alternatively, an approximation index may be calculated by integrating the spectral density difference at a specific frequency with respect to the entire audio data. Note that the typical voice and the user voice usually have different lengths (time), and therefore, it is preferable to perform a process of aligning the lengths of the both before the above-described process.

再び図５を参照して説明する。ＣＰＵ１０１は、算出した近似度指数に基づいて、データベースＤＢ１の更新を行うか否か判断する（ステップＳ２０２）。具体的には次のとおりである。ＨＤＤ１０４には、取得した音声データをデータベースＤＢ１に追加登録するための条件があらかじめ記憶されている。ＣＰＵ１０１は、ステップＳ２０１で算出した近似度指数がこの登録条件を満足するかどうか判断する。登録条件が満足された場合（ステップＳ２０２：ＹＥＳ）、ＣＰＵ１０１は、処理を後述するステップＳ２０３へと進める。登録条件が満足されない場合（ステップＳ２０２：ＮＯ）、ＣＰＵ１０１は処理を終了する。 A description will be given with reference to FIG. 5 again. The CPU 101 determines whether or not to update the database DB1 based on the calculated approximation index (step S202). Specifically, it is as follows. The HDD 104 stores in advance conditions for additionally registering the acquired audio data in the database DB1. The CPU 101 determines whether the approximation index calculated in step S201 satisfies this registration condition. When the registration conditions are satisfied (step S202: YES), the CPU 101 advances the process to step S203 described later. If the registration condition is not satisfied (step S202: NO), the CPU 101 ends the process.

登録条件が満足された場合、ＣＰＵ１０１はデータベース更新処理を行う（ステップＳ２０３）。具体的には次のとおりである。ＣＰＵ１０１は、登録条件を満足した音声データに、この音声データの話者である学習者（ユーザ）を特定するユーザＩＤを付与する。ＣＰＵ１０１は、模範音声データベースＤＢ２からユーザＩＤと同一のユーザＩＤを検索し、音声データをこのユーザＩＤと対応つけて模範音声データベースＤＢ２に追加登録する。ＣＰＵ１０１は、更新要求から抽出したユーザＩＤが模範音声データベースＤＢ２に登録されていなかった場合は、このユーザＩＤを追加登録し、このユーザＩＤに対応つけて音声データを登録する。このようにして、データベースＤＢ１に学習者の音声データが追加登録され、更新される。 When the registration conditions are satisfied, the CPU 101 performs database update processing (step S203). Specifically, it is as follows. The CPU 101 assigns a user ID that identifies a learner (user) who is a speaker of the voice data to the voice data that satisfies the registration conditions. The CPU 101 searches the model voice database DB2 for the same user ID as the user ID, and additionally registers the voice data in the model voice database DB2 in association with the user ID. When the user ID extracted from the update request is not registered in the model voice database DB2, the CPU 101 additionally registers this user ID and registers voice data in association with this user ID. In this way, the learner's voice data is additionally registered and updated in the database DB1.

以上で説明したデータベース更新動作は、前述の音声再生動作と平行して行われてもよいし、音声再生動作の完了後に行われてもよい。こうして、学習者の音声データが順次データベースＤＢ１に追加されて行くことで、データベースＤＢ１には数多くの話者の音声データが蓄積されることとなる。したがって、語学学習システム１が使用されるにつれデータベースＤＢ１に多くの話者の音声データが登録されていき、同時に新しい学習者が語学学習システム１を使用する際にも自分と特徴の似た音声が再生される確率が高くなっていく。 The database update operation described above may be performed in parallel with the above-described audio reproduction operation or may be performed after the audio reproduction operation is completed. Thus, the learner's voice data is sequentially added to the database DB1, so that the voice data of many speakers is accumulated in the database DB1. Therefore, as the language learning system 1 is used, voice data of many speakers are registered in the database DB1, and at the same time, when a new learner uses the language learning system 1, a voice having characteristics similar to that of himself / herself is recorded. The probability of being played increases.

＜３．変形例＞
本発明は上述の実施形態に限定されるものではなく、種々の変形実施が可能である。
＜３−１．変形例１＞
上述の実施形態において、ステップＳ１０４で抽出した音声データをＲＡＭ１０２に記憶した後、ＣＰＵ１０１は、音声データに対して話速変換処理を行ってもよい。具体的には次のとおりである。ＲＡＭ１０２は、話速変換処理の前後における話速比率を指定する変数ａをあらかじめ記憶している。ＣＰＵ１０１は、抽出した音声データに対して、音声の時間（音声データの先頭から末尾までの再生に要する時間）をａ倍する処理を行う。ａ＞１の場合は話速変換処理により音声の長さが伸びる。すなわち、話速は遅くなる。逆に、ａ＜１の場合は話速変換処理により音声の長さは縮む。すなわち、話速は速くなる。本実施形態において、変数ａの初期値として１より大きい値が設定されている。したがって、模範音声が再生され、続いてユーザ音声が入力された後、ユーザ音声と似た音声で再生される例文は、模範音声よりもゆっくりと再生される。したがって、学習者は、真似すべき発音（目標とすべき発音）をより明確に認識することができる。 <3. Modification>
The present invention is not limited to the above-described embodiment, and various modifications can be made.
<3-1. Modification 1>
In the above-described embodiment, after the voice data extracted in step S104 is stored in the RAM 102, the CPU 101 may perform a speech speed conversion process on the voice data. Specifically, it is as follows. The RAM 102 stores in advance a variable a that designates a speech speed ratio before and after the speech speed conversion process. The CPU 101 performs processing for multiplying the extracted audio data by a times the audio time (the time required for reproduction from the beginning to the end of the audio data). When a> 1, the speech length is increased by the speech speed conversion process. That is, the speaking speed becomes slow. Conversely, when a <1, the speech length is reduced by the speech speed conversion process. That is, the speaking speed is increased. In the present embodiment, a value greater than 1 is set as the initial value of the variable a. Therefore, after the model voice is reproduced and subsequently the user voice is input, the example sentence reproduced with the voice similar to the user voice is reproduced more slowly than the model voice. Therefore, the learner can more clearly recognize the pronunciation to be imitated (pronunciation to be targeted).

＜３−２．変形例２＞
上述の実施形態では、ステップＳ１０４において、学習者（ユーザ）の音声から抽出した特徴量と最も近似する特徴量と対応付けられた音声データを抽出したが、音声データを抽出する条件は学習者の音声の特徴量と最も近似するものに限定されない。例えば、データベースＤＢ１において、例文の音声データと対応付けてその音声の発話レベル（模範音声との近似度を示す指数。発話レベルの高いものはより模範音声に近似している）を記録しておき、この発話レベルを音声データ選択の条件に組み込んでもよい。具体的な条件としては例えば、発話レベルがある一定レベル以上のもののなかから、特徴量が最も近似するものを抽出するという条件でもよい。あるいは、特徴量の近似度がある値以上のもののなかから、発話レベルが最も高いものを抽出するという条件でもよい。発話レベルは、例えばステップＳ２０１における近似度指数の算出と同様に行えばよい。 <3-2. Modification 2>
In the above-described embodiment, the voice data associated with the feature quantity most similar to the feature quantity extracted from the voice of the learner (user) is extracted in step S104. It is not limited to the one closest to the voice feature amount. For example, in the database DB1, the speech level of the speech (an index indicating the degree of approximation with the model speech is associated with the speech data of the example sentence. Those having a high speech level are more similar to the model speech). This utterance level may be incorporated into the audio data selection condition. As a specific condition, for example, a condition in which a feature amount closest to the utterance level is extracted from a certain level or higher may be used. Alternatively, it may be a condition that the highest speech level is extracted from features whose degree of approximation of the feature amount is greater than a certain value. The speech level may be performed in the same manner as the calculation of the approximation index in step S201, for example.

＜３−３．変形例３＞
また、システムの構成は、上述の実施形態で説明したものに限定されない。語学学習システム１がネットワークを介してサーバ装置に接続されており、上述の語学学習システムの機能のうち一部を、サーバ装置に担わせてもよい。
さらに、上述の実施形態においては、ＣＰＵ１０１が語学学習プログラムを実行することにより語学学習システムとしての機能がソフトウェア的に実現されたが、図１に示される機能構成要素に相当する電子回路等を用い、ハードウェア的にシステムを実現してもよい。 <3-3. Modification 3>
The system configuration is not limited to that described in the above embodiment. The language learning system 1 may be connected to a server device via a network, and the server device may have a part of the functions of the language learning system described above.
Furthermore, in the above-described embodiment, the function as the language learning system is realized by software by the CPU 101 executing the language learning program. However, an electronic circuit or the like corresponding to the functional components shown in FIG. 1 is used. The system may be realized in hardware.

＜３−４．変形例４＞
上述の実施形態においては、話者の音声の特徴量として第１〜第３フォルマントのフォルマント周波数を用いる態様について説明したが、音声の特徴量はフォルマント周波数に限定されるものではない。スペクトログラム等、他の音声解析方法に基づいて算出した特徴量であってもよい。 <3-4. Modification 4>
In the above-described embodiment, the aspect in which the formant frequencies of the first to third formants are used as the feature amount of the speaker's voice has been described. However, the feature amount of the voice is not limited to the formant frequency. It may be a feature amount calculated based on another speech analysis method such as a spectrogram.

本発明の第１実施形態に係る語学学習システム１の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the language learning system 1 which concerns on 1st Embodiment of this invention. データベースＤＢ１の内容を例示する図である。It is a figure which illustrates the contents of database DB1. 語学学習システム１のハードウェア構成を示すブロック図である。2 is a block diagram showing a hardware configuration of a language learning system 1. FIG. 語学学習システム１の動作を示すフローチャートである。3 is a flowchart showing the operation of the language learning system 1. 語学学習システム１におけるデータベースＤＢ１の更新動作を示すフローチャートである。It is a flowchart which shows the update operation | movement of database DB1 in the language learning system 1. FIG. 模範音声（上）およびユーザ音声（下）のスペクトル包絡を例示する図である。It is a figure which illustrates the spectrum envelope of model voice (upper) and user voice (lower).

Explanation of symbols

１…語学学習システム、２…語学学習システム、１１…記憶部、１２…入力部、１３…特徴抽出部、１４…音声データ抽出部、１５…再生部、１６…記憶部、１７…比較部、１８…ＤＢ更新部、２１…話速変換部、１０１…ＣＰＵ、１０２…ＲＡＭ、１０４…ＨＤＤ、１０５…ディスプレイ、１０６…マイク、１０７…音声処理部、１０８…スピーカ、１０９…キーボード、１１０…バス、１１１…Ｉ／Ｆ DESCRIPTION OF SYMBOLS 1 ... Language learning system, 2 ... Language learning system, 11 ... Memory | storage part, 12 ... Input part, 13 ... Feature extraction part, 14 ... Audio | voice data extraction part, 15 ... Playback | regeneration part, 16 ... Memory | storage part, 17 ... Comparison part, 18 ... DB update unit, 21 ... speech speed conversion unit, 101 ... CPU, 102 ... RAM, 104 ... HDD, 105 ... display, 106 ... microphone, 107 ... voice processing unit, 108 ... speaker, 109 ... keyboard, 110 ... bus 111 ... I / F

Claims

A database that records, for a plurality of speakers, data that associates feature amounts extracted from the speaker's voice with one or more voice data of the speaker;
Voice acquisition means for acquiring learner's voice;
Feature quantity extraction means for extracting feature quantities of the learner's voice from the voice acquired by the voice acquisition means;
A degree-of-approximation calculation means for calculating an approximation index indicating the difference between the feature quantity of the plurality of speakers recorded in the database and the feature quantity extracted by the feature quantity extraction means;
Voice data extraction means for extracting from the database one voice data in which the approximation index calculated by the approximation calculation means is associated with a feature amount satisfying a first condition;
A language learning system comprising: reproducing means for outputting a sound in accordance with one sound data extracted by the sound data extracting means.

2. The language learning system according to claim 1, wherein the first condition is a condition that the one having the highest degree of approximation is extracted.

A speech speed converting means for converting the speech speed of the voice data extracted by the extracting means;
The language learning system according to claim 1, wherein the reproduction unit outputs a voice according to the voice data subjected to the speech speed conversion by the speech speed conversion unit.

Storage means for storing the model voice;
A comparison means for comparing the model voice and the learner's voice acquired by the voice acquisition means, and quantifying the degree of approximation between both;
When the degree of approximation obtained by the comparison unit satisfies a second condition, the database of the learner's voice acquired by the acquisition unit is associated with the feature amount extracted by the feature amount extraction unit The language learning system according to claim 1, further comprising: database updating means to be added to.