JP2017161632A

JP2017161632A - Pronunciation learning support system and pronunciation learning support method

Info

Publication number: JP2017161632A
Application number: JP2016044435A
Authority: JP
Inventors: 博司佐久田; Hiroshi Sakuta; 大長谷川; Dai Hasegawa; 明夫林; Akio Hayashi
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-03-08
Filing date: 2016-03-08
Publication date: 2017-09-14

Abstract

PROBLEM TO BE SOLVED: To provide a pronunciation learning support system having relatively simple structure while taking liaison learning or accent learning into consideration.SOLUTION: A pronunciation learning support system includes: a voice acquisition section 105 which acquires a voice of a word, a phrase, or a sentence by a learner; a feature extraction section which obtains feature quantity time series data of the acquired voice; a database 111 which stores a plurality of pattern feature quantity time series data previously prepared for the voice of the word, the phrase or the sentence; a selection section 109 which collates the feature quantity time series data of the acquired voice with the plurality of pattern feature quantity time series data, selects the pattern feature quantity time series data most similar to the feature quantity time series data of the acquired voice; and an output section for learner 117 which displays data corresponding to the selected pattern feature quantity time series data.SELECTED DRAWING: Figure 1

Description

本発明は、学習者が外国語などの言語を習得する際に使用される発音学習支援システム及び発音学習支援方法に関する。 The present invention relates to a pronunciation learning support system and a pronunciation learning support method used when a learner learns a language such as a foreign language.

言語を習得する際に使用される発音学習支援システムであって、学習者である発話者の音声データと口元の画像（ビデオデータ）から発話者の発音を認識し、正しい発音と発話者の間違った発音のそれぞれの口腔内の画像を比較表示して、学習者にフィードバックするシステムが開発されている（たとえば、非特許文献１）。しかし、上記のシステムは、画像解析システムを含む大規模のシステムである。また、上記のシステムは、特に、外国語の学習者にとって学習が困難である、フランス語のリエゾンのような連音の学習やアクセントの学習を考慮したものではない。 A pronunciation learning support system used when learning a language, which recognizes the pronunciation of a speaker from the voice data of the speaker who is the learner and the image of the mouth (video data), and correct pronunciation and wrong speaker A system has been developed for comparing and displaying images in the mouth of each of the pronunciations and feeding back to the learner (for example, Non-Patent Document 1). However, the above system is a large-scale system including an image analysis system. In addition, the above-mentioned system does not particularly take into consideration continuous tone learning or accent learning like a French liaison, which is difficult for foreign language learners.

したがって、連音の学習やアクセントの学習を考慮した、比較的簡単な構成の発音学習支援システム及びそれを使用した発音学習支援方法に対するニーズがある。 Therefore, there is a need for a pronunciation learning support system with a relatively simple configuration that takes into account learning of continuous sounds and accent learning, and a pronunciation learning support method using the same.

Olov Engwall et al., Designing the user interface of the computer-based speech training system ARTUR based on early user tests, Behaviour & Information Technology, Vol. 25, No. 4, July - August 2006, 353 - 365Olov Engwall et al., Designing the user interface of the computer-based speech training system ARTUR based on early user tests, Behavior & Information Technology, Vol. 25, No. 4, July-August 2006, 353-365

本発明の目的は、連音の学習やアクセントの学習を考慮した、比較的簡単な構成の発音学習支援システム及びそれを使用した発音学習支援方法を提供することである。 SUMMARY OF THE INVENTION An object of the present invention is to provide a pronunciation learning support system having a relatively simple configuration in consideration of continuous tone learning and accent learning, and a pronunciation learning support method using the same.

本発明の第１の態様の発音学習支援方法は、発音学習支援システムを使用するものであって、該発音学習支援システムが、学習者による単語、語句または文の音声を取得するステップと、該発音学習支援システムが、取得された音声の特徴量時系列データを求めるステップと、該発音学習支援システムが、該取得された音声の特徴量時系列データと、該発音学習支援システムのデータベースに記憶された、該単語、語句または文の音声の複数の類型特徴量時系列データとを照合し、該取得された音声の特徴量時系列データに最も類似する類型特徴量時系列データを選択するステップと、該発音学習支援システムが、選択された類型特徴量時系列データに対応する画像データを表示するステップと、を含む。 The pronunciation learning support method according to the first aspect of the present invention uses a pronunciation learning support system, and the pronunciation learning support system acquires a voice of a word, phrase or sentence by a learner, The pronunciation learning support system obtains the acquired feature time series data of the speech, and the pronunciation learning support system stores the acquired feature time series data of the speech and the pronunciation learning support system database Comparing the plurality of type feature quantity time-series data of the speech of the word, phrase or sentence, and selecting the type feature quantity time-series data most similar to the acquired voice feature quantity time-series data And the pronunciation learning support system displaying image data corresponding to the selected type feature quantity time-series data.

本態様の発音学習支援方法は、発音学習支援システムが、単語、語句または文の取得された音声の特徴量時系列データと、該発音学習支援システムのデータベースに記憶された、該単語、語句または文の音声の複数の類型特徴量時系列データと、を照合し、該取得された音声の特徴量時系列データに最も類似する類型特徴量時系列データを選択するように構成されるので、複数の類型特徴量時系列データを使用しない場合と比較して、システムの構成を簡単にすることができる。 In the pronunciation learning support method according to this aspect, the pronunciation learning support system is configured such that the word, phrase, or sentence stored in the database of the pronunciation learning support system is stored in the database of the pronunciation learning support system. A plurality of type feature quantity time-series data of sentence speech, and a type feature quantity time-series data that is most similar to the acquired voice feature quantity time-series data is selected. Compared with the case where no type feature quantity time-series data is used, the system configuration can be simplified.

また、学習者の音声の特徴量時系列データを、いずれかの類型特徴量時系列データに関連付けるので、予め作成した、類型特徴量時系列データに対応するデータを学習者に対して出力することができる。 In addition, since the feature amount time series data of the learner's voice is associated with any type feature amount time series data, data corresponding to the type feature amount time series data created in advance is output to the learner. Can do.

本発明の第１の態様の第１の実施形態による発音学習支援方法においては、該選択された類型特徴量時系列データに対応するデータが、音声に対応する声道の画像を含む。 In the pronunciation learning support method according to the first embodiment of the first aspect of the present invention, the data corresponding to the selected type feature quantity time-series data includes a vocal tract image corresponding to speech.

本実施形態によれば、学習者の音声に対応する声道の画像を学習者に表示することができるので、学習者の発音の学習効果が向上する。 According to the present embodiment, since the vocal tract image corresponding to the learner's voice can be displayed to the learner, the learner's pronunciation learning effect is improved.

本発明の第１の態様の第２の実施形態による発音学習支援方法においては、該選択された類型特徴量時系列データに対応するデータが、動画データを含む。 In the pronunciation learning support method according to the second embodiment of the first aspect of the present invention, the data corresponding to the selected type feature amount time-series data includes moving image data.

本実施形態によれば、動画データを使用するので、学習者の発音の学習効果がさらに向上する。 According to this embodiment, since moving image data is used, the learner's pronunciation learning effect is further improved.

本発明の第１の態様の第３の実施形態による発音学習支援方法においては、該選択された類型特徴量時系列データに対応するデータが、立体視用画像データを含む。 In the pronunciation learning support method according to the third embodiment of the first aspect of the present invention, the data corresponding to the selected type feature quantity time-series data includes stereoscopic image data.

本実施形態によれば、立体視用画像データを使用するので、学習者の発音の学習効果がさらに向上する。 According to the present embodiment, since the stereoscopic image data is used, the learner's pronunciation learning effect is further improved.

本発明の第１の態様の第４の実施形態による発音学習支援方法は、該取得された音声のアクセントを求めるステップをさらに含み、該選択された類型特徴量時系列データに対応するデータが、該取得された音声のアクセントのデータを含む。 The pronunciation learning support method according to the fourth embodiment of the first aspect of the present invention further includes the step of obtaining the accent of the acquired speech, and the data corresponding to the selected type feature amount time-series data is: The acquired voice accent data is included.

本実施形態によれば、学習者の音声のアクセントのデータを使用するので、学習者のアクセントの学習効果が向上する。 According to this embodiment, since the accent data of the learner's voice is used, the learner's accent learning effect is improved.

本発明の第１の態様の第５の実施形態による発音学習支援方法においては、該表示するステップにおいて、選択された類型特徴量時系列データに対応するデータの他に、標準音声の特徴量時系列データに対応するデータを出力する。 In the pronunciation learning support method according to the fifth embodiment of the first aspect of the present invention, in the displaying step, in addition to the data corresponding to the selected type feature quantity time-series data, Output data corresponding to series data.

本実施形態によれば、選択された類型特徴量時系列データに対応するデータの他に、標準音声の特徴量時系列データに対応するデータが出力されるので、学習者は両者のデータを比較して学習することができ学習効果が向上する。 According to this embodiment, in addition to the data corresponding to the selected type feature quantity time-series data, data corresponding to the standard voice feature quantity time-series data is output, so the learner compares the data of both And learning effect can be improved.

本発明の第１の態様の第６の実施形態による発音学習支援方法においては、該選択された類型特徴量時系列データに対応するデータの少なくとも一部、または該標準音声の特徴量時系列データに対応するデータの少なくとも一部が音素に対応付けられている。 In the pronunciation learning support method according to the sixth embodiment of the first aspect of the present invention, at least part of the data corresponding to the selected type feature quantity time series data, or the feature quantity time series data of the standard speech At least part of the data corresponding to is associated with phonemes.

本実施形態によれば、音素に対応付けられたデータが出力されるので、学習者の理解が深まり学習効果が向上する。 According to this embodiment, since data associated with phonemes is output, the learner's understanding deepens and the learning effect is improved.

本発明の第２の態様の発音学習支援装置は、学習者による単語、語句または文の音声を取得する音声取得部と、取得された音声の特徴量時系列データを求める特徴抽出部と、該単語、語句または文の音声について予め準備された複数の類型特徴量時系列データを記憶するデータベースと、該取得された音声の特徴量時系列データと該複数の類型特徴量時系列データとを照合し、該取得された音声の特徴量時系列データに最も類似する類型特徴量時系列データを選択する選択部と、選択された類型特徴量時系列データに対応するデータを表示する学習者用出力部と、を含む。 The pronunciation learning support device according to the second aspect of the present invention includes a voice acquisition unit that acquires a voice of a word, a phrase, or a sentence by a learner, a feature extraction unit that calculates feature amount time-series data of the acquired voice, A database that stores a plurality of type feature amount time series data prepared in advance for speech of words, phrases, or sentences, and a comparison between the acquired feature amount time series data of speech and the plurality of type feature amount time series data A selection unit that selects type feature amount time-series data that is most similar to the acquired feature amount time-series data, and a learner output that displays data corresponding to the selected type feature amount time-series data Part.

本態様の発音学習支援システムは、単語、語句または文の取得された音声の特徴量時系列データと、該発音学習支援システムのデータベースに記憶された、該単語、語句または文の音声の複数の類型特徴量時系列データと、を照合し、該取得された音声の特徴量時系列データに最も類似する類型特徴量時系列データを選択するように構成されるので、複数の類型特徴量時系列データを使用しない場合と比較して、システムの構成を簡単にすることができる。 The pronunciation learning support system according to this aspect includes a plurality of voice feature amount time-series data acquired from a word, phrase, or sentence, and a plurality of voices of the word, phrase, or sentence stored in the database of the pronunciation learning support system. The feature type time-series data is collated and selected so as to select the type-feature feature time-series data that is most similar to the acquired voice feature-value time-series data. Compared to the case where data is not used, the system configuration can be simplified.

本発明の第２の態様の第１の実施形態による発音学習支援システムは、該取得された音声のアクセントを求めるアクセント分析部をさらに含み、該学習者用出力部が、分析されたアクセントを、選択された類型特徴量時系列データに関連付けて出力するように構成されている。 The pronunciation learning support system according to the first embodiment of the second aspect of the present invention further includes an accent analysis unit that obtains the accent of the acquired speech, and the output unit for learner includes the analyzed accent. The selected type feature amount time series data is output in association with the selected type feature amount time series data.

本実施形態によれば、学習者のアクセントが、選択された類型特徴量時系列データに関連付けて出力されるので、学習者の理解が深まり学習効果が向上する。 According to the present embodiment, since the learner's accent is output in association with the selected type feature amount time-series data, the learner's understanding deepens and the learning effect is improved.

本発明の一実施形態の発音学習支援システムの構成を示す図である。It is a figure which shows the structure of the pronunciation learning assistance system of one Embodiment of this invention. 発音学習支援システムの動作を説明するための流れ図である。It is a flowchart for demonstrating operation | movement of the pronunciation learning assistance system. データベースに記憶されるデータの構成の一例を示す図である。It is a figure which shows an example of a structure of the data memorize | stored in a database. 単語”April”の標準データの文字、音声記号、アクセント表示、及び声道の画像を例示的に示す図である。It is a figure which shows the character of the standard data of the word "April", the phonetic symbol, the accent display, and the image of a vocal tract. 本発明の発音学習支援システムの使用方法を説明するための流れ図である。It is a flowchart for demonstrating the usage method of the pronunciation learning assistance system of this invention.

本発明の発音学習支援システムは、学習者が外国語などの言語を習得する際に、学習者の単語、語句または文の発音の学習を支援するためのものである。発音学習支援システムは、学習者が発音した単語、語句または文の音声を取得し、解析し、発音を修正するための適切な情報を学習者に提供する。 The pronunciation learning support system of the present invention is for assisting the learner in pronunciation of words, phrases or sentences when the learner learns a language such as a foreign language. The pronunciation learning support system obtains voices of words, phrases or sentences pronounced by the learner, analyzes them, and provides the learners with appropriate information for correcting pronunciation.

図１は、本発明の一実施形態の発音学習支援システム１００の構成を示す図である。発音学習支援システム１００は、学習者からの指令などの入力を受け取る学習者用入力部１０１と、各部の動作を制御する制御部１０３と、学習者の単語、語句または文の音声を取得する音声取得部１０５と、取得した音声の特徴量を抽出する特徴抽出部１０７と、単語、語句または文の音声について予め準備された複数の類型特徴量を含むデータを記憶するデータベース１１１と、取得された音声の特徴量とデータベース１１１に記憶された複数の類型特徴量とを照合し、取得された音声の特徴量に最も類似する類型特徴量を選択する選択部１０９と、取得した音声のアクセントを分析するアクセント分析部１１３と、学習者への表示内容を作成する表示内容作成部１１５と、学習者への表示及び音声などの出力を行う学習者用出力部１１７と、を備える。 FIG. 1 is a diagram showing a configuration of a pronunciation learning support system 100 according to an embodiment of the present invention. The pronunciation learning support system 100 includes a learner input unit 101 that receives an input such as a command from a learner, a control unit 103 that controls the operation of each unit, and a voice that acquires the voice of a learner's word, phrase, or sentence. An acquisition unit 105, a feature extraction unit 107 that extracts the feature amount of the acquired speech, a database 111 that stores data including a plurality of type feature amounts prepared in advance for speech of words, phrases, or sentences; The voice feature and the plurality of type feature values stored in the database 111 are collated, and the selection unit 109 that selects the type feature value most similar to the acquired voice feature value, and analyzes the acquired voice accent An accent analysis unit 113 that performs display, a display content creation unit 115 that creates display content to the learner, and a learner output unit 11 that performs display and audio output to the learner. And, equipped with a.

図２は、発音学習支援システム１００の動作を説明するための流れ図である。 FIG. 2 is a flowchart for explaining the operation of the pronunciation learning support system 100.

図２のステップＳ１０１０において、発音学習支援システム１００の制御部１０３は、学習対象の単語、語句または文を決定する。学習対象の単語、語句または文は、学習者があらかじめ定められたリストから選択して、学習者用入力部１０１から指定するように、発音学習支援システム１００を構成してもよい。 In step S1010 of FIG. 2, the control unit 103 of the pronunciation learning support system 100 determines a learning target word, phrase, or sentence. The pronunciation learning support system 100 may be configured such that a learner selects a word, phrase, or sentence to be learned from a predetermined list and designates it from the learner input unit 101.

図２のステップＳ１０２０において、発音学習支援システム１００の音声取得部１０５は、学習者の単語、語句または文の音声データを取得する。発音学習支援システム１００は、制御部１０３の指示にしたがって、学習者用出力部１１７が音声または画像を出力して学習者に指定された単語、語句または文の発声を促すように構成してもよい。続いて制御部１０３は、音声取得部１０５に学習者の単語、語句または文の音声データを取得するように指示する。音声取得部１０５は、一例として、パワーの閾値を用いて、学習者の単語、語句または文の音声の区間を定め、単語、語句または文の音声データを取得する。 In step S1020 of FIG. 2, the voice acquisition unit 105 of the pronunciation learning support system 100 acquires voice data of a learner's word, phrase, or sentence. The pronunciation learning support system 100 may be configured such that the learner output unit 117 outputs a voice or an image according to an instruction from the control unit 103 and prompts the learner to utter a specified word, phrase or sentence. Good. Subsequently, the control unit 103 instructs the voice acquisition unit 105 to acquire voice data of the learner's word, phrase, or sentence. For example, the voice acquisition unit 105 determines a voice section of a learner's word, phrase, or sentence using a power threshold, and acquires voice data of the word, phrase, or sentence.

図２のステップＳ１０３０において、発音学習支援システム１００の特徴抽出部１０７は、制御部１０３の指示にしたがって、音声取得部１０５から取得された音声データを受け取り、一例として１０〜３０ミリ秒の時間フレームごとの発音のスペクトル特徴量の時系列データを作成する。特徴量としては、ケプストラムを使用することができる。ケプストラムは、音声データをＦＦＴ処理することによって得られるパワースペクトラムの対数を、ＦＦＴ逆変換処理したものとして定義される。ケプストラムについては、たとえば、特開平８−８７２９２、特開平９−３０５１９５などに記載されている。一般に、人の声のスペクトルは、声帯振動スペクトルと声道スペクトルを掛け合わせたものである。単語、語句または文の発音は、声道スペクトルに関係し、声の大きさや高さは声帯振動スペクトルに関係する。ケプストラムを使用することにより、単語、語句または文の発音に関係する声道スペクトルの特徴を把握することができる。声道については後で説明する。特徴抽出部１０７は、単語、語句または文の音声データの特徴量時系列データとして、時間フレームごとのケプストラムの時系列を作成する。 In step S1030 of FIG. 2, the feature extraction unit 107 of the pronunciation learning support system 100 receives the audio data acquired from the audio acquisition unit 105 according to the instruction of the control unit 103, and takes a time frame of 10 to 30 milliseconds as an example. Create time-series data of the spectral feature value of each pronunciation. A cepstrum can be used as the feature amount. The cepstrum is defined as the FFT of the logarithm of the power spectrum obtained by performing FFT processing on audio data. The cepstrum is described in, for example, JP-A-8-87292 and JP-A-9-305195. In general, the spectrum of a human voice is a product of a vocal cord vibration spectrum and a vocal tract spectrum. The pronunciation of a word, phrase or sentence is related to the vocal tract spectrum, and the loudness or height of the voice is related to the vocal cord vibration spectrum. By using the cepstrum, the characteristics of the vocal tract spectrum related to the pronunciation of words, phrases or sentences can be grasped. The vocal tract will be explained later. The feature extraction unit 107 creates a time series of cepstrum for each time frame as feature amount time series data of speech data of words, phrases or sentences.

図２のステップＳ１０４０において、発音学習支援システム１００の選択部１０９は、制御部１０３の指示にしたがって、特徴抽出部１０７から学習者による発音の特徴量時系列データを受け取り、学習者による発音の特徴量時系列データと、データベース１１１に記憶された特徴量時系列データと、を照合し、記憶された特徴量時系列データの中から学習者による発音の特徴量時系列データに最も類似する特徴量時系列データを選択する。照合する方法については、後で詳細に説明する。 In step S1040 of FIG. 2, the selection unit 109 of the pronunciation learning support system 100 receives the feature amount time series data of the pronunciation by the learner from the feature extraction unit 107 according to the instruction of the control unit 103, and the feature of the pronunciation by the learner. The feature time series data and the feature amount time series data stored in the database 111 are collated, and the feature amount most similar to the feature amount time series data of the pronunciation by the learner from the stored feature amount time series data Select time series data. A method for collation will be described in detail later.

図３は、データベース１１１に記憶されるデータの構成の一例を示す図である。データベース１０９には、学習対象の単語、語句または文のそれぞれについて、標準データと複数の類型データが記憶される。図３は、単語Ａ及び単語Ｂのみを示している。また、単語Ａ及び単語Ｂの類型データの個数は、それぞれｎ個である。標準データは、たとえば、単語の標準の音声についてのデータである。類型データは、同じ単語の誤った音声についてのデータである。標準データおよび類型データは、言語の音声の単位である音素ごとのデータを含む。音素ごとのデータは、音素の音声記号、音素の特徴量の時系列、及び音素の発声の際の声道の画像を含む。 FIG. 3 is a diagram illustrating an example of a configuration of data stored in the database 111. The database 109 stores standard data and a plurality of type data for each word, phrase or sentence to be learned. FIG. 3 shows only the word A and the word B. The number of type data of word A and word B is n. The standard data is, for example, data on standard voices of words. The type data is data about wrong speech of the same word. The standard data and the type data include data for each phoneme that is a unit of speech of language. The data for each phoneme includes a phoneme phonetic symbol, a time series of phoneme feature quantities, and an image of the vocal tract when the phoneme is uttered.

ここで、声道について説明する。一般的に、声道は、喉頭、咽頭、口腔、鼻腔から成る。単語などの発音に特に影響が大きいのは、口腔の舌及び唇の形状及び位置である。本明細書において、声道とは、限定はしないものの、特に、口腔の舌及び唇の形状及び位置を指すものとする。一般的に、音素と舌及び唇の形状及び位置との関係は、解明されているので、音素ごとに舌及び唇の形状及び位置に関する声道の画像を作成することができる。必要に応じて、ＭＲＩデータの統計分析をもとに音素ごと舌及び唇の形状及び位置に関する声道の画像を作成してもよい。あるいは、Electromagnetic Articulography (EMA)測定を利用して、声道の画像を作成してもよい。 Here, the vocal tract will be described. In general, the vocal tract consists of the larynx, pharynx, oral cavity, and nasal cavity. It is the shape and position of the tongue and lips of the oral cavity that have a particularly great influence on the pronunciation of words and the like. In this specification, the vocal tract refers to, in particular, the shape and position of the tongue and lips of the oral cavity, although not limited thereto. In general, since the relationship between the phoneme and the shape and position of the tongue and lips has been elucidated, an image of the vocal tract regarding the shape and position of the tongue and lips can be created for each phoneme. If necessary, a vocal tract image regarding the shape and position of the tongue and lips for each phoneme may be created based on statistical analysis of MRI data. Alternatively, an image of the vocal tract may be created using Electromagnetic Articulography (EMA) measurement.

音素の特徴量の時系列を、音素の順に配列させたものが、学習者による特徴量時系列データと照合される特徴量時系列データである。標準データおよび類型データは、音素ごとのデータの他に、音声データ、アクセント表示用データなどを含んでもよい。 A time series of phoneme feature quantities arranged in the order of phonemes is feature quantity time series data collated with feature quantity time series data by a learner. The standard data and type data may include voice data, accent display data, and the like in addition to the data for each phoneme.

図４は、単語”April”の標準データの文字、音声記号、アクセント表示、及び声道の画像を例示的に示す図である。声道の画像は、単に概念的なものである。単語”April”の音声記号の一例は、以下の通りである。
eipril
音声記号は音素に対応するので、この場合の音素は６個である。声道の画像データは、音素に対応するので、６個の音素に対応する６個の画像データが存在する。 FIG. 4 is a diagram exemplarily showing standard data characters, phonetic symbols, accent display, and vocal tract images of the word “April”. The vocal tract image is merely conceptual. An example of a phonetic symbol of the word “April” is as follows.
eipril
Since the phonetic symbol corresponds to a phoneme, the number of phonemes in this case is six. Since the vocal tract image data corresponds to phonemes, there are six image data corresponding to six phonemes.

単語”April”の複数の類型データは、たとえば、以下の音声記号で表される発音に対応するものである。
april, aplil, aprir, eiplil, eiprir
多数の学習者の発音を採取して、誤りやすい発音を類型データとして記憶させる。本例においては、各類型データについて、それぞれの音素に対応する５個または６個の声道の画像データを記憶させる。 The plurality of type data of the word “April” corresponds to pronunciations represented by the following phonetic symbols, for example.
april, aplil, aprir, eiplil, eiprir
Pronunciations of many learners are collected and pronunciations that are prone to errors are stored as type data. In this example, for each type of data, image data of 5 or 6 vocal tracts corresponding to each phoneme is stored.

ここで、学習者の特徴量時系列データと記憶された特徴量時系列データとを照合する方法について説明する。学習者による特徴量時系列データと記憶された特徴量時系列データとを比較する場合に以下の問題点が存在する。学習者の音声データの長さと類型音声データの長さとは同じではない。また、音声データの中で速く発声される部分と遅く発声される部分とが存在する。したがって、このような２個の時系列データを照合するには、時刻によって時間間隔を伸縮しながら時系列データを照合させる必要がある。具体的には、２個の時系列のフレーム長をＩ，Ｊとして、それぞれのフレームをｘ、ｙで表し、（１，１）から（Ｉ，Ｊ）までの２個の時系列のフレームを結ぶ経路の組合せのうち、照合の距離の合計が最小となる組合せを求める。このような照合を、動的計画法（Dynamic Programming）を利用したＤＰマッチングによって実施する。ＤＰマッチングについては、たとえば、特開平９−３０５１９５などに記載されている。 Here, a method of collating the learner's feature amount time-series data with the stored feature amount time-series data will be described. The following problems exist when comparing feature quantity time-series data by a learner with stored feature quantity time-series data. The length of the learner's voice data is not the same as the length of the type voice data. In addition, there are a portion that is uttered fast and a portion uttered late in the audio data. Therefore, in order to collate such two pieces of time-series data, it is necessary to collate the time-series data while expanding / decreasing the time interval according to the time. Specifically, two time-series frame lengths are represented by I and J, and each frame is represented by x and y. Two time-series frames from (1, 1) to (I, J) are represented by Among the combinations of routes to be connected, the combination that minimizes the sum of the collation distances is obtained. Such collation is performed by DP matching using dynamic programming. About DP matching, it describes in Unexamined-Japanese-Patent No. 9-305195 etc., for example.

図２のステップＳ１０５０において、発音学習支援システム１００のアクセント分析部１１３は、制御部１０３の指示にしたがって、音声取得部１０５から、学習者の単語、語句または文の音声データを受け取り、アクセントを求める。上述のように、学習者による特徴量時系列データのフレームと記憶された特徴量時系列データのフレームとは、対応付けられているので、学習者の発音のアクセントの位置を記憶されたいずれかの音素のデータに対応付けることができる。 In step S1050 of FIG. 2, the accent analysis unit 113 of the pronunciation learning support system 100 receives voice data of a learner's word, phrase, or sentence from the voice acquisition unit 105 according to an instruction from the control unit 103, and obtains an accent. . As described above, since the feature amount time-series data frame by the learner is associated with the stored feature amount time-series data frame, any one of the stored accent positions of the learner's pronunciation is stored. Can be associated with the phoneme data.

図２のステップＳ１０６０において、発音学習支援システム１００の出力内容作成部１１５は、制御部１０３の指示にしたがって、単語、語句または文の学習者のアクセント、ならびに、学習者による発音の特徴量時系列データに最も類似する記憶された特徴量時系列データに対応する音声記号、声道の画像データなどの表示出力内容を作成する。さらに、発音学習支援システム１００の学習者用出力部１１７は、作成された表示出力内容を学習者に表示する。声道の画像データは、音声記号と対応させて表示させてもよい。たとえば、表示する声道の画像データに対応する音素の音声記号の色を変えるようにしてもよい。また、声道の画像データを動画として表示させてもよい。一連の音素から構成される単語について、音素ごとの画像データを適切な方法で補間することによって動画データを作成してもよい。 In step S1060 of FIG. 2, the output content creation unit 115 of the pronunciation learning support system 100 follows the learner's accent of the word, phrase, or sentence, and the feature amount time series of pronunciation by the learner according to the instruction of the control unit 103 Display output contents such as phonetic symbols and vocal tract image data corresponding to the stored feature amount time series data most similar to the data are created. Further, the learner output unit 117 of the pronunciation learning support system 100 displays the created display output content to the learner. The vocal tract image data may be displayed in correspondence with the phonetic symbols. For example, the color of the phonetic phonetic symbol corresponding to the vocal tract image data to be displayed may be changed. Also, the vocal tract image data may be displayed as a moving image. For words composed of a series of phonemes, moving image data may be created by interpolating image data for each phoneme by an appropriate method.

また、学習者による特徴量時系列データに最も類似する記憶された特徴量時系列データに対応する内容と対比させて、標準データのアクセント及び記憶された標準データの特徴量時系列データに対応する内容を表示させるようにしてもよい。あるいは、学習者の発音の音素のうち標準の発音の音素と異なる音素をマークして表示してもよい。 Further, in contrast to the content corresponding to the stored feature amount time-series data most similar to the feature amount time-series data by the learner, it corresponds to the accent of the standard data and the feature amount time-series data of the stored standard data. The contents may be displayed. Or you may mark and display the phoneme different from the phoneme of a standard pronunciation among the phonemes of a learner's pronunciation.

声道の画像データなどの画像データは立体視用のデータとしてもよい。立体視用のデータを使用することにより、学習効果が向上することが知られている。 Image data such as vocal tract image data may be stereoscopic data. It is known that the learning effect is improved by using stereoscopic data.

また、類型データごとに発音の修正に関するコメントをデータベース１１１に記憶させておき、出力内容作成部１１５が、類型データごとのコメントを選択し、文字の表示または音声として、学習者用出力部１１７から出力するようにしてもよい。たとえば、類型データが
eiprir
の発音に対応する場合に、最終音素について、「「Ｌ」の発音は、舌の先を歯茎の下に触れてください」などのコメントを表示または音声として、学習者用出力部１１７から出力するようにしてもよい。 In addition, comments relating to pronunciation correction for each type of data are stored in the database 111, and the output content creation unit 115 selects a comment for each type of data and displays it as a character display or sound from the learner output unit 117. You may make it output. For example, if the type data is
eiprir
For the final phoneme, a comment such as “Please touch the tip of the tongue under the gums for the pronunciation of“ L ”” is displayed or output from the learner output unit 117. You may do it.

本発明のさらなる利点について説明する。たとえば、英語において、語末のｒは発音されないが、母音で始まる語が続く場合には、ｒが発音される現象が存在する。この現象は、フランス語のリエゾンに対応する。以下の１）において、thereのrは発音されないが、２）においては、母音で始まるisが続くのでthereのrが発音される。

ここで、１）及び２）において、英語の発音記号を使用して音素を表している。このような現象による発音は、単語のみの発音の学習では、学習することができない。本発明によれば、単語だけではなく語句または文を学習の対象とすることができるので、このような現象による発音を学習することもできる。一般に、本発明は、フランス語のリエゾンを含む、種々の言語の連音の学習に有効である。 Further advantages of the present invention will be described. For example, in English, r at the end of a word is not pronounced, but there is a phenomenon in which r is pronounced when a word starting with a vowel continues. This phenomenon corresponds to the French liaison. In the following 1), there is no pronunciation of r, but in 2) there is an is that begins with a vowel, so there is a pronunciation of r.

Here, in 1) and 2), phonemes are expressed using English phonetic symbols. Pronunciation by such a phenomenon cannot be learned by learning pronunciation of words only. According to the present invention, not only words but also phrases or sentences can be targeted for learning, so pronunciation by such a phenomenon can also be learned. In general, the present invention is effective for learning a continuous tone of various languages including a French liaison.

また、上述のように学習者のアクセントの位置を音素と対応付けることができるので、学習者の発音についてアクセント付きの音声記号を表示することもできる。 Moreover, since the position of the learner's accent can be associated with the phoneme as described above, an accented phonetic symbol can be displayed for the learner's pronunciation.

図５は、本発明の発音学習支援システム１００の学習者の使用方法を説明するための流れ図である。 FIG. 5 is a flowchart for explaining how to use the learner in the pronunciation learning support system 100 of the present invention.

図５のステップＳ２０１０において、学習者は、発音学習支援システム１００を起動し、学習者用入力部１０１によって、学習の対象とする単語、語句、または文を指定する。発音学習支援システム１００は、指定された単語、語句、または文の学習の準備をし、準備が完了したら学習者用出力部１１７によって学習者に発音を指示する。 In step S2010 of FIG. 5, the learner activates the pronunciation learning support system 100 and designates a word, a phrase, or a sentence to be learned by the learner input unit 101. The pronunciation learning support system 100 prepares to learn a specified word, phrase, or sentence, and when the preparation is completed, the learner output unit 117 instructs the learner to pronounce.

図５のステップＳ２０２０において、学習者は、発音学習支援システム１００の指示にしたがって、単語、語句、または文を発音する。発音学習支援システム１００は、図２の流れ図に示した処理を行い、その結果を、学習者の発音に対応した画像データの表示などとして学習者用出力部１１７から出力する。 In step S2020 in FIG. 5, the learner pronounces a word, phrase, or sentence according to the instruction from the pronunciation learning support system 100. The pronunciation learning support system 100 performs the processing shown in the flowchart of FIG. 2 and outputs the result from the learner output unit 117 as display of image data corresponding to the learner's pronunciation.

図５のステップＳ２０３０において、学習者は、学習者用出力部１１７に表示された発音に対応した画像データを観察する。 In step S2030 of FIG. 5, the learner observes image data corresponding to the pronunciation displayed on the learner output unit 117.

図５のステップＳ２０４０において、学習者は、画像データを使用して学習する。学習者の学習者用入力部１０１を経由した指示により、学習者用出力部１１７に、学習者の発音に対応する画像データを繰り返し表示し、また、標準音声に対応するがデータを対比させて表示してもよい。 In step S2040 of FIG. 5, the learner learns using the image data. In response to an instruction via the learner input unit 101 of the learner, image data corresponding to the pronunciation of the learner is repeatedly displayed on the learner output unit 117, and the data corresponding to the standard voice is compared with the data. It may be displayed.

図５のステップＳ２０５０において、学習者は、学習を継続するかどうか判断する。継続する場合には、ステップＳ２０１０に戻る。継続しない場合には処理を終了する。 In step S2050 of FIG. 5, the learner determines whether or not to continue learning. When continuing, it returns to step S2010. If not continued, the process is terminated.

本発明の発音学習支援方法によれば、学習者は、自己の発音の特徴を、音素に対応する音声記号及び声道の画像によって視覚的に明確に認識することができる。学習者の誤りやすい発音を類型データとしてデータベース１１１に記憶させておくことにより、学習者の種々の発音に対応することができる。また、学習者の発音のアクセントの位置を記憶されたいずれかの音素のデータに対応付けて表示することができるので、学習者は自己のアクセントを明確に認識することができる。さらに、標準の発音の音素及びアクセントのデータを、学習者の発音に関するデータと対比させて表示することもできるので、学習者の理解が深まり学習効果が向上する。特に、声道の画像を含む画像データは、立体視用の画像データとして表示されるので学習者の理解がさらに深まり学習効果がさらに向上する。 According to the pronunciation learning support method of the present invention, the learner can visually recognize the characteristics of his / her pronunciation clearly by the phonetic symbol corresponding to the phoneme and the image of the vocal tract. By storing the learner's prone pronunciations as type data in the database 111, it is possible to cope with various learners' pronunciations. Further, since the position of the accent of the learner's pronunciation can be displayed in association with any stored phoneme data, the learner can clearly recognize his own accent. Further, since the phoneme and accent data of standard pronunciation can be displayed in contrast with the data related to the pronunciation of the learner, the learner's understanding deepens and the learning effect is improved. In particular, since image data including a vocal tract image is displayed as stereoscopic image data, the learner's understanding is further deepened and the learning effect is further improved.

Claims

A pronunciation learning support method using a pronunciation learning support system,
The pronunciation learning support system acquiring a voice of a word, phrase or sentence by a learner;
The pronunciation learning support system obtaining the feature amount time-series data of the acquired speech;
The pronunciation learning support system includes the acquired speech feature amount time-series data, and a plurality of type feature amount time-series data of the speech of the word, phrase or sentence stored in the database of the pronunciation learning support system; And selecting type feature amount time series data most similar to the acquired feature amount time series data of speech;
A pronunciation learning support method comprising: outputting the data corresponding to the selected type feature amount time-series data.

The pronunciation learning support method according to claim 1, wherein the data corresponding to the selected type feature amount time-series data includes a vocal tract image corresponding to speech.

The pronunciation learning support method according to claim 1, wherein the data corresponding to the selected type feature amount time-series data includes moving image data.

The pronunciation learning support method according to claim 1, wherein the data corresponding to the selected type feature amount time-series data includes stereoscopic image data.

5. The method according to claim 1, further comprising a step of obtaining an accent of the acquired voice, and data corresponding to the selected type feature amount time-series data includes data of the acquired voice accent. The pronunciation learning support method described.

6. In the displaying step, in addition to data corresponding to the selected type feature quantity time-series data, data corresponding to feature quantity time-series data of standard speech is output. Pronunciation learning support method.

7. At least a part of data corresponding to the selected type feature quantity time-series data or at least a part of data corresponding to the feature quantity time-series data of the standard speech is associated with a phoneme. The pronunciation learning support method according to any of the above.

An audio acquisition unit that acquires audio of words, phrases or sentences by learners;
A feature extraction unit for obtaining feature amount time-series data of the acquired speech;
A database for storing a plurality of type feature amount time series data prepared in advance for speech of the word, phrase or sentence;
The acquired feature amount time series data of speech and the plurality of type feature amount time series data are collated, and the type feature amount time series data most similar to the acquired feature amount time series data of the speech is selected. A selection section;
A pronunciation learning support system including: a learner output unit that displays image data corresponding to the selected type feature amount time-series data.

Further comprising an accent analysis unit for obtaining the accent of the acquired speech, wherein the learner output unit is configured to output the analyzed accent in association with the selected type feature amount time-series data. Item 9. The pronunciation learning support system according to Item 8.