JP2012073299A

JP2012073299A - Language training device

Info

Publication number: JP2012073299A
Application number: JP2010216194A
Authority: JP
Inventors: Shingo Yuasa; 信吾湯浅; Hiroyuki Saito; 裕之斉藤; Chiaki Yoshizuka; 千晶吉塚
Original assignee: Panasonic Corp
Current assignee: Panasonic Corp
Priority date: 2010-09-27
Filing date: 2010-09-27
Publication date: 2012-04-12

Abstract

PROBLEM TO BE SOLVED: To provide a language training device that evaluates a correct movement of a mouth of a user objectively and allowing the user to confirm the movement of the mouth to make comprehensive evaluation based on subjective evaluation and the objective evaluation to train the user efficiently.SOLUTION: A user is imaged by imaging means 10 so that a feature amount extraction unit 34 extracts an amount of features from a change in a shape of a lip part of the user. Evaluation means 35 compares an amount of features of a change in a shape of a lip part to be reference stored in feature storing means 33 with the amount of features extracted by the feature amount extraction unit 34. The evaluation means 35 quantitatively evaluates a difference between the amounts of features. Video display means 20 displays target video to be reference stored in target storing means 31. A half mirror is laminated on a display surface of the video display means 20 to display mirror video of the user together with the target video. Accordingly, the user can compare the mirror video of himself/herself with the target video.

Description

本発明は、発音機能の回復や外国語の習得に際して、正しい発音を修得するために用いる言語訓練装置に関するものである。 The present invention relates to a language training apparatus used to acquire correct pronunciation when recovering a pronunciation function or learning a foreign language.

従来から、外国語などの言語学習に用いる言語訓練装置として、標準発音者と練習対象者とについて、唇の動きの動画撮像と、音声採取とを行い、唇輪郭特徴、周波数成分の判定を行う技術が提案されている（たとえば、特許文献１参照）。また、特許文献１には、標準発音者と練習対象者とを並べて画面に表示した図が示されている。 Conventionally, as a language training device used for learning languages such as foreign languages, lip motion video and voice sampling are performed for standard pronunciation and practice subjects, and lip contour characteristics and frequency components are determined. A technique has been proposed (see, for example, Patent Document 1). Further, Patent Document 1 shows a diagram in which a standard utterer and a training subject are displayed side by side on a screen.

一方、外国語の修得を目的とするのではなく、脳梗塞や交通事故の後遺症などにより発音機能や言語機能が低下した人の機能回復の際には、言語聴覚士が一対一で訓練を行うことが多い。 On the other hand, language auditors train one-on-one when people who have reduced pronunciation or language function due to cerebral infarction or aftereffects of traffic accidents are not aiming to learn foreign languages. There are many cases.

特表２００８−１５８０５５号公報Special Table 2008-158055

特許文献１に記載の技術では、練習対象者は、唇の動きと音声との両方について標準発音者と比較されるから、正しい発音のための唇の動きを行っているか否かの評価がなされるとともに、実際に正しい発音を行っているか否かの評価もなされる。 In the technique described in Patent Document 1, since the person to be practiced is compared with the standard pronunciationr for both lip movement and voice, it is evaluated whether or not the lip movement for correct pronunciation is performed. In addition, it is also evaluated whether or not the correct pronunciation is actually performed.

ところで、外国語の習得だけでなく機能回復の目的においても、言語訓練の際には、訓練者は鏡を用いて、その場で自身の口の開け方を確認するのが望ましい。しかしながら、特許文献１には、言語訓練の場において訓練者（練習対象者）が自身の口の開け方を確認することは明示されていない。特許文献１に記載の技術では、標準発音者と練習対象者とを並べて画面に表示することが示唆されているが、練習対象者がその場で標準発音者との口の開け方を比較するというよりも、練習対象者の発音の評価とアドバイスとに主眼が置かれている。したがって、特許文献１に記載の技術では、鏡を用いることにより、その場で口の開け方を確認する訓練と同様の効果は期待できない。 By the way, not only for the acquisition of foreign languages but also for the purpose of functional recovery, during language training, it is desirable for the trainer to check how to open his / her mouth on the spot using a mirror. However, Patent Literature 1 does not clearly indicate that a trainee (a person who is to practice) confirms how to open his / her mouth in a language training field. In the technique described in Patent Document 1, it is suggested that the standard pronunciation person and the practice subject are displayed side by side on the screen, but the practice subject compares the opening of the mouth with the standard pronunciation person on the spot. Rather, the focus is on evaluation and advice on the pronunciation of the subject. Therefore, the technique described in Patent Document 1 cannot be expected to have the same effect as training for confirming how to open the mouth on the spot by using a mirror.

本発明は、正しい発音を修得するための口の開け方について、客観的な評価を得られるようにするとともに、使用者自身がその場で口の開け方を確認できるようにすることにより、主観的評価と客観的評価とを総合して効率のよい言語訓練を可能にする言語訓練装置を提供することを目的とする。 The present invention makes it possible to obtain an objective evaluation of how to open the mouth to acquire correct pronunciation, and to enable the user himself to confirm how to open the mouth on the spot. It is an object of the present invention to provide a language training apparatus that enables efficient language training by combining objective evaluation and objective evaluation.

本発明は、上記目的を達成するために、映像を表示する映像表示手段と、手本になる口唇部の形状を含む映像を目標映像として映像表示手段に提示する目標提示手段と、目標映像に含まれる口唇部の所定期間における形状変化を表す特徴量を第１の特徴量として記憶する特徴記憶手段と、使用者の口唇部を含む空間領域を撮像する撮像手段と、撮像手段により撮像される空間領域を少なくとも含んだ空間領域の鏡映像を映像表示手段に表示される目標映像とともに使用者に提示する鏡映像提示手段と、撮像手段により撮像した画像から所定期間に対応する期間での使用者の口唇部の形状変化を表す特徴量を第２の特徴量として抽出する特徴抽出手段と、第１の特徴量に対する第２の特徴量の差異を評価する評価手段と、評価手段で評価した結果に応じて映像表示手段の表示内容を決定する表示処理手段とを備える。 In order to achieve the above object, the present invention provides a video display means for displaying a video, a target presentation means for presenting a video including the shape of the lip as a model to the video display means, and a target video. A feature storage unit that stores a feature amount representing a shape change of a lip portion included in a predetermined period as a first feature amount, an imaging unit that captures a spatial region including the user's lip portion, and an imaging unit Mirror video presenting means for presenting a mirror image of a spatial area including at least a spatial area to a user together with a target video displayed on the video display means, and a user in a period corresponding to a predetermined period from an image captured by the imaging means A feature extracting means for extracting a feature quantity representing the shape change of the lip portion of the lip as a second feature quantity, an evaluation means for evaluating a difference of the second feature quantity with respect to the first feature quantity, and a result evaluated by the evaluation means. And a display processing means for determining the display content of the image display means in accordance with.

鏡映像提示手段は、映像表示手段の前面にハーフミラーを配置した構成であることが望ましい。 The mirror image presenting means preferably has a configuration in which a half mirror is arranged in front of the image display means.

あるいは、撮像手段により撮像した画像の左右を反転した反転映像を生成する映像反転手段をさらに備え、目標提示手段が、反転映像を目標映像とともに映像表示手段に提示する機能を備え、鏡映像提示手段が、映像反転手段と目標提示手段とにより構成されるようにしてもよい。 Alternatively, it further includes video reversing means for generating a reversed video obtained by inverting the left and right of the image captured by the imaging means, and the target presentation means has a function of presenting the reversed video together with the target video to the video display means, However, the image reversing means and the target presenting means may be configured.

使用者の舌の位置を検出する舌位置検出手段をさらに備え、特徴記憶手段が、手本になる舌の位置に関して所定期間における位置変化を表す特徴量を第３の特徴量として記憶し、特徴抽出手段が、舌位置検出手段が検出した舌の位置から所定期間に対応する期間での使用者の舌の位置変化を表す特徴量を第４の特徴量として抽出し、評価手段が、第３の特徴量に対する第４の特徴量の差異を評価する構成を採用してもよい。 Tongue position detecting means for detecting the position of the user's tongue is further provided, and the feature storage means stores a feature quantity representing a position change in a predetermined period with respect to the position of the tongue serving as a model as a third feature quantity, The extraction unit extracts a feature amount representing a change in the position of the user's tongue in a period corresponding to a predetermined period from the tongue position detected by the tongue position detection unit, and the evaluation unit includes a third feature amount. A configuration for evaluating the difference of the fourth feature amount with respect to the feature amount may be adopted.

また、音声を出力する音声出力手段と、手本になる音声を目標音声として音声出力手段に提示する音声制御手段と、使用者の音声を取得する音声取得手段とをさらに備え、特徴記憶手段は、目標音声の特徴量を第５の特徴量として記憶し、特徴抽出手段は、音声取得手段が取得した音声の特徴量を第６の特徴量として抽出し、評価手段は、第５の特徴量に対する第６の特徴量の差異を評価してもよい。 The feature storage means further comprises voice output means for outputting voice, voice control means for presenting the voice as a model to the voice output means as target voice, and voice acquisition means for acquiring the user's voice. The feature amount of the target speech is stored as the fifth feature amount, the feature extraction unit extracts the feature amount of the speech acquired by the speech acquisition unit as the sixth feature amount, and the evaluation unit stores the fifth feature amount. You may evaluate the difference of the 6th feature-value with respect to.

この場合、音声取得手段が取得した使用者の音声を録音する録音手段をさらに備え、音声制御手段が、録音手段が録音した使用者の音声を音声出力手段に出力させる構成を採用してもよい。 In this case, a configuration may be adopted in which a recording unit that records the voice of the user acquired by the voice acquisition unit is further provided, and the voice control unit outputs the voice of the user recorded by the recording unit to the voice output unit. .

目標映像は撮像手段により撮像した実画像であって、特徴抽出手段は、実画像から第１の特徴量を抽出して特徴記憶手段に記憶させる機能を備えることが望ましい。 The target video is an actual image captured by the imaging unit, and the feature extraction unit preferably has a function of extracting the first feature amount from the actual image and storing it in the feature storage unit.

あるいは、目標映像は口唇部の形状をモデルにより表現したモデル画像であることが望ましい。 Alternatively, it is desirable that the target video is a model image in which the shape of the lip is expressed by a model.

本発明の構成によれば、正しい発音を修得するための口の開け方について、口唇部の形状変化の特徴量から客観的な評価を得られるようにし、しかも、使用者の口唇部を含む鏡映像を提示することにより使用者自身がその場で口の開け方を確認できるようにしているから、主観的評価と客観的評価とを総合した効率のよい言語訓練が可能になる。 According to the configuration of the present invention, it is possible to obtain an objective evaluation from the feature amount of the shape change of the lip for how to open the mouth for acquiring correct pronunciation, and the mirror including the lip of the user Since the user can confirm how to open the mouth on the spot by presenting the video, efficient language training combining subjective evaluation and objective evaluation becomes possible.

実施形態を示すブロック図である。It is a block diagram which shows embodiment. 同上の外観を示す側面図である。It is a side view which shows an external appearance same as the above. 同上の使用例を示す斜視図である。It is a perspective view which shows the usage example same as the above. 同上の画面の例を示す動作説明図である。It is operation | movement explanatory drawing which shows the example of a screen same as the above.

以下に説明する実施形態は、図２に示すように、言語訓練の対象者である使用者１の口唇部を含む空間領域を撮像する撮像手段１０と、映像を表示する映像表示手段２０とを備える。また、使用者１は、通常は椅子２に座って本装置を使用する。 In the embodiment described below, as shown in FIG. 2, an imaging unit 10 that captures a spatial region including a lip portion of a user 1 who is a subject of language training, and a video display unit 20 that displays a video. Prepare. Further, the user 1 usually uses the apparatus while sitting on the chair 2.

撮像手段１０は、ＣＣＤイメージセンサあるいはＣＭＯＳイメージセンサを代表例とする固体撮像素子のような撮像素子を備える。また、撮像手段１０は、カラー画像を撮像する構成が望ましい。ただし、後述するように、カラー画像に加えて、空間の三次元情報を取得して距離画像を出力する構成を採用することが望ましい。 The image pickup means 10 includes an image pickup device such as a solid-state image pickup device whose representative example is a CCD image sensor or a CMOS image sensor. Further, the image pickup means 10 is preferably configured to pick up a color image. However, as will be described later, it is desirable to adopt a configuration in which, in addition to a color image, three-dimensional information on a space is acquired and a distance image is output.

距離画像を出力する技術は種々知られているが、強度を変調した強度変調光を投光するとともに、物体で反射された強度変調光を受光し、強度変調光の投受光の位相差を用いて物体までの距離を計測する技術を用いるのが望ましい。強度変調光を投光するアクティブ型の構成を採用すると、１画面分である複数個の画素値（距離値）を生成するための情報（受光出力）を１回の撮像によって得ることができる。したがって、距離画像の動画像を生成することができる上に、三角測量法やステレオ画像法を用いて三次元情報を取得する場合に比較すると、動画像の時間分解能を高くすることができる。 Various technologies for outputting range images are known, but the intensity modulated light modulated by the intensity is projected, the intensity modulated light reflected by the object is received, and the phase difference between the intensity modulated light is received and received. It is desirable to use a technique that measures the distance to the object. When an active type configuration in which intensity-modulated light is projected is employed, information (light reception output) for generating a plurality of pixel values (distance values) corresponding to one screen can be obtained by one imaging. Therefore, it is possible to generate a moving image of a distance image, and to increase the temporal resolution of the moving image as compared with the case where three-dimensional information is acquired using a triangulation method or a stereo image method.

映像表示手段２０は、液晶ディスプレイあるいはプラズマディスプレイを代表例とするフラットパネルディスプレイのような表示器を備える。また、本実施形態では、映像表示手段２０において映像を表示する表示面にハーフミラー２１を重ねて配置してある。 The video display means 20 includes a display such as a flat panel display, typically a liquid crystal display or a plasma display. In the present embodiment, the half mirror 21 is arranged on the display surface for displaying the video in the video display means 20.

映像表示手段２０の映像表示面にハーフミラー２１を配置しているから、映像表示手段２０の輝度と使用者１が存在する空間の明るさとの関係に応じて、使用者１が視認する情報が変化する。すなわち、映像表示手段２０の表示面の輝度が低いときは、使用者１が存在する空間の鏡像が使用者１に視認される。一方、映像表示手段２０の表示面の輝度が高いときは、映像表示手段２０に表示された映像がハーフミラー２１を通して使用者１に視認されることになる。映像表示手段２０の映像を使用者１に視認させる際の輝度は、ハーフミラー２１の透過率を選定することによって適宜に定めることができる。 Since the half mirror 21 is arranged on the video display surface of the video display means 20, information that the user 1 visually recognizes depends on the relationship between the brightness of the video display means 20 and the brightness of the space where the user 1 exists. Change. That is, when the brightness of the display surface of the video display means 20 is low, a mirror image of the space where the user 1 exists is visually recognized by the user 1. On the other hand, when the luminance of the display surface of the video display means 20 is high, the video displayed on the video display means 20 is visually recognized by the user 1 through the half mirror 21. The luminance when the user 1 visually recognizes the video on the video display means 20 can be appropriately determined by selecting the transmittance of the half mirror 21.

この動作から、映像表示手段２０の表示面における領域ごとの輝度を調節することにより、主として映像を見せる領域、主として鏡像を見せる領域、映像と鏡像を重ねて見せる領域を形成することが可能になる。このように領域ごとに見え方を異ならせることができる点が、映像表示手段２０とハーフミラー２１とを用いる構成の一つの特徴である。 From this operation, by adjusting the luminance for each region on the display surface of the video display means 20, it is possible to form a region that mainly shows a video, a region that mainly shows a mirror image, and a region that shows a video and a mirror image superimposed. . Thus, it is one feature of the configuration using the video display means 20 and the half mirror 21 that the appearance can be made different for each region.

図２に示す構成例では、ハーフミラー２１を用いることによって、使用者１が存在する空間の鏡像を使用者１に提示している。ただし、ハーフミラー２１を用いずに撮像装置１０で撮像した画像の左右を反転させた反転映像を映像表示手段２０に表示した場合でも、使用者１には鏡像と等価な表示が提供されることになる。ハーフミラー２１の鏡像と、撮像手段１０で撮像した画像の左右を反転させることにより生成した反転映像とは、ともに数学的には使用者１が存在する空間の鏡映に相当する。したがって、両者はともに、使用者１の存在する空間の左右を反転させた「鏡映像」ということができる。すなわち、ハーフミラー２１は鏡映像提示手段として用いられる。なお、反転映像は表示面を対象面としていないから厳密には鏡映ではないが、拡大あるいは縮小することにより、鏡映と等価に扱うことができる。 In the configuration example shown in FIG. 2, by using the half mirror 21, a mirror image of the space where the user 1 exists is presented to the user 1. However, even when an inverted image obtained by inverting the left and right of the image captured by the imaging device 10 without using the half mirror 21 is displayed on the video display means 20, the user 1 is provided with a display equivalent to a mirror image. become. Both the mirror image of the half mirror 21 and the inverted image generated by reversing the left and right of the image captured by the imaging means 10 mathematically correspond to the reflection of the space where the user 1 exists. Therefore, both can be referred to as “mirror images” in which the left and right sides of the space where the user 1 exists are reversed. That is, the half mirror 21 is used as a mirror image presentation means. Note that the reverse video is not strictly a mirror because the display surface is not the target surface, but can be handled equivalent to a mirror by enlarging or reducing.

本実施形態では、撮像手段１０と映像表示手段２０とのほかに、使用者１が発する音声を取得する音声取得手段１１と、後述する音声出力を行わせる音声出力手段２２も備えている。音声取得手段１１としてはマイクロホンを用いることができ、音声出力手段２２としてはスピーカを用いることができる。マイクロホンやスピーカの構成や配置にはとくに制限はない。音声取得手段１１および音声出力手段２２は必須というわけではなく、これらを用いない構成を採用してもよい。 In the present embodiment, in addition to the imaging unit 10 and the video display unit 20, a voice acquisition unit 11 that acquires a voice uttered by the user 1 and a voice output unit 22 that performs a voice output described later are also provided. A microphone can be used as the sound acquisition unit 11, and a speaker can be used as the sound output unit 22. There are no particular restrictions on the configuration and arrangement of the microphone and speaker. The voice acquisition unit 11 and the voice output unit 22 are not essential, and a configuration not using them may be adopted.

撮像手段１０、音声取得手段１１、映像表示手段２０、音声出力手段２２は、制御手段３０に接続される。制御手段３０は、マイコン、ＤＳＰ、ＦＰＧＡのようにプロセッサを備え、プログラムに従ってプロセッサを動作させるデバイスを主構成とする。 The imaging unit 10, the audio acquisition unit 11, the video display unit 20, and the audio output unit 22 are connected to the control unit 30. The control unit 30 includes a processor such as a microcomputer, a DSP, and an FPGA, and mainly includes a device that operates the processor according to a program.

図１に示すように、制御手段３０は、言語訓練の手本になる口唇部の形状を含む映像を目標映像として記憶している目標記憶手段３１と、目標記憶手段３１から入力された目標映像を映像表示手段２０に表示させる表示処理手段３２とを備える。すなわち、目標記憶手段３１および表示処理手段３２により目標提示手段が構成される。 As shown in FIG. 1, the control unit 30 includes a target storage unit 31 that stores a video including the shape of the lip as an example of language training as a target video, and a target video input from the target storage unit 31. Is displayed on the video display means 20. That is, the target storage unit 31 and the display processing unit 32 constitute a target presentation unit.

使用者１は、映像表示手段２０に表示された手本となる口唇部の動きを見て模倣することができる。言語訓練の手本は、単音、単語、文章など所定の形式で、言語聴覚士などにより目標記憶手段３１にあらかじめ記憶される。また、目標記憶手段３１には、上述した形式に応じた所定期間における口唇部の形状変化が記憶される。 The user 1 can imitate the movement of the lip portion as a model displayed on the video display means 20 by watching it. The language training model is stored in advance in the target storage means 31 by a language auditor or the like in a predetermined format such as a single sound, a word, or a sentence. Further, the target storage unit 31 stores the shape change of the lip in a predetermined period corresponding to the above-described format.

上述のように、本実施形態の構成では、映像表示手段２０とハーフミラー２１とを用いることにより、撮像手段１０が撮像する空間領域の鏡映像を、目標記憶手段３１に記憶された目標映像とともに使用者１に提示する鏡映像提示手段が構成される。すなわち、鏡映像提示手段は、映像表示手段２０の前面にハーフミラー２１を配置した構成を備える。 As described above, in the configuration of the present embodiment, by using the video display unit 20 and the half mirror 21, the mirror image of the spatial region captured by the imaging unit 10 is combined with the target video stored in the target storage unit 31. A mirror image presentation means to be presented to the user 1 is configured. That is, the mirror image presenting means has a configuration in which the half mirror 21 is arranged in front of the image display means 20.

目標記憶手段３１に記憶されている手本（所定期間における口唇部の形状変化）は、あらかじめ口唇部の形状変化を表す特徴量（第１の特徴量）が抽出される。制御手段３０には、この特徴量を記憶する特徴記憶手段３３が設けられる。口唇部の特徴量は、口唇部の形状を表す複数個のパラメータの組として表される。このようなパラメータは、周知の技術を用いて求めることができる。 A feature amount (first feature amount) representing a shape change of the lip is extracted in advance from the example (the shape change of the lip during a predetermined period) stored in the target storage unit 31. The control means 30 is provided with a feature storage means 33 for storing this feature quantity. The feature amount of the lip is expressed as a set of a plurality of parameters representing the shape of the lip. Such parameters can be determined using known techniques.

たとえば、口唇部の画像から口角の位置のような複数点の特徴点を抽出し、口唇部の大きさを正規化した上で、特徴点の位置関係を表す比率をパラメータの組として用いることができる。特徴点を抽出するには、唇の位置を特定する必要があるから、唇の形状を認識するだけではなく、カラー画像における色情報を用いることにより、唇と唇の周辺部とを分離する。このように唇の位置を抽出することにより、唇の位置を基準として特徴点の抽出が可能になる。 For example, it is possible to extract a plurality of feature points such as the position of the mouth corner from the image of the lip, normalize the size of the lip, and use a ratio representing the positional relationship of the feature points as a set of parameters. it can. In order to extract the feature points, it is necessary to specify the position of the lips. Therefore, not only the shape of the lips is recognized, but the color information in the color image is used to separate the lips and the peripheral portion of the lips. By extracting the position of the lips in this way, feature points can be extracted with the position of the lips as a reference.

特徴記憶手段３３は、上述したパラメータの組の時間変化を特徴量として記憶する。なお、特徴量は時間軸が正規化されているものとする。また、言語訓練の手本が複数音からなる単語や文章の形式である場合には、音声認識による特徴抽出を行って音の要素ごとに特徴量を分割しておくのが望ましい。なお、以下では、手本が単音（１個の母音、１個ずつの子音と母音との組のいずれか）である場合を想定する。 The feature storage unit 33 stores the time change of the parameter set described above as a feature amount. Note that the feature amount is normalized on the time axis. In addition, when the language training model is in the form of a word or sentence consisting of a plurality of sounds, it is desirable to perform feature extraction by speech recognition and divide the feature quantity for each sound element. In the following, it is assumed that the model is a single tone (one vowel, one consonant and one vowel).

制御手段３０において特徴記憶手段３３が手本の特徴量を記憶しているのは、言語訓練における使用者１の口唇部の形状変化を評価するためである。すなわち、制御手段３０は、撮像手段１０が撮像した使用者１の画像から口唇部の形状変化の特徴量（第２の特徴量）を抽出する特徴抽出手段３４と、この特徴量と特徴記憶手段３３に記憶されている特徴量との差異を評価する評価手段３５とを備える。 The reason why the feature storage means 33 stores the feature quantity of the model in the control means 30 is to evaluate the shape change of the lip of the user 1 in language training. That is, the control unit 30 extracts a feature amount (second feature amount) of the shape change of the lip from the image of the user 1 captured by the imaging unit 10, and the feature amount and the feature storage unit. And an evaluation means 35 for evaluating a difference from the feature quantity stored in 33.

特徴抽出手段３４では、抽出した特徴量を特徴記憶手段３３に記憶されている特徴量と比較するために、抽出した特徴量の時間軸を正規化する。すなわち、特徴記憶手段３３に記憶されている特徴量の時間軸に、特徴抽出手段３４で抽出した特徴量の時間軸を一致させる。ここに、評価手段３５では、使用者１の口唇部の形状変化の開始時点が、特徴記憶手段３３に記憶された特徴量に対応する形状変化の開始時点と一致するように、時間軸を調整する。 The feature extraction unit 34 normalizes the time axis of the extracted feature amount in order to compare the extracted feature amount with the feature amount stored in the feature storage unit 33. That is, the time axis of the feature amount extracted by the feature extraction unit 34 is matched with the time axis of the feature amount stored in the feature storage unit 33. Here, the evaluation means 35 adjusts the time axis so that the start time of the shape change of the lip of the user 1 coincides with the start time of the shape change corresponding to the feature amount stored in the feature storage means 33. To do.

評価手段３５では、たとえば、特徴量であるパラメータの組を多次元ベクトルとし、特徴記憶手段３３に記憶された特徴量と、特徴抽出手段３４が抽出した特徴量とのユークリッド距離を求め、求めたユークリッド距離が小さいほど特徴量の差異が小さいと評価する。言い換えると、ユークリッド距離を評価値に用いることにより、使用者１の口唇部の形状変化と手本における口唇部の形状変化との相違の程度を定量化する。 In the evaluation means 35, for example, a set of parameters as feature quantities is set as a multidimensional vector, and the Euclidean distance between the feature quantity stored in the feature storage means 33 and the feature quantity extracted by the feature extraction means 34 is obtained and obtained. It is evaluated that the smaller the Euclidean distance, the smaller the difference in feature amount. In other words, by using the Euclidean distance as an evaluation value, the degree of difference between the lip shape change of the user 1 and the lip shape change in the model is quantified.

ここで、ユークリッド距離を求めるだけではなく、特徴量に含まれる適宜のパラメータの距離を求めることによって、評価手段３５は、口唇部の形状の相違も評価することができる。評価手段３５において口唇部の形状の相違を評価する場合には、形状の相違と手本に近づけるためのアドバイスとをルール（知識）として評価手段３５に設定しておくことができる。評価手段３５にこのような知識を設定しておけば、評価結果に応じて、どの部位をどのような形状とすれば手本に近付くかというアドバイスが得られる。 Here, not only the Euclidean distance, but also the distance of an appropriate parameter included in the feature amount, the evaluation means 35 can also evaluate the difference in the shape of the lip. When the evaluation means 35 evaluates the difference in the shape of the lip, the difference in shape and advice for approaching the model can be set in the evaluation means 35 as rules (knowledge). If such knowledge is set in the evaluation means 35, advice can be obtained as to which part should be shaped and which shape approaches the model according to the evaluation result.

評価手段３５が求めた評価結果は、表示処理手段３２を通して映像表示手段２０に出力される。すなわち、映像表示手段２０には、使用者１の口唇部の形状変化について手本との相違の程度が示される。また、アドバイスを行うルールを評価手段３５に設定している場合には、使用者１の口唇部の形状変化に対して、手本に近づけるためのアドバイスが映像表示手段２０に表示される。 The evaluation result obtained by the evaluation unit 35 is output to the video display unit 20 through the display processing unit 32. That is, the video display means 20 indicates the degree of difference from the model regarding the shape change of the lip of the user 1. In addition, when a rule for giving advice is set in the evaluation unit 35, advice for approaching a model is displayed on the video display unit 20 with respect to the shape change of the lip of the user 1.

上述した構成例では、動画像を用いる場合を想定しているが、静止画を用いてもよい。たとえば、語学学習の初期段階では発音に対応した口唇部の形状を示す図を用いる場合があるから、このような図に代えて手本の静止画を用いることにより、発音練習を行うことができる。この場合も上述の例と同様に評価手段３５による評価結果が映像表示手段２０に表示される。 In the above-described configuration example, it is assumed that a moving image is used, but a still image may be used. For example, a diagram showing the shape of the lip corresponding to pronunciation may be used in the initial stage of language learning, so that pronunciation practice can be performed by using a still image of a model instead of such a diagram. . Also in this case, the evaluation result by the evaluation unit 35 is displayed on the video display unit 20 as in the above example.

目標画像に静止画を用いる場合には、使用者１に同じ音を継続して発音させている間に撮像手段１０で撮像した静止画を用いることができる。この場合、動画像に比較して特徴量のデータ量が少なく、また時間軸を合わせる必要もないから、評価手段３５での評価を簡単に行うことができる。 When a still image is used as the target image, the still image captured by the imaging unit 10 while the user 1 is continuously generating the same sound can be used. In this case, since the data amount of the feature amount is smaller than that of the moving image and it is not necessary to match the time axis, the evaluation by the evaluation unit 35 can be easily performed.

目標記憶手段３１に記憶される目標映像を作成するには、言語聴覚士のような専門家を撮像した実画像を用いるのが望ましい。すなわち、制御手段３０において、目標映像を作成する動作モードの選択を可能にしておき、この動作モードを選択した状態で、撮像手段１０を用いて専門家について口唇部を含む実画像を撮像する。さらに、目標記憶手段３１に記憶させた実画像を特報抽出手段３４に与えることによって特徴量の抽出を行い、抽出された特徴量を特徴記憶手段３３に記憶させる。このような動作により、専門家の実画像を手本とし、手本の特徴量を特徴記憶手段３３に記憶させることができる。 In order to create the target video stored in the target storage unit 31, it is desirable to use a real image obtained by capturing an expert such as a speech auditor. That is, the control unit 30 enables selection of an operation mode for creating a target video, and an image of the real image including the lip portion is captured for the expert using the imaging unit 10 in a state in which the operation mode is selected. Further, the feature amount is extracted by giving the real image stored in the target storage unit 31 to the special information extraction unit 34, and the extracted feature amount is stored in the feature storage unit 33. With such an operation, the actual image of the expert can be used as a model, and the feature amount of the model can be stored in the feature storage unit 33.

ここに、目標記憶手段３１に記憶させる目標映像は、必ずしも実画像でなくてもよい。たとえば、コンピュータグラフィックスにより実現される仮想三次元空間における人体モデルを用い、人体モデルの口唇部の動きを目標映像に用いてもよい。この場合、人体モデルを作成する際に用いるパラメータを、特徴記憶手段３３に記憶させる特徴量に用いることができる。 Here, the target video to be stored in the target storage unit 31 is not necessarily a real image. For example, a human body model in a virtual three-dimensional space realized by computer graphics may be used, and the movement of the lip portion of the human body model may be used for the target video. In this case, the parameters used when creating the human body model can be used for the feature amount stored in the feature storage unit 33.

また、口唇部の形状は、半月形や菱形に簡略化したモデルを用いて表すこともできる。すなわち、アニメーションで用いられているような、簡単な図形を用いて口唇部の目標映像の形状を表現してもよい。この場合の特徴量は、人体に関して公開されているデータベースを利用して設定するか、実画像から抽出すればよい。このようなモデルを用いると、口唇部の形状変化を正確に表すことはできないが、雑音になる情報を省略して必要な情報のみを強調することができるから、使用者１にとっては、実画像を用いる場合よりも理解しやすい場合がある。 The shape of the lip can also be expressed using a model simplified to a half-moon or rhombus. That is, the shape of the target image of the lip portion may be expressed using a simple figure as used in animation. The feature amount in this case may be set by using a database publicly available for the human body or extracted from an actual image. If such a model is used, the shape change of the lip portion cannot be accurately represented, but information necessary for noise can be omitted and only necessary information can be emphasized. It may be easier to understand than using.

実使用に際しては、実画像を用いる目標映像とモデルを用いる目標映像とを選択可能としておくのが望ましい。 In actual use, it is desirable that a target video using a real image and a target video using a model can be selected.

言語訓練に際しては、口唇部の形状変化だけではなく、舌の位置も併せて検出すれば、より正確に発音を評価することができる。そこで、舌の位置を検出する舌位置検出手段１２を付加してもよい。 In language training, it is possible to evaluate the pronunciation more accurately by detecting not only the shape change of the lip but also the position of the tongue. Therefore, tongue position detecting means 12 for detecting the position of the tongue may be added.

舌位置検出手段１２は、たとえば、舌に貼り付けることができる微小な器体を備え、ジャイロセンサあるいは三次元加速度センサと、非接触で受電する電源部と、センサ出力を非接触で送信する送信部とを器体に収納することにより構成される。この種の装置は、ＲＦＩＤ（ＩＣタグ）と同様の構成であって、ＲＦＩＤにおけるメモリに代えて、半導体からなるジャイロセンサあるいは三次元加速度センサを設けることにより構成される。このような構成であれば、ＲＦＩＤと同様に数ｍｍ角の器体を用いることができるから、舌の動きを妨げることなく、舌の位置を検出することが可能になる。 The tongue position detecting means 12 includes, for example, a minute body that can be attached to the tongue, a gyro sensor or a three-dimensional acceleration sensor, a power supply unit that receives power in a non-contact manner, and a transmission that transmits sensor output in a non-contact manner. It is comprised by accommodating a part in a container. This type of apparatus has a configuration similar to that of an RFID (IC tag), and is configured by providing a gyro sensor or a three-dimensional acceleration sensor made of a semiconductor instead of a memory in the RFID. With such a configuration, it is possible to detect a tongue position without hindering the movement of the tongue, since it is possible to use a several mm square body similar to RFID.

舌位置検出手段１２を用いる場合は、舌位置検出手段１２が検出する舌の位置に対応した手本になる舌の位置に関する特徴量（第３の特徴量）を特徴記憶手段３３にあらかじめ記憶させておく。また、舌位置検出手段１２が検出した舌の位置変化に対応する特徴量（第４の特徴量）を特徴抽出手段３４において抽出する。両特徴量は、口唇部の形状変化の特徴量と同様に、評価手段３５において評価され、評価結果が表示処理手段３２を通して映像表示手段２０に提示される。また、口唇部の形状変化に対する特徴量と同様に、舌の位置変化に対する特徴量も所定期間において抽出するとともに時間軸を一致させる。 When the tongue position detecting means 12 is used, a feature quantity (third feature quantity) relating to the position of the tongue serving as a model corresponding to the position of the tongue detected by the tongue position detecting means 12 is stored in the feature storage means 33 in advance. Keep it. Further, the feature extraction unit 34 extracts a feature amount (fourth feature amount) corresponding to the tongue position change detected by the tongue position detection unit 12. Both feature quantities are evaluated by the evaluation means 35 in the same manner as the feature quantities of the lip shape change, and the evaluation results are presented to the video display means 20 through the display processing means 32. Further, similarly to the feature amount with respect to the lip shape change, the feature amount with respect to the tongue position change is also extracted in a predetermined period and the time axes are matched.

上述のように、舌位置検出手段１２を設けることにより舌の位置変化の特徴量についても評価すれば、子音のように舌の位置により区別される発音についても手本との比較が可能になる。すなわち、舌位置検出手段１２を用いることにより、口唇部の形状変化だけではなく舌の位置変化も検出することとなり、言語訓練をより精度よく行うことが可能になる。 As described above, if the feature value of the tongue position change is evaluated by providing the tongue position detecting means 12, it is possible to compare the pronunciations distinguished by the position of the tongue such as consonants with the model. . That is, by using the tongue position detecting means 12, not only the lip shape change but also the tongue position change is detected, and language training can be performed with higher accuracy.

ところで、言語訓練においては音声について評価することが望ましい。そこで、本実施形態では、使用者１が発生する音声を取得するための音声取得手段１１を備えており、音声についても手本との比較が可能になっている。さらに、目標記憶手段３１には手本になる目標音声が登録され、音声制御手段３７を介して音声出力手段２２から目標音声を提示することができるようにしてある。また、特徴記憶手段３３には、目標音声に対応した音声の特徴量（第５の特徴量）が記憶される。 By the way, it is desirable to evaluate speech in language training. Therefore, in the present embodiment, the voice acquisition unit 11 for acquiring the voice generated by the user 1 is provided, and the voice can be compared with the model. Further, a target voice serving as a model is registered in the target storage means 31 so that the target voice can be presented from the voice output means 22 via the voice control means 37. In addition, the feature storage unit 33 stores a feature amount (fifth feature amount) of speech corresponding to the target speech.

音声取得手段１１が取得した使用者１の音声は特徴抽出手段３４に入力され、特徴抽出手段３４において音声の特徴量が抽出される。音声の特徴量を抽出する技術は、音声認識の分野において周知である技術を用いることができる。音声の特徴量についても所定期間において抽出するとともに時間軸を一致させる。使用者１から取得した音声の特徴量は、評価手段３５において特徴記憶手段３３に記憶した特徴量と比較されることにより手本との差異が評価される。また、評価結果は、映像表示手段２０に提示される。 The voice of the user 1 acquired by the voice acquisition unit 11 is input to the feature extraction unit 34, and the feature amount of the voice is extracted by the feature extraction unit 34. A technique known in the field of voice recognition can be used as a technique for extracting a feature amount of voice. The feature amount of the voice is also extracted in a predetermined period and the time axis is matched. The voice feature quantity acquired from the user 1 is compared with the feature quantity stored in the feature storage means 33 in the evaluation means 35, whereby the difference from the model is evaluated. The evaluation result is presented on the video display means 20.

制御手段３０は、音声取得手段１１により取得される使用者の音声を録音する録音手段３６を備えており、録音手段３６に録音された音声は、必要に応じて音声制御手段３７を通して音声出力手段２２から出力される。すなわち、使用者１が言語訓練を行っている間に発した音声を、音声を発した後に使用者１自身で確認することができる。また、音声制御手段３７では、録音手段３６に録音された使用者１の音声を、目標記憶手段３１に記憶されている目標音声とともに音声出力手段２２に出力する機能を有している。この機能により、使用者１の音声を手本の音声と重ねて出力することができ、使用者１の音声と手本の音声との差異を使用者１自身で確認することができる。 The control means 30 includes recording means 36 for recording the user's voice acquired by the voice acquisition means 11, and the voice recorded in the recording means 36 is output as voice output means through the voice control means 37 as necessary. 22 is output. In other words, the user 1 can confirm the sound produced while the user 1 is performing language training by the user 1 himself / herself after producing the sound. The voice control unit 37 has a function of outputting the voice of the user 1 recorded by the recording unit 36 to the voice output unit 22 together with the target voice stored in the target storage unit 31. With this function, the voice of the user 1 can be output with the voice of the model, and the difference between the voice of the user 1 and the voice of the model can be confirmed by the user 1 himself.

以下では、上述した言語訓練装置の使用例について説明する。ここでは、図３に示すように、使用者１がハーフミラー２１の前方において椅子２に着座した状態で装置を使用するものとする。図３の画面は、具体的には図４に示す内容になっている。この画面は、ある言葉が思い出せない使用者１や、言葉を思い出せてもその言葉を正しく発音できない使用者１の訓練を行うために設定されている。 Below, the usage example of the language training apparatus mentioned above is demonstrated. Here, as shown in FIG. 3, it is assumed that the user 1 uses the apparatus while sitting on the chair 2 in front of the half mirror 21. Specifically, the screen of FIG. 3 has the contents shown in FIG. This screen is set for training the user 1 who cannot remember a certain word, or the user 1 who cannot recall the word correctly even if he can remember the word.

図４に示す画面の下部には、目標映像を表示する目標領域４２、使用者１の鏡映像を表示する対象領域４３が設けられている。対象領域４３は、ハーフミラー２１による鏡映像を表示する領域であり、映像表示手段２０の表示面において対象領域４３に対応する部位は、輝度が０に設定されるか、ハーフミラー２１の前面側に光が透過しない程度の低輝度に設定される。 In the lower part of the screen shown in FIG. 4, a target area 42 for displaying a target video and a target area 43 for displaying a mirror video of the user 1 are provided. The target area 43 is an area for displaying a mirror image by the half mirror 21, and the portion corresponding to the target area 43 on the display surface of the video display means 20 is set to have a luminance of 0 or the front side of the half mirror 21. The brightness is set so as not to transmit light.

また、図４に示す画面の上部には、使用者１に想起させようとする言葉に関連した絵または図が表示される図絵領域４１が設けられ、図絵領域４１の下方には使用者１に想起させようとする言葉の文字を表示する文字領域４４が設けられる。 In addition, a picture area 41 for displaying a picture or a figure related to a word to be recalled by the user 1 is provided at the top of the screen shown in FIG. A character area 44 is provided for displaying the character of the word to be recalled.

ところで、この装置では、使用者１による対話的な入力を可能とするために、撮像手段１０により撮像した画像から使用者１の手の動きを認識してジェスチャ入力を行うことが可能になっている。撮像手段１０が二次元画像のみを生成する場合には、ハーフミラー２１から一定距離の平面内で手の位置を認識することにより、手の位置に応じた入力が可能になる。この操作ではハーフミラー２１には触れないが、タッチパネルを用いた動作と同様の操作になる。すなわち、画面上の位置に応じて操作が規定され、手を近づけた位置に応じた操作を行うことができる。 By the way, in this apparatus, in order to enable interactive input by the user 1, it is possible to perform gesture input by recognizing the movement of the hand of the user 1 from the image captured by the imaging unit 10. Yes. When the imaging unit 10 generates only a two-dimensional image, the input according to the position of the hand is possible by recognizing the position of the hand in a plane at a certain distance from the half mirror 21. In this operation, the half mirror 21 is not touched, but the operation is similar to the operation using the touch panel. That is, the operation is defined according to the position on the screen, and the operation according to the position where the hand is brought close can be performed.

一方、撮像手段１０が三次元画像を生成する場合には、手の先の部分であることを認識した上で、手の先の部分の三次元位置を抽出することにより、その位置に応じた入力を行う。この操作では、画面上に手の位置を示すカーソル（マーカ）を表示し、カーソルの位置を画面の所望位置に重ねることで、画面上の位置に応じて規定されている操作を行うことができる。 On the other hand, when the imaging unit 10 generates a three-dimensional image, the three-dimensional position of the tip of the hand is extracted after recognizing that it is the tip of the hand, and according to the position. Make input. In this operation, a cursor (marker) indicating the position of the hand is displayed on the screen, and the operation specified according to the position on the screen can be performed by overlapping the cursor position on the desired position on the screen. .

上述の操作は、手の位置に応じた操作であるが、手の動きに応じた操作を行う技術を採用してもよい。また、場合によっては手だけではなく、使用者１の身体の他の部位を用いて操作を行うことも可能である。このように、操作を行うために使用者１の身体動作を用いる入力技術は「ジェスチャ入力」と呼ばれている。 Although the above-described operation is an operation according to the position of the hand, a technique for performing an operation according to the movement of the hand may be employed. In some cases, the operation can be performed using not only the hand but also other parts of the user's 1 body. In this way, an input technique that uses the physical motion of the user 1 to perform an operation is called “gesture input”.

図４に示す画面には、ジェスチャ入力を受け付ける５個の釦４５〜４９が設けられている。これらの釦４５〜４９は、以下のように用いる。上述した図絵領域４１に表示される絵や図は複数のセットから選択される。また、各セットには、複数の絵や図が含まれており、言語訓練を行う使用者１に合わせてセットが選択される。 The screen shown in FIG. 4 is provided with five buttons 45 to 49 for receiving gesture input. These buttons 45 to 49 are used as follows. The pictures and figures displayed in the picture area 41 described above are selected from a plurality of sets. Each set includes a plurality of pictures and drawings, and the set is selected according to the user 1 who performs language training.

１枚の絵または図が図絵領域４１に表示された時点では、文字領域４４には何も表示されない。ここで、使用者１は図絵領域４１に表示された絵や図を示す言葉を想起できた場合には、「解答」釦４６を押す（ジェスチャ入力により選択することを、以下では「押す」という）。「解答」釦４６を押すと、目標領域４２に目標映像が表示され、正しい発音の手本を示す。このとき、対象領域４３には使用者１の鏡映像が映っているから、使用者１は、目標映像における口唇部の形状変化（動き）を確認しながら、口唇部の動かし方の練習を進めることができる。 When a single picture or drawing is displayed in the picture area 41, nothing is displayed in the character area 44. Here, when the user 1 can recall a word indicating a picture or a figure displayed in the picture area 41, the user 1 presses an “answer” button 46 (selection by gesture input is hereinafter referred to as “press”). ). When the “answer” button 46 is pressed, a target video is displayed in the target area 42 and shows an example of correct pronunciation. At this time, since the mirror image of the user 1 is shown in the target area 43, the user 1 advances the practice of how to move the lip while confirming the shape change (movement) of the lip in the target image. be able to.

使用者１は、図絵領域４１に示された絵や図を見ても言葉を想起できない場合には、「ヒント」釦４５を押す。このとき、言葉領域４４には、文字数分のブランクマーク（円形部分）が表示される。ただし、最初はブランクマークには文字は表示されず、文字数のみが示される。使用者１は、この時点で言葉を想起できれば「解答」釦４６を押し、想起できなければ「ヒント」釦４５を押す。 The user 1 presses a “hint” button 45 when the user cannot recall a word even after looking at the picture or figure shown in the picture area 41. At this time, blank marks (circular portions) corresponding to the number of characters are displayed in the word area 44. However, at first, no characters are displayed in the blank mark, and only the number of characters is shown. If the user 1 can recall the words at this time, the user 1 presses the “answer” button 46, and if not, the user 1 presses the “hint” button 45.

「ヒント」釦４５を押すたびに、文字領域４４には平仮名が１文字ずつ表示される。つまり、図絵領域４１に示された絵や図に対応する言葉が、「ヒント」釦４５を押すたびに、一音ずつ文字領域に表示される。ここで、「ヒント」釦４５を押さない場合でも、一定時間ごとに１文字ずつ平仮名が表示されるようにしておくのが望ましい。 Each time the “hint” button 45 is pressed, hiragana characters are displayed one by one in the character area 44. That is, each time a “hint” button 45 is pressed, a word corresponding to the picture or figure shown in the picture area 41 is displayed in the character area one sound at a time. Here, even when the “hint” button 45 is not pressed, hiragana is desirably displayed one character at a time.

いずれの場合も使用者１は言葉を想起できたと認識した時点で「解答」釦４６を押すことができる。「解答」釦４６を押せば、言葉を想起できていない場合でも目標領域４２に目標映像が表示されるが、この場合は、文字領域４４に表示されている文字が不足しているから、使用者１は正しい言葉を覚えるという報酬が得られないことになる。そのため、使用者１は言葉を覚えるという報酬を得るために、実際に言葉を想起できたときにのみ「解答」釦４６を押すようになると考えられる。また、誤って「解答」釦４６を押した場合には、「戻る」釦４９を押すことにより、目標映像が表示されていない状態に戻ることができる。 In either case, the user 1 can press the “answer” button 46 when he / she recognizes that he / she can recall the words. If the “answer” button 46 is pressed, the target image is displayed in the target area 42 even when the words cannot be recalled. In this case, since the characters displayed in the character area 44 are insufficient, Person 1 will not get the reward of learning the correct words. Therefore, it is considered that the user 1 presses the “answer” button 46 only when the user can actually recall the word in order to obtain a reward for learning the word. If the “answer” button 46 is erroneously pressed, the “return” button 49 can be pressed to return to a state where the target video is not displayed.

上述のようにして、１つの言葉について、目標映像に併せて口唇部の動きを練習した後には「次へ」釦４８を押せば、セット内の次の絵または図が図絵領域４１に表示される。また、前の図または絵に戻る場合には「前へ」釦４７を押せばよい。 As described above, after practicing the movement of the lip portion for one word along with the target video, if the “Next” button 48 is pressed, the next picture or diagram in the set is displayed in the picture area 41. The Further, when returning to the previous figure or picture, the “Previous” button 47 may be pressed.

上述の操作例は、図や絵は理解できるが言葉を想起できない使用者１に対して有効である。また、図や絵から言葉を想起できても、発音ができない使用者１には、図または絵と同時に文字領域４４に文字を表示し、その後、手本になる音声を音声出力手段２２（図１参照）から出力するか、目標領域４２に目標映像を表示する。なお、この動作では評価手段３５による評価を行い、手本と使用者１との差異について定量的な評価値を得ることができるから、評価値を点数化することによって、使用者に訓練の動機付けを行うようにしてもよい。 The above-described operation example is effective for the user 1 who can understand figures and pictures but cannot recall words. In addition, for the user 1 who can recall words from a figure or picture but cannot pronounce, a character is displayed in the character area 44 at the same time as the figure or picture. 1) or display the target image in the target area 42. In this operation, evaluation by the evaluation means 35 can be performed, and a quantitative evaluation value can be obtained for the difference between the model and the user 1, so that the user is motivated by training by scoring the evaluation value. You may make it attach.

ここで、ハーフミラー２１に使用者１の鏡映像を映している対象領域４３は、映像表示手段２０の画面の表示を変化させることによって移動させたり、透明度を変化させたりすることが可能である。したがって、撮像手段１０で撮像している画像を用いて使用者１の位置を認識し、使用者１の位置に応じて対象領域４３の位置を変化させることができる。この場合、目標領域４２に対象領域４３を重ねるように使用者１が移動すれば、手本となる目標映像に使用者１の鏡映像を重ねることができる。このような重ね合わせを行えば、手本と使用者１との口唇部の動きの違いを視覚的に確認することができ、訓練効果をより高めることができる。 Here, the target area 43 in which the mirror image of the user 1 is projected on the half mirror 21 can be moved by changing the display on the screen of the video display means 20, or the transparency can be changed. . Therefore, the position of the user 1 can be recognized using the image captured by the imaging unit 10, and the position of the target region 43 can be changed according to the position of the user 1. In this case, if the user 1 moves so as to overlap the target region 43 with the target region 42, the mirror image of the user 1 can be superimposed on the target image serving as a model. If such superposition is performed, the difference in movement of the lip portion between the model and the user 1 can be visually confirmed, and the training effect can be further enhanced.

上述した構成では、ハーフミラー２１を用いて使用者１の鏡映像を映す場合を例示したが、ハーフミラー２１を用いずに撮像手段１０で撮像した画像の左右を反転させた反転映像を生成して、映像表示手段２０の画面内に反転映像を表示してもよい。すなわち、撮像手段１０で撮像した使用者１を含む空間領域の左右を反転させる映像反転手段３８を設ける（図１参照）。映像反転手段３８が生成した反転映像は、表示処理手段３２において目標映像と併せて映像表示手段２０に表示される。このように画像処理によって生成した反転画像を用いる場合でも、ハーフミラー２１を用いた鏡映像と同様の効果を持つ鏡映像を使用者１に提示することができる。 In the configuration described above, the case where the mirror image of the user 1 is projected using the half mirror 21 is illustrated. However, an inverted image obtained by inverting the left and right of the image captured by the imaging unit 10 without using the half mirror 21 is generated. Thus, a reverse video may be displayed in the screen of the video display means 20. In other words, the video reversing means 38 for reversing the left and right of the space area including the user 1 imaged by the imaging means 10 is provided (see FIG. 1). The reverse video generated by the video reversing means 38 is displayed on the video display means 20 together with the target video in the display processing means 32. Thus, even when a reverse image generated by image processing is used, a mirror image having the same effect as a mirror image using the half mirror 21 can be presented to the user 1.

なお、表示処理手段３２では、目標映像と反転映像とは別レイヤとして扱えば、目標映像と使用者１の鏡映像とを、並べて表示する状態と重ねて表示する状態とを容易に実現することができる。 In the display processing means 32, if the target video and the reverse video are handled as different layers, the target video and the mirror video of the user 1 can be easily realized in a state where they are displayed side by side and a state where they are displayed in a superimposed manner. Can do.

１０撮像手段
１１音声取得手段
１２舌位置検出手段
２０映像表示手段（鏡映像提示手段）
２１ハーフミラー（鏡映像提示手段）
２２音声出力手段
３０制御手段
３１目標記憶手段（目標提示手段）
３２表示処理手段
３３特徴記憶手段
３４特徴抽出手段
３５評価手段
３６録音手段
３７音声制御手段
３８映像反転手段（鏡映像提示手段） DESCRIPTION OF SYMBOLS 10 Imaging means 11 Audio | voice acquisition means 12 Tongue position detection means 20 Image | video display means (mirror image presentation means)
21 half mirror (mirror image presentation means)
22 Voice output means 30 Control means 31 Target storage means (target presentation means)
32 Display processing means 33 Feature storage means 34 Feature extraction means 35 Evaluation means 36 Recording means 37 Audio control means 38 Video inversion means (mirror image presentation means)

Claims

Video display means for displaying video, target presentation means for presenting video including the shape of the lip portion as a model to the video display means as a target video, and shape change of the lip portion included in the target video during a predetermined period A feature storage means for storing a feature quantity representing a first feature quantity, an imaging means for imaging a spatial area including a user's lip, and a spatial area including at least a spatial area imaged by the imaging means Mirror video presentation means for presenting the mirror video to the user together with the target video displayed on the video display means, and the user's lip in a period corresponding to the predetermined period from the image captured by the imaging means A feature extraction unit that extracts a feature amount representing a shape change of the second feature amount, an evaluation unit that evaluates a difference of the second feature amount with respect to the first feature amount, and the evaluation unit Language training apparatus comprising: a display processing unit in accordance with the evaluation result determines the display contents of the image display unit.

2. The language training apparatus according to claim 1, wherein the mirror image presentation unit has a configuration in which a half mirror is disposed in front of the image display unit.

The image display unit further includes a video inversion unit that generates an inverted video obtained by inverting the left and right of the image captured by the imaging unit, and the target presentation unit has a function of presenting the inverted video together with the target video to the video display unit The language training apparatus according to claim 1, wherein the mirror video presentation unit includes the video inversion unit and the target presentation unit.

Tongue position detecting means for detecting the position of the user's tongue is further provided, and the feature storage means stores, as a third feature quantity, a feature quantity representing a change in position in the predetermined period with respect to the position of the tongue serving as a model. The feature extraction unit extracts, as a fourth feature amount, a feature amount representing a change in the position of the user's tongue in a period corresponding to the predetermined period from the tongue position detected by the tongue position detection unit. The language training apparatus according to claim 1, wherein the evaluation unit evaluates a difference of the fourth feature amount with respect to the third feature amount.

A voice output means for outputting a voice; a voice control means for presenting a voice as a model to the voice output means as a target voice; and a voice acquisition means for acquiring the voice of the user. Stores the feature amount of the target speech as a fifth feature amount, the feature extraction unit extracts the feature amount of the speech acquired by the speech acquisition unit as a sixth feature amount, and the evaluation unit includes: The language training apparatus according to claim 1, wherein a difference between the sixth feature quantity and the fifth feature quantity is evaluated.

The apparatus further comprises recording means for recording the voice of the user acquired by the voice acquisition means, and the voice control means causes the voice output means to output the voice of the user recorded by the recording means. The language training apparatus according to claim 5.

The target video is an actual image captured by the imaging unit, and the feature extraction unit has a function of extracting the first feature amount from the actual image and storing it in the feature storage unit. The language training apparatus according to any one of claims 1 to 6.

The language training apparatus according to claim 1, wherein the target video is a model image in which a shape of a lip portion is expressed by a model.