JP2010185967A

JP2010185967A - Pronunciation training device

Info

Publication number: JP2010185967A
Application number: JP2009028881A
Authority: JP
Inventors: Masahiro Kumazawa; 昌宏熊澤
Original assignee: Chubu Electric Power Co Inc
Current assignee: Chubu Electric Power Co Inc
Priority date: 2009-02-10
Filing date: 2009-02-10
Publication date: 2010-08-26

Abstract

PROBLEM TO BE SOLVED: To provide a pronunciation training device which allows a trainee to directly check the shape, movement and the like of lips for correct pronunciation and efficiently conduct training of pronunciation. SOLUTION: The pronunciation training device 1 comprises: a language recognition means for recognizing a trainee's generating language generated by a trainee at a trainee's speech generating position and an instructor's generating language generated by an instructor at an instructor's speech generating position respectively; a conversion-into-character means for converting the trainee's generating language recognized by the language recognition means into visible characters visualizing the trainee's generating language and for converting the instructor's generating language recognized by the language recognition means into visible characters visualizing the instructor's generating language; and a display control means for performing such control that the characters visualizing the trainee's generating language are displayed at a position visualizing the trainee's speech in a pick-up image and the characters visualizing the instructor's generating language are displayed at a position visualizing the instructor's speech in a pick-up image. COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、訓練者が発した言語及び指導者が発した言語をそれぞれ可視化した文字を表示領域に表示する発音訓練装置に関する。 The present invention relates to a pronunciation training device that displays characters in a display area, each of which visualizes a language uttered by a trainee and a language uttered by a trainer.

特許文献１には、一人でも容易に正しい発声又は発音練習を行うことができるようにした音声表示装置が開示されている。特許文献１の音声表示装置では、表示部が、話し手の発声又は発音を示す検出パターン形状と、予め登録された正しい発声又は発音を示す基準パターン形状とを並べて表示し、話し手は、検出パターン形状と基準パターン形状とを、表示部の画面上で並べて視認することができる。 Patent Document 1 discloses a voice display device that enables one person to easily perform correct utterance or pronunciation practice. In the voice display device of Patent Document 1, the display unit displays the detection pattern shape indicating the utterance or pronunciation of the speaker and the reference pattern shape indicating the correct utterance or pronunciation registered in advance, and the speaker displays the detection pattern shape. And the reference pattern shape can be viewed side by side on the screen of the display unit.

したがって、話し手は、表示部の画面上に表示された検出パターン形状及び基準パターン形状を見ながら、両パターン形状の差異を無くすように、発声又は発音を繰り返すことにより、自分の発声又は発音を修正して、正しい発声又は発音を習得することができる。 Therefore, the speaker corrects his or her utterance by repeating the utterance or pronunciation while observing the detected pattern shape and the reference pattern shape displayed on the screen of the display unit so as to eliminate the difference between the two pattern shapes. Thus, the correct utterance or pronunciation can be acquired.

また、特許文献２には、音源の方向を推定すると共に、推定された音源位置近傍の映像をカメラで採取し、ディスプレイに表示された前記音源位置近傍の映像上に、音源位置を表示する音源探査システムに関する技術が開示されている。 Patent Document 2 discloses a sound source that estimates the direction of a sound source, collects an image near the estimated sound source position with a camera, and displays the sound source position on the image near the sound source position displayed on a display. Techniques relating to exploration systems are disclosed.

特開２００１−３４３８９０号公報JP 2001-343890 A 特開２００２−１８１９１３号公報Japanese Patent Laid-Open No. 2002-181913

ところで、上記の音声表示装置では、話し手である訓練者は、画面上に表示された基準パターン形状を視認することはできるが、基準パターン形状に対応する音声を発する際の唇の形や動き等を直接的に視認することはできなかった。このため、訓練者は、直接的に唇の形や動き等を知ることができないため、検出パターンと基準パターンとの差異を無くする訓練を行う場合には、基準パターン形状に対応する音声を発することが可能な唇の形や動き等を想定し、何度も試行錯誤しながら、検出パターン形状を基準パターン形状に近づけることを繰り返さなければならなかった。 By the way, in the voice display device described above, the trainer who is a speaker can visually recognize the reference pattern shape displayed on the screen, but the shape and movement of the lips when the voice corresponding to the reference pattern shape is emitted. Was not directly visible. For this reason, since the trainee cannot directly know the shape or movement of the lips, when performing training to eliminate the difference between the detection pattern and the reference pattern, the trainer emits a sound corresponding to the reference pattern shape. Assuming possible lip shapes and movements, it was necessary to repeat the process of bringing the detected pattern shape closer to the reference pattern shape through trial and error.

したがって、上記の音声表示装置では、訓練者が、検出パターン形状を基準パターン形状に近づけるまでに長時間を要することがあり、効率良く発音の訓練を行っているとは言い難かった。 Therefore, in the voice display device described above, it may take a long time for the trainee to bring the detected pattern shape closer to the reference pattern shape, and it is difficult to say that training of pronunciation is performed efficiently.

この発明は、このような状況に鑑み提案されてものであって、訓練者が、正しい発音を行うための唇の形や動き等を直接的に確認することができると共に、効率良く発音の訓練を行うことができる発音訓練装置を提供することを目的とする。 The present invention is proposed in view of such a situation, and the trainer can directly confirm the shape and movement of the lips for correct pronunciation, and can efficiently train the pronunciation. An object of the present invention is to provide a pronunciation training device capable of performing

請求１の発明に係る発音訓練装置は、カメラによって撮像した訓練者の画像及び指導者の画像を含む撮像画像を表示可能な表示領域を有する表示手段と、前記訓練者が発した音声である訓練者音声及び前記指導者が発した音声である指導者音声をそれぞれ集音可能な集音手段と、前記集音手段によって集音した前記訓練者音声及び前記指導者音声に基づいて、該訓練者音声が発せられた訓練者音声発生位置及び該指導者音声が発せられた指導者音声発生位置をそれぞれ特定する位置特定手段と、前記位置特定手段によって特定した前記訓練者音声発生位置と、前記表示領域に表示する撮像画像内における訓練者音声可視化表示位置とを相関付けすると共に、前記位置特定手段によって特定した前記指導者音声発生位置と、前記表示領域に表示する撮像画像内における指導者音声可視化表示位置とを相関付けする相関手段と、前記訓練者音声発生位置において前記訓練者が発した言語である訓練者発生言語及び前記指導者音声発生位置において前記指導者が発した言語である指導者発生言語をそれぞれ認識する言語認識手段と、前記言語認識手段によって認識した前記訓練者発生言語を可視化した訓練者発生言語可視化文字に変換すると共に、前記言語認識手段によって認識した前記指導者発生言語を可視化した指導者発生言語可視化文字に変換する文字変換手段と、前記相関手段によって前記訓練者音声発生位置と相関付けされた前記訓練者音声可視化表示位置に前記訓練者発生言語可視化文字を表示する制御を行うと共に、前記相関手段によって前記指導者音声発生位置と相関付けされた前記指導者音声可視化表示位置に前記指導者発生言語可視化文字を表示する制御を行う表示制御手段と、を備えることを特徴とする。
請求項１の発明に係る発音訓練装置によれば、表示制御手段によって、訓練者の画像及び指導者の画像を含む撮像画像内における訓練者音声可視化表示位置に訓練者発生言語可視化文字を表示する制御を行うと共に、前記撮像画像内における指導者音声可視化表示位置に指導者発生言語可視化文字を表示する制御を行う。
これにより、上記の表示領域には、訓練者の画像及び指導者の画像を含む撮像画像を表示すると同時に、該撮像画像内に訓練者発生言語可視化文字や指導者発生言語可視化文字を表示することができる。
したがって、訓練者が、表示領域に表示された訓練者発生言語可視化文字と指導者発生言語可視化文字とを比較して、該訓練者発生言語可視化文字が該指導者発生言語可視化文字とは異なることを確認することにより、訓練者自らが発した言語が指導者が発した言語とは異なることを認識することができる。
加えて、訓練者は、訓練者自らが発した言語が指導者が発した言語とは異なることを認識すると同時に、表示領域に表示された指導者の画像によって表される唇の形や動き等を直接的に確認しながら、訓練者の唇の形や動き等を指導者の唇の形や動き等に合わせることが可能となる。
これに伴って、訓練者は、訓練者自らが発する言語を指導者が発する言語に近づけるように矯正する訓練を効率良く行うことができる。 The pronunciation training device according to the first aspect of the present invention is a training that is a voice generated by the trainer and display means having a display area capable of displaying a captured image including a trainer image and a trainer image captured by a camera. A voice collecting means capable of collecting the voice of the leader and the voice of the leader, and the trainer based on the voice of the trainer and the voice of the leader collected by the sound collecting means. The position of the trainer voice generation position where the voice is uttered and the position of the trainer voice generation position where the voice of the instructor is uttered, the trainer voice generation position specified by the position specification means, and the display The trainer voice visualization display position in the captured image displayed in the area is correlated, and the instructor voice generation position specified by the position specifying means is displayed in the display area. Correlation means for correlating the display position of the trainer voice in the captured image, the trainer-generated language that is a language issued by the trainer at the trainer speech generation position, and the guidance at the trainer speech generation position Language recognition means for recognizing instructor-generated languages, which are languages issued by a trainee, and converting the trainer-generated language recognized by the language recognition means into visualized trainee-generated language visualized characters, and the language recognition means Character conversion means for converting the instructor-generated language recognized by the operator into visualized instructor-generated language visualized characters, and the training in the trainer speech visualization display position correlated with the trainer speech generation position by the correlating means In addition to performing control to display the person-generated language visualization characters, the correlation means correlates with the instructor voice generation position Characterized in that it comprises a display control means performs control to display the leader generation language visualization characters to the leader speech visualization position it is.
According to the pronunciation training device of the first aspect of the present invention, the trainer-generated language visualized character is displayed at the trainer speech visualization display position in the captured image including the trainer image and the trainer image by the display control means. While performing control, it performs control which displays a leader generation language visualization character in a leader voice visualization display position in the above-mentioned picked-up image.
Thereby, in the above display area, the captured image including the image of the trainer and the image of the instructor is displayed, and at the same time, the trainer-generated language visualization character and the instructor-generated language visualization character are displayed in the captured image. Can do.
Therefore, the trainee compares the trainer-generated language visualized character displayed in the display area with the trainer-generated language visualized character, and the trainer-generated language visualized character is different from the trainer-generated language visualized character. By confirming, it is possible to recognize that the language that the trainee himself has spoken is different from the language that the instructor has spoken.
In addition, the trainer recognizes that the language that the trainee himself has spoken is different from the language that the trainer has spoken, and at the same time the shape and movement of the lips represented by the trainer image displayed in the display area, etc. It is possible to match the shape and movement of the trainee's lips with the shape and movement of the instructor's lips while directly confirming.
Along with this, the trainer can efficiently perform training for correcting the language that the trainee himself speaks to be close to the language that the trainer speaks.

請求項２の発明は、請求項１において、前記集音手段は、前記訓練者音声及び前記指導者音声をそれぞれ集音するために、所定の水平間隔を隔てて配置された一対のマイクロフォン同士を互いに直交して配置したマイクロフォン群と、前記訓練者音声及び前記指導者音声をそれぞれ集音するために、前記一対のマイクロフォンを構成する各マイクロフォンから前記水平間隔と同一の間隔である対向間隔を隔てて配置された対向マイクロフォンと、を備え、前記位置特定手段は、前記訓練者音声が、前記一対のマイクロフォンに到達する時間差及び前記各マイクロフォンと前記対向マイクロフォンとの間に到達する時間差をそれぞれ用い、双曲線法によって、前記訓練者音声発生位置を算出すると共に、前記指導者音声が、前記一対のマイクロフォンに到達する時間差及び前記各マイクロフォンと前記対向マイクロフォンとの間に到達する時間差をそれぞれ用い、双曲線法によって、前記指導者音声発生位置を算出することを特徴とする。
請求項２の発明によれば、位置特定手段が、マイクロフォン群を構成する所定の間隔を隔てて配置された各一対のマイクロフォン及び各一対のマイクロフォンを構成する各マイクロフォンと対向マイクロフォンとの間にそれぞれ到達する訓練者音声や指導者音声の到達時間差を用い、双曲線法によって、訓練者音声発生位置や指導者音声発生位置を算出する。
このため、訓練者音声が発せられた位置や指導者音声が発せられた位置が順次変化した場合であっても、双曲線法により、各一対のマイクロフォンの間や、上記の各マイクロフォンと対向マイクロフォンとの間をそれぞれ通過する複数の双曲線軌跡の交点を、前記訓練者音声発生位置や前記指導者音声発生位置として定めることができる。 A second aspect of the present invention is directed to the first aspect, wherein the sound collecting means collects a pair of microphones arranged at a predetermined horizontal interval in order to collect the trainer voice and the instructor voice, respectively. In order to collect the microphone group arranged orthogonal to each other and the trainer voice and the instructor voice, the microphones constituting the pair of microphones are separated from each other by a facing interval that is the same as the horizontal interval. The position identification means uses a time difference when the trainee's voice reaches the pair of microphones and a time difference between the microphones and the opposite microphone, respectively. The trainer voice generation position is calculated by a hyperbolic method, and the instructor voice is converted into the pair of microphones. Using the time difference reaching the O emissions and the time difference to reach between the microphone and the counter microphone respectively, by hyperbola method, and calculates the leader sound generation position.
According to the invention of claim 2, the position specifying means includes a pair of microphones arranged at a predetermined interval constituting the microphone group, and a pair of microphones constituting the pair of microphones and a counter microphone. The trainer voice generation position and the teacher voice generation position are calculated by the hyperbolic method using the arrival time difference between the trainer voice and the leader voice to reach.
For this reason, even if the position where the trainer's voice is emitted and the position where the leader's voice is emitted change sequentially, the hyperbolic method can be used between each pair of microphones or each of the above microphones and the opposed microphones. The intersections of a plurality of hyperbolic trajectories that respectively pass between the two can be determined as the trainer voice generation position and the instructor voice generation position.

請求項３の発明は、請求項１又は２において、前記言語認識手段によって認識した前記訓練者発生言語と前記言語認識手段によって認識した前記指導者発生言語とを比較して、前記訓練者発生言語が前記指導者発生言語に一致する正解判定もしくは前記訓練者発生言語が前記指導者発生言語に一致しない不正解判定を行う判定手段と、前記判定手段による前記正解判定の結果もしくは前記不正解判定の結果を前記表示領域に表示する制御を行う判定結果表示制御手段と、を備えることを特徴とする。
請求項３の発明によれば、判定結果表示制御手段により、表示領域に、正解判定の結果もしくは不正解判定の結果を表示する制御を行う。
これにより、判定結果表示制御手段によって表示領域に表示されたいずれかの判定結果（正解判定の結果もしくは不正解判定の結果）に基づいて、訓練者は、該訓練者自らが発した言語が指導者が発した言語に一致する正しい発音ができているか否かを、一目で簡単に確認することができる。 The invention of claim 3 is the trainer-generated language according to claim 1 or 2, wherein the trainer-generated language recognized by the language recognition means is compared with the trainer-generated language recognized by the language recognition means. Determination means for performing a correct answer determination that matches the instructor-generated language or an incorrect answer determination in which the trainer-generated language does not match the instructor-generated language, and a result of the correct answer determination by the determining means or an incorrect answer determination Determination result display control means for performing control to display the result in the display area.
According to the third aspect of the present invention, the determination result display control means performs control for displaying the result of the correct answer determination or the result of the incorrect answer determination in the display area.
Thereby, based on one of the determination results (correct answer determination result or incorrect answer determination result) displayed in the display area by the determination result display control means, the trainer is instructed by the language of the trainee himself / herself. It is possible to easily confirm at a glance whether or not the correct pronunciation corresponding to the language uttered by the person is made.

請求項４の発明は、請求項１ないし３のいずれかにおいて、前記文字変換手段は、前記集音手段によって集音した前記訓練者音声の音量及び前記指導者音声の音量に基づいて、前記訓練者発生言語可視化文字の大きさ及び前記指導者発生言語可視化文字の大きさをそれぞれ変化させ、前記表示制御手段は、前記文字変換手段によって変化させた前記訓練者発生言語可視化文字の大きさで、前記訓練者音声可視化表示位置に前記訓練者発生言語可視化文字を表示すると共に、前記文字変換手段によって変化させた前記指導者発生言語可視化文字の大きさで、前記指導者音声可視化表示位置に前記指導者発生言語可視化文字を表示することを特徴とする。
請求項４の発明によれば、表示制御手段により、訓練者音声の音量に基づいて変化させた訓練者発生言語可視化文字の大きさで、訓練者音声可視化表示位置に訓練者発生言語可視化文字を表示すると共に、指導者音声の音量に基づいて変化させた指導者発生言語可視化文字の大きさで、指導者音声可視化表示位置に指導者発生言語可視化文字を表示する。
このため、訓練者は、訓練者発生言語可視化文字の大きさと指導者発生言語可視化文字の大きさとを比較して、訓練者音声の音量が指導者音声の音量に適合しているか否かを判断することができる。 According to a fourth aspect of the present invention, in any one of the first to third aspects, the character conversion unit is configured to perform the training based on a volume of the trainer voice and a volume of the instructor voice collected by the sound collection unit. The size of the person-generated language visualized character and the size of the instructor-generated language visualized character are each changed, and the display control means is the size of the trainer-generated language visualized character changed by the character converting means, The trainer-generated language visualized character is displayed at the trainer-speech-visualized display position, and the guidance-speaker-spoken language-visualized character size is changed by the character conversion unit, and the guidance is displayed at the trainer-speech-visualized display position. It is characterized by displaying human-generated language visualization characters.
According to the fourth aspect of the present invention, the trainer-generated language visualized character is displayed at the trainer speech visualization display position with the size of the trainer-generated language visualized character changed by the display control means based on the volume of the trainee speech. In addition to displaying, the instructor-generated language visualized character is displayed at the instructor-speech-visualized display position with the size of the instructor-generated language visualized character changed based on the volume of the instructor voice.
For this reason, the trainee compares the size of the trainer-generated language visualized character and the size of the trainer-generated language visualized character to determine whether the volume of the trainer speech matches the volume of the trainer speech. can do.

請求項５の発明は、請求項１ないし４のいずれかにおいて、前記訓練者音声可視化表示位置及び前記指導者音声可視化表示位置を、前記表示領域の任意の位置に調整する表示位置調整手段を備えることを特徴とする。
請求項５の発明によれば、表示位置調整手段により、訓練者音声可視化表示位置及び指導者音声可視化表示位置を、表示領域の任意の位置に調整することができるため、表示領域における訓練者音声可視化表示位置及び指導者音声可視化表示位置を、該表示領域における訓練者の画像及び指導者の画像をそれぞれ遮ることがない位置に調整することが可能になる。
これにより、訓練者は、訓練者音声可視化表示位置の訓練者発生言語可視化文字及び指導者音声可視化表示位置の指導者発生言語可視化文字を、訓練者の画像及び指導者の画像からそれぞれ離して別個に確認することができる。
このため、訓練者発生言語可視化文字と指導者発生言語可視化文字との比較や、訓練者の画像と指導者の画像との比較をそれぞれ容易に行うことができる。 A fifth aspect of the present invention includes the display position adjusting unit according to any one of the first to fourth aspects, wherein the trainer voice visualization display position and the instructor voice visualization display position are adjusted to arbitrary positions in the display area. It is characterized by that.
According to the invention of claim 5, the trainer voice visualization display position and the leader voice visualization display position can be adjusted to arbitrary positions in the display area by the display position adjusting means. The visualization display position and the instructor voice visualization display position can be adjusted to positions that do not block the trainee image and the instructor image in the display area.
Thus, the trainer separates the trainer-generated language visualization character at the trainer speech visualization display position and the trainer-generated language visualization character at the trainer speech visualization display position separately from the trainer image and the trainer image, respectively. Can be confirmed.
For this reason, it is possible to easily compare the trainee-generated language visualized character and the trainer-generated language visualized character, and compare the trainer image and the trainer image.

請求項６の発明は、請求項１ないし５のいずれかにおいて、前記訓練者は聴覚障害者であることを特徴とする。
請求項６の発明によれば、上記の表示領域には、聴覚障害者の画像及び指導者の画像を含む撮像画像を表示すると同時に、該撮像画像内に聴覚障害者発生言語可視化文字や指導者発生言語可視化文字を表示することができる。
したがって、聴覚障害者が、表示領域に表示された聴覚障害者発生言語可視化文字と指導者発生言語可視化文字とを比較して、該聴覚障害者発生言語可視化文字が該指導者発生言語可視化文字とは異なることを確認することにより、聴覚障害者自らが発した言語が指導者が発した言語とは異なることを認識することができる。
加えて、聴覚障害者は、聴覚障害者自らが発した言語が指導者が発した言語とは異なることを認識すると同時に、表示領域に表示された指導者の画像によって表される唇の形や動き等を直接的に確認しながら、聴覚障害者の唇の形や動き等を指導者の唇の形や動き等に合わせることが可能となる。
これに伴って、聴覚障害者は、聴覚障害者が発する言語を指導者が発する言語に近づけるように矯正する訓練を効率良く行うことができる。 According to a sixth aspect of the present invention, in any one of the first to fifth aspects, the trainee is a hearing impaired person.
According to the invention of claim 6, in the display area, the captured image including the image of the hearing impaired person and the image of the instructor is displayed, and at the same time, the visually impaired language visualized characters and the instructor are displayed in the captured image. Emergence language visualization characters can be displayed.
Accordingly, the hearing impaired person compares the hearing-impaired person generated language visualized character displayed in the display area with the instructor-generated language visualized character, and the hearing impaired person generated language visualized character is By confirming that they are different, it is possible to recognize that the language spoken by the hearing impaired person is different from the language spoken by the instructor.
In addition, the hearing impaired person recognizes that the language spoken by the hearing impaired person is different from the language spoken by the instructor, and at the same time, the lip shape and the lip shape represented by the image of the instructor displayed in the display area. It is possible to match the shape and movement of the hearing impaired person's lips with the shape and movement of the instructor's lips while directly confirming the movement and the like.
Accordingly, the hearing impaired person can efficiently perform training for correcting the language emitted by the hearing impaired person so as to approach the language emitted by the instructor.

本発明の発音訓練装置によれば、表示制御手段によって、訓練者の画像及び指導者の画像を含む撮像画像内における訓練者音声可視化表示位置に訓練者発生言語可視化文字を表示する制御を行うと共に、前記撮像画像内における指導者音声可視化表示位置に指導者発生言語可視化文字を表示する制御を行う。
これにより、上記の表示領域には、訓練者の画像及び指導者の画像を含む撮像画像を表示すると同時に、該撮像画像内に訓練者発生言語可視化文字や指導者発生言語可視化文字を表示することができる。
したがって、訓練者が、表示領域に表示された訓練者発生言語可視化文字と指導者発生言語可視化文字とを比較して、該訓練者発生言語可視化文字が該指導者発生言語可視化文字とは異なることを確認することにより、訓練者自らが発した言語が指導者が発した言語とは異なることを認識することができる。
加えて、訓練者は、訓練者自らが発した言語が指導者が発した言語とは異なることを認識すると同時に、表示領域に表示された指導者の画像によって表される唇の形や動き等を直接的に確認しながら、訓練者の唇の形や動き等を指導者の唇の形や動き等に合わせることが可能となる。
これに伴って、訓練者は、訓練者自らが発する言語を指導者が発する言語に近づけるように矯正する訓練を効率良く行うことができる。 According to the pronunciation training device of the present invention, the display control means performs control to display the trainer-generated language visualization characters at the trainer speech visualization display position in the captured image including the trainer image and the trainer image. Then, control is performed to display the leader-generated language visualized character at the leader voice visualized display position in the captured image.
Thereby, in the above display area, the captured image including the image of the trainer and the image of the instructor is displayed, and at the same time, the trainer-generated language visualization character and the instructor-generated language visualization character are displayed in the captured image. Can do.
Therefore, the trainee compares the trainer-generated language visualized character displayed in the display area with the trainer-generated language visualized character, and the trainer-generated language visualized character is different from the trainer-generated language visualized character. By confirming, it is possible to recognize that the language that the trainee himself has spoken is different from the language that the instructor has spoken.
In addition, the trainer recognizes that the language that the trainee himself has spoken is different from the language that the trainer has spoken, and at the same time the shape and movement of the lips represented by the trainer image displayed in the display area, etc. It is possible to match the shape and movement of the trainee's lips with the shape and movement of the instructor's lips while directly confirming.
Along with this, the trainer can efficiently perform training for correcting the language that the trainee himself speaks to be close to the language that the trainer speaks.

本発明の実施形態の発音訓練装置の概略構成図である。It is a schematic block diagram of the pronunciation training apparatus of embodiment of this invention. 同発音訓練装置を構成するパーソナルコンピュータの概略ブロック図である。It is a schematic block diagram of the personal computer which comprises the same pronunciation training apparatus. 同発音訓練装置が実行する処理に関するフローチャートである。It is a flowchart regarding the process which the same pronunciation training apparatus performs. 同発音訓練装置が実行する音源位置特定処理の説明図である。It is explanatory drawing of the sound source position specific process which the same pronunciation training apparatus performs. 同発音訓練装置が実行する表示画像座標変換処理の説明図である。It is explanatory drawing of the display image coordinate transformation process which the same pronunciation training apparatus performs. 同発音訓練装置によって発音訓練を行う際にディスプレイに画像が表示された状態を示す図である。It is a figure which shows the state by which the image was displayed on the display, when performing pronunciation training by the same pronunciation training apparatus. 指導者音声の音圧レベルに比べて聴覚障害者音声の音圧レベルが低いときにディスプレイに画像が表示された状態を示す図である。It is a figure which shows the state in which the image was displayed on the display, when the sound pressure level of a hearing impaired person's sound is low compared with the sound pressure level of a leader voice.

本発明の実施形態１を、図１ないし図７を参照しつつ説明する。図１は、本実施形態の発音訓練装置１の概略構成図である。発音訓練装置１は、測定ユニット１０と、増幅器２０と、ローパスフィルタ３０と、Ａ／Ｄ変換器４０と、パーソナルコンピュータ５０と、ビデオ入出力ユニット６０と、大型ディスプレイ７０とを備えている。 Embodiment 1 of the present invention will be described with reference to FIGS. FIG. 1 is a schematic configuration diagram of a pronunciation training device 1 according to the present embodiment. The pronunciation training device 1 includes a measurement unit 10, an amplifier 20, a low-pass filter 30, an A / D converter 40, a personal computer 50, a video input / output unit 60, and a large display 70.

図１に示すように、測定ユニット１０は、支持部材１５Ａ〜１５Ｃと、基台１６と、ＣＣＤカメラ１７と、マイクロフォン支持台１８と、マイクロフォンＭ１〜Ｍ５とを備えている。基台１６は、支持部材１５Ａ〜１５Ｃの上部に配置されている。マイクロフォン支持台１８は、カメラ支持部材によって、基台１６上に支持されている。カメラ支持部材には、ＣＣＤカメラ１７が固定されている。マイクロフォン支持台１８には、マイクロフォンＭ１〜Ｍ５が、取り付けられている。 As shown in FIG. 1, the measurement unit 10 includes support members 15A to 15C, a base 16, a CCD camera 17, a microphone support 18, and microphones M1 to M5. The base 16 is arrange | positioned at the upper part of support member 15A-15C. The microphone support base 18 is supported on the base 16 by a camera support member. A CCD camera 17 is fixed to the camera support member. Microphones M1 to M5 are attached to the microphone support base 18.

測定ユニット１０では、マイクロフォンＭ１及びマイクロフォンＭ４によって、一対のマイクロフォンを構成する。マイクロフォンＭ１とマイクロフォンＭ４との水平間隔は、所定の距離に保たれている。さらに、マイクロフォンＭ２及びマイクロフォンＭ３によって、他の一対のマイクロフォンを構成する。マイクロフォンＭ２とマイクロフォンＭ３との水平間隔は、マイクロフォンＭ１とマイクロフォンＭ４との水平間隔と同様に、所定の距離に保たれている。一対のマイクロフォンＭ１、Ｍ４は、他の一対のマイクロフォンＭ２、Ｍ３と直交して配置されている。一対のマイクロフォンＭ１、Ｍ４及び他の一対のマイクロフォンＭ２、Ｍ３によって、マイクロフォン群が形成されている。マイクロフォンＭ５は、各マイクロフォンＭ１〜Ｍ４よりも上方へ突出するように配置されている。マイクロフォンＭ５と、各マイクロフォンＭ１〜Ｍ４とのそれぞれの対向間隔は、マイクロフォンＭ１とマイクロフォンＭ４との水平間隔やマイクロフォンＭ２とマイクロフォンＭ３との水平間隔と同じ距離に保たれている。なお、マイクロフォンＭ５は、本発明の対向マイクロフォンの一例であり、マイクロフォンＭ１〜Ｍ５は、本発明の集音手段の一例である。 In the measurement unit 10, the microphone M1 and the microphone M4 constitute a pair of microphones. The horizontal interval between the microphone M1 and the microphone M4 is kept at a predetermined distance. Further, the microphone M2 and the microphone M3 constitute another pair of microphones. The horizontal interval between the microphone M2 and the microphone M3 is maintained at a predetermined distance, similarly to the horizontal interval between the microphone M1 and the microphone M4. The pair of microphones M1 and M4 are arranged orthogonal to the other pair of microphones M2 and M3. A pair of microphones M1 and M4 and another pair of microphones M2 and M3 form a microphone group. The microphone M5 is disposed so as to protrude above the microphones M1 to M4. The facing distance between the microphone M5 and each of the microphones M1 to M4 is maintained at the same distance as the horizontal distance between the microphone M1 and the microphone M4 and the horizontal distance between the microphone M2 and the microphone M3. The microphone M5 is an example of the opposed microphone of the present invention, and the microphones M1 to M5 are examples of the sound collecting means of the present invention.

各マイクロフォンＭ１〜Ｍ５は、増幅器２０に接続されている。増幅器２０は、各マイクロフォンＭ１〜Ｍ５から送信された音圧信号を増幅する。増幅器２０は、ローパスフィルタ３０に接続されている。ローパスフィルタ３０によって、フィルタを通過する周波数の帯域が制限され、所定の周波数以下（ここでは４５００Ｈｚ以下）の音声信号を通過させる。ローパスフィルタ３０は、Ａ／Ｄ変換器４０に接続されている。Ａ／Ｄ変換器４０は、上記の音圧信号（アナログ信号）をディジタル信号に変換する。ディジタル信号は、パーソナルコンピュータ５０に送信される。 Each microphone M 1 to M 5 is connected to the amplifier 20. The amplifier 20 amplifies the sound pressure signal transmitted from each of the microphones M1 to M5. The amplifier 20 is connected to the low pass filter 30. The low-pass filter 30 limits the frequency band that passes through the filter, and allows an audio signal of a predetermined frequency or lower (here, 4500 Hz or lower) to pass. The low pass filter 30 is connected to the A / D converter 40. The A / D converter 40 converts the sound pressure signal (analog signal) into a digital signal. The digital signal is transmitted to the personal computer 50.

ＣＣＤカメラ１７は、ビデオ入出力ユニット６０に接続されている。ビデオ入出力ユニット６０は、ＣＣＤカメラ１７から送信された撮像信号（アナログ信号）をディジタル信号に変換する。ビデオ入出力ユニット６０によって、ディジタル信号（撮像信号）は、パーソナルコンピュータ５０に送信される。 The CCD camera 17 is connected to the video input / output unit 60. The video input / output unit 60 converts the imaging signal (analog signal) transmitted from the CCD camera 17 into a digital signal. A digital signal (imaging signal) is transmitted to the personal computer 50 by the video input / output unit 60.

パーソナルコンピュータ５０は、大型ディスプレイ７０に接続されている。符号７１は、大型ディスプレイ７０の表示領域である。なお、大型ディスプレイ７０は、本発明の表示手段の一例である。 The personal computer 50 is connected to the large display 70. Reference numeral 71 denotes a display area of the large display 70. The large display 70 is an example of the display means of the present invention.

図２は、パーソナルコンピュータ５０の概略ブロック図である。パーソナルコンピュータ５０は、キーボード５１と、演算処理部５２と、記憶部５３とを備えている。 FIG. 2 is a schematic block diagram of the personal computer 50. The personal computer 50 includes a keyboard 51, an arithmetic processing unit 52, and a storage unit 53.

キーボード５１は、演算処理部５２に接続されている。キーボード５１は、マイクロフォンの数、上述したマイクロフォンＭ１とマイクロフォンＭ４との間隔、マイクロフォンＭ２とマイクロフォンＭ３との間隔、マイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４とのそれぞれの間隔、ローパスフィルタ３０を通過させる周波数の設定値等を入力するために用いられる。 The keyboard 51 is connected to the arithmetic processing unit 52. The keyboard 51 has the number of microphones, the distance between the microphone M1 and the microphone M4, the distance between the microphone M2 and the microphone M3, the distance between the microphone M5 and each of the microphones M1 to M4, and the frequency that passes through the low-pass filter 30. Used to input setting values.

演算処理部５２は、記憶部５３、スピーカ５４及び大型ディスプレイ７０にそれぞれ接続されている。記憶部５３は、ディジタル信号（音圧信号）演算処理プログラム記憶部５３Ａと、発音訓練データ処理プログラム記憶部５３Ｂと、画像表示制御プログラム記憶部５３Ｃと、データ記憶部５３Ｄとを備えている。ディジタル信号（音圧信号）演算処理プログラム記憶部５３Ａには、後述する周波数分析処理（Ｓ５）、音源位置特定処理（Ｓ６）、表示画像座標変換処理（Ｓ８）等を実行するプログラムが記憶されている。発音訓練データ処理プログラム記憶部５３Ｂには、後述する発音訓練用言語選択処理（Ｓ３）、言語認識処理（Ｓ７）、発音訓練用言語表示位置調整処理（Ｓ９）、文字変換処理（Ｓ１０）、発音訓練正誤判定処理（Ｓ１１）を実行するプログラムが記憶されている。画像表示制御プログラム記憶部５３Ｃには、後述する発音訓練画像表示処理（Ｓ１２）を実行するプログラム等が記憶されている。また、データ記憶部５３Ｄには、大量の音声サンプルを収集して解析したメル周波数ケプストラム係数（ＭＦＣＣ）の基準データが記憶されている。 The arithmetic processing unit 52 is connected to the storage unit 53, the speaker 54, and the large display 70, respectively. The storage unit 53 includes a digital signal (sound pressure signal) arithmetic processing program storage unit 53A, a pronunciation training data processing program storage unit 53B, an image display control program storage unit 53C, and a data storage unit 53D. The digital signal (sound pressure signal) arithmetic processing program storage unit 53A stores a program for executing frequency analysis processing (S5), sound source position specifying processing (S6), display image coordinate conversion processing (S8), and the like, which will be described later. Yes. The pronunciation training data processing program storage unit 53B includes a pronunciation training language selection process (S3), a language recognition process (S7), a pronunciation training language display position adjustment process (S9), a character conversion process (S10), and a pronunciation. A program for executing the training correctness determination process (S11) is stored. The image display control program storage unit 53C stores a program for executing a pronunciation training image display process (S12) described later. The data storage unit 53D stores mel frequency cepstrum coefficient (MFCC) reference data obtained by collecting and analyzing a large amount of audio samples.

演算処理部５２は、後述するように、ＣＣＤカメラ１７によって撮像された発音の指導者及び聴覚障害者の各撮像画像の画像データや、指導者や聴覚障害者がそれぞれ発した音声を文字に変換した文字画像データに基づいて、上記の表示領域７１に表示する表示画像の画像信号、前記文字画像データに関する信号、指導者や聴覚障害者がそれぞれ発した音声に関する音声信号を生成する。続いて、演算処理部５２は、生成した画像信号等を大型ディスプレイ７０に送信し、表示領域７１に、撮像画像に重ね、指導者や聴覚障害者がそれぞれ発した音声に対応させた文字画像を表示する。加えて、演算処理部５２は、上記の音声信号をスピーカ５４に出力し、スピーカ５４は、音声によって指導者や聴覚障害者がそれぞれ発した言語を流す。 As will be described later, the arithmetic processing unit 52 converts the image data of the pronunciation instructor and the hearing-impaired image captured by the CCD camera 17 and the sound produced by the instructor and the hearing-impaired person into characters. Based on the character image data thus generated, an image signal of the display image displayed in the display area 71, a signal related to the character image data, and a sound signal related to the sound produced by the instructor or the hearing impaired person are generated. Subsequently, the arithmetic processing unit 52 transmits the generated image signal or the like to the large display 70, and superimposes the captured image on the display area 71 to display the character image corresponding to the voice uttered by the instructor or the hearing impaired person, respectively. indicate. In addition, the arithmetic processing unit 52 outputs the above-described audio signal to the speaker 54, and the speaker 54 plays the language that the instructor or the hearing-impaired person respectively utters by voice.

次に、演算処理部５２が表示領域７１に指導者及び聴覚障害者のそれぞれの撮像画像、該指導者や該聴覚障害者が発した音声を文字に変換した文字画像を表示する処理について説明する。聴覚障害者は、本発明の訓練者の一例である。発音訓練装置１の電源が投入されると、演算処理部５２は、図３に示すように、初期設定処理（Ｓ１）を実行する。初期設定処理（Ｓ１）では、上記のキーボード５１によって入力されたマイクロフォンの数（ここでは５個）、上述したマイクロフォンＭ１とマイクロフォンＭ４との間隔、マイクロフォンＭ２とマイクロフォンＭ３との間隔、マイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４との間隔、ローパスフィルタ３０を通過させる周波数の設定値を、データ記憶部５３Ｄに記憶する処理を実行する。 Next, a description will be given of a process in which the arithmetic processing unit 52 displays a captured image of each of the instructor and the hearing impaired person in the display area 71 and a character image obtained by converting voices emitted by the instructor or the hearing impaired person into characters. . A hearing impaired person is an example of a trainer of the present invention. When the pronunciation training device 1 is powered on, the arithmetic processing unit 52 executes an initial setting process (S1) as shown in FIG. In the initial setting process (S1), the number of microphones input by the keyboard 51 (here, five), the distance between the microphone M1 and the microphone M4, the distance between the microphone M2 and the microphone M3, the microphone M5, The process which memorize | stores the space | interval with microphone M1-M4 and the setting value of the frequency which passes the low-pass filter 30 in the data storage part 53D is performed.

演算処理部５２は、初期設定処理（Ｓ１）の後に、初期画面表示処理（Ｓ２）を実行する。初期画面表示処理（Ｓ２）では、上記の画像表示制御プログラム記憶部５３Ｃに記憶されたプログラムを実行することにより、初期画像として、表示領域７１に、発声訓練用の言語の選択メニュー画面を表示する処理を実行する。指導者は、選択メニュー画面によって表示された発声訓練用の言語の内から任意の発声訓練用の言語を選択することができる。選択メニュー画面には、後述する訓練用言語表示位置調整ボタンも表示される。 The arithmetic processing unit 52 executes an initial screen display process (S2) after the initial setting process (S1). In the initial screen display process (S2), by executing the program stored in the image display control program storage unit 53C, a language selection menu screen for speech training is displayed in the display area 71 as an initial image. Execute the process. The instructor can select any utterance training language from among the utterance training languages displayed on the selection menu screen. A training language display position adjustment button, which will be described later, is also displayed on the selection menu screen.

演算処理部５２は、初期画面表示処理（Ｓ２）の後に、発音訓練用言語選択処理（Ｓ３）を実行する。発音訓練用言語選択処理（Ｓ３）では、指導者によって、上記の選択メニュー画面から、発声訓練用の言語が選択されたか否かを判断する。ここでは、演算処理部５２が、指導者によって選択された発音訓練用言語のデータ（発音訓練用言語データ）がデータ記憶部５３Ｄに記憶されているか否かを判断する。発音訓練用言語選択処理（Ｓ３）では、発音訓練用の言語が選択されたと判断するまで、該発音訓練用の言語が選択された否かの判断が繰り返し行われる。発音訓練用言語選択処理（Ｓ３）では、指導者によって選択された発音訓練用言語データに応じ、文字画像データ（例えば、「ひ」と発音してください。）を生成し、該文字画像データをデータ記憶部５３Ｄに記憶する処理を実行する。 The arithmetic processing unit 52 executes a pronunciation training language selection process (S3) after the initial screen display process (S2). In the pronunciation training language selection process (S3), it is determined whether or not the language for speech training has been selected by the instructor from the selection menu screen. Here, the arithmetic processing unit 52 determines whether or not the pronunciation training language data (pronunciation training language data) selected by the instructor is stored in the data storage unit 53D. In the pronunciation training language selection process (S3), it is repeatedly determined whether or not the pronunciation training language is selected until it is determined that the pronunciation training language is selected. In the pronunciation training language selection process (S3), character image data (for example, pronounce “hi”) is generated according to the pronunciation training language data selected by the instructor, and the character image data is generated. Processing to be stored in the data storage unit 53D is executed.

演算処理部５２は、発音訓練用言語選択処理（Ｓ３）の後に、入力信号取得処理（Ｓ４）を実行する。入力信号取得処理（Ｓ４）では、指導者が発した音声（指導者音声）や聴覚障害者が発した音声（聴覚障害者音声）のそれぞれの音圧レベル（音圧信号）、指導者音声（例えば、図６中の「ひ」）に関する音声信号、聴覚障害者音声（例えば、図６中の「い」）に関する音声信号、上記の撮像信号を取得する処理を実行する。ここでは、上記のマイクロフォンＭ１〜Ｍ５によって検出された音圧信号、音声信号及び上記の撮像信号が、図２に示すように、ディジタル信号として、演算処理部５２に入力される。その後、演算処理部５２は、音圧信号（音圧レベルデータ）、音声信号（音声データ）及び撮像信号（撮像データ）を、データ記憶部５３Ｄにそれぞれ記憶する処理を実行する。 The arithmetic processing unit 52 executes an input signal acquisition process (S4) after the pronunciation training language selection process (S3). In the input signal acquisition process (S4), the sound pressure level (sound pressure signal) and the voice of the leader (voice of the leader) and the voice of the hearing impaired (voice of the hearing impaired) For example, a process of acquiring an audio signal related to “hi” in FIG. 6, an audio signal related to the hearing impaired person's voice (for example, “i” in FIG. 6), and the above imaging signal is executed. Here, the sound pressure signal, the sound signal, and the imaging signal detected by the microphones M1 to M5 are input to the arithmetic processing unit 52 as digital signals as shown in FIG. Thereafter, the arithmetic processing unit 52 executes a process of storing the sound pressure signal (sound pressure level data), the sound signal (sound data), and the imaging signal (imaging data) in the data storage unit 53D.

演算処理部５２は、入力信号取得処理（Ｓ４）の後に、周波数分析処理（Ｓ５）を実行する。周波数分析処理（Ｓ５）では、上記のディジタル信号（音圧信号）演算処理プログラム記憶部５３Ａに記憶されたプログラム（フーリエ変換処理プログラム）を用い、入力信号取得処理（Ｓ４）によって取得された音圧信号の音圧レベルを分析し、周波数のスペクトル分布を解析する処理を実行する。その後、周波数分析処理（Ｓ５）では、周波数のスペクトル分布に基づいて、ＭＦＣＣを算出する。ここでは、一例として、周波数分析処理（Ｓ５）により、指導者音声及び聴覚障害者音声の各周波数スペクトル分布に基づいて、それぞれ１２個のＭＦＣＣを算出した。加えて、周波数分析処理（Ｓ５）では、ＭＦＣＣのデータをデータ記憶部５３Ｄに記憶する処理を実行する。 The arithmetic processing unit 52 executes a frequency analysis process (S5) after the input signal acquisition process (S4). In the frequency analysis processing (S5), the sound pressure acquired by the input signal acquisition processing (S4) using the program (Fourier transform processing program) stored in the digital signal (sound pressure signal) arithmetic processing program storage unit 53A. A process of analyzing the sound pressure level of the signal and analyzing the spectrum distribution of the frequency is executed. Thereafter, in the frequency analysis process (S5), the MFCC is calculated based on the spectrum distribution of the frequency. Here, as an example, twelve MFCCs were calculated based on each frequency spectrum distribution of the instructor voice and the hearing impaired voice by frequency analysis processing (S5). In addition, in the frequency analysis process (S5), a process of storing MFCC data in the data storage unit 53D is executed.

演算処理部５２は、周波数分析処理（Ｓ５）の後に、音源位置特定処理（Ｓ６）を実行する。音源位置特定処理（Ｓ６）では、ディジタル信号（音圧信号）演算処理プログラム記憶部５３Ａに記憶されたプログラムを用い、双曲線法によって、マイクロフォン群原点位置Ｏ（図４参照。）と位置Ｐ１、Ｐ２（図４参照。）との間の水平角度θ（図４参照。）を算出する処理を実行する。水平角度θのデータは、データ記憶部５３Ｄに記憶される。マイクロフォン群原点位置Ｏは、マイクロフォンＭ１とマイクロフォンＭ４とを結ぶ直線と、マイクロフォンＭ２とマイクロフォンＭ３とを結ぶ直線とが交わる位置である。位置Ｐ１は、指導者音声が発せられた位置（指導者音声発生位置）であり、位置Ｐ２は、聴覚障害者音声が発せられた位置（聴覚障害者音声発生位置）である。図中の符号Ｌ１は、マイクロフォン群原点位置Ｏに正対した位置Ｏ１から位置Ｐ１、Ｐ２までの水平距離であり、符号Ｌ２は、マイクロフォン群原点位置Ｏから位置Ｏ１までの距離である。また、符号Ｌ４は、位置Ｏ１から下方に延ばした垂直線と、位置Ｐ１、Ｐ２から左方に向けて延ばした水平線との交点Ｏ２が、位置Ｏ１から離れた距離である。 The arithmetic processing unit 52 executes the sound source position specifying process (S6) after the frequency analysis process (S5). In the sound source position specifying process (S6), a microphone group origin position O (see FIG. 4) and positions P1, P2 are obtained by a hyperbolic method using a program stored in the digital signal (sound pressure signal) arithmetic processing program storage unit 53A. A process of calculating the horizontal angle θ (see FIG. 4) between (see FIG. 4) is executed. The data of the horizontal angle θ is stored in the data storage unit 53D. The microphone group origin position O is a position where a straight line connecting the microphone M1 and the microphone M4 intersects with a straight line connecting the microphone M2 and the microphone M3. The position P1 is a position where the leader voice is emitted (leader voice generation position), and the position P2 is a position where the hearing impaired person voice is emitted (deaf person voice generation position). In the drawing, a symbol L1 is a horizontal distance from the position O1 facing the microphone group origin position O to the positions P1 and P2, and a symbol L2 is a distance from the microphone group origin position O to the position O1. A symbol L4 is a distance from the position O1 where an intersection O2 between a vertical line extending downward from the position O1 and a horizontal line extending leftward from the positions P1 and P2 is separated.

水平角度θの値は、マイクロフォンＭ１とマイクロフォンＭ４との間の距離、マイクロフォンＭ２とマイクロフォンＭ３との間の距離、位置Ｐ１における指導者音声又は位置Ｐ２における聴覚障害者音声が一対のマイクロフォンＭ１、Ｍ４に到達する時間差、位置Ｐ１における指導者音声又は位置Ｐ２における聴覚障害者音声が他の一対のマイクロフォンＭ２、Ｍ３に到達する時間差によって変化する。水平角度θは、下記の式（１）を用いて算出される。なお、ＤＸは、一対のマイクロフォンＭ１、Ｍ４における指導者音声又は聴覚障害者音声の到達時間差であり、ＤＹは、他の一対のマイクロフォンＭ２、Ｍ３における指導者音声又は聴覚障害者音声の到達時間差である。
θ＝ｔａｎ^−１（ＤＹ／ＤＸ）・・・（１） The value of the horizontal angle θ is the distance between the microphone M1 and the microphone M4, the distance between the microphone M2 and the microphone M3, the voice of the instructor at the position P1 or the voice of the hearing impaired person at the position P2, and the pair of microphones M1 and M4. , And the voice of the instructor at position P1 or the voice of the hearing impaired person at position P2 varies depending on the time difference of reaching the other pair of microphones M2 and M3. The horizontal angle θ is calculated using the following formula (1). Note that DX is the arrival time difference between the leader voice or hearing impaired person voice in the pair of microphones M1 and M4, and DY is the arrival time difference between the leader voice or hearing impaired person voice in the other pair of microphones M2 and M3. is there.
θ = tan ⁻¹ (DY / DX) (1)

また、音源位置特定処理（Ｓ６）では、マイクロフォンＭ１〜Ｍ４に加え、上記のマイクロフォンＭ５を用いることにより、下記の式（２）を用い、位置Ｐ１、Ｐ２からマイクロフォン群原点位置Ｏを見上げた仰角φ（図４参照。）を算出する。仰角φのデータは、データ記憶部５３Ｄに記憶される。なお、下記のＤＺ１は、マイクロフォンＭ５、Ｍ１における指導者音声又は聴覚障害者音声の到達時間差、ＤＺ２は、マイクロフォンＭ５、Ｍ２における指導者音声又は聴覚障害者音声の到達時間差、ＤＺ３は、マイクロフォンＭ５、Ｍ３における指導者音声又は聴覚障害者音声の到達時間差、ＤＺ４は、マイクロフォンＭ５、Ｍ４における指導者音声又は聴覚障害者音声の到達時間差をそれぞれ示す。
φ＝ｔａｎ^−１｛（ＤＺ１＋ＤＺ２＋ＤＺ３＋ＤＺ４）／［２×√３×（√ＤＸ^２＋√ＤＹ^２）］｝・・・（２） Further, in the sound source position specifying process (S6), by using the above-described microphone M5 in addition to the microphones M1 to M4, the following angle (2) is used to look up the microphone group origin position O from the positions P1 and P2. φ (see FIG. 4) is calculated. The data of the elevation angle φ is stored in the data storage unit 53D. The following DZ1 is the arrival time difference between the leader voice or the hearing impaired voice in the microphones M5 and M1, DZ2 is the arrival time difference between the leader voice or the hearing impaired voice in the microphones M5 and M2, and DZ3 is the microphone M5, The arrival time difference between the leader voice or the hearing impaired voice in M3 and DZ4 indicate the arrival time difference between the leader voice or the hearing impaired voice in the microphones M5 and M4, respectively.
φ = tan ⁻¹ {(DZ1 + DZ2 + DZ3 + DZ4) / [2 × √3 × (√DX ² + √DY ² )]} (2)

音源位置特定処理（Ｓ６）では、双曲線法を用いることにより、マイクロフォンＭ１とマイクロフォンＭ４との間を通過する双曲線軌跡、マイクロフォンＭ２とマイクロフォンＭ３との間を通過する双曲線軌跡、マイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４との間をそれぞれ通過する双曲線軌跡が交わる点を、位置Ｐ１、Ｐ２として算出することができる。なお、演算処理部５２は、本発明の位置特定手段の一例である。 In the sound source position specifying process (S6), by using the hyperbolic method, a hyperbolic trajectory passing between the microphone M1 and the microphone M4, a hyperbolic trajectory passing between the microphone M2 and the microphone M3, the microphone M5 and each microphone M1. The points at which the hyperbola trajectories passing through M4 intersect can be calculated as positions P1 and P2. The arithmetic processing unit 52 is an example of a position specifying unit of the present invention.

演算処理部５２は、音源位置特定処理（Ｓ６）の後に、言語認識処理（Ｓ７）を実行する。言語認識処理（Ｓ７）では、発音訓練データ処理プログラム記憶部５３Ｂから言語認識プログラムを読み出す。言語認識プログラムは、日本語の認識機能を利用して単語を抽出するプログラムである。その後、言語認識処理（Ｓ７）では、言語認識プログラムによって、入力信号取得処理（Ｓ４）にて取得した指導者音声（図６中の「ひ」）及び聴覚障害者音声（図６中の「い」）の各音声信号から日本語（ここでは、「ひ」「い」）を抽出する処理を実行する。なお、演算処理部５２は、本発明の言語認識手段の一例である。また、日本語「ひ」は本発明の指導者発声言語の一例であり、日本語「い」は本発明の訓練者発声言語の一例である。 The arithmetic processing unit 52 executes a language recognition process (S7) after the sound source position specifying process (S6). In the language recognition process (S7), the language recognition program is read from the pronunciation training data processing program storage unit 53B. The language recognition program is a program that extracts words using a Japanese recognition function. Thereafter, in the language recognition process (S7), the voice of the instructor ("hi" in FIG. 6) and the hearing impaired person voice ("I" in FIG. 6) acquired in the input signal acquisition process (S4) by the language recognition program. )) Is extracted from each voice signal (here, “hi” and “i”). The arithmetic processing unit 52 is an example of the language recognition means of the present invention. Japanese “hi” is an example of the instructor utterance language of the present invention, and Japanese “i” is an example of the trainer utterance language of the present invention.

演算処理部５２は、言語認識処理（Ｓ７）の後に、表示画像座標変換処理（Ｓ８）を実行する。表示画像座標変換処理（Ｓ８）では、位置Ｐ１（指導者音声発生位置）及び位置Ｐ２（聴覚障害者音声発生位置）を、表示領域７１の表示位置に相関付けするための処理を実行する。ここでは、図５に示すように、表示領域７１の横方向Ｘを上記の水平角度θ、表示領域７１の縦方向を上記の仰角φと対応付けし、位置Ｐ１、Ｐ２に対応する表示領域７１の表示位置を算出する。 The arithmetic processing unit 52 executes display image coordinate conversion processing (S8) after language recognition processing (S7). In the display image coordinate conversion process (S8), a process for correlating the position P1 (leader voice generation position) and the position P2 (deaf person voice generation position) with the display position of the display area 71 is executed. Here, as shown in FIG. 5, the horizontal direction X of the display area 71 is associated with the horizontal angle θ and the vertical direction of the display area 71 is associated with the elevation angle φ, and the display area 71 corresponding to the positions P1 and P2. The display position of is calculated.

その後、演算処理部５２は、音源位置特定処理（Ｓ６）によって算出した水平角度θに任意の水平角度を加算した補正水平角度を算出する処理、該音源位置特定処理（Ｓ６）によって算出した仰角φに任意の仰角φを加算した補正仰角を算出する処理を実行する。これにより、補正水平角度に対応する表示領域７１の表示位置Ｐ５（図６参照。）及び補正仰角に対応する表示領域７１の表示位置Ｐ６（図６参照。）を、指導者７５（図６参照。）の口の画像及び聴覚障害者７６（図６参照。）の口の画像をそれぞれ遮ることがない位置に調整することができる。 Thereafter, the arithmetic processing unit 52 calculates a corrected horizontal angle obtained by adding an arbitrary horizontal angle to the horizontal angle θ calculated by the sound source position specifying process (S6), and the elevation angle φ calculated by the sound source position specifying process (S6). A process of calculating a corrected elevation angle obtained by adding an arbitrary elevation angle φ to is executed. As a result, the display position P5 (see FIG. 6) of the display area 71 corresponding to the corrected horizontal angle and the display position P6 (see FIG. 6) of the display area 71 corresponding to the corrected elevation angle are indicated to the instructor 75 (see FIG. 6). .) Of the mouth and the image of the mouth of the hearing impaired person 76 (see FIG. 6) can be adjusted to positions where they are not obstructed.

続いて、表示画像座標変換処理（Ｓ８）では、表示位置Ｐ５、Ｐ６のデータを、データ記憶部５３Ｄに記憶する処理を実行する。なお、演算処理部５２は、本発明の相関手段の一例である。また、表示位置Ｐ５は本発明の指導者音声可視化表示位置の一例であり、表示位置Ｐ６は本発明の訓練者音声可視化表示位置の一例である。 Subsequently, in the display image coordinate conversion process (S8), a process of storing the data of the display positions P5 and P6 in the data storage unit 53D is executed. The arithmetic processing unit 52 is an example of the correlation means of the present invention. The display position P5 is an example of the instructor voice visualization display position of the present invention, and the display position P6 is an example of the trainer voice visualization display position of the present invention.

演算処理部５２は、表示画像座標変換処理（Ｓ８）の後に、発音訓練用言語表示位置調整処理（Ｓ９）を実行する。発音訓練用言語表示位置調整処理（Ｓ９）では、データ記憶部５３Ｄに、上記の表示位置Ｐ５、Ｐ６の各調整データ（水平角度θ、仰角φ）が記憶されているか否かを判断する。各調整データは、上述した訓練用言語表示位置調整ボタンにより、指導者や聴覚障害者が水平角度θや仰角φを指定し、水平角度θや仰角φを任意の値に設定することができる。これにより、例えば、図６に示すように、任意に設定した値に対応する表示位置Ｐ５、Ｐ６を、指導者７５の画像や聴覚障害者７６の画像をそれぞれ遮ることがない位置に調整することができる。なお、演算処理部５２及び訓練用言語表示位置調整ボタンは、本発明の表示位置調整手段の一例である。 The arithmetic processing unit 52 executes a pronunciation training language display position adjustment process (S9) after the display image coordinate conversion process (S8). In the pronunciation training language display position adjustment process (S9), it is determined whether or not the adjustment data (horizontal angle θ and elevation angle φ) of the display positions P5 and P6 is stored in the data storage unit 53D. Each adjustment data can be set to an arbitrary value by the instructor or the hearing impaired person by specifying the horizontal angle θ and the elevation angle φ using the training language display position adjustment button described above. Thereby, for example, as shown in FIG. 6, the display positions P5 and P6 corresponding to the arbitrarily set values are adjusted to positions that do not block the image of the instructor 75 and the image of the hearing impaired person 76, respectively. Can do. The arithmetic processing unit 52 and the training language display position adjustment button are examples of the display position adjustment means of the present invention.

その後、発音訓練用言語表示位置調整処理（Ｓ９）では、データ記憶部５３Ｄに各調整データが記憶されていると判断した場合には、上記の表示画像座標変換処理（Ｓ８）によってデータ記憶部５３Ｄに記憶された表示位置Ｐ５、Ｐ６のデータに各調整データを上書きする処理を実行する。 Thereafter, in the pronunciation training language display position adjustment process (S9), when it is determined that each adjustment data is stored in the data storage unit 53D, the data storage unit 53D is subjected to the display image coordinate conversion process (S8). The process of overwriting each adjustment data on the data of the display positions P5 and P6 stored in the above is executed.

一方、発音訓練用言語表示位置調整処理（Ｓ９）では、データ記憶部５３Ｄに各調整データが記憶されていないと判断した場合には、表示画像座標変換処理（Ｓ８）によってデータ記憶部５３Ｄに記憶された表示位置Ｐ５、Ｐ６のデータを保持する処理を実行する。 On the other hand, in the pronunciation training language display position adjustment process (S9), if it is determined that each adjustment data is not stored in the data storage unit 53D, the display image coordinate conversion process (S8) stores it in the data storage unit 53D. A process of holding the data at the displayed positions P5 and P6 is executed.

演算処理部５２は、発音訓練用言語表示位置調整処理（Ｓ９）の後に、文字変換処理（Ｓ１０）を実行する。文字変換処理（Ｓ１０）では、発音訓練データ処理プログラム記憶部５３Ｂから文字変換プログラムを読み出す。文字変換プログラムは、言語認識処理（Ｓ７）によって抽出された単語を文字に変換するプログラムである。ここでは、言語認識処理（Ｓ７）によって抽出した単語（「ひ」「い」）を、「ひ」の文字及び「い」の文字に変換する。後述の図６に示すように、「ひ」の文字及び「い」の文字は、表示領域７１に表示されることにより可視化される。これにより、目視にて指導者７５及び聴覚障害者７６が、「ひ」の文字及び「い」の文字を確認することができる。「ひ」「い」の各文字の画像データは、データ記憶部５３Ｄに記憶される。なお、演算処理部５２は、本発明の文字変換手段の一例である。また、文字「ひ」は本発明の指導者発声言語可視化文字の一例であり、文字「い」は本発明の訓練者発声言語可視化文字の一例である。 The arithmetic processing unit 52 executes a character conversion process (S10) after the pronunciation training language display position adjustment process (S9). In the character conversion process (S10), the character conversion program is read from the pronunciation training data processing program storage unit 53B. The character conversion program is a program for converting words extracted by the language recognition process (S7) into characters. Here, the words (“hi” and “i”) extracted by the language recognition process (S7) are converted into the characters “hi” and “i”. As shown in FIG. 6 to be described later, the characters “Hi” and “I” are visualized by being displayed in the display area 71. Accordingly, the instructor 75 and the hearing impaired person 76 can visually confirm the characters “Hi” and “I”. The image data of each character “HI” and “I” is stored in the data storage unit 53D. The arithmetic processing unit 52 is an example of character conversion means of the present invention. The character “HI” is an example of the instructor utterance language visualization character of the present invention, and the character “I” is an example of the trainer utterance language visualization character of the present invention.

さらに、文字変換処理（Ｓ１０）では、データ記憶部５３Ｄから、入力信号取得処理（Ｓ４）によって取得した音圧レベルデータを読み出す。次に、文字変換処理（Ｓ１０）では、音圧レベルデータを、予め設定された閾値のデータと比較する。ここでは、音圧レベルと閾値のデータとを比較した結果に応じ、「ひ」「い」の各文字の大きさを、大、中、小の各大きさに設定した文字画像データを生成する。大、中、小の各大きさに設定された文字画像データは、データ記憶部５３Ｄに記憶される。なお、聴覚障害者音声の音圧レベルは、本発明の聴覚障害者音声の音量の一例であり、指導者音声の音圧レベルは、本発明の指導者音声の音量の一例である。 Further, in the character conversion process (S10), the sound pressure level data acquired by the input signal acquisition process (S4) is read from the data storage unit 53D. Next, in the character conversion process (S10), the sound pressure level data is compared with data of a preset threshold value. Here, according to the result of comparison between the sound pressure level and the threshold data, character image data is generated in which the size of each character of “Hi” and “I” is set to large, medium, and small. . Character image data set to large, medium, and small sizes is stored in the data storage unit 53D. The sound pressure level of the hearing impaired person's voice is an example of the volume of the hearing impaired person's voice of the present invention, and the sound pressure level of the leader's voice is an example of the volume of the leader's voice of the present invention.

演算処理部５２は、文字変換処理（Ｓ１０）の後に、発音訓練正誤判定処理（Ｓ１１）を実行する。発音訓練正誤判定処理（Ｓ１１）では、データ記憶部５３Ｄから、ＭＦＣＣの基準データ、周波数分析処理（Ｓ５）によって算出したＭＦＣＣのデータをそれぞれ読み出す。その後、発音訓練正誤判定処理（Ｓ１１）では、前記基準データに基づいて、ＭＦＣＣの基準分布範囲を決定する。さらに、発音訓練正誤判定処理（Ｓ１１）では、上述した１２個のＭＦＣＣに基づいて、指導者音声のＭＦＣＣの分布範囲を決定すると共に、該１２個のＭＦＣＣとは異なる他の１２個のＭＦＣＣに基づいて、聴覚障害者音声のＭＦＣＣの分布範囲を決定する。続いて、発音訓練正誤判定処理（Ｓ１１）では、指導者音声のＭＦＣＣの分布範囲及び聴覚障害者音声のＭＦＣＣの分布範囲を、ＭＦＣＣの基準分布範囲とそれぞれ比較し、指導者音声のＭＦＣＣの分布範囲を含む指導者音声基準分布範囲及び聴覚障害者音声のＭＦＣＣの分布範囲を含む聴覚障害者音声基準分布範囲をそれぞれ選定する。次に、発音訓練正誤判定処理（Ｓ１１）では、聴覚障害者音声基準分布範囲が、指導者音声基準分布範囲に含まれるか否かを判断する。 The arithmetic processing unit 52 executes a pronunciation training correctness determination process (S11) after the character conversion process (S10). In the pronunciation training correctness determination process (S11), the MFCC reference data and the MFCC data calculated by the frequency analysis process (S5) are read from the data storage unit 53D. Thereafter, in the pronunciation training correctness determination process (S11), a reference distribution range of the MFCC is determined based on the reference data. Further, in the pronunciation training correctness determination process (S11), the distribution range of the MFCC of the instructor voice is determined based on the 12 MFCCs described above, and other 12 MFCCs different from the 12 MFCCs are determined. Based on this, the distribution range of the MFCC of the hearing impaired voice is determined. Subsequently, in the pronunciation training correctness determination process (S11), the distribution range of the MFCC of the leader voice and the distribution range of the MFCC of the hearing impaired voice are respectively compared with the reference distribution range of the MFCC, and the distribution of the MFCC of the leader voice. A guidance voice reference distribution range including the range and a hearing impaired voice reference distribution range including the MFCC distribution range of the hearing impaired voice are respectively selected. Next, in the pronunciation training correctness determination process (S11), it is determined whether or not the hearing impaired person's voice reference distribution range is included in the leader voice reference distribution range.

発音訓練正誤判定処理（Ｓ１１）では、聴覚障害者音声基準分布範囲が、指導者音声基準分布範囲に含まれると判断したときは、指導者音声（例えば「ひ」）と聴覚障害者音声（「ひ」）とが一致し、聴覚障害者が指導者に合わせて正しい発声を行っている正解判定を行う。さらに、発音訓練正誤判定処理（Ｓ１１）では、正解判定の結果に応じ、判定結果表示用の文字画像データ（例えば、「正しく発音されています。」）を生成し、該文字画像データをデータ記憶部５３Ｄに記憶する。 In the pronunciation training correct / incorrect determination process (S11), when it is determined that the voice reference distribution range of the hearing impaired person is included in the leader voice reference distribution range, the voice of the leader (for example, “hi”) and the voice of the hearing impaired person (“ Hi)) matches, and the correct answer judgment that the hearing impaired person is speaking correctly in accordance with the instructor is performed. Further, in the pronunciation training correct / incorrect determination process (S11), character image data (for example, “correctly pronounced”) for displaying the determination result is generated according to the correct determination result, and the character image data is stored in the data. Store in the unit 53D.

一方、聴覚障害者音声基準分布範囲が、指導者音声基準分布範囲に含まれないと判断したときは、指導者音声（例えば「ひ」）と聴覚障害者音声（例えば「い」）とが一致せず、聴覚障害者が指導者とは異なる誤った発声を行っている不正解判定を行う。発音訓練正誤判定処理（Ｓ１１）では、不正解判定の結果に応じ、判定結果表示用の文字画像データ（例えば、「誤って発音されています。」）を生成し、該文字画像データをデータ記憶部５３Ｄに記憶する。なお、演算処理部５２は、本発明の判定手段の一例である。 On the other hand, when it is determined that the voice reference distribution range of the hearing impaired person is not included in the leader voice reference distribution range, the voice of the leader (eg, “hi”) matches the voice of the hearing impaired person (eg, “i”). Without correct judgment, the hearing-impaired person is making a wrong utterance different from the leader. In the pronunciation training correct / incorrect determination process (S11), character image data for displaying the determination result (for example, “pronounced pronunciation in error”) is generated according to the result of the incorrect answer determination, and the character image data is stored as data. Store in the unit 53D. The arithmetic processing unit 52 is an example of a determination unit of the present invention.

演算処理部５２は、発音訓練正誤判定処理（Ｓ１１）の後に、発音訓練画像表示処理（Ｓ１２）を実行する。発音訓練画像表示処理（Ｓ１２）では、画像表示制御プログラム５３Ｃに記憶されたプログラムを用い、入力信号取得処理（Ｓ４）によって取得された撮像信号に相当する撮像画像の画像データ、文字変換処理（Ｓ１０）によって生成された文字画像データ等に基づいて、表示画像座標変換処理（Ｓ８）、発音訓練用言語表示位置調整処理（Ｓ９）によって決定された表示位置に、指導者音声や聴覚障害者音声をそれぞれ文字に変換した画像を表示する処理を実行する。なお、演算処理部５２は、本発明の表示制御手段の一例である。 The arithmetic processing unit 52 executes a pronunciation training image display process (S12) after the pronunciation training correctness determination process (S11). In the pronunciation training image display process (S12), using the program stored in the image display control program 53C, the image data of the captured image corresponding to the imaging signal acquired by the input signal acquisition process (S4), the character conversion process (S10) Based on the character image data and the like generated by), the voice of the instructor or the hearing impaired person is sent to the display position determined by the display image coordinate conversion process (S8) and the pronunciation training language display position adjustment process (S9). A process of displaying an image converted into characters is executed. The arithmetic processing unit 52 is an example of the display control means of the present invention.

図６には、上述した選択メニュー画面により、発音訓練用の単語として、指導者７５が「ひ」を選択し、聴覚障害者７６が指導者７５とは異なる誤った発声を行っている場合の表示画像の例を示した。上記の位置Ｐ１（図４参照。）において、指導者７５が「ひ」の音声を発すると共に、聴覚障害者７６が、「ひ」と発声しているつもりであっても「い」と発声した場合には、発音訓練画像表示処理（Ｓ１２）では、以下の処理を実行する。発音訓練画像表示処理（Ｓ１２）では、データ記憶部５３Ｄから、発音訓練用言語選択処理（Ｓ３）によって該データ記憶部５３Ｄに記憶された文字画像データ（「ひ」と発音してください。）、文字変換処理（Ｓ１０）によって該データ記憶部５３Ｄに記憶された文字画像データ（「ひ」「い」の各文字画像データ）、発音訓練正誤判定処理（Ｓ１１）によって該データ記憶部５３Ｄに記憶された文字画像データ（「誤って発音されています。」）等を読み出す。さらに、発音訓練画像表示処理（Ｓ１２）では、入力信号取得処理（Ｓ４）によって前記データ記憶部５３Ｄに記憶された撮像画像データ等を読み出す。 FIG. 6 shows a case where the leader 75 selects “hi” as a pronunciation training word on the selection menu screen described above, and the hearing impaired person 76 makes a wrong utterance different from the leader 75. An example of the display image is shown. At the position P1 (see FIG. 4), the instructor 75 utters “hi” and the hearing impaired person 76 utters “hi” even if he / she intends to utter “hi”. In this case, in the pronunciation training image display process (S12), the following process is executed. In the pronunciation training image display process (S12), the character image data (pronounced “hi”) stored in the data storage unit 53D by the pronunciation training language selection process (S3) from the data storage unit 53D, Character image data (character image data of “hi” and “i”) stored in the data storage unit 53D by the character conversion process (S10), and stored in the data storage unit 53D by the pronunciation training correctness determination process (S11) Character image data (“pronounced pronunciation”), etc. Further, in the pronunciation training image display process (S12), the captured image data and the like stored in the data storage unit 53D are read out by the input signal acquisition process (S4).

図６に示すように、発音訓練画像表示処理（Ｓ１２）では、撮像画像（指導者７５及び聴覚障害者７６）に重ねて、文字画像（「ひ」と発音してください。）を表示する。その後、表示画像座標変換処理（Ｓ８）や発音訓練用言語表示位置調整処理（Ｓ９）によってデータ記憶部５３Ｄに記憶されたデータに基づいて、表示位置Ｐ５には、円で囲まれた領域内に「ひ」の文字画像を表示すると共に、表示位置Ｐ６には、円で囲まれた領域内に「い」の文字画像を表示する。ここでは、聴覚障害者音声の音圧レベルが、指導者音声の音圧レベルと一致しており、「い」の文字画像の大きさと「ひ」の文字画像の大きさが同じ場合の画像例を示した。 As shown in FIG. 6, in the pronunciation training image display process (S12), a character image (pronounce “hi”) is displayed over the captured image (the instructor 75 and the hearing impaired person 76). Thereafter, based on the data stored in the data storage unit 53D by the display image coordinate conversion process (S8) and the pronunciation training language display position adjustment process (S9), the display position P5 is within the area surrounded by a circle. While displaying the character image of “Hi”, the character image of “I” is displayed at the display position P6 in the area surrounded by a circle. In this example, the sound pressure level of the hearing impaired person's voice matches the sound pressure level of the instructor's voice, and the size of the character image of “I” is the same as the size of the character image of “Hi” showed that.

続いて、図示するように、表示領域７１には、「誤って発音されています。」の文字画像を表示する。これにより、聴覚障害者７６に対し、該聴覚障害者７６の音声が、指導者７５の音声とは異なった音声であることや、両表示位置Ｐ５、Ｐ６における文字画像が互いに異なることを、表示領域７１の画像を通じてそれぞれ知らせることができる。聴覚障害者７６は、表示領域７１の画像を見ながら、指導者７５の唇の形や動き等を直接的に確認すると共に、表示位置Ｐ５、Ｐ６の文字画像や判定結果を表示する文字画像（「誤って発音されています。」等）を見ながら、聴覚障害者７６が発する言語を、指導者７５の発する言語（ここでは「ひ」）に近づける訓練を行う。なお、演算処理部５２は、本発明の判定結果表示制御手段の一例である。 Subsequently, as shown in the figure, the display area 71 displays a character image “pronounced pronunciation”. Thereby, it is displayed to the hearing impaired person 76 that the voice of the hearing impaired person 76 is different from the voice of the instructor 75 and that the character images at the display positions P5 and P6 are different from each other. Each can be notified through the image of the area 71. The hearing impaired person 76 directly confirms the shape and movement of the lips of the instructor 75 while looking at the image of the display area 71, and also displays the character image (determined result) of the display positions P5 and P6. While observing "pronounced pronunciation" etc.), the language that the hearing impaired person 76 speaks is brought close to the language that the instructor 75 speaks (here "hi"). The arithmetic processing unit 52 is an example of a determination result display control unit of the present invention.

一方、聴覚障害者７６が発する言語が指導者７５が発する言語（「ひ」）と一致し、上記の発音訓練正誤判定処理（Ｓ１１）によって正解判定がなされたときは、表示位置Ｐ６には、「ひ」の文字画像を表示すると共に、図６中の「誤って発音されています。」の文字画像に代えて「正しく発音されています。」の文字画像を表示する。これにより、聴覚障害者７６に対し、該聴覚障害者７６が発する言語が指導者７５が発する言語と一致していることを知らせることができる。 On the other hand, when the language of the hearing impaired person 76 coincides with the language of the instructor 75 (“hi”) and the correct answer determination is made by the pronunciation training correct / incorrect determination process (S11), the display position P6 includes A character image of “Hi” is displayed, and a character image of “Properly pronounced” is displayed in place of the character image of “May be pronounced incorrectly” in FIG. Thereby, it is possible to inform the hearing impaired person 76 that the language emitted by the hearing impaired person 76 matches the language emitted by the instructor 75.

また、聴覚障害者音声の音圧レベルが、指導者音声の音圧レベルと一致しないときは、データ記憶部５３Ｄに記憶された文字画像データ（「ひ」「い」の各文字画像データ）に基づいて、図７に示すように、表示位置Ｐ５における「ひ」の文字画像と、表示位置Ｐ６における「い」の文字画像とを、異なる大きさで表示する。図示の例は、指導者音声の音圧レベルに比べて聴覚障害者音声の音圧レベルが低いため、「ひ」の文字画像に比べて「い」の文字画像を小さく表示したことを示す。 When the sound pressure level of the hearing impaired person voice does not match the sound pressure level of the instructor voice, the character image data stored in the data storage unit 53D (character image data of “hi” and “i”) is stored. Based on this, as shown in FIG. 7, the character image “HI” at the display position P5 and the character image “I” at the display position P6 are displayed in different sizes. The illustrated example shows that the character image of “I” is displayed smaller than the character image of “Hi” because the sound pressure level of the hearing impaired person's voice is lower than the sound pressure level of the instructor's voice.

発音訓練画像表示処理（Ｓ１２）の後には、演算処理部５２が、発音訓練を続けるか否かを判断する（Ｓ１３）。ここでは、演算処理部５２が、発音訓練装置１の電源がオン状態又はオフ状態であることを判断する。Ｓ１３において、電源がオン状態であって発音訓練を続けると判断した場合には、上述した各処理（Ｓ２〜Ｓ１２）を繰り返して実行する。一方、Ｓ１３において、電源がオフ状態であって発音訓練を終了すると判断した場合には、上述した各処理（Ｓ２〜Ｓ１２）を終了する。 After the pronunciation training image display process (S12), the arithmetic processing unit 52 determines whether or not to continue the pronunciation training (S13). Here, the arithmetic processing unit 52 determines that the power of the pronunciation training device 1 is on or off. In S13, when it is determined that the power is on and the pronunciation training is continued, the above-described processes (S2 to S12) are repeated. On the other hand, if it is determined in S13 that the power is off and the pronunciation training is to be ended, the above-described processes (S2 to S12) are ended.

＜本実施形態の効果＞
本実施形態の発音訓練装置１では、発音訓練画像表示処理（Ｓ１２）によって、表示領域７１に、聴覚障害者７６及び指導者７５の各画像を表示すると同時に、表示領域７１内の表示位置Ｐ６に「い」の文字画像及び該表示領域７１内の表示位置Ｐ５に「ひ」の文字画像をそれぞれ表示することができる。
したがって、聴覚障害者７６が、表示領域７１に表示された「ひ」の文字画像と「い」の文字画像とを比較して、聴覚障害者音声（「い」）が指導者音声（「ひ」）とは異なることを確認することにより、聴覚障害者７６が発した言語が指導者７５が発した言語とは異なることを認識することができる。
加えて、聴覚障害者７６は、該聴覚障害者７６が発した言語が指導者７５が発した言語とは異なることを認識すると同時に、表示領域７１に表示された指導者７５の画像によって表される唇の形や動き等を直接的に確認しながら、聴覚障害者７６の唇の形や動き等を指導者７５の唇の形や動き等に合わせることが可能となる。
これに伴って、聴覚障害者７６は、該聴覚障害者７６が発する言語を指導者７５が発する言語に近づけるように矯正する訓練を効率良く行うことができる。 <Effect of this embodiment>
In the pronunciation training device 1 of the present embodiment, each image of the hearing impaired person 76 and the instructor 75 is displayed in the display area 71 by the pronunciation training image display process (S12), and at the same time at the display position P6 in the display area 71. The character image “I” and the character image “HI” can be displayed at the display position P5 in the display area 71, respectively.
Therefore, the hearing impaired person 76 compares the character image of “hi” displayed in the display area 71 with the character image of “i”, and the hearing impaired person sound (“i”) is instructed by the instructor voice (“hi”). It is possible to recognize that the language uttered by the hearing impaired person 76 is different from the language uttered by the instructor 75.
In addition, the hearing impaired person 76 is recognized by the image of the leader 75 displayed in the display area 71 while recognizing that the language emitted by the hearing impaired person 76 is different from the language emitted by the leader 75. It is possible to match the shape and movement of the hearing impaired person 76 with the shape and movement of the instructor 75 while directly confirming the shape and movement of the lip.
Accordingly, the hearing impaired person 76 can efficiently perform training for correcting the language emitted by the hearing impaired person 76 so as to be close to the language emitted by the instructor 75.

また、音源位置特定処理（Ｓ６）によって、位置Ｐ１における指導者音声及び位置Ｐ２における聴覚障害者音声がそれぞれ一対のマイクロフォンＭ１、Ｍ４に到達する時間差、該指導者音声及び該聴覚障害者音声が他の一対のマイクロフォンＭ２、Ｍ３に到達する時間差、該指導者音声及び該聴覚障害者音声がマイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４との間にそれぞれ到達する時間差を用い、双曲線法によって、位置Ｐ１、Ｐ２を算出する。
このため、指導者音声が発せられた位置Ｐ１や聴覚障害者音声Ｐ２が発せられた位置が順次変化した場合であっても、双曲線法により、マイクロフォンＭ１とマイクロフォンＭ４との間を通過する双曲線軌跡、マイクロフォンＭ２とマイクロフォンＭ３との間を通過する双曲線軌跡及びマイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４との間をそれぞれ通過する双曲線軌跡を、順次算出することができる。
これにより、マイクロフォンＭ１とマイクロフォンＭ４との間を通過する双曲線軌跡、マイクロフォンＭ２とマイクロフォンＭ３との間を通過する双曲線軌跡及びマイクロフォンＭ５と各マイクロフォンＭ１〜Ｍ４との間をそれぞれ通過する双曲線軌跡がそれぞれ交わる点を、指導者音声が発せられた位置Ｐ１や聴覚障害者音声が発せられた位置Ｐ２として算出することができる。 In addition, by the sound source position specifying process (S6), the time difference between the voice of the instructor at position P1 and the voice of the hearing impaired person at position P2 reaching the pair of microphones M1 and M4, respectively, The positions P1, P2 are determined by the hyperbolic method using the time difference between the microphones M2 and M3 and the time difference between the microphone M5 and each of the microphones M1 to M4. Is calculated.
For this reason, even if the position P1 where the instructor's voice is emitted and the position where the hearing impaired person's voice P2 is sequentially changed, a hyperbolic locus passing between the microphone M1 and the microphone M4 by the hyperbolic method. The hyperbolic trajectory passing between the microphone M2 and the microphone M3 and the hyperbolic trajectory passing between the microphone M5 and each of the microphones M1 to M4 can be sequentially calculated.
Accordingly, a hyperbolic trajectory passing between the microphone M1 and the microphone M4, a hyperbolic trajectory passing between the microphone M2 and the microphone M3, and a hyperbolic trajectory passing between the microphone M5 and each of the microphones M1 to M4, respectively. The intersecting points can be calculated as the position P1 where the instructor voice is emitted and the position P2 where the hearing impaired person voice is emitted.

さらに、発音訓練画像表示処理（Ｓ１２）によって、表示領域７１に、正解判定結果表示用の文字画像（「正しく発音されています。」）もしくは不正解判定結果用の文字画像（「誤って発音されています。」）の文字画像を表示する。
これにより、表示領域７１に表示された文字画像（「正しく発音されています。」もしくは「誤って発音されています。」）に基づいて、聴覚障害者７６は、該聴覚障害者７６が発した言語が指導者７５が発した言語に一致する正しい発音ができているか否かを、一目で簡単に確認することができる。 Further, by the pronunciation training image display process (S12), the character image for displaying the correct answer determination result (“pronounced correctly”) or the character image for the incorrect answer determination result (“pronounced pronounced incorrectly”) is displayed in the display area 71. ")") Is displayed.
Thus, based on the character image (“pronounced pronounced correctly” or “pronounced pronounced”) displayed in the display area 71, the hearing impaired person 76 emitted the hearing impaired person 76. It is possible to easily confirm at a glance whether or not the correct pronunciation of the language corresponding to the language issued by the instructor 75 is made.

加えて、聴覚障害者音声の音圧レベルが、指導者音声の音圧レベルと一致している場合には、発音訓練画像表示処理（Ｓ１２）によって、図６に示すように、表示位置Ｐ６における聴覚障害者音声の文字画像「い」の大きさが、表示位置Ｐ５における指導者音声の文字画像「ひ」の大きさと同じになる。
一方、指導者音声の音圧レベルに比べて聴覚障害者音声の音圧レベルが低い場合には、発音訓練画像表示処理（Ｓ１２）によって、図７に示すように、表示位置Ｐ６における聴覚障害者音声の文字画像「い」の大きさが、表示位置Ｐ５における指導者音声の文字画像「ひ」の大きさよりも小さくなる。
このため、聴覚障害者７６は、「い」の文字画像の大きさと「ひ」の文字画像の大きさとを比較して、聴覚障害者音声の音圧レベルが指導者音声の音圧レベルに一致するか否かを判断することができる。 In addition, when the sound pressure level of the hearing impaired person's voice matches the sound pressure level of the instructor's voice, the pronunciation training image display process (S12) performs display at the display position P6 as shown in FIG. The size of the character image “I” of the hearing impaired person voice is the same as the size of the character image “HI” of the instructor voice at the display position P5.
On the other hand, when the sound pressure level of the hearing impaired person voice is lower than the sound pressure level of the instructor voice, the pronunciation training image display process (S12) causes the hearing impaired person at the display position P6 as shown in FIG. The size of the voice character image “I” is smaller than the size of the character image “HI” of the instructor voice at the display position P5.
For this reason, the hearing impaired person 76 compares the size of the character image of “I” with the size of the character image of “Hi”, and the sound pressure level of the hearing impaired person voice matches the sound pressure level of the instructor voice. It can be determined whether or not.

さらに加えて、上述したように、発音訓練用言語表示位置調整処理（Ｓ９）によって、図６及び図７に示すように、表示位置Ｐ５、Ｐ６を、指導者７５の画像や聴覚障害者７６の画像をそれぞれ遮ることがない位置に調整することができる。
これにより、聴覚障害者７６は、表示位置Ｐ５における指導者音声の文字画像「ひ」及び表示位置Ｐ６における聴覚障害者音声の文字画像「い」を、聴覚障害者７６の画像及び指導者７５の画像からぞれぞれ離して別個に確認することができる。
このため、聴覚障害者音声の文字画像「い」と指導者音声の文字画像「ひ」との比較や、聴覚障害者７６の画像と指導者７５の画像との比較をそれぞれ容易に行うことができる。 In addition, as described above, by the pronunciation training language display position adjustment process (S9), as shown in FIGS. 6 and 7, the display positions P5 and P6 are changed to the images of the instructor 75 and the hearing impaired person 76. Each image can be adjusted to a position that does not block the image.
As a result, the hearing impaired person 76 converts the character image “hi” of the leader voice at the display position P5 and the character image “i” of the hearing person voice at the display position P6 to the image of the hearing impaired person 76 and the leader 75. Each can be confirmed separately from the image.
For this reason, it is possible to easily compare the character image “I” of the hearing impaired person voice with the character image “hi” of the instructor voice, and the image of the hearing impaired person 76 and the image of the instructor 75, respectively. it can.

本発明は、上述した実施形態に限定されるものではなく、発明の趣旨を逸脱しない範囲内において構成の一部を適宜変更して実施することができる。本実施形態の発音訓練装置は、ひとりの聴覚障害者７６に限らず、２人以上の聴覚障害者７６及び指導者７５の各撮像画像等を表示領域７１に表示し、２人以上の聴覚障害者７６が、各撮像画像を見ることにより、指導者７５の唇の形や動き等を直接的に確認しながら発音訓練を行うことに適用してもよい。これにより、一度に２人以上の聴覚障害者７６が発音訓練を行うことができ、２人以上の聴覚障害者７６が効率的に発音訓練を行うことができる。 The present invention is not limited to the embodiment described above, and can be implemented by appropriately changing a part of the configuration without departing from the spirit of the invention. The pronunciation training apparatus according to the present embodiment displays not only one hearing impaired person 76 but also each captured image of the two or more hearing impaired persons 76 and the instructor 75 in the display area 71 and two or more hearing impaired persons. The person 76 may apply the pronunciation training while directly confirming the shape and movement of the lips of the instructor 75 by looking at each captured image. Thereby, two or more hearing impaired persons 76 can perform pronunciation training at a time, and two or more hearing impaired persons 76 can perform pronunciation training efficiently.

また、本実施形態の発音訓練装置は、入力信号取得処理（Ｓ４）によって取得した音圧レベルデータに基づいて、各表示位置Ｐ５、Ｐ６における文字の大きさを大、中、小の３段階に設定することに限定せず、各表示位置Ｐ５、Ｐ６には、３段階以上の複数の段階に亘って設定された大きさにより、「ひ」「い」等の各文字を表示してもよい。 In addition, the pronunciation training device of the present embodiment has three levels of large, medium, and small characters at the display positions P5 and P6 based on the sound pressure level data acquired by the input signal acquisition process (S4). Not limited to the setting, each display position P5, P6 may display characters such as “hi” and “i” depending on the size set in a plurality of stages of three or more. .

さらに、本実施形態の発音訓練装置は、指導者音声と聴覚障害者音声とが一致することを判断する方法をして、ＭＦＣＣを用いることに限定せず、入力信号取得処理（Ｓ４）によって取得した音声信号に基づいて算出した音声特徴ベクトルや、音声特徴ベクトルとＭＦＣＣとの組み合わせを用いたり、確率モデル（例えば、ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）等を用いてもよい。 Furthermore, the pronunciation training device according to the present embodiment uses a method for determining whether the voice of the instructor and the hearing impaired person match, and is not limited to using MFCC, but is acquired by the input signal acquisition process (S4). A speech feature vector calculated based on the speech signal, a combination of the speech feature vector and MFCC, or a probabilistic model (eg, Hidden Markov Model) may be used.

加えて、本実施形態の発音訓練装置は、聴覚障害者７６の発音訓練に限定せず、例えば、幼児や、日本語を学習する外国人が行う発音訓練にそれぞれ適用してもよい。 In addition, the pronunciation training device of the present embodiment is not limited to the pronunciation training for the hearing impaired person 76, and may be applied to pronunciation training performed by, for example, an infant or a foreigner who learns Japanese.

さらに加えて、本実施形態の発音訓練装置は、単語の発音訓練に限定せず、単文や長文の発音訓練に適用してもよい。さらに、発音訓練装置は、日本語の発音訓練に限定せず、外国語の発音訓練に適用してもよい。 In addition, the pronunciation training device of the present embodiment is not limited to word pronunciation training, and may be applied to pronunciation training of simple sentences and long sentences. Furthermore, the pronunciation training device is not limited to Japanese pronunciation training, and may be applied to foreign language pronunciation training.

また、本実施形態の発音訓練装置は、聴覚障害者７６が指導者７５に合わせて正しい発声を行っているか否かを判定することに加え、聴覚障害者７６の発音を矯正するために、適宜のメッセージ画像等を表示領域７１に表示してもよい。例えば、表示領域７１に、唇の形や動き等を具体的に聴覚障害者７６へ知らせるメッセージ画像（例えば、「唇を丸めて発音してください。」）等を表示してもよい。これより、聴覚障害者７６は、表示領域７１に表示されたメッセージ画像等に従って発音訓練を繰り返すことにより、聴覚障害者７６の唇の形や動き等を、指導者７５の唇の形や動き等に近づけることができる。したがって、聴覚障害者７６の発音を指導者７５の発音に一致させるように矯正することができる。 In addition to determining whether or not the hearing impaired person 76 is making a correct utterance in accordance with the instructor 75, the pronunciation training device of the present embodiment is suitably used to correct the pronunciation of the hearing impaired person 76. May be displayed in the display area 71. For example, a message image that informs the hearing impaired person 76 specifically of the shape and movement of the lips or the like (for example, “Round the lips and pronounce them”) may be displayed in the display area 71. Thus, the hearing impaired person 76 repeats pronunciation training in accordance with the message image displayed in the display area 71, thereby changing the shape and movement of the lip of the hearing impaired person 76 to the shape and movement of the lip of the instructor 75. Can be approached. Therefore, the pronunciation of the hearing impaired person 76 can be corrected to match the pronunciation of the instructor 75.

１・・発音訓練装置、１７・・ＣＣＤカメラ、５２・・演算処理部、７０・・大型ディスプレイ、７１・・表示領域、７５・・指導者、７６・・聴覚障害者、Ｍ１〜Ｍ５・・マイクロフォン、Ｐ１・・指導者音声発生位置、Ｐ２・・聴覚障害者音声発生位置、Ｐ５、Ｐ６・・表示領域の表示位置 1 .... Pronunciation training device, 17 .... CCD camera, 52 ... Operation processing unit, 70 ... Large display, 71 ... Display area, 75 ... Leader, 76 ... Hearing impaired, M1-M5 ... Microphone, P1 .. Leader voice generation position, P2 .. Hearing impaired voice generation position, P5, P6 .. Display area display position

Claims

Display means having a display area capable of displaying a captured image including an image of a trainee and an image of an instructor captured by a camera;
Sound collecting means capable of collecting each of the trainer voice that is the voice that the trainer has uttered and the leader voice that is the voice that the trainer has uttered;
Based on the trainer voice and the instructor voice collected by the sound collection means, the trainer voice generation position from which the trainer voice is generated and the instructor voice generation position from which the instructor voice is generated, respectively. A position identification means to identify;
The trainer voice generation position specified by the position specifying means and the trainer voice visualization display position in the captured image displayed in the display area are correlated with each other, and the instructor voice generation specified by the position specifying means is correlated. Correlation means for correlating the position and the leader voice visualization display position in the captured image displayed in the display area;
Language recognition means for recognizing a trainer-generated language that is a language issued by the trainer at the trainer speech generation position and a leader-generated language that is a language issued by the trainer at the leader speech generation position;
The trainer-generated language visualized characters visualized by the trainer-generated language recognized by the language recognizing means and the trainer-generated language recognized by the language recognizer are converted to visualized teacher-generated language visualized characters. Character conversion means;
Control is performed to display the trainer-generated language visualization character at the trainer-speech visualization display position correlated with the trainer-speech generation position by the correlation unit, and the correlation unit correlates with the teacher-speech generation position. Display control means for performing control to display the leader-generated language visualization characters at the attached leader voice visualization display position;
A pronunciation training device comprising:

The sound collecting means includes
A microphone group in which a pair of microphones arranged at predetermined horizontal intervals are orthogonal to each other in order to collect the trainee voice and the instructor voice, respectively,
In order to collect the trainer's voice and the instructor's voice, respectively, an opposing microphone disposed at an opposing interval that is the same interval as the horizontal interval from each microphone constituting the pair of microphones. ,
The position specifying means includes
The trainer voice uses the time difference to reach the pair of microphones and the time difference to reach between each microphone and the opposing microphone, respectively, to calculate the trainer voice generation position by the hyperbolic method,
The instructor voice generation position is calculated by the hyperbolic method using the time difference when the instructor voice reaches the pair of microphones and the time difference between the microphones and the opposing microphone, respectively. The pronunciation training device according to claim 1.

Comparing the trainer-generated language recognized by the language recognizing means with the teacher-generated language recognized by the language recognizing means, the correct answer determination or training that the trainer-generated language matches the leader-generated language A determination means for performing an incorrect answer determination in which a person-generated language does not match the leader-generated language;
Determination result display control means for performing control to display the result of the correct answer determination or the result of the incorrect answer determination by the determining means in the display area;
The pronunciation training apparatus according to claim 1, comprising:

The character conversion means, based on the volume of the trainer voice and the volume of the instructor voice collected by the sound collection means, the size of the trainer-generated language visualized characters and the instructor-generated language visualized characters Change the size of each,
The display control means includes
In the size of the trainer-generated language visualized character changed by the character conversion means, the trainer-generated language visualized character is displayed at the trainer voice visualized display position,
4. The instructor-generated language visualized character is displayed at the instructor-speech-visualized display position with the size of the instructor-generated language visualized character changed by the character conversion means. The pronunciation training device described in Crab.

The pronunciation according to any one of claims 1 to 4, further comprising display position adjusting means for adjusting the trainer voice visualization display position and the instructor voice visualization display position to an arbitrary position in the display area. Training device.

The pronunciation training apparatus according to claim 1, wherein the trainee is a hearing impaired person.