JP7432879B2

JP7432879B2 - speech training system

Info

Publication number: JP7432879B2
Application number: JP2020128338A
Authority: JP
Inventors: 真一坂本
Original assignee: OTODESIGNERS CO Ltd
Current assignee: OTODESIGNERS CO Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2024-02-19
Anticipated expiration: 2040-07-29
Also published as: JP2022025493A

Description

本発明は、ユーザが発話した声（言葉）を分析することによって、当該ユーザの発話音声が高齢者およびミドルエイジ世代の人に聞きやすいかどうかを判定した上で、模擬難聴化した当該音声を提示し、当該ユーザがより聞きやすい発話方法を習得するためのトレーニングシステムに関する。 The present invention analyzes the voice (words) uttered by the user, determines whether the user's voice is easy to hear for elderly and middle-aged people, and then outputs the voice with simulated hearing loss. The present invention relates to a training system for the user to learn a speaking method that is easier for the user to hear.

我が国の高齢化率は極めて高い数値で推移しており、日常生活のみならず、ビジネスの分野においても、高齢者とのコミュニケーションを円滑に行えるようにすることが極めて重要な課題となっている。 Japan's aging rate continues to be extremely high, and it has become an extremely important issue to be able to communicate smoothly with the elderly, not only in daily life but also in the business field.

聴覚の周波数分解能の低下に基づく、言葉の聞き取り能力の低下は、高齢者のみならず、ミドルエイジ世代から始まっていることが知られており、本人に自覚のない「無自覚難聴」による聞き間違いは、社会生活やビジネスの現場において大きな問題となっている。 It is known that a decline in the ability to understand speech due to a decline in the frequency resolution of hearing occurs not only in the elderly, but also in the middle-aged generation. , has become a major problem in social life and business settings.

これらの問題を解決するためには、会話時の発話の方法を改善し、高齢者およびミドルエイジ世代の人にも聞き取りやすい音声を発話する必要がある。特に、言葉の聞き取り能力が低下した高齢者やミドルエイジ世代に対しては、音声に含まれる母音よりも、子音を長く明確に発話する必要があることが聴覚心理学的に知られている。 In order to solve these problems, it is necessary to improve the way we speak during conversations and to make our voices easier to hear even for elderly and middle-aged people. In particular, it is known from psychoacoustic theory that consonants need to be uttered longer and more clearly than vowels included in speech, especially for elderly and middle-aged people whose ability to hear speech has declined.

近年は、ユーザが発話した音声を、高齢者に聞こえているであろう音声に変換する模擬難聴技術を用いて、ユーザ本人に、相手にどう聞こえているかを体験させ、高齢者およびミドルエイジ世代の人にも聞き取りやすい音声を発声するための発話トレーニングアプリケーション等が提供されている。また、語学学習などに使われる発話トレーニングアプリケーションを、高齢者に対する発話トレーニングに流用しようという動きもある。 In recent years, the use of simulating hearing loss technology that converts the voice spoken by a user into the voice that an elderly person would likely hear has been used to allow the user to experience how the other person is hearing them, and to increase the awareness of elderly and middle-aged people. There are speech training applications and the like that are used to utter sounds that are easy for people to hear. There is also a movement to divert speech training applications used for language learning to speech training for elderly people.

特許文献１には、音声を表す第１の音声データを記憶する第１の音声データ記憶手段と、収音した音声を表す第２の音声データを出力する収音手段と、前記第１の音声データ記憶手段に記憶された前記第１の音声データの特徴と前記収音手段から出力される第２の音声データの特徴とを比較し、該比較結果に基づいて指摘区間を特定する指摘区間特定手段と、前記第１の音声データにおいて、前記指摘区間特定手段により特定された指摘区間に対応する音声データの表す音声の態様と該指摘区間以外の区間に対応する音声データの表す音声の態様とが異なるように、該第１の音声データを加工する音声データ加工手段と、前記音声データ加工手段により加工された第１の音声データの表す音声を、放音手段に放音させる放音制御手段とを備えることを特徴とする音声評価装置が開示されている。 Patent Document 1 discloses a first audio data storage unit that stores first audio data representing audio, a sound collection unit that outputs second audio data representing collected audio, and a first audio data storage unit that stores first audio data that represents audio. Comparing the characteristics of the first audio data stored in the data storage means and the characteristics of the second audio data output from the sound collecting means, and identifying the pointed out section based on the comparison result. means, in the first audio data, an aspect of the audio represented by the audio data corresponding to the pointed out section specified by the pointed out section specifying means, and an aspect of the audio represented by the audio data corresponding to the section other than the pointed out section; a sound data processing means for processing the first sound data such that the first sound data is different; and a sound emission control means for causing the sound emitting means to emit the sound represented by the first sound data processed by the sound data processing means. Disclosed is a voice evaluation device characterized by comprising:

特許文献２には、信号処理装置であって、各時点のフィルタ特性が変化する時変フィルタを生成し、生成された時変フィルタを用いて、時間的に変化する音信号である入力信号から出力信号を得る処理部を備え、前記処理部は、第１聴者の圧縮特性が反映された第１聴覚フィルタバンクに前記入力信号を通すことで得られた第１聴覚的スペクトログラムと、第２聴者の圧縮特性が反映された第２聴覚フィルタバンクに前記入力信号を通すことで得られた第２聴覚的スペクトログラムと、の各時点の差分に基づいて、各時点の前記時変フィルタを生成するよう構成されている信号処理装置が開示されている。 Patent Document 2 discloses a signal processing device that generates a time-varying filter whose filter characteristics change at each point in time, and uses the generated time-varying filter to process an input signal that is a temporally changing sound signal. The processing unit includes a processing unit that obtains an output signal, and the processing unit outputs a first auditory spectrogram obtained by passing the input signal through a first auditory filter bank that reflects compression characteristics of a first listener, and a second auditory spectrogram. and a second auditory spectrogram obtained by passing the input signal through a second auditory filter bank that reflects the compression characteristics of A signal processing device configured with the following is disclosed.

特願2006-217300Patent application 2006-217300 特願2015-27305Patent application 2015-27305

高齢者等へ言葉が伝わらなかった場合には、大声で話すことが一般的な常識になっているが、聴覚の周波数分解能の低下に基づく高齢者やミドルエイジ世代の人の聞き間違いに対しては、大声で話しても聞き間違え改善効果はほとんどなく、むしろ、大声に対する不快感が増大してしまうことが、聴覚心理学的に知られている。 It is common knowledge to speak loudly when words cannot be conveyed to an elderly person, etc., but it is common sense to speak loudly when words cannot be conveyed to an elderly person. It is known from psychoacoustic theory that speaking loudly has almost no effect on improving mishearing, but rather increases discomfort due to loud voices.

また、このような場合、聞き間違いが発生しやすいのは母音ではなく子音であることも聴覚心理学的に知られているが、多くの人は、自分の言葉が伝わっていないと感じた場合に、母音を強く発声してしまう傾向がある。 Additionally, it is known from psychoacoustics that it is consonants rather than vowels that are more likely to be misheard in such cases, but many people feel that their words are not being understood. There is a tendency to vocalize vowels strongly.

ユーザが発話した音声を、高齢者に聞こえているであろう音声に変換する模擬難聴技術では、自分の声を高齢者がどのように聞いているのかを疑似的に体験できるので、大声を出しても意味が無いことや子音の発声が重要であることをユーザに実感させやすいという特徴がある。 The hearing loss simulation technology, which converts the voice spoken by the user into the voice that an elderly person would probably hear, allows users to simulate how their voice is heard by an elderly person, so it is difficult to shout out loud. It has the characteristic that it makes it easy for the user to realize that there is no meaning even if the consonant is pronounced, and that the pronunciation of the consonant is important.

しかしながら、模擬難聴化された音声を聴取しただけでは、高齢者が聞き取り難いであろうことは実感できても、自分の発話のどの音素が聞き取り難さの起因となっており、具体的に、発話の仕方をどのように改善させれば良いのかをユーザ自身が知ることが出来ないという問題があった。 However, even if an elderly person can realize that it is difficult to hear by simply listening to a simulated hearing-impaired voice, it may be difficult to understand which phonemes in one's own speech are causing the difficulty in hearing, and specifically how to understand the speech. There was a problem in that users themselves were unable to know how to improve their methods.

特許文献１に記載の音声評価装置では、音声を表す第１の音声データを記憶する第１の音声データ記憶手段と、収音した音声を表す第２の音声データを出力する収音手段と、前記第１の音声データ記憶手段に記憶された前記第１の音声データの特徴と前記収音手段から出力される第２の音声データの特徴とを比較し、該比較結果に基づいて指摘区間を特定する指摘区間特定手段と、前記第１の音声データにおいて、前記指摘区間特定手段により特定された指摘区間に対応する音声データの表す音声の態様と該指摘区間以外の区間に対応する音声データの表す音声の態様とが異なるように、該第１の音声データを加工する音声データ加工手段と、前記音声データ加工手段により加工された第１の音声データの表す音声を、放音手段に放音させる放音制御手段とを備える。これにより、ユーザは、語学学習等の採点の結果を把握しやすくなる。 The sound evaluation device described in Patent Document 1 includes: a first sound data storage means for storing first sound data representing a sound; a sound collection means for outputting second sound data representing the collected sound; Comparing the characteristics of the first audio data stored in the first audio data storage means and the characteristics of the second audio data output from the sound collecting means, and determining the indicated section based on the comparison result. a pointed-out section specifying means for specifying, and in the first audio data, the mode of sound expressed by the audio data corresponding to the pointed-out section specified by the pointed-out section specifying means, and the audio data corresponding to sections other than the pointed out section. a sound data processing means for processing the first sound data, and a sound represented by the first sound data processed by the sound data processing means, emitted to a sound emitting means so that the form of the sound to be expressed is different from that of the sound expressed by the sound data processing means; and a sound emission control means. This makes it easier for the user to understand the scoring results for language learning and the like.

しかし、ユーザは、自分の声が高齢者にどのように聞こえているかを体験することが出来ず、発話トレーニングを行う動機付けとはならない。さらに、具体的に、自分の発話における、どの音素の発話が悪く、どのように発話を改善すれば良いのかをユーザに具体的に知らせる必要があるが、その方法に関する示唆も開示も無い。 However, users cannot experience how their voices are heard by elderly people, and this does not motivate them to perform speech training. Furthermore, it is necessary to specifically inform the user which phoneme in his or her speech is bad and how to improve the speech, but there is no suggestion or disclosure of a method for doing so.

特許文献２に記載の信号処理装置は、各時点のフィルタ特性が変化する時変フィルタを生成し、生成された時変フィルタを用いて、時間的に変化する音信号である入力信号から出力信号を得る処理部を備え、前記処理部は、第１聴者の圧縮特性が反映された第１聴覚フィルタバンクに前記入力信号を通すことで得られた第１聴覚的スペクトログラムと、第２聴者の圧縮特性が反映された第２聴覚フィルタバンクに前記入力信号を通すことで得られた第２聴覚的スペクトログラムと、の各時点の差分に基づいて、各時点の前記時変フィルタを生成するよう構成されている。これにより、難聴者等の聴覚特性をより適切に反映した模擬難聴音声を生成することが可能となる。 The signal processing device described in Patent Document 2 generates a time-varying filter whose filter characteristics change at each point in time, and uses the generated time-varying filter to convert an input signal, which is a sound signal that changes over time, into an output signal. a first auditory spectrogram obtained by passing the input signal through a first auditory filter bank in which compression characteristics of a first listener are reflected, and a compression characteristic of a second listener. and a second auditory spectrogram obtained by passing the input signal through a second auditory filter bank whose characteristics are reflected, and is configured to generate the time-varying filter at each time point. ing. This makes it possible to generate simulated hearing-impaired speech that more appropriately reflects the auditory characteristics of a hearing-impaired person or the like.

しかし、ユーザは、具体的に、自分の発話における、どの音素の発話が悪く、どのように発話を改善すれば良いのかを具体的に知ることができず、また、その方法に関する示唆も開示も無い。 However, users cannot know specifically which phonemes in their own speech are bad and how to improve their speech, and there are no suggestions or disclosures regarding how to do so. None.

上記の課題を解決する手段として、本発明の発話トレーニングシステムは、ユーザの音声を収音するための収音部と、前記収音されたユーザの音声の音素ごとの持続時間を抽出する持続時間抽出部と、事前に収録された模範音声の音素ごとの持続時間を抽出および／または保持する持続時間保持部と、前記収音されたユーザの音声を模擬難聴変換する模擬難聴変換部と、前記ユーザの音声の音素ごとの持続時間と前記模範音声の音素ごとの持続時間を比較する持続時間比較部と、前記持続時間比較部の比較結果をユーザへ提示する比較結果提示部から成る構成とした。 As a means for solving the above-mentioned problems, the speech training system of the present invention includes a sound collection unit for collecting the user's voice, and a duration timer for extracting the duration of each phoneme of the collected user's voice. an extraction unit, a duration holding unit that extracts and/or holds the duration of each phoneme of a model voice recorded in advance, a simulating hearing loss converting unit that converts the collected user's voice into a simulating hearing loss; The device is configured to include a duration comparison unit that compares the duration of each phoneme of the user's voice with the duration of each phoneme of the model voice, and a comparison result presentation unit that presents the comparison result of the duration comparison unit to the user. .

これにより、ユーザは、自分の声が高齢者にどのように聞こえているかを体験することが可能となり、発話トレーニングを行う高い動機付けを得るとともに、どの音素の発話が悪く、どのように発話を改善すれば良いかを知ることが出来る。 This makes it possible for users to experience how their voices are heard by elderly people, giving them a high level of motivation to perform speech training, as well as learning which phonemes are difficult to pronounce and how to improve their speech. You can know what needs to be improved.

また、本発明の発話トレーニングシステムは、前記比較結果提示部で提示された特定の音素のみを再生する比較結果再生部を備える構成とした。これにより、ユーザは、どの音素の発話が悪く、どのように発話を改善すれば良いかを、実際にその音素を聴取しながら、さらに詳細に知ることが出来る。 Moreover, the speech training system of the present invention is configured to include a comparison result reproduction section that reproduces only the specific phoneme presented by the comparison result presentation section. This allows the user to learn in more detail which phoneme is poorly uttered and how to improve the utterance while actually listening to the phoneme.

本発明による発話トレーニングシステムによれば、会話時の発話の方法を改善し、高齢者およびミドルエイジ世代の人にも聞き取りやすい音声を発声することができるようになる。 According to the speech training system according to the present invention, it is possible to improve the method of speech during a conversation and to produce a voice that is easy to hear even for elderly people and middle-aged people.

ユーザ自身の声を模擬難聴変換した音声を聞きながら、模範的に発話された音声との音素ごとの持続時間の違いを理解することが出来るので、当該トレーニングに対する高いモチベーションを維持しつつ、特に音声に含まれる子音の発話能力の向上を図ることが出来る。 While listening to the user's own voice converted into a simulated hearing loss, the user can understand the difference in duration of each phoneme from the model uttered voice. It is possible to improve the ability to speak the consonants contained in the words.

さらに、ユーザ自身が発話した音声内の、特に伝わりにくい音素だけを再生して聴取することが出来るので、当該音素の発話に特に注意を払うことが出来るので、発話能力のさらなる向上が可能となる。 Furthermore, since it is possible to playback and listen to only the phonemes that are particularly difficult to convey in the voice that the user has uttered, it is possible to pay particular attention to the utterance of the phonemes in question, making it possible to further improve speaking ability. .

本発明の第一の実施の形態におけるブロック図Block diagram in the first embodiment of the present invention 本発明の第二の実施の形態におけるブロック図Block diagram in the second embodiment of the present invention 第一の実施の形態におけるユーザ用画面の一例An example of a user screen in the first embodiment 第一の実施の形態における比較結果提示部の画面の一例An example of a screen of the comparison result presentation section in the first embodiment 第一の実施の形態における模擬難聴変換音声聴取のための画面の一例An example of a screen for listening to simulated hearing loss converted speech in the first embodiment 第二の実施の形態におけるユーザ用画面の一例An example of a user screen in the second embodiment

以下、本発明を実施するための最良の形態を図面に基づいて詳細に説明する。なお、以下の説明において、同一機能を有するものは同一の符号とし、その繰り返しの説明は省略する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best mode for carrying out the present invention will be described in detail based on the drawings. In the following description, parts having the same function are designated by the same reference numerals, and repeated description thereof will be omitted.

図１は、本発明の第一の実施の形態におけるシステムのブロック図であり、ユーザが発声した声を収音する収音部２と、前記収音されたユーザの音声の音素ごとの持続時間を抽出する持続時間抽出部３と、事前に収録された模範音声の音素ごとの持続時間を抽出および／または保持する持続時間保持部４と、前記収音されたユーザの音声を模擬難聴変換する模擬難聴変換部５と、前記ユーザの音声の音素ごとの持続時間と前記模範音声の音素ごとの持続時間を比較する持続時間比較部６と、前記持続時間比較部の比較結果をユーザへ提示する比較結果提示部７から構成されている。 FIG. 1 is a block diagram of a system according to a first embodiment of the present invention, including a sound collection unit 2 that collects voices uttered by a user, and a duration of each phoneme of the collected user's voice. a duration extraction unit 3 that extracts the duration of each phoneme of a pre-recorded model voice, a duration holding unit 4 that extracts and/or holds the duration of each phoneme of a model voice recorded in advance, and converts the collected user's voice into a simulated hearing loss. A simulated hearing loss conversion unit 5, a duration comparison unit 6 that compares the duration of each phoneme of the user's voice with a duration of each phoneme of the model voice, and a comparison result of the duration comparison unit is presented to the user. It is composed of a comparison result presentation section 7.

ユーザ１は、本システムに向けて、ユーザ用画面等に提示される事前に定められた課題音声内容を発話する。ここで本システムは、専用のハードウェアでも良いし、スマートフォン端末やパーソナルコンピュータなどでも良い。 The user 1 utters predetermined task audio content to be presented on a user screen or the like to the present system. Here, this system may be dedicated hardware, or may be a smartphone terminal, a personal computer, or the like.

ユーザ１が発話した音声は収音部２にて収音される。前記収音部２は前記専用のハードウェア、スマートフォン端末、パーソナルコンピュータ等に内蔵されたマイクロフォン等であっても良いし、ユーザ１が自ら調達した他の収音装置であっても良い。また、収音部２が録音機能を有し、ユーザ１の声を事前に録音、保存した後に持続時間抽出部３へ当該音声データを送っても良いし、収音部２で収音した声をそのまま持続時間抽出部３へ送る構成であっても良い。 The sound uttered by the user 1 is collected by the sound collection unit 2. The sound collection unit 2 may be the dedicated hardware, a smartphone terminal, a microphone built into a personal computer, etc., or may be another sound collection device procured by the user 1 himself. Further, the sound collection unit 2 may have a recording function, and may record and save the voice of the user 1 in advance and then send the audio data to the duration extraction unit 3, or the voice collected by the sound collection unit 2 may be The configuration may be such that the data is sent to the duration extraction section 3 as is.

持続時間抽出部３は、収音部２にて収音されたユーザ１の音声を分析し、当該音声を、そこに含まれる音素ごとに分割し、各音素の持続時間を抽出する。各音素の持続時間の分析、分割には、ＤＰマッチングやＨＭＭ（隠れマルコフモデル）等の音声セグメンテーション技術が用いられる。 The duration extraction unit 3 analyzes the voice of the user 1 collected by the sound collection unit 2, divides the voice into each phoneme included therein, and extracts the duration of each phoneme. Speech segmentation techniques such as DP matching and HMM (Hidden Markov Model) are used to analyze and divide the duration of each phoneme.

持続時間保持部４には、高齢者およびミドルエイジ世代の人にも聞き取りやすい音声を発話できる模範となる発声者の、前記課題音声と同一内容の音声の各音素の持続時間が記録されている。なお、持続時間保持部４は、前記各音素の持続時間を事前に記録するのではなく、前記模範となる発声者の音声データを記録しておき、持続時間抽出部３で、その都度分析を行って、その結果を持続時間保持部４に記録する構成でも良い。 The duration holding unit 4 records the duration of each phoneme of a voice with the same content as the task voice of a model speaker who can utter a voice that is easy to hear even for elderly and middle-aged people. . Note that the duration holding unit 4 does not record the duration of each phoneme in advance, but records the voice data of the model speaker, and the duration extraction unit 3 analyzes it each time. The configuration may also be such that the results are recorded in the duration holding section 4.

持続時間比較部６は、前記持続時間抽出部３で抽出された、ユーザ１が発話した音声に含まれる各音素の持続時間と、前記持続時間保持部４に記録されている模範となる発声者の音声の各音素の持続時間を音素ごとに比較する。比較方法としては、両持続時間の差分に事前に閾値を設けておき、その閾値を超えて長かった音素もしくは短かった音素についての情報を比較結果提示部７へ送るという方法や、両者の持続時間の比率を計算し、その比率に閾値を設ける方法などが考えられる。 The duration comparison unit 6 compares the duration of each phoneme included in the voice uttered by the user 1 extracted by the duration extraction unit 3 and the model speaker recorded in the duration storage unit 4. Compare the duration of each phoneme in the speech of each phoneme. As a comparison method, a threshold value is set in advance for the difference between both durations, and information about phonemes that are longer or shorter than that threshold is sent to the comparison result presentation unit 7, or the duration of both A possible method is to calculate the ratio of , and set a threshold value for that ratio.

比較結果提示部７は、持続時間比較部６から出力された、各音素の持続時間の比較結果をユーザ１へ提示する。提示の方法としては、前記課題音声内容に含まれる音素をテキストで表示し、模範となる発声者の音声に比べての、各音素の持続時間の長短を明示する方法でも良いし、ユーザ１と模範となる発声者の音声の波形やサウンドスペクトログラム等を図示した上で、当該波形等の中で各音素が該当する区間を図示して、ビジュアル的に各音素の長短をユーザ１へ明示する方法を用いても良い。 The comparison result presentation unit 7 presents the comparison result of the duration of each phoneme output from the duration comparison unit 6 to the user 1. The presentation method may be a method of displaying the phonemes included in the task speech content in text and clearly indicating the length of the duration of each phoneme compared to the voice of a model speaker. A method of illustrating the waveform, sound spectrogram, etc. of a model speaker's voice, and then illustrating the section to which each phoneme corresponds in the waveform, etc., to visually clarify the merits and demerits of each phoneme to the user 1. You may also use

一方、模擬難聴変換部５は、前記収音部２にて収音されたユーザ１が発話した音声に模擬難聴変換を施し、高齢者およびミドルエイジ世代の人に聞こえているであろう音声に変換する。模擬難聴変換方法としては、高齢者およびミドルエイジ世代の人の聴覚フィルタの広がり度合いを周波数領域で信号処理的に模擬し、FFTとオーバーラップアド処理で合成する方法や特許文献２に記載の方法などが考えられる。 On the other hand, the simulated hearing loss conversion unit 5 performs a simulated hearing loss conversion on the voice uttered by the user 1 that has been collected by the sound collection unit 2, and transforms the voice into a voice that would be heard by elderly and middle-aged people. Convert. Examples of simulated hearing loss conversion methods include a method of simulating the spread of the auditory filter of elderly and middle-aged people using signal processing in the frequency domain, and synthesizing it using FFT and overlap add processing, and a method described in Patent Document 2. etc. are possible.

ユーザ１は、ユーザ用画面等にある再生ボタンを押すことによって、前記模擬難聴変換された音声を聴取することが出来る。さらに、模擬難聴変換されていない原音声および模範となる発声者の音声の原音声、模擬難聴変換音声を聴取するための再生ボタンも用意すれば、ユーザ１は様々な音声を聴取可能となり、自身の発話の悪い所を知り、その改善のための具体的なトレーニングを実施することができる。 The user 1 can listen to the simulated hearing loss converted audio by pressing a play button on the user screen or the like. Furthermore, if a playback button is provided to listen to the original voice that has not been converted into a simulated hearing loss, the original voice of the model speaker's voice, and the simulated hearing loss converted voice, user 1 will be able to listen to a variety of voices, and Be able to know the weak points of the student's speech and carry out specific training to improve them.

図３、図４、図５には、本発明の発話トレーニングシステムをスマートフォンアプリとして実現した場合の、スマートフォン画面に提示される画面の一例を示す。 3, 4, and 5 show examples of screens presented on a smartphone screen when the speech training system of the present invention is implemented as a smartphone application.

ユーザ１は、図３の画面に従って、録音ボタンを押した上で、画面上部に表示されたトレーニング用の課題音声内容（本例では「いつも、ありがとう」）を、スマートフォンに向かって発話する。スマーフォンに内蔵されたマイクは収音部２として機能し、ユーザ１の発話内容を収音する。 User 1 presses the recording button according to the screen shown in FIG. 3, and then speaks into the smartphone the training task voice content (in this example, "Thank you, always") displayed at the top of the screen. A microphone built into the smartphone functions as a sound collection unit 2 and collects the content of the user's 1 utterance.

前記スマートフォンアプリの持続時間抽出部３は、前記収音されたユーザ１の音声にDPマッチング等の分析処理を施し、「いつも、ありがとう」の音素である、”i”, “ts”, “u”, ”m”, “o”, “ ”, ”a”, “r”, “i”, ”g”, “a”, “t”, ”o”, “ ”に分割し、各音素の持続時間を抽出する（音素表記が無い区間は、言葉の間（ま）である）。 The duration extraction unit 3 of the smartphone application performs analysis processing such as DP matching on the collected voice of the user 1, and extracts the phonemes “i”, “ts”, “u” that are “Thank you always”. ”, ”m”, “o”, “ ”, ”a”, “r”, “i”, ”g”, “a”, “t”, ”o”, “ ” and divide each phoneme into Extract the duration (intervals without phoneme notation are between words).

持続時間保持部４には、高齢者およびミドルエイジ世代の人にも聞き取りやすい音声を発話できる模範となる発声者の「いつも、ありがとう」の音声の各音素の持続時間が記録されているので、持続時間比較部６で、両者の各音素の持続時間を比較する。 The duration holding unit 4 records the duration of each phoneme of the voice of "Always, thank you" by a model speaker who can utter a voice that is easy to hear even for elderly and middle-aged people. A duration comparison section 6 compares the duration of each phoneme in both cases.

図４は、比較結果提示部７の一例である。ここでは、特に持続時間の長短の差が大きかった「いつも」の「つ」の子音“ts”と、「ありがとう」の「あ」”a”についての結果のみを提示し、発話改善のためのアドバイスを表示している。本例では、持続時間比較部６の出力として、特に、ユーザ１が発話した”ts”は模範となる発声者のそれよりも持続時間が短く、”a”は持続時間が長すぎたので、それぞれの音素を、口を大きくあけて丁寧に発話するよう促している。 FIG. 4 is an example of the comparison result presentation section 7. Here, we present only the results for the consonant "ts" in "tsu" in "always" and "a" in "arigatou", for which the difference in duration was particularly large, and the results for improving speech. Displaying advice. In this example, as the output of the duration comparison unit 6, in particular, "ts" uttered by user 1 has a shorter duration than that of the model speaker, and "a" has a too long duration, so The children are encouraged to speak each phoneme carefully and with their mouths wide open.

図５は、ユーザ１が、模擬難聴変換された音声を聴取し、発話音声が高齢者およびミドルエイジ世代の人に、どのように聞こえているかを体験するための画面である。ユーザ１が発話した音声および模範となる発声者の、それぞれ模擬難聴変換音声と原音声を聴取することが出来る。ユーザ１は、図４で指摘された音素について、高齢者およびミドルエイジ世代の人に、実際にどのように聞こえているかを体験聴取することができるので、前記アドバイスの内容に従って、高い動機付けと具体性を持って発話改善トレーニングを実施することが出来る。 FIG. 5 is a screen for the user 1 to listen to the simulated hearing loss converted sound and experience how the spoken sound is heard by elderly people and middle-aged people. It is possible to listen to the voice uttered by the user 1 and the simulated hearing loss converted voice and original voice of the model speaker, respectively. User 1 can listen to how elderly and middle-aged people actually hear the phonemes pointed out in Figure 4. It is possible to carry out speech improvement training with specificity.

図２は、本発明の第二の実施の形態におけるシステムのブロック図であり、ユーザが発声した声を収音する収音部２と、前記収音されたユーザの音声の音素ごとの持続時間を抽出する持続時間抽出部３と、事前に収録された模範音声の音素ごとの持続時間を抽出および／または保持する持続時間保持部４と、前記収音されたユーザの音声を模擬難聴変換する模擬難聴変換部５と、前記ユーザの音声の音素ごとの持続時間と前記模範音声の音素ごとの持続時間を比較する持続時間比較部６と、前記持続時間比較部の比較結果をユーザへ提示する比較結果提示部７と、比較結果提示部７で提示された特定の音素のみを再生する比較結果再生部８から構成されている。 FIG. 2 is a block diagram of a system according to a second embodiment of the present invention, including a sound collection unit 2 that collects the voice uttered by a user, and the duration of each phoneme of the collected user's voice. a duration extraction unit 3 that extracts the duration of each phoneme of a pre-recorded model voice, a duration holding unit 4 that extracts and/or holds the duration of each phoneme of a model voice recorded in advance, and converts the collected user's voice into a simulated hearing loss. A simulated hearing loss conversion unit 5, a duration comparison unit 6 that compares the duration of each phoneme of the user's voice with a duration of each phoneme of the model voice, and a comparison result of the duration comparison unit is presented to the user. It is comprised of a comparison result presentation section 7 and a comparison result reproduction section 8 that reproduces only specific phonemes presented by the comparison result presentation section 7.

比較結果再生部８は、比較結果提示部７で表示された特に持続時間の長短の差が大きかった音素について、その音素だけを再生する機能を有する。 The comparison result reproduction unit 8 has a function of reproducing only the phonemes displayed by the comparison result presentation unit 7 that have a particularly large difference in duration.

図８は、本発明の第二の実施の形態における比較結果再生部８をスマートフォンアプリとして実現した場合の、スマートフォン画面に提示される画面の一例を示す。 FIG. 8 shows an example of a screen displayed on a smartphone screen when the comparison result playback section 8 according to the second embodiment of the present invention is implemented as a smartphone application.

ここでは一例として、音声波形を図示し、特に持続時間の長短の差が大きかった音素区間を網掛けで明示し、網掛け部をタップすると、その音素区間が再生される。実際に再生する際には、ユーザ１が当該音素区間を容易に聴取できるように、当該音素区間よりも数ミリ秒から数100ミリ秒前から再生を開始し、当該音素区間よりも数ミリ秒から数100ミリ秒後に再生を終了するか、前後の音素をいくつか含めて再生する必要があろう。 Here, as an example, a speech waveform is illustrated, and phoneme sections with particularly large differences in duration are clearly indicated by hatching, and when the shaded portion is tapped, that phoneme section is played back. When actually playing back, so that user 1 can easily hear the phoneme section, playback is started several milliseconds to several hundred milliseconds before the phoneme section, and several milliseconds before the phoneme section. It would be necessary to end the playback several hundred milliseconds after the start, or to include some of the preceding and succeeding phonemes.

また、本実施例では、波形と網掛けによって前記当該音素区間を明示しているが、これは波形によって図示する方法に限らず、波形に変わって音声のパワー変動図やサウンドスペクトログラムを用いても良いし、テキストで当該音素を表示し、その区間だけを再生するような構成にすることも可能である。 Furthermore, in this embodiment, the phoneme section is clearly indicated by the waveform and shading, but this is not limited to the method of illustrating by the waveform, and it is also possible to use a speech power fluctuation diagram or sound spectrogram instead of the waveform. It is also possible to display the phoneme in text and play only that section.

ところで、本実施例では、持続時間比較部６で出力されたユーザ１と模範となる発声者の音声内の各音素の持続時間の長短を、そのまま比較結果提示部７で表示しているが、人間が音声を聴取し、その内容を認識しようとする時は、全ての音素に等分に集中して聞き取っているわけでは無い。 By the way, in this embodiment, the length of the duration of each phoneme in the voices of the user 1 and the model speaker outputted by the duration comparison unit 6 is displayed as is in the comparison result presentation unit 7. When humans listen to speech and try to recognize its content, they do not concentrate equally on all phonemes.

聴覚心理学的には、人間が聴取した音声の内容を正確に認識するためには、語頭の子音を正確に聞き取ることが最も重要であるとの知見が報告されている。よって、事前に前記課題音声内容の音素ごとに重み付けを行い、各文節の語頭の子音については、多少の長短の差であっても比較結果提示部７で明示し、ユーザ１へトレーニングを促しつつ、語尾の母音については前記長短の差が大きめであっても比較結果提示部７で表示しないなどの構成とすることも可能である。 In terms of psychoacoustics, it has been reported that in order for humans to accurately recognize the content of the speech they hear, it is most important to accurately hear the consonant at the beginning of a word. Therefore, each phoneme of the task speech content is weighted in advance, and the comparison result presentation unit 7 clearly indicates the initial consonant of each clause even if there is a slight difference in length, while encouraging the user 1 to train. Regarding vowels at the end of words, even if the difference in length is large, the comparison result presentation section 7 may not display the vowels.

例えば、ユーザ用画面等に提示される課題音声内容が「こんにちは “k” “o” “n” “n” “i” “ch” “i” “w” “a”」であれば、最も重要な語頭の子音である”k”に関しては、持続時間比較部６で算出されるユーザ１と模範となる発声者の発話音声の持続時間の差分を２倍して閾値と比較し、僅かな差であってもユーザ１へトレーニングを促し、一方で、語尾の”a”の差分は1/2にして閾値と比較し、多少の差があってもトレーニングを促さないという構成とすることも可能である。 For example, if the task audio content presented on the user screen etc. is “Hello “k” “o” “n” “n” “i” “ch” “i” “w” “a”, then the most important Regarding "k", which is the initial consonant of a word, the difference in duration between the utterances of the user 1 and the model speaker calculated by the duration comparison unit 6 is doubled and compared with a threshold value to find the slight difference. It is also possible to prompt user 1 to train even if the difference is 1/2, and compare it with the threshold, so that training is not prompted even if there is a slight difference. It is.

なお、本発明の発話トレーニングシステムによってトレーニングした発話者の音声は、ミドルエイジおよび高齢者のみならず、難聴者全般（若年の難聴者も含む）に対しても聞きやすくなるので、本発明は、難聴者全般に対して伝わりやすい発話のトレーニングを実施するシステムとして用いることも可能である。 Note that the voice of a speaker trained by the speech training system of the present invention becomes easier to hear not only for middle-aged and elderly people but also for hearing-impaired people in general (including young hearing-impaired people). It can also be used as a system for training people with hearing loss in general to make their speech easier to understand.

１…ユーザ、２…収音部、３…持続時間抽出部、４…持続時間保持部、５…模擬難聴変換部、６…持続時間比較部、７…比較結果提示部、８…
比較結果再生部。
1... User, 2... Sound collection section, 3... Duration extraction section, 4... Duration holding section, 5... Simulated hearing loss conversion section, 6... Duration comparison section, 7... Comparison result presentation section, 8...
Comparison result playback section.

Claims

This is a speech training system for uttering a voice that is easy to hear for elderly and middle-aged people , which includes a sound collection unit for collecting the user's voice, and a voice training system for each phoneme of the collected user's voice. a duration extracting unit that extracts the duration; a duration holding unit that extracts and/or holds the duration of each phoneme of a model voice recorded in advance; A simulated hearing loss conversion unit that simulates how hearing loss sounds to people of the older generation , a play button for listening to the audio converted by the simulated hearing loss conversion unit, and a continuation of each phoneme of the user's voice. A speech training system comprising a duration comparison section that compares time and duration of each phoneme of the model voice, and a comparison result presentation section that presents advice for conversation improvement to the user based on the comparison result of the duration comparison section. .

2. The speech training system according to claim 1, further comprising a comparison result reproduction section that reproduces only specific phonemes presented by the comparison result presentation section.