JP7416245B2

JP7416245B2 - Learning devices, learning methods and learning programs

Info

Publication number: JP7416245B2
Application number: JP2022531321A
Authority: JP
Inventors: 妙佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-01-17
Anticipated expiration: 2040-06-24
Also published as: WO2021260848A1; JPWO2021260848A1

Description

本実施形態は、音声選択のための学習装置、学習方法及び学習プログラムに関する。 The present embodiment relates to a learning device, a learning method, and a learning program for voice selection.

ユーザに対して提示する音声を複数の音声候補の中から選択するための手法が種々提案されている。このような選択に音声の分類モデルが用いられることがある。この種の分類モデルの中には、分類された音声の正否の情報を教師データとして与えることによって学習が実施されるものもある。教師データの生成には、音声に対する適切な評価が必要である。音声の評価に関する提案として、例えば非特許文献１で挙げられている手法が知られている。 Various methods have been proposed for selecting a voice to be presented to a user from among a plurality of voice candidates. A speech classification model may be used for such selection. Some classification models of this type perform learning by providing information on whether the classified speech is correct or incorrect as training data. Appropriate evaluation of speech is required to generate training data. As a proposal regarding voice evaluation, for example, a method listed in Non-Patent Document 1 is known.

西本卓也ら，“音声インターフェースにおける認知的負荷測定法とその評価”, 音声言語情報処理45-5Takuya Nishimoto et al., “Cognitive load measurement method and its evaluation in speech interfaces”, Speech and Linguistic Information Processing 45-5

実施形態は、効率よく音声の分類のための教師データを収集できる学習装置、学習方法及び学習プログラムを提供する。 The embodiments provide a learning device, a learning method, and a learning program that can efficiently collect training data for speech classification.

実施形態に係る学習装置は、複数の音声候補の中からユーザに対して提示する音声を選択するための学習モデルに対する教師データを、ユーザに対して同時に提示された複数の音声に対するユーザの反応に基づいて取得する学習部を具備する。複数の音声は、ユーザに対して等距離かつ異なる方向に配置され、異なる方向からユーザに向かって音声を発する複数のスピーカのそれぞれから提示された音声である。 The learning device according to the embodiment uses training data for a learning model for selecting a voice to be presented to a user from among a plurality of voice candidates based on the user's reaction to a plurality of voices simultaneously presented to the user. Equipped with a learning section that acquires based on. The plurality of sounds are sounds presented from each of a plurality of speakers arranged equidistantly and in different directions from the user and emitting sounds toward the user from different directions .

実施形態によれば、効率よく音声の分類のための教師データを収集できる学習装置、学習方法及び学習プログラムが提供される。 According to the embodiment, a learning device, a learning method, and a learning program are provided that can efficiently collect teacher data for classifying speech.

図１は、実施形態に係る音声生成装置の一例のハードウェア構成を示す図である。FIG. 1 is a diagram illustrating a hardware configuration of an example of a voice generation device according to an embodiment. 図２Ａは、スピーカの配置の例を示す図である。FIG. 2A is a diagram illustrating an example of speaker arrangement. 図２Ｂは、スピーカの配置の例を示す図である。FIG. 2B is a diagram showing an example of speaker arrangement. 図２Ｃは、スピーカの配置の例を示す図である。FIG. 2C is a diagram showing an example of speaker arrangement. 図２Ｄは、スピーカの配置の例を示す図である。FIG. 2D is a diagram illustrating an example of speaker arrangement. 図３は、なじみ度ＤＢの一例の構成を示す図である。FIG. 3 is a diagram showing an example of the configuration of the familiarity DB. 図４は、ユーザログＤＢの一例の構成を示す図である。FIG. 4 is a diagram showing the configuration of an example of the user log DB. 図５は、呼びかけ文ＤＢの一例の構成を示す図である。FIG. 5 is a diagram illustrating an example of the configuration of the appeal text DB. 図６は、音声生成装置の機能ブロック図である。FIG. 6 is a functional block diagram of the audio generation device. 図７Ａは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 7A is a flowchart illustrating audio presentation processing by the audio generation device. 図７Ｂは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 7B is a flowchart showing audio presentation processing by the audio generation device. 図８は、「なじみ度」と「集中度」と「覚醒度変化量」を用いた二値分類モデルのイメージを表す図である。FIG. 8 is a diagram illustrating an image of a binary classification model using "familiarity", "concentration", and "awareness level change".

以下、図面を参照して実施形態を説明する。図１は、実施形態に係る学習装置を含む音声生成装置の一例のハードウェア構成を示す図である。実施形態に係る音声生成装置１は、ユーザが眠気を有している状態等の覚醒の状態にないときに、ユーザの覚醒を促す呼びかけ音声を発する。 Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram illustrating a hardware configuration of an example of a speech generation device including a learning device according to an embodiment. The sound generation device 1 according to the embodiment emits a calling sound that urges the user to wake up when the user is not in an awake state such as a sleepy state.

実施形態では、「覚醒度」に基づいてユーザが覚醒の状態にあるか否かが判定される。実施形態における覚醒度は、覚醒水準に対応した覚醒の程度を示す指標である。覚醒水準は、大脳の活動レベルに対応し、睡眠から興奮に至るまでの覚醒の程度を表している。覚醒水準は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間等から計測される。実施形態における覚醒度は、これらの覚醒水準を計測するための、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。覚醒度は、例えば睡眠状態から興奮状態に向かうに従って大きくなる値である。覚醒度は、連続的な数値でもよいし、Level 1, Level 2,…といった離散値であってもよい。また、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の各値の組み合わせによって覚醒度が算出される場合において、それらの組み合わせ方は、特に限定されない。例えばこれらの値を単純に合算する、重みづけ加算する等が組み合わせ方として用いられ得る。 In the embodiment, it is determined whether the user is in a state of wakefulness based on the "degree of wakefulness." The arousal degree in the embodiment is an index indicating the degree of arousal corresponding to the arousal level. The arousal level corresponds to the cerebral activity level and represents the degree of arousal ranging from sleep to excitement. Arousal level is measured from eye movements, blink activity, electrodermal activity, reaction time to stimulation, etc. The degree of alertness in the embodiment is calculated using any one or a combination of eye movements, eye blink activity, electrodermal activity, reaction time to stimulation, to measure the level of alertness. The degree of arousal is a value that increases from a sleeping state to an excited state, for example. The arousal level may be a continuous numerical value or may be a discrete value such as Level 1, Level 2, . . . . Further, in the case where the degree of alertness is calculated by a combination of the values of eye movement, blink activity, electrodermal activity, and reaction time to stimulation, there are no particular limitations on how to combine them. For example, simply adding up these values, weighted addition, etc. can be used as a combination method.

音声生成装置１は、プロセッサ２と、ＲＯＭ３と、ＲＡＭ４と、ストレージ５と、マイクロホン（マイク）６と、スピーカ７ａ、７ｂと、カメラ８と、入力装置９と、ディスプレイ１０と、通信モジュール１１とを有する。音声生成装置１は、例えばパーソナルコンピュータ（ＰＣ）、スマートフォン、タブレット端末といった各種の端末である。これに限らず、音声生成装置１は、ユーザによって利用される各種の装置に搭載され得る。なお、音声生成装置１は、図1で示したすべての構成を有している必要はない。例えば、マイク６、スピーカ７ａ、７ｂ、カメラ８、ディスプレイ１０は、音声生成装置１と別体の装置であってもよい。 The audio generation device 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone 6, speakers 7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11. has. The audio generation device 1 is a variety of terminals such as, for example, a personal computer (PC), a smartphone, or a tablet terminal. However, the present invention is not limited thereto, and the voice generation device 1 may be installed in various devices used by users. Note that the audio generation device 1 does not need to have all the configurations shown in FIG. For example, the microphone 6, speakers 7a, 7b, camera 8, and display 10 may be separate devices from the audio generation device 1.

プロセッサ２は、ＣＰＵ等の音声生成装置１の全体的な動作を制御する制御回路である。プロセッサ２は、ＣＰＵである必要はなく、ＡＳＩＣ、ＦＰＧＡ、ＧＰＵ等であってもよい。プロセッサ２は、単一のＣＰＵ等で構成されている必要はなく、複数のＣＰＵ等で構成されていてもよい。 The processor 2 is a control circuit such as a CPU that controls the overall operation of the audio generation device 1. The processor 2 does not need to be a CPU, and may be an ASIC, FPGA, GPU, or the like. The processor 2 does not need to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.

ＲＯＭ３は、フラッシュメモリ等の不揮発性のメモリである。ＲＯＭ３には、例えば音声生成装置１の起動プログラムが記憶されている。ＲＡＭ４は、ＳＤＲＡＭ等の揮発性のメモリである。ＲＡＭ４は、音声生成装置１における各種処理のための作業用のメモリとして使用され得る。 ROM3 is a nonvolatile memory such as a flash memory. The ROM 3 stores, for example, a startup program for the audio generation device 1. The RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the audio generation device 1.

ストレージ５は、フラッシュメモリ、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）といったストレージである。ストレージ５には、音声生成装置１で利用される各種のプログラムが記憶される。ストレージ５には、なじみ度データベース（ＤＢ）、ユーザログデータベース（ＤＢ）５２と、モデルデータベース５３と、音声合成パラメータデータベース（ＤＢ）５４と、呼びかけ文データベース（ＤＢ）５５とが記憶されてもよい。これらのデータベースについては後で詳しく説明する。 The storage 5 is a storage such as a flash memory, a hard disk drive (HDD), or a solid state drive (SSD). The storage 5 stores various programs used by the audio generation device 1. The storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a message database (DB) 55. . These databases will be explained in detail later.

マイク６は、入力された音声を電気信号である音声信号に変換するデバイスである。マイク６で得られた音声信号は、例えばＲＡＭ４又はストレージ５に記憶され得る。例えば、呼びかけ音声を合成するための音声合成パラメータは、マイク６を介して入力された音声より取得され得る。 The microphone 6 is a device that converts input audio into an audio signal that is an electrical signal. The audio signal obtained by the microphone 6 may be stored in the RAM 4 or the storage 5, for example. For example, the voice synthesis parameters for synthesizing the calling voice may be obtained from the voice input through the microphone 6.

スピーカ７ａ、７ｂは、入力された音声信号に基づいて音声を出力するデバイスである。ここで、スピーカ７ａとスピーカ７ｂとは近接していないことが望ましい。また、スピーカ７ａとスピーカ７ｂは、ユーザを中心としたときの配置の方向が異なっていることが望ましい。さらに、スピーカ７ａとユーザとの距離及びスピーカ７ｂとユーザとの距離は、等距離であることが望ましい。 The speakers 7a and 7b are devices that output audio based on input audio signals. Here, it is desirable that the speaker 7a and the speaker 7b are not close to each other. Further, it is desirable that the speaker 7a and the speaker 7b are arranged in different directions with respect to the user. Furthermore, it is desirable that the distance between the speaker 7a and the user and the distance between the speaker 7b and the user be equal distances.

図２Ａ及び図２Ｂは、スピーカ７ａ、７ｂの配置例を示す図である。図２Ａでは、ユーザＵの前方にそれぞれユーザに対して等距離となるようにスピーカ７ａ、７ｂが配置されている。図２Ｂでは、ユーザＵの前方と後方にそれぞれユーザに対して等距離となるようにスピーカ７ａ、７ｂが配置されている。 FIGS. 2A and 2B are diagrams showing examples of arrangement of speakers 7a and 7b. In FIG. 2A, speakers 7a and 7b are arranged in front of user U so as to be equidistant from the user. In FIG. 2B, speakers 7a and 7b are placed in front and behind the user U so as to be equidistant from the user.

ここで、スピーカは、ユーザのいる環境内に音声の提示数と同数だけ配置される。つまり、図１は、音声の提示数が２つの例である。これに対し、音声の提示数は、３つ以上であってもよい。この場合、スピーカも３つ以上配置されることになる。スピーカが３つ以上配置される場合であっても、それぞれのスピーカは近接していないことが望ましい。また、それぞれのスピーカは、ユーザを中心としたときの配置の方向が異なっていることが望ましい。さらに、それぞれのスピーカとユーザとの距離は、等距離であることが望ましい。例えば、スピーカが３つのスピーカ７ａ、７ｂ、７ｃであるときの配置例が図２Ｃ、図２Ｄに示されている。図２Ｃでは、スピーカ７ａ、７ｂ、７ｃがユーザＵの前方に配置されている。また、図２Ｄでは、スピーカ７ａ、７ｂ、７ｃがユーザＵの後方に配置されている。 Here, the same number of speakers as the number of voices to be presented are placed in the environment where the user is present. In other words, FIG. 1 is an example in which the number of audio presentations is two. On the other hand, the number of audio presentations may be three or more. In this case, three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the respective speakers are arranged in different directions with respect to the user. Furthermore, it is desirable that the distances between each speaker and the user be equal. For example, an arrangement example when the speakers are three speakers 7a, 7b, and 7c is shown in FIGS. 2C and 2D. In FIG. 2C, speakers 7a, 7b, and 7c are placed in front of user U. Moreover, in FIG. 2D, speakers 7a, 7b, and 7c are arranged behind the user U.

カメラ８は、ユーザを撮像し、ユーザの画像を取得する。カメラ８で得られたユーザの画像は、例えばＲＡＭ４又はストレージ５に記憶され得る。ユーザの画像は、例えば、覚醒度の取得のため又は呼びかけ音声に対するユーザの反応を取得するために用いられる。 The camera 8 captures an image of the user. The user's image obtained by camera 8 may be stored in RAM 4 or storage 5, for example. The user's image is used, for example, to obtain the alertness level or to obtain the user's reaction to the calling voice.

入力装置９は、ボタン、スイッチ、キーボード、マウスといった機械式の入力装置、タッチセンサを用いたソフトウェア式の入力装置である。入力装置９は、ユーザからの各種の入力を受け付ける。そして、入力装置９は、ユーザの入力に応じた信号をプロセッサ２に出力する。 The input device 9 is a mechanical input device such as a button, switch, keyboard, or mouse, or a software input device using a touch sensor. The input device 9 receives various inputs from the user. The input device 9 then outputs a signal according to the user's input to the processor 2.

ディスプレイ１０は、例えば液晶ディスプレイ、有機ＥＬディスプレイといったディスプレイである。ディスプレイ１０は、各種の画像を表示する。 The display 10 is, for example, a liquid crystal display or an organic EL display. The display 10 displays various images.

通信モジュール１１は、音声生成装置１が通信を実施するための装置である。通信モジュール１１は、例えば音声生成装置１の外部に設けられたサーバと通信する。通信モジュール１１による通信の方式は特に限定されない。通信モジュール１１は、無線で通信を実施してもよいし、有線で通信を実施してもよい。 The communication module 11 is a device for the voice generation device 1 to communicate. The communication module 11 communicates with, for example, a server provided outside the audio generation device 1. The method of communication by the communication module 11 is not particularly limited. The communication module 11 may perform wireless communication or wired communication.

次に、なじみ度データベース（ＤＢ）５１、ユーザログデータベース（ＤＢ）５２、モデルデータベース（ＤＢ）５３、音声合成パラメータデータベース（ＤＢ）５４、呼びかけ文データベース（ＤＢ）５５について説明する。 Next, the familiarity database (DB) 51, user log database (DB) 52, model database (DB) 53, speech synthesis parameter database (DB) 54, and appeal sentence database (DB) 55 will be explained.

図３は、なじみ度ＤＢ５１の一例の構成を示す図である。なじみ度ＤＢ５１は、ユーザの「なじみ度」を記録したデータベースである。なじみ度ＤＢ５１は、例えばユーザＩＤと、音声ラベルと、なじみ対象と、なじみ度と、反応あり数と、提示回数と、覚醒度変化平均値とを関連付けて記録している。 FIG. 3 is a diagram showing an example of the configuration of the familiarity DB 51. The familiarity level DB 51 is a database that records the "familiarity level" of users. The familiarity level DB 51 records, in association with, for example, a user ID, a voice label, a familiarity target, a familiarity level, the number of reactions, the number of presentations, and the average value of change in arousal level.

「ユーザＩＤ」は、音声生成装置１のユーザ毎に付けられたＩＤである。ユーザＩＤには、ユーザ名等のユーザの属性情報が対応付けられていてよい。 “User ID” is an ID assigned to each user of the voice generation device 1. The user ID may be associated with user attribute information such as a user name.

「音声ラベル」は、呼びかけ音声の候補のそれぞれに一意に付けられたラベルである。音声ラベルには、任意のラベルが用いられ得る。例えば、音声ラベルに、なじみ対象の名前が用いられてもよい。 The "voice label" is a label uniquely attached to each candidate for a calling voice. Any label can be used as the audio label. For example, the name of the familiar target may be used as the audio label.

「なじみ対象」は、ユーザが日頃会話する人又はユーザがよく耳にする音声を発生する対象である。なじみ対象は、必ずしも人でなくてもよい。 A "familiar object" is a person with whom the user converses on a daily basis or an object that generates sounds that the user often hears. The familiar object does not necessarily have to be a person.

「なじみ度」は、対応するなじみ対象の音声に対するユーザのなじみの度合いである。なじみ度は、ＳＮＳ等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度等から算出され得る。例えば、ＳＮＳ等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度が多いほど、なじみ度の値は大きくなる。ここで、なじみ度は、ユーザによる自己申告によって取得されてもよい。 “Familiarity level” is the user's degree of familiarity with the corresponding familiarity target voice. The degree of familiarity can be calculated from the frequency of communication with the familiar target through SNS etc., the frequency of daily conversations with the familiar target, the frequency of hearing from the familiar target on a daily basis, etc. For example, the value of the degree of familiarity increases as the frequency of communication with the familiar target through SNS etc., the frequency of daily conversations with the familiar target, and the frequency of hearing from the familiar target on a daily basis. Here, the degree of familiarity may be acquired by self-report by the user.

「反応あり数」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対してユーザが反応した回数である。提示回数は、対応する音声ラベルに基づいて生成された呼びかけ音声をユーザに対して提示した回数である。反応あり数を提示回数で割ることにより、反応確率が算出され得る。反応確率は、対応する音声ラベルに基づいて生成される呼びかけ音声に対してユーザが反応する確率である。 The “number of responses” is the number of times the user responded to the calling voice generated based on the corresponding voice label. The number of presentations is the number of times the calling voice generated based on the corresponding voice label was presented to the user. The response probability can be calculated by dividing the number of responses by the number of presentations. The response probability is the probability that the user will respond to the calling voice generated based on the corresponding voice label.

「覚醒度変化平均値」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対するユーザの覚醒度変化量の平均値である。覚醒度変化量については後で説明する。 The "average change in arousal level" is the average value of the amount of change in the user's arousal level with respect to the calling voice generated based on the corresponding voice label. The amount of change in arousal level will be explained later.

図４は、ユーザログＤＢ５２の一例の構成を示す図である。ユーザログＤＢ５２は、ユーザによる音声生成装置１の利用に係るログを記録したデータベースである。ユーザログＤＢ５２は、例えばログ発生日時と、ユーザＩＤと、音声ラベルと、なじみ対象と、集中度と、反応有無と、覚醒度と、新覚醒度と、覚醒度変化量と、正解ラベルとを関連付けて記録している。ユーザＩＤと、音声ラベルと、なじみ対象は、なじみ度ＤＢ５１と同じものである。 FIG. 4 is a diagram showing an example of the configuration of the user log DB 52. As shown in FIG. The user log DB 52 is a database that records logs related to the use of the voice generation device 1 by users. The user log DB 52 stores, for example, log occurrence date and time, user ID, voice label, familiarity target, concentration level, presence or absence of reaction, arousal level, new arousal level, amount of change in arousal level, and correct label. It is recorded in association. The user ID, voice label, and familiarity target are the same as those in the familiarity level DB 51.

「ログ発生日時」は、ユーザによる音声生成装置１の利用があった日時である。ログ発生日時は、例えばユーザに対する呼びかけ音声の提示がされる毎に記録される。 The “log occurrence date and time” is the date and time when the voice generation device 1 was used by the user. The log occurrence date and time is recorded, for example, every time a voice calling out to the user is presented.

「反応有無」は、ユーザに対して呼びかけ音声が提示された後のユーザの反応の有無の情報である。ユーザの反応があったときには、「あり」が記録される。ユーザの反応がなかったときには、「なし」が記録される。 “Reaction presence/absence” is information on whether or not the user reacts after the calling voice is presented to the user. When there is a reaction from the user, "Yes" is recorded. When there is no response from the user, "None" is recorded.

「集中度」は、呼びかけ音声の提示の際のユーザの集中の度合いである。集中度は、例えば作業中のユーザの姿勢、行動をカメラ８で得られる画像から推定することで測定され得る。集中度の値は、ユーザが集中していると考えられる姿勢、行動をする毎に高くなり、ユーザが集中していないと考えられる姿勢、行動をする毎に低くなるように算出される。また、作業中のユーザの瞳孔の開き具合をカメラ８で得られる画像から推定することで測定され得る。集中度の値は、瞳孔がより散瞳している場合に高くなり、瞳孔がより縮瞳している場合には低くなるように算出される。集中度は、例えばＬｖ(Level)１、Ｌｖ２、…といった離散値であってよい。なお、集中度の取得手法は、特定の手法には限定されない。 The "degree of concentration" is the degree of concentration of the user when presenting the calling voice. The degree of concentration can be measured, for example, by estimating the user's posture and behavior during work from an image obtained by the camera 8. The concentration level value is calculated such that it increases each time the user performs a posture or action in which the user is considered to be concentrating, and decreases each time the user performs a posture or action in which the user is considered to be unconcentrated. Further, the degree of dilation of the user's pupils during work can be estimated by estimating the degree of dilation of the user's pupils while working. The value of the degree of concentration is calculated such that it becomes higher when the pupil is more dilated and lower when the pupil is more miosis. The degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, . . . . Note that the concentration degree acquisition method is not limited to a specific method.

「覚醒度」は、音声生成装置１による呼びかけ音声の提示前に取得された覚醒度である。 The “awakeness level” is the wakefulness level acquired before the voice generation device 1 presents the calling voice.

「新覚醒度」は、ユーザの反応があった後で新たに取得された覚醒度である。新覚醒度は、ユーザの反応がなかったときには記録されない。 The "new arousal level" is the newly acquired arousal level after the user's reaction. The new arousal level is not recorded when there is no reaction from the user.

「覚醒度変化量」は、ユーザの反応の前後での覚醒度の変化を表す量である。例えば、覚醒度変化量は、例えば新覚醒度と覚醒度との差から得られる。覚醒度変化量は、新覚醒度と覚醒度との比等であってもよい。覚醒度変化量は、ユーザの反応がなかったときには記録されない。 The "amount of change in arousal level" is an amount representing a change in arousal level before and after the user's reaction. For example, the amount of change in arousal level can be obtained from, for example, the difference between the new arousal level and the arousal level. The amount of change in the arousal level may be a ratio between the new arousal level and the arousal level. The amount of change in arousal level is not recorded when there is no reaction from the user.

「正解ラベル」は、教師付き学習のための正解又は不正解のラベルである。例えば、正解は〇、不正解は×として記録される。正解ラベルについては後で詳しく説明する。 “Correct label” is a label of correct or incorrect answer for supervised learning. For example, a correct answer is recorded as ○, and an incorrect answer is recorded as ×. The correct answer label will be explained in detail later.

モデルＤＢ５３は、音声ラベル候補を抽出するための音声ラベル分類のモデルを記録したデータベースである。実施形態では、モデルは、なじみ度と集中度の２次元空間において、音声ラベルの正解又は不正解を分類するように構成されたモデルである。モデルは、初期モデルと、学習モデルとを含む。初期モデルは、モデルＤＢ５３に記憶された初期値に基づいて生成されるモデルであって、学習によって更新されないモデルである。ここで、初期値は、例えば「なじみ度」と、「集中度」、「覚醒度変化量」との３次元空間において定義される音声ラベルの分類のための分類名を決める定数（平面の方程式の係数）の値である。この初期値によって生成される分類面が初期モデルである。初期モデルでは、分類面よりも大きいなじみ度を持つ音声ラベルは正解（〇）に分類され、それ以外の音声ラベルは不正解（×）に分類される。また、学習モデルは、初期モデルから生成された学習済みのモデルである。学習モデルは、初期モデルとは異なる分類面の二値分類モデルになり得る。 The model DB 53 is a database that records voice label classification models for extracting voice label candidates. In the embodiment, the model is configured to classify correct or incorrect audio labels in a two-dimensional space of familiarity and concentration. The model includes an initial model and a learning model. The initial model is a model that is generated based on the initial values stored in the model DB 53 and is not updated by learning. Here, the initial value is a constant (plane equation coefficient). The classification surface generated using these initial values is the initial model. In the initial model, audio labels with a degree of familiarity greater than the classification surface are classified as correct (〇), and other audio labels are classified as incorrect (×). Further, the learning model is a trained model generated from the initial model. The learning model can be a binary classification model with different classification aspects than the initial model.

音声合成パラメータＤＢ５４は、音声合成パラメータを記録したデータベースである。音声合成パラメータは、ユーザのなじみ対象の音声を合成するために用いられるデータである。例えば、音声合成パラメータは、事前にマイク６を介して収音された音声のデータから抽出される特徴量のデータであってよい。あるいは、他のシステムによって取得又は定義された音声合成パラメータを事前に記録しておいてもよい。ここで、音声合成パラメータは、音声ラベルと対応付けられている。 The speech synthesis parameter DB 54 is a database that records speech synthesis parameters. The speech synthesis parameters are data used to synthesize the user's familiar speech. For example, the voice synthesis parameter may be feature data extracted from voice data collected through the microphone 6 in advance. Alternatively, speech synthesis parameters obtained or defined by other systems may be recorded in advance. Here, the speech synthesis parameters are associated with speech labels.

図５は、呼びかけ文ＤＢ５５の一例の構成を示す図である。呼びかけ文ＤＢ５５は、ユーザの覚醒を促すための各種の呼びかけ文のテンプレートデータを記録したデータベースである。呼びかけ文は特に限定されない。ただし、呼びかけ文は、ユーザの名前を用いた呼びかけを含んでいることが望ましい。これは、後で説明するカクテルパーティ効果を高めるためである。 FIG. 5 is a diagram showing the configuration of an example of the appeal text DB 55. The appeal text DB 55 is a database that records template data of various appeal texts for encouraging the user's awakening. The calling text is not particularly limited. However, it is preferable that the message includes a message using the user's name. This is to enhance the cocktail party effect, which will be explained later.

ここで、なじみ度ＤＢ５１、ユーザログＤＢ５２、モデルＤＢ５３、音声合成パラメータＤＢ５４、呼びかけ文ＤＢ５５は、必ずしもストレージ５に記憶されている必要はない。例えば、なじみ度ＤＢ５１、ユーザログＤＢ５２、モデルＤＢ５３、音声合成パラメータＤＢ５４、呼びかけ文ＤＢ５５は、音声生成装置1とは別体のサーバに記憶されていてもよい。この場合、音声生成装置１は、通信モジュール１１を用いてサーバにアクセスし、必要な情報を取得する。 Here, the familiarity DB 51, user log DB 52, model DB 53, voice synthesis parameter DB 54, and appeal text DB 55 do not necessarily need to be stored in the storage 5. For example, the familiarity level DB 51, user log DB 52, model DB 53, voice synthesis parameter DB 54, and appeal text DB 55 may be stored in a server separate from the voice generation device 1. In this case, the voice generating device 1 accesses the server using the communication module 11 and acquires necessary information.

図６は、音声生成装置１の機能ブロック図である。図６に示すように、音声生成装置１は、取得部２１と、判定部２２と、選択部２３と、生成部２４と、提示部２５と、学習部２６とを有している。取得部２１と、判定部２２と、選択部２３と、生成部２４と、提示部２５と、学習部２６との動作は、例えばストレージ５に記憶されているプログラムをプロセッサ２が実行することによって実現される。判定部２２と、選択部２３と、生成部２４と、提示部２５と、学習部２６とは、プロセッサ２とは別のハードウェアによって実現されてもよい。 FIG. 6 is a functional block diagram of the audio generation device 1. As shown in FIG. 6, the audio generation device 1 includes an acquisition section 21, a determination section 22, a selection section 23, a generation section 24, a presentation section 25, and a learning section 26. The operations of the acquisition unit 21, determination unit 22, selection unit 23, generation unit 24, presentation unit 25, and learning unit 26 are performed by the processor 2 executing a program stored in the storage 5, for example. Realized. The determining unit 22, the selecting unit 23, the generating unit 24, the presenting unit 25, and the learning unit 26 may be realized by hardware separate from the processor 2.

取得部２１は、ユーザの覚醒度を取得する。また、取得部２１は、呼びかけ音声に対するユーザの反応を取得する。前述したように、覚醒度は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。ここで、覚醒度を算出するための、眼球運動、瞬目活動、刺激への反応時間は、例えばカメラ８で取得されるユーザの画像から測定され得る。また、刺激への反応時間は、マイク６で取得される音声信号から測定されてもよい。また、皮膚電気活動は、例えばユーザの腕に装着されるセンサによって測定され得る。また、ユーザの反応は、ユーザの頭部がスピーカ７ａ又は７ｂの方向に向いた、ユーザの視線がスピーカ７ａ又は７ｂの方向に向いた等のユーザの身体的な反応の有無と反応の方向とを例えばカメラ８で取得される画像から測定することによって取得され得る。取得部２１は、音声生成装置1の外部で算出された覚醒度又はユーザの反応を通信によって取得するように構成されていてもよい。 The acquisition unit 21 acquires the user's alertness level. The acquisition unit 21 also acquires the user's reaction to the calling voice. As described above, the alertness level is calculated using any one or a combination of eye movements, eye blink activity, electrodermal activity, and reaction time to stimulation. Here, eye movement, blink activity, and reaction time to stimulation for calculating the alertness level can be measured from an image of the user acquired by the camera 8, for example. Further, the reaction time to the stimulus may be measured from an audio signal acquired by the microphone 6. Additionally, electrodermal activity may be measured, for example, by a sensor worn on the user's arm. In addition, the user's reaction is determined by the presence or absence of a physical reaction of the user, such as the user's head turning toward the speaker 7a or 7b, or the user's line of sight turning toward the speaker 7a or 7b, and the direction of the reaction. can be obtained by measuring from an image obtained by the camera 8, for example. The acquisition unit 21 may be configured to acquire the arousal level calculated outside the voice generation device 1 or the user's reaction through communication.

判定部２２は、取得部２１で取得された覚醒度に基づき、ユーザが覚醒している状態であるか否かを判定する。そして、判定部２２は、ユーザが覚醒している状態であると判定したときに、選択部２３の受信部２３１に対して音声ラベルの選択依頼を送信する。ここで、判定部２２は、覚醒度を予め定められた閾値と比較することで判定を実施する。閾値は、ユーザが覚醒している状態であるかどうかを判定するための覚醒度の閾値であり、例えばストレージ５に記憶される。また、判定部２２は、取得部２１で取得されたユーザの反応の情報に基づき、ユーザの反応の有無を判定する。 The determination unit 22 determines whether the user is in an awake state based on the degree of wakefulness acquired by the acquisition unit 21. When determining that the user is awake, the determining unit 22 transmits a voice label selection request to the receiving unit 231 of the selecting unit 23. Here, the determination unit 22 performs the determination by comparing the degree of wakefulness with a predetermined threshold. The threshold value is a threshold value of the degree of wakefulness for determining whether the user is awake, and is stored in the storage 5, for example. Further, the determining unit 22 determines whether there is a reaction from the user based on the information on the user's reaction acquired by the acquiring unit 21 .

選択部２３は、ユーザが覚醒している状態でないと判定されたときに、ユーザの覚醒を促すための候補となる音声の音声ラベルを選択する。選択部２３は、受信部２３１と、モデル選択部２３２と、音声ラベル候補抽出部２３３と、音声ラベル選択部２３４と、送信部２３５とを有している。 When it is determined that the user is not awake, the selection unit 23 selects a voice label of a voice that is a candidate for encouraging the user to wake up. The selection unit 23 includes a reception unit 231 , a model selection unit 232 , a voice label candidate extraction unit 233 , a voice label selection unit 234 , and a transmission unit 235 .

受信部２３１は、判定部２２から音声ラベルの選択依頼を受信する。 The receiving unit 231 receives a voice label selection request from the determining unit 22 .

モデル選択部２３２は、モデルＤＢ５３から音声ラベルの選択に用いるモデルを選択する。モデル選択部２３２は、当てはまり度に基づき、初期モデルと学習モデルとのうちの何れかを選択する。当てはまり度は、初期モデルと学習モデルとのどちらのほうが高い精度を有しているかを判定するための値である。当てはまり度については後で詳しく説明する。 The model selection unit 232 selects a model to be used for audio label selection from the model DB 53. The model selection unit 232 selects either the initial model or the learning model based on the degree of applicability. The degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of applicability will be explained in detail later.

音声ラベル候補抽出部２３３は、モデル選択部２３２で選択されたモデルとユーザの集中度とに基づき、ユーザに対して提示する呼びかけ音声の候補となる音声ラベルをなじみ度ＤＢ５１から抽出する。 The voice label candidate extracting unit 233 extracts voice labels that are candidates for the inviting voice to be presented to the user from the familiarity level DB 51 based on the model selected by the model selection unit 232 and the user's concentration level.

音声ラベル選択部２３４は、音声ラベル候補抽出部２３３で抽出された音声ラベルから、ユーザに対して提示する呼びかけ音声を生成するための音声ラベルを選択する。 The voice label selection unit 234 selects a voice label for generating a calling voice to be presented to the user from the voice labels extracted by the voice label candidate extraction unit 233.

送信部２３５は、音声ラベル選択部２３４で選択された音声ラベルの情報を生成部２４に送信する。 The transmitter 235 transmits information on the audio label selected by the audio label selector 234 to the generator 24 .

生成部２４は、送信部２３５から受け取った音声ラベルに基づき、ユーザの覚醒を促すための呼びかけ音声を生成する。生成部２４は、送信部２３５から受け取った音声ラベルと対応した音声合成パラメータを音声合成パラメータＤＢ５４から取得する。そして、生成部２４は、呼びかけ文ＤＢ５５に記録されている呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。 The generation unit 24 generates a calling voice to encourage the user to wake up based on the voice label received from the transmission unit 235. The generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a calling voice based on the data of the calling sentence recorded in the calling sentence DB 55 and the voice synthesis parameters.

提示部２５は、生成部２４で生成された呼びかけ音声をユーザに提示する。例えば、提示部２５は、生成部２４で生成された呼びかけ音声を、スピーカ７を利用して再生する。 The presentation unit 25 presents the calling voice generated by the generation unit 24 to the user. For example, the presentation unit 25 uses the speaker 7 to reproduce the calling voice generated by the generation unit 24 .

学習部２６は、モデルＤＢ５３に記録されているモデルの学習を実施する。学習部２６は、例えば正解ラベルを用いた二値分類学習を用いて学習を実施する。 The learning unit 26 performs learning of the models recorded in the model DB 53. The learning unit 26 performs learning using, for example, binary classification learning using correct labels.

次に、音声生成装置１の動作について説明する。図７Ａ及び図７Ｂは、音声生成装置１による音声提示処理を示すフローチャートである。図７Ａ及び図７Ｂの処理は、定期的に行われてよい。 Next, the operation of the audio generation device 1 will be explained. 7A and 7B are flowcharts showing the audio presentation process by the audio generation device 1. The processes in FIGS. 7A and 7B may be performed periodically.

ステップＳ１において、取得部２１は、ユーザの覚醒度を取得する。取得部２１は、取得した覚醒度を判定部２２に出力する。また、取得部２１は、取得した覚醒度を呼びかけ音声の提示後のユーザからの反応の取得のタイミングまで保持しておく。 In step S1, the acquisition unit 21 acquires the user's alertness level. The acquisition unit 21 outputs the acquired alertness level to the determination unit 22. Further, the acquisition unit 21 retains the acquired arousal level until the timing of acquiring a response from the user after presentation of the calling voice.

ステップＳ２において、判定部２２は、取得部２１で取得された覚醒度が閾値以下であるか否かを判定する。ステップＳ２において、覚醒度が閾値を超えていると判定されたとき、すなわちユーザが覚醒の状態にあるときには、図７Ａ及び図７Ｂの処理は終了する。ステップＳ２において、覚醒度が閾値以下であると判定されたとき、すなわちユーザが眠気を有しているといった覚醒の状態にないときには、処理はステップＳ３に移行する。 In step S2, the determination unit 22 determines whether the degree of wakefulness acquired by the acquisition unit 21 is less than or equal to a threshold value. In step S2, when it is determined that the arousal level exceeds the threshold value, that is, when the user is in an awake state, the processing in FIGS. 7A and 7B ends. In step S2, when it is determined that the wakefulness level is below the threshold value, that is, when the user is not in a state of wakefulness such as sleepiness, the process moves to step S3.

ステップＳ３において、判定部２２は、選択部２３に対して音声ラベルの選択依頼を送信する。音声ラベルの選択依頼が受信部２３１で受信されると、モデル選択部２３２は、ユーザログＤＢ５２を参照して、反応あり回数を取得する。反応あり回数は、「反応有無」の「あり」の総数である。 In step S3, the determination unit 22 transmits a voice label selection request to the selection unit 23. When the voice label selection request is received by the receiving unit 231, the model selecting unit 232 refers to the user log DB 52 and obtains the number of responses. The number of responses is the total number of "response/presence" responses.

ステップＳ４において、モデル選択部２３２は、反応あり回数が閾値未満であるか否かを判定する。閾値は、利用できる学習モデルがモデルＤＢ５３に記録されているか否かを判定するための閾値である。閾値は、例えば２に設定される。この場合、反応あり回数が０回又は1回のときには、反応あり回数が閾値未満であると判定される。ステップＳ４において、反応あり回数が閾値未満であると判定されたときには、処理はステップＳ５に移行する。ステップＳ４において、反応あり回数が閾値以上であると判定されたときには、処理はステップＳ６に移行する。 In step S4, the model selection unit 232 determines whether the number of reactions is less than a threshold. The threshold value is a threshold value for determining whether or not an available learning model is recorded in the model DB 53. The threshold value is set to 2, for example. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold. In step S4, when it is determined that the number of reactions is less than the threshold, the process moves to step S5. If it is determined in step S4 that the number of times there is a reaction is equal to or greater than the threshold, the process moves to step S6.

ステップＳ５において、モデル選択部２３２は、初期値、すなわち初期モデルをモデルＤＢ５３から選択する。そして、モデル選択部２３２は、選択した初期モデルを音声ラベル候補抽出部２３３に出力する。その後、処理はステップＳ９に移行する。 In step S5, the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. The model selection unit 232 then outputs the selected initial model to the audio label candidate extraction unit 233. After that, the process moves to step S9.

ステップＳ６において、モデル選択部２３２は、当てはまり度を計算する。当てはまり度の計算に際して、モデル選択部２３２は、まず、ユーザログＤＢ５２から過去の全ての反応あり及び反応なしのログを取得する。そして、モデル選択部２３２は、初期モデルと学習モデルの双方の当てはまり度を計算する。モデル選択部２３２は、例えば、それぞれのログの集中度の値が用いられた時の対応するモデルの正解又は不正解の出力結果とそれぞれのログの反応有無とを比較して求めた正答率（Accuracy）を当てはまり度として用いることができる。当てはまり度は、正答率に限らず、モデルの正解又は不正解の出力結果とログの反応有無とが用いられることによって算出される、適合率(Precision)、再現率(Recall)、Ｆ値(F-measure)等であってもよい。適合率は、正解と予測されたデータのうちで、実際にユーザの反応が「あり」であった割合である。再現率は、実際にユーザの反応ありであるログのうちの正解と予測されたものの割合である。Ｆ値は、再現率と適合率の調和平均である。例えば、Ｆ値は、2Recall・Precision／(Recall＋Precision)から算出され得る。 In step S6, the model selection unit 232 calculates the degree of fit. When calculating the degree of applicability, the model selection unit 232 first obtains all past response and non-reaction logs from the user log DB 52. The model selection unit 232 then calculates the degree of applicability of both the initial model and the learning model. For example, the model selection unit 232 selects a correct answer rate ( Accuracy) can be used as the degree of applicability. The degree of fit is not limited to the correct answer rate, but also includes the precision rate, recall rate, and F value, which are calculated by using the correct or incorrect output results of the model and the presence or absence of a response in the log. -measure) etc. The precision rate is the percentage of data predicted to be correct that actually resulted in a user response of "Yes." Recall rate is the ratio of logs that are predicted to be correct out of logs that actually have user responses. The F value is the harmonic mean of recall and precision. For example, the F value can be calculated from 2Recall·Precision/(Recall+Precision).

ステップＳ７において、モデル選択部２３２は、初期モデルと学習モデルの当てはまり度を比較し、学習モデルの当てはまり度の方が高いか否かを判定する。ステップＳ７において、初期モデルの当てはまり度のほうが高いと判定されたときには、処理はステップＳ５に移行する。この場合、モデル選択部２３２は、初期値、すなわち初期モデルを選択する。ステップＳ７において、学習モデルの当てはまり度のほうが高いと判定されたときには、処理はステップＳ８に移行する。 In step S7, the model selection unit 232 compares the degree of applicability of the initial model and the learning model, and determines whether the degree of applicability of the learning model is higher. If it is determined in step S7 that the degree of fit of the initial model is higher, the process moves to step S5. In this case, the model selection unit 232 selects the initial value, that is, the initial model. If it is determined in step S7 that the degree of applicability of the learning model is higher, the process moves to step S8.

ステップＳ８において、モデル選択部２３２は、学習モデルを選択する。そして、モデル選択部２３２は、選択した学習モデルを音声ラベル候補抽出部２３３に出力する。その後、処理はステップＳ９に移行する。 In step S8, the model selection unit 232 selects a learning model. The model selection unit 232 then outputs the selected learning model to the speech label candidate extraction unit 233. After that, the process moves to step S9.

ステップＳ９において、音声ラベル候補抽出部２３３は、取得部２１から現在のユーザの集中度を取得する。 In step S9, the audio label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.

ステップＳ１０において、音声ラベル候補抽出部２３３は、呼びかけ音声の生成に用いる候補の音声ラベルをなじみ度ＤＢ５１から抽出する。候補の音声ラベルの抽出数は、指定された数、例えば呼びかけ音声の提示数以上である。音声ラベル候補抽出部２３３は、例えばなじみ度ＤＢ５１に登録されている音声ラベルの中から、現在の集中度の値に対して正解のラベルが付けられているすべての音声ラベルを抽出する。正解のラベルが付けられている音声ラベルは、呼びかけ音声の提示によるユーザの反応が期待され、かつ、覚醒度の上昇も期待される音声ラベルである。 In step S<b>10 , the voice label candidate extracting unit 233 extracts candidate voice labels to be used for generating the calling voice from the familiarity level DB 51 . The number of extracted candidate audio labels is greater than or equal to a specified number, for example, the number of presentations of calling voices. The voice label candidate extracting unit 233 extracts, for example, from among the voice labels registered in the familiarity level DB 51, all voice labels attached with correct labels for the current concentration level value. A voice label labeled as correct is a voice label that is expected to elicit a reaction from the user upon presentation of the calling voice, and is also expected to increase the user's arousal level.

ステップＳ１１において、音声ラベル選択部２３４は、音声ラベル候補抽出部２３３で抽出された音声ラベルの中から、指定された数、例えば呼びかけ音声の提示数と同数の音声ラベルを選択する。音声ラベル選択部２３４は、例えば音声ラベルを選択するに当たって、過去の提示回数を基に重み付き当選確率を求める。そして、音声ラベル選択部２３４は、重み付き当選確率を基にランダムサンプリングによって音声ラベルを選択する。重み付き当選確率は、例えば式（１）に従って算出され得る。重み付き当選確率は、式（１）と異なる式で算出されてもよい。

In step S11, the audio label selection unit 234 selects a specified number of audio labels from among the audio labels extracted by the audio label candidate extraction unit 233, for example, the same number as the number of presentations of calling voices. For example, when selecting a voice label, the voice label selection unit 234 calculates a weighted winning probability based on the number of past presentations. Then, the audio label selection unit 234 selects an audio label by random sampling based on the weighted winning probability. The weighted winning probability can be calculated, for example, according to equation (1). The weighted winning probability may be calculated using a formula different from formula (1).

ステップＳ１２において、送信部２３５は、音声ラベル選択部２３４で選択された音声ラベルを示す情報を、生成部２４に送信する。生成部２４は、音声合成パラメータＤＢ５４から、受信した音声ラベルに対応した音声合成パラメータを取得する。そして、生成部２４は、呼びかけ文ＤＢ５５からランダムに選択した呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。呼びかけ音声の生成は、音声合成パラメータを用いた音声合成処理によって行われ得る。その後、処理はステップＳ１３に移行する。 In step S12, the transmitter 235 transmits information indicating the audio label selected by the audio label selector 234 to the generator 24. The generation unit 24 acquires a speech synthesis parameter corresponding to the received speech label from the speech synthesis parameter DB 54. Then, the generation unit 24 generates a calling voice based on the data of the calling sentence randomly selected from the calling sentence DB 55 and the voice synthesis parameters. Generation of the calling voice may be performed by voice synthesis processing using voice synthesis parameters. After that, the process moves to step S13.

ステップＳ１３において、提示部２５は、生成部２４において生成された呼びかけ音声を、スピーカ７ａ、７ｂから同時にユーザに提示する。 In step S13, the presentation unit 25 simultaneously presents the calling voice generated by the generation unit 24 to the user from the speakers 7a and 7b.

ステップＳ１４において、取得部２１は、ユーザの反応を取得する。そして、取得部２１は、ユーザの反応の情報を判定部２２に出力する。 In step S14, the acquisition unit 21 acquires the user's reaction. The acquisition unit 21 then outputs information on the user's reaction to the determination unit 22.

ステップＳ１５において、判定部２２は、ユーザの反応があったか否かを判定する。ステップＳ１５において、ユーザの反応がなかったと判定されたときには、処理はステップＳ２０に移行する。ステップＳ１５において、ユーザの反応があったと判定されたときには、処理はステップＳ１６に移行する。 In step S15, the determining unit 22 determines whether there is a reaction from the user. If it is determined in step S15 that there is no reaction from the user, the process moves to step S20. If it is determined in step S15 that there is a reaction from the user, the process moves to step S16.

ステップＳ１６において、判定部２２は、取得部２１に対して新覚醒度の取得を要求する。これを受けて、取得部２１は、新覚醒度を取得する。新覚醒度の取得は、覚醒度の取得と同様に行われてよい。 In step S16, the determination unit 22 requests the acquisition unit 21 to acquire the new arousal level. In response to this, the acquisition unit 21 acquires the new arousal level. Acquisition of the new arousal level may be performed in the same manner as acquisition of the arousal level.

ステップＳ１７において、取得部２１は、正解ラベルの設定を行う。取得部２１は、例えば次のようにして正解レベルを設定する。
１）ユーザが特定のスピーカの方を向いたことが反応として取得された場合
該当するスピーカにおいて提示された音声と対応する音声ラベル：〇
それ以外の音声ラベル：×
２）ユーザが複数のスピーカの間等を向いたことが反応として取得された場合
ユーザが向いた方向と各スピーカの方向とのなす角度を求め、その角度がより小さいスピーカにおいて提示された音声の音声ラベル：〇
それ以外の音声ラベル：×
３）ユーザが１つのスピーカの方向を向いた後に、別のスピーカの方向を向いたことが反応として取得された場合
始めに向いたスピーカにおいて提示された音声の音声ラベル：〇
それ以外の音声ラベル：×
４）反応が取得できなかった場合
すべての音声のラベル：×In step S17, the acquisition unit 21 sets a correct label. The acquisition unit 21 sets the correct answer level, for example, as follows.
1) When it is acquired as a reaction that the user turned towards a specific speaker: Audio label corresponding to the audio presented by the relevant speaker: 〇 Other audio labels: ×
2) When it is obtained as a response that the user turned between multiple speakers, etc. Find the angle between the direction the user turned and the direction of each speaker, and calculate the angle of the sound presented on the speaker with the smaller angle. Audio label: 〇 Other audio labels: ×
3) When it is obtained as a reaction that the user turned to one speaker and then turned to another speaker Audio label of the audio presented at the first speaker: 〇 Other audio labels :×
4) If no response was obtained All audio labels: ×

ステップＳ１８において、取得部２１は、集中度、反応有無の情報、覚醒度、新覚醒度、覚醒度変化量、正解ラベルをログ発生日時、音声ラベル、なじみ対象、なじみ度と対応付けてユーザログＤＢ５２に登録する。その後、処理はステップＳ１９に移行する。 In step S18, the acquisition unit 21 associates the concentration level, response information, arousal level, new arousal level, arousal level change amount, and correct label with the log occurrence date and time, audio label, familiarity target, and familiarity level, and logs the user log. Register in DB52. After that, the process moves to step S19.

ステップＳ１９において、学習部２６は、ユーザログＤＢ５２を参照して、反応あり回数を取得する。そして、学習部２６は、反応あり回数が閾値未満であるか否かを判定する。閾値は、学習に必要な情報が蓄積されたか否かを判定するための閾値である。閾値は、例えば２に設定される。この場合、反応あり回数が０回又は１回のときには、反応あり回数が閾値未満であると判定される。ステップＳ１９において、反応あり回数が閾値未満であると判定されたときには、図７Ａ及び図７Ｂの処理は終了する。ステップＳ１９において、反応あり回数が閾値以上であると判定されたときには、処理はステップＳ２０に移行する。 In step S19, the learning unit 26 refers to the user log DB 52 and obtains the number of reactions. Then, the learning unit 26 determines whether the number of times there is a reaction is less than a threshold value. The threshold is a threshold for determining whether information necessary for learning has been accumulated. The threshold value is set to 2, for example. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold. In step S19, when it is determined that the number of times there is a reaction is less than the threshold value, the processing in FIGS. 7A and 7B ends. If it is determined in step S19 that the number of times there is a reaction is equal to or greater than the threshold value, the process moves to step S20.

ステップＳ２０において、学習部２６は、二値分類学習を実施する。そして、学習部２６は、二値分類学習の実施による学習の結果をモデルＤＢ５３に記録する。その後、図７Ａ及び図７Ｂの処理は終了する。ステップＳ２０において、学習部２６は、例えばユーザログＤＢ５２に記録されている正解ラベルと、正解ラベルに関連付けられたなじみ度と、集中度とを取得する。そして、学習部２６は、「なじみ度」と、「集中度」と、「覚醒度変化量」の３次元空間における音声ラベルの二値分類モデルを生成する。図８は、「なじみ度」と、「集中度」、「覚醒度変化量」とを用いた二値分類モデルのイメージを表す図である。図８の例では、分類面Ｐよりも上側の空間に位置するなじみ度を持つ音声ラベルは正解（〇）に分類される。一方、分類面Ｐよりも下側の空間に位置するなじみ度を持つ音声ラベルは不正解（×）に分類される。ここで、モデルの生成には、ロジスティック回帰、ＳＶＮ（Support Vector Machine）、ニューラルネットワーク等を用いた各種の二値分類学習が用いられ得る。 In step S20, the learning unit 26 performs binary classification learning. Then, the learning unit 26 records the learning results obtained by performing the binary classification learning in the model DB 53. After that, the processing in FIGS. 7A and 7B ends. In step S20, the learning unit 26 obtains, for example, the correct label recorded in the user log DB 52, the familiarity level associated with the correct label, and the concentration level. Then, the learning unit 26 generates a binary classification model of the audio label in the three-dimensional space of "degree of familiarity", "degree of concentration", and "amount of change in degree of arousal". FIG. 8 is a diagram illustrating an image of a binary classification model using "familiarity", "concentration", and "amount of change in arousal". In the example of FIG. 8, a voice label having a degree of familiarity located in a space above the classification plane P is classified as correct (○). On the other hand, a voice label having a degree of familiarity located in a space below the classification plane P is classified as incorrect (x). Here, various types of binary classification learning using logistic regression, SVN (Support Vector Machine), neural networks, etc. may be used to generate the model.

ここで、実施形態における二値分類モデルに、「なじみ度」と、「集中度」、「覚醒度変化量」の３軸が採用されている理由について説明する。人は、自分が興味のある人の会話や自分の名前等のなじみある音声に対しては、選択的注意が働く特性を有している。これは、カクテルパーティ効果と呼ばれている。また、本城由美子，”注意と覚醒に関する生理心理学的研究”, 関西学院大学博士論文，乙第217号，p.187-188では、選択的注意と覚醒の双方を導入した注意と覚醒のモデルが導出されている。このことから、選択的注意の発生と覚醒度とには関連があると考えられる。このように、「なじみ度」は、カクテルパーティ効果の生じやすさとカクテルパーティ効果による覚醒度の変化に影響すると考えられるので、学習の１軸として採用されている。 Here, the reason why the three axes of "familiarity level", "concentration level", and "alertness level change" are adopted in the binary classification model in the embodiment will be explained. People have the characteristic of selectively paying attention to familiar sounds such as conversations of people they are interested in or their own names. This is called the cocktail party effect. In addition, Yumiko Honjo, “Physiological psychological research on attention and arousal,” Kwansei Gakuin University doctoral dissertation, Otsu No. 217, p.187-188, discusses the relationship between attention and arousal, which introduces both selective attention and arousal. A model has been derived. This suggests that there is a relationship between the occurrence of selective attention and the level of arousal. In this way, "familiarity" is considered to influence the likelihood of the cocktail party effect occurring and the change in arousal due to the cocktail party effect, and is therefore adopted as one axis of learning.

また、「集中度」については、“「効率的選択」で脳は注意を向け集中を高める”, 理化学研究所ニュースリリース，2011年12月8日, ［Online］［令和2年6月10日検索］，インターネットURL：https://www.riken.jp/press/2011/20111208/に、集中状態では、感覚から知覚へ伝達する情報が限定されることが報告されている。つまり、集中が高まっているときに認知される音は、よりユーザにとって必要とされる又は耳に入りやすい音となると推測される。このように、「集中度」は、ユーザの選択的注意を生じさせやすさ、つまりどの音に反応しやすいかに影響すると考えることができるので、学習の１軸として採用されている。 Regarding "concentration level", "The brain directs attention and increases concentration through 'efficient selection'", RIKEN News Release, December 8, 2011, [Online] [June 10, 2020] Internet URL: https://www.riken.jp/press/2011/20111208/ reports that in a state of concentration, the information transmitted from the senses to perception is limited. It is presumed that the sound that is perceived when the user's attention is high is the one that is needed by the user or is easier for the user to hear.In this way, "concentration level" increases the likelihood that the user's selective attention will occur. Since it can be thought of as having an effect on which sounds we tend to respond to, it has been adopted as one of the pillars of learning.

覚醒度変化量は、正解ラベル、すなわち、ユーザが反応するかどうかに加えて、ユーザの反応を特徴づけるものである。したがって、「覚醒度変化量」は、正解ラベルの判定の精度のさらなる向上が見込まれるものとして、学習の１軸として採用されている。 The amount of change in arousal degree characterizes the user's reaction in addition to the correct label, that is, whether or not the user responds. Therefore, the "amount of change in arousal level" is adopted as one axis of learning as it is expected to further improve the accuracy of determining correct labels.

以上説明したように実施形態によれば、ユーザが覚醒していない状態であると判定されたときには、ユーザにとってなじみのある音声を用いてユーザに対する呼びかけが行われる。このため、ユーザが眠気を有している状態等であっても、カクテルパーティ効果によってユーザに呼びかけ音声を聞かせることができる。したがって、短時間での覚醒度の向上が見込まれる。また、実施形態では、なじみのある音声の選択にあたり、なじみ度と集中度とが用いられる。このため、よりユーザが反応し易い呼びかけ音声をユーザに聞かせることができる。 As described above, according to the embodiment, when it is determined that the user is not awake, a voice familiar to the user is used to address the user. Therefore, even if the user is drowsy or the like, the cocktail party effect allows the user to hear the calling voice. Therefore, it is expected that the level of alertness will improve in a short period of time. Further, in the embodiment, familiarity and concentration are used to select familiar voices. For this reason, it is possible to make the user hear a calling voice that the user is more likely to respond to.

また、実施形態によれば、なじみ度と、集中度と、覚醒度変化量の３軸を有する学習モデルを用いて音声ラベルの分類が行われる。このため、学習が進むことにより、よりユーザに適した音声ラベルの候補が抽出されることが期待される。また、実施形態によれば、抽出された候補の中から過去の提示回数に基づくランダムサンプリングによって音声を生成するための音声ラベルが選択される。これにより、同じ音声ラベルの呼びかけ音声が頻繁に提示されることによる、ユーザの慣れや飽きが抑制される。これにより、長期に音声生成装置１が利用される場合であっても、呼びかけ音声に対するユーザの反応が期待され易くなり、結果としてユーザの覚醒度の上昇が見込まれる。 Further, according to the embodiment, audio labels are classified using a learning model having three axes: familiarity, concentration, and arousal level change. Therefore, as learning progresses, it is expected that voice label candidates that are more suitable for the user will be extracted. Further, according to the embodiment, an audio label for generating audio is selected from the extracted candidates by random sampling based on the number of past presentations. This suppresses the user's habituation and boredom caused by frequent presentation of calling voices with the same voice label. As a result, even if the voice generation device 1 is used for a long period of time, the user's reaction to the calling voice can be easily expected, and as a result, the user's level of alertness is expected to increase.

さらに、実施形態によれば、環境に配置された複数のスピーカから同時に呼びかけ音声が提示され、それぞれの呼びかけ音声に対するユーザの反応が取得される。そして、このユーザの反応に従って正解ラベルが設定される。これにより、効率よく教師データを得ることができる。 Furthermore, according to the embodiment, a plurality of speakers placed in the environment present calling voices simultaneously, and a user's reaction to each calling voice is obtained. Then, a correct label is set according to the user's reaction. Thereby, teacher data can be obtained efficiently.

［変形例］
実施形態の変形例を説明する。実施形態では、なじみ度と、集中度と、覚醒度変化量に基づく音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、何れも音声生成装置１の中で行われている例が示されている。しかしながら、音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、別個の装置において行われてもよい。[Modified example]
A modification of the embodiment will be described. In the embodiment, an example is shown in which the selection of a voice label based on the degree of familiarity, the degree of concentration, and the amount of change in arousal level, the generation of a calling voice, and the learning of a learning model are all performed in the voice generation device 1. has been done. However, the selection of the audio label, the generation of the calling voice, and the learning of the learning model may be performed in separate devices.

また、実施形態では、二値分類モデルに、「なじみ度」と、「集中度」、「覚醒度変化量」の３軸が採用されている。これに対し、より簡易的に例えば「なじみ度」だけ、「なじみ度」と「集中度」だけといった二値分類モデルが用いられてもよい。 Further, in the embodiment, three axes of "familiarity", "concentration", and "amount of change in arousal" are employed in the binary classification model. On the other hand, a simpler binary classification model may be used, for example, only "familiarity" or only "familiarity" and "concentration".

また、実施形態では、学習装置は、ユーザの覚醒を促す呼びかけ音声のための音声ラベルの分類モデルの学習装置として用いられている。これに対し、実施形態の学習装置は、ユーザが認知しやすい音声を選定するための各種のモデルの学習に利用可能である。 Further, in the embodiment, the learning device is used as a learning device for a classification model of a voice label for a calling voice that urges the user to wake up. In contrast, the learning device of the embodiment can be used to learn various models for selecting voices that are easy for the user to recognize.

上述した実施形態による各処理は、コンピュータであるプロセッサに実行させることができるプログラムとして記憶させておくこともできる。この他、磁気ディスク、光ディスク、半導体メモリ等の外部記憶装置の記憶媒体に格納して配布することができる。そして、プロセッサは、この外部記憶装置の記憶媒体に記憶されたプログラムを読み込み、この読み込んだプログラムによって動作が制御されることにより、上述した処理を実行することができる。 Each process according to the embodiments described above can also be stored as a program that can be executed by a processor that is a computer. In addition, it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory. The processor reads the program stored in the storage medium of the external storage device, and its operations are controlled by the read program, thereby being able to execute the above-described processes.

なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。更に、上記実施形態には種々の発明が含まれており、開示される複数の構成要件から選択された組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、課題が解決でき、効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。 Note that the present invention is not limited to the above-described embodiments, and can be variously modified at the implementation stage without departing from the gist thereof. Moreover, each embodiment may be implemented in combination as appropriate, and in that case, the combined effect can be obtained. Furthermore, the embodiments described above include various inventions, and various inventions can be extracted by combinations selected from the plurality of constituent features disclosed. For example, if a problem can be solved and an effect can be obtained even if some constituent features are deleted from all the constituent features shown in the embodiment, the configuration from which these constituent features are deleted can be extracted as an invention.

１…音声生成装置
２…プロセッサ
３…ＲＯＭ
４…ＲＡＭ
５…ストレージ
６…マイクロホン（マイク）
７ａ，７ｂ…スピーカ
８…カメラ
９…入力装置
１０…ディスプレイ
１１…通信モジュール
２１…取得部
２２…判定部
２３…選択部
２４…生成部
２５…提示部
２６…学習部
５１…なじみ度データベース（ＤＢ）
５２…ユーザログデータベース（ＤＢ）
５３…モデルデータベース（ＤＢ）
５４…音声合成パラメータデータベース（ＤＢ）
５５…呼びかけ文データベース（ＤＢ）
２３１…受信部
２３２…モデル選択部
２３３…音声ラベル候補抽出部
２３４…音声ラベル選択部
２３５…送信部1...Sound generation device 2...Processor 3...ROM
4...RAM
5...Storage 6...Microphone (microphone)
7a, 7b... Speaker 8... Camera 9... Input device 10... Display 11... Communication module 21... Acquisition unit 22... Judgment unit 23... Selection unit 24... Generation unit 25... Presentation unit 26... Learning unit 51... Familiarity database (DB )
52...User log database (DB)
53...Model database (DB)
54...Speech synthesis parameter database (DB)
55...Call text database (DB)
231... Receiving unit 232... Model selection unit 233... Voice label candidate extraction unit 234... Voice label selection unit 235... Transmission unit

Claims

a learning unit that acquires training data for a learning model for selecting a voice to be presented to a user from among a plurality of voice candidates based on the user's reactions to a plurality of voices simultaneously presented to the user; Equipped with
In the learning device, the plurality of sounds are sounds presented from each of a plurality of speakers arranged at equal distances and different directions from the user and emitting sounds toward the user from different directions.

The learning model includes a degree of familiarity representing the degree to which the user is familiar with each of the plurality of voice candidates, a degree of concentration representing the degree of current concentration of the user, and a degree of arousal from sleep of the user due to the presentation of the voice. In a three-dimensional space consisting of the amount of change in the degree of arousal representing the degree of arousal up to A classification model that classifies into a first voice candidate that is expected and a second voice candidate that is not expected to cause the user's reaction or increase in the user's arousal level due to the presentation of the voice. 1. The learning device according to 1 .

The learning unit generates training data for a learning model for selecting a voice to be presented to a user from among a plurality of voice candidates based on the user's reactions to a plurality of voices simultaneously presented to the user. Equipped with obtaining
In the learning method, the plurality of sounds are sounds presented from each of a plurality of speakers arranged at equal distances and different directions from the user and emitting sounds toward the user from different directions.

A learning program for causing a processor to function as the learning section of the learning device according to claim 1 or 2 .