WO2021260848A1

WO2021260848A1 - Learning device, learning method, and learning program

Info

Publication number: WO2021260848A1
Application number: PCT/JP2020/024823
Authority: WO
Inventors: 妙佐藤
Original assignee: 日本電信電話株式会社
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2021-12-30
Also published as: JPWO2021260848A1; JP7416245B2

Abstract

This learning device comprises a learning unit (26) that acquires training data for a learning model for selecting a voice to be presented to a user from among a plurality of voice candidates on the basis of the user's reaction to a plurality of voices simultaneously presented to the user.

Description

Learning equipment, learning methods and learning programs

This embodiment relates to a learning device, a learning method, and a learning program for voice selection.

Various methods for selecting the voice to be presented to the user from a plurality of voice candidates have been proposed. Speech classification models may be used for such selection. In some classification models of this type, learning is performed by giving information on the correctness of the classified speech as teacher data. Appropriate evaluation of speech is required to generate teacher data. As a proposal for evaluation of voice, for example, the method mentioned in Non-Patent Document 1 is known.

The embodiment provides a learning device, a learning method, and a learning program that can efficiently collect teacher data for speech classification.

The learning device according to the embodiment uses teacher data for a learning model for selecting a voice to be presented to the user from a plurality of voice candidates, and the user's reaction to the plurality of voices presented to the user at the same time. It is equipped with a learning unit to be acquired based on.

According to the embodiment, a learning device, a learning method, and a learning program capable of efficiently collecting teacher data for speech classification are provided.

FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. FIG. 2A is a diagram showing an example of speaker arrangement. FIG. 2B is a diagram showing an example of speaker arrangement. FIG. 2C is a diagram showing an example of speaker arrangement. FIG. 2D is a diagram showing an example of speaker arrangement. FIG. 3 is a diagram showing the configuration of an example of the familiarity DB. FIG. 4 is a diagram showing an example configuration of a user log DB. FIG. 5 is a diagram showing the structure of an example of the call statement DB. FIG. 6 is a functional block diagram of the voice generator. FIG. 7A is a flowchart showing a voice presentation process by the voice generator. FIG. 7B is a flowchart showing a voice presentation process by the voice generator. FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount".

Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration of an example of a voice generation device including a learning device according to an embodiment. The voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.

In the embodiment, it is determined whether or not the user is in an awake state based on the "awakening degree". The degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level. The arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement. The arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like. The degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels. The arousal level is a value that increases from a sleep state to an excitement state, for example. The arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.

The voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6,

speakers

7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11. Has. The voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user. The voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the

speakers

7a, 7b, the camera 8, and the display 10 may be separate devices from the voice generation device 1.

The processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU. The processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like. The processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.

ROM 3 is a non-volatile memory such as a flash memory. For example, the start program of the voice generator 1 is stored in the ROM 3. RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.

The storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD). Various programs used in the voice generator 1 are stored in the storage 5. The storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.

The microphone 6 is a device that converts the input voice into a voice signal which is an electric signal. The audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5. For example, the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.

Speakers

7a and 7b are devices that output voice based on the input voice signal. Here, it is desirable that the speaker 7a and the speaker 7b are not in close proximity to each other. Further, it is desirable that the speaker 7a and the speaker 7b are arranged in different directions when the user is the center. Further, it is desirable that the distance between the speaker 7a and the user and the distance between the speaker 7b and the user are equidistant.

2A and 2B are diagrams showing an arrangement example of the

speakers

7a and 7b. In FIG. 2A,

speakers

7a and 7b are arranged in front of the user U so as to be equidistant to the user, respectively. In FIG. 2B,

speakers

7a and 7b are arranged in front of and behind the user U so as to be equidistant to the user, respectively.

Here, as many speakers as the number of voices presented are arranged in the environment where the user is. That is, FIG. 1 shows an example in which the number of presented voices is two. On the other hand, the number of presented voices may be three or more. In this case, three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the speaker is arranged in a different direction when the user is the center. Further, it is desirable that the distance between each speaker and the user is equidistant. For example, an arrangement example when the speakers are three

speakers

7a, 7b, and 7c is shown in FIGS. 2C and 2D. In FIG. 2C, the

speakers

7a, 7b, and 7c are arranged in front of the user U. Further, in FIG. 2D, the

speakers

7a, 7b, and 7c are arranged behind the user U.

The camera 8 captures the user and acquires the image of the user. The user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5. The user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.

The input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor. The input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.

The display 10 is a display such as a liquid crystal display or an organic EL display. The display 10 displays various images.

The communication module 11 is a device for the voice generation device 1 to carry out communication. The communication module 11 communicates with, for example, a server provided outside the voice generator 1. The communication method by the communication module 11 is not particularly limited. The communication module 11 may carry out communication wirelessly or may carry out communication by wire.

Next, the familiarity database (DB) 51, the user log database (DB) 52, the model database (DB) 53, the voice synthesis parameter database (DB) 54, and the call statement database (DB) 55 will be described.

FIG. 3 is a diagram showing a configuration of an example of familiarity DB 51. The familiarity DB 51 is a database that records the "familiarity" of the user. The familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.

The "user ID" is an ID assigned to each user of the voice generator 1. The user ID may be associated with user attribute information such as a user name.

The "voice label" is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.

The "familiar target" is a target that generates a voice that the user often talks to or hears. The familiar target does not necessarily have to be a person.

"Familiarity" is the degree of familiarity of the user with the corresponding familiar voice. The degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity. Here, the degree of familiarity may be acquired by self-reporting by the user.

The "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label. The number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user. The reaction probability can be calculated by dividing the number of reactions by the number of presentations. The reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.

The "average value of change in arousal level" is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label. The amount of change in arousal level will be described later.

FIG. 4 is a diagram showing the configuration of an example of the user log DB 52. The user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user. The user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded. The user ID, the voice label, and the familiar object are the same as the familiarity DB 51.

The "log generation date and time" is the date and time when the user used the voice generator 1. The log generation date and time is recorded, for example, each time a call voice is presented to the user.

"Presence / absence of reaction" is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, "yes" is recorded. "None" is recorded when there is no user response.

"Concentration ratio" is the degree of concentration of the user when presenting the call voice. The degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8. The value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action. Further, the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8. The concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic. The degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, .... The method for acquiring the degree of concentration is not limited to a specific method.

The "awakening degree" is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.

The "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.

The "awakening degree change amount" is an amount representing the change in the arousal degree before and after the user's reaction. For example, the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness. The amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.

The "correct answer label" is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as 〇, and the incorrect answer is recorded as ×. The correct label will be described in detail later.

The model DB 53 is a database that records a model of voice label classification for extracting voice label candidates. In the embodiment, the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration. The model includes an initial model and a learning model. The initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning. Here, the initial value is a constant (equation of a plane) that determines the classification name for the classification of the voice label defined in the three-dimensional space of, for example, "familiarity", "concentration", and "awakening degree change". The value of). The classification plane generated by this initial value is the initial model. In the initial model, voice labels with a degree of familiarity larger than the classification surface are classified as correct answers (○), and other voice labels are classified as incorrect answers (×). The training model is a trained model generated from the initial model. The learning model can be a binary classification model with a different classification surface than the initial model.

The voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded. The voice synthesis parameter is data used for synthesizing the voice of the user's familiar target. For example, the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6. Alternatively, speech synthesis parameters acquired or defined by other systems may be pre-recorded. Here, the speech synthesis parameter is associated with the speech label.

FIG. 5 is a diagram showing the configuration of an example of the call statement DB55. The call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded. The call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.

Here, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5. For example, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1. In this case, the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.

FIG. 6 is a functional block diagram of the voice generator 1. As shown in FIG. 6, the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26. The operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized. The determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.

The acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof. Here, the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8. Further, the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6. Also, skin electrical activity can be measured, for example, by a sensor worn on the user's arm. Further, the user's reaction is the presence or absence of the user's physical reaction such as the user's head facing the direction of the

speaker

7a or 7b, the user's line of sight facing the direction of the

speaker

7a or 7b, and the direction of the reaction. Can be obtained, for example, by measuring from an image acquired by the camera 8. The acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.

The determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value. The threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.

When it is determined that the user is not in an awake state, the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken. The selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.

The receiving unit 231 receives a voice label selection request from the determination unit 22.

The model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53. The model selection unit 232 selects either an initial model or a learning model based on the degree of fit. The degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.

The voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.

The voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.

The transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.

The generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235. The generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.

The presentation unit 25 presents the call voice generated by the generation unit 24 to the user. For example, the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.

The learning unit 26 learns the model recorded in the model DB 53. The learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.

Next, the operation of the voice generator 1 will be described. 7A and 7B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 7A and 7B may be performed periodically.

In step S1, the acquisition unit 21 acquires the user's arousal level. The acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.

In step S2, the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value. In step S2, when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 7A and 7B are terminated. In step S2, when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.

In step S3, the determination unit 22 transmits a voice label selection request to the selection unit 23. When the voice label selection request is received by the receiving unit 231, the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction".

In step S4, the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S4 that the number of times there is a reaction is less than the threshold value, the process proceeds to step S5. When it is determined in step S4 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S6.

In step S5, the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.

In step S6, the model selection unit 232 calculates the degree of fit. In calculating the degree of fit, the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model. The model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit. The degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used. The precision rate is the percentage of the data predicted to be correct that the user actually responded "yes". The recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct. The F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall · Precision / (Recall + Precision).

In step S7, the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher. When it is determined in step S7 that the degree of fit of the initial model is higher, the process proceeds to step S5. In this case, the model selection unit 232 selects an initial value, that is, an initial model. When it is determined in step S7 that the degree of fit of the learning model is higher, the process proceeds to step S8.

In step S8, the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.

In step S9, the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.

In step S10, the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51. The number of candidate voice labels extracted is equal to or greater than the specified number, for example, the number of solicitation voices presented. The voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example. The voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.

In step S11, the voice label selection unit 234 selects a specified number of voice labels, for example, the same number as the number of presented call voices, from the voice labels extracted by the voice label candidate extraction unit 233. The voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects a voice label by random sampling based on the weighted winning probability. The weighted winning probability can be calculated, for example, according to the equation (1). The weighted winning probability may be calculated by an equation different from the equation (1).

In step S12, the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24. The generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter. The generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.

In step S13, the presentation unit 25 simultaneously presents the call voice generated by the generation unit 24 to the user from the

speakers

7a and 7b.

In step S14, the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.

In step S15, the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.

In step S16, the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree. In response to this, the acquisition unit 21 acquires the new arousal degree. The acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.

In step S17, the acquisition unit 21 sets the correct answer label. The acquisition unit 21 sets the correct answer level as follows, for example.
1) When it is acquired as a reaction that the user points to a specific speaker The voice label corresponding to the voice presented in the corresponding speaker: 〇 Other voice labels: ×
2) When it is acquired as a reaction that the user faces between a plurality of speakers, the angle formed by the direction of the user and the direction of each speaker is obtained, and the voice presented by the speaker having a smaller angle is obtained. Audio label: 〇 Other audio labels: ×
3) When the user turns to one speaker and then turns to another speaker, which is acquired as a reaction. The voice label of the voice presented in the first facing speaker: 〇 Other voice labels : ×
4) If no response can be obtained Labels for all voices: ×

In step S18, the acquisition unit 21 associates the concentration level, the reaction presence / absence information, the arousal level, the new arousal level, the arousal level change amount, and the correct answer label with the log generation date / time, the voice label, the familiar target, and the familiar degree, and the user log. Register in DB 52. After that, the process proceeds to step S19.

In step S19, the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S19 that the number of times there is a reaction is less than the threshold value, the processes of FIGS. 7A and 7B are terminated. When it is determined in step S19 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S20.

In step S20, the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 7A and 7B is completed. In step S20, the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the three-dimensional space of "familiarity", "concentration ratio", and "awakening degree change amount". FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "alertness change amount". In the example of FIG. 8, the familiar voice label located in the space above the classification surface P is classified as the correct answer (◯). On the other hand, a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.

Here, the reason why the three axes of "familiarity", "concentration ratio", and "alertness change amount" are adopted in the binary classification model in the embodiment will be explained. A person has a characteristic that selective attention works for a conversation of a person who is interested in the person or a familiar voice such as a person's name. This is called the cocktail party effect. In addition, Yumiko Honjo, "Physiological Psychological Study on Attention and Awakening", Kwansei Gakuin University Doctoral Dissertation, Otsu No. 217, p.187-188, introduced both selective attention and awakening. The model has been derived. From this, it is considered that there is a relationship between the occurrence of selective attention and the arousal level. As described above, "familiarity" is considered to affect the tendency of the cocktail party effect to occur and the change in the arousal level due to the cocktail party effect, and is therefore adopted as one axis of learning.

Regarding "concentration ratio", "the brain pays attention and raises concentration by" efficient selection "", RIKEN News Release, December 8, 2011, [Online] [Reiwa 2 June 10] Daily search], Internet URL: https://www.riken.jp/press/2011/20111208/, it is reported that the information transmitted from the senses to the perception is limited in the concentrated state. In other words, the concentrated state. It is presumed that the sound perceived when is increasing will be the sound that is more needed or heard by the user. Thus, "concentration" is likely to give rise to the user's selective attention. In other words, it is adopted as one axis of learning because it can be considered that it affects which sound is easy to react to.

The amount of change in arousal degree is a correct label, that is, it characterizes the user's reaction in addition to whether or not the user reacts. Therefore, the "arousal degree change amount" is adopted as one axis of learning because it is expected that the accuracy of the determination of the correct answer label will be further improved.

As described above, according to the embodiment, when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.

Further, according to the embodiment, the voice label is classified using a learning model having three axes of familiarity, concentration, and arousal change. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.

Further, according to the embodiment, the call voice is simultaneously presented from a plurality of speakers arranged in the environment, and the user's reaction to each call voice is acquired. Then, the correct label is set according to the reaction of this user. As a result, teacher data can be obtained efficiently.

[Modification example]
A modification of the embodiment will be described. In the embodiment, an example is shown in which the selection of the voice label based on the degree of familiarity, the degree of concentration, and the amount of change in the degree of arousal, the generation of the calling voice, and the learning of the learning model are all performed in the voice generation device 1. Has been done. However, the selection of voice labels, the generation of calling voices, and the training of learning models may be performed in separate devices.

Further, in the embodiment, the binary classification model employs three axes of "familiarity", "concentration ratio", and "alertness change amount". On the other hand, a binary classification model such as "familiarity" only, "familiarity" and "concentration ratio" may be used more simply.

Further, in the embodiment, the learning device is used as a learning device of a voice label classification model for a call voice that encourages the user to awaken. On the other hand, the learning device of the embodiment can be used for learning various models for selecting a voice that is easy for the user to recognize.

Each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer. In addition, it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory. Then, the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.

The present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof. In addition, each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained. Further, the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.

1 ... Voice generator 2 ... Processor 3 ... ROM
4 ... RAM
5 ... Storage 6 ... Microphone (microphone)
7a, 7b ... Speaker 8 ... Camera 9 ... Input device 10 ... Display 11 ... Communication module 21 ... Acquisition unit 22 ... Judgment unit 23 ... Selection unit 24 ... Generation unit 25 ... Presentation unit 26 ... Learning unit 51 ... Familiarity database (DB) )
52 ... User log database (DB)
53 ... Model database (DB)
54 ... Speech synthesis parameter database (DB)
55 ... Calling sentence database (DB)
231 ... Receiver unit 232 ... Model selection unit 233 ... Voice label candidate extraction unit 234 ... Voice label selection unit 235 ... Transmission unit

Claims

A learning unit that acquires teacher data for a learning model for selecting a voice to be presented to a user from a plurality of voice candidates based on the reaction of the user to a plurality of voices simultaneously presented to the user. A learning device equipped with.
The learning device according to claim 1, wherein the plurality of voices are voices presented from each of a plurality of speakers that are arranged equidistantly and in different directions with respect to the user and emit voices toward the user from different directions. ..
In the learning model, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates, the degree of concentration indicating the degree of the current concentration of the user, and the excitement from the sleep of the user due to the presentation of the voice. In a three-dimensional space consisting of a change amount of arousal degree indicating the degree of arousal up to A claim that is a classification model for classifying into a first voice candidate that is expected and a second voice candidate that is not expected to respond to the user by presenting the voice or is not expected to increase the arousal level of the user. The learning device according to 1 or 2.
The learning unit provides teacher data for a learning model for selecting a voice to be presented to a user from a plurality of voice candidates based on the user's reaction to a plurality of voices simultaneously presented to the user. A learning method that comprises acquiring.
A learning program for causing the processor to function as the learning unit of the learning device according to any one of claims 1 to 3.