WO2021260846A1

WO2021260846A1 - Voice generation device, voice generation method, and voice generation program

Info

Publication number: WO2021260846A1
Application number: PCT/JP2020/024820
Authority: WO
Inventors: 妙佐藤; 昭宏千葉; 真奈笹川
Original assignee: 日本電信電話株式会社
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2021-12-30
Also published as: JPWO2021260846A1; JP7416244B2

Abstract

This voice generation device (1) has an acquisition unit (21), a determination unit (22), a selection unit (23), and a generation unit (24). The acquisition unit acquires an alertness degree representing a user's degree of alertness from sleep to excitement. The determination unit determines, on the basis of the alertness degree, whether the user is in a state of alertness. When the user is not in a state of alertness, the selection unit selects a voice prompting the user to be alert from among a plurality of voice candidates, this selection being made on the basis of a familiarity degree representing the degree to which the user is familiar with each of the plurality of voice candidates and a concentration degree representing the user's current degree of concentration. On the basis of the selected voice, the generation unit generates a calling-out voice to be presented to the user.

Description

Voice generator, voice generation method and voice generation program

This embodiment relates to a voice generator, a voice generation method, and a voice generation program.

It is difficult for people to spend the day without feeling drowsy at all. This is because human brain function has an arousal fluctuation rhythm called ultradian rhythm in a short cycle.

Non-Patent Documents

1 and 2 describe that sleep is the opposite of arousal. The arousal level is an index showing the degree of arousal from sleep to excitement. Further, in Non-Patent Document 2, "sleepiness" is defined as a state in which the arousal level is lower than the moderate arousal level. For this reason, even if drowsiness is felt in work in a remote environment such as working from home, distance lessons, etc., it is required to raise the arousal level in as short a time as possible.

The embodiment provides a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time.

The voice generation device according to the embodiment has an acquisition unit that acquires an arousal degree indicating the degree of arousal from sleep to excitement of the user, and whether or not the user is awake based on the arousal degree. Based on the determination unit for determining, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates when the user is not awake, and the degree of concentration indicating the degree of the current concentration of the user. It is provided with a selection unit for selecting a voice for prompting the user's awakening from a plurality of voice candidates, and a generation unit for generating a call voice for presenting to the user based on the selected voice.

According to the embodiment, a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time are provided.

FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. FIG. 2 is a diagram showing the configuration of an example of the familiarity DB. FIG. 3 is a diagram showing an example configuration of a user log DB. FIG. 4 is a diagram showing the structure of an example of the call statement DB. FIG. 5 is a functional block diagram of the voice generator. FIG. 6A is a flowchart showing a voice presentation process by the voice generator. FIG. 6B is a flowchart showing a voice presentation process by the voice generator. FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration". FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount".

Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. The voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.

In the embodiment, it is determined whether or not the user is in an awake state based on the "awakening degree". The degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level. The arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement. The arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like. The degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels. The arousal level is a value that increases from a sleep state to an excitement state, for example. The arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.

The voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, a speaker 7, a camera 8, an input device 9, a display 10, and a communication module 11. .. The voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user. The voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speaker 7, the camera 8, and the display 10 may be separate devices from the voice generation device 1.

The processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU. The processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like. The processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.

ROM 3 is a non-volatile memory such as a flash memory. For example, the start program of the voice generator 1 is stored in the ROM 3. RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.

The storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD). Various programs used in the voice generator 1 are stored in the storage 5. The storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.

The microphone 6 is a device that converts the input voice into a voice signal which is an electric signal. The audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5. For example, the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.

The speaker 7 is a device that outputs voice based on the input voice signal.

The camera 8 captures the user and acquires the image of the user. The user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5. The user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.

The input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor. The input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.

The display 10 is a display such as a liquid crystal display or an organic EL display. The display 10 displays various images.

The communication module 11 is a device for the voice generation device 1 to carry out communication. The communication module 11 communicates with, for example, a server provided outside the voice generator 1. The communication method by the communication module 11 is not particularly limited. The communication module 11 may carry out communication wirelessly or may carry out communication by wire.

Next, the familiarity database (DB) 51, the user log database (DB) 52, the model database (DB) 53, the voice synthesis parameter database (DB) 54, and the call statement database (DB) 55 will be described.

FIG. 2 is a diagram showing a configuration of an example of familiarity DB 51. The familiarity DB 51 is a database that records the "familiarity" of the user. The familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.

The "user ID" is an ID assigned to each user of the voice generator 1. The user ID may be associated with user attribute information such as a user name.

The "voice label" is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.

The "familiar target" is a target that generates a voice that the user often talks to or hears. The familiar target does not necessarily have to be a person.

"Familiarity" is the degree of familiarity of the user with the corresponding familiar voice. The degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity. Here, the degree of familiarity may be acquired by self-reporting by the user.

The "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label. The number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user. The reaction probability can be calculated by dividing the number of reactions by the number of presentations. The reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.

The "average value of change in arousal level" is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label. The amount of change in arousal level will be described later.

FIG. 3 is a diagram showing the configuration of an example of the user log DB 52. The user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user. The user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded. The user ID, the voice label, and the familiar object are the same as the familiarity DB 51.

The "log generation date and time" is the date and time when the user used the voice generator 1. The log generation date and time is recorded, for example, each time a call voice is presented to the user.

"Presence / absence of reaction" is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, "yes" is recorded. "None" is recorded when there is no user response.

"Concentration ratio" is the degree of concentration of the user when presenting the call voice. The degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8. The value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action. Further, the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8. The concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic. The degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, .... The method for acquiring the degree of concentration is not limited to a specific method.

The "awakening degree" is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.

The "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.

The "awakening degree change amount" is an amount representing the change in the arousal degree before and after the user's reaction. For example, the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness. The amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.

The "correct answer label" is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as 〇, and the incorrect answer is recorded as ×. In the embodiment, the correct answer is recorded when the user reacts to the call voice from the voice generator 1 and the arousal level rises beyond the threshold value in the voice presentation operation described later. .. The incorrect answer is recorded when the user does not respond to the call voice in the voice presentation process, or when the arousal level is equal to or less than the threshold value even when the user responds.

The model DB 53 is a database that records a model of voice label classification for extracting voice label candidates. In the embodiment, the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration. The model includes an initial model and a learning model. The initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning. Here, the initial values are two constants (a, b) of a linear function (y = ax + b) for classifying voice labels defined in a two-dimensional space of, for example, "familiarity y" and "concentration ratio x". ) Is the value. The binary classification model using the linear function y = ax + b generated by this initial value is the initial model. In the initial model, voice labels having a degree of familiarity greater than y = ax + b in the xy space are classified as correct answers (〇), and other voice labels are classified as incorrect answers (x). The training model is a trained model generated from the initial model. The learning model can be a binary classification model with constants (a, b) different from the initial model.

The voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded. The voice synthesis parameter is data used for synthesizing the voice of the user's familiar target. For example, the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6. Alternatively, speech synthesis parameters acquired or defined by other systems may be pre-recorded. Here, the speech synthesis parameter is associated with the speech label.

FIG. 4 is a diagram showing the configuration of an example of the call statement DB55. The call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded. The call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.

Here, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5. For example, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1. In this case, the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.

FIG. 5 is a functional block diagram of the voice generator 1. As shown in FIG. 5, the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26. The operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized. The determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.

The acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof. Here, the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8. Further, the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6. Also, skin electrical activity can be measured, for example, by a sensor worn on the user's arm. Further, the user's reaction can be obtained by measuring, for example, from an image acquired by the camera 8, whether or not the user has seen the direction of the sound after the presentation of the calling voice. The acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.

The determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value. The threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.

When it is determined that the user is not in an awake state, the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken. The selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.

The receiving unit 231 receives a voice label selection request from the determination unit 22.

The model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53. The model selection unit 232 selects either an initial model or a learning model based on the degree of fit. The degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.

The voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.

The voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.

The transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.

The generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235. The generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.

The presentation unit 25 presents the call voice generated by the generation unit 24 to the user. For example, the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.

The learning unit 26 learns the model recorded in the model DB 53. The learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.

Next, the operation of the voice generator 1 will be described. 6A and 6B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 6A and 6B may be performed periodically.

In step S1, the acquisition unit 21 acquires the user's arousal level. The acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.

In step S2, the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value. In step S2, when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 6A and 6B are terminated. In step S2, when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.

In step S3, the determination unit 22 transmits a voice label selection request to the selection unit 23. When the voice label selection request is received by the receiving unit 231, the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction".

In step S4, the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S4 that the number of times there is a reaction is less than the threshold value, the process proceeds to step S5. When it is determined in step S4 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S6.

In step S5, the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.

In step S6, the model selection unit 232 calculates the degree of fit. In calculating the degree of fit, the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model. The model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit. The degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used. The precision rate is the percentage of the data predicted to be correct that the user actually responded "yes". The recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct. The F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall · Precision / (Recall + Precision).

In step S7, the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher. When it is determined in step S7 that the degree of fit of the initial model is higher, the process proceeds to step S5. In this case, the model selection unit 232 selects an initial value, that is, an initial model. When it is determined in step S7 that the degree of fit of the learning model is higher, the process proceeds to step S8.

In step S8, the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.

In step S9, the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.

In step S10, the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51. The voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example. The voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.

In step S11, the voice label selection unit 234 selects one voice label from the voice labels extracted by the voice label candidate extraction unit 233. The voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects one voice label by random sampling based on the weighted winning probability. The weighted winning probability can be calculated, for example, according to the equation (1). The weighted winning probability may be calculated by an equation different from the equation (1).

In step S12, the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24. The generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter. The generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.

In step S13, the presentation unit 25 presents the call voice generated by the generation unit 24 to the user using the speaker 7.

In step S14, the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.

In step S15, the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.

In step S16, the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree. In response to this, the acquisition unit 21 acquires the new arousal degree. The acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.

In step S17, the determination unit 22 acquires the new arousal level from the acquisition unit 21. Then, the determination unit 22 determines whether or not the new alertness is equal to or less than the threshold value. The threshold value in step S17 may be the same as or different from the threshold value in step S2. When it is determined in step S17 that the new arousal level is not equal to or less than the threshold value, the process proceeds to step S18. When it is determined in step S17 that the new arousal level is equal to or less than the threshold value, the process proceeds to step S20.

In step S18, the acquisition unit 21 sets the correct answer (〇) label on the correct answer label.

In step S19, the acquisition unit 21 acquires the arousal degree change average value from the familiarity degree DB 51. Then, the acquisition unit 21 updates the arousal degree change average value by using the newly calculated arousal degree change amount and the previously acquired arousal degree change average value. Further, the acquisition unit 21 associates the concentration degree, the information with a reaction, the arousal degree, the new arousal degree, the arousal degree change amount, and the correct answer label with the log generation date and time, the voice label, the familiar target, and the familiar degree in the user log DB 52. sign up. After that, the process proceeds to step S22.

In step S20, the acquisition unit 21 sets an incorrect answer (x) label on the correct answer label.

In step S21, the acquisition unit 21 registers the concentration level, the information without reaction, the arousal level, and the correct answer label in the user log DB 52 in association with the log generation date / time, the voice label, the familiar target, and the familiarity level. After that, the process proceeds to step S22.

In step S22, the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S22 that the number of times there is a reaction is less than the threshold value, the processes of FIGS. 6A and 6B are terminated. When it is determined in step S22 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S23.

In step S23, the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 6A and 6B is completed. In step S23, the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the two-dimensional space of "familiarity" and "concentration". FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration". This model is a binary classification model using a linear function (y = ax + b) for classifying voice labels defined in a two-dimensional space of "familiarity y" and "concentration ratio x". As shown in FIG. 7, a voice label having a degree of familiarity larger than the straight line L representing y = ax + b is classified as a correct answer (◯). That is, the voice label having a familiarity included in the area a of FIG. 7 is classified as the correct answer (◯). On the other hand, voice labels other than the area a are classified as incorrect answers (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.

Here, the reason why the two axes of "familiarity" and "concentration" are adopted in the binary classification model in the embodiment will be explained. A person has a characteristic that selective attention works for a conversation of a person who is interested in the person or a familiar voice such as a person's name. This is called the cocktail party effect. In addition, Yumiko Honjo, "Physiological Psychological Study on Attention and Awakening", Kwansei Gakuin University Doctoral Dissertation, Otsu No. 217, p.187-188, introduced both selective attention and awakening. The model has been derived. From this, it is considered that there is a relationship between the occurrence of selective attention and the arousal level. As described above, "familiarity" is considered to affect the tendency of the cocktail party effect to occur and the change in the arousal level due to the cocktail party effect, and is therefore adopted as one axis of learning.

Regarding "concentration ratio", "the brain pays attention and raises concentration by" efficient selection "", RIKEN News Release, December 8, 2011, [Online] [Reiwa 2 June 10] Daily search], Internet URL: https://www.riken.jp/press/2011/20111208/, it is reported that the information transmitted from the senses to the perception is limited in the concentrated state. In other words, the concentrated state. It is presumed that the sound perceived when is increasing will be the sound that is more needed or heard by the user. Thus, "concentration" is likely to give rise to the user's selective attention. In other words, it is adopted as one axis of learning because it can be considered that it affects which sound is easy to react to.

As described above, according to the embodiment, when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.

Further, according to the embodiment, the voice label is classified using a learning model having two axes of familiarity and concentration. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.

[Modification example]
A modification of the embodiment will be described. In the embodiment, a plurality of candidate voice labels used for generating a call voice based on the degree of familiarity and the degree of concentration are extracted from the familiarity DB 51, and one voice label is randomly selected from the extracted voice label candidates. Selected by sampling. As a simpler process, even if one voice label used for generating the call voice based on the degree of familiarity and the degree of concentration is extracted from the familiarity DB 51 and the call voice is generated based on the extracted voice label. good.

Further, in the embodiment, candidate voice labels used for generating a calling voice are extracted from the voice labels classified using a learning model having two axes of familiarity and concentration. The learning model does not necessarily have to be used to extract the voice label of this candidate. For example, the candidate voice labels may be extracted by extracting a plurality of voice labels having a higher weighting addition value between the degree of familiarity and the degree of concentration as candidates.

Further, in the embodiment, an example is shown in which the selection of the voice label based on the degree of familiarity and the degree of concentration, the generation of the calling voice, and the learning of the learning model are all performed in the voice generation device 1. However, speech label selection based on familiarity and concentration, generation of calling speech, and learning of the learning model may be performed in separate devices.

In addition, the learning model is configured to classify voice labels in a two-dimensional space of familiarity and concentration. The learning model may be further configured to classify speech labels in a three-dimensional space that includes the amount of change in arousal. FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount". This model is a binary classification model using a classification surface for classifying voice labels defined in a three-dimensional space of "familiarity", "concentration ratio", and "awakening degree change amount". In the example of FIG. 8, the familiar voice label located in the space above the classification surface P is classified as the correct answer (◯). On the other hand, a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model. The amount of change in arousal degree characterizes the correct label, that is, whether or not the user responds, as well as the user's reaction. Therefore, by adopting the "awareness change amount" as an axis, it is expected that the accuracy of the determination of the correct answer label will be further improved.

Further, in the embodiment, when the user is not awakened even after the call, that is, when it is determined that the new arousal level is below the threshold value, an active action that is expected to improve the arousal level even if the required time is slightly required is performed. A call to make a suggestion may be made to the user. As a result, the user can have an opportunity to act, and as a result, the degree of arousal is expected to be improved in a short time. Further, since the call voice for proposing an active action is also performed with a voice familiar to the user, the effect that the user can hear the call voice by the cocktail party effect can be obtained.

Further, in the embodiment, the call text used for the call voice is randomly selected from the templates recorded in the call text DB 55. This template can be modified accordingly. For example, by collecting the user's daily conversations and the like, the template may be changed so as to include words that are frequently used in the daily conversations and the like and that easily attract the user's attention.

Further, when the call voice is generated in the embodiment, the volume and the like may be changed according to the degree of arousal.

Further, each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer. In addition, it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory. Then, the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.

The present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof. In addition, each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained. Further, the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.

1 ... Voice generator 2 ... Processor 3 ... ROM
4 ... RAM
5 ... Storage 6 ... Microphone (microphone)
7 ... Speaker 8 ... Camera 9 ... Input device 10 ... Display 11 ... Communication module 21 ... Acquisition unit 22 ... Judgment unit 23 ... Selection unit 24 ... Generation unit 25 ... Presentation unit 26 ... Learning unit 51 ... Familiarity database (DB)
52 ... User log database (DB)
53 ... Model database (DB)
54 ... Speech synthesis parameter database (DB)
55 ... Calling sentence database (DB)
231 ... Receiver unit 232 ... Model selection unit 233 ... Voice label candidate extraction unit 234 ... Voice label selection unit 235 ... Transmission unit

Claims

An acquisition unit that acquires the arousal level, which indicates the degree of arousal from the user's sleep to excitement,
A determination unit that determines whether or not the user is awake based on the degree of arousal.
When the user is not awake, the user's awakening is based on the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates and the degree of concentration indicating the degree of the user's current concentration. A selection unit that selects the voice that prompts you from multiple voice candidates,
A generation unit that generates a call voice to be presented to the user based on the selected voice, and a generation unit.
A voice generator equipped with.
The selection unit extracts a plurality of voice candidates based on the familiarity and the concentration, and selects one voice candidate from the extracted voice candidates by random sampling as a voice that promotes awakening of the user. Item 1. The voice generator according to item 1.
The selection unit is expected to respond to the user by presenting the calling voice to the plurality of voice candidates in a two-dimensional space composed of the familiarity and the concentration, and is expected to increase the arousal level of the user. The classification model is used to classify the first voice candidate to be classified into a second voice candidate in which the user's reaction by the presentation of the call voice is not expected or the user's arousal level is not expected to increase. The voice generator according to claim 1 or 2, which selects a voice that urges the user to awaken.
The selection unit selects the plurality of voice candidates in a three-dimensional space including the degree of familiarity, the degree of concentration, and the amount of change in the degree of arousal due to the presentation of the call voice, and the user by presenting the call voice. The reaction of the first voice candidate is expected to increase the arousal level of the user, and the reaction of the user by the presentation of the call voice is not expected, or the increase of the arousal level of the user is not expected. The voice generation device according to claim 1 or 2, wherein a voice that promotes awakening of the user is selected by using a classification model classified as a second voice candidate.
The acquisition unit acquires the arousal level, which indicates the degree of arousal from the user's sleep to excitement.
The determination unit determines whether or not the user is awake based on the degree of arousal.
Depending on the selection unit, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates when the user is not awake and the degree of concentration indicating the degree of the current concentration of the user are used. Selecting the voice that urges the user to awaken from a plurality of voice candidates, and
The generation unit generates a call voice to be presented to the user based on the selected voice.
A voice generation method comprising.
A voice generation program for causing the processor to function as the acquisition unit, the determination unit, the selection unit, and the generation unit of the voice generation device according to any one of claims 1 to 4.