WO2021260846A1 - Voice generation device, voice generation method, and voice generation program - Google Patents

Voice generation device, voice generation method, and voice generation program Download PDF

Info

Publication number
WO2021260846A1
WO2021260846A1 PCT/JP2020/024820 JP2020024820W WO2021260846A1 WO 2021260846 A1 WO2021260846 A1 WO 2021260846A1 JP 2020024820 W JP2020024820 W JP 2020024820W WO 2021260846 A1 WO2021260846 A1 WO 2021260846A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
degree
arousal
unit
Prior art date
Application number
PCT/JP2020/024820
Other languages
French (fr)
Japanese (ja)
Inventor
妙 佐藤
昭宏 千葉
真奈 笹川
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/024820 priority Critical patent/WO2021260846A1/en
Priority to JP2022531319A priority patent/JP7416244B2/en
Publication of WO2021260846A1 publication Critical patent/WO2021260846A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61MDEVICES FOR INTRODUCING MEDIA INTO, OR ONTO, THE BODY; DEVICES FOR TRANSDUCING BODY MEDIA OR FOR TAKING MEDIA FROM THE BODY; DEVICES FOR PRODUCING OR ENDING SLEEP OR STUPOR
    • A61M21/00Other devices or methods to cause a change in the state of consciousness; Devices for producing or ending sleep by mechanical, optical, or acoustical means, e.g. for hypnosis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • This embodiment relates to a voice generator, a voice generation method, and a voice generation program.
  • Non-Patent Documents 1 and 2 describe that sleep is the opposite of arousal.
  • the arousal level is an index showing the degree of arousal from sleep to excitement.
  • "sleepiness" is defined as a state in which the arousal level is lower than the moderate arousal level. For this reason, even if drowsiness is felt in work in a remote environment such as working from home, distance lessons, etc., it is required to raise the arousal level in as short a time as possible.
  • the embodiment provides a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time.
  • the voice generation device has an acquisition unit that acquires an arousal degree indicating the degree of arousal from sleep to excitement of the user, and whether or not the user is awake based on the arousal degree. Based on the determination unit for determining, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates when the user is not awake, and the degree of concentration indicating the degree of the current concentration of the user. It is provided with a selection unit for selecting a voice for prompting the user's awakening from a plurality of voice candidates, and a generation unit for generating a call voice for presenting to the user based on the selected voice.
  • a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time are provided.
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment.
  • FIG. 2 is a diagram showing the configuration of an example of the familiarity DB.
  • FIG. 3 is a diagram showing an example configuration of a user log DB.
  • FIG. 4 is a diagram showing the structure of an example of the call statement DB.
  • FIG. 5 is a functional block diagram of the voice generator.
  • FIG. 6A is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 6B is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration”.
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount”.
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment.
  • the voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
  • the degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level.
  • the arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement.
  • the arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like.
  • the degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels.
  • the arousal level is a value that increases from a sleep state to an excitement state, for example.
  • the arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
  • the voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, a speaker 7, a camera 8, an input device 9, a display 10, and a communication module 11. ..
  • the voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user.
  • the voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speaker 7, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
  • the processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU.
  • the processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like.
  • the processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
  • ROM 3 is a non-volatile memory such as a flash memory.
  • the start program of the voice generator 1 is stored in the ROM 3.
  • RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
  • the storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
  • Various programs used in the voice generator 1 are stored in the storage 5.
  • the storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
  • the microphone 6 is a device that converts the input voice into a voice signal which is an electric signal.
  • the audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5.
  • the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
  • the speaker 7 is a device that outputs voice based on the input voice signal.
  • the camera 8 captures the user and acquires the image of the user.
  • the user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5.
  • the user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
  • the input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor.
  • the input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
  • the display 10 is a display such as a liquid crystal display or an organic EL display.
  • the display 10 displays various images.
  • the communication module 11 is a device for the voice generation device 1 to carry out communication.
  • the communication module 11 communicates with, for example, a server provided outside the voice generator 1.
  • the communication method by the communication module 11 is not particularly limited.
  • the communication module 11 may carry out communication wirelessly or may carry out communication by wire.
  • FIG. 2 is a diagram showing a configuration of an example of familiarity DB 51.
  • the familiarity DB 51 is a database that records the "familiarity" of the user.
  • the familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
  • the "user ID" is an ID assigned to each user of the voice generator 1.
  • the user ID may be associated with user attribute information such as a user name.
  • the "voice label” is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
  • the "familiar target” is a target that generates a voice that the user often talks to or hears.
  • the familiar target does not necessarily have to be a person.
  • “Familiarity” is the degree of familiarity of the user with the corresponding familiar voice.
  • the degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity.
  • the degree of familiarity may be acquired by self-reporting by the user.
  • the "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label.
  • the number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user.
  • the reaction probability can be calculated by dividing the number of reactions by the number of presentations.
  • the reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
  • the "average value of change in arousal level” is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label.
  • the amount of change in arousal level will be described later.
  • FIG. 3 is a diagram showing the configuration of an example of the user log DB 52.
  • the user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user.
  • the user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded.
  • the user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
  • the "log generation date and time” is the date and time when the user used the voice generator 1.
  • the log generation date and time is recorded, for example, each time a call voice is presented to the user.
  • Presence / absence of reaction is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, “yes” is recorded. “None” is recorded when there is no user response.
  • “Concentration ratio” is the degree of concentration of the user when presenting the call voice.
  • the degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8.
  • the value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action.
  • the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8.
  • the concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic.
  • the degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, ....
  • the method for acquiring the degree of concentration is not limited to a specific method.
  • the "awakening degree” is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
  • the "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
  • the "awakening degree change amount” is an amount representing the change in the arousal degree before and after the user's reaction.
  • the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness.
  • the amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
  • the "correct answer label” is a label of correct or incorrect answers for supervised learning.
  • the correct answer is recorded as ⁇
  • the incorrect answer is recorded as ⁇ .
  • the correct answer is recorded when the user reacts to the call voice from the voice generator 1 and the arousal level rises beyond the threshold value in the voice presentation operation described later. ..
  • the incorrect answer is recorded when the user does not respond to the call voice in the voice presentation process, or when the arousal level is equal to or less than the threshold value even when the user responds.
  • the model DB 53 is a database that records a model of voice label classification for extracting voice label candidates.
  • the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration.
  • the model includes an initial model and a learning model.
  • the initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning.
  • the training model is a trained model generated from the initial model.
  • the learning model can be a binary classification model with constants (a, b) different from the initial model.
  • the voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded.
  • the voice synthesis parameter is data used for synthesizing the voice of the user's familiar target.
  • the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6.
  • speech synthesis parameters acquired or defined by other systems may be pre-recorded.
  • the speech synthesis parameter is associated with the speech label.
  • FIG. 4 is a diagram showing the configuration of an example of the call statement DB55.
  • the call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded.
  • the call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1.
  • the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
  • FIG. 5 is a functional block diagram of the voice generator 1.
  • the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26.
  • the operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized.
  • the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
  • the acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof.
  • the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8.
  • the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6.
  • skin electrical activity can be measured, for example, by a sensor worn on the user's arm.
  • the user's reaction can be obtained by measuring, for example, from an image acquired by the camera 8, whether or not the user has seen the direction of the sound after the presentation of the calling voice.
  • the acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
  • the determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value.
  • the threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
  • the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken.
  • the selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
  • the receiving unit 231 receives a voice label selection request from the determination unit 22.
  • the model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53.
  • the model selection unit 232 selects either an initial model or a learning model based on the degree of fit.
  • the degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
  • the voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
  • the voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
  • the transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
  • the presentation unit 25 presents the call voice generated by the generation unit 24 to the user.
  • the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
  • the learning unit 26 learns the model recorded in the model DB 53.
  • the learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
  • FIGS. 6A and 6B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 6A and 6B may be performed periodically.
  • step S1 the acquisition unit 21 acquires the user's arousal level.
  • the acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
  • step S2 the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value.
  • step S2 when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 6A and 6B are terminated.
  • step S2 when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
  • step S3 the determination unit 22 transmits a voice label selection request to the selection unit 23.
  • the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction”.
  • step S4 the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • step S5 the process proceeds to step S5.
  • step S6 the process proceeds to step S6.
  • step S5 the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S6 the model selection unit 232 calculates the degree of fit.
  • the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model.
  • the model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit.
  • the degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used.
  • the precision rate is the percentage of the data predicted to be correct that the user actually responded "yes".
  • the recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct.
  • the F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall ⁇ Precision / (Recall + Precision).
  • step S7 the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher.
  • step S7 the degree of fit of the initial model is higher
  • the process proceeds to step S5.
  • the model selection unit 232 selects an initial value, that is, an initial model.
  • step S8 the degree of fit of the learning model is higher
  • step S8 the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S9 the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
  • the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51.
  • the voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example.
  • the voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
  • the voice label selection unit 234 selects one voice label from the voice labels extracted by the voice label candidate extraction unit 233.
  • the voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects one voice label by random sampling based on the weighted winning probability.
  • the weighted winning probability can be calculated, for example, according to the equation (1).
  • the weighted winning probability may be calculated by an equation different from the equation (1).
  • step S12 the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter.
  • the generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
  • step S13 the presentation unit 25 presents the call voice generated by the generation unit 24 to the user using the speaker 7.
  • step S14 the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
  • step S15 the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
  • step S16 the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree.
  • the acquisition unit 21 acquires the new arousal degree.
  • the acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
  • step S17 the determination unit 22 acquires the new arousal level from the acquisition unit 21. Then, the determination unit 22 determines whether or not the new alertness is equal to or less than the threshold value.
  • the threshold value in step S17 may be the same as or different from the threshold value in step S2.
  • the process proceeds to step S18.
  • the process proceeds to step S20.
  • step S18 the acquisition unit 21 sets the correct answer ( ⁇ ) label on the correct answer label.
  • step S19 the acquisition unit 21 acquires the arousal degree change average value from the familiarity degree DB 51. Then, the acquisition unit 21 updates the arousal degree change average value by using the newly calculated arousal degree change amount and the previously acquired arousal degree change average value. Further, the acquisition unit 21 associates the concentration degree, the information with a reaction, the arousal degree, the new arousal degree, the arousal degree change amount, and the correct answer label with the log generation date and time, the voice label, the familiar target, and the familiar degree in the user log DB 52. sign up. After that, the process proceeds to step S22.
  • step S20 the acquisition unit 21 sets an incorrect answer (x) label on the correct answer label.
  • step S21 the acquisition unit 21 registers the concentration level, the information without reaction, the arousal level, and the correct answer label in the user log DB 52 in association with the log generation date / time, the voice label, the familiar target, and the familiarity level. After that, the process proceeds to step S22.
  • step S22 the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • the processes of FIGS. 6A and 6B are terminated.
  • the process proceeds to step S23.
  • step S23 the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 6A and 6B is completed.
  • step S23 the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the two-dimensional space of "familiarity" and "concentration".
  • FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration”.
  • voice labels other than the area a are classified as incorrect answers (x).
  • various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
  • the embodiment when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
  • the voice label is classified using a learning model having two axes of familiarity and concentration. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
  • candidate voice labels used for generating a calling voice are extracted from the voice labels classified using a learning model having two axes of familiarity and concentration.
  • the learning model does not necessarily have to be used to extract the voice label of this candidate.
  • the candidate voice labels may be extracted by extracting a plurality of voice labels having a higher weighting addition value between the degree of familiarity and the degree of concentration as candidates.
  • the selection of the voice label based on the degree of familiarity and the degree of concentration, the generation of the calling voice, and the learning of the learning model are all performed in the voice generation device 1.
  • speech label selection based on familiarity and concentration, generation of calling speech, and learning of the learning model may be performed in separate devices.
  • the learning model is configured to classify voice labels in a two-dimensional space of familiarity and concentration.
  • the learning model may be further configured to classify speech labels in a three-dimensional space that includes the amount of change in arousal.
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity”, “concentration ratio”, and "awakening degree change amount”.
  • This model is a binary classification model using a classification surface for classifying voice labels defined in a three-dimensional space of "familiarity", “concentration ratio”, and "awakening degree change amount”.
  • the familiar voice label located in the space above the classification surface P is classified as the correct answer ( ⁇ ).
  • a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x).
  • various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
  • the amount of change in arousal degree characterizes the correct label, that is, whether or not the user responds, as well as the user's reaction. Therefore, by adopting the "awareness change amount" as an axis, it is expected that the accuracy of the determination of the correct answer label will be further improved.
  • an active action that is expected to improve the arousal level even if the required time is slightly required is performed.
  • a call to make a suggestion may be made to the user.
  • the user can have an opportunity to act, and as a result, the degree of arousal is expected to be improved in a short time.
  • the call voice for proposing an active action is also performed with a voice familiar to the user, the effect that the user can hear the call voice by the cocktail party effect can be obtained.
  • the call text used for the call voice is randomly selected from the templates recorded in the call text DB 55.
  • This template can be modified accordingly. For example, by collecting the user's daily conversations and the like, the template may be changed so as to include words that are frequently used in the daily conversations and the like and that easily attract the user's attention.
  • the volume and the like may be changed according to the degree of arousal.
  • each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer.
  • it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory.
  • the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
  • the present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
  • each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained.
  • the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Anesthesiology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Veterinary Medicine (AREA)
  • Hematology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Biomedical Technology (AREA)
  • Psychology (AREA)
  • Traffic Control Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

This voice generation device (1) has an acquisition unit (21), a determination unit (22), a selection unit (23), and a generation unit (24). The acquisition unit acquires an alertness degree representing a user's degree of alertness from sleep to excitement. The determination unit determines, on the basis of the alertness degree, whether the user is in a state of alertness. When the user is not in a state of alertness, the selection unit selects a voice prompting the user to be alert from among a plurality of voice candidates, this selection being made on the basis of a familiarity degree representing the degree to which the user is familiar with each of the plurality of voice candidates and a concentration degree representing the user's current degree of concentration. On the basis of the selected voice, the generation unit generates a calling-out voice to be presented to the user.

Description

音声生成装置、音声生成方法及び音声生成プログラムVoice generator, voice generation method and voice generation program
 本実施形態は、音声生成装置、音声生成方法及び音声生成プログラムに関する。 This embodiment relates to a voice generator, a voice generation method, and a voice generation program.
 人が眠気を全く感じずに日中を過ごすことは難しい。これは、人の脳機能がウルトラディアンリズムという短い周期での覚醒度の変動リズムを有しているためである。非特許文献1及び2では、睡眠は覚醒とは対極にある状態であると説明されている。そして、睡眠から興奮に至る覚醒の程度を表す指標が覚醒水準である。また、非特許文献2では、「眠気」は、中程度の覚醒水準よりも覚醒水準が低くなっている状態であると定義されている。このため、在宅勤務といった遠隔環境での業務、遠隔授業等で、眠気を感じたとしても、なるべく短時間で覚醒水準を上昇させることが求められている。 It is difficult for people to spend the day without feeling drowsy at all. This is because human brain function has an arousal fluctuation rhythm called ultradian rhythm in a short cycle. Non-Patent Documents 1 and 2 describe that sleep is the opposite of arousal. The arousal level is an index showing the degree of arousal from sleep to excitement. Further, in Non-Patent Document 2, "sleepiness" is defined as a state in which the arousal level is lower than the moderate arousal level. For this reason, even if drowsiness is felt in work in a remote environment such as working from home, distance lessons, etc., it is required to raise the arousal level in as short a time as possible.
 実施形態は、短時間でユーザの覚醒を促すための音声生成装置、音声生成方法及び音声生成プログラムを提供する。 The embodiment provides a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time.
 実施形態に係る音声生成装置は、ユーザの睡眠から興奮に至るまでの覚醒の程度を表す覚醒度を取得する取得部と、覚醒度に基づき、ユーザが覚醒している状態であるか否かを判定する判定部と、ユーザが覚醒している状態でないとき、ユーザが複数の音声候補のそれぞれになじんでいる度合いを表すなじみ度と前記ユーザの現在の集中の度合いを表す集中度とに基づいて、ユーザの覚醒を促す音声を複数の音声候補の中から選択する選択部と、選択された音声に基づき、ユーザに対して提示するための呼びかけ音声を生成する生成部とを具備する。 The voice generation device according to the embodiment has an acquisition unit that acquires an arousal degree indicating the degree of arousal from sleep to excitement of the user, and whether or not the user is awake based on the arousal degree. Based on the determination unit for determining, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates when the user is not awake, and the degree of concentration indicating the degree of the current concentration of the user. It is provided with a selection unit for selecting a voice for prompting the user's awakening from a plurality of voice candidates, and a generation unit for generating a call voice for presenting to the user based on the selected voice.
 実施形態によれば、短時間でユーザの覚醒を促すための音声生成装置、音声生成方法及び音声生成プログラムが提供される。 According to the embodiment, a voice generator, a voice generation method, and a voice generation program for urging the user to awaken in a short time are provided.
図1は、実施形態に係る音声生成装置の一例のハードウェア構成を示す図である。FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. 図2は、なじみ度DBの一例の構成を示す図である。FIG. 2 is a diagram showing the configuration of an example of the familiarity DB. 図3は、ユーザログDBの一例の構成を示す図である。FIG. 3 is a diagram showing an example configuration of a user log DB. 図4は、呼びかけ文DBの一例の構成を示す図である。FIG. 4 is a diagram showing the structure of an example of the call statement DB. 図5は、音声生成装置の機能ブロック図である。FIG. 5 is a functional block diagram of the voice generator. 図6Aは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 6A is a flowchart showing a voice presentation process by the voice generator. 図6Bは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 6B is a flowchart showing a voice presentation process by the voice generator. 図7は、「なじみ度」と「集中度」とを用いた二値分類モデルのイメージを表す図である。FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration". 図8は、「なじみ度」と「集中度」と「覚醒度変化量」を用いた二値分類モデルのイメージを表す図である。FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount".
 以下、図面を参照して実施形態を説明する。図1は、実施形態に係る音声生成装置の一例のハードウェア構成を示す図である。実施形態に係る音声生成装置1は、ユーザが眠気を有している状態等の覚醒の状態にないときに、ユーザの覚醒を促す呼びかけ音声を発する。 Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. The voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
 実施形態では、「覚醒度」に基づいてユーザが覚醒の状態にあるか否かが判定される。実施形態における覚醒度は、覚醒水準に対応した覚醒の程度を示す指標である。覚醒水準は、大脳の活動レベルに対応し、睡眠から興奮に至るまでの覚醒の程度を表している。覚醒水準は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間等から計測される。実施形態における覚醒度は、これらの覚醒水準を計測するための、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。覚醒度は、例えば睡眠状態から興奮状態に向かうに従って大きくなる値である。覚醒度は、連続的な数値でもよいし、Level 1, Level 2,…といった離散値であってもよい。また、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の各値の組み合わせによって覚醒度が算出される場合において、それらの組み合わせ方は、特に限定されない。例えばこれらの値を単純に合算する、重みづけ加算する等が組み合わせ方として用いられ得る。 In the embodiment, it is determined whether or not the user is in an awake state based on the "awakening degree". The degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level. The arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement. The arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like. The degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels. The arousal level is a value that increases from a sleep state to an excitement state, for example. The arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
 音声生成装置1は、プロセッサ2と、ROM3と、RAM4と、ストレージ5と、マイクロホン(マイク)6と、スピーカ7と、カメラ8と、入力装置9と、ディスプレイ10と、通信モジュール11とを有する。音声生成装置1は、例えばパーソナルコンピュータ(PC)、スマートフォン、タブレット端末といった各種の端末である。これに限らず、音声生成装置1は、ユーザによって利用される各種の装置に搭載され得る。なお、音声生成装置1は、図1で示したすべての構成を有している必要はない。例えば、マイク6、スピーカ7、カメラ8、ディスプレイ10は、音声生成装置1と別体の装置であってもよい。 The voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, a speaker 7, a camera 8, an input device 9, a display 10, and a communication module 11. .. The voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user. The voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speaker 7, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
 プロセッサ2は、CPU等の音声生成装置1の全体的な動作を制御する制御回路である。プロセッサ2は、CPUである必要はなく、ASIC、FPGA、GPU等であってもよい。プロセッサ2は、単一のCPU等で構成されている必要はなく、複数のCPU等で構成されていてもよい。 The processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU. The processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like. The processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
 ROM3は、フラッシュメモリ等の不揮発性のメモリである。ROM3には、例えば音声生成装置1の起動プログラムが記憶されている。RAM4は、SDRAM等の揮発性のメモリである。RAM4は、音声生成装置1における各種処理のための作業用のメモリとして使用され得る。 ROM 3 is a non-volatile memory such as a flash memory. For example, the start program of the voice generator 1 is stored in the ROM 3. RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
 ストレージ5は、フラッシュメモリ、ハードディスクドライブ(HDD)、ソリッドステートドライブ(SSD)といったストレージである。ストレージ5には、音声生成装置1で利用される各種のプログラムが記憶される。ストレージ5には、なじみ度データベース(DB)、ユーザログデータベース(DB)52と、モデルデータベース53と、音声合成パラメータデータベース(DB)54と、呼びかけ文データベース(DB)55とが記憶されてもよい。これらのデータベースについては後で詳しく説明する。 The storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD). Various programs used in the voice generator 1 are stored in the storage 5. The storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
 マイク6は、入力された音声を電気信号である音声信号に変換するデバイスである。マイク6で得られた音声信号は、例えばRAM4又はストレージ5に記憶され得る。例えば、呼びかけ音声を合成するための音声合成パラメータは、マイク6を介して入力された音声より取得され得る。 The microphone 6 is a device that converts the input voice into a voice signal which is an electric signal. The audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5. For example, the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
 スピーカ7は、入力された音声信号に基づいて音声を出力するデバイスである。 The speaker 7 is a device that outputs voice based on the input voice signal.
 カメラ8は、ユーザを撮像し、ユーザの画像を取得する。カメラ8で得られたユーザの画像は、例えばRAM4又はストレージ5に記憶され得る。ユーザの画像は、例えば、覚醒度の取得のため又は呼びかけ音声に対するユーザの反応を取得するために用いられる。 The camera 8 captures the user and acquires the image of the user. The user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5. The user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
 入力装置9は、ボタン、スイッチ、キーボード、マウスといった機械式の入力装置、タッチセンサを用いたソフトウェア式の入力装置である。入力装置9は、ユーザからの各種の入力を受け付ける。そして、入力装置9は、ユーザの入力に応じた信号をプロセッサ2に出力する。 The input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor. The input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
 ディスプレイ10は、例えば液晶ディスプレイ、有機ELディスプレイといったディスプレイである。ディスプレイ10は、各種の画像を表示する。 The display 10 is a display such as a liquid crystal display or an organic EL display. The display 10 displays various images.
 通信モジュール11は、音声生成装置1が通信を実施するための装置である。通信モジュール11は、例えば音声生成装置1の外部に設けられたサーバと通信する。通信モジュール11による通信の方式は特に限定されない。通信モジュール11は、無線で通信を実施してもよいし、有線で通信を実施してもよい。 The communication module 11 is a device for the voice generation device 1 to carry out communication. The communication module 11 communicates with, for example, a server provided outside the voice generator 1. The communication method by the communication module 11 is not particularly limited. The communication module 11 may carry out communication wirelessly or may carry out communication by wire.
 次に、なじみ度データベース(DB)51、ユーザログデータベース(DB)52、モデルデータベース(DB)53、音声合成パラメータデータベース(DB)54、呼びかけ文データベース(DB)55について説明する。 Next, the familiarity database (DB) 51, the user log database (DB) 52, the model database (DB) 53, the voice synthesis parameter database (DB) 54, and the call statement database (DB) 55 will be described.
 図2は、なじみ度DB51の一例の構成を示す図である。なじみ度DB51は、ユーザの「なじみ度」を記録したデータベースである。なじみ度DB51は、例えばユーザIDと、音声ラベルと、なじみ対象と、なじみ度と、反応あり数と、提示回数と、覚醒度変化平均値とを関連付けて記録している。 FIG. 2 is a diagram showing a configuration of an example of familiarity DB 51. The familiarity DB 51 is a database that records the "familiarity" of the user. The familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
 「ユーザID」は、音声生成装置1のユーザ毎に付けられたIDである。ユーザIDには、ユーザ名等のユーザの属性情報が対応付けられていてよい。 The "user ID" is an ID assigned to each user of the voice generator 1. The user ID may be associated with user attribute information such as a user name.
 「音声ラベル」は、呼びかけ音声の候補のそれぞれに一意に付けられたラベルである。音声ラベルには、任意のラベルが用いられ得る。例えば、音声ラベルに、なじみ対象の名前が用いられてもよい。 The "voice label" is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
 「なじみ対象」は、ユーザが日頃会話する人又はユーザがよく耳にする音声を発生する対象である。なじみ対象は、必ずしも人でなくてもよい。 The "familiar target" is a target that generates a voice that the user often talks to or hears. The familiar target does not necessarily have to be a person.
 「なじみ度」は、対応するなじみ対象の音声に対するユーザのなじみの度合いである。なじみ度は、SNS等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度等から算出され得る。例えば、SNS等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度が多いほど、なじみ度の値は大きくなる。ここで、なじみ度は、ユーザによる自己申告によって取得されてもよい。 "Familiarity" is the degree of familiarity of the user with the corresponding familiar voice. The degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity. Here, the degree of familiarity may be acquired by self-reporting by the user.
 「反応あり数」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対してユーザが反応した回数である。提示回数は、対応する音声ラベルに基づいて生成された呼びかけ音声をユーザに対して提示した回数である。反応あり数を提示回数で割ることにより、反応確率が算出され得る。反応確率は、対応する音声ラベルに基づいて生成される呼びかけ音声に対してユーザが反応する確率である。 The "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label. The number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user. The reaction probability can be calculated by dividing the number of reactions by the number of presentations. The reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
 「覚醒度変化平均値」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対するユーザの覚醒度変化量の平均値である。覚醒度変化量については後で説明する。 The "average value of change in arousal level" is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label. The amount of change in arousal level will be described later.
 図3は、ユーザログDB52の一例の構成を示す図である。ユーザログDB52は、ユーザによる音声生成装置1の利用に係るログを記録したデータベースである。ユーザログDB52は、例えばログ発生日時と、ユーザIDと、音声ラベルと、なじみ対象と、集中度と、反応有無と、覚醒度と、新覚醒度と、覚醒度変化量と、正解ラベルとを関連付けて記録している。ユーザIDと、音声ラベルと、なじみ対象は、なじみ度DB51と同じものである。 FIG. 3 is a diagram showing the configuration of an example of the user log DB 52. The user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user. The user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded. The user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
 「ログ発生日時」は、ユーザによる音声生成装置1の利用があった日時である。ログ発生日時は、例えばユーザに対する呼びかけ音声の提示がされる毎に記録される。 The "log generation date and time" is the date and time when the user used the voice generator 1. The log generation date and time is recorded, for example, each time a call voice is presented to the user.
 「反応有無」は、ユーザに対して呼びかけ音声が提示された後のユーザの反応の有無の情報である。ユーザの反応があったときには、「あり」が記録される。ユーザの反応がなかったときには、「なし」が記録される。 "Presence / absence of reaction" is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, "yes" is recorded. "None" is recorded when there is no user response.
 「集中度」は、呼びかけ音声の提示の際のユーザの集中の度合いである。集中度は、例えば作業中のユーザの姿勢、行動をカメラ8で得られる画像から推定することで測定され得る。集中度の値は、ユーザが集中していると考えられる姿勢、行動をする毎に高くなり、ユーザが集中していないと考えられる姿勢、行動をする毎に低くなるように算出される。また、作業中のユーザの瞳孔の開き具合をカメラ8で得られる画像から推定することで測定され得る。集中度の値は、瞳孔がより散瞳している場合に高くなり、瞳孔がより縮瞳している場合には低くなるように算出される。集中度は、例えばLv(Level)1、Lv2、…といった離散値であってよい。なお、集中度の取得手法は、特定の手法には限定されない。 "Concentration ratio" is the degree of concentration of the user when presenting the call voice. The degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8. The value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action. Further, the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8. The concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic. The degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, .... The method for acquiring the degree of concentration is not limited to a specific method.
 「覚醒度」は、音声生成装置1による呼びかけ音声の提示前に取得された覚醒度である。 The "awakening degree" is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
 「新覚醒度」は、ユーザの反応があった後で新たに取得された覚醒度である。新覚醒度は、ユーザの反応がなかったときには記録されない。 The "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
 「覚醒度変化量」は、ユーザの反応の前後での覚醒度の変化を表す量である。例えば、覚醒度変化量は、例えば新覚醒度と覚醒度との差から得られる。覚醒度変化量は、新覚醒度と覚醒度との比等であってもよい。覚醒度変化量は、ユーザの反応がなかったときには記録されない。 The "awakening degree change amount" is an amount representing the change in the arousal degree before and after the user's reaction. For example, the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness. The amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
 「正解ラベル」は、教師付き学習のための正解又は不正解のラベルである。例えば、正解は〇、不正解は×として記録される。実施形態では、正解は、後で説明する音声提示動作において、音声生成装置1からの呼びかけ音声に対してユーザが反応し、かつ、それによって覚醒度が閾値を超えて上昇したときに記録される。不正解は、音声提示処理において、呼びかけ音声に対してユーザが反応しなかったとき、又はユーザが反応した場合であっても覚醒度が閾値以下であったときに記録される。 The "correct answer label" is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as 〇, and the incorrect answer is recorded as ×. In the embodiment, the correct answer is recorded when the user reacts to the call voice from the voice generator 1 and the arousal level rises beyond the threshold value in the voice presentation operation described later. .. The incorrect answer is recorded when the user does not respond to the call voice in the voice presentation process, or when the arousal level is equal to or less than the threshold value even when the user responds.
 モデルDB53は、音声ラベル候補を抽出するための音声ラベル分類のモデルを記録したデータベースである。実施形態では、モデルは、なじみ度と集中度の2次元空間において、音声ラベルの正解又は不正解を分類するように構成されたモデルである。モデルは、初期モデルと、学習モデルとを含む。初期モデルは、モデルDB53に記憶された初期値に基づいて生成されるモデルであって、学習によって更新されないモデルである。ここで、初期値は、例えば「なじみ度y」と「集中度x」の2次元空間において定義される音声ラベルの分類のための1次関数(y=ax+b)の2つの定数(a,b)の値である。この初期値によって生成される1次関数y=ax+bを用いた二値分類モデルが初期モデルである。初期モデルでは、xy空間内のy=ax+bよりも大きいなじみ度を持つ音声ラベルは正解(〇)に分類され、それ以外の音声ラベルは不正解(×)に分類される。また、学習モデルは、初期モデルから生成された学習済みのモデルである。学習モデルは、初期モデルとは異なる定数(a,b)の二値分類モデルになり得る。 The model DB 53 is a database that records a model of voice label classification for extracting voice label candidates. In the embodiment, the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration. The model includes an initial model and a learning model. The initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning. Here, the initial values are two constants (a, b) of a linear function (y = ax + b) for classifying voice labels defined in a two-dimensional space of, for example, "familiarity y" and "concentration ratio x". ) Is the value. The binary classification model using the linear function y = ax + b generated by this initial value is the initial model. In the initial model, voice labels having a degree of familiarity greater than y = ax + b in the xy space are classified as correct answers (〇), and other voice labels are classified as incorrect answers (x). The training model is a trained model generated from the initial model. The learning model can be a binary classification model with constants (a, b) different from the initial model.
 音声合成パラメータDB54は、音声合成パラメータを記録したデータベースである。音声合成パラメータは、ユーザのなじみ対象の音声を合成するために用いられるデータである。例えば、音声合成パラメータは、事前にマイク6を介して収音された音声のデータから抽出される特徴量のデータであってよい。あるいは、他のシステムによって取得又は定義された音声合成パラメータを事前に記録しておいてもよい。ここで、音声合成パラメータは、音声ラベルと対応付けられている。 The voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded. The voice synthesis parameter is data used for synthesizing the voice of the user's familiar target. For example, the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6. Alternatively, speech synthesis parameters acquired or defined by other systems may be pre-recorded. Here, the speech synthesis parameter is associated with the speech label.
 図4は、呼びかけ文DB55の一例の構成を示す図である。呼びかけ文DB55は、ユーザの覚醒を促すための各種の呼びかけ文のテンプレートデータを記録したデータベースである。呼びかけ文は特に限定されない。ただし、呼びかけ文は、ユーザの名前を用いた呼びかけを含んでいることが望ましい。これは、後で説明するカクテルパーティ効果を高めるためである。 FIG. 4 is a diagram showing the configuration of an example of the call statement DB55. The call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded. The call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
 ここで、なじみ度DB51、ユーザログDB52、モデルDB53、音声合成パラメータDB54、呼びかけ文DB55は、必ずしもストレージ5に記憶されている必要はない。例えば、なじみ度DB51、ユーザログDB52、モデルDB53、音声合成パラメータDB54、呼びかけ文DB55は、音声生成装置1とは別体のサーバに記憶されていてもよい。この場合、音声生成装置1は、通信モジュール11を用いてサーバにアクセスし、必要な情報を取得する。 Here, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5. For example, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1. In this case, the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
 図5は、音声生成装置1の機能ブロック図である。図5に示すように、音声生成装置1は、取得部21と、判定部22と、選択部23と、生成部24と、提示部25と、学習部26とを有している。取得部21と、判定部22と、選択部23と、生成部24と、提示部25と、学習部26との動作は、例えばストレージ5に記憶されているプログラムをプロセッサ2が実行することによって実現される。判定部22と、選択部23と、生成部24と、提示部25と、学習部26とは、プロセッサ2とは別のハードウェアによって実現されてもよい。 FIG. 5 is a functional block diagram of the voice generator 1. As shown in FIG. 5, the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26. The operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized. The determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
 取得部21は、ユーザの覚醒度を取得する。また、取得部21は、呼びかけ音声に対するユーザの反応を取得する。前述したように、覚醒度は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。ここで、覚醒度を算出するための、眼球運動、瞬目活動、刺激への反応時間は、例えばカメラ8で取得されるユーザの画像から測定され得る。また、刺激への反応時間は、マイク6で取得される音声信号から測定されてもよい。また、皮膚電気活動は、例えばユーザの腕に装着されるセンサによって測定され得る。また、ユーザの反応は、呼びかけ音声の提示後に音のする方向をユーザが見たか否かを例えばカメラ8で取得される画像から測定することによって取得され得る。取得部21は、音声生成装置1の外部で算出された覚醒度又はユーザの反応を通信によって取得するように構成されていてもよい。 The acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof. Here, the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8. Further, the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6. Also, skin electrical activity can be measured, for example, by a sensor worn on the user's arm. Further, the user's reaction can be obtained by measuring, for example, from an image acquired by the camera 8, whether or not the user has seen the direction of the sound after the presentation of the calling voice. The acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
 判定部22は、取得部21で取得された覚醒度に基づき、ユーザが覚醒している状態であるか否かを判定する。そして、判定部22は、ユーザが覚醒している状態であると判定したときに、選択部23の受信部231に対して音声ラベルの選択依頼を送信する。ここで、判定部22は、覚醒度を予め定められた閾値と比較することで判定を実施する。閾値は、ユーザが覚醒している状態であるかどうかを判定するための覚醒度の閾値であり、例えばストレージ5に記憶される。また、判定部22は、取得部21で取得されたユーザの反応の情報に基づき、ユーザの反応の有無を判定する。 The determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value. The threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
 選択部23は、ユーザが覚醒している状態でないと判定されたときに、ユーザの覚醒を促すための候補となる音声の音声ラベルを選択する。選択部23は、受信部231と、モデル選択部232と、音声ラベル候補抽出部233と、音声ラベル選択部234と、送信部235とを有している。 When it is determined that the user is not in an awake state, the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken. The selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
 受信部231は、判定部22から音声ラベルの選択依頼を受信する。 The receiving unit 231 receives a voice label selection request from the determination unit 22.
 モデル選択部232は、モデルDB53から音声ラベルの選択に用いるモデルを選択する。モデル選択部232は、当てはまり度に基づき、初期モデルと学習モデルとのうちの何れかを選択する。当てはまり度は、初期モデルと学習モデルとのどちらのほうが高い精度を有しているかを判定するための値である。当てはまり度については後で詳しく説明する。 The model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53. The model selection unit 232 selects either an initial model or a learning model based on the degree of fit. The degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
 音声ラベル候補抽出部233は、モデル選択部232で選択されたモデルとユーザの集中度とに基づき、ユーザに対して提示する呼びかけ音声の候補となる音声ラベルをなじみ度DB51から抽出する。 The voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
 音声ラベル選択部234は、音声ラベル候補抽出部233で抽出された音声ラベルから、ユーザに対して提示する呼びかけ音声を生成するための音声ラベルを選択する。 The voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
 送信部235は、音声ラベル選択部234で選択された音声ラベルの情報を生成部24に送信する。 The transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
 生成部24は、送信部235から受け取った音声ラベルに基づき、ユーザの覚醒を促すための呼びかけ音声を生成する。生成部24は、送信部235から受け取った音声ラベルと対応した音声合成パラメータを音声合成パラメータDB54から取得する。そして、生成部24は、呼びかけ文DB55に記録されている呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。 The generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235. The generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
 提示部25は、生成部24で生成された呼びかけ音声をユーザに提示する。例えば、提示部25は、生成部24で生成された呼びかけ音声を、スピーカ7を利用して再生する。 The presentation unit 25 presents the call voice generated by the generation unit 24 to the user. For example, the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
 学習部26は、モデルDB53に記録されているモデルの学習を実施する。学習部26は、例えば正解ラベルを用いた二値分類学習を用いて学習を実施する。 The learning unit 26 learns the model recorded in the model DB 53. The learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
 次に、音声生成装置1の動作について説明する。図6A及び図6Bは、音声生成装置1による音声提示処理を示すフローチャートである。図6A及び図6Bの処理は、定期的に行われてよい。 Next, the operation of the voice generator 1 will be described. 6A and 6B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 6A and 6B may be performed periodically.
 ステップS1において、取得部21は、ユーザの覚醒度を取得する。取得部21は、取得した覚醒度を判定部22に出力する。また、取得部21は、取得した覚醒度を呼びかけ音声の提示後のユーザからの反応の取得のタイミングまで保持しておく。 In step S1, the acquisition unit 21 acquires the user's arousal level. The acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
 ステップS2において、判定部22は、取得部21で取得された覚醒度が閾値以下であるか否かを判定する。ステップS2において、覚醒度が閾値を超えていると判定されたとき、すなわちユーザが覚醒の状態にあるときには、図6A及び図6Bの処理は終了する。ステップS2において、覚醒度が閾値以下であると判定されたとき、すなわちユーザが眠気を有しているといった覚醒の状態にないときには、処理はステップS3に移行する。 In step S2, the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value. In step S2, when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 6A and 6B are terminated. In step S2, when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
 ステップS3において、判定部22は、選択部23に対して音声ラベルの選択依頼を送信する。音声ラベルの選択依頼が受信部231で受信されると、モデル選択部232は、ユーザログDB52を参照して、反応あり回数を取得する。反応あり回数は、「反応有無」の「あり」の総数である。 In step S3, the determination unit 22 transmits a voice label selection request to the selection unit 23. When the voice label selection request is received by the receiving unit 231, the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction".
 ステップS4において、モデル選択部232は、反応あり回数が閾値未満であるか否かを判定する。閾値は、利用できる学習モデルがモデルDB53に記録されているか否かを判定するための閾値である。閾値は、例えば2に設定される。この場合、反応あり回数が0回又は1回のときには、反応あり回数が閾値未満であると判定される。ステップS4において、反応あり回数が閾値未満であると判定されたときには、処理はステップS5に移行する。ステップS4において、反応あり回数が閾値以上であると判定されたときには、処理はステップS6に移行する。 In step S4, the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S4 that the number of times there is a reaction is less than the threshold value, the process proceeds to step S5. When it is determined in step S4 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S6.
 ステップS5において、モデル選択部232は、初期値、すなわち初期モデルをモデルDB53から選択する。そして、モデル選択部232は、選択した初期モデルを音声ラベル候補抽出部233に出力する。その後、処理はステップS9に移行する。 In step S5, the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
 ステップS6において、モデル選択部232は、当てはまり度を計算する。当てはまり度の計算に際して、モデル選択部232は、まず、ユーザログDB52から過去の全ての反応あり及び反応なしのログを取得する。そして、モデル選択部232は、初期モデルと学習モデルの双方の当てはまり度を計算する。モデル選択部232は、例えば、それぞれのログの集中度の値が用いられた時の対応するモデルの正解又は不正解の出力結果とそれぞれのログの反応有無とを比較して求めた正答率(Accuracy)を当てはまり度として用いることができる。当てはまり度は、正答率に限らず、モデルの正解又は不正解の出力結果とログの反応有無とが用いられることによって算出される、適合率(Precision)、再現率(Recall)、F値(F-measure)等であってもよい。適合率は、正解と予測されたデータのうちで、実際にユーザの反応が「あり」であった割合である。再現率は、実際にユーザの反応ありであるログのうちの正解と予測されたものの割合である。F値は、再現率と適合率の調和平均である。例えば、F値は、2Recall・Precision/(Recall+Precision)から算出され得る。 In step S6, the model selection unit 232 calculates the degree of fit. In calculating the degree of fit, the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model. The model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit. The degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used. The precision rate is the percentage of the data predicted to be correct that the user actually responded "yes". The recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct. The F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall · Precision / (Recall + Precision).
 ステップS7において、モデル選択部232は、初期モデルと学習モデルの当てはまり度を比較し、学習モデルの当てはまり度の方が高いか否かを判定する。ステップS7において、初期モデルの当てはまり度のほうが高いと判定されたときには、処理はステップS5に移行する。この場合、モデル選択部232は、初期値、すなわち初期モデルを選択する。ステップS7において、学習モデルの当てはまり度のほうが高いと判定されたときには、処理はステップS8に移行する。 In step S7, the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher. When it is determined in step S7 that the degree of fit of the initial model is higher, the process proceeds to step S5. In this case, the model selection unit 232 selects an initial value, that is, an initial model. When it is determined in step S7 that the degree of fit of the learning model is higher, the process proceeds to step S8.
 ステップS8において、モデル選択部232は、学習モデルを選択する。そして、モデル選択部232は、選択した学習モデルを音声ラベル候補抽出部233に出力する。その後、処理はステップS9に移行する。 In step S8, the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
 ステップS9において、音声ラベル候補抽出部233は、取得部21から現在のユーザの集中度を取得する。 In step S9, the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
 ステップS10において、音声ラベル候補抽出部233は、呼びかけ音声の生成に用いる候補の音声ラベルをなじみ度DB51から抽出する。音声ラベル候補抽出部233は、例えばなじみ度DB51に登録されている音声ラベルの中から、現在の集中度の値に対して正解のラベルが付けられているすべての音声ラベルを抽出する。正解のラベルが付けられている音声ラベルは、呼びかけ音声の提示によるユーザの反応が期待され、かつ、覚醒度の上昇も期待される音声ラベルである。 In step S10, the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51. The voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example. The voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
 ステップS11において、音声ラベル選択部234は、音声ラベル候補抽出部233で抽出された音声ラベルの中から、1つの音声ラベルを選択する。音声ラベル選択部234は、例えば音声ラベルを選択するに当たって、過去の提示回数を基に重み付き当選確率を求める。そして、音声ラベル選択部234は、重み付き当選確率を基にランダムサンプリングによって1つの音声ラベルを選択する。重み付き当選確率は、例えば式(1)に従って算出され得る。重み付き当選確率は、式(1)と異なる式で算出されてもよい。
Figure JPOXMLDOC01-appb-M000001
In step S11, the voice label selection unit 234 selects one voice label from the voice labels extracted by the voice label candidate extraction unit 233. The voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects one voice label by random sampling based on the weighted winning probability. The weighted winning probability can be calculated, for example, according to the equation (1). The weighted winning probability may be calculated by an equation different from the equation (1).
Figure JPOXMLDOC01-appb-M000001
 ステップS12において、送信部235は、音声ラベル選択部234で選択された音声ラベルを示す情報を、生成部24に送信する。生成部24は、音声合成パラメータDB54から、受信した音声ラベルに対応した音声合成パラメータを取得する。そして、生成部24は、呼びかけ文DB55からランダムに選択した呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。呼びかけ音声の生成は、音声合成パラメータを用いた音声合成処理によって行われ得る。その後、処理はステップS13に移行する。 In step S12, the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24. The generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter. The generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
 ステップS13において、提示部25は、生成部24において生成された呼びかけ音声を、スピーカ7を利用してユーザに提示する。 In step S13, the presentation unit 25 presents the call voice generated by the generation unit 24 to the user using the speaker 7.
 ステップS14において、取得部21は、ユーザの反応を取得する。そして、取得部21は、ユーザの反応の情報を判定部22に出力する。 In step S14, the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
 ステップS15において、判定部22は、ユーザの反応があったか否かを判定する。ステップS15において、ユーザの反応がなかったと判定されたときには、処理はステップS20に移行する。ステップS15において、ユーザの反応があったと判定されたときには、処理はステップS16に移行する。 In step S15, the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
 ステップS16において、判定部22は、取得部21に対して新覚醒度の取得を要求する。これを受けて、取得部21は、新覚醒度を取得する。新覚醒度の取得は、覚醒度の取得と同様に行われてよい。 In step S16, the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree. In response to this, the acquisition unit 21 acquires the new arousal degree. The acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
 ステップS17において、判定部22は、取得部21から新覚醒度を取得する。そして、判定部22は、新覚醒度が閾値以下であるか否かを判定する。ステップS17における閾値は、ステップS2における閾値と同じであってもよいし、異なっていてもよい。ステップS17において、新覚醒度が閾値以下でないと判定されたときには、処理はステップS18に移行する。ステップS17において、新覚醒度が閾値以下であると判定されたとき、処理はステップS20に移行する。 In step S17, the determination unit 22 acquires the new arousal level from the acquisition unit 21. Then, the determination unit 22 determines whether or not the new alertness is equal to or less than the threshold value. The threshold value in step S17 may be the same as or different from the threshold value in step S2. When it is determined in step S17 that the new arousal level is not equal to or less than the threshold value, the process proceeds to step S18. When it is determined in step S17 that the new arousal level is equal to or less than the threshold value, the process proceeds to step S20.
 ステップS18において、取得部21は、正解ラベルに正解(〇)のラベルを設定する。 In step S18, the acquisition unit 21 sets the correct answer (〇) label on the correct answer label.
 ステップS19において、取得部21は、なじみ度DB51から覚醒度変化平均値を取得する。そして、取得部21は、新たに算出した覚醒度変化量と先に取得した覚醒度変化平均値とを用いて覚醒度変化平均値を更新する。また、取得部21は、集中度、反応ありの情報、覚醒度、新覚醒度、覚醒度変化量、正解ラベルをログ発生日時、音声ラベル、なじみ対象、なじみ度と対応付けてユーザログDB52に登録する。その後、処理はステップS22に移行する。 In step S19, the acquisition unit 21 acquires the arousal degree change average value from the familiarity degree DB 51. Then, the acquisition unit 21 updates the arousal degree change average value by using the newly calculated arousal degree change amount and the previously acquired arousal degree change average value. Further, the acquisition unit 21 associates the concentration degree, the information with a reaction, the arousal degree, the new arousal degree, the arousal degree change amount, and the correct answer label with the log generation date and time, the voice label, the familiar target, and the familiar degree in the user log DB 52. sign up. After that, the process proceeds to step S22.
 ステップS20において、取得部21は、正解ラベルに不正解(×)のラベルを設定する。 In step S20, the acquisition unit 21 sets an incorrect answer (x) label on the correct answer label.
 ステップS21において、取得部21は、集中度、反応なしの情報、覚醒度、正解ラベルをログ発生日時、音声ラベル、なじみ対象、なじみ度と対応付けてユーザログDB52に登録する。その後、処理はステップS22に移行する。 In step S21, the acquisition unit 21 registers the concentration level, the information without reaction, the arousal level, and the correct answer label in the user log DB 52 in association with the log generation date / time, the voice label, the familiar target, and the familiarity level. After that, the process proceeds to step S22.
 ステップS22において、学習部26は、ユーザログDB52を参照して、反応あり回数を取得する。そして、学習部26は、反応あり回数が閾値未満であるか否かを判定する。閾値は、学習に必要な情報が蓄積されたか否かを判定するための閾値である。閾値は、例えば2に設定される。この場合、反応あり回数が0回又は1回のときには、反応あり回数が閾値未満であると判定される。ステップS22において、反応あり回数が閾値未満であると判定されたときには、図6A及び図6Bの処理は終了する。ステップS22において、反応あり回数が閾値以上であると判定されたときには、処理はステップS23に移行する。 In step S22, the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S22 that the number of times there is a reaction is less than the threshold value, the processes of FIGS. 6A and 6B are terminated. When it is determined in step S22 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S23.
 ステップS23において、学習部26は、二値分類学習を実施する。そして、学習部26は、二値分類学習の実施による学習の結果をモデルDB53に記録する。その後、図6A及び図6Bの処理は終了する。ステップS23において、学習部26は、例えばユーザログDB52に記録されている正解ラベルと、正解ラベルに関連付けられたなじみ度と、集中度とを取得する。そして、学習部26は、「なじみ度」と「集中度」の2次元空間における音声ラベルの二値分類モデルを生成する。図7は、「なじみ度」と「集中度」とを用いた二値分類モデルのイメージを表す図である。このモデルは、「なじみ度y」と「集中度x」の2次元空間において定義される音声ラベルの分類のための1次関数(y=ax+b)を用いた二値分類モデルである。図7に示すように、y=ax+bを表す直線Lよりも大きいなじみ度を持つ音声ラベルは正解(〇)に分類される。つまり、図7の領域aに含まれるなじみ度を持つ音声ラベルは正解(〇)に分類される。一方、領域a以外の音声ラベルは不正解(×)に分類される。ここで、モデルの生成には、ロジスティック回帰、SVN(Support Vector Machine)、ニューラルネットワーク等を用いた各種の二値分類学習が用いられ得る。 In step S23, the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 6A and 6B is completed. In step S23, the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the two-dimensional space of "familiarity" and "concentration". FIG. 7 is a diagram showing an image of a binary classification model using "familiarity" and "concentration". This model is a binary classification model using a linear function (y = ax + b) for classifying voice labels defined in a two-dimensional space of "familiarity y" and "concentration ratio x". As shown in FIG. 7, a voice label having a degree of familiarity larger than the straight line L representing y = ax + b is classified as a correct answer (◯). That is, the voice label having a familiarity included in the area a of FIG. 7 is classified as the correct answer (◯). On the other hand, voice labels other than the area a are classified as incorrect answers (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
 ここで、実施形態における二値分類モデルに、「なじみ度」と「集中度」の2軸が採用されている理由について説明する。人は、自分が興味のある人の会話や自分の名前等のなじみある音声に対しては、選択的注意が働く特性を有している。これは、カクテルパーティ効果と呼ばれている。また、本城由美子,”注意と覚醒に関する生理心理学的研究”, 関西学院大学博士論文,乙第217号,p.187-188では、選択的注意と覚醒の双方を導入した注意と覚醒のモデルが導出されている。このことから、選択的注意の発生と覚醒度とには関連があると考えられる。このように、「なじみ度」は、カクテルパーティ効果の生じやすさとカクテルパーティ効果による覚醒度の変化に影響すると考えられるので、学習の1軸として採用されている。 Here, the reason why the two axes of "familiarity" and "concentration" are adopted in the binary classification model in the embodiment will be explained. A person has a characteristic that selective attention works for a conversation of a person who is interested in the person or a familiar voice such as a person's name. This is called the cocktail party effect. In addition, Yumiko Honjo, "Physiological Psychological Study on Attention and Awakening", Kwansei Gakuin University Doctoral Dissertation, Otsu No. 217, p.187-188, introduced both selective attention and awakening. The model has been derived. From this, it is considered that there is a relationship between the occurrence of selective attention and the arousal level. As described above, "familiarity" is considered to affect the tendency of the cocktail party effect to occur and the change in the arousal level due to the cocktail party effect, and is therefore adopted as one axis of learning.
 また、「集中度」については、“「効率的選択」で脳は注意を向け集中を高める”, 理化学研究所ニュースリリース,2011年12月8日, [Online][令和2年6月10日検索],インターネットURL:https://www.riken.jp/press/2011/20111208/に、集中状態では、感覚から知覚へ伝達する情報が限定されることが報告されている。つまり、集中が高まっているときに認知される音は、よりユーザにとって必要とされる又は耳に入りやすい音となると推測される。このように、「集中度」は、ユーザの選択的注意を生じさせやすさ、つまりどの音に反応しやすいかに影響すると考えることができるので、学習の1軸として採用されている。 Regarding "concentration ratio", "the brain pays attention and raises concentration by" efficient selection "", RIKEN News Release, December 8, 2011, [Online] [Reiwa 2 June 10] Daily search], Internet URL: https://www.riken.jp/press/2011/20111208/, it is reported that the information transmitted from the senses to the perception is limited in the concentrated state. In other words, the concentrated state. It is presumed that the sound perceived when is increasing will be the sound that is more needed or heard by the user. Thus, "concentration" is likely to give rise to the user's selective attention. In other words, it is adopted as one axis of learning because it can be considered that it affects which sound is easy to react to.
 以上説明したように実施形態によれば、ユーザが覚醒していない状態であると判定されたときには、ユーザにとってなじみのある音声を用いてユーザに対する呼びかけが行われる。このため、ユーザが眠気を有している状態等であっても、カクテルパーティ効果によってユーザに呼びかけ音声を聞かせることができる。したがって、短時間での覚醒度の向上が見込まれる。また、実施形態では、なじみのある音声の選択にあたり、なじみ度と集中度とが用いられる。このため、よりユーザが反応し易い呼びかけ音声をユーザに聞かせることができる。 As described above, according to the embodiment, when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
 また、実施形態によれば、なじみ度と集中度との2軸を有する学習モデルを用いて音声ラベルの分類が行われる。このため、学習が進むことにより、よりユーザに適した音声ラベルの候補が抽出されることが期待される。また、実施形態によれば、抽出された候補の中から過去の提示回数に基づくランダムサンプリングによって音声を生成するための音声ラベルが選択される。これにより、同じ音声ラベルの呼びかけ音声が頻繁に提示されることによる、ユーザの慣れや飽きが抑制される。これにより、長期に音声生成装置1が利用される場合であっても、呼びかけ音声に対するユーザの反応が期待され易くなり、結果としてユーザの覚醒度の上昇が見込まれる。 Further, according to the embodiment, the voice label is classified using a learning model having two axes of familiarity and concentration. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
 [変形例]
 実施形態の変形例を説明する。実施形態では、なじみ度と集中度とに基づいて呼びかけ音声の生成に用いられる複数の候補の音声ラベルがなじみ度DB51から抽出され、抽出された音声ラベルの候補の中から1つの音声ラベルがランダムサンプリングによって選択される。より簡易的な処理として、なじみ度と集中度とに基づいて呼びかけ音声の生成に用いられる1つの音声ラベルがなじみ度DB51から抽出され、抽出された音声ラベルに基づいて呼びかけ音声が生成されてもよい。
[Modification example]
A modification of the embodiment will be described. In the embodiment, a plurality of candidate voice labels used for generating a call voice based on the degree of familiarity and the degree of concentration are extracted from the familiarity DB 51, and one voice label is randomly selected from the extracted voice label candidates. Selected by sampling. As a simpler process, even if one voice label used for generating the call voice based on the degree of familiarity and the degree of concentration is extracted from the familiarity DB 51 and the call voice is generated based on the extracted voice label. good.
 また、実施形態ではなじみ度と集中度との2軸を有する学習モデルを用いて分類された音声ラベルの中から呼びかけ音声の生成に用いられる候補の音声ラベルが抽出される。この候補の音声ラベルの抽出には、必ずしも学習モデルが用いられる必要はない。例えば、なじみ度と集中度との重みづけ加算値が上位の複数の音声ラベルが候補として抽出される等によって候補の音声ラベルの抽出が行われてもよい。 Further, in the embodiment, candidate voice labels used for generating a calling voice are extracted from the voice labels classified using a learning model having two axes of familiarity and concentration. The learning model does not necessarily have to be used to extract the voice label of this candidate. For example, the candidate voice labels may be extracted by extracting a plurality of voice labels having a higher weighting addition value between the degree of familiarity and the degree of concentration as candidates.
 また、実施形態では、なじみ度と集中度とに基づく音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、何れも音声生成装置1の中で行われている例が示されている。しかしながら、なじみ度と集中度とに基づく音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、別個の装置において行われてもよい。 Further, in the embodiment, an example is shown in which the selection of the voice label based on the degree of familiarity and the degree of concentration, the generation of the calling voice, and the learning of the learning model are all performed in the voice generation device 1. However, speech label selection based on familiarity and concentration, generation of calling speech, and learning of the learning model may be performed in separate devices.
 また、学習モデルは、なじみ度と集中度との2次元空間で音声ラベルの分類をするように構成されている。学習モデルは、さらに、覚醒度変化量を含む3次元空間で音声ラベルの分類をするように構成されていてもよい。図8は、「なじみ度」と「集中度」と「覚醒度変化量」を用いた二値分類モデルのイメージを表す図である。このモデルは、「なじみ度」、「集中度」、「覚醒度変化量」の3次元空間において定義される音声ラベルの分類のための分類面を用いた二値分類モデルである。図8の例では、分類面Pよりも上側の空間に位置するなじみ度を持つ音声ラベルは正解(〇)に分類される。一方、分類面Pよりも下側の空間に位置するなじみ度を持つ音声ラベルは不正解(×)に分類される。ここで、モデルの生成には、ロジスティック回帰、SVN(Support Vector Machine)、ニューラルネットワーク等を用いた各種の二値分類学習が用いられ得る。覚醒度変化量は、正解ラベル、すなわち、ユーザが反応するかどうかに加えて、ユーザの反応を特徴づけるものである。したがって、「覚醒度変化量」が軸として採用されることにより、正解ラベルの判定の精度のさらなる向上が見込まれる。 In addition, the learning model is configured to classify voice labels in a two-dimensional space of familiarity and concentration. The learning model may be further configured to classify speech labels in a three-dimensional space that includes the amount of change in arousal. FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount". This model is a binary classification model using a classification surface for classifying voice labels defined in a three-dimensional space of "familiarity", "concentration ratio", and "awakening degree change amount". In the example of FIG. 8, the familiar voice label located in the space above the classification surface P is classified as the correct answer (◯). On the other hand, a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model. The amount of change in arousal degree characterizes the correct label, that is, whether or not the user responds, as well as the user's reaction. Therefore, by adopting the "awareness change amount" as an axis, it is expected that the accuracy of the determination of the correct answer label will be further improved.
 また、実施形態において、呼びかけの後でもユーザが覚醒していない、すなわち新覚醒度が閾値以下であると判定されたときには、所要時間が多少必要でも覚醒度が向上できると見込まれる能動的行動を提案するための呼びかけがユーザに対してなされてもよい。これにより、ユーザは行動の機会を得ることができ、結果として短時間での覚醒度の向上が見込まれる。さらに、能動的行動を提案するための呼びかけ音声もユーザにとってなじみのある音声で行われることにより、カクテルパーティ効果によってユーザに呼びかけ音声を聞かせることができる効果も得られる。 Further, in the embodiment, when the user is not awakened even after the call, that is, when it is determined that the new arousal level is below the threshold value, an active action that is expected to improve the arousal level even if the required time is slightly required is performed. A call to make a suggestion may be made to the user. As a result, the user can have an opportunity to act, and as a result, the degree of arousal is expected to be improved in a short time. Further, since the call voice for proposing an active action is also performed with a voice familiar to the user, the effect that the user can hear the call voice by the cocktail party effect can be obtained.
 また、実施形態では、呼びかけ音声に用いられる呼びかけ文は、呼びかけ文DB55に記録されているテンプレートの中からランダムに選択される。このテンプレートは、適宜に変更され得る。例えば、ユーザの日常会話等を収集しておくことにより、日常会話等で頻出する、ユーザの注意をひきやすい単語が含められるようなテンプレートの変更がされてもよい。 Further, in the embodiment, the call text used for the call voice is randomly selected from the templates recorded in the call text DB 55. This template can be modified accordingly. For example, by collecting the user's daily conversations and the like, the template may be changed so as to include words that are frequently used in the daily conversations and the like and that easily attract the user's attention.
 また、実施形態において呼びかけ音声が生成されるときに、覚醒度に応じた音量等の変更が併せて行われてもよい。 Further, when the call voice is generated in the embodiment, the volume and the like may be changed according to the degree of arousal.
 また、上述した実施形態による各処理は、コンピュータであるプロセッサに実行させることができるプログラムとして記憶させておくこともできる。この他、磁気ディスク、光ディスク、半導体メモリ等の外部記憶装置の記憶媒体に格納して配布することができる。そして、プロセッサは、この外部記憶装置の記憶媒体に記憶されたプログラムを読み込み、この読み込んだプログラムによって動作が制御されることにより、上述した処理を実行することができる。 Further, each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer. In addition, it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory. Then, the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
 なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。更に、上記実施形態には種々の発明が含まれており、開示される複数の構成要件から選択された組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、課題が解決でき、効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。 The present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof. In addition, each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained. Further, the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.
 1…音声生成装置
 2…プロセッサ
 3…ROM
 4…RAM
 5…ストレージ
 6…マイクロホン(マイク)
 7…スピーカ
 8…カメラ
 9…入力装置
 10…ディスプレイ
 11…通信モジュール
 21…取得部
 22…判定部
 23…選択部
 24…生成部
 25…提示部
 26…学習部
 51…なじみ度データベース(DB)
 52…ユーザログデータベース(DB)
 53…モデルデータベース(DB)
 54…音声合成パラメータデータベース(DB)
 55…呼びかけ文データベース(DB)
 231…受信部
 232…モデル選択部
 233…音声ラベル候補抽出部
 234…音声ラベル選択部
 235…送信部
1 ... Voice generator 2 ... Processor 3 ... ROM
4 ... RAM
5 ... Storage 6 ... Microphone (microphone)
7 ... Speaker 8 ... Camera 9 ... Input device 10 ... Display 11 ... Communication module 21 ... Acquisition unit 22 ... Judgment unit 23 ... Selection unit 24 ... Generation unit 25 ... Presentation unit 26 ... Learning unit 51 ... Familiarity database (DB)
52 ... User log database (DB)
53 ... Model database (DB)
54 ... Speech synthesis parameter database (DB)
55 ... Calling sentence database (DB)
231 ... Receiver unit 232 ... Model selection unit 233 ... Voice label candidate extraction unit 234 ... Voice label selection unit 235 ... Transmission unit

Claims (6)

  1.  ユーザの睡眠から興奮に至るまでの覚醒の程度を表す覚醒度を取得する取得部と、
     前記覚醒度に基づき、前記ユーザが覚醒している状態であるか否かを判定する判定部と、
     前記ユーザが覚醒している状態でないとき、ユーザが複数の音声候補のそれぞれになじんでいる度合いを表すなじみ度と前記ユーザの現在の集中の度合いを表す集中度とに基づいて、前記ユーザの覚醒を促す音声を複数の音声候補の中から選択する選択部と、
     選択された音声に基づき、前記ユーザに対して提示するための呼びかけ音声を生成する生成部と、
     を具備する音声生成装置。
    An acquisition unit that acquires the arousal level, which indicates the degree of arousal from the user's sleep to excitement,
    A determination unit that determines whether or not the user is awake based on the degree of arousal.
    When the user is not awake, the user's awakening is based on the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates and the degree of concentration indicating the degree of the user's current concentration. A selection unit that selects the voice that prompts you from multiple voice candidates,
    A generation unit that generates a call voice to be presented to the user based on the selected voice, and a generation unit.
    A voice generator equipped with.
  2.  前記選択部は、前記なじみ度と前記集中度とに基づいて複数の音声候補を抽出し、抽出した音声候補の中からランダムサンプリングによって1つの音声候補を前記ユーザの覚醒を促す音声として選択する請求項1に記載の音声生成装置。 The selection unit extracts a plurality of voice candidates based on the familiarity and the concentration, and selects one voice candidate from the extracted voice candidates by random sampling as a voice that promotes awakening of the user. Item 1. The voice generator according to item 1.
  3.  前記選択部は、前記なじみ度と前記集中度とからなる2次元空間において、前記複数の音声候補を、前記呼びかけ音声の提示による前記ユーザの反応が期待され、前記ユーザの覚醒度の上昇が期待される第1の音声候補と、前記呼びかけ音声の提示による前記ユーザの反応が期待されない、又は、前記ユーザの覚醒度の上昇が期待されない第2の音声候補とに分類する分類モデルを用いて前記ユーザの覚醒を促す音声を選択する請求項1又は2に記載の音声生成装置。 The selection unit is expected to respond to the user by presenting the calling voice to the plurality of voice candidates in a two-dimensional space composed of the familiarity and the concentration, and is expected to increase the arousal level of the user. The classification model is used to classify the first voice candidate to be classified into a second voice candidate in which the user's reaction by the presentation of the call voice is not expected or the user's arousal level is not expected to increase. The voice generator according to claim 1 or 2, which selects a voice that urges the user to awaken.
  4.  前記選択部は、前記なじみ度と、前記集中度と、前記呼びかけ音声の提示による前記覚醒度の変化量とからなる3次元空間において、前記複数の音声候補を、前記呼びかけ音声の提示による前記ユーザの反応が期待され、前記ユーザの覚醒度の上昇が期待される第1の音声候補と、前記呼びかけ音声の提示による前記ユーザの反応が期待されない、又は、前記ユーザの覚醒度の上昇が期待されない第2の音声候補とに分類する分類モデルを用いて前記ユーザの覚醒を促す音声を選択する請求項1又は2に記載の音声生成装置。 The selection unit selects the plurality of voice candidates in a three-dimensional space including the degree of familiarity, the degree of concentration, and the amount of change in the degree of arousal due to the presentation of the call voice, and the user by presenting the call voice. The reaction of the first voice candidate is expected to increase the arousal level of the user, and the reaction of the user by the presentation of the call voice is not expected, or the increase of the arousal level of the user is not expected. The voice generation device according to claim 1 or 2, wherein a voice that promotes awakening of the user is selected by using a classification model classified as a second voice candidate.
  5.  取得部により、ユーザの睡眠から興奮に至るまでの覚醒の程度を表す覚醒度を取得することと、
     判定部により、前記覚醒度に基づき、前記ユーザが覚醒している状態であるか否かを判定することと、
     選択部により、前記ユーザが覚醒している状態でないとき、ユーザが複数の音声候補のそれぞれになじんでいる度合いを表すなじみ度と前記ユーザの現在の集中の度合いを表す集中度とに基づいて、前記ユーザの覚醒を促す音声を複数の音声候補の中から選択することと、
     生成部により、選択された音声に基づき、前記ユーザに対して提示するための呼びかけ音声を生成することと、
     を具備する音声生成方法。
    The acquisition unit acquires the arousal level, which indicates the degree of arousal from the user's sleep to excitement.
    The determination unit determines whether or not the user is awake based on the degree of arousal.
    Depending on the selection unit, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates when the user is not awake and the degree of concentration indicating the degree of the current concentration of the user are used. Selecting the voice that urges the user to awaken from a plurality of voice candidates, and
    The generation unit generates a call voice to be presented to the user based on the selected voice.
    A voice generation method comprising.
  6.  プロセッサを、請求項1乃至4の何れか1項に記載の音声生成装置の前記取得部と、前記判定部と、前記選択部と、前記生成部として機能させるための音声生成プログラム。 A voice generation program for causing the processor to function as the acquisition unit, the determination unit, the selection unit, and the generation unit of the voice generation device according to any one of claims 1 to 4.
PCT/JP2020/024820 2020-06-24 2020-06-24 Voice generation device, voice generation method, and voice generation program WO2021260846A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/024820 WO2021260846A1 (en) 2020-06-24 2020-06-24 Voice generation device, voice generation method, and voice generation program
JP2022531319A JP7416244B2 (en) 2020-06-24 2020-06-24 Voice generation device, voice generation method, and voice generation program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/024820 WO2021260846A1 (en) 2020-06-24 2020-06-24 Voice generation device, voice generation method, and voice generation program

Publications (1)

Publication Number Publication Date
WO2021260846A1 true WO2021260846A1 (en) 2021-12-30

Family

ID=79282106

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/024820 WO2021260846A1 (en) 2020-06-24 2020-06-24 Voice generation device, voice generation method, and voice generation program

Country Status (2)

Country Link
JP (1) JP7416244B2 (en)
WO (1) WO2021260846A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016192127A (en) * 2015-03-31 2016-11-10 パイオニア株式会社 Music information update device
JP2019124977A (en) * 2018-01-11 2019-07-25 トヨタ自動車株式会社 On-board voice output device, voice output control method, and voice output control program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016192127A (en) * 2015-03-31 2016-11-10 パイオニア株式会社 Music information update device
JP2019124977A (en) * 2018-01-11 2019-07-25 トヨタ自動車株式会社 On-board voice output device, voice output control method, and voice output control program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
OMI, TAKUHIRO: "Vision technology and transportation systems that support safe driving - Image sensor for evaluating the driver's doze state", IMAGE LABORATORY, vol. 26, no. 2, 10 February 2015 (2015-02-10), pages 64 - 69 *

Also Published As

Publication number Publication date
JPWO2021260846A1 (en) 2021-12-30
JP7416244B2 (en) 2024-01-17

Similar Documents

Publication Publication Date Title
US10944708B2 (en) Conversation agent
US11288708B2 (en) System and method for personalized preference optimization
CN108780663B (en) Digital personalized medical platform and system
EP3403235B1 (en) Sensor assisted evaluation of health and rehabilitation
WO2017033697A1 (en) Lifestyle management assistance device and lifestyle management assistance method
JP2019084249A (en) Dementia diagnosis apparatus, dementia diagnosis method, and dementia diagnosis program
JP7347414B2 (en) Information processing system, information processing method, and recording medium
JP6952257B2 (en) Information processing device for content presentation, control method of information processing device, and control program
Bachmann et al. How to use smartphones for less obtrusive ambulatory mood assessment and mood recognition
US20210106290A1 (en) Systems and methods for the determination of arousal states, calibrated communication signals and monitoring arousal states
CN110881987B (en) Old person emotion monitoring system based on wearable equipment
WO2019118917A1 (en) Systems and methods for monitoring user well-being
JP4609475B2 (en) Information processing apparatus, information processing method, and recording medium
US20200285700A1 (en) Technology-Facilitated Support System for Monitoring and Understanding Interpersonal Relationships
WO2021260846A1 (en) Voice generation device, voice generation method, and voice generation program
WO2021260848A1 (en) Learning device, learning method, and learning program
US20190141418A1 (en) A system and method for generating one or more statements
WO2021260844A1 (en) Voice generation device, voice generation method, and voice generation program
WO2020202958A1 (en) Classification device and classification program
Holder et al. The cochlear implant use questionnaire: assessing habits and barriers to use
CN108461125B (en) Memory training device for the elderly
EP3675137A1 (en) Method and system for more efficient wellbeing estimation and improvement
JP7534745B1 (en) Seizure prediction program, storage medium, seizure prediction device, and seizure prediction method
US20240249813A1 (en) Activity Recommendation Based on Biometric Data from Wearable Device
US20220148708A1 (en) Technology-facilitated support system for monitoring and understanding interpersonal relationships

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941780

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022531319

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941780

Country of ref document: EP

Kind code of ref document: A1