WO2021260848A1 - Learning device, learning method, and learning program - Google Patents

Learning device, learning method, and learning program Download PDF

Info

Publication number
WO2021260848A1
WO2021260848A1 PCT/JP2020/024823 JP2020024823W WO2021260848A1 WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1 JP 2020024823 W JP2020024823 W JP 2020024823W WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
learning
degree
model
Prior art date
Application number
PCT/JP2020/024823
Other languages
French (fr)
Japanese (ja)
Inventor
妙 佐藤
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/024823 priority Critical patent/WO2021260848A1/en
Priority to JP2022531321A priority patent/JP7416245B2/en
Publication of WO2021260848A1 publication Critical patent/WO2021260848A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • This embodiment relates to a learning device, a learning method, and a learning program for voice selection.
  • Speech classification models may be used for such selection.
  • learning is performed by giving information on the correctness of the classified speech as teacher data.
  • Appropriate evaluation of speech is required to generate teacher data.
  • the method mentioned in Non-Patent Document 1 is known.
  • the embodiment provides a learning device, a learning method, and a learning program that can efficiently collect teacher data for speech classification.
  • the learning device uses teacher data for a learning model for selecting a voice to be presented to the user from a plurality of voice candidates, and the user's reaction to the plurality of voices presented to the user at the same time. It is equipped with a learning unit to be acquired based on.
  • a learning device capable of efficiently collecting teacher data for speech classification are provided.
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment.
  • FIG. 2A is a diagram showing an example of speaker arrangement.
  • FIG. 2B is a diagram showing an example of speaker arrangement.
  • FIG. 2C is a diagram showing an example of speaker arrangement.
  • FIG. 2D is a diagram showing an example of speaker arrangement.
  • FIG. 3 is a diagram showing the configuration of an example of the familiarity DB.
  • FIG. 4 is a diagram showing an example configuration of a user log DB.
  • FIG. 5 is a diagram showing the structure of an example of the call statement DB.
  • FIG. 6 is a functional block diagram of the voice generator.
  • FIG. 7A is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 7B is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio”, and "awa
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generation device including a learning device according to an embodiment.
  • the voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
  • the degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level.
  • the arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement.
  • the arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like.
  • the degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels.
  • the arousal level is a value that increases from a sleep state to an excitement state, for example.
  • the arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
  • the voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, speakers 7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11.
  • the voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user.
  • the voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speakers 7a, 7b, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
  • the processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU.
  • the processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like.
  • the processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
  • ROM 3 is a non-volatile memory such as a flash memory.
  • the start program of the voice generator 1 is stored in the ROM 3.
  • RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
  • the storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
  • Various programs used in the voice generator 1 are stored in the storage 5.
  • the storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
  • the microphone 6 is a device that converts the input voice into a voice signal which is an electric signal.
  • the audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5.
  • the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
  • Speakers 7a and 7b are devices that output voice based on the input voice signal.
  • the speaker 7a and the speaker 7b are not in close proximity to each other.
  • the speaker 7a and the speaker 7b are arranged in different directions when the user is the center.
  • the distance between the speaker 7a and the user and the distance between the speaker 7b and the user are equidistant.
  • FIG. 2A and 2B are diagrams showing an arrangement example of the speakers 7a and 7b.
  • speakers 7a and 7b are arranged in front of the user U so as to be equidistant to the user, respectively.
  • speakers 7a and 7b are arranged in front of and behind the user U so as to be equidistant to the user, respectively.
  • FIG. 1 shows an example in which the number of presented voices is two.
  • the number of presented voices may be three or more.
  • three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the speaker is arranged in a different direction when the user is the center. Further, it is desirable that the distance between each speaker and the user is equidistant.
  • FIGS. 2C and 2D an arrangement example when the speakers are three speakers 7a, 7b, and 7c is shown in FIGS. 2C and 2D.
  • FIG. 2C the speakers 7a, 7b, and 7c are arranged in front of the user U.
  • FIG. 2D the speakers 7a, 7b, and 7c are arranged behind the user U.
  • the camera 8 captures the user and acquires the image of the user.
  • the user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5.
  • the user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
  • the input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor.
  • the input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
  • the display 10 is a display such as a liquid crystal display or an organic EL display.
  • the display 10 displays various images.
  • the communication module 11 is a device for the voice generation device 1 to carry out communication.
  • the communication module 11 communicates with, for example, a server provided outside the voice generator 1.
  • the communication method by the communication module 11 is not particularly limited.
  • the communication module 11 may carry out communication wirelessly or may carry out communication by wire.
  • FIG. 3 is a diagram showing a configuration of an example of familiarity DB 51.
  • the familiarity DB 51 is a database that records the "familiarity" of the user.
  • the familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
  • the "user ID" is an ID assigned to each user of the voice generator 1.
  • the user ID may be associated with user attribute information such as a user name.
  • the "voice label” is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
  • the "familiar target” is a target that generates a voice that the user often talks to or hears.
  • the familiar target does not necessarily have to be a person.
  • “Familiarity” is the degree of familiarity of the user with the corresponding familiar voice.
  • the degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity.
  • the degree of familiarity may be acquired by self-reporting by the user.
  • the "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label.
  • the number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user.
  • the reaction probability can be calculated by dividing the number of reactions by the number of presentations.
  • the reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
  • the "average value of change in arousal level” is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label.
  • the amount of change in arousal level will be described later.
  • FIG. 4 is a diagram showing the configuration of an example of the user log DB 52.
  • the user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user.
  • the user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded.
  • the user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
  • the "log generation date and time” is the date and time when the user used the voice generator 1.
  • the log generation date and time is recorded, for example, each time a call voice is presented to the user.
  • Presence / absence of reaction is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, “yes” is recorded. “None” is recorded when there is no user response.
  • “Concentration ratio” is the degree of concentration of the user when presenting the call voice.
  • the degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8.
  • the value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action.
  • the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8.
  • the concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic.
  • the degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, ....
  • the method for acquiring the degree of concentration is not limited to a specific method.
  • the "awakening degree” is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
  • the "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
  • the "awakening degree change amount” is an amount representing the change in the arousal degree before and after the user's reaction.
  • the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness.
  • the amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
  • the "correct answer label” is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as ⁇ , and the incorrect answer is recorded as ⁇ .
  • the correct label will be described in detail later.
  • the model DB 53 is a database that records a model of voice label classification for extracting voice label candidates.
  • the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration.
  • the model includes an initial model and a learning model.
  • the initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning.
  • the initial value is a constant (equation of a plane) that determines the classification name for the classification of the voice label defined in the three-dimensional space of, for example, "familiarity", "concentration", and "awakening degree change". The value of).
  • the classification plane generated by this initial value is the initial model.
  • the training model is a trained model generated from the initial model.
  • the learning model can be a binary classification model with a different classification surface than the initial model.
  • the voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded.
  • the voice synthesis parameter is data used for synthesizing the voice of the user's familiar target.
  • the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6.
  • speech synthesis parameters acquired or defined by other systems may be pre-recorded.
  • the speech synthesis parameter is associated with the speech label.
  • FIG. 5 is a diagram showing the configuration of an example of the call statement DB55.
  • the call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded.
  • the call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1.
  • the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
  • FIG. 6 is a functional block diagram of the voice generator 1.
  • the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26.
  • the operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized.
  • the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
  • the acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof.
  • the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8.
  • the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6.
  • skin electrical activity can be measured, for example, by a sensor worn on the user's arm.
  • the user's reaction is the presence or absence of the user's physical reaction such as the user's head facing the direction of the speaker 7a or 7b, the user's line of sight facing the direction of the speaker 7a or 7b, and the direction of the reaction. Can be obtained, for example, by measuring from an image acquired by the camera 8.
  • the acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
  • the determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value.
  • the threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
  • the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken.
  • the selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
  • the receiving unit 231 receives a voice label selection request from the determination unit 22.
  • the model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53.
  • the model selection unit 232 selects either an initial model or a learning model based on the degree of fit.
  • the degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
  • the voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
  • the voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
  • the transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
  • the presentation unit 25 presents the call voice generated by the generation unit 24 to the user.
  • the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
  • the learning unit 26 learns the model recorded in the model DB 53.
  • the learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
  • FIGS. 7A and 7B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 7A and 7B may be performed periodically.
  • step S1 the acquisition unit 21 acquires the user's arousal level.
  • the acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
  • step S2 the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value.
  • step S2 when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 7A and 7B are terminated.
  • step S2 when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
  • step S3 the determination unit 22 transmits a voice label selection request to the selection unit 23.
  • the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction”.
  • step S4 the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • step S5 the process proceeds to step S5.
  • step S6 the process proceeds to step S6.
  • step S5 the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S6 the model selection unit 232 calculates the degree of fit.
  • the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model.
  • the model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit.
  • the degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used.
  • the precision rate is the percentage of the data predicted to be correct that the user actually responded "yes".
  • the recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct.
  • the F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall ⁇ Precision / (Recall + Precision).
  • step S7 the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher.
  • step S7 the degree of fit of the initial model is higher
  • the process proceeds to step S5.
  • the model selection unit 232 selects an initial value, that is, an initial model.
  • step S8 the degree of fit of the learning model is higher
  • step S8 the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S9 the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
  • the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51.
  • the number of candidate voice labels extracted is equal to or greater than the specified number, for example, the number of solicitation voices presented.
  • the voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example.
  • the voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
  • the voice label selection unit 234 selects a specified number of voice labels, for example, the same number as the number of presented call voices, from the voice labels extracted by the voice label candidate extraction unit 233.
  • the voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects a voice label by random sampling based on the weighted winning probability.
  • the weighted winning probability can be calculated, for example, according to the equation (1).
  • the weighted winning probability may be calculated by an equation different from the equation (1).
  • step S12 the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter.
  • the generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
  • step S13 the presentation unit 25 simultaneously presents the call voice generated by the generation unit 24 to the user from the speakers 7a and 7b.
  • step S14 the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
  • step S15 the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
  • step S16 the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree.
  • the acquisition unit 21 acquires the new arousal degree.
  • the acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
  • step S17 the acquisition unit 21 sets the correct answer label.
  • the acquisition unit 21 sets the correct answer level as follows, for example. 1) When it is acquired as a reaction that the user points to a specific speaker The voice label corresponding to the voice presented in the corresponding speaker: ⁇ Other voice labels: ⁇ 2) When it is acquired as a reaction that the user faces between a plurality of speakers, the angle formed by the direction of the user and the direction of each speaker is obtained, and the voice presented by the speaker having a smaller angle is obtained. Audio label: ⁇ Other audio labels: ⁇ 3) When the user turns to one speaker and then turns to another speaker, which is acquired as a reaction. The voice label of the voice presented in the first facing speaker: ⁇ Other voice labels : ⁇ 4) If no response can be obtained Labels for all voices: ⁇
  • step S18 the acquisition unit 21 associates the concentration level, the reaction presence / absence information, the arousal level, the new arousal level, the arousal level change amount, and the correct answer label with the log generation date / time, the voice label, the familiar target, and the familiar degree, and the user log. Register in DB 52. After that, the process proceeds to step S19.
  • step S19 the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • the processes of FIGS. 7A and 7B are terminated.
  • the process proceeds to step S20.
  • step S20 the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 7A and 7B is completed.
  • step S20 the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the three-dimensional space of "familiarity", “concentration ratio”, and "awakening degree change amount".
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", “concentration ratio", and "alertness change amount”.
  • the familiar voice label located in the space above the classification surface P is classified as the correct answer ( ⁇ ).
  • a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x).
  • various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
  • the amount of change in arousal degree is a correct label, that is, it characterizes the user's reaction in addition to whether or not the user reacts. Therefore, the "arousal degree change amount" is adopted as one axis of learning because it is expected that the accuracy of the determination of the correct answer label will be further improved.
  • the embodiment when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
  • the voice label is classified using a learning model having three axes of familiarity, concentration, and arousal change. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
  • the call voice is simultaneously presented from a plurality of speakers arranged in the environment, and the user's reaction to each call voice is acquired. Then, the correct label is set according to the reaction of this user. As a result, teacher data can be obtained efficiently.
  • the binary classification model employs three axes of "familiarity”, “concentration ratio”, and “alertness change amount”.
  • a binary classification model such as “familiarity” only, “familiarity” and “concentration ratio” may be used more simply.
  • the learning device is used as a learning device of a voice label classification model for a call voice that encourages the user to awaken.
  • the learning device of the embodiment can be used for learning various models for selecting a voice that is easy for the user to recognize.
  • Each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer.
  • it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory.
  • the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
  • the present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
  • each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained.
  • the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Traffic Control Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

This learning device comprises a learning unit (26) that acquires training data for a learning model for selecting a voice to be presented to a user from among a plurality of voice candidates on the basis of the user's reaction to a plurality of voices simultaneously presented to the user.

Description

学習装置、学習方法及び学習プログラムLearning equipment, learning methods and learning programs
 本実施形態は、音声選択のための学習装置、学習方法及び学習プログラムに関する。 This embodiment relates to a learning device, a learning method, and a learning program for voice selection.
 ユーザに対して提示する音声を複数の音声候補の中から選択するための手法が種々提案されている。このような選択に音声の分類モデルが用いられることがある。この種の分類モデルの中には、分類された音声の正否の情報を教師データとして与えることによって学習が実施されるものもある。教師データの生成には、音声に対する適切な評価が必要である。音声の評価に関する提案として、例えば非特許文献1で挙げられている手法が知られている。 Various methods for selecting the voice to be presented to the user from a plurality of voice candidates have been proposed. Speech classification models may be used for such selection. In some classification models of this type, learning is performed by giving information on the correctness of the classified speech as teacher data. Appropriate evaluation of speech is required to generate teacher data. As a proposal for evaluation of voice, for example, the method mentioned in Non-Patent Document 1 is known.
 実施形態は、効率よく音声の分類のための教師データを収集できる学習装置、学習方法及び学習プログラムを提供する。 The embodiment provides a learning device, a learning method, and a learning program that can efficiently collect teacher data for speech classification.
 実施形態に係る学習装置は、複数の音声候補の中からユーザに対して提示する音声を選択するための学習モデルに対する教師データを、ユーザに対して同時に提示された複数の音声に対するユーザの反応に基づいて取得する学習部を具備する。 The learning device according to the embodiment uses teacher data for a learning model for selecting a voice to be presented to the user from a plurality of voice candidates, and the user's reaction to the plurality of voices presented to the user at the same time. It is equipped with a learning unit to be acquired based on.
 実施形態によれば、効率よく音声の分類のための教師データを収集できる学習装置、学習方法及び学習プログラムが提供される。 According to the embodiment, a learning device, a learning method, and a learning program capable of efficiently collecting teacher data for speech classification are provided.
図1は、実施形態に係る音声生成装置の一例のハードウェア構成を示す図である。FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment. 図2Aは、スピーカの配置の例を示す図である。FIG. 2A is a diagram showing an example of speaker arrangement. 図2Bは、スピーカの配置の例を示す図である。FIG. 2B is a diagram showing an example of speaker arrangement. 図2Cは、スピーカの配置の例を示す図である。FIG. 2C is a diagram showing an example of speaker arrangement. 図2Dは、スピーカの配置の例を示す図である。FIG. 2D is a diagram showing an example of speaker arrangement. 図3は、なじみ度DBの一例の構成を示す図である。FIG. 3 is a diagram showing the configuration of an example of the familiarity DB. 図4は、ユーザログDBの一例の構成を示す図である。FIG. 4 is a diagram showing an example configuration of a user log DB. 図5は、呼びかけ文DBの一例の構成を示す図である。FIG. 5 is a diagram showing the structure of an example of the call statement DB. 図6は、音声生成装置の機能ブロック図である。FIG. 6 is a functional block diagram of the voice generator. 図7Aは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 7A is a flowchart showing a voice presentation process by the voice generator. 図7Bは、音声生成装置による音声提示処理を示すフローチャートである。FIG. 7B is a flowchart showing a voice presentation process by the voice generator. 図8は、「なじみ度」と「集中度」と「覚醒度変化量」を用いた二値分類モデルのイメージを表す図である。FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "awakening degree change amount".
 以下、図面を参照して実施形態を説明する。図1は、実施形態に係る学習装置を含む音声生成装置の一例のハードウェア構成を示す図である。実施形態に係る音声生成装置1は、ユーザが眠気を有している状態等の覚醒の状態にないときに、ユーザの覚醒を促す呼びかけ音声を発する。 Hereinafter, embodiments will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration of an example of a voice generation device including a learning device according to an embodiment. The voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
 実施形態では、「覚醒度」に基づいてユーザが覚醒の状態にあるか否かが判定される。実施形態における覚醒度は、覚醒水準に対応した覚醒の程度を示す指標である。覚醒水準は、大脳の活動レベルに対応し、睡眠から興奮に至るまでの覚醒の程度を表している。覚醒水準は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間等から計測される。実施形態における覚醒度は、これらの覚醒水準を計測するための、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。覚醒度は、例えば睡眠状態から興奮状態に向かうに従って大きくなる値である。覚醒度は、連続的な数値でもよいし、Level 1, Level 2,…といった離散値であってもよい。また、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の各値の組み合わせによって覚醒度が算出される場合において、それらの組み合わせ方は、特に限定されない。例えばこれらの値を単純に合算する、重みづけ加算する等が組み合わせ方として用いられ得る。 In the embodiment, it is determined whether or not the user is in an awake state based on the "awakening degree". The degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level. The arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement. The arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like. The degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels. The arousal level is a value that increases from a sleep state to an excitement state, for example. The arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
 音声生成装置1は、プロセッサ2と、ROM3と、RAM4と、ストレージ5と、マイクロホン(マイク)6と、スピーカ7a、7bと、カメラ8と、入力装置9と、ディスプレイ10と、通信モジュール11とを有する。音声生成装置1は、例えばパーソナルコンピュータ(PC)、スマートフォン、タブレット端末といった各種の端末である。これに限らず、音声生成装置1は、ユーザによって利用される各種の装置に搭載され得る。なお、音声生成装置1は、図1で示したすべての構成を有している必要はない。例えば、マイク6、スピーカ7a、7b、カメラ8、ディスプレイ10は、音声生成装置1と別体の装置であってもよい。 The voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, speakers 7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11. Has. The voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user. The voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speakers 7a, 7b, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
 プロセッサ2は、CPU等の音声生成装置1の全体的な動作を制御する制御回路である。プロセッサ2は、CPUである必要はなく、ASIC、FPGA、GPU等であってもよい。プロセッサ2は、単一のCPU等で構成されている必要はなく、複数のCPU等で構成されていてもよい。 The processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU. The processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like. The processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
 ROM3は、フラッシュメモリ等の不揮発性のメモリである。ROM3には、例えば音声生成装置1の起動プログラムが記憶されている。RAM4は、SDRAM等の揮発性のメモリである。RAM4は、音声生成装置1における各種処理のための作業用のメモリとして使用され得る。 ROM 3 is a non-volatile memory such as a flash memory. For example, the start program of the voice generator 1 is stored in the ROM 3. RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
 ストレージ5は、フラッシュメモリ、ハードディスクドライブ(HDD)、ソリッドステートドライブ(SSD)といったストレージである。ストレージ5には、音声生成装置1で利用される各種のプログラムが記憶される。ストレージ5には、なじみ度データベース(DB)、ユーザログデータベース(DB)52と、モデルデータベース53と、音声合成パラメータデータベース(DB)54と、呼びかけ文データベース(DB)55とが記憶されてもよい。これらのデータベースについては後で詳しく説明する。 The storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD). Various programs used in the voice generator 1 are stored in the storage 5. The storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
 マイク6は、入力された音声を電気信号である音声信号に変換するデバイスである。マイク6で得られた音声信号は、例えばRAM4又はストレージ5に記憶され得る。例えば、呼びかけ音声を合成するための音声合成パラメータは、マイク6を介して入力された音声より取得され得る。 The microphone 6 is a device that converts the input voice into a voice signal which is an electric signal. The audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5. For example, the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
 スピーカ7a、7bは、入力された音声信号に基づいて音声を出力するデバイスである。ここで、スピーカ7aとスピーカ7bとは近接していないことが望ましい。また、スピーカ7aとスピーカ7bは、ユーザを中心としたときの配置の方向が異なっていることが望ましい。さらに、スピーカ7aとユーザとの距離及びスピーカ7bとユーザとの距離は、等距離であることが望ましい。 Speakers 7a and 7b are devices that output voice based on the input voice signal. Here, it is desirable that the speaker 7a and the speaker 7b are not in close proximity to each other. Further, it is desirable that the speaker 7a and the speaker 7b are arranged in different directions when the user is the center. Further, it is desirable that the distance between the speaker 7a and the user and the distance between the speaker 7b and the user are equidistant.
 図2A及び図2Bは、スピーカ7a、7bの配置例を示す図である。図2Aでは、ユーザUの前方にそれぞれユーザに対して等距離となるようにスピーカ7a、7bが配置されている。図2Bでは、ユーザUの前方と後方にそれぞれユーザに対して等距離となるようにスピーカ7a、7bが配置されている。 2A and 2B are diagrams showing an arrangement example of the speakers 7a and 7b. In FIG. 2A, speakers 7a and 7b are arranged in front of the user U so as to be equidistant to the user, respectively. In FIG. 2B, speakers 7a and 7b are arranged in front of and behind the user U so as to be equidistant to the user, respectively.
 ここで、スピーカは、ユーザのいる環境内に音声の提示数と同数だけ配置される。つまり、図1は、音声の提示数が2つの例である。これに対し、音声の提示数は、3つ以上であってもよい。この場合、スピーカも3つ以上配置されることになる。スピーカが3つ以上配置される場合であっても、それぞれのスピーカは近接していないことが望ましい。また、それぞれのスピーカは、ユーザを中心としたときの配置の方向が異なっていることが望ましい。さらに、それぞれのスピーカとユーザとの距離は、等距離であることが望ましい。例えば、スピーカが3つのスピーカ7a、7b、7cであるときの配置例が図2C、図2Dに示されている。図2Cでは、スピーカ7a、7b、7cがユーザUの前方に配置されている。また、図2Dでは、スピーカ7a、7b、7cがユーザUの後方に配置されている。 Here, as many speakers as the number of voices presented are arranged in the environment where the user is. That is, FIG. 1 shows an example in which the number of presented voices is two. On the other hand, the number of presented voices may be three or more. In this case, three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the speaker is arranged in a different direction when the user is the center. Further, it is desirable that the distance between each speaker and the user is equidistant. For example, an arrangement example when the speakers are three speakers 7a, 7b, and 7c is shown in FIGS. 2C and 2D. In FIG. 2C, the speakers 7a, 7b, and 7c are arranged in front of the user U. Further, in FIG. 2D, the speakers 7a, 7b, and 7c are arranged behind the user U.
 カメラ8は、ユーザを撮像し、ユーザの画像を取得する。カメラ8で得られたユーザの画像は、例えばRAM4又はストレージ5に記憶され得る。ユーザの画像は、例えば、覚醒度の取得のため又は呼びかけ音声に対するユーザの反応を取得するために用いられる。 The camera 8 captures the user and acquires the image of the user. The user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5. The user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
 入力装置9は、ボタン、スイッチ、キーボード、マウスといった機械式の入力装置、タッチセンサを用いたソフトウェア式の入力装置である。入力装置9は、ユーザからの各種の入力を受け付ける。そして、入力装置9は、ユーザの入力に応じた信号をプロセッサ2に出力する。 The input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor. The input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
 ディスプレイ10は、例えば液晶ディスプレイ、有機ELディスプレイといったディスプレイである。ディスプレイ10は、各種の画像を表示する。 The display 10 is a display such as a liquid crystal display or an organic EL display. The display 10 displays various images.
 通信モジュール11は、音声生成装置1が通信を実施するための装置である。通信モジュール11は、例えば音声生成装置1の外部に設けられたサーバと通信する。通信モジュール11による通信の方式は特に限定されない。通信モジュール11は、無線で通信を実施してもよいし、有線で通信を実施してもよい。 The communication module 11 is a device for the voice generation device 1 to carry out communication. The communication module 11 communicates with, for example, a server provided outside the voice generator 1. The communication method by the communication module 11 is not particularly limited. The communication module 11 may carry out communication wirelessly or may carry out communication by wire.
 次に、なじみ度データベース(DB)51、ユーザログデータベース(DB)52、モデルデータベース(DB)53、音声合成パラメータデータベース(DB)54、呼びかけ文データベース(DB)55について説明する。 Next, the familiarity database (DB) 51, the user log database (DB) 52, the model database (DB) 53, the voice synthesis parameter database (DB) 54, and the call statement database (DB) 55 will be described.
 図3は、なじみ度DB51の一例の構成を示す図である。なじみ度DB51は、ユーザの「なじみ度」を記録したデータベースである。なじみ度DB51は、例えばユーザIDと、音声ラベルと、なじみ対象と、なじみ度と、反応あり数と、提示回数と、覚醒度変化平均値とを関連付けて記録している。 FIG. 3 is a diagram showing a configuration of an example of familiarity DB 51. The familiarity DB 51 is a database that records the "familiarity" of the user. The familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
 「ユーザID」は、音声生成装置1のユーザ毎に付けられたIDである。ユーザIDには、ユーザ名等のユーザの属性情報が対応付けられていてよい。 The "user ID" is an ID assigned to each user of the voice generator 1. The user ID may be associated with user attribute information such as a user name.
 「音声ラベル」は、呼びかけ音声の候補のそれぞれに一意に付けられたラベルである。音声ラベルには、任意のラベルが用いられ得る。例えば、音声ラベルに、なじみ対象の名前が用いられてもよい。 The "voice label" is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
 「なじみ対象」は、ユーザが日頃会話する人又はユーザがよく耳にする音声を発生する対象である。なじみ対象は、必ずしも人でなくてもよい。 The "familiar target" is a target that generates a voice that the user often talks to or hears. The familiar target does not necessarily have to be a person.
 「なじみ度」は、対応するなじみ対象の音声に対するユーザのなじみの度合いである。なじみ度は、SNS等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度等から算出され得る。例えば、SNS等によるなじみ対象とのコミュニケーション頻度、なじみ対象との日常の会話頻度、なじみ対象から日常的に耳にする頻度が多いほど、なじみ度の値は大きくなる。ここで、なじみ度は、ユーザによる自己申告によって取得されてもよい。 "Familiarity" is the degree of familiarity of the user with the corresponding familiar voice. The degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity. Here, the degree of familiarity may be acquired by self-reporting by the user.
 「反応あり数」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対してユーザが反応した回数である。提示回数は、対応する音声ラベルに基づいて生成された呼びかけ音声をユーザに対して提示した回数である。反応あり数を提示回数で割ることにより、反応確率が算出され得る。反応確率は、対応する音声ラベルに基づいて生成される呼びかけ音声に対してユーザが反応する確率である。 The "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label. The number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user. The reaction probability can be calculated by dividing the number of reactions by the number of presentations. The reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
 「覚醒度変化平均値」は、対応する音声ラベルに基づいて生成された呼びかけ音声に対するユーザの覚醒度変化量の平均値である。覚醒度変化量については後で説明する。 The "average value of change in arousal level" is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label. The amount of change in arousal level will be described later.
 図4は、ユーザログDB52の一例の構成を示す図である。ユーザログDB52は、ユーザによる音声生成装置1の利用に係るログを記録したデータベースである。ユーザログDB52は、例えばログ発生日時と、ユーザIDと、音声ラベルと、なじみ対象と、集中度と、反応有無と、覚醒度と、新覚醒度と、覚醒度変化量と、正解ラベルとを関連付けて記録している。ユーザIDと、音声ラベルと、なじみ対象は、なじみ度DB51と同じものである。 FIG. 4 is a diagram showing the configuration of an example of the user log DB 52. The user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user. The user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded. The user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
 「ログ発生日時」は、ユーザによる音声生成装置1の利用があった日時である。ログ発生日時は、例えばユーザに対する呼びかけ音声の提示がされる毎に記録される。 The "log generation date and time" is the date and time when the user used the voice generator 1. The log generation date and time is recorded, for example, each time a call voice is presented to the user.
 「反応有無」は、ユーザに対して呼びかけ音声が提示された後のユーザの反応の有無の情報である。ユーザの反応があったときには、「あり」が記録される。ユーザの反応がなかったときには、「なし」が記録される。 "Presence / absence of reaction" is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, "yes" is recorded. "None" is recorded when there is no user response.
 「集中度」は、呼びかけ音声の提示の際のユーザの集中の度合いである。集中度は、例えば作業中のユーザの姿勢、行動をカメラ8で得られる画像から推定することで測定され得る。集中度の値は、ユーザが集中していると考えられる姿勢、行動をする毎に高くなり、ユーザが集中していないと考えられる姿勢、行動をする毎に低くなるように算出される。また、作業中のユーザの瞳孔の開き具合をカメラ8で得られる画像から推定することで測定され得る。集中度の値は、瞳孔がより散瞳している場合に高くなり、瞳孔がより縮瞳している場合には低くなるように算出される。集中度は、例えばLv(Level)1、Lv2、…といった離散値であってよい。なお、集中度の取得手法は、特定の手法には限定されない。 "Concentration ratio" is the degree of concentration of the user when presenting the call voice. The degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8. The value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action. Further, the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8. The concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic. The degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, .... The method for acquiring the degree of concentration is not limited to a specific method.
 「覚醒度」は、音声生成装置1による呼びかけ音声の提示前に取得された覚醒度である。 The "awakening degree" is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
 「新覚醒度」は、ユーザの反応があった後で新たに取得された覚醒度である。新覚醒度は、ユーザの反応がなかったときには記録されない。 The "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
 「覚醒度変化量」は、ユーザの反応の前後での覚醒度の変化を表す量である。例えば、覚醒度変化量は、例えば新覚醒度と覚醒度との差から得られる。覚醒度変化量は、新覚醒度と覚醒度との比等であってもよい。覚醒度変化量は、ユーザの反応がなかったときには記録されない。 The "awakening degree change amount" is an amount representing the change in the arousal degree before and after the user's reaction. For example, the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness. The amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
 「正解ラベル」は、教師付き学習のための正解又は不正解のラベルである。例えば、正解は〇、不正解は×として記録される。正解ラベルについては後で詳しく説明する。 The "correct answer label" is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as 〇, and the incorrect answer is recorded as ×. The correct label will be described in detail later.
 モデルDB53は、音声ラベル候補を抽出するための音声ラベル分類のモデルを記録したデータベースである。実施形態では、モデルは、なじみ度と集中度の2次元空間において、音声ラベルの正解又は不正解を分類するように構成されたモデルである。モデルは、初期モデルと、学習モデルとを含む。初期モデルは、モデルDB53に記憶された初期値に基づいて生成されるモデルであって、学習によって更新されないモデルである。ここで、初期値は、例えば「なじみ度」と、「集中度」、「覚醒度変化量」との3次元空間において定義される音声ラベルの分類のための分類名を決める定数(平面の方程式の係数)の値である。この初期値によって生成される分類面が初期モデルである。初期モデルでは、分類面よりも大きいなじみ度を持つ音声ラベルは正解(〇)に分類され、それ以外の音声ラベルは不正解(×)に分類される。また、学習モデルは、初期モデルから生成された学習済みのモデルである。学習モデルは、初期モデルとは異なる分類面の二値分類モデルになり得る。 The model DB 53 is a database that records a model of voice label classification for extracting voice label candidates. In the embodiment, the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration. The model includes an initial model and a learning model. The initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning. Here, the initial value is a constant (equation of a plane) that determines the classification name for the classification of the voice label defined in the three-dimensional space of, for example, "familiarity", "concentration", and "awakening degree change". The value of). The classification plane generated by this initial value is the initial model. In the initial model, voice labels with a degree of familiarity larger than the classification surface are classified as correct answers (○), and other voice labels are classified as incorrect answers (×). The training model is a trained model generated from the initial model. The learning model can be a binary classification model with a different classification surface than the initial model.
 音声合成パラメータDB54は、音声合成パラメータを記録したデータベースである。音声合成パラメータは、ユーザのなじみ対象の音声を合成するために用いられるデータである。例えば、音声合成パラメータは、事前にマイク6を介して収音された音声のデータから抽出される特徴量のデータであってよい。あるいは、他のシステムによって取得又は定義された音声合成パラメータを事前に記録しておいてもよい。ここで、音声合成パラメータは、音声ラベルと対応付けられている。 The voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded. The voice synthesis parameter is data used for synthesizing the voice of the user's familiar target. For example, the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6. Alternatively, speech synthesis parameters acquired or defined by other systems may be pre-recorded. Here, the speech synthesis parameter is associated with the speech label.
 図5は、呼びかけ文DB55の一例の構成を示す図である。呼びかけ文DB55は、ユーザの覚醒を促すための各種の呼びかけ文のテンプレートデータを記録したデータベースである。呼びかけ文は特に限定されない。ただし、呼びかけ文は、ユーザの名前を用いた呼びかけを含んでいることが望ましい。これは、後で説明するカクテルパーティ効果を高めるためである。 FIG. 5 is a diagram showing the configuration of an example of the call statement DB55. The call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded. The call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
 ここで、なじみ度DB51、ユーザログDB52、モデルDB53、音声合成パラメータDB54、呼びかけ文DB55は、必ずしもストレージ5に記憶されている必要はない。例えば、なじみ度DB51、ユーザログDB52、モデルDB53、音声合成パラメータDB54、呼びかけ文DB55は、音声生成装置1とは別体のサーバに記憶されていてもよい。この場合、音声生成装置1は、通信モジュール11を用いてサーバにアクセスし、必要な情報を取得する。 Here, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5. For example, the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1. In this case, the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
 図6は、音声生成装置1の機能ブロック図である。図6に示すように、音声生成装置1は、取得部21と、判定部22と、選択部23と、生成部24と、提示部25と、学習部26とを有している。取得部21と、判定部22と、選択部23と、生成部24と、提示部25と、学習部26との動作は、例えばストレージ5に記憶されているプログラムをプロセッサ2が実行することによって実現される。判定部22と、選択部23と、生成部24と、提示部25と、学習部26とは、プロセッサ2とは別のハードウェアによって実現されてもよい。 FIG. 6 is a functional block diagram of the voice generator 1. As shown in FIG. 6, the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26. The operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized. The determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
 取得部21は、ユーザの覚醒度を取得する。また、取得部21は、呼びかけ音声に対するユーザの反応を取得する。前述したように、覚醒度は、眼球運動、瞬目活動、皮膚電気活動、刺激への反応時間の何れか又はそれらの組み合わせで算出される。ここで、覚醒度を算出するための、眼球運動、瞬目活動、刺激への反応時間は、例えばカメラ8で取得されるユーザの画像から測定され得る。また、刺激への反応時間は、マイク6で取得される音声信号から測定されてもよい。また、皮膚電気活動は、例えばユーザの腕に装着されるセンサによって測定され得る。また、ユーザの反応は、ユーザの頭部がスピーカ7a又は7bの方向に向いた、ユーザの視線がスピーカ7a又は7bの方向に向いた等のユーザの身体的な反応の有無と反応の方向とを例えばカメラ8で取得される画像から測定することによって取得され得る。取得部21は、音声生成装置1の外部で算出された覚醒度又はユーザの反応を通信によって取得するように構成されていてもよい。 The acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof. Here, the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8. Further, the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6. Also, skin electrical activity can be measured, for example, by a sensor worn on the user's arm. Further, the user's reaction is the presence or absence of the user's physical reaction such as the user's head facing the direction of the speaker 7a or 7b, the user's line of sight facing the direction of the speaker 7a or 7b, and the direction of the reaction. Can be obtained, for example, by measuring from an image acquired by the camera 8. The acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
 判定部22は、取得部21で取得された覚醒度に基づき、ユーザが覚醒している状態であるか否かを判定する。そして、判定部22は、ユーザが覚醒している状態であると判定したときに、選択部23の受信部231に対して音声ラベルの選択依頼を送信する。ここで、判定部22は、覚醒度を予め定められた閾値と比較することで判定を実施する。閾値は、ユーザが覚醒している状態であるかどうかを判定するための覚醒度の閾値であり、例えばストレージ5に記憶される。また、判定部22は、取得部21で取得されたユーザの反応の情報に基づき、ユーザの反応の有無を判定する。 The determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value. The threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
 選択部23は、ユーザが覚醒している状態でないと判定されたときに、ユーザの覚醒を促すための候補となる音声の音声ラベルを選択する。選択部23は、受信部231と、モデル選択部232と、音声ラベル候補抽出部233と、音声ラベル選択部234と、送信部235とを有している。 When it is determined that the user is not in an awake state, the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken. The selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
 受信部231は、判定部22から音声ラベルの選択依頼を受信する。 The receiving unit 231 receives a voice label selection request from the determination unit 22.
 モデル選択部232は、モデルDB53から音声ラベルの選択に用いるモデルを選択する。モデル選択部232は、当てはまり度に基づき、初期モデルと学習モデルとのうちの何れかを選択する。当てはまり度は、初期モデルと学習モデルとのどちらのほうが高い精度を有しているかを判定するための値である。当てはまり度については後で詳しく説明する。 The model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53. The model selection unit 232 selects either an initial model or a learning model based on the degree of fit. The degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
 音声ラベル候補抽出部233は、モデル選択部232で選択されたモデルとユーザの集中度とに基づき、ユーザに対して提示する呼びかけ音声の候補となる音声ラベルをなじみ度DB51から抽出する。 The voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
 音声ラベル選択部234は、音声ラベル候補抽出部233で抽出された音声ラベルから、ユーザに対して提示する呼びかけ音声を生成するための音声ラベルを選択する。 The voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
 送信部235は、音声ラベル選択部234で選択された音声ラベルの情報を生成部24に送信する。 The transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
 生成部24は、送信部235から受け取った音声ラベルに基づき、ユーザの覚醒を促すための呼びかけ音声を生成する。生成部24は、送信部235から受け取った音声ラベルと対応した音声合成パラメータを音声合成パラメータDB54から取得する。そして、生成部24は、呼びかけ文DB55に記録されている呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。 The generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235. The generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
 提示部25は、生成部24で生成された呼びかけ音声をユーザに提示する。例えば、提示部25は、生成部24で生成された呼びかけ音声を、スピーカ7を利用して再生する。 The presentation unit 25 presents the call voice generated by the generation unit 24 to the user. For example, the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
 学習部26は、モデルDB53に記録されているモデルの学習を実施する。学習部26は、例えば正解ラベルを用いた二値分類学習を用いて学習を実施する。 The learning unit 26 learns the model recorded in the model DB 53. The learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
 次に、音声生成装置1の動作について説明する。図7A及び図7Bは、音声生成装置1による音声提示処理を示すフローチャートである。図7A及び図7Bの処理は、定期的に行われてよい。 Next, the operation of the voice generator 1 will be described. 7A and 7B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 7A and 7B may be performed periodically.
 ステップS1において、取得部21は、ユーザの覚醒度を取得する。取得部21は、取得した覚醒度を判定部22に出力する。また、取得部21は、取得した覚醒度を呼びかけ音声の提示後のユーザからの反応の取得のタイミングまで保持しておく。 In step S1, the acquisition unit 21 acquires the user's arousal level. The acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
 ステップS2において、判定部22は、取得部21で取得された覚醒度が閾値以下であるか否かを判定する。ステップS2において、覚醒度が閾値を超えていると判定されたとき、すなわちユーザが覚醒の状態にあるときには、図7A及び図7Bの処理は終了する。ステップS2において、覚醒度が閾値以下であると判定されたとき、すなわちユーザが眠気を有しているといった覚醒の状態にないときには、処理はステップS3に移行する。 In step S2, the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value. In step S2, when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 7A and 7B are terminated. In step S2, when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
 ステップS3において、判定部22は、選択部23に対して音声ラベルの選択依頼を送信する。音声ラベルの選択依頼が受信部231で受信されると、モデル選択部232は、ユーザログDB52を参照して、反応あり回数を取得する。反応あり回数は、「反応有無」の「あり」の総数である。 In step S3, the determination unit 22 transmits a voice label selection request to the selection unit 23. When the voice label selection request is received by the receiving unit 231, the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction".
 ステップS4において、モデル選択部232は、反応あり回数が閾値未満であるか否かを判定する。閾値は、利用できる学習モデルがモデルDB53に記録されているか否かを判定するための閾値である。閾値は、例えば2に設定される。この場合、反応あり回数が0回又は1回のときには、反応あり回数が閾値未満であると判定される。ステップS4において、反応あり回数が閾値未満であると判定されたときには、処理はステップS5に移行する。ステップS4において、反応あり回数が閾値以上であると判定されたときには、処理はステップS6に移行する。 In step S4, the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S4 that the number of times there is a reaction is less than the threshold value, the process proceeds to step S5. When it is determined in step S4 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S6.
 ステップS5において、モデル選択部232は、初期値、すなわち初期モデルをモデルDB53から選択する。そして、モデル選択部232は、選択した初期モデルを音声ラベル候補抽出部233に出力する。その後、処理はステップS9に移行する。 In step S5, the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
 ステップS6において、モデル選択部232は、当てはまり度を計算する。当てはまり度の計算に際して、モデル選択部232は、まず、ユーザログDB52から過去の全ての反応あり及び反応なしのログを取得する。そして、モデル選択部232は、初期モデルと学習モデルの双方の当てはまり度を計算する。モデル選択部232は、例えば、それぞれのログの集中度の値が用いられた時の対応するモデルの正解又は不正解の出力結果とそれぞれのログの反応有無とを比較して求めた正答率(Accuracy)を当てはまり度として用いることができる。当てはまり度は、正答率に限らず、モデルの正解又は不正解の出力結果とログの反応有無とが用いられることによって算出される、適合率(Precision)、再現率(Recall)、F値(F-measure)等であってもよい。適合率は、正解と予測されたデータのうちで、実際にユーザの反応が「あり」であった割合である。再現率は、実際にユーザの反応ありであるログのうちの正解と予測されたものの割合である。F値は、再現率と適合率の調和平均である。例えば、F値は、2Recall・Precision/(Recall+Precision)から算出され得る。 In step S6, the model selection unit 232 calculates the degree of fit. In calculating the degree of fit, the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model. The model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit. The degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used. The precision rate is the percentage of the data predicted to be correct that the user actually responded "yes". The recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct. The F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall · Precision / (Recall + Precision).
 ステップS7において、モデル選択部232は、初期モデルと学習モデルの当てはまり度を比較し、学習モデルの当てはまり度の方が高いか否かを判定する。ステップS7において、初期モデルの当てはまり度のほうが高いと判定されたときには、処理はステップS5に移行する。この場合、モデル選択部232は、初期値、すなわち初期モデルを選択する。ステップS7において、学習モデルの当てはまり度のほうが高いと判定されたときには、処理はステップS8に移行する。 In step S7, the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher. When it is determined in step S7 that the degree of fit of the initial model is higher, the process proceeds to step S5. In this case, the model selection unit 232 selects an initial value, that is, an initial model. When it is determined in step S7 that the degree of fit of the learning model is higher, the process proceeds to step S8.
 ステップS8において、モデル選択部232は、学習モデルを選択する。そして、モデル選択部232は、選択した学習モデルを音声ラベル候補抽出部233に出力する。その後、処理はステップS9に移行する。 In step S8, the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
 ステップS9において、音声ラベル候補抽出部233は、取得部21から現在のユーザの集中度を取得する。 In step S9, the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
 ステップS10において、音声ラベル候補抽出部233は、呼びかけ音声の生成に用いる候補の音声ラベルをなじみ度DB51から抽出する。候補の音声ラベルの抽出数は、指定された数、例えば呼びかけ音声の提示数以上である。音声ラベル候補抽出部233は、例えばなじみ度DB51に登録されている音声ラベルの中から、現在の集中度の値に対して正解のラベルが付けられているすべての音声ラベルを抽出する。正解のラベルが付けられている音声ラベルは、呼びかけ音声の提示によるユーザの反応が期待され、かつ、覚醒度の上昇も期待される音声ラベルである。 In step S10, the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51. The number of candidate voice labels extracted is equal to or greater than the specified number, for example, the number of solicitation voices presented. The voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example. The voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
 ステップS11において、音声ラベル選択部234は、音声ラベル候補抽出部233で抽出された音声ラベルの中から、指定された数、例えば呼びかけ音声の提示数と同数の音声ラベルを選択する。音声ラベル選択部234は、例えば音声ラベルを選択するに当たって、過去の提示回数を基に重み付き当選確率を求める。そして、音声ラベル選択部234は、重み付き当選確率を基にランダムサンプリングによって音声ラベルを選択する。重み付き当選確率は、例えば式(1)に従って算出され得る。重み付き当選確率は、式(1)と異なる式で算出されてもよい。
Figure JPOXMLDOC01-appb-M000001
In step S11, the voice label selection unit 234 selects a specified number of voice labels, for example, the same number as the number of presented call voices, from the voice labels extracted by the voice label candidate extraction unit 233. The voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects a voice label by random sampling based on the weighted winning probability. The weighted winning probability can be calculated, for example, according to the equation (1). The weighted winning probability may be calculated by an equation different from the equation (1).
Figure JPOXMLDOC01-appb-M000001
 ステップS12において、送信部235は、音声ラベル選択部234で選択された音声ラベルを示す情報を、生成部24に送信する。生成部24は、音声合成パラメータDB54から、受信した音声ラベルに対応した音声合成パラメータを取得する。そして、生成部24は、呼びかけ文DB55からランダムに選択した呼びかけ文のデータと音声合成パラメータとに基づき、呼びかけ音声を生成する。呼びかけ音声の生成は、音声合成パラメータを用いた音声合成処理によって行われ得る。その後、処理はステップS13に移行する。 In step S12, the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24. The generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter. The generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
 ステップS13において、提示部25は、生成部24において生成された呼びかけ音声を、スピーカ7a、7bから同時にユーザに提示する。 In step S13, the presentation unit 25 simultaneously presents the call voice generated by the generation unit 24 to the user from the speakers 7a and 7b.
 ステップS14において、取得部21は、ユーザの反応を取得する。そして、取得部21は、ユーザの反応の情報を判定部22に出力する。 In step S14, the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
 ステップS15において、判定部22は、ユーザの反応があったか否かを判定する。ステップS15において、ユーザの反応がなかったと判定されたときには、処理はステップS20に移行する。ステップS15において、ユーザの反応があったと判定されたときには、処理はステップS16に移行する。 In step S15, the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
 ステップS16において、判定部22は、取得部21に対して新覚醒度の取得を要求する。これを受けて、取得部21は、新覚醒度を取得する。新覚醒度の取得は、覚醒度の取得と同様に行われてよい。 In step S16, the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree. In response to this, the acquisition unit 21 acquires the new arousal degree. The acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
 ステップS17において、取得部21は、正解ラベルの設定を行う。取得部21は、例えば次のようにして正解レベルを設定する。 
1)ユーザが特定のスピーカの方を向いたことが反応として取得された場合
 該当するスピーカにおいて提示された音声と対応する音声ラベル:〇
 それ以外の音声ラベル:×
2)ユーザが複数のスピーカの間等を向いたことが反応として取得された場合
 ユーザが向いた方向と各スピーカの方向とのなす角度を求め、その角度がより小さいスピーカにおいて提示された音声の音声ラベル:〇
 それ以外の音声ラベル:×
3)ユーザが1つのスピーカの方向を向いた後に、別のスピーカの方向を向いたことが反応として取得された場合
 始めに向いたスピーカにおいて提示された音声の音声ラベル:〇
 それ以外の音声ラベル:×
4)反応が取得できなかった場合
 すべての音声のラベル:×
In step S17, the acquisition unit 21 sets the correct answer label. The acquisition unit 21 sets the correct answer level as follows, for example.
1) When it is acquired as a reaction that the user points to a specific speaker The voice label corresponding to the voice presented in the corresponding speaker: 〇 Other voice labels: ×
2) When it is acquired as a reaction that the user faces between a plurality of speakers, the angle formed by the direction of the user and the direction of each speaker is obtained, and the voice presented by the speaker having a smaller angle is obtained. Audio label: 〇 Other audio labels: ×
3) When the user turns to one speaker and then turns to another speaker, which is acquired as a reaction. The voice label of the voice presented in the first facing speaker: 〇 Other voice labels : ×
4) If no response can be obtained Labels for all voices: ×
 ステップS18において、取得部21は、集中度、反応有無の情報、覚醒度、新覚醒度、覚醒度変化量、正解ラベルをログ発生日時、音声ラベル、なじみ対象、なじみ度と対応付けてユーザログDB52に登録する。その後、処理はステップS19に移行する。 In step S18, the acquisition unit 21 associates the concentration level, the reaction presence / absence information, the arousal level, the new arousal level, the arousal level change amount, and the correct answer label with the log generation date / time, the voice label, the familiar target, and the familiar degree, and the user log. Register in DB 52. After that, the process proceeds to step S19.
 ステップS19において、学習部26は、ユーザログDB52を参照して、反応あり回数を取得する。そして、学習部26は、反応あり回数が閾値未満であるか否かを判定する。閾値は、学習に必要な情報が蓄積されたか否かを判定するための閾値である。閾値は、例えば2に設定される。この場合、反応あり回数が0回又は1回のときには、反応あり回数が閾値未満であると判定される。ステップS19において、反応あり回数が閾値未満であると判定されたときには、図7A及び図7Bの処理は終了する。ステップS19において、反応あり回数が閾値以上であると判定されたときには、処理はステップS20に移行する。 In step S19, the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value. The threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated. The threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value. When it is determined in step S19 that the number of times there is a reaction is less than the threshold value, the processes of FIGS. 7A and 7B are terminated. When it is determined in step S19 that the number of times there is a reaction is equal to or greater than the threshold value, the process proceeds to step S20.
 ステップS20において、学習部26は、二値分類学習を実施する。そして、学習部26は、二値分類学習の実施による学習の結果をモデルDB53に記録する。その後、図7A及び図7Bの処理は終了する。ステップS20において、学習部26は、例えばユーザログDB52に記録されている正解ラベルと、正解ラベルに関連付けられたなじみ度と、集中度とを取得する。そして、学習部26は、「なじみ度」と、「集中度」と、「覚醒度変化量」の3次元空間における音声ラベルの二値分類モデルを生成する。図8は、「なじみ度」と、「集中度」、「覚醒度変化量」とを用いた二値分類モデルのイメージを表す図である。図8の例では、分類面Pよりも上側の空間に位置するなじみ度を持つ音声ラベルは正解(〇)に分類される。一方、分類面Pよりも下側の空間に位置するなじみ度を持つ音声ラベルは不正解(×)に分類される。ここで、モデルの生成には、ロジスティック回帰、SVN(Support Vector Machine)、ニューラルネットワーク等を用いた各種の二値分類学習が用いられ得る。 In step S20, the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 7A and 7B is completed. In step S20, the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the three-dimensional space of "familiarity", "concentration ratio", and "awakening degree change amount". FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio", and "alertness change amount". In the example of FIG. 8, the familiar voice label located in the space above the classification surface P is classified as the correct answer (◯). On the other hand, a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x). Here, various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
 ここで、実施形態における二値分類モデルに、「なじみ度」と、「集中度」、「覚醒度変化量」の3軸が採用されている理由について説明する。人は、自分が興味のある人の会話や自分の名前等のなじみある音声に対しては、選択的注意が働く特性を有している。これは、カクテルパーティ効果と呼ばれている。また、本城由美子,”注意と覚醒に関する生理心理学的研究”, 関西学院大学博士論文,乙第217号,p.187-188では、選択的注意と覚醒の双方を導入した注意と覚醒のモデルが導出されている。このことから、選択的注意の発生と覚醒度とには関連があると考えられる。このように、「なじみ度」は、カクテルパーティ効果の生じやすさとカクテルパーティ効果による覚醒度の変化に影響すると考えられるので、学習の1軸として採用されている。 Here, the reason why the three axes of "familiarity", "concentration ratio", and "alertness change amount" are adopted in the binary classification model in the embodiment will be explained. A person has a characteristic that selective attention works for a conversation of a person who is interested in the person or a familiar voice such as a person's name. This is called the cocktail party effect. In addition, Yumiko Honjo, "Physiological Psychological Study on Attention and Awakening", Kwansei Gakuin University Doctoral Dissertation, Otsu No. 217, p.187-188, introduced both selective attention and awakening. The model has been derived. From this, it is considered that there is a relationship between the occurrence of selective attention and the arousal level. As described above, "familiarity" is considered to affect the tendency of the cocktail party effect to occur and the change in the arousal level due to the cocktail party effect, and is therefore adopted as one axis of learning.
 また、「集中度」については、“「効率的選択」で脳は注意を向け集中を高める”, 理化学研究所ニュースリリース,2011年12月8日, [Online][令和2年6月10日検索],インターネットURL:https://www.riken.jp/press/2011/20111208/に、集中状態では、感覚から知覚へ伝達する情報が限定されることが報告されている。つまり、集中が高まっているときに認知される音は、よりユーザにとって必要とされる又は耳に入りやすい音となると推測される。このように、「集中度」は、ユーザの選択的注意を生じさせやすさ、つまりどの音に反応しやすいかに影響すると考えることができるので、学習の1軸として採用されている。 Regarding "concentration ratio", "the brain pays attention and raises concentration by" efficient selection "", RIKEN News Release, December 8, 2011, [Online] [Reiwa 2 June 10] Daily search], Internet URL: https://www.riken.jp/press/2011/20111208/, it is reported that the information transmitted from the senses to the perception is limited in the concentrated state. In other words, the concentrated state. It is presumed that the sound perceived when is increasing will be the sound that is more needed or heard by the user. Thus, "concentration" is likely to give rise to the user's selective attention. In other words, it is adopted as one axis of learning because it can be considered that it affects which sound is easy to react to.
 覚醒度変化量は、正解ラベル、すなわち、ユーザが反応するかどうかに加えて、ユーザの反応を特徴づけるものである。したがって、「覚醒度変化量」は、正解ラベルの判定の精度のさらなる向上が見込まれるものとして、学習の1軸として採用されている。 The amount of change in arousal degree is a correct label, that is, it characterizes the user's reaction in addition to whether or not the user reacts. Therefore, the "arousal degree change amount" is adopted as one axis of learning because it is expected that the accuracy of the determination of the correct answer label will be further improved.
 以上説明したように実施形態によれば、ユーザが覚醒していない状態であると判定されたときには、ユーザにとってなじみのある音声を用いてユーザに対する呼びかけが行われる。このため、ユーザが眠気を有している状態等であっても、カクテルパーティ効果によってユーザに呼びかけ音声を聞かせることができる。したがって、短時間での覚醒度の向上が見込まれる。また、実施形態では、なじみのある音声の選択にあたり、なじみ度と集中度とが用いられる。このため、よりユーザが反応し易い呼びかけ音声をユーザに聞かせることができる。 As described above, according to the embodiment, when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
 また、実施形態によれば、なじみ度と、集中度と、覚醒度変化量の3軸を有する学習モデルを用いて音声ラベルの分類が行われる。このため、学習が進むことにより、よりユーザに適した音声ラベルの候補が抽出されることが期待される。また、実施形態によれば、抽出された候補の中から過去の提示回数に基づくランダムサンプリングによって音声を生成するための音声ラベルが選択される。これにより、同じ音声ラベルの呼びかけ音声が頻繁に提示されることによる、ユーザの慣れや飽きが抑制される。これにより、長期に音声生成装置1が利用される場合であっても、呼びかけ音声に対するユーザの反応が期待され易くなり、結果としてユーザの覚醒度の上昇が見込まれる。 Further, according to the embodiment, the voice label is classified using a learning model having three axes of familiarity, concentration, and arousal change. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
 さらに、実施形態によれば、環境に配置された複数のスピーカから同時に呼びかけ音声が提示され、それぞれの呼びかけ音声に対するユーザの反応が取得される。そして、このユーザの反応に従って正解ラベルが設定される。これにより、効率よく教師データを得ることができる。 Further, according to the embodiment, the call voice is simultaneously presented from a plurality of speakers arranged in the environment, and the user's reaction to each call voice is acquired. Then, the correct label is set according to the reaction of this user. As a result, teacher data can be obtained efficiently.
 [変形例]
 実施形態の変形例を説明する。実施形態では、なじみ度と、集中度と、覚醒度変化量に基づく音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、何れも音声生成装置1の中で行われている例が示されている。しかしながら、音声ラベルの選択、呼びかけ音声の生成、学習モデルの学習は、別個の装置において行われてもよい。
[Modification example]
A modification of the embodiment will be described. In the embodiment, an example is shown in which the selection of the voice label based on the degree of familiarity, the degree of concentration, and the amount of change in the degree of arousal, the generation of the calling voice, and the learning of the learning model are all performed in the voice generation device 1. Has been done. However, the selection of voice labels, the generation of calling voices, and the training of learning models may be performed in separate devices.
 また、実施形態では、二値分類モデルに、「なじみ度」と、「集中度」、「覚醒度変化量」の3軸が採用されている。これに対し、より簡易的に例えば「なじみ度」だけ、「なじみ度」と「集中度」だけといった二値分類モデルが用いられてもよい。 Further, in the embodiment, the binary classification model employs three axes of "familiarity", "concentration ratio", and "alertness change amount". On the other hand, a binary classification model such as "familiarity" only, "familiarity" and "concentration ratio" may be used more simply.
 また、実施形態では、学習装置は、ユーザの覚醒を促す呼びかけ音声のための音声ラベルの分類モデルの学習装置として用いられている。これに対し、実施形態の学習装置は、ユーザが認知しやすい音声を選定するための各種のモデルの学習に利用可能である。 Further, in the embodiment, the learning device is used as a learning device of a voice label classification model for a call voice that encourages the user to awaken. On the other hand, the learning device of the embodiment can be used for learning various models for selecting a voice that is easy for the user to recognize.
 上述した実施形態による各処理は、コンピュータであるプロセッサに実行させることができるプログラムとして記憶させておくこともできる。この他、磁気ディスク、光ディスク、半導体メモリ等の外部記憶装置の記憶媒体に格納して配布することができる。そして、プロセッサは、この外部記憶装置の記憶媒体に記憶されたプログラムを読み込み、この読み込んだプログラムによって動作が制御されることにより、上述した処理を実行することができる。 Each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer. In addition, it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory. Then, the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
 なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。また、各実施形態は適宜組み合わせて実施してもよく、その場合組み合わせた効果が得られる。更に、上記実施形態には種々の発明が含まれており、開示される複数の構成要件から選択された組み合わせにより種々の発明が抽出され得る。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、課題が解決でき、効果が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。 The present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof. In addition, each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained. Further, the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.
 1…音声生成装置
 2…プロセッサ
 3…ROM
 4…RAM
 5…ストレージ
 6…マイクロホン(マイク)
 7a,7b…スピーカ
 8…カメラ
 9…入力装置
 10…ディスプレイ
 11…通信モジュール
 21…取得部
 22…判定部
 23…選択部
 24…生成部
 25…提示部
 26…学習部
 51…なじみ度データベース(DB)
 52…ユーザログデータベース(DB)
 53…モデルデータベース(DB)
 54…音声合成パラメータデータベース(DB)
 55…呼びかけ文データベース(DB)
 231…受信部
 232…モデル選択部
 233…音声ラベル候補抽出部
 234…音声ラベル選択部
 235…送信部
1 ... Voice generator 2 ... Processor 3 ... ROM
4 ... RAM
5 ... Storage 6 ... Microphone (microphone)
7a, 7b ... Speaker 8 ... Camera 9 ... Input device 10 ... Display 11 ... Communication module 21 ... Acquisition unit 22 ... Judgment unit 23 ... Selection unit 24 ... Generation unit 25 ... Presentation unit 26 ... Learning unit 51 ... Familiarity database (DB) )
52 ... User log database (DB)
53 ... Model database (DB)
54 ... Speech synthesis parameter database (DB)
55 ... Calling sentence database (DB)
231 ... Receiver unit 232 ... Model selection unit 233 ... Voice label candidate extraction unit 234 ... Voice label selection unit 235 ... Transmission unit

Claims (5)

  1.  複数の音声候補の中からユーザに対して提示する音声を選択するための学習モデルに対する教師データを、前記ユーザに対して同時に提示された複数の音声に対する前記ユーザの反応に基づいて取得する学習部を具備する学習装置。 A learning unit that acquires teacher data for a learning model for selecting a voice to be presented to a user from a plurality of voice candidates based on the reaction of the user to a plurality of voices simultaneously presented to the user. A learning device equipped with.
  2.  前記複数の音声は、前記ユーザに対して等距離かつ異なる方向に配置され、異なる方向からユーザに向かって音声を発する複数のスピーカのそれぞれから提示された音声である請求項1に記載の学習装置。 The learning device according to claim 1, wherein the plurality of voices are voices presented from each of a plurality of speakers that are arranged equidistantly and in different directions with respect to the user and emit voices toward the user from different directions. ..
  3.  前記学習モデルは、前記ユーザが複数の音声候補のそれぞれになじんでいる度合いを表すなじみ度と、前記ユーザの現在の集中の度合いを表す集中度と、前記音声の提示による前記ユーザの睡眠から興奮に至るまでの覚醒の程度を表す覚醒度の変化量とからなる3次元空間において、前記複数の音声候補を、前記音声の提示による前記ユーザの反応が期待され、前記ユーザの覚醒度の上昇が期待される第1の音声候補と、前記音声の提示による前記ユーザの反応が期待されない、又は、前記ユーザの覚醒度の上昇が期待されない第2の音声候補とに分類する分類モデルである請求項1又は2に記載の学習装置。 In the learning model, the degree of familiarity indicating the degree to which the user is familiar with each of the plurality of voice candidates, the degree of concentration indicating the degree of the current concentration of the user, and the excitement from the sleep of the user due to the presentation of the voice. In a three-dimensional space consisting of a change amount of arousal degree indicating the degree of arousal up to A claim that is a classification model for classifying into a first voice candidate that is expected and a second voice candidate that is not expected to respond to the user by presenting the voice or is not expected to increase the arousal level of the user. The learning device according to 1 or 2.
  4.  学習部により、複数の音声候補の中からユーザに対して提示する音声を選択するための学習モデルに対する教師データを、前記ユーザに対して同時に提示された複数の音声に対する前記ユーザの反応に基づいて取得することを具備する学習方法。 The learning unit provides teacher data for a learning model for selecting a voice to be presented to a user from a plurality of voice candidates based on the user's reaction to a plurality of voices simultaneously presented to the user. A learning method that comprises acquiring.
  5.  プロセッサを、請求項1乃至3の何れか1項に記載の学習装置の前記学習部として機能させるための学習プログラム。 A learning program for causing the processor to function as the learning unit of the learning device according to any one of claims 1 to 3.
PCT/JP2020/024823 2020-06-24 2020-06-24 Learning device, learning method, and learning program WO2021260848A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/024823 WO2021260848A1 (en) 2020-06-24 2020-06-24 Learning device, learning method, and learning program
JP2022531321A JP7416245B2 (en) 2020-06-24 2020-06-24 Learning devices, learning methods and learning programs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/024823 WO2021260848A1 (en) 2020-06-24 2020-06-24 Learning device, learning method, and learning program

Publications (1)

Publication Number Publication Date
WO2021260848A1 true WO2021260848A1 (en) 2021-12-30

Family

ID=79282108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/024823 WO2021260848A1 (en) 2020-06-24 2020-06-24 Learning device, learning method, and learning program

Country Status (2)

Country Link
JP (1) JP7416245B2 (en)
WO (1) WO2021260848A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001304898A (en) * 2000-04-25 2001-10-31 Sony Corp On-vehicle equipment
JP2007271296A (en) * 2006-03-30 2007-10-18 Yamaha Corp Alarm device, and program
JP2013101248A (en) * 2011-11-09 2013-05-23 Sony Corp Voice control device, voice control method, and program
JP2016191791A (en) * 2015-03-31 2016-11-10 ソニー株式会社 Information processing device, information processing method, and program
JP2020024293A (en) * 2018-08-07 2020-02-13 トヨタ自動車株式会社 Voice interaction system
JP2020034835A (en) * 2018-08-31 2020-03-05 国立大学法人京都大学 Voice interactive system, voice interactive method, program, learning model generation device, and learning model generation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10433075B2 (en) 2017-09-12 2019-10-01 Whisper.Ai, Inc. Low latency audio enhancement

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001304898A (en) * 2000-04-25 2001-10-31 Sony Corp On-vehicle equipment
JP2007271296A (en) * 2006-03-30 2007-10-18 Yamaha Corp Alarm device, and program
JP2013101248A (en) * 2011-11-09 2013-05-23 Sony Corp Voice control device, voice control method, and program
JP2016191791A (en) * 2015-03-31 2016-11-10 ソニー株式会社 Information processing device, information processing method, and program
JP2020024293A (en) * 2018-08-07 2020-02-13 トヨタ自動車株式会社 Voice interaction system
JP2020034835A (en) * 2018-08-31 2020-03-05 国立大学法人京都大学 Voice interactive system, voice interactive method, program, learning model generation device, and learning model generation method

Also Published As

Publication number Publication date
JPWO2021260848A1 (en) 2021-12-30
JP7416245B2 (en) 2024-01-17

Similar Documents

Publication Publication Date Title
US10944708B2 (en) Conversation agent
JP6263308B1 (en) Dementia diagnosis apparatus, dementia diagnosis method, and dementia diagnosis program
US11009952B2 (en) Interface for electroencephalogram for computer control
KR102039848B1 (en) Personal emotion-based cognitive assistance systems, methods of providing personal emotion-based cognitive assistance, and non-transitory computer readable media for improving memory and decision making
CN106464758B (en) It initiates to communicate using subscriber signal
JP2021057057A (en) Mobile and wearable video acquisition and feedback platform for therapy of mental disorder
CN109460752B (en) Emotion analysis method and device, electronic equipment and storage medium
US9747902B2 (en) Method and system for assisting patients
KR20180110012A (en) Sensor support depression detection
US11751813B2 (en) System, method and computer program product for detecting a mobile phone user's risky medical condition
CN110881987B (en) Old person emotion monitoring system based on wearable equipment
JP6930277B2 (en) Presentation device, presentation method, communication control device, communication control method and communication control system
CN113287175A (en) Interactive health status evaluation method and system thereof
KR102552220B1 (en) Contents providing method, system and computer program for performing adaptable diagnosis and treatment for mental health
JP2021146214A (en) Techniques for separating driving emotion from media induced emotion in driver monitoring system
JP6906197B2 (en) Information processing method, information processing device and information processing program
JP2018503187A (en) Scheduling interactions with subjects
WO2021260848A1 (en) Learning device, learning method, and learning program
WO2021260846A1 (en) Voice generation device, voice generation method, and voice generation program
US20190141418A1 (en) A system and method for generating one or more statements
WO2020230589A1 (en) Information processing device, information processing method, and information processing program
CN108461125B (en) Memory training device for the elderly
US10079074B1 (en) System for monitoring disease progression
WO2021260844A1 (en) Voice generation device, voice generation method, and voice generation program
US20240008766A1 (en) System, method and computer program product for processing a mobile phone user's condition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941543

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022531321

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941543

Country of ref document: EP

Kind code of ref document: A1