WO2021260848A1 - Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage - Google Patents

Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage Download PDF

Info

Publication number
WO2021260848A1
WO2021260848A1 PCT/JP2020/024823 JP2020024823W WO2021260848A1 WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1 JP 2020024823 W JP2020024823 W JP 2020024823W WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
learning
degree
model
Prior art date
Application number
PCT/JP2020/024823
Other languages
English (en)
Japanese (ja)
Inventor
妙 佐藤
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022531321A priority Critical patent/JP7416245B2/ja
Priority to PCT/JP2020/024823 priority patent/WO2021260848A1/fr
Publication of WO2021260848A1 publication Critical patent/WO2021260848A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • This embodiment relates to a learning device, a learning method, and a learning program for voice selection.
  • Speech classification models may be used for such selection.
  • learning is performed by giving information on the correctness of the classified speech as teacher data.
  • Appropriate evaluation of speech is required to generate teacher data.
  • the method mentioned in Non-Patent Document 1 is known.
  • the embodiment provides a learning device, a learning method, and a learning program that can efficiently collect teacher data for speech classification.
  • the learning device uses teacher data for a learning model for selecting a voice to be presented to the user from a plurality of voice candidates, and the user's reaction to the plurality of voices presented to the user at the same time. It is equipped with a learning unit to be acquired based on.
  • a learning device capable of efficiently collecting teacher data for speech classification are provided.
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment.
  • FIG. 2A is a diagram showing an example of speaker arrangement.
  • FIG. 2B is a diagram showing an example of speaker arrangement.
  • FIG. 2C is a diagram showing an example of speaker arrangement.
  • FIG. 2D is a diagram showing an example of speaker arrangement.
  • FIG. 3 is a diagram showing the configuration of an example of the familiarity DB.
  • FIG. 4 is a diagram showing an example configuration of a user log DB.
  • FIG. 5 is a diagram showing the structure of an example of the call statement DB.
  • FIG. 6 is a functional block diagram of the voice generator.
  • FIG. 7A is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 7B is a flowchart showing a voice presentation process by the voice generator.
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio”, and "awa
  • FIG. 1 is a diagram showing a hardware configuration of an example of a voice generation device including a learning device according to an embodiment.
  • the voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
  • the degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level.
  • the arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement.
  • the arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like.
  • the degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels.
  • the arousal level is a value that increases from a sleep state to an excitement state, for example.
  • the arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
  • the voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, speakers 7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11.
  • the voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user.
  • the voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speakers 7a, 7b, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
  • the processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU.
  • the processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like.
  • the processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
  • ROM 3 is a non-volatile memory such as a flash memory.
  • the start program of the voice generator 1 is stored in the ROM 3.
  • RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
  • the storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
  • Various programs used in the voice generator 1 are stored in the storage 5.
  • the storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
  • the microphone 6 is a device that converts the input voice into a voice signal which is an electric signal.
  • the audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5.
  • the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
  • Speakers 7a and 7b are devices that output voice based on the input voice signal.
  • the speaker 7a and the speaker 7b are not in close proximity to each other.
  • the speaker 7a and the speaker 7b are arranged in different directions when the user is the center.
  • the distance between the speaker 7a and the user and the distance between the speaker 7b and the user are equidistant.
  • FIG. 2A and 2B are diagrams showing an arrangement example of the speakers 7a and 7b.
  • speakers 7a and 7b are arranged in front of the user U so as to be equidistant to the user, respectively.
  • speakers 7a and 7b are arranged in front of and behind the user U so as to be equidistant to the user, respectively.
  • FIG. 1 shows an example in which the number of presented voices is two.
  • the number of presented voices may be three or more.
  • three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the speaker is arranged in a different direction when the user is the center. Further, it is desirable that the distance between each speaker and the user is equidistant.
  • FIGS. 2C and 2D an arrangement example when the speakers are three speakers 7a, 7b, and 7c is shown in FIGS. 2C and 2D.
  • FIG. 2C the speakers 7a, 7b, and 7c are arranged in front of the user U.
  • FIG. 2D the speakers 7a, 7b, and 7c are arranged behind the user U.
  • the camera 8 captures the user and acquires the image of the user.
  • the user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5.
  • the user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
  • the input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor.
  • the input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
  • the display 10 is a display such as a liquid crystal display or an organic EL display.
  • the display 10 displays various images.
  • the communication module 11 is a device for the voice generation device 1 to carry out communication.
  • the communication module 11 communicates with, for example, a server provided outside the voice generator 1.
  • the communication method by the communication module 11 is not particularly limited.
  • the communication module 11 may carry out communication wirelessly or may carry out communication by wire.
  • FIG. 3 is a diagram showing a configuration of an example of familiarity DB 51.
  • the familiarity DB 51 is a database that records the "familiarity" of the user.
  • the familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
  • the "user ID" is an ID assigned to each user of the voice generator 1.
  • the user ID may be associated with user attribute information such as a user name.
  • the "voice label” is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
  • the "familiar target” is a target that generates a voice that the user often talks to or hears.
  • the familiar target does not necessarily have to be a person.
  • “Familiarity” is the degree of familiarity of the user with the corresponding familiar voice.
  • the degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity.
  • the degree of familiarity may be acquired by self-reporting by the user.
  • the "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label.
  • the number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user.
  • the reaction probability can be calculated by dividing the number of reactions by the number of presentations.
  • the reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
  • the "average value of change in arousal level” is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label.
  • the amount of change in arousal level will be described later.
  • FIG. 4 is a diagram showing the configuration of an example of the user log DB 52.
  • the user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user.
  • the user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded.
  • the user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
  • the "log generation date and time” is the date and time when the user used the voice generator 1.
  • the log generation date and time is recorded, for example, each time a call voice is presented to the user.
  • Presence / absence of reaction is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, “yes” is recorded. “None” is recorded when there is no user response.
  • “Concentration ratio” is the degree of concentration of the user when presenting the call voice.
  • the degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8.
  • the value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action.
  • the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8.
  • the concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic.
  • the degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, ....
  • the method for acquiring the degree of concentration is not limited to a specific method.
  • the "awakening degree” is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
  • the "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
  • the "awakening degree change amount” is an amount representing the change in the arousal degree before and after the user's reaction.
  • the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness.
  • the amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
  • the "correct answer label” is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as ⁇ , and the incorrect answer is recorded as ⁇ .
  • the correct label will be described in detail later.
  • the model DB 53 is a database that records a model of voice label classification for extracting voice label candidates.
  • the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration.
  • the model includes an initial model and a learning model.
  • the initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning.
  • the initial value is a constant (equation of a plane) that determines the classification name for the classification of the voice label defined in the three-dimensional space of, for example, "familiarity", "concentration", and "awakening degree change". The value of).
  • the classification plane generated by this initial value is the initial model.
  • the training model is a trained model generated from the initial model.
  • the learning model can be a binary classification model with a different classification surface than the initial model.
  • the voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded.
  • the voice synthesis parameter is data used for synthesizing the voice of the user's familiar target.
  • the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6.
  • speech synthesis parameters acquired or defined by other systems may be pre-recorded.
  • the speech synthesis parameter is associated with the speech label.
  • FIG. 5 is a diagram showing the configuration of an example of the call statement DB55.
  • the call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded.
  • the call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5.
  • the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1.
  • the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
  • FIG. 6 is a functional block diagram of the voice generator 1.
  • the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26.
  • the operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized.
  • the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
  • the acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof.
  • the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8.
  • the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6.
  • skin electrical activity can be measured, for example, by a sensor worn on the user's arm.
  • the user's reaction is the presence or absence of the user's physical reaction such as the user's head facing the direction of the speaker 7a or 7b, the user's line of sight facing the direction of the speaker 7a or 7b, and the direction of the reaction. Can be obtained, for example, by measuring from an image acquired by the camera 8.
  • the acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
  • the determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value.
  • the threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
  • the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken.
  • the selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
  • the receiving unit 231 receives a voice label selection request from the determination unit 22.
  • the model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53.
  • the model selection unit 232 selects either an initial model or a learning model based on the degree of fit.
  • the degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
  • the voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
  • the voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
  • the transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
  • the presentation unit 25 presents the call voice generated by the generation unit 24 to the user.
  • the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
  • the learning unit 26 learns the model recorded in the model DB 53.
  • the learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
  • FIGS. 7A and 7B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 7A and 7B may be performed periodically.
  • step S1 the acquisition unit 21 acquires the user's arousal level.
  • the acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
  • step S2 the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value.
  • step S2 when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 7A and 7B are terminated.
  • step S2 when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
  • step S3 the determination unit 22 transmits a voice label selection request to the selection unit 23.
  • the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction”.
  • step S4 the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • step S5 the process proceeds to step S5.
  • step S6 the process proceeds to step S6.
  • step S5 the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S6 the model selection unit 232 calculates the degree of fit.
  • the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model.
  • the model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit.
  • the degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used.
  • the precision rate is the percentage of the data predicted to be correct that the user actually responded "yes".
  • the recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct.
  • the F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall ⁇ Precision / (Recall + Precision).
  • step S7 the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher.
  • step S7 the degree of fit of the initial model is higher
  • the process proceeds to step S5.
  • the model selection unit 232 selects an initial value, that is, an initial model.
  • step S8 the degree of fit of the learning model is higher
  • step S8 the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
  • step S9 the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
  • the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51.
  • the number of candidate voice labels extracted is equal to or greater than the specified number, for example, the number of solicitation voices presented.
  • the voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example.
  • the voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
  • the voice label selection unit 234 selects a specified number of voice labels, for example, the same number as the number of presented call voices, from the voice labels extracted by the voice label candidate extraction unit 233.
  • the voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects a voice label by random sampling based on the weighted winning probability.
  • the weighted winning probability can be calculated, for example, according to the equation (1).
  • the weighted winning probability may be calculated by an equation different from the equation (1).
  • step S12 the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24.
  • the generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter.
  • the generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
  • step S13 the presentation unit 25 simultaneously presents the call voice generated by the generation unit 24 to the user from the speakers 7a and 7b.
  • step S14 the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
  • step S15 the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
  • step S16 the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree.
  • the acquisition unit 21 acquires the new arousal degree.
  • the acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
  • step S17 the acquisition unit 21 sets the correct answer label.
  • the acquisition unit 21 sets the correct answer level as follows, for example. 1) When it is acquired as a reaction that the user points to a specific speaker The voice label corresponding to the voice presented in the corresponding speaker: ⁇ Other voice labels: ⁇ 2) When it is acquired as a reaction that the user faces between a plurality of speakers, the angle formed by the direction of the user and the direction of each speaker is obtained, and the voice presented by the speaker having a smaller angle is obtained. Audio label: ⁇ Other audio labels: ⁇ 3) When the user turns to one speaker and then turns to another speaker, which is acquired as a reaction. The voice label of the voice presented in the first facing speaker: ⁇ Other voice labels : ⁇ 4) If no response can be obtained Labels for all voices: ⁇
  • step S18 the acquisition unit 21 associates the concentration level, the reaction presence / absence information, the arousal level, the new arousal level, the arousal level change amount, and the correct answer label with the log generation date / time, the voice label, the familiar target, and the familiar degree, and the user log. Register in DB 52. After that, the process proceeds to step S19.
  • step S19 the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value.
  • the threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated.
  • the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
  • the processes of FIGS. 7A and 7B are terminated.
  • the process proceeds to step S20.
  • step S20 the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 7A and 7B is completed.
  • step S20 the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the three-dimensional space of "familiarity", “concentration ratio”, and "awakening degree change amount".
  • FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", “concentration ratio", and "alertness change amount”.
  • the familiar voice label located in the space above the classification surface P is classified as the correct answer ( ⁇ ).
  • a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x).
  • various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
  • the amount of change in arousal degree is a correct label, that is, it characterizes the user's reaction in addition to whether or not the user reacts. Therefore, the "arousal degree change amount" is adopted as one axis of learning because it is expected that the accuracy of the determination of the correct answer label will be further improved.
  • the embodiment when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
  • the voice label is classified using a learning model having three axes of familiarity, concentration, and arousal change. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
  • the call voice is simultaneously presented from a plurality of speakers arranged in the environment, and the user's reaction to each call voice is acquired. Then, the correct label is set according to the reaction of this user. As a result, teacher data can be obtained efficiently.
  • the binary classification model employs three axes of "familiarity”, “concentration ratio”, and “alertness change amount”.
  • a binary classification model such as “familiarity” only, “familiarity” and “concentration ratio” may be used more simply.
  • the learning device is used as a learning device of a voice label classification model for a call voice that encourages the user to awaken.
  • the learning device of the embodiment can be used for learning various models for selecting a voice that is easy for the user to recognize.
  • Each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer.
  • it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory.
  • the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
  • the present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
  • each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained.
  • the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Traffic Control Systems (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

Ce dispositif d'apprentissage comprend une unité d'apprentissage (26) qui acquiert des données d'apprentissage pour un modèle d'apprentissage pour sélectionner une voix à présenter à un utilisateur parmi une pluralité de candidats de voix sur la base de la réaction de l'utilisateur à une pluralité de voix présentées simultanément à l'utilisateur.
PCT/JP2020/024823 2020-06-24 2020-06-24 Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage WO2021260848A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022531321A JP7416245B2 (ja) 2020-06-24 2020-06-24 学習装置、学習方法及び学習プログラム
PCT/JP2020/024823 WO2021260848A1 (fr) 2020-06-24 2020-06-24 Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/024823 WO2021260848A1 (fr) 2020-06-24 2020-06-24 Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage

Publications (1)

Publication Number Publication Date
WO2021260848A1 true WO2021260848A1 (fr) 2021-12-30

Family

ID=79282108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/024823 WO2021260848A1 (fr) 2020-06-24 2020-06-24 Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage

Country Status (2)

Country Link
JP (1) JP7416245B2 (fr)
WO (1) WO2021260848A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001304898A (ja) * 2000-04-25 2001-10-31 Sony Corp 車載機器
JP2007271296A (ja) * 2006-03-30 2007-10-18 Yamaha Corp アラーム装置、およびプログラム
JP2013101248A (ja) * 2011-11-09 2013-05-23 Sony Corp 音声制御装置、音声制御方法、およびプログラム
JP2016191791A (ja) * 2015-03-31 2016-11-10 ソニー株式会社 情報処理装置、情報処理方法及びプログラム
JP2020024293A (ja) * 2018-08-07 2020-02-13 トヨタ自動車株式会社 音声対話システム
JP2020034835A (ja) * 2018-08-31 2020-03-05 国立大学法人京都大学 音声対話システム、音声対話方法、プログラム、学習モデル生成装置及び学習モデル生成方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10433075B2 (en) 2017-09-12 2019-10-01 Whisper.Ai, Inc. Low latency audio enhancement

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001304898A (ja) * 2000-04-25 2001-10-31 Sony Corp 車載機器
JP2007271296A (ja) * 2006-03-30 2007-10-18 Yamaha Corp アラーム装置、およびプログラム
JP2013101248A (ja) * 2011-11-09 2013-05-23 Sony Corp 音声制御装置、音声制御方法、およびプログラム
JP2016191791A (ja) * 2015-03-31 2016-11-10 ソニー株式会社 情報処理装置、情報処理方法及びプログラム
JP2020024293A (ja) * 2018-08-07 2020-02-13 トヨタ自動車株式会社 音声対話システム
JP2020034835A (ja) * 2018-08-31 2020-03-05 国立大学法人京都大学 音声対話システム、音声対話方法、プログラム、学習モデル生成装置及び学習モデル生成方法

Also Published As

Publication number Publication date
JP7416245B2 (ja) 2024-01-17
JPWO2021260848A1 (fr) 2021-12-30

Similar Documents

Publication Publication Date Title
US10944708B2 (en) Conversation agent
JP6263308B1 (ja) 認知症診断装置、認知症診断方法、及び認知症診断プログラム
US11009952B2 (en) Interface for electroencephalogram for computer control
CN106464758B (zh) 利用用户信号来发起通信
JP2021057057A (ja) 精神障害の療法のためのモバイルおよびウェアラブルビデオ捕捉およびフィードバックプラットフォーム
CN109460752B (zh) 一种情绪分析方法、装置、电子设备及存储介质
KR20180137490A (ko) 기억과 의사 결정력 향상을 위한 개인 감정-기반의 컴퓨터 판독 가능한 인지 메모리 및 인지 통찰
CN110881987B (zh) 一种基于可穿戴设备的老年人情绪监测系统
US11751813B2 (en) System, method and computer program product for detecting a mobile phone user's risky medical condition
JP6930277B2 (ja) 提示装置、提示方法、通信制御装置、通信制御方法及び通信制御システム
JP6906197B2 (ja) 情報処理方法、情報処理装置及び情報処理プログラム
CN113287175A (zh) 互动式健康状态评估方法及其系统
WO2019086856A1 (fr) Systèmes et procédés permettant de combiner et d'analyser des états humains
JP2021146214A (ja) ドライバー監視システムでメディア誘発感情から運転感情を分離するための技術
KR102552220B1 (ko) 정신건강 진단 및 치료를 적응적으로 수행하기 위한 컨텐츠 제공 방법, 시스템 및 컴퓨터 프로그램
JP2018503187A (ja) 被験者とのインタラクションのスケジューリング
WO2021260848A1 (fr) Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage
WO2021260846A1 (fr) Dispositif de génération vocale, procédé de génération vocale et programme de génération vocale
US20190141418A1 (en) A system and method for generating one or more statements
CN108461125B (zh) 一种针对老年人的记忆力训练装置
US10079074B1 (en) System for monitoring disease progression
WO2021260844A1 (fr) Dispositif de génération vocale, procédé de génération vocale et programme de génération vocale
JP7534745B1 (ja) 発作予測プログラム、記憶媒体、発作予測装置および発作予測方法
US20240008766A1 (en) System, method and computer program product for processing a mobile phone user's condition
WO2023199422A1 (fr) Dispositif d'inférence d'état interne, procédé d'inférence d'état interne et support de stockage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941543

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022531321

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941543

Country of ref document: EP

Kind code of ref document: A1