WO2021260848A1 - Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage - Google Patents
Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage Download PDFInfo
- Publication number
- WO2021260848A1 WO2021260848A1 PCT/JP2020/024823 JP2020024823W WO2021260848A1 WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1 JP 2020024823 W JP2020024823 W JP 2020024823W WO 2021260848 A1 WO2021260848 A1 WO 2021260848A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- voice
- learning
- degree
- model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 32
- 238000006243 chemical reaction Methods 0.000 claims abstract description 54
- 230000037007 arousal Effects 0.000 claims description 56
- 230000008859 change Effects 0.000 claims description 23
- 238000013145 classification model Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 21
- 230000015572 biosynthetic process Effects 0.000 description 20
- 238000003786 synthesis reaction Methods 0.000 description 20
- 238000010586 diagram Methods 0.000 description 17
- 230000000694 effects Effects 0.000 description 17
- 238000004891 communication Methods 0.000 description 14
- 238000000605 extraction Methods 0.000 description 10
- 230000036626 alertness Effects 0.000 description 9
- 230000005540 biological transmission Effects 0.000 description 6
- 230000035484 reaction time Effects 0.000 description 6
- 101100328887 Caenorhabditis elegans col-34 gene Proteins 0.000 description 5
- 230000004397 blinking Effects 0.000 description 5
- 230000004424 eye movement Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 239000000470 constituent Substances 0.000 description 4
- 230000015654 memory Effects 0.000 description 4
- 230000010332 selective attention Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 210000001747 pupil Anatomy 0.000 description 3
- 206010041349 Somnolence Diseases 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000003935 attention Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 206010048909 Boredom Diseases 0.000 description 1
- RRLHMJHRFMHVNM-BQVXCWBNSA-N [(2s,3r,6r)-6-[5-[5-hydroxy-3-(4-hydroxyphenyl)-4-oxochromen-7-yl]oxypentoxy]-2-methyl-3,6-dihydro-2h-pyran-3-yl] acetate Chemical compound C1=C[C@@H](OC(C)=O)[C@H](C)O[C@H]1OCCCCCOC1=CC(O)=C2C(=O)C(C=3C=CC(O)=CC=3)=COC2=C1 RRLHMJHRFMHVNM-BQVXCWBNSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 210000004720 cerebrum Anatomy 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000002301 combined effect Effects 0.000 description 1
- 210000000887 face Anatomy 0.000 description 1
- 210000003128 head Anatomy 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 230000003547 miosis Effects 0.000 description 1
- 230000002911 mydriatic effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
Definitions
- This embodiment relates to a learning device, a learning method, and a learning program for voice selection.
- Speech classification models may be used for such selection.
- learning is performed by giving information on the correctness of the classified speech as teacher data.
- Appropriate evaluation of speech is required to generate teacher data.
- the method mentioned in Non-Patent Document 1 is known.
- the embodiment provides a learning device, a learning method, and a learning program that can efficiently collect teacher data for speech classification.
- the learning device uses teacher data for a learning model for selecting a voice to be presented to the user from a plurality of voice candidates, and the user's reaction to the plurality of voices presented to the user at the same time. It is equipped with a learning unit to be acquired based on.
- a learning device capable of efficiently collecting teacher data for speech classification are provided.
- FIG. 1 is a diagram showing a hardware configuration of an example of a voice generator according to an embodiment.
- FIG. 2A is a diagram showing an example of speaker arrangement.
- FIG. 2B is a diagram showing an example of speaker arrangement.
- FIG. 2C is a diagram showing an example of speaker arrangement.
- FIG. 2D is a diagram showing an example of speaker arrangement.
- FIG. 3 is a diagram showing the configuration of an example of the familiarity DB.
- FIG. 4 is a diagram showing an example configuration of a user log DB.
- FIG. 5 is a diagram showing the structure of an example of the call statement DB.
- FIG. 6 is a functional block diagram of the voice generator.
- FIG. 7A is a flowchart showing a voice presentation process by the voice generator.
- FIG. 7B is a flowchart showing a voice presentation process by the voice generator.
- FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", "concentration ratio”, and "awa
- FIG. 1 is a diagram showing a hardware configuration of an example of a voice generation device including a learning device according to an embodiment.
- the voice generation device 1 according to the embodiment emits a call voice for urging the user to awaken when the user is not in an awake state such as drowsiness.
- the degree of arousal in the embodiment is an index indicating the degree of arousal corresponding to the arousal level.
- the arousal level corresponds to the activity level of the cerebrum and represents the degree of arousal from sleep to excitement.
- the arousal level is measured from eye movements, blinking activity, electrical skin activity, reaction time to stimuli, and the like.
- the degree of arousal in the embodiment is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof for measuring these arousal levels.
- the arousal level is a value that increases from a sleep state to an excitement state, for example.
- the arousal degree may be a continuous numerical value or a discrete value such as Level 1, Level 2, .... Further, when the arousal degree is calculated by the combination of each value of eye movement, blinking activity, skin electrical activity, and reaction time to a stimulus, the combination thereof is not particularly limited. For example, simple summing of these values, weighting addition, and the like can be used as a combination method.
- the voice generator 1 includes a processor 2, a ROM 3, a RAM 4, a storage 5, a microphone (microphone) 6, speakers 7a and 7b, a camera 8, an input device 9, a display 10, and a communication module 11.
- the voice generation device 1 is various terminals such as a personal computer (PC), a smartphone, and a tablet terminal. Not limited to this, the voice generation device 1 can be mounted on various devices used by the user.
- the voice generator 1 does not have to have all the configurations shown in FIG. For example, the microphone 6, the speakers 7a, 7b, the camera 8, and the display 10 may be separate devices from the voice generation device 1.
- the processor 2 is a control circuit that controls the overall operation of a voice generator 1 such as a CPU.
- the processor 2 does not have to be a CPU, and may be an ASIC, FPGA, GPU or the like.
- the processor 2 does not have to be composed of a single CPU or the like, and may be composed of a plurality of CPUs or the like.
- ROM 3 is a non-volatile memory such as a flash memory.
- the start program of the voice generator 1 is stored in the ROM 3.
- RAM 4 is a volatile memory such as SDRAM. The RAM 4 can be used as a working memory for various processes in the voice generator 1.
- the storage 5 is a storage such as a flash memory, a hard disk drive (HDD), and a solid state drive (SSD).
- Various programs used in the voice generator 1 are stored in the storage 5.
- the storage 5 may store a familiarity database (DB), a user log database (DB) 52, a model database 53, a speech synthesis parameter database (DB) 54, and a call statement database (DB) 55. .. These databases will be described in detail later.
- the microphone 6 is a device that converts the input voice into a voice signal which is an electric signal.
- the audio signal obtained by the microphone 6 can be stored in, for example, the RAM 4 or the storage 5.
- the voice synthesis parameter for synthesizing the calling voice can be acquired from the voice input via the microphone 6.
- Speakers 7a and 7b are devices that output voice based on the input voice signal.
- the speaker 7a and the speaker 7b are not in close proximity to each other.
- the speaker 7a and the speaker 7b are arranged in different directions when the user is the center.
- the distance between the speaker 7a and the user and the distance between the speaker 7b and the user are equidistant.
- FIG. 2A and 2B are diagrams showing an arrangement example of the speakers 7a and 7b.
- speakers 7a and 7b are arranged in front of the user U so as to be equidistant to the user, respectively.
- speakers 7a and 7b are arranged in front of and behind the user U so as to be equidistant to the user, respectively.
- FIG. 1 shows an example in which the number of presented voices is two.
- the number of presented voices may be three or more.
- three or more speakers will also be arranged. Even when three or more speakers are arranged, it is desirable that the speakers are not close to each other. Further, it is desirable that the speaker is arranged in a different direction when the user is the center. Further, it is desirable that the distance between each speaker and the user is equidistant.
- FIGS. 2C and 2D an arrangement example when the speakers are three speakers 7a, 7b, and 7c is shown in FIGS. 2C and 2D.
- FIG. 2C the speakers 7a, 7b, and 7c are arranged in front of the user U.
- FIG. 2D the speakers 7a, 7b, and 7c are arranged behind the user U.
- the camera 8 captures the user and acquires the image of the user.
- the user's image obtained by the camera 8 can be stored in, for example, the RAM 4 or the storage 5.
- the user's image is used, for example, to acquire the degree of arousal or to acquire the user's reaction to the calling voice.
- the input device 9 is a mechanical input device such as a button, a switch, a keyboard, and a mouse, and a software-type input device using a touch sensor.
- the input device 9 receives various inputs from the user. Then, the input device 9 outputs a signal corresponding to the user's input to the processor 2.
- the display 10 is a display such as a liquid crystal display or an organic EL display.
- the display 10 displays various images.
- the communication module 11 is a device for the voice generation device 1 to carry out communication.
- the communication module 11 communicates with, for example, a server provided outside the voice generator 1.
- the communication method by the communication module 11 is not particularly limited.
- the communication module 11 may carry out communication wirelessly or may carry out communication by wire.
- FIG. 3 is a diagram showing a configuration of an example of familiarity DB 51.
- the familiarity DB 51 is a database that records the "familiarity" of the user.
- the familiarity DB 51 records, for example, a user ID, a voice label, a familiar object, a familiarity, a number of reactions, a number of presentations, and an average value of arousal change.
- the "user ID" is an ID assigned to each user of the voice generator 1.
- the user ID may be associated with user attribute information such as a user name.
- the "voice label” is a label uniquely attached to each of the candidates for the calling voice. Any label can be used as the audio label. For example, a familiar name may be used for the voice label.
- the "familiar target” is a target that generates a voice that the user often talks to or hears.
- the familiar target does not necessarily have to be a person.
- “Familiarity” is the degree of familiarity of the user with the corresponding familiar voice.
- the degree of familiarity can be calculated from the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, the frequency of daily hearing from a familiar target, and the like. For example, the higher the frequency of communication with a familiar target by SNS or the like, the frequency of daily conversation with a familiar target, and the frequency of daily hearing from a familiar target, the greater the value of familiarity.
- the degree of familiarity may be acquired by self-reporting by the user.
- the "number of responses" is the number of times the user responded to the call voice generated based on the corresponding voice label.
- the number of presentations is the number of times the call voice generated based on the corresponding voice label is presented to the user.
- the reaction probability can be calculated by dividing the number of reactions by the number of presentations.
- the reaction probability is the probability that the user will react to the call voice generated based on the corresponding voice label.
- the "average value of change in arousal level” is the average value of the amount of change in the arousal level of the user with respect to the call voice generated based on the corresponding voice label.
- the amount of change in arousal level will be described later.
- FIG. 4 is a diagram showing the configuration of an example of the user log DB 52.
- the user log DB 52 is a database that records logs related to the use of the voice generation device 1 by the user.
- the user log DB 52 has, for example, a log generation date and time, a user ID, a voice label, a familiar target, a concentration level, a reaction presence / absence, an alertness level, a new alertness level, an arousal level change amount, and a correct answer label. It is associated and recorded.
- the user ID, the voice label, and the familiar object are the same as the familiarity DB 51.
- the "log generation date and time” is the date and time when the user used the voice generator 1.
- the log generation date and time is recorded, for example, each time a call voice is presented to the user.
- Presence / absence of reaction is information on the presence / absence of reaction of the user after the call voice is presented to the user. When there is a user reaction, “yes” is recorded. “None” is recorded when there is no user response.
- “Concentration ratio” is the degree of concentration of the user when presenting the call voice.
- the degree of concentration can be measured, for example, by estimating the posture and behavior of the user during work from the image obtained by the camera 8.
- the value of the degree of concentration is calculated so as to increase each time the user thinks that the user is concentrated and takes an action, and lowers each time the user thinks that the user is not concentrated and takes an action.
- the degree of opening of the pupil of the user during work can be measured by estimating from the image obtained by the camera 8.
- the concentration value is calculated to be higher when the pupil is more mydriatic and lower when the pupil is more miotic.
- the degree of concentration may be a discrete value such as Lv (Level) 1, Lv2, ....
- the method for acquiring the degree of concentration is not limited to a specific method.
- the "awakening degree” is the awakening degree acquired before the presentation of the call voice by the voice generation device 1.
- the "new arousal degree" is the arousal degree newly acquired after the user's reaction. New arousal is not recorded when there is no user response.
- the "awakening degree change amount” is an amount representing the change in the arousal degree before and after the user's reaction.
- the amount of change in alertness is obtained, for example, from the difference between the new alertness and the alertness.
- the amount of change in arousal level may be the ratio of the new arousal level to the arousal level or the like. The amount of change in alertness is not recorded when there is no reaction from the user.
- the "correct answer label” is a label of correct or incorrect answers for supervised learning. For example, the correct answer is recorded as ⁇ , and the incorrect answer is recorded as ⁇ .
- the correct label will be described in detail later.
- the model DB 53 is a database that records a model of voice label classification for extracting voice label candidates.
- the model is a model configured to classify correct or incorrect answers of voice labels in a two-dimensional space of familiarity and concentration.
- the model includes an initial model and a learning model.
- the initial model is a model generated based on the initial value stored in the model DB 53, and is a model that is not updated by learning.
- the initial value is a constant (equation of a plane) that determines the classification name for the classification of the voice label defined in the three-dimensional space of, for example, "familiarity", "concentration", and "awakening degree change". The value of).
- the classification plane generated by this initial value is the initial model.
- the training model is a trained model generated from the initial model.
- the learning model can be a binary classification model with a different classification surface than the initial model.
- the voice synthesis parameter DB 54 is a database in which voice synthesis parameters are recorded.
- the voice synthesis parameter is data used for synthesizing the voice of the user's familiar target.
- the voice synthesis parameter may be feature amount data extracted from voice data previously collected through the microphone 6.
- speech synthesis parameters acquired or defined by other systems may be pre-recorded.
- the speech synthesis parameter is associated with the speech label.
- FIG. 5 is a diagram showing the configuration of an example of the call statement DB55.
- the call statement DB 55 is a database in which template data of various call statements for encouraging the awakening of the user are recorded.
- the call statement is not particularly limited. However, it is desirable that the call statement includes a call using the user's name. This is to enhance the cocktail party effect described later.
- the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 do not necessarily have to be stored in the storage 5.
- the familiarity DB 51, the user log DB 52, the model DB 53, the voice synthesis parameter DB 54, and the call statement DB 55 may be stored in a server separate from the voice generation device 1.
- the voice generator 1 accesses the server using the communication module 11 and acquires necessary information.
- FIG. 6 is a functional block diagram of the voice generator 1.
- the voice generation device 1 has an acquisition unit 21, a determination unit 22, a selection unit 23, a generation unit 24, a presentation unit 25, and a learning unit 26.
- the operation of the acquisition unit 21, the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 is, for example, when the processor 2 executes a program stored in the storage 5. It will be realized.
- the determination unit 22, the selection unit 23, the generation unit 24, the presentation unit 25, and the learning unit 26 may be realized by hardware different from the processor 2.
- the acquisition unit 21 acquires the arousal level of the user. Further, the acquisition unit 21 acquires the user's reaction to the call voice. As described above, the degree of arousal is calculated by any one of eye movements, blinking activity, electrical skin activity, reaction time to stimuli, or a combination thereof.
- the eye movement, blinking activity, and reaction time to the stimulus for calculating the degree of arousal can be measured from, for example, an image of the user acquired by the camera 8.
- the reaction time to the stimulus may be measured from the audio signal acquired by the microphone 6.
- skin electrical activity can be measured, for example, by a sensor worn on the user's arm.
- the user's reaction is the presence or absence of the user's physical reaction such as the user's head facing the direction of the speaker 7a or 7b, the user's line of sight facing the direction of the speaker 7a or 7b, and the direction of the reaction. Can be obtained, for example, by measuring from an image acquired by the camera 8.
- the acquisition unit 21 may be configured to acquire the arousal degree or the user's reaction calculated outside the voice generation device 1 by communication.
- the determination unit 22 determines whether or not the user is awake based on the degree of arousal acquired by the acquisition unit 21. Then, when the determination unit 22 determines that the user is in an awake state, the determination unit 22 transmits a voice label selection request to the reception unit 231 of the selection unit 23. Here, the determination unit 22 makes a determination by comparing the degree of arousal with a predetermined threshold value.
- the threshold value is a threshold value of the degree of arousal for determining whether or not the user is in an awake state, and is stored in, for example, the storage 5. Further, the determination unit 22 determines whether or not there is a user reaction based on the user reaction information acquired by the acquisition unit 21.
- the selection unit 23 selects an audio label of a voice that is a candidate for encouraging the user to awaken.
- the selection unit 23 includes a reception unit 231, a model selection unit 232, an audio label candidate extraction unit 233, an audio label selection unit 234, and a transmission unit 235.
- the receiving unit 231 receives a voice label selection request from the determination unit 22.
- the model selection unit 232 selects a model to be used for selecting an audio label from the model DB 53.
- the model selection unit 232 selects either an initial model or a learning model based on the degree of fit.
- the degree of fit is a value for determining which of the initial model and the learning model has higher accuracy. The degree of fit will be described in detail later.
- the voice label candidate extraction unit 233 extracts voice labels that are candidates for the call voice to be presented to the user from the familiarity DB 51 based on the model selected by the model selection unit 232 and the concentration level of the user.
- the voice label selection unit 234 selects a voice label for generating a call voice to be presented to the user from the voice label extracted by the voice label candidate extraction unit 233.
- the transmission unit 235 transmits the information of the voice label selected by the voice label selection unit 234 to the generation unit 24.
- the generation unit 24 generates a call voice for encouraging the user to awaken based on the voice label received from the transmission unit 235.
- the generation unit 24 acquires the voice synthesis parameter corresponding to the voice label received from the transmission unit 235 from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the call text data recorded in the call text DB 55 and the voice synthesis parameter.
- the presentation unit 25 presents the call voice generated by the generation unit 24 to the user.
- the presentation unit 25 reproduces the call voice generated by the generation unit 24 by using the speaker 7.
- the learning unit 26 learns the model recorded in the model DB 53.
- the learning unit 26 performs learning by using, for example, binary classification learning using a correct answer label.
- FIGS. 7A and 7B are flowcharts showing the voice presentation process by the voice generator 1. The processes of FIGS. 7A and 7B may be performed periodically.
- step S1 the acquisition unit 21 acquires the user's arousal level.
- the acquisition unit 21 outputs the acquired arousal level to the determination unit 22. Further, the acquisition unit 21 holds the acquired arousal level until the timing of acquiring the reaction from the user after the presentation of the call voice.
- step S2 the determination unit 22 determines whether or not the arousal level acquired by the acquisition unit 21 is equal to or less than the threshold value.
- step S2 when it is determined that the arousal degree exceeds the threshold value, that is, when the user is in the awake state, the processes of FIGS. 7A and 7B are terminated.
- step S2 when it is determined that the arousal degree is equal to or less than the threshold value, that is, when the user is not in an awake state such as having drowsiness, the process proceeds to step S3.
- step S3 the determination unit 22 transmits a voice label selection request to the selection unit 23.
- the model selection unit 232 refers to the user log DB 52 and acquires the number of times there is a reaction. The number of times there is a reaction is the total number of "yes" of "with or without reaction”.
- step S4 the model selection unit 232 determines whether or not the number of times there is a reaction is less than the threshold value.
- the threshold value is a threshold value for determining whether or not the available learning model is recorded in the model DB 53.
- the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
- step S5 the process proceeds to step S5.
- step S6 the process proceeds to step S6.
- step S5 the model selection unit 232 selects an initial value, that is, an initial model from the model DB 53. Then, the model selection unit 232 outputs the selected initial model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
- step S6 the model selection unit 232 calculates the degree of fit.
- the model selection unit 232 first acquires all past reactioned and unreacted logs from the user log DB 52. Then, the model selection unit 232 calculates the degree of fit of both the initial model and the learning model.
- the model selection unit 232 is, for example, a correct answer rate (correct answer rate) obtained by comparing the output result of the correct answer or the incorrect answer of the corresponding model when the value of the concentration degree of each log is used with the presence or absence of the reaction of each log. Accuracy) can be used as the degree of fit.
- the degree of fit is not limited to the correct answer rate, but is calculated by using the output result of the correct or incorrect answer of the model and the presence or absence of the reaction of the log. -measure) etc. may be used.
- the precision rate is the percentage of the data predicted to be correct that the user actually responded "yes".
- the recall rate is the percentage of the logs that are actually the user's reaction and are predicted to be correct.
- the F value is a harmonic mean of the reproducibility and the precision. For example, the F value can be calculated from 2Recall ⁇ Precision / (Recall + Precision).
- step S7 the model selection unit 232 compares the degree of fitting of the initial model and the learning model, and determines whether or not the degree of fitting of the learning model is higher.
- step S7 the degree of fit of the initial model is higher
- the process proceeds to step S5.
- the model selection unit 232 selects an initial value, that is, an initial model.
- step S8 the degree of fit of the learning model is higher
- step S8 the model selection unit 232 selects a learning model. Then, the model selection unit 232 outputs the selected learning model to the voice label candidate extraction unit 233. After that, the process proceeds to step S9.
- step S9 the voice label candidate extraction unit 233 acquires the current user concentration level from the acquisition unit 21.
- the voice label candidate extraction unit 233 extracts the candidate voice label used for generating the calling voice from the familiarity DB 51.
- the number of candidate voice labels extracted is equal to or greater than the specified number, for example, the number of solicitation voices presented.
- the voice label candidate extraction unit 233 extracts all voice labels to which the correct answer label is attached to the current concentration value from the voice labels registered in the familiarity DB 51, for example.
- the voice label with the correct answer label is a voice label that is expected to respond to the user by presenting the call voice and is also expected to increase the degree of arousal.
- the voice label selection unit 234 selects a specified number of voice labels, for example, the same number as the number of presented call voices, from the voice labels extracted by the voice label candidate extraction unit 233.
- the voice label selection unit 234 obtains a weighted winning probability based on the number of past presentations, for example, when selecting a voice label. Then, the voice label selection unit 234 selects a voice label by random sampling based on the weighted winning probability.
- the weighted winning probability can be calculated, for example, according to the equation (1).
- the weighted winning probability may be calculated by an equation different from the equation (1).
- step S12 the transmission unit 235 transmits information indicating the voice label selected by the voice label selection unit 234 to the generation unit 24.
- the generation unit 24 acquires the voice synthesis parameter corresponding to the received voice label from the voice synthesis parameter DB 54. Then, the generation unit 24 generates a call voice based on the data of the call text randomly selected from the call text DB 55 and the voice synthesis parameter.
- the generation of the calling voice can be performed by a voice synthesis process using the voice synthesis parameters. After that, the process proceeds to step S13.
- step S13 the presentation unit 25 simultaneously presents the call voice generated by the generation unit 24 to the user from the speakers 7a and 7b.
- step S14 the acquisition unit 21 acquires the user's reaction. Then, the acquisition unit 21 outputs the user reaction information to the determination unit 22.
- step S15 the determination unit 22 determines whether or not there has been a reaction from the user. When it is determined in step S15 that there is no reaction from the user, the process proceeds to step S20. When it is determined in step S15 that there is a reaction from the user, the process proceeds to step S16.
- step S16 the determination unit 22 requests the acquisition unit 21 to acquire the new arousal degree.
- the acquisition unit 21 acquires the new arousal degree.
- the acquisition of the new arousal degree may be performed in the same manner as the acquisition of the arousal degree.
- step S17 the acquisition unit 21 sets the correct answer label.
- the acquisition unit 21 sets the correct answer level as follows, for example. 1) When it is acquired as a reaction that the user points to a specific speaker The voice label corresponding to the voice presented in the corresponding speaker: ⁇ Other voice labels: ⁇ 2) When it is acquired as a reaction that the user faces between a plurality of speakers, the angle formed by the direction of the user and the direction of each speaker is obtained, and the voice presented by the speaker having a smaller angle is obtained. Audio label: ⁇ Other audio labels: ⁇ 3) When the user turns to one speaker and then turns to another speaker, which is acquired as a reaction. The voice label of the voice presented in the first facing speaker: ⁇ Other voice labels : ⁇ 4) If no response can be obtained Labels for all voices: ⁇
- step S18 the acquisition unit 21 associates the concentration level, the reaction presence / absence information, the arousal level, the new arousal level, the arousal level change amount, and the correct answer label with the log generation date / time, the voice label, the familiar target, and the familiar degree, and the user log. Register in DB 52. After that, the process proceeds to step S19.
- step S19 the learning unit 26 refers to the user log DB 52 and acquires the number of times there is a reaction. Then, the learning unit 26 determines whether or not the number of times there is a reaction is less than the threshold value.
- the threshold value is a threshold value for determining whether or not the information necessary for learning has been accumulated.
- the threshold is set to, for example, 2. In this case, when the number of reactions is 0 or 1, it is determined that the number of reactions is less than the threshold value.
- the processes of FIGS. 7A and 7B are terminated.
- the process proceeds to step S20.
- step S20 the learning unit 26 carries out binary classification learning. Then, the learning unit 26 records the learning result of the implementation of the binary classification learning in the model DB 53. After that, the processing of FIGS. 7A and 7B is completed.
- step S20 the learning unit 26 acquires, for example, the correct answer label recorded in the user log DB 52, the degree of familiarity associated with the correct answer label, and the degree of concentration. Then, the learning unit 26 generates a binary classification model of the voice label in the three-dimensional space of "familiarity", “concentration ratio”, and "awakening degree change amount".
- FIG. 8 is a diagram showing an image of a binary classification model using "familiarity", “concentration ratio", and "alertness change amount”.
- the familiar voice label located in the space above the classification surface P is classified as the correct answer ( ⁇ ).
- a voice label having a familiarity located in a space below the classification surface P is classified as an incorrect answer (x).
- various binary classification learning using logistic regression, SVN (Support Vector Machine), neural network, or the like can be used to generate the model.
- the amount of change in arousal degree is a correct label, that is, it characterizes the user's reaction in addition to whether or not the user reacts. Therefore, the "arousal degree change amount" is adopted as one axis of learning because it is expected that the accuracy of the determination of the correct answer label will be further improved.
- the embodiment when it is determined that the user is not awake, a call is made to the user using a voice familiar to the user. Therefore, even when the user is drowsy, the cocktail party effect can be used to call the user to hear the voice. Therefore, it is expected that the degree of arousal will be improved in a short time. Further, in the embodiment, the degree of familiarity and the degree of concentration are used in selecting a familiar voice. Therefore, it is possible to let the user hear the call voice that the user is more likely to respond to.
- the voice label is classified using a learning model having three axes of familiarity, concentration, and arousal change. Therefore, as the learning progresses, it is expected that voice label candidates more suitable for the user will be extracted. Further, according to the embodiment, a voice label for generating voice is selected from the extracted candidates by random sampling based on the number of presentations in the past. As a result, the user's familiarity and boredom due to the frequent presentation of the call voice with the same voice label are suppressed. As a result, even when the voice generation device 1 is used for a long period of time, it is easy to expect the user's reaction to the calling voice, and as a result, the user's arousal level is expected to increase.
- the call voice is simultaneously presented from a plurality of speakers arranged in the environment, and the user's reaction to each call voice is acquired. Then, the correct label is set according to the reaction of this user. As a result, teacher data can be obtained efficiently.
- the binary classification model employs three axes of "familiarity”, “concentration ratio”, and “alertness change amount”.
- a binary classification model such as “familiarity” only, “familiarity” and “concentration ratio” may be used more simply.
- the learning device is used as a learning device of a voice label classification model for a call voice that encourages the user to awaken.
- the learning device of the embodiment can be used for learning various models for selecting a voice that is easy for the user to recognize.
- Each process according to the above-described embodiment can be stored as a program that can be executed by a processor that is a computer.
- it can be stored and distributed in a storage medium of an external storage device such as a magnetic disk, an optical disk, or a semiconductor memory.
- the processor reads the program stored in the storage medium of the external storage device, and the operation is controlled by the read program, so that the above-mentioned processing can be executed.
- the present invention is not limited to the above embodiment, and can be variously modified at the implementation stage without departing from the gist thereof.
- each embodiment may be carried out in combination as appropriate, in which case the combined effect can be obtained.
- the above-described embodiment includes various inventions, and various inventions can be extracted by a combination selected from a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, if the problem can be solved and the effect is obtained, the configuration in which the constituent elements are deleted can be extracted as an invention.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Traffic Control Systems (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Ce dispositif d'apprentissage comprend une unité d'apprentissage (26) qui acquiert des données d'apprentissage pour un modèle d'apprentissage pour sélectionner une voix à présenter à un utilisateur parmi une pluralité de candidats de voix sur la base de la réaction de l'utilisateur à une pluralité de voix présentées simultanément à l'utilisateur.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022531321A JP7416245B2 (ja) | 2020-06-24 | 2020-06-24 | 学習装置、学習方法及び学習プログラム |
PCT/JP2020/024823 WO2021260848A1 (fr) | 2020-06-24 | 2020-06-24 | Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2020/024823 WO2021260848A1 (fr) | 2020-06-24 | 2020-06-24 | Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021260848A1 true WO2021260848A1 (fr) | 2021-12-30 |
Family
ID=79282108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/024823 WO2021260848A1 (fr) | 2020-06-24 | 2020-06-24 | Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP7416245B2 (fr) |
WO (1) | WO2021260848A1 (fr) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001304898A (ja) * | 2000-04-25 | 2001-10-31 | Sony Corp | 車載機器 |
JP2007271296A (ja) * | 2006-03-30 | 2007-10-18 | Yamaha Corp | アラーム装置、およびプログラム |
JP2013101248A (ja) * | 2011-11-09 | 2013-05-23 | Sony Corp | 音声制御装置、音声制御方法、およびプログラム |
JP2016191791A (ja) * | 2015-03-31 | 2016-11-10 | ソニー株式会社 | 情報処理装置、情報処理方法及びプログラム |
JP2020024293A (ja) * | 2018-08-07 | 2020-02-13 | トヨタ自動車株式会社 | 音声対話システム |
JP2020034835A (ja) * | 2018-08-31 | 2020-03-05 | 国立大学法人京都大学 | 音声対話システム、音声対話方法、プログラム、学習モデル生成装置及び学習モデル生成方法 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10433075B2 (en) | 2017-09-12 | 2019-10-01 | Whisper.Ai, Inc. | Low latency audio enhancement |
-
2020
- 2020-06-24 JP JP2022531321A patent/JP7416245B2/ja active Active
- 2020-06-24 WO PCT/JP2020/024823 patent/WO2021260848A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001304898A (ja) * | 2000-04-25 | 2001-10-31 | Sony Corp | 車載機器 |
JP2007271296A (ja) * | 2006-03-30 | 2007-10-18 | Yamaha Corp | アラーム装置、およびプログラム |
JP2013101248A (ja) * | 2011-11-09 | 2013-05-23 | Sony Corp | 音声制御装置、音声制御方法、およびプログラム |
JP2016191791A (ja) * | 2015-03-31 | 2016-11-10 | ソニー株式会社 | 情報処理装置、情報処理方法及びプログラム |
JP2020024293A (ja) * | 2018-08-07 | 2020-02-13 | トヨタ自動車株式会社 | 音声対話システム |
JP2020034835A (ja) * | 2018-08-31 | 2020-03-05 | 国立大学法人京都大学 | 音声対話システム、音声対話方法、プログラム、学習モデル生成装置及び学習モデル生成方法 |
Also Published As
Publication number | Publication date |
---|---|
JP7416245B2 (ja) | 2024-01-17 |
JPWO2021260848A1 (fr) | 2021-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10944708B2 (en) | Conversation agent | |
JP6263308B1 (ja) | 認知症診断装置、認知症診断方法、及び認知症診断プログラム | |
US11009952B2 (en) | Interface for electroencephalogram for computer control | |
CN106464758B (zh) | 利用用户信号来发起通信 | |
JP2021057057A (ja) | 精神障害の療法のためのモバイルおよびウェアラブルビデオ捕捉およびフィードバックプラットフォーム | |
CN109460752B (zh) | 一种情绪分析方法、装置、电子设备及存储介质 | |
KR20180137490A (ko) | 기억과 의사 결정력 향상을 위한 개인 감정-기반의 컴퓨터 판독 가능한 인지 메모리 및 인지 통찰 | |
CN110881987B (zh) | 一种基于可穿戴设备的老年人情绪监测系统 | |
US11751813B2 (en) | System, method and computer program product for detecting a mobile phone user's risky medical condition | |
JP6930277B2 (ja) | 提示装置、提示方法、通信制御装置、通信制御方法及び通信制御システム | |
JP6906197B2 (ja) | 情報処理方法、情報処理装置及び情報処理プログラム | |
CN113287175A (zh) | 互动式健康状态评估方法及其系统 | |
WO2019086856A1 (fr) | Systèmes et procédés permettant de combiner et d'analyser des états humains | |
JP2021146214A (ja) | ドライバー監視システムでメディア誘発感情から運転感情を分離するための技術 | |
KR102552220B1 (ko) | 정신건강 진단 및 치료를 적응적으로 수행하기 위한 컨텐츠 제공 방법, 시스템 및 컴퓨터 프로그램 | |
JP2018503187A (ja) | 被験者とのインタラクションのスケジューリング | |
WO2021260848A1 (fr) | Dispositif d'apprentissage, procédé d'apprentissage et programme d'apprentissage | |
WO2021260846A1 (fr) | Dispositif de génération vocale, procédé de génération vocale et programme de génération vocale | |
US20190141418A1 (en) | A system and method for generating one or more statements | |
CN108461125B (zh) | 一种针对老年人的记忆力训练装置 | |
US10079074B1 (en) | System for monitoring disease progression | |
WO2021260844A1 (fr) | Dispositif de génération vocale, procédé de génération vocale et programme de génération vocale | |
JP7534745B1 (ja) | 発作予測プログラム、記憶媒体、発作予測装置および発作予測方法 | |
US20240008766A1 (en) | System, method and computer program product for processing a mobile phone user's condition | |
WO2023199422A1 (fr) | Dispositif d'inférence d'état interne, procédé d'inférence d'état interne et support de stockage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20941543 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022531321 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20941543 Country of ref document: EP Kind code of ref document: A1 |