WO2023012941A1 - 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム - Google Patents

聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム Download PDF

Info

Publication number
WO2023012941A1
WO2023012941A1 PCT/JP2021/028988 JP2021028988W WO2023012941A1 WO 2023012941 A1 WO2023012941 A1 WO 2023012941A1 JP 2021028988 W JP2021028988 W JP 2021028988W WO 2023012941 A1 WO2023012941 A1 WO 2023012941A1
Authority
WO
WIPO (PCT)
Prior art keywords
learning
pupil diameter
sound source
sig
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/028988
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
雄太 鈴木
シンイ リャオ
茂人 古川
詠皓 楊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2023539458A priority Critical patent/JP7619465B2/ja
Priority to US18/293,976 priority patent/US20240341649A1/en
Priority to PCT/JP2021/028988 priority patent/WO2023012941A1/ja
Publication of WO2023012941A1 publication Critical patent/WO2023012941A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/168Evaluating attention deficit, hyperactivity
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B3/00Apparatus for testing the eyes; Instruments for examining the eyes
    • A61B3/10Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions
    • A61B3/11Objective types, i.e. instruments for examining the eyes independent of the patients' perceptions or reactions for measuring interpupillary distance or diameter of pupils
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/163Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state by tracking eye movement, gaze, or pupil change
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7246Details of waveform analysis using correlation, e.g. template matching or determination of similarity
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7264Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
    • A61B5/7267Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems involving training the classification device
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the present invention relates to technology for estimating auditory attentional states.
  • Non-Patent Document 1 the reduction/dilation of the pupil diameter (pupil vibration, or PFT (Pupillary Frequency Tagging)) induced by ON/OFF of the light source is used to determine the user's visual attention. I'm guessing.
  • the present invention provides a method for estimating where the user's auditory attention is directed, based on changes in pupil diameter.
  • a feature value based on the strength of correlation between each of a plurality of mutually different visual stimulus patterns corresponding to a plurality of mutually different sound sources and the amount of change in the pupil diameter of the user is obtained, and the feature value is used to apply the sound from the sound source. estimating where the person has aurally directed their attention.
  • FIG. 1 is a block diagram illustrating the functional configuration of the auditory attention state estimation system of the embodiment.
  • FIG. 2 is a block diagram illustrating the mechanical configuration of the learning device of the embodiment.
  • FIG. 3 is a block diagram illustrating the mechanical configuration of the auditory attention state estimation device of the embodiment.
  • FIG. 4 is a conceptual diagram for explaining the content of the experiment.
  • FIG. 5 is a graph illustrating experimental results.
  • FIG. 6 is a block diagram illustrating the hardware configuration of the learning device and the auditory attention state estimation device of the embodiment.
  • the subject 100 performed a task of paying attention aurally to a sound emitted from one of the sound sources 140-i.
  • the sound sources 140-1 For example, the sound sources 140-1, .
  • a task was performed to count the number of words in a given category.
  • the following sounds are simultaneously output from four sound sources 140-1, . ,6) or the task of answering the color (white, pink, red, black) was performed.
  • subject 100 pays auditory attention to the sound emitted from sound source 140-1.
  • Sound source 140-1 "Then go to the white 3 for the bear””"Then go to the pink 8 for the tiger”” Then go to the red 9 for the cat””"Then go to the black 6 for the dog”
  • Sound 140-2 "Let's go to rainbow 2 for deer” 4 times
  • Sound 140-3 "Let's go to blue 9 for tiger” 4 times
  • Sound 140-4 “Well then go to pink 7 for pigs” Please” four times.
  • 100 pupil diameters were measured with a pupil diameter acquisition device 150 (an eye tracker in this experiment). The execution of the task and the measurement of the pupil diameter are performed a plurality of times for a plurality of subjects 100, and each subject 100 pays auditory attention to the sound emitted from each sound source 140-i. Pupil diameter was measured.
  • FIG. 5 illustrates a signal (frequency domain pupil diameter variation signal) obtained by transforming the time-series signal representing the pupil diameter variation of the subject 100 performing the above task into the frequency domain.
  • the horizontal axis of FIG. 5 represents the frequency (Hz), and the vertical axis represents the power [au] of the frequency domain pupil diameter variation signal.
  • the thick line represents the mean power and the thin line represents the power distribution.
  • the frequency domain pupil diameter change amount signal based on the pupil diameter change amount of the subject 100
  • the power peak is at or near the flickering frequency of light source 130-i provided in sound source 140-i.
  • the subject 100 pays auditory attention to the sound emitted from the sound source 140-1 provided with the light source 130-1 that blinks at a flickering frequency of 1.50 [Hz]
  • the pupil diameter of the subject 100 The power peak p1 of the frequency domain pupil diameter variation signal based on the variation appears at or near 1.50 [Hz].
  • the pupil diameter change amount of the subject 100 When the subject 100 pays auditory attention to the sound emitted from the sound source 140-2 provided with the light source 130-2 that blinks at a flickering frequency of 1.75 [Hz], the pupil diameter change amount of the subject 100 The peak p2 of the power of the frequency domain pupillary diameter variation signal based on , appears at or near 1.75 [Hz].
  • the pupil diameter change amount of the subject 100 The peak p3 of the power of the frequency domain pupillary diameter variation signal based on 2.00[Hz] appears at or near it.
  • the pupil diameter change amount of the subject 100 When the subject 100 pays auditory attention to the sound emitted from the sound source 140-4 provided with the light source 130-4 that blinks at a blinking frequency of 2.25 [Hz], the pupil diameter change amount of the subject 100 The peak p4 of the power of the frequency domain pupillary diameter variation signal based on 2.25 [Hz] appears at or near it. That is, the pupil diameter change amount when the subject 100 pays auditory attention to the sound emitted from the sound source 140-i, and the visual stimulus emitted from the light source 130-1 corresponding to the sound source 140-i There is a correlation between patterns. Here, subject 100 is not instructed to gaze at light source 130-1 during execution of the task. Therefore, this correlation is expected to hold even if subject 100 does not gaze at light source 130-1.
  • the visual stimulus pattern corresponding to the sound source is a time-varying visual stimulus pattern.
  • it may be a periodically time-varying visual stimulus pattern, or an aperiodically time-varying visual stimulus pattern. It can be a pattern.
  • the visual stimulus pattern may be a time-varying luminance (brightness) pattern, a time-varying color pattern, a time-varying pattern pattern, or a time-varying pattern. It may be a pattern of shapes.
  • the visual stimulus pattern corresponding to the sound source is not limited to that created by the light source, but may be displayed or projected by a display or projector, or created by the sound source itself or the environment around the sound source (e.g., luminance change). can be anything.
  • the auditory attention state estimation system 1 of this embodiment includes a learning device 11, an auditory attention state estimation device 12, a plurality of visual stimulus generators 13-1, . . . , 13-N, a plurality of sound sources. , 14-N, and a pupil diameter acquisition device 15, and based on the amount of change in the pupil diameter of the user 10, the destination to which the user 10 pays auditory attention is estimated.
  • N is an integer of 1 or more, for example, N is an integer of 2 or more.
  • the learning device 11 of this embodiment has an input unit 111 , a storage unit 112 , a learning unit 113 , an output unit 114 and a control unit 117 .
  • the learning device 11 executes each process under the control of the control unit 117, the input data to the learning device 11 and the data obtained by each process are stored in the storage unit 112, It is read and used as needed.
  • the auditory attention state estimation device 12 of the present embodiment includes an input unit 121, a storage unit 122, a visual stimulus control unit 123, an auditory information control unit 124, a feature amount extraction unit 125, an estimation unit 126, and a control unit 127 .
  • the auditory attention state estimation device 12 executes each process based on the control of the control unit 127, and the input data to the auditory attention state estimation device 12 and the data obtained by each process are stored. It is stored in the unit 122 and read and used as needed.
  • An example of the sound source device 14-n is a speaker or the like, but this does not limit the present invention. Any device can be used as the sound source device 14-n as long as it can arrange a sound source at a desired spatial position.
  • the sound source devices 14-1, . . . , 14-N are different from each other, and the sound source devices 14-1, . For example, the directions of the sound source devices 14-1, .
  • the sound source devices 14-1, . . . , 14-N may emit sounds SO(Info-1), .
  • Some sound source devices may emit sound SO(Info-n) at different timings than other sound source devices. However, it is preferable that sounds emitted simultaneously from different sound source devices are different from each other.
  • the visual stimulus generator 13-n may be placed near the sound source device 14-n, may be placed in contact with the sound source device 14-n, or may be fixed to the sound source device 14-n. Alternatively, it may be configured integrally with the tone generator device 14-n.
  • Each of the visual stimulation patterns VS(Sig-n) of the present embodiment is a visual stimulation pattern that periodically changes over time.
  • the visual stimulation pattern VS(Sig-n) of the embodiment may be a luminance (brightness) pattern that periodically changes over time, or a pattern that periodically blinks (ON/OFF). It may be a color pattern that periodically changes with time, a pattern that periodically changes with time, or a shape pattern that periodically changes with time. good. That is, the visual stimulus generator 13-n may present luminance (brightness) that periodically changes over time, may present light that periodically blinks, or may present light that periodically changes over time. A color that changes periodically may be presented, a pattern that periodically changes with time may be presented, or a shape that periodically changes with time may be presented. Any visual stimulus generator 13-n may be used as long as it can visually present such a visual stimulus pattern VS(Sig-n).
  • the visual stimulus generator 13-n may be an LED light source, a laser light generator, a display, or a projector.
  • time-series signals representing visual stimulus patterns VS(Sig-1), . time-series signal, color time-series signal, pattern time-series signal, shape time-series signal, etc. are transformed into the frequency domain (e.g. Fourier transform) The frequencies (peak frequencies) are different from each other.
  • the pupil diameter acquisition device 15 of this embodiment is a device that measures the pupil diameter Pub of the user 10 .
  • the pupil diameter acquisition device 15 is a camera that captures the movement of the eye of the user 10 and a device that acquires and outputs the pupil diameter Pub of the user 10 from the image captured by the camera.
  • An example of the pupil diameter acquisition device 15 is a commercially available eye tracker or the like.
  • learning data T is input to the learning device 11, the learning device 11 obtains an estimated model M( ⁇ ) through learning processing using the learning data T, and the estimated model M( ⁇ ) is Output the specified model parameter ⁇ .
  • the model parameter ⁇ is input to the auditory attention state estimation device 12 .
  • the auditory attention state estimation device 12 outputs the output information Info-1, . . . , Info-N to the sound source devices 14-1, . Output sounds SO(Info-1), ..., SO(Info-N) based on the output information Info-1, ..., Info-N, respectively.
  • the auditory attention state estimation device 12 outputs the output information Sig-1, . . .
  • Sig-N to the visual stimulus generators 13-1, . , 13-N present visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) based on the output information Sig-1, .
  • the user 10 aurally pays attention to the sound SO(Info-n) emitted from at least one of the sound source devices 14-n, and the pupil diameter obtaining device 15 obtains the pupil diameter Pub of the user 10. and sent to the auditory attention state estimating device 12 .
  • the auditory attention state estimating device 12 detects each of the visual stimulus patterns VS(Sig-1), . . . , VS(Sig-N) presented based on the output information Sig-1, .
  • M( ⁇ ) machine learning model
  • Learning data T is input to the input unit 111 of the learning device 11 (FIG. 2), and the learning data T is stored in the storage unit 112 .
  • the learning data T are N (plural) learning visual stimulus patterns TVS(Sig-1), .
  • a learning feature value Tf j based on the strength of correlation with the learning pupil diameter change amount TVPub, and correct information Ta j representing the destination to which auditory attention is directed for the sound from the learning sound source (learning sound). and data (Tf 1 , Ta 1 ),..., (Tf J , Ta J ) in which .
  • the learning sound can be any sound such as voice, music, environmental sound, ringing sound, alarm sound, etc.
  • a specific example of the learning sound is the same as the sound emitted from the sound source device 14-n described above.
  • N learning sound sources A specific example of the N learning sound sources is the same as the sound source devices 14-1, . . . , 14-N described above.
  • the N learning sound sources may be N devices that emit learning sounds, and even if devices different from the sound source devices 14-1, . good.
  • the N training visual stimulus patterns TVS(Sig-1), . . . , TVS(Sig-N) are the same as the visual stimulus patterns VS(Sig-1), . , is the pattern of visual stimuli that periodically changes over time.
  • the N learning visual stimulus patterns TVS(Sig-1), . are the same, and specific examples of the learning visual stimulation patterns TVS(Sig-1), ..., TVS(Sig-N) are the visual stimulation patterns VS(Sig-1), ..., VS(Sig-N ), respectively.
  • the amount of change in pupil diameter TVPub for learning is calculated by a learning user presented with a learning sound and a learning visual stimulus pattern TVS(Sig-1), . . . , TVS(Sig-N).
  • TSO n (where n ⁇ 1,...,N ⁇ ) is the amount of pupil diameter change of the learning user when paying attention aurally.
  • the time when the learning user pays auditory attention to any learning sound TSO n is, for example, the time when the learning user is performing the above-described assignment (task).
  • the learning pupil diameter change amount TVPub is obtained, for example, based on the learning user's pupil diameter TPub measured by the pupil diameter acquisition device 15, but the learning user's pupil diameter TPub measured by another device may be obtained based on A method of obtaining the learning pupil diameter change amount from the pupil diameter TPub will be exemplified below.
  • preprocessing is performed on the time-series data of pupil diameter TPub to obtain time-series data of pupil diameter TPub' after preprocessing.
  • the pre-processing for example, linear interpolation, quadratic spline interpolation, or the like can be used to interpolate missing portions due to blinking or the like of the learning user from the time-series data of the pupil diameter TPub.
  • the pupil after interpolating a low-pass filter for example, a low-pass filter that passes a band including all blinking frequencies
  • TVS(Sig-1), ..., TVS(Sig-N) It may be applied to time-series data of diameter TPub to reduce noise.
  • the learning feature value Tf j is based on the strength of the correlation between each of the N learning visual stimulus patterns TVS(Sig-1), ..., TVS(Sig-N) and the learning pupil diameter change amount TVPub. If so, it can be of any kind.
  • the learning feature amount Tf j is exemplified below.
  • a learning feature Tf j representing the magnitude (for example, the peak value of the magnitude of TFVPub) may be used.
  • the plurality of frequency-domain visual stimulus signals for learning TFVS(Sig-1), . -N) is a signal obtained by transforming (for example, Fourier transform) the time-series signal representing each of them into the frequency domain.
  • the learning frequency domain pupil diameter change amount signal TFVPub is a signal obtained by converting the time-series signal representing the learning pupil diameter change amount TVPub into the frequency domain.
  • the “magnitude of ⁇ ” may be the absolute value of the amplitude of ⁇ , the power of ⁇ (the square of the amplitude of ⁇ ), or the absolute value of ⁇ . It may be a monotonically increasing value.
  • a learning feature Tf j whose elements include powers (peak values of powers) at or near peaks p1, . . . , p4 may be used.
  • the stimulus pattern TVS (Sig-n) and the learning pupil diameter variation TVPub There is a strong correlation between the stimulus pattern TVS (Sig-n) and the learning pupil diameter variation TVPub.
  • the learning frequency domain pupil diameter change amount signal TFVPub at and/or in the vicinity of the peak frequency described above, other information is included in the learning feature Tf j .
  • the size of TFVPub e.g., the peak value of the size
  • (2) phase changes of N training visual stimulus patterns TVS(Sig-1), ..., TVS(Sig-N) and training pupil diameter The sequence TSS(Sig -1), ..., TSS(Sig-N) and each maximum value CCFmax(TSS(Sig-1), TPS) , .
  • the "sequence corresponding to the time-series signal” may be, for example, the time-series signal itself, a sequence obtained by transforming the time-series signal into the frequency domain, or other time-series signals. It may be a series of function values of a series signal.
  • the maximum value of the cross-correlation function between the series ⁇ 1 and the sequence ⁇ 2 means the maximum value among the cross-correlation function values between the series ⁇ 1 and the sequence ⁇ 2 with respect to the variable delay amount ⁇ .
  • the visual stimulus pattern TVS(Sig-n) There is a strong correlation between the visual stimulus pattern TVS(Sig-n) and the learning pupil diameter variation TVPub.
  • the learning visual stimulation pattern TVS(Sig-n) having a phase change with a higher degree of synchronization with the learning pupil diameter change amount TVPub has a stronger correlation with the learning pupil diameter change amount TVPub.
  • the larger the maximum value CCFmax(TSS(Sig-n), TPS) of the cross-correlation function is, the stronger the correlation with the training pupillary diameter change amount TVPub is for the learning visual stimulation pattern TVS(Sig-n).
  • the learning pupil diameter change amount TVPub is the pupil diameter change amount of the learning user when the learning user aurally pays attention to the learning sound TSO n .
  • Ta j is information representing the learning sound TSO n . That is, the correct answer information Ta j represents the learning sound TSO n corresponding to the learning pupil diameter change amount TVPub.
  • the correct information Ta j may be information (eg, an index) representing a sound source (eg, a device that emits a learning sound TSO n ), or a learning visual stimulation pattern TVS corresponding to the learning sound TSO n . It may be information representing (Sig-n) (for example, blinking frequency) or information representing learning sound TSO n .
  • the correct information Ta j is associated with the learning feature Tf j (step S111).
  • the learning unit 113 obtains an estimated model M( ⁇ ) by learning processing (machine learning) using the learning data T read from the storage unit 112, and outputs a model parameter ⁇ specifying the estimated model M( ⁇ ).
  • the estimation model M( ⁇ ) is used with each of N (plural) different visual stimulus patterns VS(Sig-1), ..., VS(Sig-N) corresponding to N (plural) sound sources different from each other.
  • a method for estimating the user's auditory attention direction (visual attention direction) for the sound from the sound source, using the feature value f based on the strength of correlation with the pupil diameter change amount VPub of the user as input. is a model.
  • feature quantity f is such that learning sounds are replaced with sounds, learning sound sources are replaced with sound sources, and learning visual stimulus patterns TVS(Sig-1), ..., TVS(Sig-N) are visual stimulus patterns VS (Sig-1), .
  • the configuration is the same as that of the learning feature Tf j described above, except that it is replaced with .
  • the estimation model M( ⁇ ) estimates the sound emitted from the sound source corresponding to the visual stimulus pattern VS(Sig-n), which has a high correlation with the pupil diameter change amount VPub, as the destination to which the user pays auditory attention.
  • configured to ⁇ A destination where the user pays auditory attention'' estimated using such an estimation model M( ⁇ ) is, for example, at least one of the following (3.1), (3.2), and (3.3). become.
  • the N (multiple) sound sources of interest include sound source AS 1 (first sound source) and sound source AS 2 (second sound source), and the visual stimulus patterns VS(Sig-1), ..., VS(Sig- N), the visual stimulus pattern VS(Sig-n 1 ) (first visual stimulus pattern) corresponds to the sound source AS 1 , and the visual stimulus pattern VS(Sig-n 2 ) (second visual stimulus pattern) corresponds to the sound source AS 2 , the strength of the correlation between the visual stimulus pattern VS(Sig-n 1 ) and the amount of pupil diameter change VPub is the strength of the correlation between the visual stimulus pattern VS(Sig-n 2 ) and the pupil diameter change VPub If it is stronger than the intensity, it is estimated that the user's auditory attention is directed to the sound source AS 1 or the vicinity of the sound source AS 1 .
  • ⁇ the destination where the user pays auditory attention'' estimated using the estimation model M( ⁇ ) may be information (eg, an index) representing a sound source (eg, a device that emits sound). Alternatively, it may be information representing the visual stimulus pattern VS(Sig-n) corresponding to the sound SO n (for example, blinking frequency), or information representing the sound SO n . It may be information representing In addition, by the estimation model M( ⁇ ), one "destination to which the user pays auditory attention" may be estimated, or a plurality of "destinations to which the user pays auditory attention” may be estimated. It may be estimated, or the probability of "where the user pays attention auditorily" may be estimated.
  • the estimation model M( ⁇ ) may be based on any method.
  • an estimation model M( ⁇ ) based on K-NN (k-nearest neighbor algorithm), SVM (support vector machine), deep learning, hidden Markov model, etc. can be exemplified.
  • K-NN k-nearest neighbor algorithm
  • SVM support vector machine
  • deep learning hidden Markov model
  • a specific learning process method a known method corresponding to the method of the estimation model M( ⁇ ) may be used.
  • first set an appropriate initial value of the provisional model parameter ⁇ ' then apply the learning feature value Tf j to the estimated model M( ⁇ ') and the result obtained by applying the correct answer information Ta j
  • the process of updating the provisional model parameter ⁇ ' is repeated so that the error of is reduced, and the provisional model parameter ⁇ ' at the time when a predetermined termination condition is satisfied is set as the model parameter ⁇ .
  • the model parameter ⁇ output from the learning unit 113 is sent to the output unit 114, and the output unit 114 sends the model parameter ⁇ to the auditory attention state estimation device 12 (step S113).
  • the model parameter ⁇ sent from the learning device 11 is input to the input unit 121 of the auditory attention state estimation device 12 (FIG. 3) and stored in the storage unit 122 .
  • the storage unit 122 stores output information Sig-1, . . . , Sig-N representing visual stimulation patterns VS(Sig-1), . 1), the output information Info-1, . . . , Info-N representing SO(Info-N) are stored (step S122).
  • the auditory information control unit 124 reads the output information Info-1, .
  • Each sound source device 14-n presents (outputs) sound SO(Info-n) based on the sent output information Info-n (step S124).
  • the visual stimulus control unit 123 reads the output information Sig-1, .
  • the user 10 presented with the sounds SO(Info-1), ..., SO(Info-N) and the visual stimulus patterns VS(Sig-1), ..., VS(Sig-N) can hear any of the sounds SO( Give auditory attention to Info-n). For example, the user 10 aurally pays attention to any sound SO (Info-n) by performing the above-described task.
  • the pupil diameter acquisition device 15 measures the pupil diameter Pub of the user 10 and sends time-series data of the pupil diameter Pub to the feature amount extraction unit 125 .
  • the feature amount extraction unit 125 uses the time-series data of the pupil diameter Pub and the output information Sig-1, ..., Sig-N to extract the visual stimulus patterns VS(Sig-1), ..., VS(Sig-N).
  • a feature value f based on the strength of correlation between each of them and the pupil diameter change amount VPub of the user 10 is obtained and output.
  • the configuration of feature quantity f is such that learning sounds are replaced with sounds, learning sound sources are replaced with sound sources, and learning visual stimulus patterns TVS(Sig-1), ..., TVS(Sig-N) are visual stimulus patterns VS (Sig-1), .
  • the configuration is the same as that of the learning feature Tf j described above, except that it is replaced with the quantity VPub.
  • the feature quantity extraction unit 125 obtains the feature quantity f as follows.
  • the feature amount extraction unit 125 preprocesses the time-series data of the pupil diameter Pub, and obtains the time-series data of the pupil diameter Pub' after preprocessing.
  • pre-processing for example, linear interpolation, quadratic spline interpolation, or the like can be used to interpolate a missing portion due to blinking or the like of the user 10 from the time-series data of the pupil diameter Pub.
  • the pupil diameterPub may be applied to time-series data to reduce noise.
  • the feature amount extraction unit 125 extracts the time-series data of the pupil diameter Pub' when the user 10 aurally pays attention to any of the learning sounds TSO n . Subtract the average value of the pupil diameter Pub' of , and standardize by the z value to obtain the time-series data of the pupil diameter change amount VPub.
  • the feature amount extraction unit 125 obtains the feature amount f based on the time-series data of the pupil diameter change amount VPub and the N visual stimulus patterns VS(Sig-1), ..., VS(Sig-N). .
  • the feature quantity f is based on the strength of the correlation between each of the N visual stimulation patterns VS(Sig-1), . . . , VS(Sig-N) and the pupil diameter change amount VPub.
  • the feature quantity f is exemplified below.
  • each of the plurality of frequency-domain visual stimulus signals FVS(Sig-1), ..., FVS(Sig-N) corresponds to a plurality of visual stimulation patterns VS(Sig-1), ..., VS(Sig-N). It is a signal obtained by transforming the time-series signal representing each into the frequency domain. Further, the frequency domain pupil diameter change amount signal FVPub is a signal obtained by converting the time-series signal representing the pupil diameter change amount VPub into the frequency domain.
  • FVPub for example, the peak value of the magnitude of FVPub
  • the visual stimulus pattern VS(Sig-n) having a phase change with a higher degree of synchronization with the pupil diameter change amount VPub has a stronger correlation with the pupil diameter change amount VPub.
  • the feature quantity f is sent to the estimation unit 126 (step S125).
  • the estimation unit 126 reads the model parameter ⁇ from the storage unit 122 .
  • the estimation result E may be information (eg, index) representing the sound source (eg, sound source device 14-n), or the visual stimulation pattern VS(Sig-n) corresponding to the sound SO n .
  • the estimation result E may represent one "destination to which the user has directed his auditory attention", or may represent a plurality of "destinations to which the user has directed his auditory attention”. Alternatively, it may represent the probability of "where the user pays his attention aurally" (step S126).
  • the visual stimulus pattern VS(Sig-n) of the first embodiment was a visual stimulus pattern that periodically changes over time.
  • the visual stimulus pattern VS(Sig-n) may be a non-periodically time-varying stimulus pattern.
  • differences from the first embodiment will be mainly described, and the same reference numerals will be used for items that are common to the first embodiment, and the description will be omitted or simplified.
  • the auditory attention state estimation system 2 of this embodiment includes a learning device 21, an auditory attention state estimation device 22, a plurality of visual stimulus generators 13-1, . . . , 13-N, a plurality of sound sources. , 14-N, and a pupil diameter acquisition device 15, and based on the amount of change in the pupil diameter of the user 10, the destination to which the user 10 pays auditory attention is estimated.
  • the learning device 21 of this embodiment has an input unit 111 , a storage unit 112 , a learning unit 213 , an output unit 114 and a control unit 117 .
  • the auditory attention state estimation device 22 of the present embodiment includes an input unit 121, a storage unit 122, a visual stimulus control unit 223, an auditory information control unit 124, a feature amount extraction unit 225, an estimation unit 226, and a control unit 127 .
  • the learning pupil diameter change amount TVPub for example, a sequence TSS ( Sig-1), ..., TSS(Sig-N) and each maximum value CCFmax(TSS(Sig-1), TPS), . . . , CCFmax(TSS(Sig-N), TPS), etc.
  • the learning feature amount Tf j may include the degree of synchronization of the phase change (step S211).
  • the learning unit 213 obtains the estimated model M( ⁇ ) through learning processing (machine learning) using the learning data T read from the storage unit 112, and outputs model parameters ⁇ that specify the estimated model M( ⁇ ).
  • the learning data T is the only difference from the first embodiment in the processing of the learning unit 213 .
  • the output unit 114 sends the model parameter ⁇ to the auditory attention state estimation device 22 (step S213).
  • the model parameter ⁇ sent from the learning device 11 is input to the input section 121 of the auditory attention state estimation device 22 (FIG. 3) and stored in the storage section 122 .
  • the storage unit 122 stores output information Sig-1, . . . , Sig-N representing visual stimulation patterns VS(Sig-1), . 1), ..., output information Info-1, ..., Info-N representing SO(Info-N) are stored.
  • step S124 The processing of the auditory information control unit 124 is the same as in the first embodiment (step S124).
  • the visual stimulus control unit 223 reads the output information Sig-1, .
  • the pupil diameter acquisition device 15 measures the pupil diameter Pub of the user 10 and sends time-series data of the pupil diameter Pub to the feature amount extraction unit 225 .
  • the feature quantity extraction unit 225 uses the time-series data of the pupil diameter Pub and the output information Sig-1, ..., Sig-N extracted from the storage unit 122 to extract the sensory stimulus patterns VS(Sig-1), ..., VS (Sig-N) and the pupil diameter change amount VPub of the user 10 are obtained and output.
  • the configuration of feature quantity f is such that learning sounds are replaced with sounds, learning sound sources are replaced with sound sources, and learning visual stimulus patterns TVS(Sig-1), ..., TVS(Sig-N) are visual stimulus patterns VS (Sig-1), .
  • the configuration is the same as that of the learning feature Tf j described above, except that it is replaced with the quantity VPub.
  • the difference from the first embodiment is that the visual stimulation patterns VS(Sig-1), .
  • the feature quantity extraction unit 225 executes the processes (4.1) and (4.2) described in the first embodiment, and in (4.3) N visual stimulus patterns VS(Sig-1), ..., VS( Sig-N) and the degree of synchronization between the phase change of the pupil diameter variation VPub and time-series signals representing each of the N visual stimulus patterns VS(Sig-1), ..., VS(Sig-N) , SS(Sig-N) corresponding to , and the sequence PS corresponding to the time-series signal representing the amount of change in pupil diameter VPub. 1), PS), ..., CCFmax(SS(Sig-N), PS), etc.
  • the feature quantity f may also include the degree and the like.
  • the feature quantity f is sent to the estimation unit 226 (step S225).
  • the estimation unit 226 reads the model parameter ⁇ from the storage unit 122.
  • the sound source device 14-n is the sound source that emits the nth sound SO (Info-n).
  • each position ⁇ n in space becomes a sound source for emitting the n-th sound SO(Info-n)
  • the visual stimulus generator 13-n is arranged at a position corresponding to the position ⁇ n which is the sound source.
  • the visual stimulus generator 13-n is placed at the position ⁇ n or near the position ⁇ n .
  • the visual stimulus pattern for learning and the visual stimulus pattern are a dedicated device for presenting the visual stimulus pattern such as a visual stimulus generator (hereinafter referred to as "dedicated visual stimulus device”).
  • a visual stimulus generator hereinafter referred to as "dedicated visual stimulus device”.
  • devices other than the dedicated device for visual stimulation, and temporal changes in images of landscapes, machines, plants, animals, etc. may be used as the visual stimulation pattern for learning and the visual stimulation pattern.
  • the learning visual stimulus pattern is generated from an image obtained by photographing such an image with a camera or the like.
  • the visual stimulus pattern at the time of estimating the auditory attention state is a pattern that the user 10 visually perceives directly from devices other than the visual stimulus dedicated device, landscapes, machines, plants, animals, and the like.
  • the visual stimulus controllers 123 and 223 and the visual stimulus generators 13-1, . . . , 13-N can be omitted.
  • learning sounds and sounds are presented from a dedicated device such as a sound source device (hereinafter referred to as a "sound presentation dedicated device").
  • a sound presentation dedicated device such as a sound source device
  • sounds emitted from devices other than the dedicated sound presentation device, landscapes, machines, plants, animals, and the like may be used.
  • Learning sounds can be generated from audio signals obtained by recording such sounds with a microphone or the like.
  • the sounds presented to the user 10 when estimating the auditory attention state are the sounds that the user 10 perceives auditorily directly from devices other than the sound presentation dedicated device, landscapes, machines, plants, animals, and the like.
  • the auditory information control unit 124 and the tone generators 14-1, . . . , 14-N can be omitted.
  • the learning devices 11 and 21 and the auditory attention state estimating devices 12 and 22 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit), a RAM (random-access memory), a ROM (read-only memory), and so on. It is a device configured by executing a predetermined program on a general-purpose or dedicated computer equipped with a memory such as a memory only. That is, the learning devices 11, 21 and the auditory attention state estimating devices 12, 22 in each embodiment, for example, have processing circuitry configured to implement their respective units.
  • This computer may have a single processor and memory, or may have multiple processors and memories.
  • This program may be installed in the computer, or may be recorded in ROM or the like in advance.
  • processing units may be configured using an electronic circuit that independently realizes processing functions, instead of an electronic circuit that realizes a functional configuration by reading a program like a CPU.
  • an electronic circuit that constitutes one device may include a plurality of CPUs.
  • FIG. 6 is a block diagram illustrating the hardware configuration of the learning devices 11 and 21 and the auditory attention state estimation devices 12 and 22 in each embodiment.
  • the learning devices 11 and 21 and the hearing attention state estimation devices 12 and 22 of this example include a CPU (Central Processing Unit) 10a, an input section 10b, an output section 10c, and a RAM (Random Access Memory) 10d. , a ROM (Read Only Memory) 10e, an auxiliary storage device 10f and a bus 10g.
  • the CPU 10a of this example has a control section 10aa, an arithmetic section 10ab, and a register 10ac, and executes various arithmetic processing according to various programs read into the register 10ac.
  • the input unit 10b is an input terminal, a keyboard, a mouse, a touch panel, etc. for inputting data.
  • the output unit 10c is an output terminal for outputting data, a display, a LAN card controlled by the CPU 10a having read a predetermined program, and the like.
  • the RAM 10d is SRAM (Static Random Access Memory), DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored.
  • the auxiliary storage device 10f is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa in which a predetermined program is stored and a data area 10fb in which various data are stored.
  • the bus 10g connects the CPU 10a, the input section 10b, the output section 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged.
  • the CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program.
  • the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d where the program and data are written is stored in the register 10ac of the CPU 10a.
  • the control unit 10aa of the CPU 10a sequentially reads these addresses stored in the register 10ac, reads the program and data from the area on the RAM 10d indicated by the read address, and causes the calculation unit 10ab to sequentially execute the calculation indicated by the program, The calculation result is stored in the register 10ac.
  • the above program can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is a non-transitory recording medium. Examples of such recording media are magnetic recording devices, optical discs, magneto-optical recording media, semiconductor memories, and the like.
  • the distribution of this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.
  • a computer that executes such a program for example, first stores the program recorded on a portable recording medium or transferred from a server computer in its own storage device. When executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer.
  • the processing according to the received program may be executed sequentially.
  • the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition.
  • ASP Application Service Provider
  • the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).
  • the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.
  • an estimation model is obtained through learning, and the target to which the user aurally pays attention is estimated using the estimation model.
  • a feature quantity based on the strength of correlation between each of a plurality of mutually different visual stimulus patterns corresponding to a plurality of mutually different sound sources and the user's pupil diameter change amount is used, what kind of method is used for the user may estimate the destination to which the has directed its attention aurally.
  • a threshold may be determined by sampling feature amounts obtained in the past, and a destination to which the user's attention is aurally directed may be estimated by comparing the threshold with a newly obtained feature amount.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Surgery (AREA)
  • Veterinary Medicine (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Psychiatry (AREA)
  • Developmental Disabilities (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Social Psychology (AREA)
  • Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Educational Technology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Physiology (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Geometry (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Eye Examination Apparatus (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
PCT/JP2021/028988 2021-08-04 2021-08-04 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム Ceased WO2023012941A1 (ja)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2023539458A JP7619465B2 (ja) 2021-08-04 2021-08-04 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム
US18/293,976 US20240341649A1 (en) 2021-08-04 2021-08-04 Hearing attentional state estimation apparatus, learning apparatus, method, and program thereof
PCT/JP2021/028988 WO2023012941A1 (ja) 2021-08-04 2021-08-04 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/028988 WO2023012941A1 (ja) 2021-08-04 2021-08-04 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム

Publications (1)

Publication Number Publication Date
WO2023012941A1 true WO2023012941A1 (ja) 2023-02-09

Family

ID=85154446

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/028988 Ceased WO2023012941A1 (ja) 2021-08-04 2021-08-04 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム

Country Status (3)

Country Link
US (1) US20240341649A1 (https=)
JP (1) JP7619465B2 (https=)
WO (1) WO2023012941A1 (https=)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010104754A (ja) * 2008-09-30 2010-05-13 Hanamura Takeshi 情動分析装置
WO2013023056A1 (en) * 2011-08-09 2013-02-14 Ohio University Pupillometric assessment of language comprehension
JP2015132783A (ja) * 2014-01-16 2015-07-23 日本電信電話株式会社 音の顕著度推定装置、その方法、及びプログラム
JP2019126423A (ja) * 2018-01-22 2019-08-01 日本電信電話株式会社 聴覚的注意推定装置、聴覚的注意推定方法、プログラム
US20200253526A1 (en) * 2019-02-07 2020-08-13 University Of Oregon Measuring responses to sound using pupillometry
JP2021019983A (ja) * 2019-07-30 2021-02-18 株式会社豊田中央研究所 心理状態判定装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021019963A (ja) * 2019-07-30 2021-02-18 純生 倉田 介護ベッド
EP4161387B1 (en) * 2020-06-03 2024-02-28 Apple Inc. Sound-based attentive state assessment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010104754A (ja) * 2008-09-30 2010-05-13 Hanamura Takeshi 情動分析装置
WO2013023056A1 (en) * 2011-08-09 2013-02-14 Ohio University Pupillometric assessment of language comprehension
JP2015132783A (ja) * 2014-01-16 2015-07-23 日本電信電話株式会社 音の顕著度推定装置、その方法、及びプログラム
JP2019126423A (ja) * 2018-01-22 2019-08-01 日本電信電話株式会社 聴覚的注意推定装置、聴覚的注意推定方法、プログラム
US20200253526A1 (en) * 2019-02-07 2020-08-13 University Of Oregon Measuring responses to sound using pupillometry
JP2021019983A (ja) * 2019-07-30 2021-02-18 株式会社豊田中央研究所 心理状態判定装置

Also Published As

Publication number Publication date
JPWO2023012941A1 (https=) 2023-02-09
US20240341649A1 (en) 2024-10-17
JP7619465B2 (ja) 2025-01-22

Similar Documents

Publication Publication Date Title
Martin et al. Multimodality in VR: A survey
Carlile et al. The perception of auditory motion
Vroomen et al. Visual anticipatory information modulates multisensory interactions of artificial audiovisual stimuli
Van Erp et al. Distilling the underlying dimensions of tactile melodies
US10921888B2 (en) Sensory evoked response based attention evaluation systems and methods
Poikonen et al. Dance on cortex: enhanced theta synchrony in experts when watching a dance piece
Popov et al. Brain areas associated with visual spatial attention display topographic organization during auditory spatial attention
Romei et al. The contributions of sensory dominance and attentional bias to cross-modal enhancement of visual cortex excitability
Dugué et al. Distinct perceptual rhythms for feature and conjunction searches
Diederich et al. Saccadic reaction times to audiovisual stimuli show effects of oscillatory phase reset
Heron et al. Adaptation minimizes distance-related audiovisual delays
Geronazzo et al. The impact of an accurate vertical localization with HRTFs on short explorations of immersive virtual reality scenarios
Chouiter et al. Experience-based auditory predictions modulate brain activity to silence as do real sounds
Van Opstal et al. Reconstructing spectral cues for sound localization from responses to rippled noise stimuli
Nemes et al. Multiple spatial frequency channels in human visual perceptual memory
Hamilton et al. Gone in a flash: manipulation of audiovisual temporal integration using transcranial magnetic stimulation
Tian et al. Rhythm facilitates auditory working memory via beta-band encoding and theta-band maintenance
JP7619465B2 (ja) 聴覚注意状態推定装置、学習装置、それらの方法、およびプログラム
Tajadura-Jiménez et al. Auditory-induced emotion: A neglected channel for communication in human-computer interaction
Wang et al. Temporal and spectral EEG dynamics can be indicators of stealth placement
KR20190058565A (ko) 개인의 작업 메모리 부하에 영향을 미치는 센서 입력들의 식별
Bratzke et al. Short-term memory of temporal information revisited
JP7533796B2 (ja) 合成光源生成装置、瞳孔径変化誘発装置、それらの方法、およびプログラム
Globerson et al. Brain responses to regular and octave-scrambled melodies: A case of predictive-coding?
Dal Palù et al. From multisensory to multicognitive: The sound of a product is other than the sum of its parts

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21952119

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023539458

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21952119

Country of ref document: EP

Kind code of ref document: A1