US9002707B2 - Determining the position of the source of an utterance - Google Patents

Determining the position of the source of an utterance Download PDF

Info

Publication number
US9002707B2
US9002707B2 US13/669,843 US201213669843A US9002707B2 US 9002707 B2 US9002707 B2 US 9002707B2 US 201213669843 A US201213669843 A US 201213669843A US 9002707 B2 US9002707 B2 US 9002707B2
Authority
US
United States
Prior art keywords
information
event
target
input
utterance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US13/669,843
Other languages
English (en)
Other versions
US20130124209A1 (en
Inventor
Keiichi Yamada
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Corp
Original Assignee
Sony Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Corp filed Critical Sony Corp
Assigned to SONY CORPORATION reassignment SONY CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YAMADA, KEIICHI
Publication of US20130124209A1 publication Critical patent/US20130124209A1/en
Application granted granted Critical
Publication of US9002707B2 publication Critical patent/US9002707B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • G06K9/00335
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2134Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on separation criteria, e.g. independent component analysis
    • G06K9/0057
    • G06K9/00684
    • G06K9/624
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/35Categorising the entire scene, e.g. birthday party or wedding scene
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/22Source localisation; Inverse modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present disclosure relates to an information processing apparatus, an information processing method, and a program, and more particularly, to an information processing apparatus, an information processing method, and a program which analyze an external environment based on input information by inputting input information from the outside world, for example, information such as images, voices and the like, and specifically, analyzes the position of a person who is speaking and the like.
  • the present disclosure relates to an information processing apparatus, an information processing method, and a program which identify a user who is speaking and analyze each utterance when a plurality of persons are speaking simultaneously.
  • a system that performs an interactive process between a person and information processing apparatuses such as a PC or a robot, for example, a communication process or an interactive process is referred to as a man-machine interaction system.
  • the information processing apparatus such as a PC, a robot, or the like performs analysis based on input information by inputting image information or voice information to recognize human actions such as human behavior or words.
  • various channels for gestures, line of sight, facial expressions, and the like are used as information transmission channels. If it is possible to analyze all of these channels in a machine, even communication between people and machines may reach the same level as that of communication between people.
  • An interface capable of analyzing input information from these multiple channels (also referred to as modality or modal) is called a multi-modal interface, and development and research into such an interface have been conducted extensively in recent years.
  • An information processing apparatus receives or is input images and voices of users (father, mother, sister, and brother) in front of the television via a camera and a microphone, and analyzes the position of each of the users, which user is speaking, and the like, so that a system capable of performing processes according to analysis information such as the camera zooming-in with respect to a user who has spoken, making an adequate response with respect to the user who has spoken, or the like may be realized.
  • Examples of the related art in which an existing man-machine interaction system is disclosed include, for example, Japanese Unexamined Patent Application Publication No. 2009-31951 and Japanese Unexamined Patent Application Publication No. 2009-140366.
  • a process in which information from a multi-channel (modal) is integrated in a probabilistic manner, and the position of each of a plurality of users, who are the plurality of users, and who is issuing signals, that is, who is speaking are determined with respect to each of the plurality of users is performed.
  • a probability that each of the targets is an utterance source is calculated from analysis results of image data captured by a camera or sound information obtained by a microphone.
  • an information processing apparatus may perform a process for integrating information estimated to be more accurate by performing a stochastic process with respect to uncertain information included in various input information such as image information, sound information, and the like in a system for performing analysis of input information from a plurality of channels (modality or modal), more specifically, specific processes such as, for example, the position of persons in the surrounding area, or the like, so that robustness may be improved, and highly accurate analysis may be performed.
  • modality or modal modality or modal
  • an information processing apparatus including: a plurality of information input units that input observation information of a real space; an event detection unit that generates event information including estimated position information and estimated identification information of users present in the real space based on analysis of the information input from the information input units; and an information integration processing unit that inputs the event information, and generates target information including a position of each user and user identification information on the basis of the input event information, and signal information representing a probability value of the event generation source, wherein the information integration processing unit includes an utterance source probability calculation unit, and wherein the utterance source probability calculation unit performs a process of calculating an utterance source score as an index value representing an utterance source probability of each target by multiplying weights based on utterance situations by a plurality of different information items input from the event detection unit.
  • the utterance source probability calculation unit may receive an input of (a) user position information (sound source direction information), and (b) user identification information (utterer identification information) which are corresponding to an utterance event as input information from a voice event detection unit constituting the event detection unit, may receive an input of (a) user position information (face position information), (b) user identification information (face identification information), and (c) lip movement information as the target information generated based on input information from an image event detection unit constituting the event detection unit, and may perform a process of calculating the utterance source score based on the input information by adopting at least one item of the above-mentioned information.
  • the utterance source probability calculation unit may perform a process of adjusting the weight coefficients ⁇ , ⁇ , and ⁇ according to an utterance situation.
  • the utterance source probability calculation unit may perform a process of adjusting the weight coefficients ⁇ , ⁇ , and ⁇ according to the following two conditions of (Condition 1) whether it is a single utterance from only one target or a simultaneous utterance from two targets and (Condition 2) whether positions of the two targets are close to each other or positions of the two targets are far apart.
  • the utterance source probability calculation unit may perform a process of adjusting the weight coefficients ⁇ , ⁇ , and ⁇ such that the weight coefficient ⁇ of the lip movement information is small in a situation where two targets with an utterance probability are present and the two targets speak simultaneously.
  • the utterance source probability calculation unit may perform a process of adjusting the weight coefficients ⁇ , ⁇ , and ⁇ such that the weight coefficient ⁇ of the sound source direction information is small in a situation where two targets with an utterance probability are present and positions of the two targets are close to each other and only one target speaks.
  • the utterance source probability calculation unit may perform a process of adjusting the weight coefficients ⁇ , ⁇ , and ⁇ such that the weight coefficient ⁇ of the lip movement information and the weight coefficient ⁇ of the sound source direction information are small in a situation where two targets with an utterance probability are present and positions of the two targets are close to each other and two targets speak simultaneously.
  • an information processing method of performing an information analysis process in an information processing apparatus including: receiving, by a plurality of information input units, an input of observation information of a real space; generating, by an event detection unit, event information including estimated position information and estimated identification information of users present in the real space based on analysis of the information input from the information input units; and receiving, by an information integration processing unit, an input of an event, and generating target information including a position of each user and user identification information on the basis of the input event information, and signal information representing a probability value of the event generation source, wherein in the generating of the target information, a process of calculating an utterance source score as an index value representing an utterance source probability of each target by multiplying weights based on utterance situations by a plurality of different information items input in the generating of the event information is performed.
  • the program of the present disclosure is a program which can be provided in a storage medium or a communication medium in a computer-readable format, for example, in an information processing apparatus or a computer system capable of executing various program codes.
  • a process corresponding to the program is realized in the information processing apparatus or the computer system.
  • a system in the specification is a logical group configuration of a plurality of apparatuses, and the present disclosure is not limited to the apparatuses with each configuration being provided in the same case.
  • a configuration that generates a user position, identification information, utterer information, and the like by information analysis based on uncertain and asynchronous input information is realized.
  • the information processing apparatus may include an information integration processing unit that receives an input of event information including estimated position and estimated identification data of a user based on image information or voice information, and generates target information including a position and user identification information of each user based on the input event information and signal information representing a probability value for an event generating source.
  • the information integration processing unit includes an utterance source probability calculation unit, the utterance source probability calculation unit performs a process of calculating an utterance source score as an index value representing an utterance source probability of each target by multiplying weights based on utterance situations by a plurality of different information items input from an event detection unit.
  • FIG. 1 is a diagram illustrating an overview of a process performed by an information processing apparatus according to an embodiment of the present disclosure.
  • FIG. 2 is a diagram illustrating a configuration and a process of an information processing apparatus according to an embodiment of the present disclosure.
  • FIG. 3A and FIG. 3B are diagrams illustrating an example of information that is generated by a voice event detection unit and an image event detection unit, and is input to an information integration processing unit.
  • FIG. 4A to FIG. 4C are diagrams illustrating a basic processing example to which a particle filter is applied.
  • FIG. 5 is a diagram illustrating a configuration of particles set in the present processing example.
  • FIG. 6 is a diagram illustrating a configuration of target data of each target included in respective particles.
  • FIG. 7 is a diagram illustrating a configuration and a generation process of target information.
  • FIG. 8 is a diagram illustrating a configuration and a generation process of target information.
  • FIG. 9 is a diagram illustrating a configuration and a generation process of target information.
  • FIG. 10 is a flowchart illustrating a processing sequence performed by an information integration processing unit.
  • FIG. 11 is a diagram illustrating a calculation process of a particle weight, in detail.
  • FIG. 12 is a diagram illustrating an utterer specification process.
  • FIG. 13 is a flowchart illustrating an example of a processing sequence performed by an utterance source probability calculation unit.
  • FIG. 14 is a diagram illustrating a process of calculating an utterance source score performed by an utterance source probability calculation unit.
  • FIG. 15 is a flowchart illustrating a calculation processing sequence of an utterance source score performed by an utterance source probability calculation unit.
  • FIG. 16A to FIG. 16D are diagrams illustrating an example of an utterance situation that is a determination element of a weight coefficient in a process of calculating an utterance source score performed by an utterance source probability calculation unit.
  • FIG. 17 is a diagram illustrating an example of a process of determining a weight coefficient in a process of calculating an utterance source score performed by an utterance source probability calculation unit.
  • FIG. 18A and FIG. 18B are diagrams illustrating an example of a process of determining a weight coefficient in a process of calculating an utterance source score performed by an utterance source probability calculation unit.
  • the present disclosure realizes a configuration in which an identifier is used with respect to voice event information corresponding to an utterance of a user from within input event information when calculating an utterance source probability, so that it is not necessary for a weight coefficient described in BACKGROUND to be adjusted beforehand.
  • an identifier for identifying whether each target is an utterance source or an identifier for determining which one of two items of target information seems more likely to be an utterance source with respect to only two items of target information is used.
  • sound source direction information or utterer identification information included in voice event information As the input information to the identifier, sound source direction information or utterer identification information included in voice event information, lip movement information included in image event information from within event information, and a target position or a total number of targets included in target information are used.
  • the information processing apparatus 100 of the present disclosure inputs image information and voice information from a sensor in which observation information in real time is input, here for example, a camera 21 and a plurality of microphones 31 to 34 , and performs analysis of the environment based on the input information. Specifically, position analysis of a plurality of users 1 , 11 to 2 , 12 , and identification of the users of the corresponding positions are performed.
  • the information processing apparatus 100 performs analysis of the image information and the voice information input from the camera 21 and the plurality of microphones 31 to 34 to thereby identify the positions of the two users 1 and 2 , and whether the user in each position is the sister or the brother.
  • the identified result is used for various processes.
  • the identified result is used for a process such as a camera zooming-in of on a user who has spoken, a television making a response with respect to the user having a conversation, or the like.
  • a user position identification and a user specification process are performed as a user identification process based on input information from a plurality of information input units (camera 21 , and microphones 31 to 34 ).
  • Applications of the identified result are not particularly limited.
  • Various uncertain information is included in the image information and the voice information input from the camera 21 and the plurality of microphones 31 to 34 .
  • a stochastic process is performed with respect to the uncertain information included in the input information, and the information being subjected to the stochastic process is integrated to information estimated to be highly accurate. By this estimation process, robustness is improved to perform analysis with high accuracy.
  • the information processing apparatus 100 includes an image input unit (camera) 111 and a plurality of voice input units (microphones) 121 a to 121 d as an input device.
  • the information processing apparatus 100 inputs image information from the image input unit (camera) 111 , and inputs voice information from the voice input unit (microphones) 121 to thereby perform analysis based on this input information.
  • Each of the plurality of voice input units (microphones) 121 a to 121 d is disposed in various positions as shown in FIG. 1 .
  • the voice information input from the plurality of microphones 121 a to 121 d is input to an information integration processing unit 131 via a voice event detection unit 122 .
  • the voice event detection unit 122 analyzes and integrates voice information input from the plurality of voice input units (microphones) 121 a to 121 d disposed in a plurality of different positions. Specifically, a position in which sound is generated and user identification information indicating which user generates the sound are generated based on the voice information input from the voice input units (microphones) 121 a to 121 d , and inputs the generated information to the information integration processing unit 131 .
  • the specific process is a process for specifying an event generation source such as a person (utterer) who is speaking, or the like.
  • the voice event detection unit 122 analyzes the voice information input from the plurality of voice input units (microphones) 121 a to 121 d disposed in a plurality of different positions, and generates position information of a voice generation source as probability distribution data. Specifically, the voice event detection unit 122 generates an expected value and distribution data N(m e , ⁇ e ) with respect to a sound source direction. In addition, the voice event detection unit 122 generates user identification information based on a comparison with feature information of a voice of a user that is registered in advance. The identification information is also generated as a probabilistic estimated value.
  • the voice event detection unit 122 analyzes the voice information input from the plurality of voice input units (microphones) 121 a to 121 d disposed in the plurality of different positions, generates “integrated voice event information” configured by probability distribution data as position information of a generation source of the voice, and user identification information constituted by a probabilistic estimated value, and inputs the generated integrated voice event information to the information integration processing unit 131 .
  • the image information input from the image input unit (camera) 111 is input to the information integration processing unit 131 via the image event detection unit 112 .
  • the image event detection unit 112 analyzes the image information input from the image input unit (camera) 111 , extracts a face of a person included in the image, and generates position information of the face as probability distribution data. Specifically, an expected value for a position or a orientation of the face, and distribution data N(m e , ⁇ e ) are generated.
  • the image event detection unit 112 identifies a face by performing a comparison with feature information of a user's face that is registered in advance, and generates user identification information.
  • the identification information is generated as a probabilistic estimated value. Since feature information with respect to the faces of a plurality of users to be verified in advance is registered in the image event detection unit 112 , a comparison between feature information of an image of a face area extracted from an input image and feature information of a registered face image is performed, a process of determining which user's face corresponds to the high probability input image is determined, so that a posterior probability or a score with respect to all of the registered users is calculated.
  • the image event detection unit 112 calculates an attribute score corresponding to a face included in the image input from the image input unit (camera) 111 , for example, a face attribute score generated based on a movement of a mouth area.
  • the image event detection unit 112 identifies the mouth area from the face area included in the image input from the image input unit (camera) 111 , and detects a movement of the mouth area, so that a score with a higher value is calculated in a case where it is determined that a score corresponding to a movement detection result is detected, for example, when a movement of the mouth area is detected.
  • a movement detection process of the mouth area is performed as a process to which VSD (Visual Speech Detection) is applied.
  • VSD Voice Speech Detection
  • a method disclosed in Japanese Unexamined Patent Application Publication No. 2005-157679 relating to the application of the same applicant as that of the present disclosure is applied. Specifically, for example, left and right corners of the lips are detected from a face image detected from the image input from the image input unit (camera) 111 , a difference in luminance is calculated after the left and right corners of the lips are aligned in an N-th frame and an (N+1)-th frame, and a value of the difference is processed as a threshold value, thereby detecting a movement of the lips.
  • the information integration processing unit 131 performs a process of probabilistically estimating who each of a plurality of users is, a position of each of the plurality of users, and who is generating signals such as a voice or the like, based on the input information from the voice event detection unit 122 or the image event detection unit 112 .
  • the information integration processing unit 131 outputs, to a processing determination unit 132 , each item of information such as (a) “target information” as estimation information concerning the position of each of the plurality of users, and who they are, and (b) “signal information” such as an event generation source of, for example, a user, or the like who is speaking based on the input information from the voice event detection unit 122 or the image event detection unit 112 .
  • a processing determination unit 132 each item of information such as (a) “target information” as estimation information concerning the position of each of the plurality of users, and who they are, and (b) “signal information” such as an event generation source of, for example, a user, or the like who is speaking based on the input information from the voice event detection unit 122 or the image event detection unit 112 .
  • signal information (b1) signal information based on a voice event and (b2) signal information based on an image event.
  • a target information updating unit 141 of the information integration processing unit 131 performs target updating using, for example, a particle filter by inputting the image event information detected in the image event detection unit 112 , and generates the target information and the signal information based on the image event to thereby output the generated information to the processing determination unit 132 .
  • the target information obtained as the updating result is also output to the utterance source probability calculation unit 142 .
  • the utterance source probability calculation unit 142 of the information integration processing unit 131 calculates a probability in which each of the targets is a generation source of the input voice event using an identification model (identifier) by inputting the voice event information detected in the voice event detection unit 122 .
  • the utterance source probability calculation unit 142 generates signal information based on the voice event based on the calculated value, and outputs the generated information to the processing determination unit 132 .
  • the processing determination unit 132 receiving the identification processing result including the target information and the signal information generated by the information integration processing unit 131 performs a process using the identification processing result. For example, processes such as a camera zooming-in with respect to, for example, a user who has spoken, or a television making a response with respect to the user who has spoken, or the like are performed.
  • the voice event detection unit 122 generates probability distribution data of position information of the generation source of a voice, and more specifically, an expected value and distribution data N(m e , ⁇ e ) with respect to a sound source direction.
  • the voice event detection unit 122 generates user identification information based on a comparison result such as feature information of a user that is registered in advance, and inputs the generated information to the information integration processing unit 131 .
  • the image event detection unit 112 extracts a face of a person included in the image, and generates position information of the face as probability distribution data. Specifically, the image event detection unit 112 generates an expected value and dispersion data N(m e , ⁇ e ) with respect to a position and a orientation of the face. In addition, the image event detection unit 112 generates user identification information based on a comparison process performed with the feature information of the face of the user that is registered in advance, and inputs the generated information to the information integration processing unit 131 .
  • the image event detection unit 112 detects a face attribute score as face attribute information from a face area within the image input from the image input unit (camera) 111 , for example, a movement of a mouth area, calculates a score corresponding to the movement detection result of the mouth area, and more specifically, a face attribute score with a high value when a significant movement of the mouth area is detected, and inputs the calculated score to the information integration processing unit 131 .
  • FIG. 3A and FIG. 3B examples of information that is generated by the voice event detection unit 122 and the image event detection unit 112 , and input to the information integration processing unit 131 are described.
  • the image event detection unit 112 generates data such as (Va) an expected value and dispersion data N(m e , ⁇ e ) with respect to a position and a orientation of a face, (Vb) user identification information based on feature information of a face image, and (Vc) a score corresponding to attributes of a detected face, for example, a face attribute score generated based on a movement of a mouth area, and inputs the generated data to the information integration processing unit 131 .
  • data such as (Va) an expected value and dispersion data N(m e , ⁇ e ) with respect to a position and a orientation of a face, (Vb) user identification information based on feature information of a face image, and (Vc) a score corresponding to attributes of a detected face, for example, a face attribute score generated based on a movement of a mouth area, and inputs the generated data to the information integration processing unit 131 .
  • the voice event detection unit 122 inputs, to the information integration processing unit 131 , data such as (Aa) an expected value and dispersion data N(m e , ⁇ e ) with respect to a sound source direction, and (Ab) user identification information based on voice characteristics.
  • FIG. 3A An example of a real environment including the same camera and the microphone as those described with reference to FIG. 1 is illustrated in FIG. 3A , and there are a plurality of users 1 to k, 201 to 20 k .
  • the voice is input via the microphone.
  • the camera continuously photographs images.
  • the information that is generated by the voice event detection unit 122 and the image event detection unit 112 , and is input to the information integration processing unit 131 is classified into three types such as (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score).
  • (a) user position information is integrated information of (Va) an expected value and dispersion data N (m e , ⁇ e ) with respect to a face position or direction, which is generated by the image event detection unit 112 , and (Aa) an expected value and dispersion data (m e , ⁇ e ) with respect to a sound source direction, which is generated by the voice event detection unit 122 .
  • user identification information (face identification information or utterer identification information) is integrated information of (Vb) user identification information based on feature information of a face image, which is generated by the image event detection unit 112 , and (Ab) user identification information based on feature information of voice, which is generated by the voice event detection unit 122 .
  • the (c) face attribute information is corresponding to a score corresponding to the detected face attribute (Vc) generated by the image event detection unit 112 , for example, a face attribute score generated based on the movement of the lip area.
  • Three kinds of information such as the (a) user position information, the (b) user identification information (face identification information or utterer identification information), and the (c) face attribute information (face attribute score) are generated for each event.
  • the voice event detection unit 122 When voice information is input from the voice input units (microphones) 121 a to 121 d , the voice event detection unit 122 generates the above described (a) user information and (b) user identification information based on the voice information, and inputs the generated information to the information integration processing unit 131 .
  • the image event detection unit 112 generates the (a) user position information, the (b) user identification information, and the (c) face attribute information (face attribute score) based on the image information input from the image input unit (camera) 111 at a certain frame interval determined in advance, and inputs the generated information to the information integration processing unit 131 .
  • the image input unit (camera) 111 shows an example in which a single camera is set, and images of a plurality of users are photographed by the single camera.
  • the (a) user position information and the (b) user identification information are generated with respect to each of the plurality of faces included in a single image, and the generated information is input to the information integration processing unit 131 .
  • a process in which the voice event detection unit 122 generates the (a) user position information and the (b) user identification information (utterer identification information) will be described based on the voice information input from the voice input unit (microphone) 121 a to 121 d.
  • the voice event detection unit 122 generates estimated information of a position of a user who generated a voice that is analyzed based on the voice information input from the voice input unit (microphone) 121 a to 121 d , that is, a position of an utterer. That is, the voice event detection unit 122 generates a position estimated to be where the utterer is, as Gaussian distribution (normal distribution) data N(m e , ⁇ e ) obtained from an expected value (average) [m e ] and distribution information [ ⁇ e ].
  • the voice event detection unit 122 estimates who the utterer is based on the voice information input from the voice input unit (microphone) 121 a to 121 d , by a comparison between feature information of the input voice and feature information of the voices of users 1 to k registered in advance. Specifically, a probability that the utterer is each of the users 1 to k is calculated. The calculated value (b) is used as the user identification information (utterer identification information).
  • the highest score is distributed to a user having registered voice characteristics closest to the characteristics of the input voice, and the lowest score (for example, zero) is distributed to a user having the most different characteristics from the characteristics of the input voice, so that data setting a probability that the input voice belongs to each of the users is generated, and the generated data is used as the (b) user identification information (utterer identification information).
  • the image event detection unit 112 generates information such as (a) user position information, (b) user identification information (face identification information), and (c) face attribute information (face attribute score) based on the image information input from the image input unit (camera) 111 will be described.
  • the image event detection unit 112 generates estimated information of a face position with respect to each of faces included in the image information input from the image input unit (camera) 111 . That is, a position at which it is estimated that the face detected from the image is present is generated as Gaussian distribution (normal distribution) data N(m e , ⁇ e ) obtained from an expected value (average) [m e ] and distribution information [ ⁇ e ].
  • the image event detection unit 112 detects a face included in image information based on the image information input from the image input unit (camera) 111 , and estimates who each of the faces is by a comparison between the input image information and feature information of a face of each user 1 to k registered in advance. Specifically, a probability that each extracted face is each of the users 1 to k is calculated. The calculated value is used as (b) user identification information (face identification information).
  • the highest score is distributed to a user having characteristics of a registered face closest to characteristics of a face included in the input image, and the lowest score (for example, zero) is distributed to a user having the most different characteristics from the characteristics of the face, so that data setting a probability that the input voice belongs to each user is generated, and the generated data is used as (b) user identification information (face identification information).
  • the image event detection unit 112 detects a face area included in the image information based on image information input from the image input unit (camera) 111 , and calculates attributes of the detected face, specifically, attribute scores such as the above described movement of the mouth area of the face, whether the detected face is a smiling face, whether the detected face is a male face or a female face, whether the detected face is an adult face, and the like.
  • attribute scores such as the above described movement of the mouth area of the face, whether the detected face is a smiling face, whether the detected face is a male face or a female face, whether the detected face is an adult face, and the like.
  • a score corresponding to the movement of the mouth area of the face included in the image is calculated and used as the face attribute score will be described.
  • the image event detection unit 112 detects left and right corners of a lips from the face image detected from the image input from the image input unit (camera) 111 , a difference in luminance is calculated after the left and right corners of the lips are aligned in an N-th frame and an (N+1)-th frame, and a value of the difference is processed as a threshold value.
  • a face attribute score in which a higher score is obtained with an increase in the movement of the lips is set.
  • the image event detection unit 112 when a plurality of faces is detected from an image photographed by the camera, the image event detection unit 112 generates event information corresponding to each of the faces as a separate event according to each of the detected faces. That is, the image event detection unit 112 generates event information including the following information and inputs them to the information integration processing unit 131 .
  • the image event detection unit 112 generates the information such as (a) user position information, (b) user identification information (face identification information), and (c) face attribute information (face attribute score), and inputs the generated information to the information integration processing unit 131 .
  • the image event detection unit 112 generates (a) user position information, (b) user identification information (face identification information), and (c) face attribute information (face attribute score) with respect to each of the faces included in each of the photographed images of the plurality of cameras, and inputs the generated information to the information integration processing unit 131 .
  • the information integration processing unit 131 inputs the three items of information shown in FIG. 3B from the voice event detection unit 122 and the image event detection unit 112 as described above, that is, (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score) in this stated order.
  • the voice event detection unit 122 generates and inputs each item of information of the above (a) and (b) as the voice event information when a new voice is input, so that the image event detection unit 112 generates and inputs each item of information of (a), (b), and (c) as voice event information in a certain frame period unit.
  • a process performed by the information integration processing unit 131 will be described with reference to FIG. 4A to FIG. 4C .
  • the information integration processing unit 131 includes a target information updating unit 141 and an utterance source probability calculation unit 142 , and performs the following processes.
  • the target information updating unit 141 inputs the image event information detected in the image event detection unit 112 , for example, performs a target updating process using a particle filter, and generates target information and signal information based on the image event to thereby output the generated information to the processing determination unit 132 .
  • the target information as the updating result is output to the utterance source probability calculation unit 142 .
  • the utterance source probability calculation unit 142 inputs the voice event information detected in the voice event detection unit 122 , and calculates a probability in which each target is an utterance source of the input voice event using an identification model (identifier).
  • the utterance source probability calculation unit 142 generates, based on the calculated value, signal information based on the voice event, and outputs the generated information to the processing determination unit 132 .
  • the target information updating unit 141 of the information integration processing unit 131 performs a process of leaving only more probable hypothesis by setting probability distribution data of a hypothesis with respect to a position and identification information of a user, and updating the hypothesis based on the input information. As this processing scheme, a process to which a particle filter is applied is performed.
  • the process to which the particle filter is applied is performed by setting a large number of particles corresponding to various hypotheses.
  • a large number of particles corresponding to hypotheses concerning a position of the user and who the user is are set, and a process of increasing a more probable weight of the particles based on the three items of information shown in FIG. 3B from the image event detection unit 112 , that is, (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score) is performed.
  • FIG. 4A to FIG. 4C shows a processing example of estimating a presence position corresponding to a user by the particle filter.
  • FIG. 4A to FIG. 4C a process of estimating a position where a user 301 is in a one-dimensional area on any straight line is performed.
  • An initial hypothesis (H) becomes uniform particle distribution data as shown in FIG. 4A .
  • image data 302 is acquired, and probability distribution data of presence of a user 301 based on the acquired image is acquired as data of FIG. 4B .
  • particle distribution data of FIG. 4A is updated, thereby obtaining updated hypothesis probability distribution data of FIG. 4C .
  • This process is repeatedly performed based on the input information, thereby obtaining more probable position information than that of the user.
  • input information is processed only with respect to a presence position of the user using the image data only.
  • each of the particles has information concerning the presence position of the user 301 only.
  • the target information updating unit 141 of the information integration processing unit 131 acquires the information shown in FIG. 3B from the image event detection unit 112 , that is, (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score), and determines positions of a plurality of users and who each of the plurality of users is. Accordingly, in the process to which the particle filter is applied, the information integration processing unit 131 sets a large number of particles corresponding to a hypothesis concerning a position of the user and who the user is, so that particle updating is performed based on the two items of information shown in FIG. 3B in the image event detection unit 112 .
  • a particle updating processing example performed by inputting by the information integration processing unit 131 , three items of information shown in FIG. 3B , that is, (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score) from the voice event detection unit 122 and the image event detection unit 112 will be described with reference to FIG. 5 .
  • the particle shown in FIG. 5 is 1 to m.
  • a plurality (n-numbered) of targets corresponding to virtual users greater than the number of people estimated to be present in a real space are set as each of the particles.
  • Each of m number of particles maintains data by the number of the targets in a target unit.
  • the face image detected in the image event detection unit 112 is subjected to the updating process as a separate event by associating a target with each of the face image events.
  • the image event detection unit 112 generates (a) user position information, (b) user identification information, and (c) face attribute information (face attribute score) based on the image information input from the image input unit (camera) 111 at a certain frame interval determined in advance, and inputs the generated information to the information integration processing unit 131 .
  • Event generation source hypothesis data 371 and 372 shown in FIG. 5 is event generation source hypothesis data set in each of the particles, and an updating target corresponding to the event ID is determined depending on information concerning that the event generation source hypothesis data is set in each of the particles.
  • each packet of target data included in each of the particles will be described with reference to FIG. 6 .
  • the target data of the target 375 is configured by the following data, that is, (a) probability distribution of a presence position corresponding to each of the targets [Gaussian distribution: N (m 1n , ⁇ 1n )] and (b) user confirmation degree information (uID) indicating who each of the targets is
  • face attribute information face attribute score [S eID ]
  • a target to be updated is data included in each packet of target data, that is, (a) user position information, and (b) user identification information (face identification information or utterer identification information).
  • the (c) face attribute information (face attribute score [S eID ]) is finally used as signal information indicating an event generation source.
  • the weight of each particle is also updated, so that a weight of a particle having data closest to information in a real space is increased, and a weight of a particle having data unsuitable for the information in the real space is reduced.
  • the signal information based on the face attribute information (face attribute score), that is, the signal information indicating the event generation source is calculated.
  • m which is set in the target information updating unit 141 of the information integration processing unit 131 , and the number of targets allocated to each event.
  • This data is finally used as the signal information indicating the event generation source.
  • the target information updating unit 141 generates (a) target information including position estimated information indicating a position of each of a plurality of users, estimated information (uID estimated information) indicating who each of the plurality of users is, and an expected value of face attribute information (S tID ), for example, a face attribute expected value indicating speaking with a moving mouth, and (b) signal information (image event correspondence signal information) indicating an event generation source such as a user who is speaking, and outputs the generated information to the processing determination unit 132 .
  • target information including position estimated information indicating a position of each of a plurality of users, estimated information (uID estimated information) indicating who each of the plurality of users is, and an expected value of face attribute information (S tID ), for example, a face attribute expected value indicating speaking with a moving mouth
  • signal information image event correspondence signal information
  • ‘i’ denotes an event ID.
  • “i” is an event ID.
  • a face attribute score [S eID ] is not present in the face image event eID (for example, when a movement of a mouth is not detected due to a hand covering the mouth even though a face is detected)
  • a value S prior of prior knowledge, or the like is used in the face attribute score S eID .
  • the value of prior knowledge when a value previously obtained is present for each target, the value is used, or an average value of the face attribute that is calculated from the face image event obtained in the off-line in advance is used.
  • FIG. 9 three targets corresponding to an event are set within a system, however, a calculation example of an expected value of face attribute when only two targets are input as the face image event within a frame of an image 1 from the image event detection unit 112 to the information integration processing unit 131 is illustrated.
  • the face attribute is described as the face attribute expected value based on a score corresponding to the movement of the mouth in this embodiment, that is, as data indicating an expected value in which each target is an utterer, however, the face attribute score, as described above, is able to be calculated as a score such as a smiling face or an age, and the face attribute expected value in this case is calculated as data corresponding to attribute corresponding to the score.
  • the information integration processing unit 131 performs updating of the particles based on the input information, and generates (a) target information as estimated information concerning a position of a plurality of users, and who each of the plurality of users is, and (b) signal information indicating the event generation source such as a user who is speaking to thereby output the generated information to the processing determination unit 132 .
  • the target information updating unit 141 of the information integration processing unit 131 performs particle filtering process to which a plurality of particles setting a plurality of target data corresponding to a virtual user are applied, and generates analysis information including position information of a user present in a real space. That is, each packet of target data set in particles is set to be associated with each event input from the event detection unit, and updating of target data corresponding to the event selected from each of the particles according to an input event identifier.
  • the target information updating unit 141 calculates an inter-event generation source hypothesis target likelihood set in each of the particles and the event information input from the event detection unit, and sets a value corresponding to the scale of the likelihood as a weight of the particle in each of the particles, so that a re-sampling process preferentially selecting a particle having a large weight is performed to update the particles. This process will be described later.
  • updating over time is performed.
  • the signal information is generated as a probability value of the event generation source.
  • the utterance source probability calculation unit 142 of the information integration processing unit 131 inputs the voice event information detected in the voice event detection unit 122 , and calculates a probability in which each target is an utterance source of the input voice event using an identification model (identifier).
  • the utterance source probability calculation unit 142 generates signal information concerning a voice event based on the calculated value, and outputs the generated information to the processing determination unit 132 .
  • the information integration processing unit 131 inputs event information shown in FIG. 3B from the voice event detection unit 122 and the image event detection unit 112 , that is, the user position information and the user identification information (face identification information or utterer identification information), generates (a) target information as estimated information concerning a position of a plurality of users, and who each of the plurality of users is, and (b) signal information indicating an event generation source of, for example, a user, or the like who is speaking, and outputs the generated information to the processing determination unit 132 .
  • This processing sequence will be described with reference to the flowchart shown in FIG. 10 .
  • step S 101 the information integration processing unit 131 inputs event information such as (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score) from the voice event detection unit 122 and the image event detection unit 112 .
  • event information such as (a) user position information, (b) user identification information (face identification information or utterer identification information), and (c) face attribute information (face attribute score) from the voice event detection unit 122 and the image event detection unit 112 .
  • step S 102 When acquisition of the event information is successfully performed, the process proceeds to step S 102 , and when the acquisition of the event information is wrongly performed, the process proceeds to step S 121 .
  • the process of step S 121 will be described later.
  • the information integration processing unit 131 determines whether a voice event is input in step S 102 .
  • the process proceeds to step S 111 , and when the input event is an image event, the process proceeds to step S 103 .
  • step S 111 a probability in which each target is an utterance source of the input voice event is calculated using an identification model (identifier).
  • the calculated result is output to the processing determination unit 132 (see FIG. 2 ) as the signal information based on the voice event. Details of step S 111 will be described later.
  • step S 103 updating of a particle based on the input information is performed, however, whether setting of a new target has to be performed with respect to each of the particles is determined in step S 103 before performing the updating of the particle.
  • step S 104 when the number of events input from the image event detection unit 112 is larger than the number of the targets, setting of a new target has to be performed. Specifically, this corresponds to a case in which a face that was not present until now appears in an image frame 350 shown in FIG. 5 . In this case, the process proceeds to step S 104 , so that a new target is set in each particle. This target is set as a target updated to be equivalent with the new event.
  • the event generation source for example, when the event generation source is a voice event, a user who is speaking is the event generation source, and when the event generation source is the image event, a user having an extracted face is the event generation source.
  • the event generation source hypothesis by the acquisition event is generated in each of the particles so that overlap does not occur.
  • step S 106 a weight corresponding to each particle, that is, a particle weight [W pID ] is calculated.
  • a uniform value is initially set to each particle, however, updating is performed according to the event input.
  • the particle weight [W pID ] corresponds to an index of correctness hypothesis of each particle generating a hypothesis target of the event generation source.
  • the particle weight [W pID ] is calculated as a value corresponding to a sum of likelihood between the event and the target as the similarity index between the event and the target calculated in each particle.
  • the process of calculating the likelihood shown in a lower end of FIG. 11 is performed such that (a) inter-Gaussian distribution likelihood [DL] as similarity data between an event with respect to user position information and target data, and (b) inter-user confirmation degree information (uID) likelihood [UL] as similarity data between an event with respect to user identification information (face identification information or utterer identification information) and target data are separately calculated.
  • DL inter-Gaussian distribution likelihood
  • uID inter-user confirmation degree information
  • a calculation process of the inter-Gaussian distribution likelihood [DL] as the similarity data between the (a) events with respect to the user position information and hypothesis target is the following process.
  • the inter-user confirmation degree information (uID) likelihood [UL] is calculated by the following equation using, as Pt[i], a value (score) of confirmation degree of each of the users 1 to k of the user confirmation degree information (uID) of the hypothesis target selected from the particle.
  • UL ⁇ P e [i] ⁇ P t [i]
  • [ W pID ] ⁇ n UL ⁇ ⁇ DL 1- ⁇
  • n denotes the number of targets corresponding to an event included in a particle.
  • the particle weight [W pID ] is calculated with respect to each of the particles.
  • step S 106 The calculation of the weight [W pID ] corresponding to each particle in step S 106 of the flowchart of FIG. 10 is performed as the process described with reference to FIG. 11 .
  • step S 107 a re-sampling process of the particle based on the particle weight [W pID ] of each particle set in step S 106 is performed.
  • the particle 1 is re-sampled with 40% probability, and the particle 2 is re-sampled with 10% probability.
  • m 100 to 1,000, and the re-sampled result is configured by particles having a distribution ratio corresponding to the particle weight.
  • step S 108 updating of target data (user position and user confirmation degree) included in each particle is performed.
  • each target is configured by data such as:
  • uID t ⁇ ⁇ 1 Pt ⁇ [ 1 ]
  • uID t ⁇ ⁇ 2 Pt ⁇ [ 2 ]
  • uID tk Pt ⁇ [ k ]
  • “i” is an event ID.
  • the expected value S tID of the face event attribute is calculated in the following equation 2 using a complement [1 ⁇ eID P eID (tID)] and the value prior [S prior ] knowledge.
  • S tID ⁇ eID P eID ( tID ) ⁇ S eID +(1 ⁇ eID P eID ( tID )) ⁇ S prior ⁇ Equation 2>
  • the updating of the target data in step S 108 is performed with respect to each of (a) user position, (b) user confirmation degree, and (c) expected value of face attribute (expected value (probability) being an utterer in this embodiment).
  • (a) user position will be described.
  • the updating of (a) user position is performed as updating of the following two stages such as (a1) updating with respect to all targets of all particles, and (a2) updating with respect to event generation source hypothesis target set in each particle.
  • the (a1) updating with respect to all targets of all particles is performed with respect to targets selected as the event generation source hypothesis target and other targets. This updating is performed based on the assumption that dispersion of the user position is expanded over time, and the updating is performed, using the Kalman filter, by the elapsed time and the position information of the event from the previous updating process.
  • m t denotes a predicted expectation value (predicted state)
  • ⁇ t 2 denotes a predicted covariance (predicted estimation covariance)
  • xc denotes movement information (control model)
  • ⁇ c 2 denotes noise (process noise).
  • Gaussian distribution N(m t , ⁇ t ) as the user position information included in all targets is updated.
  • step S 104 a target selected according to the set event generation source hypothesis is updated.
  • the updating process performed based on the event generation source hypothesis the updating of the target being able to be associated with the event is performed.
  • the updating process using Gaussian distribution: N(m e , ⁇ e ) indicating the user position included in the event information input from the voice event detection unit 122 or the image event detection unit 112 is performed.
  • K denotes Kalman Gain
  • m e denotes an observed value (observed state) included in input event information: N(m e , ⁇ e )
  • ⁇ e 2 denotes an observed value (observed covariance) included in the input event information: N(m e , ⁇ e )
  • K ⁇ t 2 /( ⁇ t 2 + ⁇ e 2 )
  • m t m t +K ( xc ⁇ m t )
  • ⁇ t 2 (1 ⁇ K ) ⁇ t 2 .
  • step S 108 an updating process with respect to the user confirmation degree information (uID) is performed.
  • the update rate [ ⁇ ] corresponds to a value of 0 to 1, and is set in advance.
  • uID t ⁇ ⁇ 1 Pt ⁇ [ 1 ]
  • uID t ⁇ ⁇ 2 Pt ⁇ [ 2 ]
  • uID tk Pt ⁇ [ k ]
  • expected value of face attribute expected value (probability) being an utterer in this embodiment
  • the target information is generated based on the above described data and each particle weight [W pID ], and outputs the generated target information to the processing determination unit 132 .
  • the target information is data shown in the target information 380 shown in a right end of FIG. 7 .
  • ⁇ i 1 m ⁇ W i ⁇ N ⁇ ( m i ⁇ ⁇ 1 , ⁇ i ⁇ ⁇ 1 ) ( Equation ⁇ ⁇ A )
  • W i denotes a particle weight [W pID ].
  • W i denotes a particle weight [W pID ].
  • the signal information indicating the event generation source is data indicating who is speaking, that is, data indicating an utterer with respect to the voice event, and is data indicating who a face included in an image belongs to and data indicating the utterer with respect to the image event.
  • This data is output to the processing determination unit 132 as the signal information indicating the event generation source.
  • step S 109 When the process of step S 109 is completed, the process returns to step S 101 to thereby proceed to a waiting state for input of the event information from the voice event detection unit 122 and the image event detection unit 112 .
  • step S 101 the descriptions of steps S 101 to S 109 shown in FIG. 10 have been made.
  • the information integration processing unit 131 does not acquire the event information shown in FIG. 3B from the voice event detection unit 122 and the image event detection unit 112 in step S 101 , updating of configuration data of the target included in each of the particles is performed in step S 121 .
  • This updating is a process considering a change in the user position over time.
  • the updating of the target is the same process as the (a1) updating with respect to all targets of all particles described in step S 108 , is performed based on the assumption that dispersion of the user position is expanded over time, and is performed, using the Kalman filter, by the elapsed time and the position information of the event from the previous updating process.
  • the predicted calculation of the user position after dt is calculated with the elapsed time [dt] from the previous updating process for all targets. That is, the following updating is performed with respect to Gaussian distribution as distribution information of the user position: expected value (average) of N (m t , ⁇ t ): [m t ], and distribution [ ⁇ t ].
  • m t denotes a predicted expectation value (predicted state)
  • ⁇ t 2 denotes a predicted covariance (predicted estimation covariance)
  • xc denotes movement information (control model)
  • ⁇ c 2 denotes noise (process noise).
  • Gaussian distribution N(m t , ⁇ t ) as the user position information included in all targets is updated.
  • step S 121 After the process of step S 121 is completed, whether elimination of the target is necessary or unnecessary is determined in step 122 , and when the elimination of the target is necessary, the target is eliminated in step S 123 .
  • the elimination of the target is performed as a process of eliminating data in which a specific user position is not obtained, such as a case in which a peak is not detected in the user position information included in the target, and the like.
  • steps S 122 to S 123 in which the elimination is unnecessary are performed, and then the process returns to step S 101 to thereby proceed to a waiting state for input of the event information from the voice event detection unit 122 and the image event detection unit 112 .
  • the information integration processing unit 131 repeatedly performs the process based on the flowchart shown in FIG. 10 for each input of the event information from the voice event detection unit 122 and the image event detection unit 112 .
  • a weight of the particle in which more reliable target is set as a hypothesis target is increased, and particles with larger weights remain through the re-sampling process based on the particle weight.
  • highly reliable data similar to the event information input from the voice event detection unit 122 and the image event detection unit 112 remains, so that the following highly reliable information, that is, (a) target information as estimated information indicating a position of each of a plurality of users, and who each of the plurality of users is, and, for example, (b) signal information indicating the event generation source such as the user who is speaking are ultimately generated, and the generated information is output to the processing determination unit 132 .
  • two items of signal information such as (b1) signal information based on a voice event generated by the process of step S 111 , and (b2) signal information based on an image event generated by the process of steps S 103 to 109 are included.
  • step S 111 shown in the flowchart of FIG. 10 , that is, a process of generating signal information based on a voice event will be described in detail.
  • the information integration processing unit 131 shown in FIG. 2 includes the target information updating unit 141 and the utterance source probability calculation unit 142 .
  • the target information updated for each the image event information in the target information updating unit 141 is output to the utterance source probability calculation unit 142 .
  • the utterance source probability calculation unit 142 generates the signal information based on the voice event by applying the voice event information input from the voice event detection unit 122 and the target information updated for each the image event information in the target information updating unit 141 . That is, the above described signal information is signal information indicating how much each target resembles an utterance source of the voice event information, as the utterance source probability.
  • the utterance source probability calculation unit 142 calculates the utterance source probability indicating how much each target resembles the utterance source of the voice event information using the target information input from the target information updating unit 141 .
  • FIG. 12 an example of input information such as (A) voice event information, and (B) target information which are input to the utterance source probability calculation unit 142 is shown.
  • the (A) voice event information is voice event information input from the voice event detection unit 122 .
  • the (B) target information is target information updated for each image event information in the target information updating unit 141 .
  • sound source direction information position information
  • utterer identification information included in the voice event information shown in (A) of FIG. 12
  • lip movement information included in the image event information or target position n or the total number of targets included in the target information are used.
  • the lip movement information originally included in the image event information is supplied to the utterance source probability calculation unit 142 from the target information updating unit 141 , as one item of the face attribute information included in the target information.
  • the lip movement information in this embodiment is generated from a lip state score obtainable by applying the visual speech detection technique.
  • the visual speech detection technique is described in, for example, [Visual lip activity detection and speaker detection using mouth region intensities/IEEE Transactions on Circuits and Systems for Video Technology, Volume 19, Issue 1 (January 2009), Pages: 133-137 (see, URL: http://poseidon.csd.auth.gr/papers/PUBLISHED/JOURNAL/pdf/Siatras09a)], [Facilitating Speech Detection in Style!: The Effect of Visual Speaking Style on the Detection of Speech in Noise Auditory-Visual Speech Processing 2005 (see, URL: http://www.isca-speech.org/archive/avsp05/av05 — 023.html)], and the like, and this technique may be applicable.
  • a graph of the time/lip state score shown in the bottom of the target information of (B) of FIG. 12 corresponds to the lip movement information.
  • the lip movement information is regularized with a sum of the lip movement information of all targets.
  • Japanese Unexamined Patent Application Publication No. 2010-20294 is disclosed
  • Japanese Unexamined Patent Application Publication No. 2004-286805 is disclosed, and the existing process may be applied.
  • the utterance source probability calculation unit 142 acquires (a) user position information (sound source direction information), and (b) user identification information (utterer identification information) which are corresponding to uttering as the voice event information input from the voice event detection unit 122 .
  • the utterance source probability calculation unit 142 acquires information such as (a) user position information, (b) user identification information, and (c) lip movement information, as the target information updated for each the image event information in the target information updating unit 141 .
  • the utterance source probability calculation unit 142 generates a probability (signal information) in which each target is an utterance source based on the above described information, and outputs the generated probability to the processing determination unit 132 .
  • the processing example shown in the flowchart of FIG. 13 is a processing example using an identifier in which targets are individually selected, and an utterance source probability (utterance source score) indicating whether the target is a generation source is determined from only information of the selected target.
  • step S 201 a single target acting as a target to be processed is selected from all targets.
  • step S 202 an utterance source score is obtained as a value of a probability whether the selected target is the utterance source using the identifier of the utterance source probability calculation unit 142 .
  • the identifier is an identifier for calculating the utterance source probability for each target, based on input information such as (a) user position information (sound source direction information) and (b) user identification information (utterer identification information) input from the voice event detection unit 122 , and (a) user position information, (b) user identification information, (c) lip movement information, and (d) target position or the number of targets input from the target information updating unit 141 .
  • the input information of the identifier may be all of the above described information, however, only some items of the input information may be used.
  • step S 202 the identifier calculates the utterance source score as the probability value indicating whether the selected target is the utterance source.
  • step S 202 Details of the process of calculating the utterance source score performed in step S 202 will be described later in detail with reference to FIG. 14 .
  • step S 203 whether other unprocessed targets are present is determined, and when the other unprocessed targets are present, processes after step S 201 are performed with respect to the other unprocessed targets.
  • step S 203 when the other unprocessed targets are absent, the process proceeds to step S 204 .
  • step S 204 the utterance source score obtained for each target is regularized with a sum of the utterance source scores of all of the targets to thereby determine the utterance source score as the utterance source probability that is corresponding to each target.
  • a target with the highest utterance source score is estimated to be the utterance source.
  • the utterance source score is calculated in the utterance source probability calculation unit 142 shown in FIG. 2 . That is, the utterance source score as a probability value whether or not the selected target is the utterance source is calculated.
  • the utterance source probability calculation unit 142 inputs (a) user position information (sound source direction information), and (b) user identification information (utterer identification information) from the voice event detection unit 122 , and inputs (a) user position information, (b) user identification information, (c) lip movement information, and (d) target position or the total number of targets from the target information updating unit 141 , to calculate the utterance source score for obtaining the utterance source probability for each target.
  • the utterance source probability calculation unit 142 may have a configuration of calculating the score using all the information described above, but may have a configuration of calculating the score using a part thereof.
  • the equation of calculating the utterance source score P using the three kinds of information D, S, and L may be defined by the following equation, for example, as shown in FIG. 14 .
  • P D ⁇ ⁇ S ⁇ ⁇ L ⁇ ,
  • D is sound source direction information
  • S is utterer identification information
  • is a weight coefficient of sound source direction information
  • is a weight coefficient of utterer identification information
  • is a weight coefficient of lip movement information
  • the utterance source calculation equation: P D ⁇ ⁇ S ⁇ ⁇ L ⁇ is applied, and the utterance source score as a probability value whether or not the selected target is the utterance source is calculated.
  • a condition is that three kinds of information of D: sound source direction information, S: utterer identification information, and L: lip movement information are acquired as input information.
  • a coefficient of sound source direction information
  • a weight coefficient of utterer identification information
  • a weight coefficient of lip movement information
  • one voice recognition result included in the voice event input from the voice event detection unit 122 includes the following information.
  • the utterance source probability calculation unit 142 adjust all the weight coefficients by changing weights of the lip movement information and the sound source direction information according to whether or not there is time overlap between voice event that is the target of the process of calculating the utterance source score and the just previous sound event, and whether or not there is the other target close in position to the target, to calculate the utterance source score using all the adjusted weight coefficients.
  • the utterance source probability calculation unit 142 acquires and applies information about whether or not there is time overlap of utterance, and whether there is the other target close in position on the basis of the information input from the image event detection unit 112 and the voice event detection unit 122 , and performs the process of determining the coefficients ( ⁇ , ⁇ , and ⁇ ) applied to the process of calculating the utterance source score.
  • step S 301 time overlap between the voice event that is the processing target of the process of calculating the utterance source score and the just previous voice event is confirmed.
  • the determination whether or not there is the time overlap may be performed only by the subsequent voice event deviating in time. This is because it is difficult to completely determine whether or not there is the other voice event overlapping in time at the time point when the proceeding voice event is detected (at the time when the end time of the proceeding voice event is determined).
  • step S 302 it is confirmed whether or not there is the other target close in position to the processing target.
  • this process may be performed using the user position information input from the target information updating unit 141 .
  • step S 303 the weight coefficients of ⁇ : weight coefficient of sound source direction information, and ⁇ : weight coefficient of lip movement information according to whether or not there is the time overlap determined step S 301 and whether or not there is the other target close in position determined in step S 302 , and all the weight coefficients are adjusted.
  • the utterance source score probability calculation equation 142 dynamically adjusts what values the weight coefficients of input information are set to, according to the situation where the voice is emanated.
  • a preset predetermined threshold value is applied on the basis of the difference of the sound source direction, that is, the angle representing the sound source direction, to determine whether close or far.
  • the absolute value of the difference of the sound source direction is equal to or less than 10° corresponding to the fact that a distance between two users is within about 53 cm at a position far away from a microphone by 3 m.
  • the sound source direction is close is replaced by “the distance between the users is close” or “the position of the users are close”.
  • the utterance source probability calculation unit 142 does not perform the adjustment of all the weight coefficients ( ⁇ , ⁇ , and ⁇ ), and uses a preset value.
  • the utterance source probability calculation unit 142 performs the process of adjusting the weight coefficients: ⁇ , ⁇ , and ⁇ such that the weight ⁇ of the lip movement information is made small.
  • the utterance source probability calculation unit 142 performs the process of adjusting the weight coefficients: ⁇ , ⁇ , and ⁇ such that the weight ⁇ of the sound source direction information is made small.
  • the adjustment is performed such that the weight ( ⁇ ) of the lip movement information and the weight ( ⁇ ) of the sound source direction information are made small.
  • the utterance source probability calculation unit 142 performs the process of adjusting the weight coefficients: ⁇ , ⁇ , and ⁇ such that the weight ⁇ of the lip movement information and the weight ⁇ of the sound source direction information are made small.
  • An example of summarizing the adjustment of such weight coefficients ( ⁇ , ⁇ , and ⁇ ) is shown in FIG. 17 .
  • FIG. 18A and FIG. 18B are diagrams illustrating the following two examples as specific adjustment examples of the weight coefficients ( ⁇ , ⁇ , and ⁇ ).
  • the other two coefficients are adjusted such that a ratio thereof is the same as the preset value.
  • step S 303 in the flowchart shown in FIG. 15 for example, as described above, the weight coefficients of ⁇ : weight coefficient of sound source direction information, ⁇ : weight coefficient of utterer identification information, and ⁇ : weight coefficient of lip movement information are adjusted.
  • the utterance source score for each target is calculated, and it is possible to determine that the target with the highest score is the utterance source by comparison of the scores.
  • both the utterer identification information and the lip movement information are considered, the applied weight coefficients of these information items are changed to calculate the utterance source score, and the utterance source probability is calculated according to the calculated score.
  • the technique disclosed in the specification may have the following configurations.
  • An information processing apparatus including:
  • an event detection unit that generates event information including estimated position information and estimated identification information of users present in the real space based on analysis of the information input from the information input unit;
  • an information integration processing unit that inputs the event information, and generates target information including a position of each user and user identification information on the basis of the input event information, and signal information representing a probability value of the event generation source,
  • the information integration processing unit includes an utterance source probability calculation unit
  • the utterance source probability calculation unit performs a process of calculating an utterance source score as an index value representing an utterance source probability of each target by multiplying weights based on utterance situations by a plurality of different information items input from the event detection unit.
  • the utterance source probability calculation unit receives an input of (a) user position information (sound source direction information), and (b) user identification information (utterer identification information) which are corresponding to an utterance event as input information from a voice event detection unit constituting the event detection unit, receives an input of (a) user position information (face position information), (b) user identification information (face identification information), and (c) lip movement information as the target information generated based on input information from an image event detection unit constituting the event detection unit, and performs a process of calculating the utterance source score based on the input information by adopting at least one item of the information.
  • is a weight coefficient of sound source direction information
  • is a weight coefficient of utterer identification information
  • is a weight coefficient of lip movement information
  • ⁇ + ⁇ + ⁇ 1.
  • the configuration of the present disclosure includes a method of a process performed in the apparatus described above, or a program for executing a process.
  • a series of processes described throughout the specification can be performed by hardware or software or by a complex configuration of both.
  • a program in which the processing sequence is recorded is installed in the memory within a computer built into dedicated hardware to perform the process, or is installed in a general-purpose computer in which various processes can be performed to thereby perform the process.
  • the program may be recorded in a recording medium in advance.
  • the program can be received via a network such as a LAN (Local Area Network) or the Internet, installed in a recording medium such as a built-in hard disk, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Psychiatry (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)
US13/669,843 2011-11-11 2012-11-06 Determining the position of the source of an utterance Expired - Fee Related US9002707B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011247130A JP2013104938A (ja) 2011-11-11 2011-11-11 情報処理装置、および情報処理方法、並びにプログラム
JP2011-247130 2011-11-11

Publications (2)

Publication Number Publication Date
US20130124209A1 US20130124209A1 (en) 2013-05-16
US9002707B2 true US9002707B2 (en) 2015-04-07

Family

ID=48281470

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/669,843 Expired - Fee Related US9002707B2 (en) 2011-11-11 2012-11-06 Determining the position of the source of an utterance

Country Status (3)

Country Link
US (1) US9002707B2 (zh)
JP (1) JP2013104938A (zh)
CN (1) CN103106390A (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2831706B1 (en) * 2012-03-26 2018-12-26 Tata Consultancy Services Limited A multimodal system and method facilitating gesture creation through scalar and vector data
US11209897B2 (en) 2014-04-25 2021-12-28 Lenovo (Singapore) Pte. Ltd. Strengthening prediction confidence and command priority using natural user interface (NUI) inputs
JP6592940B2 (ja) * 2015-04-07 2019-10-23 ソニー株式会社 情報処理装置、情報処理方法、及びプログラム
JP6501260B2 (ja) * 2015-08-20 2019-04-17 本田技研工業株式会社 音響処理装置及び音響処理方法
CN105913849B (zh) * 2015-11-27 2019-10-25 中国人民解放军总参谋部陆航研究所 一种基于事件检测的说话人分割方法
GB2567600B (en) 2016-08-29 2022-05-04 Groove X Inc Autonomously acting robot that recognizes direction of sound source
WO2018072214A1 (zh) * 2016-10-21 2018-04-26 向裴 混合现实音频系统
CN106782595B (zh) * 2016-12-26 2020-06-09 云知声(上海)智能科技有限公司 一种降低语音泄露的鲁棒阻塞矩阵方法
JP6472823B2 (ja) * 2017-03-21 2019-02-20 株式会社東芝 信号処理装置、信号処理方法および属性付与装置
WO2018173139A1 (ja) * 2017-03-22 2018-09-27 ヤマハ株式会社 撮影収音装置、収音制御システム、撮影収音装置の制御方法、及び収音制御システムの制御方法
WO2018178207A1 (en) * 2017-03-31 2018-10-04 Sony Corporation Apparatus and method
JP6853163B2 (ja) * 2017-11-27 2021-03-31 日本電信電話株式会社 話者方向推定装置、話者方向推定方法、およびプログラム
JP7120254B2 (ja) * 2018-01-09 2022-08-17 ソニーグループ株式会社 情報処理装置、情報処理方法、およびプログラム
CN110610718B (zh) * 2018-06-15 2021-10-08 炬芯科技股份有限公司 一种提取期望声源语音信号的方法及装置
CN111081257A (zh) * 2018-10-19 2020-04-28 珠海格力电器股份有限公司 一种语音采集方法、装置、设备及存储介质
WO2020213245A1 (ja) * 2019-04-16 2020-10-22 ソニー株式会社 情報処理装置、情報処理方法、及びプログラム
CN110189242B (zh) * 2019-05-06 2023-04-11 阿波罗智联(北京)科技有限公司 图像处理方法和装置
CN110767226B (zh) * 2019-10-30 2022-08-16 山西见声科技有限公司 具有高准确度的声源定位方法、装置、语音识别方法、系统、存储设备及终端
JP7111206B2 (ja) * 2021-02-17 2022-08-02 日本電信電話株式会社 話者方向強調装置、話者方向強調方法、およびプログラム
CN114454164B (zh) * 2022-01-14 2024-01-09 纳恩博(北京)科技有限公司 机器人的控制方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6009396A (en) * 1996-03-15 1999-12-28 Kabushiki Kaisha Toshiba Method and system for microphone array input type speech recognition using band-pass power distribution for sound source position/direction estimation
US20020161577A1 (en) * 2001-04-25 2002-10-31 International Business Mashines Corporation Audio source position detection and audio adjustment
US20030033150A1 (en) * 2001-07-27 2003-02-13 Balan Radu Victor Virtual environment systems
US20080120100A1 (en) * 2003-03-17 2008-05-22 Kazuya Takeda Method For Detecting Target Sound, Method For Detecting Delay Time In Signal Input, And Sound Signal Processor
JP2009031951A (ja) 2007-07-25 2009-02-12 Sony Corp 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
JP2009140366A (ja) 2007-12-07 2009-06-25 Sony Corp 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
US8311233B2 (en) * 2004-12-02 2012-11-13 Koninklijke Philips Electronics N.V. Position sensing using loudspeakers as microphones

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6009396A (en) * 1996-03-15 1999-12-28 Kabushiki Kaisha Toshiba Method and system for microphone array input type speech recognition using band-pass power distribution for sound source position/direction estimation
US20020161577A1 (en) * 2001-04-25 2002-10-31 International Business Mashines Corporation Audio source position detection and audio adjustment
US20030033150A1 (en) * 2001-07-27 2003-02-13 Balan Radu Victor Virtual environment systems
US20080120100A1 (en) * 2003-03-17 2008-05-22 Kazuya Takeda Method For Detecting Target Sound, Method For Detecting Delay Time In Signal Input, And Sound Signal Processor
US8311233B2 (en) * 2004-12-02 2012-11-13 Koninklijke Philips Electronics N.V. Position sensing using loudspeakers as microphones
JP2009031951A (ja) 2007-07-25 2009-02-12 Sony Corp 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
JP2009140366A (ja) 2007-12-07 2009-06-25 Sony Corp 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10178301B1 (en) * 2015-06-25 2019-01-08 Amazon Technologies, Inc. User identification based on voice and face

Also Published As

Publication number Publication date
US20130124209A1 (en) 2013-05-16
JP2013104938A (ja) 2013-05-30
CN103106390A (zh) 2013-05-15

Similar Documents

Publication Publication Date Title
US9002707B2 (en) Determining the position of the source of an utterance
US20120035927A1 (en) Information Processing Apparatus, Information Processing Method, and Program
US8140458B2 (en) Information processing apparatus, information processing method, and computer program
JP4462339B2 (ja) 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
US20110224978A1 (en) Information processing device, information processing method and program
US10740598B2 (en) Multi-modal emotion recognition device, method, and storage medium using artificial intelligence
JP4730404B2 (ja) 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
US20100185571A1 (en) Information processing apparatus, information processing method, and program
US20190341058A1 (en) Joint neural network for speaker recognition
US9899025B2 (en) Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
CN112088315A (zh) 多模式语音定位
JP7370014B2 (ja) 収音装置、収音方法、及びプログラム
JP2005141687A (ja) 物体追跡方法、物体追跡装置、物体追跡システム、プログラム、および、記録媒体
JP2009042910A (ja) 情報処理装置、および情報処理方法、並びにコンピュータ・プログラム
US11460927B2 (en) Auto-framing through speech and video localizations
WO2019171780A1 (ja) 個人識別装置および特徴収集装置
JP2013257418A (ja) 情報処理装置、および情報処理方法、並びにプログラム
JP5940944B2 (ja) 視聴状況判定装置、識別器構築装置、視聴状況判定方法、識別器構築方法およびプログラム
CN114911449A (zh) 音量控制方法、装置、存储介质和电子设备
CN115862597A (zh) 人物类型的确定方法、装置、电子设备和存储介质
WO2023277888A1 (en) Multiple perspective hand tracking
Wang et al. Real-time automated video and audio capture with multiple cameras and microphones
CN115910047B (zh) 数据处理方法、模型训练方法、关键词检测方法及设备
CN116129946A (zh) 语音端点检测方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
AS Assignment

Owner name: SONY CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAMADA, KEIICHI;REEL/FRAME:029712/0964

Effective date: 20121105

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20230407