WO2007026280A1 - Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne - Google Patents

Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne Download PDF

Info

Publication number
WO2007026280A1
WO2007026280A1 PCT/IB2006/052915 IB2006052915W WO2007026280A1 WO 2007026280 A1 WO2007026280 A1 WO 2007026280A1 IB 2006052915 W IB2006052915 W IB 2006052915W WO 2007026280 A1 WO2007026280 A1 WO 2007026280A1
Authority
WO
WIPO (PCT)
Prior art keywords
person
speech
visual
ids
dialogue system
Prior art date
Application number
PCT/IB2006/052915
Other languages
English (en)
Inventor
Holger Scholl
Original Assignee
Philips Intellectual Property & Standards Gmbh
Koninklijke Philips Electronics N. V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Philips Intellectual Property & Standards Gmbh, Koninklijke Philips Electronics N. V. filed Critical Philips Intellectual Property & Standards Gmbh
Publication of WO2007026280A1 publication Critical patent/WO2007026280A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present invention relates to a dialogue system and a method performed in the dialogue system for interacting with a person by making use of both the person's visual ID and the person's speech ID.
  • Speaker identification is a known technology, which so far has in deed only been used for initial person identification. Once the person was identified by the system, ordinary speech recognition was used. This is of course sufficient if the system 0 is close-ended, meaning that the dialogue with the system has a clear goal and end. It is then unlikely that the person will change during this interaction, and so an initial identification is sufficient. The situation is different with open-ended systems, which will remain active over a longer period of time and with which a person might interact only occasionally (e.g. every 10 minutes on average). This would be, for example, the 5 case for a home jukebox application where the person can select music to be played. The person personalizes on him, and he might interact with the system every now and then, but he also might not want any other person to interfere with the system (or that this happens by accident due to wrong recognition of background speech).
  • US 20050027515 discloses a speech detection that combines conventional audio microphone with an additional speech sensor that provides a speech sensor signal based on an additional input.
  • the speech sensor signal is based on an action undertaken by a speaker during speech, such as facial movement, bone vibrations, throat vibration, throat impedance changes etc. Accordingly, it is assumed that the desired speaker carries said speech sensor for providing information about whether the speaker is speaking. However, this reference provides no information about the identity of the speaker or the desired speaker.
  • US 20040267536 discloses a system and a method facilitating speech detection and/or enhancement utilizing audio/video fusion.
  • This reference fuses audio and video in a probabilistic generative model that implements cross-model, self supervised learning, enabling rapid adaptation to audio visual data. Also in this reference no information about the identity of the speaker or the desired speaker is provided.
  • the present invention relates to a method performed in a dialogue system of interacting with a person based on both the person's visual ID and the person's speech ID, said method comprising the steps of: • detecting the speech signal from said person and processing said speech ID there from,
  • said method is adapted to react in a much better and natural way when interacting with said person on a dialogue level.
  • the dialogue communication may be stopped.
  • an identified person who hereafter will be referred as an active person
  • the result of the processing will be that the new visual ID does not match the previous visual ID.
  • the system may be adapted to address the unidentified person accordingly, e.g. saying "I'm currently communicating with John".
  • the detected speech signal is divided into speech segments, wherein the speech ID is processed for each of said speech segments.
  • each speech ID is processed from each of said speech segments for a short time interval.
  • a continuous speech signal is formed e.g of. 2 seconds
  • it will be broken into many speech segments, e.g. 10 milliseconds speech segments (200 speech segments).
  • the speech signals from an active person can be distinguished from unidentified persons or another active person, e.g. the first 100 segments belong to the active person, the rest of the speech segments belong to unidentified persons or an active person who are not interacting with the dialogue system.
  • the step of detecting said person visually comprises capturing video of said person and generating at least one image frame of said person there from, wherein said visual ID is processed from said at least one image frame.
  • each of said speech segments and each of said at least one image are harmonized so that their absolute times overlap. In that way, it is ensured that the real time of the speech segments and image frames will overlap, i.e. are arranged into "pairs" having the same real time.
  • said harmonization comprises that those speech segments/image frames with higher rate are labeled according to the existing segments/image frames with lower segments/image frames rate so that their real times overlap.
  • the number of image frames may be labeled as having the same visual ID as the first image frame.
  • the step of detecting said person visually comprises detecting the person's face, and wherein said visual ID comprises the person's face ID.
  • the present invention further relates to a computer-readable medium having stored therein instructions for causing a processing unit to execute said methods.
  • the present invention relates to a dialogue system for interacting with a person based on both the person' s visual ID and the person' s speech ID, comprising:
  • a speech recognizer for detecting the speech signal from said person and processing said speech ID there from, a visual identifier for detecting said person visually and processing said visual ID there from, - a processor for repeatedly comparing whether said processed IDs match the
  • said visual identifier comprises a video camera for detecting said person visually, wherein the resulting video stream is subsequently cut into at least one image frame from which said visual ID is processed.
  • said dialogue system further comprises a memory for storing said IDs.
  • the speech recognizer is further adapted to adapt the dialogue behavior to appropriately react on the mismatch.
  • Such an appropriate reaction could e.g. comprise implying to said person that the dialogue system is currently not interacting with said person. This makes the dialogue system more human like.
  • the speech recognizer would imply the mismatch between the speech IDs by e.g. saying to person B: "I'm currently interacting with person A”.
  • the speech recognizer is further adapted to re-process said speech signal.
  • said dialogue system is further adapted to focus its attention physically on said person interacting with said dialogue system when two or more persons are present by means of focusing on the person of said two or more persons whose visual and speech IDs match.
  • said dialogue system becomes more human like, since now it is provided with a "head" which can "look at” an interacting person.
  • the feature of maintaining the person in sight may be done e.g. via audio localization where the view is centered on the (loudest) speaking voice source, or via centering the visual identifier (e.g.
  • Said rotation means may e.g. comprise a rotation head or the like which enables a two dimensional rotation (two degrees of freedom), or preferably a three dimensional rotation (three degrees of freedom).
  • the system in contrast determines which person is the active one e.g. based on voice ID and centers its view (i.e. moves its head accordingly) onto the face with the matching visual ID, overcoming the disadvantages of existing approaches.
  • figure 1 illustrates the principle of the present invention, namely to enable a dialogue system to interact with one or more persons from a group of persons
  • figure 2 shows a method according to the present invention performed in a home dialogue system for interacting with a person based on both the person's speech ID and the person' s visual ID
  • figure 3 shows an example of how to harmonize the video frames and the speech segments
  • figure 4 shows an extension of the example in Fig. 3, wherein the update frequency of the identification of an active person is increased
  • figure 5 shows a dialogue system according to the present invention.
  • Figures 1 illustrates the principle of the present invention, namely to enable a dialogue system 100 to interact with one or more person 101-105 based on both the person's speech IDs and the person's visual IDs, which are pre-stored in the dialogue system 100.
  • the person's visual ID is the person's face ID, which may e.g. additionally include the shape of the person' s head, specifics relating to the person's ears, the person's hair etc.
  • the dialogue system 100 which will be discussed in more details later, comprises a camera or video camera for the visual detection and a microphone for the speech detection.
  • the scene as illustrated here shows a room with five persons 101-105 interacting with the dialogue system 100 and who have been identified by the system 100. These persons 101-105 will hereafter be referred to as active persons.
  • the main object of the present invention is to maintain the dialogue with each one of these active persons 101- 105 in a very natural way based on the visual and speech IDs of the persons by repeatedly comparing whether the visual and speech IDs of the person currently interacting with the dialogue system match the IDs associated to said person. As long as the IDs match, the interaction with the active person is maintained.
  • an active person 101 is interacting with the dialogue system 100, whereby the system repeatedly compares whether the visual and speech IDs of active person 101 match the pre-stored visual and speech IDs associated to person 101. If active person 102 would interrupt active person 101 by suddenly addressing the system 100, the resulting speech signal could first comprise a combination of speech signals from both the active persons and thereafter a speech signal from active person 102. In this scenario the speech processing would result in that the processed speech ID (belonging to active person 102) does not match the speech ID of the active person who is currently interacting with the dialogue system, i.e. active person 101.
  • the dialogue system 100 is adapted to address the persons personally, one possible reaction from the system 100 could be: "Hey John (active person 102) I'm currently interacting with Eric (active person 101), I'll interact with you as soon as I have assisted Eric".
  • the active person 102 might interrupt active person
  • the system 100 is adapted to behave in a very human like manner. This is similar to the scenario where a person A is talking (interacting) to a person B, but hears person C (which he/she knows) talking to someone else and not being looking towards person A. Person A would obviously not react on that since he/she knows that person C is not addressing him/her (since there is no eye contact).
  • the dialogue system 100 of the present invention is adapted to behave as a normal human being would do.
  • active person 102 would simultaneously have gazed the dialogue system 100, it would have interpreted that as a "sign" to interact with the system 100, i.e. in a way as an eye contact.
  • 105 to the dialogue system 100 comprises standing in front of the camera comprised in the system 100 for enabling face identification and providing a speech test for the speech identification.
  • the initial visual ID of a person will be used for the subsequent processing, i.e. considered as constant.
  • an active person will be allowed to move out of the sight of the camera, e.g. interact with the dialogue system 100 from another room, or where the active person is among a large group of people and is not visible observable (e.g. the person is relatively small, or is in another room).
  • the received speech signals are processed and compared to the speech ID, wherein the interaction is continued while the speech IDs match the initial identified speech ID.
  • the system could be adapted to immediately process the visual image of the person. Since the resulting visual ID would clearly not correspond to the initial visual ID the dialogue system 100 could e.g. respond saying, "I'm currently communicating with John".
  • Figure 2 shows a method according to the present invention performed in a home dialogue system 100 for interacting with a person based on both the person's speech ID and the person's visual ID.
  • the initial step of identifying the person is to detect the person both visually in step 201 (V_D) using e.g. a video camera and acoustically in step 202 (A_S) using a microphone.
  • the video stream is cut into image frames (C_F) in step 203, where each frame may be a full image of e.g. the person's face.
  • Each image frame is then processed and based thereon the person's face ID is determined in step 205 (Im_ID).
  • a short time speech segment is generated (S_S) in step 204 and processed, and based thereon the person' s speech ID is determined in step 206 (Sp_ID).
  • S_S short time speech segment
  • Sp_ID the person' s speech ID
  • This can be done on word or utterance level where a word or even a complete utterance may be labeled according to the dominant speech ID.
  • the frame rates for video and speech may be different, they need to be harmonized (Harm) in step 207, so that each video frame and each speech segment have the same real time, or real time that overlaps (since the image frame is typically only one time instant, it is essential that the real time of the speech segment includes the real time of the image frame). This will be discussed in more details under Fig. 3.
  • the person is positively identified (Pos. Id) in step 209 if both the visual and speech IDs are recognized and match the same person. Otherwise, the person is not identified (N_Pos. Id) in step 210 by the system 100.
  • step 208 is, as previously mentioned, by repeatedly comparing the visual and speech IDs of an active person (e.g. person marked as 101 in Fig. 1) by regularly checking whether the two IDs belong to the active person who is currently interacting with the dialogue system 100. Since the active person can be among many other active persons, e.g. as shown in Fig. 1, the above mentioned speech segments may partly belong to one of the active persons who is not interacting with the dialogue system 100. In such situations, the dialogue system 100 may be adapted to evaluate, based on pre-defined evaluation criteria, whether the majority of the speech segments belong to the active person who is currently interacting with the system.
  • the system 100 may be adapted to consider that as "acceptable” and consider the dialogue over these 2 seconds as a dialogue with the active person.
  • the dialogue system 100 may also be adapted to further process the utterance in the speech signal if the processed speech ID does not match the active person, e.g. by transferring the result of the semantic interpretation to the semantic interpretation module.
  • the utterance is rejected as it might contain a mixture of speech from different persons. This type of rejection can be compared to a "Nothing understood” rejection, and will help filtering out false alarms by background or party speech.
  • the system determines from face and voice: Match of both (both showing "A"), hence the utterance is OK and to be processed further.
  • person B talks loudly in the background while A wants to give the next command.
  • the voice ID sequence shows As and Bs intermixed, depending on who was louder, or no ID where both voices were too mixed.
  • the voice ID sequence does not match the visual ID sequence.
  • the system rejects the utterance based on this detected mismatch of IDs. It could also reject the utterance only based on a mixed sequence of As and Bs in the voice ID.
  • the system detects, from a mismatch of the visual ID (saying person A) and the voice ID (B), that the command was not given by the correct person and hence does not react on it - although it maybe was perfectly understood by the speech recognizer.
  • Figure 3 shows an example of how to synchronize the video frames and the speech segments in time.
  • the horizontal axes (R_t) 309 may be considered as a real time axes (a kind of a "common clock" for both the image frames and the speech segments).
  • the frequency of the image frame rate is lower than that of the short time speech segments, i.e. two image frames 301-302 vs. five speech segments 303-307 over time interval t 308. This is common since in almost all real systems the speech frame rate is much higher than the visual frame rate, where the visual frame rate requires much more processing power. Therefore, in general, a fewer visual frames are provided having longer time intervals there between than between the speech segments.
  • the synchronization as shown here means to associate the images frames
  • image frame marked as 301 has a real time t2 310, which falls within the real time interval extending from tl 315 to t3 316 of the first speech segment 303.
  • the identification of the active person may be updated based on the frame rate given by the image frames, where the face ID obtained by frame 301 is compared with the speech ID obtained from segment 303, and this is then repeated from image frame 302 and speech segment 306.
  • Figure 4 shows an extension of the example in Fig. 3, wherein the update frequency of the identification of an active person is increased. This is done by labeling the image frames 312-313 between image frames 301 and 302 with the same person ID as image frame 301.
  • Figure 5 shows a dialogue system 100, 500 according to the present invention comprising a speech recognizer (S_R) 501, a visual identifier (V_I) 502, a processor (P) 503 and a memory 504.
  • the speech recognizer (S_R) 501 comprises a microphone for receiving an input from a person 505, and from the received input generating speech segments. These are then processed for obtaining the person's speech ID from each respective segment.
  • the visual identifier (V_I) 502 comprises a digital camera or a video camera for capturing digital images/video from the person 505. In the case of the video camera the resulting video stream is cut into image frames. The visual ID from the person 505 is then processed from the image frames.
  • the processor (P) 503 is adapted to harmonize the speech segments and the images frames as shown in Fig. 3 and 4 and to identify the person 505 based on the speech and the visual IDs.
  • the processor (P) 503 is further adapted to repeatedly compare whether said processed IDs match the IDs associated to said person, and to maintain the interaction with said person while the IDs match.
  • the invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer.
  • a system claim enumerating several means several of these means can be embodied by one and the same item of hardware.
  • the mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Image Analysis (AREA)
  • Air Conditioning Control Device (AREA)

Abstract

L'invention concerne un système de dialogue et un procédé mis en oeuvre dans un système de dialogue destinés à interagir avec une personne sur la base de l'identité visuelle et de l'identité vocale de cette personne. L'identité vocale est obtenue à partir du signal vocal de la personne et l'identité visuelle est déterminée à partir de l'image visuelle de la personne. Les identités sont ensuite traitées au moyen de comparaisons répétées destinées à déterminer si ces identités correspondent aux identités associées à cette personne, l'interaction avec la personne étant maintenue tant qu'il existe une correspondance entre lesdites identités.
PCT/IB2006/052915 2005-08-31 2006-08-23 Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne WO2007026280A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP05107972 2005-08-31
EP05107972.1 2005-08-31

Publications (1)

Publication Number Publication Date
WO2007026280A1 true WO2007026280A1 (fr) 2007-03-08

Family

ID=37682728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/052915 WO2007026280A1 (fr) 2005-08-31 2006-08-23 Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne

Country Status (2)

Country Link
TW (1) TW200729155A (fr)
WO (1) WO2007026280A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003065350A1 (fr) * 2002-01-30 2003-08-07 Koninklijke Philips Electronics N.V. Detection audiovisuelle d'activite vocale de systeme de reconnaissance vocale
US20030154084A1 (en) * 2002-02-14 2003-08-14 Koninklijke Philips Electronics N.V. Method and system for person identification using video-speech matching
EP1494210A1 (fr) * 2003-07-03 2005-01-05 Sony Corporation Système et méthode de communication parole, et robot

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003065350A1 (fr) * 2002-01-30 2003-08-07 Koninklijke Philips Electronics N.V. Detection audiovisuelle d'activite vocale de systeme de reconnaissance vocale
US20030154084A1 (en) * 2002-02-14 2003-08-14 Koninklijke Philips Electronics N.V. Method and system for person identification using video-speech matching
EP1494210A1 (fr) * 2003-07-03 2005-01-05 Sony Corporation Système et méthode de communication parole, et robot

Also Published As

Publication number Publication date
TW200729155A (en) 2007-08-01

Similar Documents

Publication Publication Date Title
US10743107B1 (en) Synchronization of audio signals from distributed devices
US11023690B2 (en) Customized output to optimize for user preference in a distributed system
US11322148B2 (en) Speaker attributed transcript generation
US6975991B2 (en) Wearable display system with indicators of speakers
US11875796B2 (en) Audio-visual diarization to identify meeting attendees
CN111402900B (zh) 一种语音交互方法,设备和系统
US20210407516A1 (en) Processing Overlapping Speech from Distributed Devices
US20170134552A1 (en) Techniques for voice controlling Bluetooth headset
CN112997186A (zh) “存活性”检测系统
US20190355352A1 (en) Voice and conversation recognition system
US11528568B1 (en) Assisted hearing aid with synthetic substitution
US20200351603A1 (en) Audio Stream Processing for Distributed Device Meeting
JP2004515982A (ja) テレビ会議及び他の適用においてイベントを予測する方法及び装置
US20180158462A1 (en) Speaker identification
US11468895B2 (en) Distributed device meeting initiation
JP3838159B2 (ja) 音声認識対話装置およびプログラム
CN112673423A (zh) 一种车内语音交互方法及设备
JP4585380B2 (ja) 次発言者検出方法、装置、およびプログラム
WO2007026280A1 (fr) Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne
KR102000282B1 (ko) 청각 기능 보조용 대화 지원 장치
JP4219129B2 (ja) テレビジョン受像機
CN115050375A (zh) 一种设备的语音操作方法、装置和电子设备
US20240127821A1 (en) Selective speech-to-text for deaf or severely hearing impaired
CN115985324A (zh) 角色区分方法、装置、设备及可读存储介质
JP2006127353A (ja) 会話参与手続き認識装置および会話参与手続き認識システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06795742

Country of ref document: EP

Kind code of ref document: A1