WO2007026280A1 - Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne - Google Patents
Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne Download PDFInfo
- Publication number
- WO2007026280A1 WO2007026280A1 PCT/IB2006/052915 IB2006052915W WO2007026280A1 WO 2007026280 A1 WO2007026280 A1 WO 2007026280A1 IB 2006052915 W IB2006052915 W IB 2006052915W WO 2007026280 A1 WO2007026280 A1 WO 2007026280A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- person
- speech
- visual
- ids
- dialogue system
- Prior art date
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 claims abstract description 24
- 230000003993 interaction Effects 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims description 17
- 230000006399 behavior Effects 0.000 claims description 2
- 238000001514 detection method Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000005534 acoustic noise Effects 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/10—Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Definitions
- the present invention relates to a dialogue system and a method performed in the dialogue system for interacting with a person by making use of both the person's visual ID and the person's speech ID.
- Speaker identification is a known technology, which so far has in deed only been used for initial person identification. Once the person was identified by the system, ordinary speech recognition was used. This is of course sufficient if the system 0 is close-ended, meaning that the dialogue with the system has a clear goal and end. It is then unlikely that the person will change during this interaction, and so an initial identification is sufficient. The situation is different with open-ended systems, which will remain active over a longer period of time and with which a person might interact only occasionally (e.g. every 10 minutes on average). This would be, for example, the 5 case for a home jukebox application where the person can select music to be played. The person personalizes on him, and he might interact with the system every now and then, but he also might not want any other person to interfere with the system (or that this happens by accident due to wrong recognition of background speech).
- US 20050027515 discloses a speech detection that combines conventional audio microphone with an additional speech sensor that provides a speech sensor signal based on an additional input.
- the speech sensor signal is based on an action undertaken by a speaker during speech, such as facial movement, bone vibrations, throat vibration, throat impedance changes etc. Accordingly, it is assumed that the desired speaker carries said speech sensor for providing information about whether the speaker is speaking. However, this reference provides no information about the identity of the speaker or the desired speaker.
- US 20040267536 discloses a system and a method facilitating speech detection and/or enhancement utilizing audio/video fusion.
- This reference fuses audio and video in a probabilistic generative model that implements cross-model, self supervised learning, enabling rapid adaptation to audio visual data. Also in this reference no information about the identity of the speaker or the desired speaker is provided.
- the present invention relates to a method performed in a dialogue system of interacting with a person based on both the person's visual ID and the person's speech ID, said method comprising the steps of: • detecting the speech signal from said person and processing said speech ID there from,
- said method is adapted to react in a much better and natural way when interacting with said person on a dialogue level.
- the dialogue communication may be stopped.
- an identified person who hereafter will be referred as an active person
- the result of the processing will be that the new visual ID does not match the previous visual ID.
- the system may be adapted to address the unidentified person accordingly, e.g. saying "I'm currently communicating with John".
- the detected speech signal is divided into speech segments, wherein the speech ID is processed for each of said speech segments.
- each speech ID is processed from each of said speech segments for a short time interval.
- a continuous speech signal is formed e.g of. 2 seconds
- it will be broken into many speech segments, e.g. 10 milliseconds speech segments (200 speech segments).
- the speech signals from an active person can be distinguished from unidentified persons or another active person, e.g. the first 100 segments belong to the active person, the rest of the speech segments belong to unidentified persons or an active person who are not interacting with the dialogue system.
- the step of detecting said person visually comprises capturing video of said person and generating at least one image frame of said person there from, wherein said visual ID is processed from said at least one image frame.
- each of said speech segments and each of said at least one image are harmonized so that their absolute times overlap. In that way, it is ensured that the real time of the speech segments and image frames will overlap, i.e. are arranged into "pairs" having the same real time.
- said harmonization comprises that those speech segments/image frames with higher rate are labeled according to the existing segments/image frames with lower segments/image frames rate so that their real times overlap.
- the number of image frames may be labeled as having the same visual ID as the first image frame.
- the step of detecting said person visually comprises detecting the person's face, and wherein said visual ID comprises the person's face ID.
- the present invention further relates to a computer-readable medium having stored therein instructions for causing a processing unit to execute said methods.
- the present invention relates to a dialogue system for interacting with a person based on both the person' s visual ID and the person' s speech ID, comprising:
- a speech recognizer for detecting the speech signal from said person and processing said speech ID there from, a visual identifier for detecting said person visually and processing said visual ID there from, - a processor for repeatedly comparing whether said processed IDs match the
- said visual identifier comprises a video camera for detecting said person visually, wherein the resulting video stream is subsequently cut into at least one image frame from which said visual ID is processed.
- said dialogue system further comprises a memory for storing said IDs.
- the speech recognizer is further adapted to adapt the dialogue behavior to appropriately react on the mismatch.
- Such an appropriate reaction could e.g. comprise implying to said person that the dialogue system is currently not interacting with said person. This makes the dialogue system more human like.
- the speech recognizer would imply the mismatch between the speech IDs by e.g. saying to person B: "I'm currently interacting with person A”.
- the speech recognizer is further adapted to re-process said speech signal.
- said dialogue system is further adapted to focus its attention physically on said person interacting with said dialogue system when two or more persons are present by means of focusing on the person of said two or more persons whose visual and speech IDs match.
- said dialogue system becomes more human like, since now it is provided with a "head" which can "look at” an interacting person.
- the feature of maintaining the person in sight may be done e.g. via audio localization where the view is centered on the (loudest) speaking voice source, or via centering the visual identifier (e.g.
- Said rotation means may e.g. comprise a rotation head or the like which enables a two dimensional rotation (two degrees of freedom), or preferably a three dimensional rotation (three degrees of freedom).
- the system in contrast determines which person is the active one e.g. based on voice ID and centers its view (i.e. moves its head accordingly) onto the face with the matching visual ID, overcoming the disadvantages of existing approaches.
- figure 1 illustrates the principle of the present invention, namely to enable a dialogue system to interact with one or more persons from a group of persons
- figure 2 shows a method according to the present invention performed in a home dialogue system for interacting with a person based on both the person's speech ID and the person' s visual ID
- figure 3 shows an example of how to harmonize the video frames and the speech segments
- figure 4 shows an extension of the example in Fig. 3, wherein the update frequency of the identification of an active person is increased
- figure 5 shows a dialogue system according to the present invention.
- Figures 1 illustrates the principle of the present invention, namely to enable a dialogue system 100 to interact with one or more person 101-105 based on both the person's speech IDs and the person's visual IDs, which are pre-stored in the dialogue system 100.
- the person's visual ID is the person's face ID, which may e.g. additionally include the shape of the person' s head, specifics relating to the person's ears, the person's hair etc.
- the dialogue system 100 which will be discussed in more details later, comprises a camera or video camera for the visual detection and a microphone for the speech detection.
- the scene as illustrated here shows a room with five persons 101-105 interacting with the dialogue system 100 and who have been identified by the system 100. These persons 101-105 will hereafter be referred to as active persons.
- the main object of the present invention is to maintain the dialogue with each one of these active persons 101- 105 in a very natural way based on the visual and speech IDs of the persons by repeatedly comparing whether the visual and speech IDs of the person currently interacting with the dialogue system match the IDs associated to said person. As long as the IDs match, the interaction with the active person is maintained.
- an active person 101 is interacting with the dialogue system 100, whereby the system repeatedly compares whether the visual and speech IDs of active person 101 match the pre-stored visual and speech IDs associated to person 101. If active person 102 would interrupt active person 101 by suddenly addressing the system 100, the resulting speech signal could first comprise a combination of speech signals from both the active persons and thereafter a speech signal from active person 102. In this scenario the speech processing would result in that the processed speech ID (belonging to active person 102) does not match the speech ID of the active person who is currently interacting with the dialogue system, i.e. active person 101.
- the dialogue system 100 is adapted to address the persons personally, one possible reaction from the system 100 could be: "Hey John (active person 102) I'm currently interacting with Eric (active person 101), I'll interact with you as soon as I have assisted Eric".
- the active person 102 might interrupt active person
- the system 100 is adapted to behave in a very human like manner. This is similar to the scenario where a person A is talking (interacting) to a person B, but hears person C (which he/she knows) talking to someone else and not being looking towards person A. Person A would obviously not react on that since he/she knows that person C is not addressing him/her (since there is no eye contact).
- the dialogue system 100 of the present invention is adapted to behave as a normal human being would do.
- active person 102 would simultaneously have gazed the dialogue system 100, it would have interpreted that as a "sign" to interact with the system 100, i.e. in a way as an eye contact.
- 105 to the dialogue system 100 comprises standing in front of the camera comprised in the system 100 for enabling face identification and providing a speech test for the speech identification.
- the initial visual ID of a person will be used for the subsequent processing, i.e. considered as constant.
- an active person will be allowed to move out of the sight of the camera, e.g. interact with the dialogue system 100 from another room, or where the active person is among a large group of people and is not visible observable (e.g. the person is relatively small, or is in another room).
- the received speech signals are processed and compared to the speech ID, wherein the interaction is continued while the speech IDs match the initial identified speech ID.
- the system could be adapted to immediately process the visual image of the person. Since the resulting visual ID would clearly not correspond to the initial visual ID the dialogue system 100 could e.g. respond saying, "I'm currently communicating with John".
- Figure 2 shows a method according to the present invention performed in a home dialogue system 100 for interacting with a person based on both the person's speech ID and the person's visual ID.
- the initial step of identifying the person is to detect the person both visually in step 201 (V_D) using e.g. a video camera and acoustically in step 202 (A_S) using a microphone.
- the video stream is cut into image frames (C_F) in step 203, where each frame may be a full image of e.g. the person's face.
- Each image frame is then processed and based thereon the person's face ID is determined in step 205 (Im_ID).
- a short time speech segment is generated (S_S) in step 204 and processed, and based thereon the person' s speech ID is determined in step 206 (Sp_ID).
- S_S short time speech segment
- Sp_ID the person' s speech ID
- This can be done on word or utterance level where a word or even a complete utterance may be labeled according to the dominant speech ID.
- the frame rates for video and speech may be different, they need to be harmonized (Harm) in step 207, so that each video frame and each speech segment have the same real time, or real time that overlaps (since the image frame is typically only one time instant, it is essential that the real time of the speech segment includes the real time of the image frame). This will be discussed in more details under Fig. 3.
- the person is positively identified (Pos. Id) in step 209 if both the visual and speech IDs are recognized and match the same person. Otherwise, the person is not identified (N_Pos. Id) in step 210 by the system 100.
- step 208 is, as previously mentioned, by repeatedly comparing the visual and speech IDs of an active person (e.g. person marked as 101 in Fig. 1) by regularly checking whether the two IDs belong to the active person who is currently interacting with the dialogue system 100. Since the active person can be among many other active persons, e.g. as shown in Fig. 1, the above mentioned speech segments may partly belong to one of the active persons who is not interacting with the dialogue system 100. In such situations, the dialogue system 100 may be adapted to evaluate, based on pre-defined evaluation criteria, whether the majority of the speech segments belong to the active person who is currently interacting with the system.
- the system 100 may be adapted to consider that as "acceptable” and consider the dialogue over these 2 seconds as a dialogue with the active person.
- the dialogue system 100 may also be adapted to further process the utterance in the speech signal if the processed speech ID does not match the active person, e.g. by transferring the result of the semantic interpretation to the semantic interpretation module.
- the utterance is rejected as it might contain a mixture of speech from different persons. This type of rejection can be compared to a "Nothing understood” rejection, and will help filtering out false alarms by background or party speech.
- the system determines from face and voice: Match of both (both showing "A"), hence the utterance is OK and to be processed further.
- person B talks loudly in the background while A wants to give the next command.
- the voice ID sequence shows As and Bs intermixed, depending on who was louder, or no ID where both voices were too mixed.
- the voice ID sequence does not match the visual ID sequence.
- the system rejects the utterance based on this detected mismatch of IDs. It could also reject the utterance only based on a mixed sequence of As and Bs in the voice ID.
- the system detects, from a mismatch of the visual ID (saying person A) and the voice ID (B), that the command was not given by the correct person and hence does not react on it - although it maybe was perfectly understood by the speech recognizer.
- Figure 3 shows an example of how to synchronize the video frames and the speech segments in time.
- the horizontal axes (R_t) 309 may be considered as a real time axes (a kind of a "common clock" for both the image frames and the speech segments).
- the frequency of the image frame rate is lower than that of the short time speech segments, i.e. two image frames 301-302 vs. five speech segments 303-307 over time interval t 308. This is common since in almost all real systems the speech frame rate is much higher than the visual frame rate, where the visual frame rate requires much more processing power. Therefore, in general, a fewer visual frames are provided having longer time intervals there between than between the speech segments.
- the synchronization as shown here means to associate the images frames
- image frame marked as 301 has a real time t2 310, which falls within the real time interval extending from tl 315 to t3 316 of the first speech segment 303.
- the identification of the active person may be updated based on the frame rate given by the image frames, where the face ID obtained by frame 301 is compared with the speech ID obtained from segment 303, and this is then repeated from image frame 302 and speech segment 306.
- Figure 4 shows an extension of the example in Fig. 3, wherein the update frequency of the identification of an active person is increased. This is done by labeling the image frames 312-313 between image frames 301 and 302 with the same person ID as image frame 301.
- Figure 5 shows a dialogue system 100, 500 according to the present invention comprising a speech recognizer (S_R) 501, a visual identifier (V_I) 502, a processor (P) 503 and a memory 504.
- the speech recognizer (S_R) 501 comprises a microphone for receiving an input from a person 505, and from the received input generating speech segments. These are then processed for obtaining the person's speech ID from each respective segment.
- the visual identifier (V_I) 502 comprises a digital camera or a video camera for capturing digital images/video from the person 505. In the case of the video camera the resulting video stream is cut into image frames. The visual ID from the person 505 is then processed from the image frames.
- the processor (P) 503 is adapted to harmonize the speech segments and the images frames as shown in Fig. 3 and 4 and to identify the person 505 based on the speech and the visual IDs.
- the processor (P) 503 is further adapted to repeatedly compare whether said processed IDs match the IDs associated to said person, and to maintain the interaction with said person while the IDs match.
- the invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer.
- a system claim enumerating several means several of these means can be embodied by one and the same item of hardware.
- the mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Image Analysis (AREA)
- Air Conditioning Control Device (AREA)
Abstract
L'invention concerne un système de dialogue et un procédé mis en oeuvre dans un système de dialogue destinés à interagir avec une personne sur la base de l'identité visuelle et de l'identité vocale de cette personne. L'identité vocale est obtenue à partir du signal vocal de la personne et l'identité visuelle est déterminée à partir de l'image visuelle de la personne. Les identités sont ensuite traitées au moyen de comparaisons répétées destinées à déterminer si ces identités correspondent aux identités associées à cette personne, l'interaction avec la personne étant maintenue tant qu'il existe une correspondance entre lesdites identités.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP05107972 | 2005-08-31 | ||
EP05107972.1 | 2005-08-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007026280A1 true WO2007026280A1 (fr) | 2007-03-08 |
Family
ID=37682728
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2006/052915 WO2007026280A1 (fr) | 2005-08-31 | 2006-08-23 | Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne |
Country Status (2)
Country | Link |
---|---|
TW (1) | TW200729155A (fr) |
WO (1) | WO2007026280A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003065350A1 (fr) * | 2002-01-30 | 2003-08-07 | Koninklijke Philips Electronics N.V. | Detection audiovisuelle d'activite vocale de systeme de reconnaissance vocale |
US20030154084A1 (en) * | 2002-02-14 | 2003-08-14 | Koninklijke Philips Electronics N.V. | Method and system for person identification using video-speech matching |
EP1494210A1 (fr) * | 2003-07-03 | 2005-01-05 | Sony Corporation | Système et méthode de communication parole, et robot |
-
2006
- 2006-08-23 WO PCT/IB2006/052915 patent/WO2007026280A1/fr active Application Filing
- 2006-08-28 TW TW095131581A patent/TW200729155A/zh unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003065350A1 (fr) * | 2002-01-30 | 2003-08-07 | Koninklijke Philips Electronics N.V. | Detection audiovisuelle d'activite vocale de systeme de reconnaissance vocale |
US20030154084A1 (en) * | 2002-02-14 | 2003-08-14 | Koninklijke Philips Electronics N.V. | Method and system for person identification using video-speech matching |
EP1494210A1 (fr) * | 2003-07-03 | 2005-01-05 | Sony Corporation | Système et méthode de communication parole, et robot |
Also Published As
Publication number | Publication date |
---|---|
TW200729155A (en) | 2007-08-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10743107B1 (en) | Synchronization of audio signals from distributed devices | |
US11023690B2 (en) | Customized output to optimize for user preference in a distributed system | |
US11322148B2 (en) | Speaker attributed transcript generation | |
US6975991B2 (en) | Wearable display system with indicators of speakers | |
US11875796B2 (en) | Audio-visual diarization to identify meeting attendees | |
CN111402900B (zh) | 一种语音交互方法,设备和系统 | |
US20210407516A1 (en) | Processing Overlapping Speech from Distributed Devices | |
US20170134552A1 (en) | Techniques for voice controlling Bluetooth headset | |
CN112997186A (zh) | “存活性”检测系统 | |
US20190355352A1 (en) | Voice and conversation recognition system | |
US11528568B1 (en) | Assisted hearing aid with synthetic substitution | |
US20200351603A1 (en) | Audio Stream Processing for Distributed Device Meeting | |
JP2004515982A (ja) | テレビ会議及び他の適用においてイベントを予測する方法及び装置 | |
US20180158462A1 (en) | Speaker identification | |
US11468895B2 (en) | Distributed device meeting initiation | |
JP3838159B2 (ja) | 音声認識対話装置およびプログラム | |
CN112673423A (zh) | 一种车内语音交互方法及设备 | |
JP4585380B2 (ja) | 次発言者検出方法、装置、およびプログラム | |
WO2007026280A1 (fr) | Systeme de dialogue destine a interagir avec une personne au moyen des identites visuelle et vocale de cette personne | |
KR102000282B1 (ko) | 청각 기능 보조용 대화 지원 장치 | |
JP4219129B2 (ja) | テレビジョン受像機 | |
CN115050375A (zh) | 一种设备的语音操作方法、装置和电子设备 | |
US20240127821A1 (en) | Selective speech-to-text for deaf or severely hearing impaired | |
CN115985324A (zh) | 角色区分方法、装置、设备及可读存储介质 | |
JP2006127353A (ja) | 会話参与手続き認識装置および会話参与手続き認識システム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06795742 Country of ref document: EP Kind code of ref document: A1 |