WO2008069519A1 - Système et procédé de reconnaissance intégrée de geste/voix - Google Patents

Système et procédé de reconnaissance intégrée de geste/voix Download PDF

Info

Publication number
WO2008069519A1
WO2008069519A1 PCT/KR2007/006189 KR2007006189W WO2008069519A1 WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1 KR 2007006189 W KR2007006189 W KR 2007006189W WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
speech
feature information
integrated
module
Prior art date
Application number
PCT/KR2007/006189
Other languages
English (en)
Inventor
Young Giu Jung
Mun Sung Han
Jae Seon Lee
Jun Seok Park
Original Assignee
Electronics And Telecommunications Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020070086575A external-priority patent/KR100948600B1/ko
Application filed by Electronics And Telecommunications Research Institute filed Critical Electronics And Telecommunications Research Institute
Priority to JP2009540141A priority Critical patent/JP2010511958A/ja
Publication of WO2008069519A1 publication Critical patent/WO2008069519A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/038Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/038Indexing scheme relating to G06F3/038
    • G06F2203/0381Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer

Definitions

  • the present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
  • EPD end point detection
  • a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology.
  • the speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment.
  • performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information.
  • a particular section of the gesture cannot be easily identified, so that recognition is difficult.
  • An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
  • An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed.
  • EPD speech end point detection
  • a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
  • the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
  • the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
  • EPD end point detection
  • the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm.
  • the speech feature extraction unit may remove noise from the extracted speech feature information.
  • the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
  • the integrated recognition unit may include: an integrated learning DB
  • the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
  • a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
  • the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
  • the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
  • the gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech.
  • EPD end point detection
  • optimal frames are applied to the order section of the gesture to synchronize the speech and the gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
  • FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
  • FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
  • a person 100 orders through a speech 110 and a gesture 120.
  • the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
  • feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121.
  • the recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
  • the present invention provides a technology of integrated recognition for a speech and a gesture of a person.
  • the recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device.
  • the recognition result is transmitted to a network, so that data of the five senses as the result is transmitted to control each of the output devices.
  • the present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
  • FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
  • the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance.
  • a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211
  • a gesture feature extraction unit 220 for extracting gesture feature
  • the speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model.
  • a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
  • the EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
  • the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
  • the EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
  • the energy value needed to detect the end point of the speech signal by the EPD module 212 is calculated as follows.
  • a frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
  • S(n) denotes a vocal cord signal sample
  • N denotes the number of samples per frame.
  • the calculated frame energy is used as a parameter for detecting the end point.
  • the EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions.
  • the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section.
  • the four thresholds are used to detect the start and end of every frame by an EPD algorithm.
  • the EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
  • the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
  • the hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212.
  • Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
  • a channel noise reduction module removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
  • the gesture feature extraction unit 220 includes a face and hand detection module
  • a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
  • the face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images.
  • the hand tracking module 223 tracks the hand.
  • the hand tracking module 223 may track various body portions having movements recognized as a gesture.
  • the hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210.
  • the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
  • the synchronization module 230 includes the gesture start point detection module
  • an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
  • the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224.
  • the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232 In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
  • the integrated recognition unit 240 includes an integrated model generation module
  • an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model
  • an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter
  • an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information
  • an integrated recognition model 245 for providing various functions by generating recognition results.
  • the integrated model generation module 242 generates a high-performance in- tegrated model in order to effectively integrate the speech feature information with the gesture feature information.
  • various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed.
  • HMM hidden Markov model
  • NN neural network
  • DTW dynamic time warping
  • a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used.
  • the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
  • the problem of the synchronization in the learning model is referred to as a learning model optimization problem.
  • an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer.
  • An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length.
  • the overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
  • the integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
  • Table 1 shows an order set defined to integrate gestures and speeches.
  • the defined order set is obtained on the basis of natural gestures that people can understand without learning.
  • the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1.
  • PCM pulse code modulation
  • 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data.
  • the image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
  • OpenCV open source computer vision library
  • SVS small vision system
  • Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values.
  • Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
  • the integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
  • the integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction unit 210 and the gesture feature extraction unit 220.
  • the control operation is associated with the extension and the reduction of the node number of input vectors.
  • the integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
  • the integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241.
  • the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
  • FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
  • the gesture/speech integrated recognition method includes three threads to be operated.
  • the three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture.
  • the three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
  • the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20.
  • the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
  • the gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
  • a speech EPD plug in the gesture images stored in the buffer is checked (operation S324).
  • the start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326).
  • the stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance.
  • the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
  • the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
  • the integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331).
  • the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
  • the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334).
  • the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
  • the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un système et un procédé de reconnaissance intégrée de geste/voix faisant intervenir une unité d'extraction de caractéristiques de voix assurant l'extraction d'informations de caractéristiques de voix par détection d'un point de début et d'un point de fin d'une commande dans une entrée vocale, une unité d'extraction de caractéristiques de geste assurant l'extraction d'informations de caractéristiques de geste par détection d'une partie commande d'un geste dans des images capturées, au moyen d'informations sur le point de début et le point de fin détectés, ainsi qu'une unité de reconnaissance intégrée produisant en sortie les informations de caractéristiques de voix et les informations de caractéristiques de geste extraites sous forme de données de reconnaissance intégrée au moyen d'un paramètre d'apprentissage fixé par avance. Par conséquent, il est possible d'augmenter l'efficacité de reconnaissance d'une commande par intégration de la voix et du geste dans un environnement bruyant.
PCT/KR2007/006189 2006-12-04 2007-12-03 Système et procédé de reconnaissance intégrée de geste/voix WO2008069519A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009540141A JP2010511958A (ja) 2006-12-04 2007-12-03 ジェスチャー/音声統合認識システム及び方法

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR20060121836 2006-12-04
KR10-2006-0121836 2006-12-04
KR10-2007-0086575 2007-08-28
KR1020070086575A KR100948600B1 (ko) 2006-12-04 2007-08-28 제스처/음성 융합 인식 시스템 및 방법

Publications (1)

Publication Number Publication Date
WO2008069519A1 true WO2008069519A1 (fr) 2008-06-12

Family

ID=39492339

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2007/006189 WO2008069519A1 (fr) 2006-12-04 2007-12-03 Système et procédé de reconnaissance intégrée de geste/voix

Country Status (1)

Country Link
WO (1) WO2008069519A1 (fr)

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011081541A (ja) * 2009-10-06 2011-04-21 Canon Inc 入力装置及びその制御方法
GB2476711A (en) * 2009-12-31 2011-07-06 Intel Corp Using multi-modal input to control multiple objects on a display
EP2347810A3 (fr) * 2009-12-30 2011-09-14 Crytek GmbH Dispositif d'entrée mobile et de capteur pour système de divertissement vidéo commandé par ordinateur
WO2011130083A2 (fr) * 2010-04-14 2011-10-20 T-Mobile Usa, Inc. Suppression de bruit et reconnaissance de la parole assistées par une caméra
CN102298442A (zh) * 2010-06-24 2011-12-28 索尼公司 手势识别设备、手势识别方法及程序
JP2012525625A (ja) * 2009-04-30 2012-10-22 サムスン エレクトロニクス カンパニー リミテッド マルチモーダル情報を用いるユーザ意図推論装置及び方法
CN103064530A (zh) * 2012-12-31 2013-04-24 华为技术有限公司 输入处理方法和装置
CN103376891A (zh) * 2012-04-23 2013-10-30 凹凸电子(武汉)有限公司 多媒体系统,显示装置的控制方法及控制器
WO2014010879A1 (fr) * 2012-07-09 2014-01-16 엘지전자 주식회사 Appareil et procédé de reconnaissance vocale
WO2014078480A1 (fr) * 2012-11-16 2014-05-22 Aether Things, Inc. Structure unifiée pour la configuration, l'interaction et la commande d'un dispositif, et procédés, dispositifs et systèmes associés
CN104317392A (zh) * 2014-09-25 2015-01-28 联想(北京)有限公司 一种信息控制方法及电子设备
US9002714B2 (en) 2011-08-05 2015-04-07 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
CN105792005A (zh) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 录像控制的方法及装置
US9454962B2 (en) 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US20180122379A1 (en) * 2016-11-03 2018-05-03 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US10061843B2 (en) 2011-05-12 2018-08-28 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries
CN110121696A (zh) * 2016-11-03 2019-08-13 三星电子株式会社 电子设备及其控制方法
US10642934B2 (en) 2011-03-31 2020-05-05 Microsoft Technology Licensing, Llc Augmented conversational understanding architecture
US10739864B2 (en) 2018-12-31 2020-08-11 International Business Machines Corporation Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation
WO2021135281A1 (fr) * 2019-12-30 2021-07-08 浪潮(北京)电子信息产业有限公司 Procédé, appareil, dispositif et support pour la détection de point d'extrémité sur la base d'une fusion de caractéristiques multicouches
US11521038B2 (en) 2018-07-19 2022-12-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
CN115881118A (zh) * 2022-11-04 2023-03-31 荣耀终端有限公司 一种语音交互方法及相关电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0594129A2 (fr) * 1992-10-20 1994-04-27 Hitachi, Ltd. Système d'affichage capable d'accepter des commandes d'utilisateur utilisant voix et gestes
JPH07306772A (ja) * 1994-05-16 1995-11-21 Canon Inc 情報処理方法及び装置
KR20010075838A (ko) * 2000-01-20 2001-08-11 오길록 멀티모달 인터페이스 처리 장치 및 그 방법
US20030001908A1 (en) * 2001-06-29 2003-01-02 Koninklijke Philips Electronics N.V. Picture-in-picture repositioning and/or resizing based on speech and gesture control

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0594129A2 (fr) * 1992-10-20 1994-04-27 Hitachi, Ltd. Système d'affichage capable d'accepter des commandes d'utilisateur utilisant voix et gestes
JPH07306772A (ja) * 1994-05-16 1995-11-21 Canon Inc 情報処理方法及び装置
KR20010075838A (ko) * 2000-01-20 2001-08-11 오길록 멀티모달 인터페이스 처리 장치 및 그 방법
US20030001908A1 (en) * 2001-06-29 2003-01-02 Koninklijke Philips Electronics N.V. Picture-in-picture repositioning and/or resizing based on speech and gesture control

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012525625A (ja) * 2009-04-30 2012-10-22 サムスン エレクトロニクス カンパニー リミテッド マルチモーダル情報を用いるユーザ意図推論装置及び方法
JP2011081541A (ja) * 2009-10-06 2011-04-21 Canon Inc 入力装置及びその制御方法
EP2347810A3 (fr) * 2009-12-30 2011-09-14 Crytek GmbH Dispositif d'entrée mobile et de capteur pour système de divertissement vidéo commandé par ordinateur
US9344753B2 (en) 2009-12-30 2016-05-17 Crytek Gmbh Mobile input and sensor device for a computer-controlled video entertainment system
US8977972B2 (en) 2009-12-31 2015-03-10 Intel Corporation Using multi-modal input to control multiple objects on a display
GB2476711B (en) * 2009-12-31 2012-09-05 Intel Corp Using multi-modal input to control multiple objects on a display
GB2476711A (en) * 2009-12-31 2011-07-06 Intel Corp Using multi-modal input to control multiple objects on a display
WO2011130083A3 (fr) * 2010-04-14 2012-02-02 T-Mobile Usa, Inc. Suppression de bruit et reconnaissance de la parole assistées par une caméra
WO2011130083A2 (fr) * 2010-04-14 2011-10-20 T-Mobile Usa, Inc. Suppression de bruit et reconnaissance de la parole assistées par une caméra
US8635066B2 (en) 2010-04-14 2014-01-21 T-Mobile Usa, Inc. Camera-assisted noise cancellation and speech recognition
CN102298442A (zh) * 2010-06-24 2011-12-28 索尼公司 手势识别设备、手势识别方法及程序
EP2400371A3 (fr) * 2010-06-24 2015-04-08 Sony Corporation Appareil de reconnaissance de geste, procédé de reconnaissance de geste et programme
US9842168B2 (en) 2011-03-31 2017-12-12 Microsoft Technology Licensing, Llc Task driven user intents
US9858343B2 (en) 2011-03-31 2018-01-02 Microsoft Technology Licensing Llc Personalization of queries, conversations, and searches
US10585957B2 (en) 2011-03-31 2020-03-10 Microsoft Technology Licensing, Llc Task driven user intents
US10642934B2 (en) 2011-03-31 2020-05-05 Microsoft Technology Licensing, Llc Augmented conversational understanding architecture
US10296587B2 (en) 2011-03-31 2019-05-21 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US9298287B2 (en) 2011-03-31 2016-03-29 Microsoft Technology Licensing, Llc Combined activation for natural user interface systems
US9244984B2 (en) 2011-03-31 2016-01-26 Microsoft Technology Licensing, Llc Location based conversational understanding
US10049667B2 (en) 2011-03-31 2018-08-14 Microsoft Technology Licensing, Llc Location-based conversational understanding
US9760566B2 (en) 2011-03-31 2017-09-12 Microsoft Technology Licensing, Llc Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof
US10061843B2 (en) 2011-05-12 2018-08-28 Microsoft Technology Licensing, Llc Translating natural language utterances to keyword search queries
US9454962B2 (en) 2011-05-12 2016-09-27 Microsoft Technology Licensing, Llc Sentence simplification for spoken language understanding
US9002714B2 (en) 2011-08-05 2015-04-07 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
US9733895B2 (en) 2011-08-05 2017-08-15 Samsung Electronics Co., Ltd. Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same
CN103376891A (zh) * 2012-04-23 2013-10-30 凹凸电子(武汉)有限公司 多媒体系统,显示装置的控制方法及控制器
WO2014010879A1 (fr) * 2012-07-09 2014-01-16 엘지전자 주식회사 Appareil et procédé de reconnaissance vocale
WO2014078480A1 (fr) * 2012-11-16 2014-05-22 Aether Things, Inc. Structure unifiée pour la configuration, l'interaction et la commande d'un dispositif, et procédés, dispositifs et systèmes associés
AU2013270485C1 (en) * 2012-12-31 2016-01-21 Huawei Technologies Co. , Ltd. Input processing method and apparatus
AU2013270485B2 (en) * 2012-12-31 2015-09-10 Huawei Technologies Co. , Ltd. Input processing method and apparatus
EP2765473A4 (fr) * 2012-12-31 2014-12-10 Huawei Tech Co Ltd Procédé et appareil de traitement d'entrée
EP2765473A1 (fr) * 2012-12-31 2014-08-13 Huawei Technologies Co., Ltd. Procédé et appareil de traitement d'entrée
CN103064530A (zh) * 2012-12-31 2013-04-24 华为技术有限公司 输入处理方法和装置
CN104317392A (zh) * 2014-09-25 2015-01-28 联想(北京)有限公司 一种信息控制方法及电子设备
CN105792005A (zh) * 2014-12-22 2016-07-20 深圳Tcl数字技术有限公司 录像控制的方法及装置
CN105792005B (zh) * 2014-12-22 2019-05-14 深圳Tcl数字技术有限公司 录像控制的方法及装置
EP4220630A1 (fr) * 2016-11-03 2023-08-02 Samsung Electronics Co., Ltd. Dispositif électronique et son procédé de commande
CN110121696A (zh) * 2016-11-03 2019-08-13 三星电子株式会社 电子设备及其控制方法
EP3523709A4 (fr) * 2016-11-03 2019-11-06 Samsung Electronics Co., Ltd. Dispositif électronique et procédé de commande associé
WO2018084576A1 (fr) 2016-11-03 2018-05-11 Samsung Electronics Co., Ltd. Dispositif électronique et procédé de commande associé
US20180122379A1 (en) * 2016-11-03 2018-05-03 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US10679618B2 (en) 2016-11-03 2020-06-09 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US11908465B2 (en) 2016-11-03 2024-02-20 Samsung Electronics Co., Ltd. Electronic device and controlling method thereof
US11521038B2 (en) 2018-07-19 2022-12-06 Samsung Electronics Co., Ltd. Electronic apparatus and control method thereof
US10739864B2 (en) 2018-12-31 2020-08-11 International Business Machines Corporation Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation
WO2021135281A1 (fr) * 2019-12-30 2021-07-08 浪潮(北京)电子信息产业有限公司 Procédé, appareil, dispositif et support pour la détection de point d'extrémité sur la base d'une fusion de caractéristiques multicouches
CN115881118A (zh) * 2022-11-04 2023-03-31 荣耀终端有限公司 一种语音交互方法及相关电子设备
CN115881118B (zh) * 2022-11-04 2023-12-22 荣耀终端有限公司 一种语音交互方法及相关电子设备

Similar Documents

Publication Publication Date Title
WO2008069519A1 (fr) Système et procédé de reconnaissance intégrée de geste/voix
KR100948600B1 (ko) 제스처/음성 융합 인식 시스템 및 방법
CN107799126B (zh) 基于有监督机器学习的语音端点检测方法及装置
US11762474B2 (en) Systems, methods and devices for gesture recognition
US11854550B2 (en) Determining input for speech processing engine
US8793134B2 (en) System and method for integrating gesture and sound for controlling device
CN109766759A (zh) 情绪识别方法及相关产品
CN112016367A (zh) 一种情绪识别系统、方法及电子设备
CN110310623A (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
CN110322760B (zh) 语音数据生成方法、装置、终端及存储介质
CN106157956A (zh) 语音识别的方法及装置
JP2012014394A (ja) ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機
KR20100062207A (ko) 화상통화 중 애니메이션 효과 제공 방법 및 장치
KR102291740B1 (ko) 영상처리 시스템
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
CN114779922A (zh) 教学设备的控制方法、控制设备、教学系统和存储介质
CN111326152A (zh) 语音控制方法及装置
CN107452381B (zh) 一种多媒体语音识别装置及方法
CN110728993A (zh) 一种变声识别方法及电子设备
CN110286771A (zh) 交互方法、装置、智能机器人、电子设备及存储介质
KR20210066774A (ko) 멀티모달 기반 사용자 구별 방법 및 장치
CN117975957A (zh) 一种基于人工智能的交互系统及装置
CN113497912A (zh) 通过语音和视频定位的自动取景
KR20130054131A (ko) 디스플레이장치 및 그 제어방법
CN116229962A (zh) 终端设备及语音唤醒方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07851181

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2009540141

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07851181

Country of ref document: EP

Kind code of ref document: A1