WO2008069519A1 - Système et procédé de reconnaissance intégrée de geste/voix - Google Patents
Système et procédé de reconnaissance intégrée de geste/voix Download PDFInfo
- Publication number
- WO2008069519A1 WO2008069519A1 PCT/KR2007/006189 KR2007006189W WO2008069519A1 WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1 KR 2007006189 W KR2007006189 W KR 2007006189W WO 2008069519 A1 WO2008069519 A1 WO 2008069519A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gesture
- speech
- feature information
- integrated
- module
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000000605 extraction Methods 0.000 claims abstract description 45
- 238000001514 detection method Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000010354 integration Effects 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 5
- 238000013179 statistical model Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 18
- 238000011161 development Methods 0.000 description 5
- 240000008042 Zea mays Species 0.000 description 3
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 3
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 3
- 235000008429 bread Nutrition 0.000 description 3
- 235000005822 corn Nutrition 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/033—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
- G06F3/038—Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/038—Indexing scheme relating to G06F3/038
- G06F2203/0381—Multimodal input, i.e. interface arrangements enabling the user to issue commands by simultaneous use of input devices of different nature, e.g. voice plus gesture on digitizer
Definitions
- the present invention relates to a integrated recognition technology, and more particularly, to a gesture/speech integrated recognition system and method capable of recognizing an order of a user by extracting feature information on a gesture by using an end point detection (EPD) value of a speech and integrating feature information on the speech with the feature information on the gesture, thereby recognizing the order of the user at a high recognition rate in a noise environment.
- EPD end point detection
- a speech recognition technology and a gesture recognition technology are used as the most convenient interface technology.
- the speech recognition technology and the gesture recognition technology have a high recognition rate in a limited environment.
- performances of the technologies are not good in a noise environment. This is because environmental noise affects the performance of the speech recognition, and a change in lighting and a type of gesture affect the performance of a camera-based gesture recognition technology. Therefore, the speech recognition technology needs the development of a technology of recognizing speech by using an algorithm robust against noise, and the gesture recognition technology needs the development of a technology of extracting a particular section of a gesture including recognition information.
- a particular section of the gesture cannot be easily identified, so that recognition is difficult.
- An aspect of the present invention provides a means for extracting feature information by detecting an order section from a speech of a user by using an algorithm robust against environmental noise, and detecting a feature section of a gesture by using information on an order start point of the speech, thereby easily recognizing the order in the gesture that is not easily identified.
- An aspect of the present invention also provides a means for synchronizing a gesture and a speech by applying optimal frames set in advance to an order section of the gesture detected by a speech end point detection (EPD) value, thereby solving a problem of a synchronization difference between the speech and the gesture while integrated recognition is performed.
- EPD speech end point detection
- a gesture/speech integrated recognition system including: a speech feature extraction unit extracting speech feature information by detecting a start point and an end point of an order in an input speech; a gesture feature extraction unit extracting gesture feature information by detecting an order section from a gesture of taken images by using information on the detected start point and the end point; and an integrated recognition unit outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a learning parameter set in advance.
- the system may further include a synchronization module including: a gesture start point detection module detecting a start point of the gesture from the taken images by using the detected start point; and an optimal frame applying module calculating and extracting optimal image frames by applying the number of optimal frames set in advance from the start point of the gesture.
- the gesture start point detection module may detect the start point of the gesture by checking a start point that is an end point detection (EPD) plug of the detected speech from the taken images.
- EPD end point detection
- the speech feature extraction unit may include: an EPD module detecting a start point and an end point in the input speech; and a hearing model-based speech feature extraction module extracting speech feature information included in the detected order by using a hearing model-based algorithm.
- the speech feature extraction unit may remove noise from the extracted speech feature information.
- the gesture feature extraction unit may include: a hand tracking module tracking movements of a hand in images taken by a camera and transmitting the movements to the synchronization module; and a gesture feature extraction module extracting gesture feature information by using the optimal image frames extracted by the synchronization module.
- the integrated recognition unit may include: an integrated learning DB
- the integrated feature control module may control feature vectors of the extracted speech feature information and the gesture feature information through the extension and the reduction of the node numbers of input vectors.
- a gesture/ speech integrated recognition method including: a first step of extracting speech feature information by detecting a start point that is an EPD value and an end point of an order in an input speech; a second step of extracting gesture feature information by detecting an order section from a gesture in images input through a camera by using the detected start point of the order; and a third step of outputting the extracted speech feature information and gesture feature information as integrated recognition data by using a learning parameter set in advance.
- the speech feature information may be extracted from the order section obtained by the start point and the end point of the order on the basis of a hearing model.
- the second step may include: the second step comprises: a B step of detecting an order section from the gesture of the movements of the hand by using the transmitted EPD value; a C step of determining optimal frames from the order section from the gesture by applying optimal frames set in advance; and a D step of extracting gesture feature information from the determined optimal frames.
- the gesture/speech integrated recognition system and method according to the present invention increases a recognition rate of a gesture that is not easily identified by detecting an order section of the gesture by using an end point detection (EPD) value that is a start point of an order section of a speech.
- EPD end point detection
- optimal frames are applied to the order section of the gesture to synchronize the speech and the gesture, so that it is possible to implement integrated recognition between the speech and the gesture.
- FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- FIG. 2 is a view illustrating a configuration of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
- FIG. 1 is a view illustrating a concept of a gesture/speech integrated recognition system according to an embodiment of the present invention.
- a person 100 orders through a speech 110 and a gesture 120.
- the person may indicate a corn bread with a finger while saying "select a corn bread"as an order for selecting the corn bread from among the displayed goods.
- feature information on the speech order of the person is recognized through speech recognition 111, and feature information on the gesture of the person is recognized through gesture recognition 121.
- the recognized feature information on the gesture and the speech is recognized through integrated recognition 130 as a single user order in order to increase a recognition rate of the speech affected by environmental noise and the gesture that cannot be easily identified.
- the present invention provides a technology of integrated recognition for a speech and a gesture of a person.
- the recognized order is transmitted by a controller to a speaker 170, a display apparatus 171, a diffuser 172, a touching device 173, and a tasting device 174 which are output devices for the five senses, and the controller controls each device.
- the recognition result is transmitted to a network, so that data of the five senses as the result is transmitted to control each of the output devices.
- the present invention relates to integrated recognition, and a configuration after the recognition operation can be modified in various manners, so that a detailed description thereof is omitted.
- FIG. 2 is a view illustrating a configuration of the gesture/speech integrated recognition system according to an embodiment of the present invention.
- the gesture/speech integrated recognition system includes a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211, a gesture feature extraction unit 220 for extracting gesture feature information by detecting an order section from a gesture of images taken by a camera by using information on the start point and the end point detected by the speech feature extraction unit 210, a synchronization module 230 for detecting a start point of the gesture from the taken images by using the start point detected by the speech feature extraction unit 210, and calculating optimal image frames by applying the number optimal frames set in advance from the detected start point of the gesture, and an integrated recognition unit 240 outputting the extracted speech feature information and the gesture feature information as integrated recognition data by using a predetermined learning parameter set in advance.
- a speech feature extraction unit 210 for extracting speech feature information by detecting a start point and an end point of an order from a speech input through a microphone 211
- a gesture feature extraction unit 220 for extracting gesture feature
- the speech feature extraction unit 210 includes the microphone 211 receiving a speech of a user, an end point detection (EPD) module 212 detecting a start point and an end point of an order section from the speech of the user, and a hearing model- based speech feature extraction module 213 extracting speech feature information from the order section of the speech detected by the EPD module 212 on the basis of the hearing model.
- a channel noise reduction module (not shown) for removing noise from the extracted speech feature information may further be included.
- the EPD module 212 detects the start and the end of the order by analyzing the speech input through the wired or wireless microphone.
- the EPD module 212 acquires a speech signal, calculates an energy value needed to detect the end point of the speech signal, and detects the start and the end of the order by identifying a section to be calculated as the order from the input speech signal.
- the EPD module 212 firstly acquires the speech signal through the microphone and converts the acquired speech into a type for frame calculation. In the meanwhile, when the speech is input wirelessly, a problem of data loss or signal deterioration due to signal interference may occur, so that an operation to deal with the problem during the signal acquisition is needed.
- the energy value needed to detect the end point of the speech signal by the EPD module 212 is calculated as follows.
- a frame size used to analyze the speech signal is 160 samples, and frame energy is calculated by the following equation.
- S(n) denotes a vocal cord signal sample
- N denotes the number of samples per frame.
- the calculated frame energy is used as a parameter for detecting the end point.
- the EPD module 212 determines a section to be calculated as the order after calculating the frame energy value. For example, the operation of calculating the start point and the end point of the speech signal is determined by four energy thresholds using the frame energy and 10 conditions.
- the four energy thresholds and the 10 conditions can be set in various manners and may have values obtained in experiments in order to obtain the order section.
- the four thresholds are used to detect the start and end of every frame by an EPD algorithm.
- the EPD module 212 transmits information on the detected start point (hereinafter, referred to as an "EPD value" of the order to a gesture start point detection module 231 of the synchronization module 230.
- the EPD module 212 transmits information on the order section in the input speech to the hearing model-based speech feature extraction module 213 to extract speech feature information.
- the hearing model-based speech feature extraction module 213 that receives the information on the order section in the speech extracts the feature information on the basis of the hearing model from the order section detected by the EPD module 212.
- Examples of an algorithm used to extract the speech feature information on the basis of the hearing model include an ensemble interval histogram (EIH) and a zero-crossings with peak amplitudes (ZCPA).
- a channel noise reduction module removes noise from the speech feature information extracted by the hearing model-based speech feature extraction module 213, and the speech feature information from which the noise is removed is transmitted to the integrated recognition unit 240.[245->240]
- the gesture feature extraction unit 220 includes a face and hand detection module
- a gesture feature extraction module 224 for tracking and transmitting movements of the detected hand to the synchronization module 230 and extracting feature information on the gesture by using optimal frames calculated by the synchronization module 230.
- the face and hand detection module 222 detects the face and the hand that gives a gesture, and a hand tracking module 223 continuously tracks the hand movements in the images.
- the hand tracking module 223 tracks the hand.
- the hand tracking module 223 may track various body portions having movements recognized as a gesture.
- the hand tracking module 223 continuously stores the hand movements as time elapses, and a part recognized as a gesture order in the hand movement is detected by the synchronization module 230 by using the EPD value transmitted from the speech feature extraction unit 210.
- the synchronization module 230 which detects a section recognized as the gesture order in the hand movements by using the EPD value and applies the optimal frames for synchronization between speeches and gestures will be described.
- the synchronization module 230 includes the gesture start point detection module
- an optimal frame applying module 232 for calculating optimal image frames needed for integrated recognition by using a start frame of the gesture calculated by using the detected start point of the gesture.
- the gesture start point detection module 231 of the synchronization module 230 checks a speech EPD plug in the image signal. In this method, the gesture start point detection module 231 calculates the start frame of the gesture. In addition, the optimal frame applying module 232 calculates optimal image frames needed for integrated recognition by using the calculated start frame of the gesture and transmits the optimal image frames to the gesture feature extraction module 224.
- the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232 In order to calculate the optimal image frames needed for integrated recognition applied by the optimal frame applying module 232, the number of frames which are determined to have a high recognition rate of gestures is set in advance, and when the start frame of the gesture is calculated by the gesture start point detection module 231, the optimal image frames are determined.
- the integrated recognition unit 240 includes an integrated model generation module
- an integrated learning database (DB) 244 implemented to be proper for integrated recognition algorithm development based on a statistical model
- an integrated learning DB control module 243 for controlling learning by the integrated model generation module 242 and the integrated learning DB 244 and a learning parameter
- an integrated feature control module 241 for controlling the learning parameter and feature vectors of the input speech feature information and the gesture feature information
- an integrated recognition model 245 for providing various functions by generating recognition results.
- the integrated model generation module 242 generates a high-performance in- tegrated model in order to effectively integrate the speech feature information with the gesture feature information.
- various learning algorithms such as hidden Markov model (HMM), neural network (NN), dynamic time warping (DTW) are implemented and experiments are performed.
- HMM hidden Markov model
- NN neural network
- DTW dynamic time warping
- a method of optimizing NN parameters by determining an integrated model on the basis of the NN to obtain high-performance integrated recognition may be used.
- the generation of the high-performance integrated model has a problem of synchronizing two modalities that have different frame numbers in the learning model.
- the problem of the synchronization in the learning model is referred to as a learning model optimization problem.
- an integration layer is provided, and a method of connecting the speech and the gesture is optimized in the layer.
- An overlapping length between the speech and the gesture is calculated with respect to a time axis for the optimization, and the synchronization is performed on the basis of the overlapping length.
- the overlapping length is used to search for a connection method having a highest recognition rate through a recognition rate experiment.
- the integrated learning DB 244 implements an integrated recognition DB to be proper for the integrated recognition algorithm development based on the statistical model.
- Table 1 shows an order set defined to integrate gestures and speeches.
- the defined order set is obtained on the basis of natural gestures that people can understand without learning.
- the speech is recorded at a sampling rate of 16kHz/16bits per sample by using a waveform in a pulse code modulation (PCM) method using the channel number of 1.
- PCM pulse code modulation
- 24bits BITMAP images having a size of 320x240 are recorded with a blue screen background in a light condition having four fluorescent boxes at 15 frames per second by using a STH-DCSG-C stereo camera. Since the stereo camera does not have a speech interface, a speech collection module and an image collection module are independently provided, and a synchronization program of the image and speech to perform a method of controlling an image collecting process through inter-process communications (IPC) in a speech recording program controls is written to collect data.
- the image collection module is configured by using an open source computer vision library (OpenCV) and a small vision system (SVS).
- OpenCV open source computer vision library
- SVS small vision system
- Stereo camera images have to be adjusted to a real recording environment through an additional calibration process, and in order to acquire optima images, associated gain, exposure, brightness, red, and blue parameter values are modified to control color, exposure, and WB values.
- Calibration information and parameter information are stored in an additional ini file so that the image storage model calls and use the information.
- the integrated learning DB control module 243 generates the learning parameter on the basis of the integrated learning DB 244 that is related to the integrated model generation module 242 to be generated and stored in advance.
- the integrated feature control module 241 controls the learning parameter generated by the integrated learning DB control module 243 and the feature vectors of the feature information on the speech and the gesture extracted by the speech feature extraction unit 210 and the gesture feature extraction unit 220.
- the control operation is associated with the extension and the reduction of the node number of input vectors.
- the integrated feature control module 241 has the integration layer, and the integration layer is developed to effectively integrate lengths of the speech and the gesture that have different values and propose a single recognition rate.
- the integrated recognition module 245 generates a recognition result by using a control result obtained by the integrated feature control module 241.
- the integrated recognition module 245 provides various functions for interactions with a network or a integration representing unit.
- FIG. 3 is a flowchart illustrating a gesture/speech integrated recognition method according to an embodiment of the present invention.
- the gesture/speech integrated recognition method includes three threads to be operated.
- the three threads includes a speech feature extraction thread 10 for extracting speech features, a gesture feature extraction thread 20 for extracting gesture features, and an integrated recognition thread 30 for performing integrated recognition on the speech and the gesture.
- the three threads 10, 20, and 30 are generated at a time point when the learning parameter is loaded and cooperatively operated by using a thread plug. Now, the gesture/speech integrated recognition method in which the three threads 10, 20, 30 are cooperatively operated is described.
- the gesture feature extraction thread 20 continuously receives images including the gesture through a camera (operation S320). Speech frames of the speech continuously input through the microphone are calculated (operation S312), and the EPD module 212 detects a start point and an end point (speech EPD value) of an order included in the speech (operation S313). When the speech EPD value is detected, the speech EPD value is transmitted to a synchronization operation 40 of the gesture feature extraction thread 20.
- the hearing model-based speech feature extraction module 213 extracts a speech feature from the order section on the basis of the hearing model (operation S314) and transmits the extracted speech feature to the integrated recognition thread 30.
- the gesture feature extraction thread 20 detects a hand and a face from the images continuously input through the camera (operation S321). When the hand and face are detected, a gesture of the user is tracked (operation S322). Since the gesture of the user continuously changes, a gesture having a predetermined length is stored in a buffer (operation S323).
- a speech EPD plug in the gesture images stored in the buffer is checked (operation S324).
- the start point and the end point of the gesture including the feature information on the images are searched for by the speech EPD plug (operation S325), and the searched gesture feature is stored (operation S326).
- the stored gesture feature is not in synchronization with the speech, so that optimal frames are calculated from a start frame of the gesture by applying optimal frames set in advance.
- the calculated optimal frames are used to extract gesture feature information by the gesture feature extraction module 224, and the extracted gesture feature information is transmitted to the integrated recognition thread 30.
- the speech/gesture feature extraction threads 10 and 20 are in a sleep state while the integrated recognition thread checks a recognition result (operations S328 and S315).
- the integrated recognition thread 30 generates a high-performance integrated model in advance by using the integrated model generation module 245 before receiving the speech feature information and the gesture feature information and controls the generated integrated model and the integrated learning DB 244 so that the integrated learning DB control module 243 generates and loads the learning parameter (operation S331).
- the integrated recognition thread 30 is maintained in the sleep state until the speech/gesture feature information is received (operation S332).
- the integrated recognition thread 30 that is in the sleep state receives a signal associated with the feature information after the extraction of the feature information on the speech and the gesture is completed (operation S333), the integrated recognition thread 30 loads each feature in a memory (operation S334).
- the recognition result is calculated by using the optimized integrated learning model and the learning parameter that are set in advance (operation S335).
- the speech feature extraction thread 10 and the gesture feature extraction thread 20 that are in the sleep state perform operations of extracting feature information from a speech and an image that are input.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- User Interface Of Digital Computer (AREA)
- Image Analysis (AREA)
Abstract
L'invention concerne un système et un procédé de reconnaissance intégrée de geste/voix faisant intervenir une unité d'extraction de caractéristiques de voix assurant l'extraction d'informations de caractéristiques de voix par détection d'un point de début et d'un point de fin d'une commande dans une entrée vocale, une unité d'extraction de caractéristiques de geste assurant l'extraction d'informations de caractéristiques de geste par détection d'une partie commande d'un geste dans des images capturées, au moyen d'informations sur le point de début et le point de fin détectés, ainsi qu'une unité de reconnaissance intégrée produisant en sortie les informations de caractéristiques de voix et les informations de caractéristiques de geste extraites sous forme de données de reconnaissance intégrée au moyen d'un paramètre d'apprentissage fixé par avance. Par conséquent, il est possible d'augmenter l'efficacité de reconnaissance d'une commande par intégration de la voix et du geste dans un environnement bruyant.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009540141A JP2010511958A (ja) | 2006-12-04 | 2007-12-03 | ジェスチャー/音声統合認識システム及び方法 |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20060121836 | 2006-12-04 | ||
KR10-2006-0121836 | 2006-12-04 | ||
KR10-2007-0086575 | 2007-08-28 | ||
KR1020070086575A KR100948600B1 (ko) | 2006-12-04 | 2007-08-28 | 제스처/음성 융합 인식 시스템 및 방법 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008069519A1 true WO2008069519A1 (fr) | 2008-06-12 |
Family
ID=39492339
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2007/006189 WO2008069519A1 (fr) | 2006-12-04 | 2007-12-03 | Système et procédé de reconnaissance intégrée de geste/voix |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008069519A1 (fr) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011081541A (ja) * | 2009-10-06 | 2011-04-21 | Canon Inc | 入力装置及びその制御方法 |
GB2476711A (en) * | 2009-12-31 | 2011-07-06 | Intel Corp | Using multi-modal input to control multiple objects on a display |
EP2347810A3 (fr) * | 2009-12-30 | 2011-09-14 | Crytek GmbH | Dispositif d'entrée mobile et de capteur pour système de divertissement vidéo commandé par ordinateur |
WO2011130083A2 (fr) * | 2010-04-14 | 2011-10-20 | T-Mobile Usa, Inc. | Suppression de bruit et reconnaissance de la parole assistées par une caméra |
CN102298442A (zh) * | 2010-06-24 | 2011-12-28 | 索尼公司 | 手势识别设备、手势识别方法及程序 |
JP2012525625A (ja) * | 2009-04-30 | 2012-10-22 | サムスン エレクトロニクス カンパニー リミテッド | マルチモーダル情報を用いるユーザ意図推論装置及び方法 |
CN103064530A (zh) * | 2012-12-31 | 2013-04-24 | 华为技术有限公司 | 输入处理方法和装置 |
CN103376891A (zh) * | 2012-04-23 | 2013-10-30 | 凹凸电子(武汉)有限公司 | 多媒体系统,显示装置的控制方法及控制器 |
WO2014010879A1 (fr) * | 2012-07-09 | 2014-01-16 | 엘지전자 주식회사 | Appareil et procédé de reconnaissance vocale |
WO2014078480A1 (fr) * | 2012-11-16 | 2014-05-22 | Aether Things, Inc. | Structure unifiée pour la configuration, l'interaction et la commande d'un dispositif, et procédés, dispositifs et systèmes associés |
CN104317392A (zh) * | 2014-09-25 | 2015-01-28 | 联想(北京)有限公司 | 一种信息控制方法及电子设备 |
US9002714B2 (en) | 2011-08-05 | 2015-04-07 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
US9244984B2 (en) | 2011-03-31 | 2016-01-26 | Microsoft Technology Licensing, Llc | Location based conversational understanding |
US9298287B2 (en) | 2011-03-31 | 2016-03-29 | Microsoft Technology Licensing, Llc | Combined activation for natural user interface systems |
CN105792005A (zh) * | 2014-12-22 | 2016-07-20 | 深圳Tcl数字技术有限公司 | 录像控制的方法及装置 |
US9454962B2 (en) | 2011-05-12 | 2016-09-27 | Microsoft Technology Licensing, Llc | Sentence simplification for spoken language understanding |
US9760566B2 (en) | 2011-03-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US9842168B2 (en) | 2011-03-31 | 2017-12-12 | Microsoft Technology Licensing, Llc | Task driven user intents |
US9858343B2 (en) | 2011-03-31 | 2018-01-02 | Microsoft Technology Licensing Llc | Personalization of queries, conversations, and searches |
US20180122379A1 (en) * | 2016-11-03 | 2018-05-03 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US10061843B2 (en) | 2011-05-12 | 2018-08-28 | Microsoft Technology Licensing, Llc | Translating natural language utterances to keyword search queries |
CN110121696A (zh) * | 2016-11-03 | 2019-08-13 | 三星电子株式会社 | 电子设备及其控制方法 |
US10642934B2 (en) | 2011-03-31 | 2020-05-05 | Microsoft Technology Licensing, Llc | Augmented conversational understanding architecture |
US10739864B2 (en) | 2018-12-31 | 2020-08-11 | International Business Machines Corporation | Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation |
WO2021135281A1 (fr) * | 2019-12-30 | 2021-07-08 | 浪潮(北京)电子信息产业有限公司 | Procédé, appareil, dispositif et support pour la détection de point d'extrémité sur la base d'une fusion de caractéristiques multicouches |
US11521038B2 (en) | 2018-07-19 | 2022-12-06 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
CN115881118A (zh) * | 2022-11-04 | 2023-03-31 | 荣耀终端有限公司 | 一种语音交互方法及相关电子设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0594129A2 (fr) * | 1992-10-20 | 1994-04-27 | Hitachi, Ltd. | Système d'affichage capable d'accepter des commandes d'utilisateur utilisant voix et gestes |
JPH07306772A (ja) * | 1994-05-16 | 1995-11-21 | Canon Inc | 情報処理方法及び装置 |
KR20010075838A (ko) * | 2000-01-20 | 2001-08-11 | 오길록 | 멀티모달 인터페이스 처리 장치 및 그 방법 |
US20030001908A1 (en) * | 2001-06-29 | 2003-01-02 | Koninklijke Philips Electronics N.V. | Picture-in-picture repositioning and/or resizing based on speech and gesture control |
-
2007
- 2007-12-03 WO PCT/KR2007/006189 patent/WO2008069519A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0594129A2 (fr) * | 1992-10-20 | 1994-04-27 | Hitachi, Ltd. | Système d'affichage capable d'accepter des commandes d'utilisateur utilisant voix et gestes |
JPH07306772A (ja) * | 1994-05-16 | 1995-11-21 | Canon Inc | 情報処理方法及び装置 |
KR20010075838A (ko) * | 2000-01-20 | 2001-08-11 | 오길록 | 멀티모달 인터페이스 처리 장치 및 그 방법 |
US20030001908A1 (en) * | 2001-06-29 | 2003-01-02 | Koninklijke Philips Electronics N.V. | Picture-in-picture repositioning and/or resizing based on speech and gesture control |
Cited By (48)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012525625A (ja) * | 2009-04-30 | 2012-10-22 | サムスン エレクトロニクス カンパニー リミテッド | マルチモーダル情報を用いるユーザ意図推論装置及び方法 |
JP2011081541A (ja) * | 2009-10-06 | 2011-04-21 | Canon Inc | 入力装置及びその制御方法 |
EP2347810A3 (fr) * | 2009-12-30 | 2011-09-14 | Crytek GmbH | Dispositif d'entrée mobile et de capteur pour système de divertissement vidéo commandé par ordinateur |
US9344753B2 (en) | 2009-12-30 | 2016-05-17 | Crytek Gmbh | Mobile input and sensor device for a computer-controlled video entertainment system |
US8977972B2 (en) | 2009-12-31 | 2015-03-10 | Intel Corporation | Using multi-modal input to control multiple objects on a display |
GB2476711B (en) * | 2009-12-31 | 2012-09-05 | Intel Corp | Using multi-modal input to control multiple objects on a display |
GB2476711A (en) * | 2009-12-31 | 2011-07-06 | Intel Corp | Using multi-modal input to control multiple objects on a display |
WO2011130083A3 (fr) * | 2010-04-14 | 2012-02-02 | T-Mobile Usa, Inc. | Suppression de bruit et reconnaissance de la parole assistées par une caméra |
WO2011130083A2 (fr) * | 2010-04-14 | 2011-10-20 | T-Mobile Usa, Inc. | Suppression de bruit et reconnaissance de la parole assistées par une caméra |
US8635066B2 (en) | 2010-04-14 | 2014-01-21 | T-Mobile Usa, Inc. | Camera-assisted noise cancellation and speech recognition |
CN102298442A (zh) * | 2010-06-24 | 2011-12-28 | 索尼公司 | 手势识别设备、手势识别方法及程序 |
EP2400371A3 (fr) * | 2010-06-24 | 2015-04-08 | Sony Corporation | Appareil de reconnaissance de geste, procédé de reconnaissance de geste et programme |
US9842168B2 (en) | 2011-03-31 | 2017-12-12 | Microsoft Technology Licensing, Llc | Task driven user intents |
US9858343B2 (en) | 2011-03-31 | 2018-01-02 | Microsoft Technology Licensing Llc | Personalization of queries, conversations, and searches |
US10585957B2 (en) | 2011-03-31 | 2020-03-10 | Microsoft Technology Licensing, Llc | Task driven user intents |
US10642934B2 (en) | 2011-03-31 | 2020-05-05 | Microsoft Technology Licensing, Llc | Augmented conversational understanding architecture |
US10296587B2 (en) | 2011-03-31 | 2019-05-21 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US9298287B2 (en) | 2011-03-31 | 2016-03-29 | Microsoft Technology Licensing, Llc | Combined activation for natural user interface systems |
US9244984B2 (en) | 2011-03-31 | 2016-01-26 | Microsoft Technology Licensing, Llc | Location based conversational understanding |
US10049667B2 (en) | 2011-03-31 | 2018-08-14 | Microsoft Technology Licensing, Llc | Location-based conversational understanding |
US9760566B2 (en) | 2011-03-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US10061843B2 (en) | 2011-05-12 | 2018-08-28 | Microsoft Technology Licensing, Llc | Translating natural language utterances to keyword search queries |
US9454962B2 (en) | 2011-05-12 | 2016-09-27 | Microsoft Technology Licensing, Llc | Sentence simplification for spoken language understanding |
US9002714B2 (en) | 2011-08-05 | 2015-04-07 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
US9733895B2 (en) | 2011-08-05 | 2017-08-15 | Samsung Electronics Co., Ltd. | Method for controlling electronic apparatus based on voice recognition and motion recognition, and electronic apparatus applying the same |
CN103376891A (zh) * | 2012-04-23 | 2013-10-30 | 凹凸电子(武汉)有限公司 | 多媒体系统,显示装置的控制方法及控制器 |
WO2014010879A1 (fr) * | 2012-07-09 | 2014-01-16 | 엘지전자 주식회사 | Appareil et procédé de reconnaissance vocale |
WO2014078480A1 (fr) * | 2012-11-16 | 2014-05-22 | Aether Things, Inc. | Structure unifiée pour la configuration, l'interaction et la commande d'un dispositif, et procédés, dispositifs et systèmes associés |
AU2013270485C1 (en) * | 2012-12-31 | 2016-01-21 | Huawei Technologies Co. , Ltd. | Input processing method and apparatus |
AU2013270485B2 (en) * | 2012-12-31 | 2015-09-10 | Huawei Technologies Co. , Ltd. | Input processing method and apparatus |
EP2765473A4 (fr) * | 2012-12-31 | 2014-12-10 | Huawei Tech Co Ltd | Procédé et appareil de traitement d'entrée |
EP2765473A1 (fr) * | 2012-12-31 | 2014-08-13 | Huawei Technologies Co., Ltd. | Procédé et appareil de traitement d'entrée |
CN103064530A (zh) * | 2012-12-31 | 2013-04-24 | 华为技术有限公司 | 输入处理方法和装置 |
CN104317392A (zh) * | 2014-09-25 | 2015-01-28 | 联想(北京)有限公司 | 一种信息控制方法及电子设备 |
CN105792005A (zh) * | 2014-12-22 | 2016-07-20 | 深圳Tcl数字技术有限公司 | 录像控制的方法及装置 |
CN105792005B (zh) * | 2014-12-22 | 2019-05-14 | 深圳Tcl数字技术有限公司 | 录像控制的方法及装置 |
EP4220630A1 (fr) * | 2016-11-03 | 2023-08-02 | Samsung Electronics Co., Ltd. | Dispositif électronique et son procédé de commande |
CN110121696A (zh) * | 2016-11-03 | 2019-08-13 | 三星电子株式会社 | 电子设备及其控制方法 |
EP3523709A4 (fr) * | 2016-11-03 | 2019-11-06 | Samsung Electronics Co., Ltd. | Dispositif électronique et procédé de commande associé |
WO2018084576A1 (fr) | 2016-11-03 | 2018-05-11 | Samsung Electronics Co., Ltd. | Dispositif électronique et procédé de commande associé |
US20180122379A1 (en) * | 2016-11-03 | 2018-05-03 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US10679618B2 (en) | 2016-11-03 | 2020-06-09 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US11908465B2 (en) | 2016-11-03 | 2024-02-20 | Samsung Electronics Co., Ltd. | Electronic device and controlling method thereof |
US11521038B2 (en) | 2018-07-19 | 2022-12-06 | Samsung Electronics Co., Ltd. | Electronic apparatus and control method thereof |
US10739864B2 (en) | 2018-12-31 | 2020-08-11 | International Business Machines Corporation | Air writing to speech system using gesture and wrist angle orientation for synthesized speech modulation |
WO2021135281A1 (fr) * | 2019-12-30 | 2021-07-08 | 浪潮(北京)电子信息产业有限公司 | Procédé, appareil, dispositif et support pour la détection de point d'extrémité sur la base d'une fusion de caractéristiques multicouches |
CN115881118A (zh) * | 2022-11-04 | 2023-03-31 | 荣耀终端有限公司 | 一种语音交互方法及相关电子设备 |
CN115881118B (zh) * | 2022-11-04 | 2023-12-22 | 荣耀终端有限公司 | 一种语音交互方法及相关电子设备 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008069519A1 (fr) | Système et procédé de reconnaissance intégrée de geste/voix | |
KR100948600B1 (ko) | 제스처/음성 융합 인식 시스템 및 방법 | |
CN107799126B (zh) | 基于有监督机器学习的语音端点检测方法及装置 | |
US11762474B2 (en) | Systems, methods and devices for gesture recognition | |
US11854550B2 (en) | Determining input for speech processing engine | |
US8793134B2 (en) | System and method for integrating gesture and sound for controlling device | |
CN109766759A (zh) | 情绪识别方法及相关产品 | |
CN112016367A (zh) | 一种情绪识别系统、方法及电子设备 | |
CN110310623A (zh) | 样本生成方法、模型训练方法、装置、介质及电子设备 | |
CN110322760B (zh) | 语音数据生成方法、装置、终端及存储介质 | |
CN106157956A (zh) | 语音识别的方法及装置 | |
JP2012014394A (ja) | ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機 | |
KR20100062207A (ko) | 화상통화 중 애니메이션 효과 제공 방법 및 장치 | |
KR102291740B1 (ko) | 영상처리 시스템 | |
CN113129867A (zh) | 语音识别模型的训练方法、语音识别方法、装置和设备 | |
CN114779922A (zh) | 教学设备的控制方法、控制设备、教学系统和存储介质 | |
CN111326152A (zh) | 语音控制方法及装置 | |
CN107452381B (zh) | 一种多媒体语音识别装置及方法 | |
CN110728993A (zh) | 一种变声识别方法及电子设备 | |
CN110286771A (zh) | 交互方法、装置、智能机器人、电子设备及存储介质 | |
KR20210066774A (ko) | 멀티모달 기반 사용자 구별 방법 및 장치 | |
CN117975957A (zh) | 一种基于人工智能的交互系统及装置 | |
CN113497912A (zh) | 通过语音和视频定位的自动取景 | |
KR20130054131A (ko) | 디스플레이장치 및 그 제어방법 | |
CN116229962A (zh) | 终端设备及语音唤醒方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07851181 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2009540141 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07851181 Country of ref document: EP Kind code of ref document: A1 |