WO2010126321A2 - Apparatus and method for user intention inference using multimodal information - Google Patents

Apparatus and method for user intention inference using multimodal information Download PDF

Info

Publication number
WO2010126321A2
WO2010126321A2 PCT/KR2010/002723 KR2010002723W WO2010126321A2 WO 2010126321 A2 WO2010126321 A2 WO 2010126321A2 KR 2010002723 W KR2010002723 W KR 2010002723W WO 2010126321 A2 WO2010126321 A2 WO 2010126321A2
Authority
WO
WIPO (PCT)
Prior art keywords
user intention
user
intention
modal
predicted
Prior art date
Application number
PCT/KR2010/002723
Other languages
French (fr)
Korean (ko)
Other versions
WO2010126321A3 (en
Inventor
조정미
김정수
방원철
김남훈
Original Assignee
삼성전자주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020090038267A external-priority patent/KR101581883B1/en
Priority claimed from KR1020100036031A external-priority patent/KR101652705B1/en
Application filed by 삼성전자주식회사 filed Critical 삼성전자주식회사
Priority to JP2012508401A priority Critical patent/JP5911796B2/en
Priority to EP10769966.2A priority patent/EP2426598B1/en
Priority to CN201080017476.6A priority patent/CN102405463B/en
Publication of WO2010126321A2 publication Critical patent/WO2010126321A2/en
Publication of WO2010126321A3 publication Critical patent/WO2010126321A3/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/0346Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor with detection of the device orientation or free movement in a 3D space, e.g. 3D mice, 6-DOF [six degrees of freedom] pointers using gyroscopes, accelerometers or tilt-sensors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces

Definitions

  • One or more aspects relate to a system using multi-modal information, and more particularly, to an apparatus and method for processing user input using multi-modal information.
  • Multi-modal interface means a method of interface using voice, keyboard, pen, etc. for communication between human and machine.
  • the method of analyzing the user intention is a method of fusing and analyzing the multi-modal input at the signal level, and analyzing the respective modality input information, and then analyzing the result at the semantic level.
  • the fusion method at the signal level fusions and analyzes and classifies multi-modal input signals at once.
  • the fusion method may be suitably used for signal processing simultaneously occurring, such as voice signals and lip movements.
  • the feature space is very large, and the model for calculating the correlation between signals is very complicated and the learning amount is high.
  • scalability as in the case of combining with other modalities or applying to other terminals is not easy.
  • the method of fusing each modality at the semantic level analyzes the meaning of each modality input signal and then fuses the analysis result.
  • the independence between modalities can be maintained to facilitate learning and expansion.
  • the reason for the user's multi-modal input is that there is an association between modalities, which is difficult to find when analyzing meaning individually.
  • An apparatus and method are provided that can efficiently and accurately infer user intention by predicting user intention by motion information and inferring the predicted user intention using multi-modal input information.
  • an apparatus for inducing user intention may include a first predictor configured to predict a portion of user intention using at least one motion information, and a portion of the predicted user intention and multi-modal information input from at least one multi-modal sensor. It includes a second prediction unit for predicting the user intention using.
  • a method of inferring user intention may include receiving at least one motion information, predicting a part of the user intention using the received motion information, and multimodal information input from at least one multi-modal sensor. And receiving the predicted user intention using a part of the predicted user intention and the multi-modal information.
  • the user motion recognition predicts a part of the user intention, analyzes the multi-modal information according to the predicted part of the user intention, and predicts the user intention secondarily, thereby maintaining the independence between the modalities and the association between the modalities. It is easy to grasp and infer user intention accurately.
  • the user can infer the user's inference apparatus without learning a special voice input method.
  • Voice can be input.
  • FIG. 1 is a diagram illustrating a configuration of a user intention reasoning apparatus according to an exemplary embodiment.
  • FIG. 2 is a diagram illustrating an example of a configuration of a user intention predictor of FIG. 1.
  • FIG. 3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.
  • FIG. 4 is a diagram illustrating an example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
  • FIG. 5 illustrates another example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
  • FIG. 6 is a diagram illustrating an example of a configuration of classifying a signal by combining an audio signal and a video signal.
  • FIG. 7 is a diagram illustrating a user intention reasoning method using multi-modal information according to an exemplary embodiment.
  • an apparatus for inducing user intention may include a first predictor configured to predict a portion of user intention using at least one motion information, and a portion of the predicted user intention and multi-modal information input from at least one multi-modal sensor. It includes a second prediction unit for predicting the user intention using.
  • the first predictor may generate a control signal for executing an operation performed in the process of predicting the user intention by using a part of the predicted user intention.
  • the control signal for executing the operation performed in the process of predicting the user intention may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus.
  • the secondary predictor may interpret the multi-modal information input from the multi-modal sensor to predict the user intention to be associated with a part of the predicted user intention.
  • the secondary predictor may predict the user intention by interpreting the input voice in association with the object selection.
  • the second prediction unit may predict the user intention using the multi-modal information input from the at least one multi-modal sensor within a range of the part of the predicted user intention.
  • the second predictor detects an acoustic signal, extracts and analyzes a feature with respect to the sensed acoustic signal, and predicts the user's intention.
  • the second prediction unit may determine whether a voice section is detected from the sound signal, and when the voice section is detected, predict the user intention as the voice command intention.
  • the second predictor may predict the user's intention by blowing a breath sound in the acoustic signal.
  • the second predictor may predict the user intention as at least one of deletion, classification, and alignment of the selected object by using the multi-modal information.
  • the apparatus may further include a user intention application unit configured to control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
  • a method of inferring user intention may include receiving at least one motion information, predicting a part of the user intention using the received motion information, and multimodal information input from at least one multi-modal sensor. And receiving the predicted user intention using a part of the predicted user intention and the multi-modal information.
  • FIG. 1 is a diagram illustrating a configuration of a user intention reasoning apparatus according to an exemplary embodiment.
  • the user intention reasoning apparatus 100 includes a motion sensor 110, a controller 120, and a multi-modal sensing unit 130.
  • the user intention inference device 100 includes a cellular phone, a personal digital assistane (PDA), a digital camera, a portable game console, an MP3 player, a portable / personal multimedia player (PMP), a handheld e-book, a portable laptop PC, a GPS (global) positioning system) and any type of device or system, such as navigation and desktop PCs, high definition televison (HDTV), optical disc players, set-top boxes, and the like.
  • the user intention inference apparatus 100 may further include various components according to an implementation example, such as components for a multi-modal interface such as a user interface unit, a display unit, a sound output unit, and the like.
  • the motion sensor 110 may include an inertial sensor, a geomagnetic sensor for detecting a direction, an acceleration sensor or a gyro sensor for detecting a motion, and the like to detect motion information.
  • the motion sensor 110 may include an image sensor, an acoustic sensor, and the like.
  • a plurality of motion sensors may be attached to a part of the user's body and the user intention reasoning apparatus 100 to detect motion information.
  • the multi-modal sensing unit 130 may include at least one multi-modal sensor 132, 134, 136, and 138.
  • the acoustic sensor 132 is a sensor for detecting an acoustic signal
  • the image sensor 134 is a sensor for detecting image information
  • the biometric information sensor 136 detects biometric information such as body temperature
  • the touch sensor 138 is a touch.
  • the touch gesture on the pad may be sensed, and other various types or types of multi-modal sensors may be included.
  • the multi-modal sensing unit 130 illustrates that four sensors are included in the multi-modal sensing unit 130, but the number is not limited thereto.
  • the type and range of the sensor included in the multi-modal sensing unit 130 may be wider than the type and range of the sensor included in the motion sensor 110 for the purpose of motion detection.
  • the motion sensor 110 and the multi-modal sensing unit 130 are illustrated as being separately present in FIG. 1, they may be integrated.
  • the same kind of sensor for example, an image sensor and an acoustic sensor, may be included in the sensor included in the motion sensor 110 and the multi-modal sensing unit 130.
  • the multi-modal detection unit 130 may include a module for extracting feature values according to the type of the multi-modal information detected by each of the multi-modal sensors 132, 134, 136, and 138 to analyze the meaning. .
  • Components for analyzing the multi-modal information may be included in the controller 120.
  • the controller 120 may include an application, data, and an operating system for controlling the operation of each component of the user intention reasoning apparatus 100.
  • the controller 120 includes a user intention predictor 122 and a user intention applicator 124.
  • the user intention predictor 122 receives at least one motion information detected from the motion sensor 110, and primarily predicts a part of the user intention using the received motion information.
  • the user intention predictor 122 may secondarily predict the user intention using a part of the predicted user intention and the multi-modal information input from the at least one multi-modal sensor. That is, when the user intention predictor 122 secondarily predicts the user intention, the user intention predictor 122 finally uses the motion information detected from the motion sensor 110 and the multi-modal information input from the multi-modal sensing unit 130 to finally determine the user intention. It can be predicted.
  • the user intention predictor 122 may use various known inference models for inferring the user's intention.
  • the user intention predictor 122 may generate a control signal for executing an operation performed in the process of predicting the user intention secondary by using a part of the user intentionally predicted.
  • the control signal for executing the operation performed in the user intention inference process may be a control signal for controlling the operation of the multi-modal sensing unit 130 controlled by the user intention inference apparatus 100.
  • the motion information may be used to activate some sensor operations associated with a part of the first predicted user intention among the sensors of the multi-modal sensing unit 130 based on the part of the first predicted user intention.
  • power consumption used for the sensor operation may be reduced as compared with the case of activating all the sensors of the multi-modal sensing unit 130.
  • accurate user intention can be inferred while simplifying the interpretation of the multi-modal input information while reducing the complexity of the user intention prediction process.
  • the user intention predictor 122 may include a module (not shown) that extracts and analyzes features according to types of multi-modal information in order to predict user intention secondarily.
  • the user intention predictor 122 may interpret the multi-modal information input from the multi-modal sensing unit 130 to be associated with a part of the user's intention predicted primarily.
  • the user intention predictor 122 when a part of the user intention primarily predicted by the user intention predictor 122 is determined by selection of an object displayed on the display screen, when the voice is input from the multi-modal sensing unit 130, the input voice is input. Can be secondarily predicted by interpreting in conjunction with object selection.
  • the user intention predictor The user's intention may be interpreted to mean "arrange the object selected on the display screen in the order of date”.
  • the user intention predictor 122 may predict the secondary user intention as at least one of deleting, classifying, and sorting using the multi-modal information. Can be.
  • the user intention application unit 124 may control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
  • the user intention applying unit 124 may provide a multi-modal interface for interacting with the predicted user intention. For example, if a user's intention is predicted as a voice command, you can run an application or search application that performs voice recognition to understand the meaning in the voice command and automatically connects the phone to a specific person based on the recognition result. If the intention is to transfer the object selected by the user, the email application can be executed. As another example, when the user intention is predicted to be humming, an application for searching for music similar to the humming sound source may be driven. As another example, when the user intention is predicted to be blow, the avatar may be used as a command for executing a specific action in the game application.
  • Multi-modal information can be interpreted in relation to a part of the user's intentionally predicted, while maintaining the accuracy of the intention.
  • FIG. 2 is a diagram illustrating an example of a configuration of a user intention predictor of FIG. 1.
  • the user intention predictor 122 may include a motion information analyzer 210, a first predictor 220, and a second predictor 230.
  • the motion information analyzer 210 analyzes one or more motion information received from the motion sensor 110.
  • the motion information analyzer 210 may measure location information and angle information of each part of the user's body to which the motion sensor 110 is attached, and the motion sensor 110 may use the measured location information and angle information. Location information and angle information of each part of the user's body that is not attached may also be calculated.
  • the distance between the sensor and the sensor is measured, and each sensor can obtain three-dimensional rotation angle information about the reference coordinate system. Therefore, the distance between the wrist part and the head part and the rotation angle information of the wrist may be calculated from the motion information to calculate the distance between the wrist and the mouth part of the face and the rotation angle information of the wrist. Assuming that a user is holding a microphone corresponding to the acoustic sensor 132 of the user intention reasoning apparatus 100 in the hand, the distance between the mouths of the microphones and the direction of the microphone can be calculated.
  • the motion information analyzer 210 may calculate the distance between the wrist and the mouth of the face and the rotation angle information of the microphone.
  • an image sensor may be included in the motion sensor 110 to input image information to the motion information analyzer 210.
  • the motion information analyzer 210 may recognize an object such as a face or a hand in the image and calculate a positional relationship between the objects. For example, the motion information analyzer 210 may calculate a distance and an angle between a face and two hands, a distance and an angle between two hands, and the like.
  • the primary predictor 220 predicts a part of the user intention triggered by the motion information analysis. For example, the primary predictor 220 may predict whether the motion primarily selects an object on the screen through analysis of motion information including an image.
  • the second prediction unit 230 predicts the user intention by using a part of the user intention predicted by the first prediction unit 220 and the multi-modal information input from the multi-modal sensing unit 130.
  • the second prediction unit 230 may interpret the multi-modal information input from the multi-modal sensor to be associated with a part of the first predicted user intention in order to predict the user intention. For example, when a part of the first predicted user intention is a selection of an object displayed on the display screen, and the second predictor 230 receives a voice from the multi-modal detection unit 130, the input voice is selected from the object selection. By correlating and interpreting, the user's intention can be predicted secondarily.
  • the first predictor 220 predicts that a part of the first predicted user's intention is to bring the microphone into the mouth
  • the multimodal sensor 130 uses an image sensor 134 such as a camera.
  • the secondary predictor 230 may predict the user's intention as a voice command input.
  • the user predictor 124 detects a voice section from the sound signal of the second predictor 230 and performs semantic analysis through feature extraction and analysis on the detected voice section. It can be made available.
  • the first prediction unit 220 firstly predicts that the microphone is brought to the mouth as a part of the user's intention, and the multimodal detection unit 130 uses the image sensor 134 such as a camera to make the lips
  • the second prediction unit 230 may predict the user's intention as blow.
  • the user's intentions are different: “Hold microphone into mouth and input voice command” and “Hold microphone into mouth.”
  • some of the two user intentions are common to "take the microphone to the mouth,” and the first predictor 220 may first predict a portion of the user intention to narrow the scope of the user intention.
  • the secondary predictor 230 may predict the user intention in consideration of multi-modal information.
  • the difference predictor 230 may determine whether the user intention is "voice command input” or "blowing” in consideration of the sensed multi-modal information.
  • FIG. 3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.
  • the primary predictor 220 may predict a part of the user's intention using the motion information analyzed by the motion information analyzer 210.
  • the second prediction unit 230 receives a multi-modal signal such as an image detected by the image sensor 134 of the multi-modal detection unit 130 or an acoustic signal detected from the sound sensor 132, and the voice is detected. Information about whether or not the user can be generated to predict the intention of the user.
  • the motion information analyzer 210 calculates a distance between a user's mouth and a hand holding a microphone using motion information detected from a motion sensor mounted on a user's head and wrist (310).
  • the motion information analyzer 210 calculates the direction of the microphone from the rotation angle of the wrist (320).
  • the first predictor 220 predicts a part of the user's intention by predicting whether the user moves the microphone to the mouth using the distance and direction information calculated by the motion information analyzer 210 (330). For example, when the first predictor 220 determines that the position of the user holding the user's mouth and the microphone is within a 20 cm radius around the mouth, and the microphone direction is toward the mouth, the user attempts to bring the microphone into the mouth. It can be predicted.
  • the second prediction unit 230 analyzes the multimodal input signals input from the acoustic sensor 132 such as a microphone and the image sensor 134 such as a camera, and is it intended to be a voice command or an intention such as a hum or blowing. Etc., the user's intention can be predicted.
  • the second prediction unit 230 predicts a part of the user's intention, that is, the first prediction brings the microphone to the mouth, when the movement of the lips is detected from the camera, and the voice is detected from the acoustic signal detected by the microphone, the user's intention is determined.
  • the voice command may be determined as the intention (340).
  • the first prediction is to bring the microphone to the mouth, the image protruding the lips forward from the camera is detected, and the breath sound is detected from the sound signal input from the microphone, the second prediction unit 230 is performed. May determine 350 the user intention to blow.
  • FIG. 4 is a diagram illustrating an example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
  • the second predictor 230 includes a microphone included in the multi-modal sensing unit 130.
  • a multimodal signal is input by activating a sensor such as a camera (420).
  • the second predictor 230 extracts features from an acoustic signal input from the microphone and an image signal input from the camera, and classifies and analyzes the features (430).
  • Acoustic features include time energy, frequency energy, zero crossing rate, linear predictive coding (LPC), cepstral coefficients, and pitch features such as a time domain or statistical features such as a frequency spectrum may be extracted.
  • LPC linear predictive coding
  • cepstral coefficients cepstral coefficients
  • pitch features such as a time domain or statistical features such as a frequency spectrum may be extracted.
  • the features that can be extracted are not limited to these and can be extracted by other feature algorithms.
  • the extracted features are input feature speech using classification and learning algorithms such as Decision Tree, Support Vector Machine, Bayesian Network, Neural Network, etc. It may be classified as an activity class or a non-speech activity class, but is not limited thereto.
  • the second prediction unit 230 may predict the user's intention by inputting the voice command. As a result of the feature analysis, the second predictor 230 may predict the degree of blow when the voice section is not detected (440) and when the breathing sound is detected (450). In addition, as other types of features are detected, the user's intention may be determined in various ways such as humming. In this case, the second prediction unit 230 may predict the user intention within a range limited from the first prediction.
  • the user's intention may be predicted using the multi-modal information of the user, and the performance of the voice detection operation may be controlled according to the prediction result.
  • Voice can be intuitively input without learning a separate button for input or an operation method such as a screen touch.
  • the second prediction unit 230 changes the image information input from the image sensor 134 such as a camera and the person input from the biometric information sensor 136 such as a vocal cord microphone to utter a voice.
  • At least one of the at least one piece of sensing information may be used together with the feature information extracted from the sound signal to detect a voice section and process the voice of the detected voice section.
  • the sensing information includes image information indicating a change in the shape of the user's mouth, temperature information changed by breathing during ignition, vibration information of a body part such as a throat or jaw that vibrates during ignition, and infrared detection from a face or mouth during ignition. It may include at least one of the information.
  • the user intention application unit 124 may perform voice recognition by processing a voice signal belonging to the detected voice section, and switch the application module using the voice recognition result. For example, when the application is executed according to the recognition result, when the name is recognized, intelligent voice input start and end switching can be performed, such as a search for a phone number for the recognized name or a call to the retrieved phone number. have.
  • the voice call starts and ends based on the multi-modal information to grasp the intention of the voice call automatically even if the user does not perform a separate operation such as pressing a call button. The operation mode can be switched to the mode.
  • FIG. 5 illustrates another example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
  • the second predictor 230 activates a sensor such as a camera and an ultrasonic sensor when a part of the first predicted user intention received from the first predictor 220 is a selection of a specific object (460). Input is received (470).
  • a sensor such as a camera and an ultrasonic sensor
  • the second prediction unit 230 analyzes the input multi-modal signal 480 to predict the user's intention.
  • the predicted user intention may be intentions within a range defined from the first prediction.
  • the second prediction unit 230 may determine that the user shakes the hand as a result of the multimodal signal analysis.
  • the secondary predicting unit 230 interprets the waving operation as an intention to delete a specific item or file shown on the screen according to the application being executed by the user intention applying unit 124, and the user intention applying unit 224. ) Can be controlled to delete specific items or files.
  • FIG. 6 is a diagram illustrating an example of feature-based signal classification in which the secondary predictor 230 performs integrated analysis by using an acoustic signal and an image signal together.
  • the second predictor 230 may include an acoustic feature extractor 510, an acoustic feature analyzer 520, an image feature extractor 530, an image feature analyzer 540, and an integrated analyzer 550. have.
  • the sound feature extractor 510 extracts a sound feature from the sound signal.
  • the acoustic feature analyzer 520 extracts a speech section by applying a classification and learning algorithm to the acoustic features.
  • the image feature extractor 530 extracts an image feature from a series of image signals.
  • the image feature analyzer 540 extracts a speech section by applying a classification and learning algorithm to the extracted image features.
  • the integrated analysis unit 550 fuses the results classified by the audio signal and the video signal, respectively, and finally detects the voice section.
  • the acoustic feature and the image feature may be individually applied or the two features may be fused and applied.
  • the integrated analyzer 550 may be used.
  • An audio section may be detected by fusion with detection information extracted from an audio signal and an image signal.
  • the user when using the voice interface, the user may intuitively input voice without separately learning a voice input method. For example, the user does not need to perform a separate button or screen touch for voice input.
  • the user does not need to perform a separate button or screen touch for voice input.
  • noise such as home noise, vehicle noise, non-talker noise
  • the voice since the voice may be detected using other biometric information in addition to the image, the voice section of the user may be accurately detected even when the lighting is too bright or dark or the user's mouth is covered.
  • FIG. 7 is a diagram illustrating a user intention reasoning method using multi-modal information according to an exemplary embodiment.
  • the user intention reasoning apparatus 100 receives the detected motion information from at least one motion sensor (610).
  • the user intention reasoning apparatus 100 primarily predicts a part of the user intention using the received motion information (620).
  • the user intention inference apparatus 100 predicts the user's intention by using a part of the first predicted user intention and the multi-modal information. (640). In the second step of predicting the user intention, an operation may be performed to interpret the multi-modal information input from the multi-modal sensor to be associated with a portion of the first predicted user intention.
  • a portion of the first predicted user intention may be used to generate a control signal for executing an operation performed in the secondary user intention prediction process.
  • the control signal for executing the operation performed in the secondary user intention prediction process may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus 100.
  • the user intention may be determined using multi-modal information input from at least one multi-modal sensor, within a range of the first predicted user intent.
  • One aspect of the invention may be embodied as computer readable code on a computer readable recording medium. Codes and code segments that implement a program can be easily inferred by a computer programmer in the art.
  • Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like.
  • the computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
  • the invention is industrially applicable in the fields of computers, electronics, computer software and information technology.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Disclosed are an apparatus and a method for user intention inference using multimodal information. The apparatus for user intention inference according to one aspect of the present invention comprises: a first prediction unit which predicts a portion of user intention using at least one piece of motion information; and a second prediction unit which predicts user intention using the portion of user intention predicted by the first prediction unit and multimodal information input by at least one multimodal sensor.

Description

멀티 모달 정보를 이용하는 사용자 의도 추론 장치 및 방법User Intention Inference Device and Method Using Multi-modal Information
하나 이상의 양상은 멀티 모달 정보를 이용하는 시스템에 관한 것으로, 더욱 상세하게는 멀티 모달 정보를 이용하여 사용자 입력을 처리하는 장치 및 방법에 관한 것이다. One or more aspects relate to a system using multi-modal information, and more particularly, to an apparatus and method for processing user input using multi-modal information.
멀티모달 인터페이스는 인간과 기계의 통신을 위해 음성, 키보드, 펜 등을 이용해 인터페이스하는 방법을 의미한다. 이러한 멀티모달 인터페이스를 통한 멀티 모달 정보가 입력된 경우, 사용자 의도를 분석하는 방식은 멀티 모달 입력을 신호 레벨에서 융합하여 분석하는 방식과 각 모달리티 입력 정보를 각각 분석한 후 분석된 결과를 의미 레벨에서 융합하여 분석하는 방법이 있다. Multi-modal interface means a method of interface using voice, keyboard, pen, etc. for communication between human and machine. When the multi-modal information through the multi-modal interface is input, the method of analyzing the user intention is a method of fusing and analyzing the multi-modal input at the signal level, and analyzing the respective modality input information, and then analyzing the result at the semantic level. There is a method of fusion and analysis.
신호 레벨에서 융합하는 방식은 멀티 모달 입력 신호를 융합하여 한꺼번에 분석 및 분류하는 것으로 예를 들어, 음성 신호와 입술 움직임과 같이 동시에 발생하는 신호 처리에 적합하게 이용될 수 있다. 그러나 2 이상의 신호를 통합하여 처리하기 때문에 특징 공간이 매우 크고 신호 간의 연관성을 계산하기 위한 모델이 매우 복잡하고 학습량이 많아지게 된다. 또한 다른 모달리티와 결합하거나 다른 단말에 적용하는 등의 경우와 같은 확장성이 용이하지 않다. The fusion method at the signal level fusions and analyzes and classifies multi-modal input signals at once. For example, the fusion method may be suitably used for signal processing simultaneously occurring, such as voice signals and lip movements. However, since two or more signals are integrated and processed, the feature space is very large, and the model for calculating the correlation between signals is very complicated and the learning amount is high. In addition, scalability as in the case of combining with other modalities or applying to other terminals is not easy.
각 모달리티를 의미 레벨에서 융합하는 방식은 각각의 모달리티 입력 신호의 의미를 분석한 후 분석 결과를 융합하는 것으로, 모달리티 간 독립성을 유지할 수 있어 학습 및 확장이 용이하다. 그러나, 사용자가 멀티 모달 입력을 하는 이유는 모달리티 간 연관성이 있기 때문인데, 개별적으로 의미를 분석할 경우 이 연관성을 찾아내기 힘들다. The method of fusing each modality at the semantic level analyzes the meaning of each modality input signal and then fuses the analysis result. The independence between modalities can be maintained to facilitate learning and expansion. However, the reason for the user's multi-modal input is that there is an association between modalities, which is difficult to find when analyzing meaning individually.
모션 정보에 의해 사용자 의도를 예측하고, 예측된 사용자 의도를 멀티 모달 입력 정보를 이용하여 추론함으로써 효율적이고 정확하게 사용자 의도를 추론할 수 있는 장치 및 방법이 제공된다. An apparatus and method are provided that can efficiently and accurately infer user intention by predicting user intention by motion information and inferring the predicted user intention using multi-modal input information.
일 측면에 따른 사용자 의도 추론 장치는, 적어도 하나의 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 1차 예측부와, 예측된 사용자 의도의 일부분 및 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 2차 예측부를 포함한다. According to an aspect, an apparatus for inducing user intention may include a first predictor configured to predict a portion of user intention using at least one motion information, and a portion of the predicted user intention and multi-modal information input from at least one multi-modal sensor. It includes a second prediction unit for predicting the user intention using.
다른 측면에 따른 사용자 의도 추론 방법은, 적어도 하나의 모션 정보를 수신하는 단계와, 수신된 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 단계와, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 수신하는 단계와, 예측된 사용자 의도의 일부분 및 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 단계를 포함한다. According to another aspect of the present invention, a method of inferring user intention may include receiving at least one motion information, predicting a part of the user intention using the received motion information, and multimodal information input from at least one multi-modal sensor. And receiving the predicted user intention using a part of the predicted user intention and the multi-modal information.
일 실시예에 따르면, 사용자 모션 인식을 통해 사용자 의도의 일부분을 예측하고, 예측된 사용자 의도의 일부분에 따라 멀티 모달 정보를 분석하여 2차적으로 사용자 의도를 예측함으로써 모달리티간 독립성을 유지하면서도 모달리티간 연관성 파악이 용이하여 사용자 의도를 정확하게 추론할 수 있다. According to an embodiment, the user motion recognition predicts a part of the user intention, analyzes the multi-modal information according to the predicted part of the user intention, and predicts the user intention secondarily, thereby maintaining the independence between the modalities and the association between the modalities. It is easy to grasp and infer user intention accurately.
또한, 모션 정보를 이용하여 또는 모션 정보와 함께 음성 또는 영상 정보 등 멀티모달 정보를 융합하여 사용자의 음성 입력 시작 및 종료 의도를 예측할 수 있으므로 사용자는 특별한 음성 입력 방식을 학습하지 않고도 사용자 의도 추론 장치에 음성을 입력할 수 있다. In addition, by using the motion information or by combining the multimodal information such as voice or video information with the motion information to predict the user's intention to start and end the voice input, the user can infer the user's inference apparatus without learning a special voice input method. Voice can be input.
도 1은 일 실시예에 따른 사용자 의도 추론 장치의 구성을 나타내는 도면이다. 1 is a diagram illustrating a configuration of a user intention reasoning apparatus according to an exemplary embodiment.
도 2는 도 1의 사용자 의도 예측부의 구성의 일 예를 나타내는 도면이다. FIG. 2 is a diagram illustrating an example of a configuration of a user intention predictor of FIG. 1.
도 3은 도 2의 사용자 의도 예측부의 예시적인 동작을 나타내는 도면이다. 3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.
도 4는 사용자 의도의 일부분이 예측된 후, 추가적인 멀티모달 입력을 받아 사용자의 의도를 예측하는 동작의 일 예를 나타내는 도면이다.4 is a diagram illustrating an example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
도 5는 사용자 의도의 일부분이 예측된 후, 추가적인 멀티모달 입력을 받아 사용자의 의도를 예측하는 동작의 다른 예를 나타내는 도면이다.FIG. 5 illustrates another example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
도 6은 음향 신호와 영상 신호를 결합하여 신호를 분류하는 구성의 일예를 나타내는 도면이다.6 is a diagram illustrating an example of a configuration of classifying a signal by combining an audio signal and a video signal.
도 7은 일 실시예에 따른 멀티 모달 정보를 이용하는 사용자 의도 추론 방법을 나타내는 도면이다. 7 is a diagram illustrating a user intention reasoning method using multi-modal information according to an exemplary embodiment.
일 측면에 따른 사용자 의도 추론 장치는, 적어도 하나의 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 1차 예측부와, 예측된 사용자 의도의 일부분 및 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 2차 예측부를 포함한다. According to an aspect, an apparatus for inducing user intention may include a first predictor configured to predict a portion of user intention using at least one motion information, and a portion of the predicted user intention and multi-modal information input from at least one multi-modal sensor. It includes a second prediction unit for predicting the user intention using.
1차 예측부는 예측된 사용자 의도의 일부분을 이용하여 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호를 생성할 수 있다. The first predictor may generate a control signal for executing an operation performed in the process of predicting the user intention by using a part of the predicted user intention.
사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호는 사용자 의도 추론 장치에 의해 제어되는 멀티 모달 센서의 동작을 제어하는 제어 신호일 수 있다. The control signal for executing the operation performed in the process of predicting the user intention may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus.
2차 예측부는 사용자 의도를 예측하기 위하여 멀티 모달 센서로부터 입력되는 멀티 모달 정보를 예측된 사용자 의도의 일부분과 연관되도록 해석할 수 있다. The secondary predictor may interpret the multi-modal information input from the multi-modal sensor to predict the user intention to be associated with a part of the predicted user intention.
예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택이고, 멀티 모달 센서로부터 음성이 입력되면, 2차 예측부는 입력된 음성을 오브젝트 선택과 연관하여 해석함으로써 사용자 의도를 예측할 수 있다. If a part of the predicted user intention is a selection of an object displayed on the display screen, and a voice is input from the multi-modal sensor, the secondary predictor may predict the user intention by interpreting the input voice in association with the object selection.
2차 예측부는, 예측된 사용자 의도의 일부분의 범위 내에서, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측할 수 있다. The second prediction unit may predict the user intention using the multi-modal information input from the at least one multi-modal sensor within a range of the part of the predicted user intention.
예측된 사용자 의도의 일부분이 마이크를 입에 가져가는 동작인 경우, 2차 예측부는, 음향 신호를 감지하고, 감지된 음향 신호에 대하여 특징을 추출 및 분석하여, 사용자 의도를 예측할 수 있다. When a part of the predicted user's intention is an operation of bringing a microphone into the mouth, the second predictor detects an acoustic signal, extracts and analyzes a feature with respect to the sensed acoustic signal, and predicts the user's intention.
2차 예측부는, 음향 신호에서 음성 구간이 검출되는지 결정하고, 음성 구간이 검출되는 경우 사용자 의도를 음성 명령 의도로 예측할 수 있다. The second prediction unit may determine whether a voice section is detected from the sound signal, and when the voice section is detected, predict the user intention as the voice command intention.
2차 예측부는, 음향 신호에서 호흡음이 검출된 경우, 사용자 의도를 불기로 예측할 수 있다. The second predictor may predict the user's intention by blowing a breath sound in the acoustic signal.
예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택인 경우, 2차 예측부는, 멀티 모달 정보를 이용하여 사용자 의도를 선택된 오브젝트에 대한 삭제, 분류 및 정렬 중 적어도 하나로 예측할 수 있다. When a part of the predicted user intention is a selection of an object displayed on the display screen, the second predictor may predict the user intention as at least one of deletion, classification, and alignment of the selected object by using the multi-modal information.
사용자 의도 예측 결과를 이용하여 사용자 의도 추론 장치에서 제어되는 소프트웨어 또는 하드웨어를 제어하는 사용자 의도 적용부를 더 포함할 수 있다. The apparatus may further include a user intention application unit configured to control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
다른 측면에 따른 사용자 의도 추론 방법은, 적어도 하나의 모션 정보를 수신하는 단계와, 수신된 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 단계와, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 수신하는 단계와, 예측된 사용자 의도의 일부분 및 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 단계를 포함한다. According to another aspect of the present invention, a method of inferring user intention may include receiving at least one motion information, predicting a part of the user intention using the received motion information, and multimodal information input from at least one multi-modal sensor. And receiving the predicted user intention using a part of the predicted user intention and the multi-modal information.
이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 본 발명의 다양한 실시예를 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. Hereinafter, with reference to the accompanying drawings will be described in detail an embodiment of the present invention. In the description of the various embodiments of the present invention, when it is determined that detailed descriptions of related known functions or configurations may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.
도 1은 일 실시예에 따른 사용자 의도 추론 장치의 구성을 나타내는 도면이다. 1 is a diagram illustrating a configuration of a user intention reasoning apparatus according to an exemplary embodiment.
사용자 의도 추론 장치(100)는 모션 센서(110), 제어부(120) 및 멀티 모달 감지부(130)를 포함한다. 사용자 의도 추론 장치(100)는 셀룰러 폰, PDA(personal digital assistane), 디지털 카메라, 휴대용 게임 콘솔, MP3 플레이어, 휴대용/개인용 멀티미디어 플레이어(PMP), 핸드헬드 e-book, 휴대용 랩탑 PC, GPS(global positioning system) 네비게이션, 및 데스크탑 PC, HDTV(high definition televison), 광학 디스크 플레이어, 셋탑 박스 등 어떤 형태의 장치 또는 시스템으로도 구현될 수 있다. 또한, 사용자 의도 추론 장치(100)는 사용자 인터페이스부, 디스플레이부, 음향 출력부 등 멀티 모달 인터페이스를 위한 구성요소와 같이 구현예에 따라 다양한 구성요소를 더 포함하여 구성될 수 있다. The user intention reasoning apparatus 100 includes a motion sensor 110, a controller 120, and a multi-modal sensing unit 130. The user intention inference device 100 includes a cellular phone, a personal digital assistane (PDA), a digital camera, a portable game console, an MP3 player, a portable / personal multimedia player (PMP), a handheld e-book, a portable laptop PC, a GPS (global) positioning system) and any type of device or system, such as navigation and desktop PCs, high definition televison (HDTV), optical disc players, set-top boxes, and the like. In addition, the user intention inference apparatus 100 may further include various components according to an implementation example, such as components for a multi-modal interface such as a user interface unit, a display unit, a sound output unit, and the like.
모션 센서(110)는 모션 정보를 감지하기 위하여 관성 센서, 방향을 감지하는 지자기 센서 및 움직임을 감지하는 가속도 센서 또는 자이로 센서 등을 포함할 수 있다. 모션 센서(110)는 위에 열거한 센서 외에도, 영상 센서, 음향 센서 등을 포함할 수 있다. 일 실시예에 따르면 복수 개의 모션 센서가 사용자의 신체의 일부 부위와 사용자 의도 추론 장치(100)에 부착되어 모션 정보를 감지할 수 있다. The motion sensor 110 may include an inertial sensor, a geomagnetic sensor for detecting a direction, an acceleration sensor or a gyro sensor for detecting a motion, and the like to detect motion information. In addition to the sensors listed above, the motion sensor 110 may include an image sensor, an acoustic sensor, and the like. According to an embodiment, a plurality of motion sensors may be attached to a part of the user's body and the user intention reasoning apparatus 100 to detect motion information.
멀티 모달 감지부(130)는 적어도 하나의 멀티 모달 센서(132, 134, 136, 138)를 포함할 수 있다. 음향 센서(132)는 음향 신호를 감지하는 센서이고, 영상 센서(134)는 이미지 정보를 감지하는 센서이고, 생체 정보 센서(136)는 체온 등 생체 정보를 감지하고, 터치 센서(138)는 터치 패드상의 터치 제스처를 감지할 수 있으며, 기타 다양한 종류 또는 형태의 멀티 모달 센서가 포함될 수 있다. The multi-modal sensing unit 130 may include at least one multi-modal sensor 132, 134, 136, and 138. The acoustic sensor 132 is a sensor for detecting an acoustic signal, the image sensor 134 is a sensor for detecting image information, the biometric information sensor 136 detects biometric information such as body temperature, and the touch sensor 138 is a touch. The touch gesture on the pad may be sensed, and other various types or types of multi-modal sensors may be included.
도 1에는 멀티 모달 감지부(130)에 4개의 센서가 포함되어 있는 것으로 도시되어 있으나, 개수에는 제한이 없다. 멀티 모달 감지부(130)에 포함되는 센서의 종류 및 범위는 모션 감지를 목적으로 하는 모션 센서(110)에 포함되는 센서의 종류 및 범위보다 넓을 수 있다. 또한, 도 1에는 모션 센서(110)와 멀티 모달 감지부(130)가 별도로 존재하는 것으로 도시되어 있으나, 일체화되어 구성될 수도 있다. 또는, 모션 센서(110)에 포함되는 센서와 멀티 모달 감지부(130)에 동일한 종류의 센서 예를 들어, 영상 센서 및 음향 센서가 중복적으로 포함될 수 있다. 1 illustrates that four sensors are included in the multi-modal sensing unit 130, but the number is not limited thereto. The type and range of the sensor included in the multi-modal sensing unit 130 may be wider than the type and range of the sensor included in the motion sensor 110 for the purpose of motion detection. In addition, although the motion sensor 110 and the multi-modal sensing unit 130 are illustrated as being separately present in FIG. 1, they may be integrated. Alternatively, the same kind of sensor, for example, an image sensor and an acoustic sensor, may be included in the sensor included in the motion sensor 110 and the multi-modal sensing unit 130.
멀티 모달 감지부(130)는 각각의 멀티 모달 센서(132, 134, 136, 138)에서 감지된 멀티 모달 정보를 그 종류에 따라 특징값을 추출하여 의미를 분석하는 모듈을 포함하여 구성될 수 있다. 멀티 모달 정보를 분석하는 구성요소는 제어부(120)에 포함될 수도 있다. The multi-modal detection unit 130 may include a module for extracting feature values according to the type of the multi-modal information detected by each of the multi-modal sensors 132, 134, 136, and 138 to analyze the meaning. . Components for analyzing the multi-modal information may be included in the controller 120.
제어부(120)는 사용자 의도 추론 장치(100)의 각 구성 요소의 동작을 제어하기 위한 애플리케이션, 데이터 및 오퍼레이팅 시스템을 포함할 수 있다. 일 실시예에 따르면 제어부(120)는 사용자 의도 예측부(122) 및 사용자 의도 적용부(124)를 포함한다. The controller 120 may include an application, data, and an operating system for controlling the operation of each component of the user intention reasoning apparatus 100. According to an embodiment, the controller 120 includes a user intention predictor 122 and a user intention applicator 124.
사용자 의도 예측부(122)는 모션 센서(110)로부터 감지된 적어도 하나의 모션 정보를 수신하고, 수신된 모션 정보를 이용하여 1차적으로 사용자 의도의 일부분을 예측한다. 또한, 사용자 의도 예측부(122)는 예측된 사용자 의도의 일부분 및 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 2차적으로 사용자 의도를 예측할 수 있다. 즉, 사용자 의도 예측부(122)는 2차적으로 사용자 의도를 예측할 때 모션 센서(110)로부터 감지된 모션 정보 및 멀티 모달 감지부(130)로부터 입력된 멀티 모달 정보를 이용하여 최종적으로 사용자 의도를 예측할 수 있다. 사용자 의도 예측부(122)는 사용자의 의도를 추론하기 위한 알려진 여러 가지 추론 모델을 이용할 수 있다. The user intention predictor 122 receives at least one motion information detected from the motion sensor 110, and primarily predicts a part of the user intention using the received motion information. In addition, the user intention predictor 122 may secondarily predict the user intention using a part of the predicted user intention and the multi-modal information input from the at least one multi-modal sensor. That is, when the user intention predictor 122 secondarily predicts the user intention, the user intention predictor 122 finally uses the motion information detected from the motion sensor 110 and the multi-modal information input from the multi-modal sensing unit 130 to finally determine the user intention. It can be predicted. The user intention predictor 122 may use various known inference models for inferring the user's intention.
또한, 사용자 의도 예측부(122)는 1차적으로 예측된 사용자 의도의 일부분을 이용하여 2차적으로 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호를 생성할 수 있다. 사용자 의도 추론 과정에서 수행되는 동작을 실행시키기 위한 제어 신호는 사용자 의도 추론 장치(100)에 의해 제어되는 멀티 모달 감지부(130)의 동작을 제어하는 제어 신호일 수 있다. In addition, the user intention predictor 122 may generate a control signal for executing an operation performed in the process of predicting the user intention secondary by using a part of the user intentionally predicted. The control signal for executing the operation performed in the user intention inference process may be a control signal for controlling the operation of the multi-modal sensing unit 130 controlled by the user intention inference apparatus 100.
예를 들어, 모션 정보를 이용하여 1차적으로 예측된 사용자 의도의 일부분에 기반하여 멀티 모달 감지부(130)의 센서 중 1차적으로 예측된 사용자 의도의 일부분과 연관된 일부 센서 동작을 활성화시킬 수 있으며 이 경우 멀티 모달 감지부(130)의 모든 센서를 활성화하는 경우에 비하여 센서 동작에 사용하는 전력 소모를 감소시킬 수 있다. 또한, 일부 센서로부터 입력되는 감지 정보를 분석하게 되므로, 멀티 모달 입력 정보의 해석을 단순화하여 사용자 의도 예측 과정의 복잡도를 감소시키면서도 정확한 사용자 의도를 추론할 수 있다. For example, the motion information may be used to activate some sensor operations associated with a part of the first predicted user intention among the sensors of the multi-modal sensing unit 130 based on the part of the first predicted user intention. In this case, power consumption used for the sensor operation may be reduced as compared with the case of activating all the sensors of the multi-modal sensing unit 130. In addition, since the detection information input from some sensors is analyzed, accurate user intention can be inferred while simplifying the interpretation of the multi-modal input information while reducing the complexity of the user intention prediction process.
사용자 의도 예측부(122)는 2차적으로 사용자 의도를 예측하기 위하여 멀티 모달 정보의 종류에 따라 특징을 추출하고 분석하는 모듈(도시되지 않음)을 포함하여 구성될 수 있다. 또한, 사용자 의도 예측부(122)는 멀티 모달 감지부(130)로부터 입력되는 멀티 모달 정보를 1차적으로 예측된 사용자 의도의 일부분과 연관되도록 해석할 수 있다. The user intention predictor 122 may include a module (not shown) that extracts and analyzes features according to types of multi-modal information in order to predict user intention secondarily. In addition, the user intention predictor 122 may interpret the multi-modal information input from the multi-modal sensing unit 130 to be associated with a part of the user's intention predicted primarily.
예를 들어, 사용자 의도 예측부(122)에서 1차적으로 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택으로 결정되는 경우, 멀티 모달 감지부(130)로부터 음성이 입력되면, 입력된 음성을 오브젝트 선택과 연관하여 해석함으로써 2차적으로 사용자 의도를 예측할 수 있다. 구체적으로, 1차로 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택으로 결정되고, 멀티 모달 감지부(130)에서 입력된 음향 신호가 "날짜별로 정리"라고 분석된 경우, 사용자 의도 예측부(122)는 사용자 의도를 "디스플레이 화면에서 선택된 오브젝트를 날짜 순서대로 정렬"하라는 의미로 해석할 수 있다. For example, when a part of the user intention primarily predicted by the user intention predictor 122 is determined by selection of an object displayed on the display screen, when the voice is input from the multi-modal sensing unit 130, the input voice is input. Can be secondarily predicted by interpreting in conjunction with object selection. In detail, when a part of the first intention predicted by the user is determined by the selection of an object displayed on the display screen, and the sound signal input by the multi-modal detection unit 130 is analyzed as “organized by date”, the user intention predictor The user's intention may be interpreted to mean "arrange the object selected on the display screen in the order of date".
또한, 사용자 의도 예측부(122)는 1차적으로 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택인 경우, 멀티 모달 정보를 이용하여 2차적 사용자 의도를 삭제, 분류 및 정렬 중 적어도 하나로 예측할 수 있다. In addition, when a part of the first predicted user intention is a selection of an object displayed on the display screen, the user intention predictor 122 may predict the secondary user intention as at least one of deleting, classifying, and sorting using the multi-modal information. Can be.
사용자 의도 적용부(124)는 사용자 의도 예측 결과를 이용하여 사용자 의도 추론 장치에서 제어되는 소프트웨어 또는 하드웨어를 제어할 수 있다. 사용자 의도 적용부(124)는 예측된 사용자 의도에 인터랙션하기 위한 멀티 모달 인터페이스를 제공할 수 있다. 예를 들어, 사용자의 의도가 음성 명령으로 예측된 경우, 음성 명령내 의미를 파악하기 위해 음성 인식을 수행하고, 인식 결과에 따라 특정 사람에 대하여 자동으로 전화를 연결하는 애플리케이션이나 검색 애플리케이션을 실행할 수 있으며, 사용자가 선택한 오브젝트를 전송하려는 의도인 경우에는 이메일 애플리케이션을 실행할 수 있다. 다른 예로, 사용자 의도가 허밍(humming)으로 예측되는 경우, 허밍 음원과 유사한 음악을 검색하는 애플리케이션이 구동될 수 있다. 또 다른 예로, 사용자 의도가 불기(blow)로 예측되는 경우, 게임 애플리케이션에서 아바타가 특정 동작을 실행하는 명령으로 이용될 수 있다. The user intention application unit 124 may control software or hardware controlled by the user intention inference apparatus using the user intention prediction result. The user intention applying unit 124 may provide a multi-modal interface for interacting with the predicted user intention. For example, if a user's intention is predicted as a voice command, you can run an application or search application that performs voice recognition to understand the meaning in the voice command and automatically connects the phone to a specific person based on the recognition result. If the intention is to transfer the object selected by the user, the email application can be executed. As another example, when the user intention is predicted to be humming, an application for searching for music similar to the humming sound source may be driven. As another example, when the user intention is predicted to be blow, the avatar may be used as a command for executing a specific action in the game application.
일 실시예에 따르면, 사용자 모션 인식을 통해 사용자 의도의 일부분을 예측하고, 예측된 사용자 의도의 일부분에 따라 멀티 모달 정보를 분석하여 2차적으로 사용자 의도를 예측함으로써 멀티 모달 정보를 해석하는 과정에서 독립성을 유지하면서도 일차적으로 예측된 사용자 의도의 일부분과 관련지어 멀티 모달 정보를 해석할 수 있으므로, 모달리티간 연관성 파악이 용이하여 사용자 의도를 정확하게 추론할 수 있다. According to one embodiment, independence in the process of interpreting multi-modal information by predicting a part of user intention through user motion recognition, analyzing multi-modal information according to the predicted part of user intention, and secondly predicting user intention. Multi-modal information can be interpreted in relation to a part of the user's intentionally predicted, while maintaining the accuracy of the intention.
도 2는 도 1의 사용자 의도 예측부의 구성의 일 예를 나타내는 도면이다. FIG. 2 is a diagram illustrating an example of a configuration of a user intention predictor of FIG. 1.
사용자 의도 예측부(122)는 모션 정보 분석부(210), 1차 예측부(220) 및 2차 예측부(230)를 포함할 수 있다. The user intention predictor 122 may include a motion information analyzer 210, a first predictor 220, and a second predictor 230.
모션 정보 분석부(210)는 모션 센서(110)로부터 수신되는 하나 이상의 모션 정보를 분석한다. 모션 정보 분석부(210)는 모션 센서(110)가 부착된 사용자의 신체의 각 부위의 위치 정보 및 각도 정보를 측정할 수 있고, 측정된 위치 정보 및 각도 정보를 이용하여 모션 센서(110)가 부착되지 않은 사용자의 신체의 각 부위의 위치 정보 및 각도 정보도 계산할 수 있다. The motion information analyzer 210 analyzes one or more motion information received from the motion sensor 110. The motion information analyzer 210 may measure location information and angle information of each part of the user's body to which the motion sensor 110 is attached, and the motion sensor 110 may use the measured location information and angle information. Location information and angle information of each part of the user's body that is not attached may also be calculated.
예를 들어, 모션 센서(110)가 양 손목 및 머리에 부착된 경우, 센서와 센서간 거리가 측정되고, 각 센서는 기준 좌표계에 대한 3차원 회전각 정보를 얻을 수 있다. 따라서, 모션 정보로부터 손목 부위와 머리 부위 사이의 거리 및 손목의 회전각 정보를 계산하여 손목과 얼굴의 입 부위 사이의 거리 및 손목의 회전각 정보를 계산할 수 있다. 사용자가 손에 사용자 의도 추론 장치(100)의 음향 센서(132)에 해당하는 마이크를 잡고 있는 경우를 가정하면, 마이크의 입 사이의 거리와 마이크의 방향이 계산될 수 있다. For example, when the motion sensor 110 is attached to both wrists and heads, the distance between the sensor and the sensor is measured, and each sensor can obtain three-dimensional rotation angle information about the reference coordinate system. Therefore, the distance between the wrist part and the head part and the rotation angle information of the wrist may be calculated from the motion information to calculate the distance between the wrist and the mouth part of the face and the rotation angle information of the wrist. Assuming that a user is holding a microphone corresponding to the acoustic sensor 132 of the user intention reasoning apparatus 100 in the hand, the distance between the mouths of the microphones and the direction of the microphone can be calculated.
다른 예로, 모션 센서(110)가 사용자의 머리와 음향 센서에 해당하는 마이크에 장착된 경우, 모션 정보로부터 마이크와 머리 부위 사이의 거리가 측정되고, 마이크에 부착된 관성센서로부터 센서를 부착한 축의 3차원 각도 정보를 획득하여, 모션 정보 분석부(210)는 손목과 얼굴의 입 부위 사이의 거리 및 마이크의 회전각 정보를 계산할 수 있다. As another example, when the motion sensor 110 is mounted on the microphone corresponding to the user's head and the acoustic sensor, the distance between the microphone and the head is measured from the motion information, and the axis of the shaft attached with the sensor from the inertial sensor attached to the microphone is measured. By obtaining the 3D angle information, the motion information analyzer 210 may calculate the distance between the wrist and the mouth of the face and the rotation angle information of the microphone.
또 다른 예로, 모션 센서(110)에 영상 센서가 포함되어, 모션 정보 분석부(210)로 영상 정보들을 입력할 수 있다. 이 경우, 모션 정보 분석부(210)는 영상내 얼굴이나 손과 같은 오브젝트(object)를 인식한 후 오브젝트 간 위치 관계를 계산할 수 있다. 예를 들어, 모션 정보 분석부(210)는 얼굴과 2개의 손 사이의 거리 및 각도, 2개의 손 사이의 거리 및 각도 등을 계산할 수 있다. As another example, an image sensor may be included in the motion sensor 110 to input image information to the motion information analyzer 210. In this case, the motion information analyzer 210 may recognize an object such as a face or a hand in the image and calculate a positional relationship between the objects. For example, the motion information analyzer 210 may calculate a distance and an angle between a face and two hands, a distance and an angle between two hands, and the like.
1차 예측부(220)는 모션 정보 분석에 의해 트리거된 사용자 의도의 일부분을 예측한다. 예를 들어, 1차 예측부(220)는 영상을 포함하는 모션 정보 분석을 통해 1차적으로 스크린에 있는 오브젝트를 선택하는 모션인지 예측할 수 있다. The primary predictor 220 predicts a part of the user intention triggered by the motion information analysis. For example, the primary predictor 220 may predict whether the motion primarily selects an object on the screen through analysis of motion information including an image.
2차 예측부(230)는 1차 예측부(220)에서 예측된 사용자 의도의 일부분 및 멀티 모달 감지부(130)로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측한다. The second prediction unit 230 predicts the user intention by using a part of the user intention predicted by the first prediction unit 220 and the multi-modal information input from the multi-modal sensing unit 130.
2차 예측부(230)는 사용자 의도를 예측하기 위하여 멀티 모달 센서로부터 입력되는 멀티 모달 정보를 1차적으로 예측된 사용자 의도의 일부분과 연관되도록 해석할 수 있다. 일예로, 1차적으로 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택이고, 2차 예측부(230)는 멀티 모달 감지부(130)로부터 음성이 입력되면, 입력된 음성을 오브젝트 선택과 연관하여 해석함으로써 2차적으로 사용자 의도를 예측할 수 있다. The second prediction unit 230 may interpret the multi-modal information input from the multi-modal sensor to be associated with a part of the first predicted user intention in order to predict the user intention. For example, when a part of the first predicted user intention is a selection of an object displayed on the display screen, and the second predictor 230 receives a voice from the multi-modal detection unit 130, the input voice is selected from the object selection. By correlating and interpreting, the user's intention can be predicted secondarily.
또 다른 예로, 1차 예측부(220)가 1차적으로 예측된 사용자 의도의 일부분을 마이크를 입으로 가져가는 것으로 예측하고, 멀티모달 감지부(130)에서 카메라와 같은 영상 센서(134)를 통해 입의 움직임이 감지되고, 마이크와 같은 음향 센성(132)를 통해 음성이 입력되는 경우, 2차 예측부(230)는 사용자 의도를 음성 명령 입력으로 예측할 수 있다. 음성 명령 입력 의도를 예측하기 위해, 2차 예측부(230) 음향 신호로부터 음성 구간을 검출하고, 검출된 음성 구간에 대한 특징 추출 및 분석을 통한 의미 분석을 수행하여 사용자 의도 적용부(124)에서 이용할 수 있는 형태로 만들 수 있다. As another example, the first predictor 220 predicts that a part of the first predicted user's intention is to bring the microphone into the mouth, and the multimodal sensor 130 uses an image sensor 134 such as a camera. When the movement of the mouth is sensed and a voice is input through the acoustic sensory 132 such as a microphone, the secondary predictor 230 may predict the user's intention as a voice command input. In order to predict a voice command input intention, the user predictor 124 detects a voice section from the sound signal of the second predictor 230 and performs semantic analysis through feature extraction and analysis on the detected voice section. It can be made available.
또 다른 예로, 1차 예측부(220)가 마이크를 입으로 가져가는 것을 1차적으로 사용자 의도의 일부분으로 예측하고, 멀티모달 감지부(130)에서 카메라와 같은 영상 센서(134)를 통해 입술이 앞으로 돌출되는 영상이 일관성있게 감지되고, 마이크를 통해 호흡음(breath sound)이 입력되는 경우, 2차 예측부(230)는 사용자 의도를 불기(blow)로 예측할 수 있다.As another example, the first prediction unit 220 firstly predicts that the microphone is brought to the mouth as a part of the user's intention, and the multimodal detection unit 130 uses the image sensor 134 such as a camera to make the lips When the image protruding forward is consistently sensed and a breath sound is input through the microphone, the second prediction unit 230 may predict the user's intention as blow.
위의 두 예에서 사용자 의도는 “마이크를 입으로 가져가 음성 명령 입력”과 “마이크를 입으로 가져가 불기”로 각각 다르다. 그러나 두 사용자 의도의 일부분은 “마이크를 입으로 가져가”는 것으로 공통이며, 1차 예측부(220)는 이러한 사용자 의도의 일부분을 먼저 예측하여 사용자 의도의 범위를 좁힐 수 있다. 1차 예측부(220)에 의해 좁혀진 사용자 의도의 범위내에서 2차 예측부(230)는 멀티모달 정보를 고려하여 사용자 의도를 예측할 수 있다. 위의 두 예의 경우만 고려하면, “마이크를 입으로 가져가”는 모션이 감지되면 1차 예측부(220)에 의하여 사용자 의도의 범위는 “음성 명령 입력”과 “불기”로 제한되며, 2차 예측부(230)는 감지되는 멀티 모달 정보를 고려하여 사용자 의도가 “음성 명령 입력”인지 “불기”인지 판단할 수 있다.In the above two examples, the user's intentions are different: "Hold microphone into mouth and input voice command" and "Hold microphone into mouth." However, some of the two user intentions are common to "take the microphone to the mouth," and the first predictor 220 may first predict a portion of the user intention to narrow the scope of the user intention. Within the range of user intention narrowed by the primary predictor 220, the secondary predictor 230 may predict the user intention in consideration of multi-modal information. Considering only the above two cases, when the motion of “take the microphone to the mouth” is detected, the range of the user's intention is limited to “speech command input” and “blowing” by the first predictor 220, 2 The difference predictor 230 may determine whether the user intention is "voice command input" or "blowing" in consideration of the sensed multi-modal information.
도 3은 이러한 도 2의 사용자 의도 예측부의 예시적인 동작을 나타내는 도면이다. FIG. 3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.
1차 예측부(220)는 모션 정보 분석부(210)에서 분석된 모션 정보를 이용하여 사용자 의도의 일부분을 예측할 수 있다. 2차 예측부(230)는 멀티 모달 감지부(130)의 영상 센서(134)에 의해 감지된 영상 또는 음향 센서(132)로부터 감지된 음향 신호 등의 멀티 모달 신호를 입력 받아서, 음성이 검출되고 있는지 여부에 대한 정보를 생성하여 사용자의 의도를 예측할 수 있다. The primary predictor 220 may predict a part of the user's intention using the motion information analyzed by the motion information analyzer 210. The second prediction unit 230 receives a multi-modal signal such as an image detected by the image sensor 134 of the multi-modal detection unit 130 or an acoustic signal detected from the sound sensor 132, and the voice is detected. Information about whether or not the user can be generated to predict the intention of the user.
일 예로, 모션 정보 분석부(210)는 사용자의 머리 및 손목에 장착된 모션 센서로부터 감지된 모션 정보를 이용하여 사용자의 입과 마이크를 잡은 손 사이의 거리를 계산한다(310). 모션 정보 분석부(210)는 손목의 회전 각도로부터 마이크의 방향을 계산한다(320). For example, the motion information analyzer 210 calculates a distance between a user's mouth and a hand holding a microphone using motion information detected from a motion sensor mounted on a user's head and wrist (310). The motion information analyzer 210 calculates the direction of the microphone from the rotation angle of the wrist (320).
1차 예측부(220)는 모션 정보 분석부(210)에 의해 계산된 거리 및 방향 정보를 이용하여, 사용자가 마이크를 입으로 가져다 대는 모션인지 예측하여 사용자 의도의 일부분을 예측한다(330). 예를 들어, 1차 예측부(220)는 사용자의 입과 마이크를 잡은 손의 위치가 입 주위 반경 20 cm 이내이고, 마이크 방향이 입을 향하고 있다고 결정되면, 사용자가 마이크를 입으로 가져오려고 하는 것으로 예측할 수 있다.The first predictor 220 predicts a part of the user's intention by predicting whether the user moves the microphone to the mouth using the distance and direction information calculated by the motion information analyzer 210 (330). For example, when the first predictor 220 determines that the position of the user holding the user's mouth and the microphone is within a 20 cm radius around the mouth, and the microphone direction is toward the mouth, the user attempts to bring the microphone into the mouth. It can be predicted.
이 경우, 2차 예측부(230)는 마이크와 같은 음향 센서(132)와 카메라와 같은 영상 센서(134)로부터 입력된 멀티모달 입력 신호를 분석하여 음성 명령 의도인지, 허밍이나 불기와 같은 의도인지 등으로, 사용자 의도를 예측할 수 있다.In this case, the second prediction unit 230 analyzes the multimodal input signals input from the acoustic sensor 132 such as a microphone and the image sensor 134 such as a camera, and is it intended to be a voice command or an intention such as a hum or blowing. Etc., the user's intention can be predicted.
2차 예측부(230)는 사용자 의도 일부분 예측, 즉 1차 예측이 마이크를 입으로 가져오는 것이고, 카메라로부터 입술의 움직임이 감지되고, 마이크에 의해 감지된 음향 신호로부터 음성이 검출되면 사용자 의도를 음성 명령 의도로 결정할 수 있다(340). 이와 달리 1차 예측이 마이크를 입으로 가져오는 것이고, 카메라로부터 입술을 앞으로 돌출하는 영상이 감지되고, 마이크로부터 입력되는 음향 신호로부터 호흡음(breath sound)이 검출되면, 2차 예측부(230)는 사용자 의도를 불기(blow)로 결정할 수 있다(350). The second prediction unit 230 predicts a part of the user's intention, that is, the first prediction brings the microphone to the mouth, when the movement of the lips is detected from the camera, and the voice is detected from the acoustic signal detected by the microphone, the user's intention is determined. The voice command may be determined as the intention (340). On the contrary, when the first prediction is to bring the microphone to the mouth, the image protruding the lips forward from the camera is detected, and the breath sound is detected from the sound signal input from the microphone, the second prediction unit 230 is performed. May determine 350 the user intention to blow.
도 4는 사용자 의도의 일부분이 예측된 후, 추가적인 멀티모달 입력을 받아 사용자의 의도를 예측하는 동작의 일 예를 나타내는 도면이다. 4 is a diagram illustrating an example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
2차 예측부(230)는 1차 예측부(220)로부터 수신된 예측된 사용자 의도의 일부분이 마이크를 입으로 가져가는 것인 경우(410), 멀티 모달 감지부(130)에 포함된 마이크와 카메라 등의 센서를 활성화하여 멀티모달 신호를 입력 받는다(420). If the second predictor 230 is a part of the predicted user's intention received from the primary predictor 220 is to bring the microphone to the mouth (410), the second predictor 230 includes a microphone included in the multi-modal sensing unit 130. A multimodal signal is input by activating a sensor such as a camera (420).
2차 예측부(230)는 마이크로부터 입력받은 음향 신호와 카메라로부터 입력받은 영상 신호로부터 특징들을 추출하고, 특징들을 분류 및 분석한다(430). The second predictor 230 extracts features from an acoustic signal input from the microphone and an image signal input from the camera, and classifies and analyzes the features (430).
음향 특징으로서 마이크로부터 입력받은 음향 신호에서 시간 에너지(Time Energy), 주파수 에너지(Frequency Energy), 영교차율(Zero Crossing Rate), LPC(Linear Predictive Coding), 셉스트럴 계수(Cepstral coefficients), 피치(pitch) 등 시간 영역의 특징이나 주파수 스펙트럼과 같은 통계적 특징 등이 추출될 수 있다. 추출될 수 있는 특징은 이들에 한정되지 않고 다른 특징 알고리즘에 의해 추출될 수 있다. 추출된 특징은 결정 트리(Decision Tree), 지원 벡터 머신(Support Vector Machine), 베이에시안 네트워크(Bayesian Network), 신경망(Neural Network)와 같은 분류 및 학습 알고리즘 등을 사용하여 입력 특징 음성(speech) 활동 클래스인지, 비음성(non-speech) 활동 클래스인지를 분류될 수 있으나, 이에 한정되지 않는다. Acoustic features include time energy, frequency energy, zero crossing rate, linear predictive coding (LPC), cepstral coefficients, and pitch features such as a time domain or statistical features such as a frequency spectrum may be extracted. The features that can be extracted are not limited to these and can be extracted by other feature algorithms. The extracted features are input feature speech using classification and learning algorithms such as Decision Tree, Support Vector Machine, Bayesian Network, Neural Network, etc. It may be classified as an activity class or a non-speech activity class, but is not limited thereto.
특징 분석 결과 음성 구간이 검출되면(440), 2차 예측부(230)는 음성 명령 입력으로 사용자 의도를 예측할 수 있다. 2차 예측부(230)는 특징 분석 결과, 음성 구간이 검출되지 않고(440), 호흡음이 검출되면(450), 불기(blow)의도로 예측할 수 있다. 또한, 다른 종류의 특징이 검출됨에 따라 사용자 의도를 허밍 등 여러가지로 결정될 수 있다. 이때 2차 예측부(230)는 1차 예측으로부터 한정되는 범위내에서 사용자 의도를 예측할 수 있다. When the voice section is detected as a result of the feature analysis (440), the second prediction unit 230 may predict the user's intention by inputting the voice command. As a result of the feature analysis, the second predictor 230 may predict the degree of blow when the voice section is not detected (440) and when the breathing sound is detected (450). In addition, as other types of features are detected, the user's intention may be determined in various ways such as humming. In this case, the second prediction unit 230 may predict the user intention within a range limited from the first prediction.
따라서, 일 실시예에 따르면, 사용자의 멀티 모달 정보를 이용하여 사용자 의도를 예측하고, 예측 결과에 따라 음성 검출 동작의 수행을 제어할 수 있으므로 음성 인터페이스 사용시 사용자가 음성 입력 방법 예를 들어 사용자가 음성 입력을 위한 별도의 버튼이나 화면 터치 등의 동작 방법 등을 별도로 학습하지 않고도 직관적으로 음성을 입력할 수 있다. Therefore, according to an embodiment, the user's intention may be predicted using the multi-modal information of the user, and the performance of the voice detection operation may be controlled according to the prediction result. Voice can be intuitively input without learning a separate button for input or an operation method such as a screen touch.
2차 예측부(230)는, 마이크로부터 음향 정보 외에도, 카메라와 같은 영상 센서(134)로부터 입력되는 영상 정보와 성대 마이크와 같은 생체 정보 센서(136)로부터 입력되는 사람이 음성을 발화할 때 변화되는 적어도 하나의 감지 정보 중 적어도 하나를 음향 신호로부터 추출된 특징 정보와 함께 이용하여 음성 구간을 검출하고, 검출된 음성 구간의 음성을 처리할 수 있다. 여기에서, 감지 정보는 사용자의 입 모양 변화 등 나타내는 영상 정보, 발화시 나오는 입김 등에 의해 변화하는 온도 정보 및 발화시 진동하는 목구멍 또는 턱뼈 등 신체 부위의 진동 정보, 발화시 얼굴이나 입에서 나오는 적외선 감지 정보 중 적어도 하나를 포함할 수 있다. In addition to the acoustic information from the microphone, the second prediction unit 230 changes the image information input from the image sensor 134 such as a camera and the person input from the biometric information sensor 136 such as a vocal cord microphone to utter a voice. At least one of the at least one piece of sensing information may be used together with the feature information extracted from the sound signal to detect a voice section and process the voice of the detected voice section. Here, the sensing information includes image information indicating a change in the shape of the user's mouth, temperature information changed by breathing during ignition, vibration information of a body part such as a throat or jaw that vibrates during ignition, and infrared detection from a face or mouth during ignition. It may include at least one of the information.
사용자 의도 적용부(124)는 음성 구간이 검출되면(440), 검출된 음성 구간에 속한 음성 신호를 처리하여 음성 인식을 수행하고, 음성 인식 결과를 이용하여 응용 모듈을 전환시킬 수 있다. 예를 들어, 인식 결과에 따라 애플리케이션이 실행되어, 이름이 인식되면, 인식된 이름에 대한 전화번호가 검색되거나, 검색된 전화번호로 전화를 거는 동작 등 지능적인 음성 입력 시작 및 종료 전환이 가능해질 수 있다. 또한, 사용자 의도 추론 장치(100)가 모바일 커뮤니케이션 디바이스로 구현된 경우, 멀티 모달 정보에 기반하여 음성 통화 시작 및 종료 의도를 파악하여 사용자가 통화 버튼을 누르는 등 별도의 동작을 하지 않더라도 자동으로 음성 통화 모드로 동작 모드가 전환될 수 있다. When the voice section is detected (440), the user intention application unit 124 may perform voice recognition by processing a voice signal belonging to the detected voice section, and switch the application module using the voice recognition result. For example, when the application is executed according to the recognition result, when the name is recognized, intelligent voice input start and end switching can be performed, such as a search for a phone number for the recognized name or a call to the retrieved phone number. have. In addition, when the user intention inference device 100 is implemented as a mobile communication device, the voice call starts and ends based on the multi-modal information to grasp the intention of the voice call automatically even if the user does not perform a separate operation such as pressing a call button. The operation mode can be switched to the mode.
도 5는 사용자 의도의 일부분이 예측된 후, 추가적인 멀티모달 입력을 받아 사용자의 의도를 예측하는 동작의 다른 예를 나타내는 도면이다. FIG. 5 illustrates another example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.
2차 예측부(230)는 1차 예측부(220)로부터 수신된 1차 예측된 사용자 의도의 일부분이 특정 오브젝트의 선택인 경우(460), 카메라와 초음파 센서 등의 센서를 활성화하여 멀티모달 신호를 입력 받는다(470).The second predictor 230 activates a sensor such as a camera and an ultrasonic sensor when a part of the first predicted user intention received from the first predictor 220 is a selection of a specific object (460). Input is received (470).
2차 예측부(230)는 입력받은 멀티모달 신호를 분석하여(480), 사용자 의도를 예측한다. 이때, 예측되는 사용자 의도는 1차 예측으로부터 한정되는 범위내의 의도들일 수 있다. The second prediction unit 230 analyzes the input multi-modal signal 480 to predict the user's intention. In this case, the predicted user intention may be intentions within a range defined from the first prediction.
2차 예측부(230)는 멀티모달 신호 분석 결과, 손을 흔드는 동작으로 판단할 수 있다(490). 2차 예측부(230)는 사용자 의도 적용부(124)에서 실행중인 애플리케이션에 따라서, 손을 흔드는 동작을 화면상에 도시되는 특정 아이템 또는 파일을 삭제하라는 의도로 해석하여, 사용자 의도 적용부(224)에서 특정 아이템 또는 파일이 삭제되도록 제어할 수 있다. In operation 490, the second prediction unit 230 may determine that the user shakes the hand as a result of the multimodal signal analysis. The secondary predicting unit 230 interprets the waving operation as an intention to delete a specific item or file shown on the screen according to the application being executed by the user intention applying unit 124, and the user intention applying unit 224. ) Can be controlled to delete specific items or files.
도 6은 2차 예측부(230)에서 음향 신호와 영상 신호를 함께 이용하여 통합 분석하는 특징 기반 신호 분류에 대한 일예를 나타내는 도면이다. FIG. 6 is a diagram illustrating an example of feature-based signal classification in which the secondary predictor 230 performs integrated analysis by using an acoustic signal and an image signal together.
2차 예측부(230)는 음향 특징 추출부(510), 음향 특징 분석부(520), 영상 특징 추출부(530), 영상 특징 분석부(540) 및 통합 분석부(550)를 포함할 수 있다. The second predictor 230 may include an acoustic feature extractor 510, an acoustic feature analyzer 520, an image feature extractor 530, an image feature analyzer 540, and an integrated analyzer 550. have.
음향 특징 추출부(510)는 음향 신호로부터 음향 특징을 추출한다. 음향 특징 분석부(520)는 음향 특징들에 분류 및 학습 알고리즘을 적용하여 음성 구간을 추출한다. 영상 특징 추출부(530)는 일련의 영상 신호로부터 영상 특징을 추출한다. 영상 특징 분석부(540)는 추출된 영상 특징들에 분류 및 학습 알고리즘을 적용하여 음성 구간을 추출한다. The sound feature extractor 510 extracts a sound feature from the sound signal. The acoustic feature analyzer 520 extracts a speech section by applying a classification and learning algorithm to the acoustic features. The image feature extractor 530 extracts an image feature from a series of image signals. The image feature analyzer 540 extracts a speech section by applying a classification and learning algorithm to the extracted image features.
통합 분석부(550)는 음향 신호와 영상 신호에 의해 각각 분류된 결과를 융합하여 최종적으로 음성 구간을 검출한다. 이때, 음향 특징 및 영상 특징을 개별적으로 적용하거나 두 특징을 융합하여 적용할 수 있으며, 다른 신호 예를 들어, 진동, 온도 등을 나타내는 신호로부터 특징이 추출 및 분석되는 경우, 통합 분석부(550)에서 음향 신호 및 영상 신호로부터 추출된 검출 정보와 융합하여 음성 구간이 검출될 수 있다. The integrated analysis unit 550 fuses the results classified by the audio signal and the video signal, respectively, and finally detects the voice section. In this case, the acoustic feature and the image feature may be individually applied or the two features may be fused and applied. When the feature is extracted and analyzed from a signal indicating another signal, for example, vibration or temperature, the integrated analyzer 550 may be used. An audio section may be detected by fusion with detection information extracted from an audio signal and an image signal.
일 실시예에 따르면, 음성 인터페이스 사용시 사용자가 음성 입력 방법을 별도로 학습하지 않고도 직관적으로 음성을 입력할 수 있다. 일 예로, 사용자가 음성 입력을 위한 별도의 버튼이나 화면 터치 등의 동작을 할 필요가 없다. 또한, 가정 잡음, 차량 잡음, 비화자 잡음 등 잡음의 종류나 정도 등에 관계없이 다양한 잡은 환경에서 정확한 사용자 음성 구간 검출을 할 수 있다. 또한, 영상 이외에도 다른 생체 정보를 이용하여 음성 검출을 할 수 있으므로 조명이 너무 밝거나 어두운 경우 또는 사용자 입이 가려지는 상황에서도 사용자의 음성 구간을 정확하게 검출할 수 있다. According to an embodiment, when using the voice interface, the user may intuitively input voice without separately learning a voice input method. For example, the user does not need to perform a separate button or screen touch for voice input. In addition, regardless of the kind or degree of noise such as home noise, vehicle noise, non-talker noise, it is possible to accurately detect the user's voice section in various environments. In addition, since the voice may be detected using other biometric information in addition to the image, the voice section of the user may be accurately detected even when the lighting is too bright or dark or the user's mouth is covered.
도 7은 일 실시예에 따른 멀티 모달 정보를 이용하는 사용자 의도 추론 방법을 나타내는 도면이다. 7 is a diagram illustrating a user intention reasoning method using multi-modal information according to an exemplary embodiment.
사용자 의도 추론 장치(100)는 적어도 하나의 모션 센서로부터 감지된 모션 정보를 수신한다(610). 사용자 의도 추론 장치(100)는 수신된 모션 정보를 이용하여 1차적으로 사용자 의도의 일부분을 예측한다(620). The user intention reasoning apparatus 100 receives the detected motion information from at least one motion sensor (610). The user intention reasoning apparatus 100 primarily predicts a part of the user intention using the received motion information (620).
적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보가 수신되면(630), 사용자 의도 추론 장치(100)는 1차적으로 예측된 사용자 의도의 일부분 및 멀티 모달 정보를 이용하여 2차적으로 사용자 의도를 예측한다(640). 2차적으로 사용자 의도를 예측하는 단계에서, 멀티 모달 센서로부터 입력되는 멀티 모달 정보를 1차적으로 예측된 사용자 의도의 일부분과 연관되도록 해석하는 동작이 수행될 수 있다. When the multi-modal information input from the at least one multi-modal sensor is received (630), the user intention inference apparatus 100 predicts the user's intention by using a part of the first predicted user intention and the multi-modal information. (640). In the second step of predicting the user intention, an operation may be performed to interpret the multi-modal information input from the multi-modal sensor to be associated with a portion of the first predicted user intention.
1차적으로 예측된 사용자 의도의 일부분을 이용하여 2차적 사용자 의도 예측 과정에서 수행되는 동작을 실행시키기 위한 제어 신호를 생성할 수 있다. 2차적 사용자 의도 예측 과정에서 수행되는 동작을 실행시키기 위한 제어 신호는 사용자 의도 추론 장치(100)에 의해 제어되는 멀티 모달 센서의 동작을 제어하는 제어 신호일 수 있다. 사용자 의도는 1차적으로 예측된 사용자 의도의 일부분의 범위 내에서, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 결정될 수 있다. A portion of the first predicted user intention may be used to generate a control signal for executing an operation performed in the secondary user intention prediction process. The control signal for executing the operation performed in the secondary user intention prediction process may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus 100. The user intention may be determined using multi-modal information input from at least one multi-modal sensor, within a range of the first predicted user intent.
본 발명의 일 양상은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있다. 프로그램을 구현하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 디스크 등을 포함한다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드로 저장되고 실행될 수 있다.One aspect of the invention may be embodied as computer readable code on a computer readable recording medium. Codes and code segments that implement a program can be easily inferred by a computer programmer in the art. Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
이상의 설명은 본 발명의 일 실시예에 불과할 뿐, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 본질적 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다. The above description is only one embodiment of the present invention, and those skilled in the art may implement the present invention in a modified form without departing from the essential characteristics of the present invention. Therefore, the scope of the present invention should not be limited to the above-described examples, but should be construed to include various embodiments within the scope equivalent to those described in the claims.
본 발명은 컴퓨터, 전자제품, 컴퓨터 소프트웨어 및 정보기술영역에서 산업상 이용가능하다.The invention is industrially applicable in the fields of computers, electronics, computer software and information technology.

Claims (17)

  1. 적어도 하나의 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 1차 예측부; 및A first predictor predicting a part of a user intention using at least one motion information; And
    상기 예측된 사용자 의도의 일부분 및 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 2차 예측부를 포함하는 사용자 의도 추론 장치. And a second predictor predicting the user intention using a part of the predicted user intention and the multi-modal information input from at least one multi-modal sensor.
  2. 제1항에 있어서, The method of claim 1,
    상기 1차 예측부는 상기 예측된 사용자 의도의 일부분을 이용하여 상기 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호를 생성하는 사용자 의도 추론 장치. And the first predicting unit generates a control signal for executing an operation performed in the process of predicting the user intention by using a part of the predicted user intention.
  3. 제2항에 있어서, The method of claim 2,
    상기 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호는 상기 사용자 의도 추론 장치에 의해 제어되는 멀티 모달 센서의 동작을 제어하는 제어 신호인 사용자 의도 추론 장치. And a control signal for executing an operation performed in the process of predicting the user intention is a control signal for controlling an operation of a multi-modal sensor controlled by the user intention inference device.
  4. 제1항에 있어서, The method of claim 1,
    상기 2차 예측부는 사용자 의도를 예측하기 위하여 상기 멀티 모달 센서로부터 입력되는 멀티 모달 정보를 상기 예측된 사용자 의도의 일부분과 연관되도록 해석하는 사용자 의도 추론 장치. And the second prediction unit interprets multi-modal information input from the multi-modal sensor to be associated with a part of the predicted user intention in order to predict user intention.
  5. 제4항에 있어서, The method of claim 4, wherein
    상기 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택이고, 상기 멀티 모달 센서로부터 음성이 입력되면, 상기 2차 예측부는 상기 입력된 음성을 상기 오브젝트 선택과 연관하여 해석함으로써 사용자 의도를 예측하는 사용자 의도 추론 장치. When a part of the predicted user intention is a selection of an object displayed on a display screen, and a voice is input from the multi-modal sensor, the second predictor predicts the user intention by interpreting the input voice in association with the object selection. User Intention Inference Device.
  6. 제1항에 있어서, The method of claim 1,
    상기 2차 예측부는, 상기 예측된 사용자 의도의 일부분의 범위 내에서, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 사용자 의도 추론 장치. And the second predictor predicts user intention using multi-modal information input from at least one multi-modal sensor within a range of the part of the predicted user intention.
  7. 제6항에 있어서,The method of claim 6,
    상기 예측된 사용자 의도의 일부분이 마이크를 입에 가져가는 동작인 경우, 상기 2차 예측부는, 음향 신호를 감지하고, 감지된 음향 신호에 대하여 특징을 추출 및 분석하여, 사용자 의도를 예측하는 사용자 의도 추론 장치. When the part of the predicted user's intention is to bring the microphone into the mouth, the second predictor detects an acoustic signal, extracts and analyzes a feature with respect to the detected acoustic signal, and predicts the user's intention. Inference device.
  8. 제7항에 있어서, The method of claim 7, wherein
    상기 2차 예측부는, 상기 음향 신호에서 음성 구간이 검출되는지 결정하고, 음성 구간이 검출되는 경우 사용자 의도를 음성 명령 의도로 예측하는 사용자 의도 추론 장치. And the second predictor determines whether a speech section is detected in the sound signal, and predicts a user intention as a voice command intention when the speech section is detected.
  9. 제8항에 있어서, The method of claim 8,
    상기 2차 예측부는, 상기 음향 신호에서 호흡음이 검출된 경우, 사용자 의도를 불기로 예측하는 사용자 의도 추론 장치. And the second predictor predicts a user's intention by blowing a breath sound from the sound signal.
  10. 제1항에 있어서, The method of claim 1,
    상기 예측된 사용자 의도의 일부분이 디스플레이 화면에 표시된 오브젝트의 선택인 경우, 상기 2차 예측부는, 멀티 모달 정보를 이용하여 사용자 의도를 상기 선택된 오브젝트에 대한 삭제, 분류 및 정렬 중 적어도 하나로 예측하는 사용자 의도 추론 장치. When the part of the predicted user intention is a selection of an object displayed on the display screen, the second predictor uses the multi-modal information to predict the user intention as at least one of deletion, classification, and alignment of the selected object. Inference device.
  11. 제1항에 있어서, The method of claim 1,
    상기 사용자 의도 예측 결과를 이용하여 상기 사용자 의도 추론 장치에서 제어되는 소프트웨어 또는 하드웨어를 제어하는 사용자 의도 적용부를 더 포함하는 사용자 의도 추론 장치.And a user intention application unit configured to control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
  12. 적어도 하나의 모션 정보를 수신하는 단계;Receiving at least one motion information;
    상기 수신된 모션 정보를 이용하여 사용자 의도의 일부분을 예측하는 단계;Predicting a portion of user intent using the received motion information;
    적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 수신하는 단계; 및 Receiving multi-modal information input from at least one multi-modal sensor; And
    상기 예측된 사용자 의도의 일부분 및 상기 멀티 모달 정보를 이용하여 사용자 의도를 예측하는 단계를 포함하는 사용자 의도 추론 방법. Predicting user intention using the portion of the predicted user intent and the multi-modal information.
  13. 제12항에 있어서, The method of claim 12,
    상기 예측된 사용자 의도의 일부분을 이용하여 상기 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호를 생성하는 단계를 더 포함하는 사용자 의도 추론 방법. And generating a control signal for executing an operation performed in the process of predicting the user intention using a portion of the predicted user intention.
  14. 제13항에 있어서, The method of claim 13,
    상기 사용자 의도를 예측하는 과정에서 수행되는 동작을 실행시키기 위한 제어 신호는 상기 사용자 의도 추론 장치에 의해 제어되는 멀티 모달 센서의 동작을 제어하는 제어 신호인 사용자 의도 추론 방법. And a control signal for executing an operation performed in the process of predicting the user intention is a control signal for controlling an operation of a multi-modal sensor controlled by the user intention inference device.
  15. 제12항에 있어서, The method of claim 12,
    상기 사용자 의도를 예측하는 단계는, Predicting the user intention,
    상기 멀티 모달 센서로부터 입력되는 멀티 모달 정보를 상기 예측된 사용자 의도의 일부분과 연관되도록 해석하는 단계를 포함하는 사용자 의도 추론 방법. Interpreting multi-modal information input from the multi-modal sensor to be associated with a portion of the predicted user intention.
  16. 제12항에 있어서, The method of claim 12,
    상기 사용자 의도를 예측하는 단계에서, 사용자 의도는 상기 예측된 사용자 의도의 일부분의 범위 내에서, 적어도 하나의 멀티 모달 센서로부터 입력된 멀티 모달 정보를 이용하여 예측되는 사용자 의도 추론 방법. And in the predicting the user intention, the user intention is predicted using the multi-modal information input from at least one multi-modal sensor within a range of the predicted portion of the user intention.
  17. 제12항에 있어서, The method of claim 12,
    상기 사용자 의도 예측 결과를 이용하여 상기 사용자 의도 추론 장치에서 제어되는 소프트웨어 또는 하드웨어를 제어하는 단계를 더 포함하는 사용자 의도 추론 방법. And controlling software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
PCT/KR2010/002723 2009-04-30 2010-04-29 Apparatus and method for user intention inference using multimodal information WO2010126321A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2012508401A JP5911796B2 (en) 2009-04-30 2010-04-29 User intention inference apparatus and method using multimodal information
EP10769966.2A EP2426598B1 (en) 2009-04-30 2010-04-29 Apparatus and method for user intention inference using multimodal information
CN201080017476.6A CN102405463B (en) 2009-04-30 2010-04-29 Utilize the user view reasoning device and method of multi-modal information

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
KR10-2009-0038267 2009-04-30
KR1020090038267A KR101581883B1 (en) 2009-04-30 2009-04-30 Appratus for detecting voice using motion information and method thereof
KR10-2009-0067034 2009-07-22
KR20090067034 2009-07-22
KR1020100036031A KR101652705B1 (en) 2009-07-22 2010-04-19 Apparatus for predicting intention of user using multi modal information and method thereof
KR10-2010-0036031 2010-04-19

Publications (2)

Publication Number Publication Date
WO2010126321A2 true WO2010126321A2 (en) 2010-11-04
WO2010126321A3 WO2010126321A3 (en) 2011-03-24

Family

ID=45541557

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2010/002723 WO2010126321A2 (en) 2009-04-30 2010-04-29 Apparatus and method for user intention inference using multimodal information

Country Status (5)

Country Link
US (1) US8606735B2 (en)
EP (1) EP2426598B1 (en)
JP (1) JP5911796B2 (en)
CN (1) CN102405463B (en)
WO (1) WO2010126321A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016148398A1 (en) * 2015-03-16 2016-09-22 주식회사 스마트올웨이즈온 Set-top box and photograghing apparatus for performing context-aware function based on multi-modal information to self-learn and self-improve user interface and user experience

Families Citing this family (322)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001013255A2 (en) 1999-08-13 2001-02-22 Pixo, Inc. Displaying and traversing links in character array
US8645137B2 (en) 2000-03-16 2014-02-04 Apple Inc. Fast, language-independent method for user authentication by voice
ITFI20010199A1 (en) 2001-10-22 2003-04-22 Riccardo Vieri SYSTEM AND METHOD TO TRANSFORM TEXTUAL COMMUNICATIONS INTO VOICE AND SEND THEM WITH AN INTERNET CONNECTION TO ANY TELEPHONE SYSTEM
US7669134B1 (en) 2003-05-02 2010-02-23 Apple Inc. Method and apparatus for displaying information during an instant messaging session
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US7633076B2 (en) 2005-09-30 2009-12-15 Apple Inc. Automated response to and sensing of user activity in portable devices
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
ITFI20070177A1 (en) 2007-07-26 2009-01-27 Riccardo Vieri SYSTEM FOR THE CREATION AND SETTING OF AN ADVERTISING CAMPAIGN DERIVING FROM THE INSERTION OF ADVERTISING MESSAGES WITHIN AN EXCHANGE OF MESSAGES AND METHOD FOR ITS FUNCTIONING.
US9053089B2 (en) 2007-10-02 2015-06-09 Apple Inc. Part-of-speech tagging using latent analogy
US8364694B2 (en) 2007-10-26 2013-01-29 Apple Inc. Search assistant for digital media assets
US8620662B2 (en) 2007-11-20 2013-12-31 Apple Inc. Context-aware unit selection
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8327272B2 (en) 2008-01-06 2012-12-04 Apple Inc. Portable multifunction device, method, and graphical user interface for viewing and managing electronic calendars
US8065143B2 (en) 2008-02-22 2011-11-22 Apple Inc. Providing text input using speech data and non-speech data
US8289283B2 (en) 2008-03-04 2012-10-16 Apple Inc. Language input interface on a device
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US10496753B2 (en) 2010-01-18 2019-12-03 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US8464150B2 (en) 2008-06-07 2013-06-11 Apple Inc. Automatic language identification for dynamic text processing
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8768702B2 (en) 2008-09-05 2014-07-01 Apple Inc. Multi-tiered voice feedback in an electronic device
US8898568B2 (en) 2008-09-09 2014-11-25 Apple Inc. Audio user interface
US8712776B2 (en) 2008-09-29 2014-04-29 Apple Inc. Systems and methods for selective text to speech synthesis
US8583418B2 (en) 2008-09-29 2013-11-12 Apple Inc. Systems and methods of detecting language and natural language strings for text to speech synthesis
US8352268B2 (en) 2008-09-29 2013-01-08 Apple Inc. Systems and methods for selective rate of speech and speech preferences for text to speech synthesis
US8352272B2 (en) 2008-09-29 2013-01-08 Apple Inc. Systems and methods for text to speech synthesis
US8396714B2 (en) 2008-09-29 2013-03-12 Apple Inc. Systems and methods for concatenation of words in text to speech synthesis
US8355919B2 (en) 2008-09-29 2013-01-15 Apple Inc. Systems and methods for text normalization for text to speech synthesis
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
WO2010067118A1 (en) 2008-12-11 2010-06-17 Novauris Technologies Limited Speech recognition involving a mobile device
US8862252B2 (en) 2009-01-30 2014-10-14 Apple Inc. Audio user interface for displayless electronic device
US8380507B2 (en) 2009-03-09 2013-02-19 Apple Inc. Systems and methods for determining the language to use for speech generated by a text to speech engine
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10540976B2 (en) 2009-06-05 2020-01-21 Apple Inc. Contextual voice commands
US10255566B2 (en) 2011-06-03 2019-04-09 Apple Inc. Generating and processing task items that represent tasks to perform
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US9858925B2 (en) 2009-06-05 2018-01-02 Apple Inc. Using context information to facilitate processing of commands in a virtual assistant
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
US8682649B2 (en) 2009-11-12 2014-03-25 Apple Inc. Sentiment prediction from textual data
US8600743B2 (en) 2010-01-06 2013-12-03 Apple Inc. Noise profile determination for voice-related feature
US8311838B2 (en) 2010-01-13 2012-11-13 Apple Inc. Devices and methods for identifying a prompt corresponding to a voice input in a sequence of prompts
US8381107B2 (en) 2010-01-13 2013-02-19 Apple Inc. Adaptive audio feedback system and method
US10705794B2 (en) 2010-01-18 2020-07-07 Apple Inc. Automatically adapting user interfaces for hands-free interaction
US10553209B2 (en) 2010-01-18 2020-02-04 Apple Inc. Systems and methods for hands-free notification summaries
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US10679605B2 (en) 2010-01-18 2020-06-09 Apple Inc. Hands-free list-reading by intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
US8639516B2 (en) 2010-06-04 2014-01-28 Apple Inc. User-specific noise suppression for voice quality improvements
US9323844B2 (en) 2010-06-11 2016-04-26 Doat Media Ltd. System and methods thereof for enhancing a user's search experience
US9529918B2 (en) 2010-06-11 2016-12-27 Doat Media Ltd. System and methods thereof for downloading applications via a communication network
US20160300138A1 (en) * 2010-06-11 2016-10-13 Doat Media Ltd. Method and system for context-based intent verification
US10713312B2 (en) * 2010-06-11 2020-07-14 Doat Media Ltd. System and method for context-launching of applications
US9141702B2 (en) 2010-06-11 2015-09-22 Doat Media Ltd. Method for dynamically displaying a personalized home screen on a device
US9552422B2 (en) 2010-06-11 2017-01-24 Doat Media Ltd. System and method for detecting a search intent
US20140365474A1 (en) * 2010-06-11 2014-12-11 Doat Media Ltd. System and method for sharing content over the web
US9069443B2 (en) 2010-06-11 2015-06-30 Doat Media Ltd. Method for dynamically displaying a personalized home screen on a user device
US8713021B2 (en) 2010-07-07 2014-04-29 Apple Inc. Unsupervised document clustering using latent semantic density analysis
US9104670B2 (en) 2010-07-21 2015-08-11 Apple Inc. Customized search or acquisition of digital media assets
US20120038555A1 (en) * 2010-08-12 2012-02-16 Research In Motion Limited Method and Electronic Device With Motion Compensation
US8719006B2 (en) 2010-08-27 2014-05-06 Apple Inc. Combined statistical and rule-based part-of-speech tagging for text-to-speech synthesis
US8700392B1 (en) * 2010-09-10 2014-04-15 Amazon Technologies, Inc. Speech-inclusive device interfaces
US9274744B2 (en) 2010-09-10 2016-03-01 Amazon Technologies, Inc. Relative position-inclusive device interfaces
US8719014B2 (en) 2010-09-27 2014-05-06 Apple Inc. Electronic device with text error correction based on voice recognition data
US9348417B2 (en) 2010-11-01 2016-05-24 Microsoft Technology Licensing, Llc Multimodal input system
US20120159341A1 (en) 2010-12-21 2012-06-21 Microsoft Corporation Interactions with contextual and task-based computing environments
US10515147B2 (en) 2010-12-22 2019-12-24 Apple Inc. Using statistical language models for contextual lookup
US10762293B2 (en) 2010-12-22 2020-09-01 Apple Inc. Using parts-of-speech tagging and named entity recognition for spelling correction
US20120166522A1 (en) * 2010-12-27 2012-06-28 Microsoft Corporation Supporting intelligent user interface interactions
US8781836B2 (en) 2011-02-22 2014-07-15 Apple Inc. Hearing assistance system for providing consistent human speech
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
US9263045B2 (en) 2011-05-17 2016-02-16 Microsoft Technology Licensing, Llc Multi-mode text input
US20120304067A1 (en) * 2011-05-25 2012-11-29 Samsung Electronics Co., Ltd. Apparatus and method for controlling user interface using sound recognition
US10672399B2 (en) 2011-06-03 2020-06-02 Apple Inc. Switching between text data and audio data based on a mapping
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
TWI447066B (en) * 2011-06-08 2014-08-01 Sitronix Technology Corp Distance sensing circuit and touch electronic device
US8928336B2 (en) 2011-06-09 2015-01-06 Ford Global Technologies, Llc Proximity switch having sensitivity control and method therefor
US8975903B2 (en) 2011-06-09 2015-03-10 Ford Global Technologies, Llc Proximity switch having learned sensitivity and method therefor
US8812294B2 (en) 2011-06-21 2014-08-19 Apple Inc. Translating phrases from one language into another using an order-based set of declarative rules
US10004286B2 (en) 2011-08-08 2018-06-26 Ford Global Technologies, Llc Glove having conductive ink and method of interacting with proximity sensor
US8706472B2 (en) 2011-08-11 2014-04-22 Apple Inc. Method for disambiguating multiple readings in language conversion
US8994660B2 (en) 2011-08-29 2015-03-31 Apple Inc. Text correction processing
US9143126B2 (en) 2011-09-22 2015-09-22 Ford Global Technologies, Llc Proximity switch having lockout control for controlling movable panel
US8762156B2 (en) 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US8994228B2 (en) 2011-11-03 2015-03-31 Ford Global Technologies, Llc Proximity switch having wrong touch feedback
US10112556B2 (en) 2011-11-03 2018-10-30 Ford Global Technologies, Llc Proximity switch having wrong touch adaptive learning and method
US8878438B2 (en) 2011-11-04 2014-11-04 Ford Global Technologies, Llc Lamp and proximity switch assembly and method
US9223415B1 (en) 2012-01-17 2015-12-29 Amazon Technologies, Inc. Managing resource usage for task performance
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9483461B2 (en) 2012-03-06 2016-11-01 Apple Inc. Handling speech synthesis of content for multiple languages
US8933708B2 (en) 2012-04-11 2015-01-13 Ford Global Technologies, Llc Proximity switch assembly and activation method with exploration mode
US9197206B2 (en) 2012-04-11 2015-11-24 Ford Global Technologies, Llc Proximity switch having differential contact surface
US9520875B2 (en) 2012-04-11 2016-12-13 Ford Global Technologies, Llc Pliable proximity switch assembly and activation method
US9568527B2 (en) 2012-04-11 2017-02-14 Ford Global Technologies, Llc Proximity switch assembly and activation method having virtual button mode
US9219472B2 (en) 2012-04-11 2015-12-22 Ford Global Technologies, Llc Proximity switch assembly and activation method using rate monitoring
US9531379B2 (en) 2012-04-11 2016-12-27 Ford Global Technologies, Llc Proximity switch assembly having groove between adjacent proximity sensors
US9660644B2 (en) 2012-04-11 2017-05-23 Ford Global Technologies, Llc Proximity switch assembly and activation method
US9831870B2 (en) 2012-04-11 2017-11-28 Ford Global Technologies, Llc Proximity switch assembly and method of tuning same
US9559688B2 (en) 2012-04-11 2017-01-31 Ford Global Technologies, Llc Proximity switch assembly having pliable surface and depression
US9184745B2 (en) 2012-04-11 2015-11-10 Ford Global Technologies, Llc Proximity switch assembly and method of sensing user input based on signal rate of change
US9944237B2 (en) 2012-04-11 2018-04-17 Ford Global Technologies, Llc Proximity switch assembly with signal drift rejection and method
US9065447B2 (en) 2012-04-11 2015-06-23 Ford Global Technologies, Llc Proximity switch assembly and method having adaptive time delay
US9287864B2 (en) 2012-04-11 2016-03-15 Ford Global Technologies, Llc Proximity switch assembly and calibration method therefor
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US8775442B2 (en) 2012-05-15 2014-07-08 Apple Inc. Semantic search using a single-source semantic model
US9136840B2 (en) 2012-05-17 2015-09-15 Ford Global Technologies, Llc Proximity switch assembly having dynamic tuned threshold
US8981602B2 (en) 2012-05-29 2015-03-17 Ford Global Technologies, Llc Proximity switch assembly having non-switch contact and method
US9337832B2 (en) 2012-06-06 2016-05-10 Ford Global Technologies, Llc Proximity switch and method of adjusting sensitivity therefor
US10019994B2 (en) 2012-06-08 2018-07-10 Apple Inc. Systems and methods for recognizing textual identifiers within a plurality of words
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9641172B2 (en) 2012-06-27 2017-05-02 Ford Global Technologies, Llc Proximity switch assembly having varying size electrode fingers
US9495129B2 (en) 2012-06-29 2016-11-15 Apple Inc. Device, method, and user interface for voice-activated navigation and browsing of a document
US9576574B2 (en) 2012-09-10 2017-02-21 Apple Inc. Context-sensitive handling of interruptions by intelligent digital assistant
US8922340B2 (en) 2012-09-11 2014-12-30 Ford Global Technologies, Llc Proximity switch based door latch release
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US8935167B2 (en) 2012-09-25 2015-01-13 Apple Inc. Exemplar-based latent perceptual modeling for automatic speech recognition
US8796575B2 (en) 2012-10-31 2014-08-05 Ford Global Technologies, Llc Proximity switch assembly having ground layer
US9081413B2 (en) * 2012-11-20 2015-07-14 3M Innovative Properties Company Human interaction system based upon real-time intention detection
CN103841137A (en) * 2012-11-22 2014-06-04 腾讯科技(深圳)有限公司 Method for intelligent terminal to control webpage application, and intelligent terminal
US9147398B2 (en) * 2013-01-23 2015-09-29 Nokia Technologies Oy Hybrid input device for touchless user interface
EP4138075A1 (en) 2013-02-07 2023-02-22 Apple Inc. Voice trigger for a digital assistant
US9311204B2 (en) 2013-03-13 2016-04-12 Ford Global Technologies, Llc Proximity interface development system having replicator and method
US9977779B2 (en) 2013-03-14 2018-05-22 Apple Inc. Automatic supplementation of word correction dictionaries
US10572476B2 (en) 2013-03-14 2020-02-25 Apple Inc. Refining a search based on schedule items
US9368114B2 (en) 2013-03-14 2016-06-14 Apple Inc. Context-sensitive handling of interruptions
US10642574B2 (en) 2013-03-14 2020-05-05 Apple Inc. Device, method, and graphical user interface for outputting captions
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US9733821B2 (en) 2013-03-14 2017-08-15 Apple Inc. Voice control to diagnose inadvertent activation of accessibility features
KR101857648B1 (en) 2013-03-15 2018-05-15 애플 인크. User training by intelligent digital assistant
CN112230878A (en) 2013-03-15 2021-01-15 苹果公司 Context-sensitive handling of interrupts
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
KR101759009B1 (en) 2013-03-15 2017-07-17 애플 인크. Training an at least partial voice command system
WO2014144579A1 (en) 2013-03-15 2014-09-18 Apple Inc. System and method for updating an adaptive speech recognition model
JP6032350B2 (en) * 2013-03-21 2016-11-24 富士通株式会社 Motion detection device and motion detection method
CN103200330A (en) * 2013-04-16 2013-07-10 上海斐讯数据通信技术有限公司 Method and mobile terminal for achieving triggering of flashlight
WO2014197336A1 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for detecting errors in interactions with a voice-based digital assistant
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
US9582608B2 (en) 2013-06-07 2017-02-28 Apple Inc. Unified ranking with entropy-weighted information for phrase-based semantic auto-completion
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN110442699A (en) 2013-06-09 2019-11-12 苹果公司 Operate method, computer-readable medium, electronic equipment and the system of digital assistants
CN105265005B (en) 2013-06-13 2019-09-17 苹果公司 System and method for the urgent call initiated by voice command
US9633317B2 (en) 2013-06-20 2017-04-25 Viv Labs, Inc. Dynamically evolving cognitive architecture system based on a natural language intent interpreter
US9594542B2 (en) 2013-06-20 2017-03-14 Viv Labs, Inc. Dynamically evolving cognitive architecture system based on training by third-party developers
US10474961B2 (en) 2013-06-20 2019-11-12 Viv Labs, Inc. Dynamically evolving cognitive architecture system based on prompting for additional user input
US9519461B2 (en) 2013-06-20 2016-12-13 Viv Labs, Inc. Dynamically evolving cognitive architecture system based on third-party developers
US10791216B2 (en) 2013-08-06 2020-09-29 Apple Inc. Auto-activating smart responses based on activities from remote devices
US11199906B1 (en) 2013-09-04 2021-12-14 Amazon Technologies, Inc. Global user input management
US9367203B1 (en) 2013-10-04 2016-06-14 Amazon Technologies, Inc. User interface techniques for simulating three-dimensional depth
US20160163314A1 (en) * 2013-11-25 2016-06-09 Mitsubishi Electric Corporation Dialog management system and dialog management method
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
EP2887205A1 (en) * 2013-12-17 2015-06-24 Sony Corporation Voice activated device, method & computer program product
US10741182B2 (en) * 2014-02-18 2020-08-11 Lenovo (Singapore) Pte. Ltd. Voice input correction using non-audio based input
US9620105B2 (en) 2014-05-15 2017-04-11 Apple Inc. Analyzing audio input for efficient speech and music recognition
US10592095B2 (en) 2014-05-23 2020-03-17 Apple Inc. Instantaneous speaking of content on touch devices
US9502031B2 (en) 2014-05-27 2016-11-22 Apple Inc. Method for supporting dynamic grammars in WFST-based ASR
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9760559B2 (en) 2014-05-30 2017-09-12 Apple Inc. Predictive text input
US9785630B2 (en) 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
US9734193B2 (en) 2014-05-30 2017-08-15 Apple Inc. Determining domain salience ranking from ambiguous words in natural speech
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US9842101B2 (en) 2014-05-30 2017-12-12 Apple Inc. Predictive conversion of language input
US9966065B2 (en) 2014-05-30 2018-05-08 Apple Inc. Multi-command single utterance input method
US10078631B2 (en) 2014-05-30 2018-09-18 Apple Inc. Entropy-guided text prediction using combined word and character n-gram language models
US10289433B2 (en) 2014-05-30 2019-05-14 Apple Inc. Domain specific language for encoding assistant dialog
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
US10659851B2 (en) 2014-06-30 2020-05-19 Apple Inc. Real-time digital assistant knowledge updates
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9582482B1 (en) 2014-07-11 2017-02-28 Google Inc. Providing an annotation linking related entities in onscreen content
US10446141B2 (en) 2014-08-28 2019-10-15 Apple Inc. Automatic speech recognition based on user feedback
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10789041B2 (en) 2014-09-12 2020-09-29 Apple Inc. Dynamic thresholds for always listening speech trigger
US9792334B2 (en) * 2014-09-25 2017-10-17 Sap Se Large-scale processing and querying for real-time surveillance
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US9646609B2 (en) 2014-09-30 2017-05-09 Apple Inc. Caching apparatus for serving phonetic pronunciations
US9886432B2 (en) 2014-09-30 2018-02-06 Apple Inc. Parsimonious handling of word inflection via categorical stem + suffix N-gram language models
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
US10038443B2 (en) 2014-10-20 2018-07-31 Ford Global Technologies, Llc Directional proximity switch assembly
US10552013B2 (en) 2014-12-02 2020-02-04 Apple Inc. Data detection
US9711141B2 (en) 2014-12-09 2017-07-18 Apple Inc. Disambiguating heteronyms in speech synthesis
JP5784211B1 (en) * 2014-12-19 2015-09-24 株式会社Cygames Information processing program and information processing method
CN105812506A (en) * 2014-12-27 2016-07-27 深圳富泰宏精密工业有限公司 Operation mode control system and method
US9865280B2 (en) 2015-03-06 2018-01-09 Apple Inc. Structured dictation using intelligent automated assistants
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9654103B2 (en) 2015-03-18 2017-05-16 Ford Global Technologies, Llc Proximity switch assembly having haptic feedback and method
US9899019B2 (en) 2015-03-18 2018-02-20 Apple Inc. Systems and methods for structured stem and suffix language models
KR102351497B1 (en) 2015-03-19 2022-01-14 삼성전자주식회사 Method and apparatus for detecting a voice section based on image information
US9842105B2 (en) 2015-04-16 2017-12-12 Apple Inc. Parsimonious continuous-space phrase representations for natural language processing
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US9548733B2 (en) 2015-05-20 2017-01-17 Ford Global Technologies, Llc Proximity sensor assembly having interleaved electrode configuration
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US10127220B2 (en) 2015-06-04 2018-11-13 Apple Inc. Language identification from short strings
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US10101822B2 (en) 2015-06-05 2018-10-16 Apple Inc. Language input correction
US10255907B2 (en) 2015-06-07 2019-04-09 Apple Inc. Automatic accent detection using acoustic models
US10186254B2 (en) 2015-06-07 2019-01-22 Apple Inc. Context-based endpoint detection
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
CN105159111B (en) * 2015-08-24 2019-01-25 百度在线网络技术(北京)有限公司 Intelligent interaction device control method and system based on artificial intelligence
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US9697820B2 (en) 2015-09-24 2017-07-04 Apple Inc. Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10970646B2 (en) * 2015-10-01 2021-04-06 Google Llc Action suggestions for user-selected content
CN105389461A (en) * 2015-10-21 2016-03-09 胡习 Interactive children self-management system and management method thereof
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10764226B2 (en) * 2016-01-15 2020-09-01 Staton Techiya, Llc Message delivery and presentation methods, systems and devices using receptivity
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179588B1 (en) 2016-06-09 2019-02-22 Apple Inc. Intelligent automated assistant in a home environment
CN107490971B (en) * 2016-06-09 2019-06-11 苹果公司 Intelligent automation assistant in home environment
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
US10621992B2 (en) * 2016-07-22 2020-04-14 Lenovo (Singapore) Pte. Ltd. Activating voice assistant based on at least one of user proximity and context
CN106446524A (en) * 2016-08-31 2017-02-22 北京智能管家科技有限公司 Intelligent hardware multimodal cascade modeling method and apparatus
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US10535005B1 (en) 2016-10-26 2020-01-14 Google Llc Providing contextual actions for mobile onscreen content
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US10229680B1 (en) * 2016-12-29 2019-03-12 Amazon Technologies, Inc. Contextual entity resolution
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. Low-latency intelligent automated assistant
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. Multi-modal interfaces
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
DK179549B1 (en) 2017-05-16 2019-02-12 Apple Inc. Far-field extension for digital assistant services
US10664533B2 (en) 2017-05-24 2020-05-26 Lenovo (Singapore) Pte. Ltd. Systems and methods to determine response cue for digital assistant based on context
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
CN111656406A (en) 2017-12-14 2020-09-11 奇跃公司 Context-based rendering of virtual avatars
CN108563321A (en) * 2018-01-02 2018-09-21 联想(北京)有限公司 Information processing method and electronic equipment
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
TWI691923B (en) * 2018-04-02 2020-04-21 華南商業銀行股份有限公司 Fraud detection system for financial transaction and method thereof
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
US10496705B1 (en) 2018-06-03 2019-12-03 Apple Inc. Accelerated task performance
US11588902B2 (en) * 2018-07-24 2023-02-21 Newton Howard Intelligent reasoning framework for user intent extraction
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US10831442B2 (en) * 2018-10-19 2020-11-10 International Business Machines Corporation Digital assistant user interface amalgamation
CN109192209A (en) * 2018-10-23 2019-01-11 珠海格力电器股份有限公司 A kind of audio recognition method and device
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
CN111737670B (en) * 2019-03-25 2023-08-18 广州汽车集团股份有限公司 Method, system and vehicle-mounted multimedia device for multi-mode data collaborative man-machine interaction
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
DK201970511A1 (en) 2019-05-31 2021-02-15 Apple Inc Voice identification in digital assistant systems
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110196642B (en) * 2019-06-21 2022-05-17 济南大学 Navigation type virtual microscope based on intention understanding model
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11887600B2 (en) * 2019-10-04 2024-01-30 Disney Enterprises, Inc. Techniques for interpreting spoken input using non-verbal cues
EP3832435A1 (en) * 2019-12-06 2021-06-09 XRSpace CO., LTD. Motion tracking system and method
US11869213B2 (en) * 2020-01-17 2024-01-09 Samsung Electronics Co., Ltd. Electronic device for analyzing skin image and method for controlling the same
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11043220B1 (en) 2020-05-11 2021-06-22 Apple Inc. Digital assistant hardware abstraction
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
CN111968631B (en) * 2020-06-29 2023-10-10 百度在线网络技术(北京)有限公司 Interaction method, device, equipment and storage medium of intelligent equipment
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
US11804215B1 (en) 2022-04-29 2023-10-31 Apple Inc. Sonic responses

Family Cites Families (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0375860A (en) * 1989-08-18 1991-03-29 Hitachi Ltd Personalized terminal
US5621858A (en) * 1992-05-26 1997-04-15 Ricoh Corporation Neural network acoustic and visual speech recognition system training method and apparatus
US5473726A (en) * 1993-07-06 1995-12-05 The United States Of America As Represented By The Secretary Of The Air Force Audio and amplitude modulated photo data collection for speech recognition
JP3375449B2 (en) 1995-02-27 2003-02-10 シャープ株式会社 Integrated recognition dialogue device
US5806036A (en) * 1995-08-17 1998-09-08 Ricoh Company, Ltd. Speechreading using facial feature parameters from a non-direct frontal view of the speaker
JP3702978B2 (en) * 1996-12-26 2005-10-05 ソニー株式会社 Recognition device, recognition method, learning device, and learning method
JPH11164186A (en) * 1997-11-27 1999-06-18 Fuji Photo Film Co Ltd Image recorder
US6629065B1 (en) * 1998-09-30 2003-09-30 Wisconsin Alumni Research Foundation Methods and apparata for rapid computer-aided design of objects in virtual reality and other environments
JP2000132305A (en) * 1998-10-23 2000-05-12 Olympus Optical Co Ltd Operation input device
US6842877B2 (en) * 1998-12-18 2005-01-11 Tangis Corporation Contextual responses based on automated learning techniques
US6563532B1 (en) * 1999-01-05 2003-05-13 Internal Research Corporation Low attention recording unit for use by vigorously active recorder
JP2000276190A (en) 1999-03-26 2000-10-06 Yasuto Takeuchi Voice call device requiring no phonation
SE9902229L (en) * 1999-06-07 2001-02-05 Ericsson Telefon Ab L M Apparatus and method of controlling a voice controlled operation
US6904405B2 (en) * 1999-07-17 2005-06-07 Edwin A. Suominen Message recognition using shared language model
JP2001100878A (en) 1999-09-29 2001-04-13 Toshiba Corp Multi-modal input/output device
US7028269B1 (en) * 2000-01-20 2006-04-11 Koninklijke Philips Electronics N.V. Multi-modal video target acquisition and re-direction system and method
JP2001216069A (en) * 2000-02-01 2001-08-10 Toshiba Corp Operation inputting device and direction detecting method
JP2005174356A (en) * 2000-02-01 2005-06-30 Toshiba Corp Direction detection method
NZ503882A (en) * 2000-04-10 2002-11-26 Univ Otago Artificial intelligence system comprising a neural network with an adaptive component arranged to aggregate rule nodes
US6754373B1 (en) * 2000-07-14 2004-06-22 International Business Machines Corporation System and method for microphone activation using visual speech cues
US6894714B2 (en) * 2000-12-05 2005-05-17 Koninklijke Philips Electronics N.V. Method and apparatus for predicting events in video conferencing and other applications
US6964023B2 (en) * 2001-02-05 2005-11-08 International Business Machines Corporation System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input
KR20020068235A (en) 2001-02-20 2002-08-27 유재천 Method and apparatus of recognizing speech using a tooth and lip image
US7171357B2 (en) * 2001-03-21 2007-01-30 Avaya Technology Corp. Voice-activity detection using energy ratios and periodicity
US7102485B2 (en) * 2001-05-08 2006-09-05 Gene Williams Motion activated communication device
US7203643B2 (en) * 2001-06-14 2007-04-10 Qualcomm Incorporated Method and apparatus for transmitting speech activity in distributed voice recognition systems
CA2397703C (en) * 2001-08-15 2009-04-28 At&T Corp. Systems and methods for abstracting portions of information that is represented with finite-state devices
US6990639B2 (en) * 2002-02-07 2006-01-24 Microsoft Corporation System and process for controlling electronic components in a ubiquitous computing environment using multimodal integration
DE10208469A1 (en) * 2002-02-27 2003-09-04 Bsh Bosch Siemens Hausgeraete Electrical device, in particular extractor hood
US7230955B1 (en) * 2002-12-27 2007-06-12 At & T Corp. System and method for improved use of voice activity detection
KR100515798B1 (en) * 2003-02-10 2005-09-21 한국과학기술원 Robot driving method using facial gestures
CA2420129A1 (en) * 2003-02-17 2004-08-17 Catena Networks, Canada, Inc. A method for robustly detecting voice activity
US8745541B2 (en) * 2003-03-25 2014-06-03 Microsoft Corporation Architecture for controlling a computer using hand gestures
US20040243416A1 (en) * 2003-06-02 2004-12-02 Gardos Thomas R. Speech recognition
US7343289B2 (en) * 2003-06-25 2008-03-11 Microsoft Corp. System and method for audio/video speaker detection
US7383181B2 (en) * 2003-07-29 2008-06-03 Microsoft Corporation Multi-sensory speech detection system
US7318030B2 (en) * 2003-09-17 2008-01-08 Intel Corporation Method and apparatus to perform voice activity detection
JP4311190B2 (en) * 2003-12-17 2009-08-12 株式会社デンソー In-vehicle device interface
US20050228673A1 (en) * 2004-03-30 2005-10-13 Nefian Ara V Techniques for separating and evaluating audio and video source data
US8788265B2 (en) * 2004-05-25 2014-07-22 Nokia Solutions And Networks Oy System and method for babble noise detection
US7624355B2 (en) * 2004-05-27 2009-11-24 Baneth Robin C System and method for controlling a user interface
FI20045315A (en) * 2004-08-30 2006-03-01 Nokia Corp Detection of voice activity in an audio signal
JP4630646B2 (en) * 2004-11-19 2011-02-09 任天堂株式会社 Breath blowing discrimination program, breath blowing discrimination device, game program, and game device
EP1686804A1 (en) * 2005-01-26 2006-08-02 Alcatel Predictor of multimedia system user behavior
WO2006104555A2 (en) * 2005-03-24 2006-10-05 Mindspeed Technologies, Inc. Adaptive noise state update for a voice activity detector
GB2426166B (en) * 2005-05-09 2007-10-17 Toshiba Res Europ Ltd Voice activity detection apparatus and method
US7346504B2 (en) * 2005-06-20 2008-03-18 Microsoft Corporation Multi-sensory speech enhancement using a clean speech prior
US20070005363A1 (en) * 2005-06-29 2007-01-04 Microsoft Corporation Location aware multi-modal multi-lingual device
WO2007057879A1 (en) * 2005-11-17 2007-05-24 Shaul Simhi Personalized voice activity detection
KR100820141B1 (en) 2005-12-08 2008-04-08 한국전자통신연구원 Apparatus and Method for detecting of speech block and system for speech recognition
US7860718B2 (en) * 2005-12-08 2010-12-28 Electronics And Telecommunications Research Institute Apparatus and method for speech segment detection and system for speech recognition
DE102006037156A1 (en) * 2006-03-22 2007-09-27 Volkswagen Ag Interactive operating device and method for operating the interactive operating device
KR20080002187A (en) 2006-06-30 2008-01-04 주식회사 케이티 System and method for customized emotion service with an alteration of human being's emotion and circumstances
US8775168B2 (en) * 2006-08-10 2014-07-08 Stmicroelectronics Asia Pacific Pte, Ltd. Yule walker based low-complexity voice activity detector in noise suppression systems
WO2008069519A1 (en) * 2006-12-04 2008-06-12 Electronics And Telecommunications Research Institute Gesture/speech integrated recognition system and method
US8326636B2 (en) * 2008-01-16 2012-12-04 Canyon Ip Holdings Llc Using a physical phenomenon detector to control operation of a speech recognition engine
US20080252595A1 (en) * 2007-04-11 2008-10-16 Marc Boillot Method and Device for Virtual Navigation and Voice Processing
JP2009042910A (en) * 2007-08-07 2009-02-26 Sony Corp Information processor, information processing method, and computer program
US8321219B2 (en) 2007-10-05 2012-11-27 Sensory, Inc. Systems and methods of performing speech recognition using gestures
US20090262078A1 (en) * 2008-04-21 2009-10-22 David Pizzi Cellular phone with special sensor functions
US20100162181A1 (en) * 2008-12-22 2010-06-24 Palm, Inc. Interpreting Gesture Input Including Introduction Or Removal Of A Point Of Contact While A Gesture Is In Progress

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None
See also references of EP2426598A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016148398A1 (en) * 2015-03-16 2016-09-22 주식회사 스마트올웨이즈온 Set-top box and photograghing apparatus for performing context-aware function based on multi-modal information to self-learn and self-improve user interface and user experience

Also Published As

Publication number Publication date
EP2426598B1 (en) 2017-06-21
US8606735B2 (en) 2013-12-10
WO2010126321A3 (en) 2011-03-24
JP5911796B2 (en) 2016-04-27
US20100280983A1 (en) 2010-11-04
JP2012525625A (en) 2012-10-22
CN102405463B (en) 2015-07-29
EP2426598A2 (en) 2012-03-07
EP2426598A4 (en) 2012-11-14
CN102405463A (en) 2012-04-04

Similar Documents

Publication Publication Date Title
WO2010126321A2 (en) Apparatus and method for user intention inference using multimodal information
CN106575150B (en) Method for recognizing gestures using motion data and wearable computing device
LaViola Jr 3d gestural interaction: The state of the field
WO2010110573A2 (en) Multi-telepointer, virtual object display device, and virtual object control method
WO2013055025A1 (en) Intelligent robot, system for interaction between intelligent robot and user, and method for interacting between intelligent robot and user
WO2017188801A1 (en) Optimum control method based on multi-mode command of operation-voice, and electronic device to which same is applied
KR20100119250A (en) Appratus for detecting voice using motion information and method thereof
WO2013009062A2 (en) Method and terminal device for controlling content by sensing head gesture and hand gesture, and computer-readable recording medium
CN104516499B (en) Apparatus and method for event using user interface
LaViola Jr An introduction to 3D gestural interfaces
CN111833872B (en) Voice control method, device, equipment, system and medium for elevator
CN109725727A (en) There are the gestural control method and device of screen equipment
KR101652705B1 (en) Apparatus for predicting intention of user using multi modal information and method thereof
WO2016036197A1 (en) Hand movement recognizing device and method
CN114167984A (en) Device control method, device, storage medium and electronic device
WO2019156412A1 (en) Method for operating voice recognition service and electronic device supporting same
Wang et al. A gesture-based method for natural interaction in smart spaces
CN109725722A (en) There are the gestural control method and device of screen equipment
WO2014178491A1 (en) Speech recognition method and apparatus
Babu et al. Controlling Computer Features Through Hand Gesture
Mali et al. Hand gestures recognition using inertial sensors through deep learning
Chaudhry et al. Music Recommendation System through Hand Gestures and Facial Emotions
Costagliola et al. Gesture‐Based Computing
US11464380B2 (en) Artificial intelligence cleaner and operating method thereof
KR20230043285A (en) Method and apparatus for hand movement tracking using deep learning

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080017476.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10769966

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 2012508401

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2010769966

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2010769966

Country of ref document: EP