WO2010126321A2

WO2010126321A2 - Apparatus and method for user intention inference using multimodal information

Info

Publication number: WO2010126321A2
Application number: PCT/KR2010/002723
Authority: WO
Inventors: 조정미; 김정수; 방원철; 김남훈
Original assignee: 삼성전자주식회사
Priority date: 2009-04-30
Filing date: 2010-04-29
Publication date: 2010-11-04
Also published as: EP2426598B1; US8606735B2; WO2010126321A3; JP5911796B2; US20100280983A1; JP2012525625A; CN102405463B; EP2426598A2; EP2426598A4; CN102405463A

Abstract

Disclosed are an apparatus and a method for user intention inference using multimodal information. The apparatus for user intention inference according to one aspect of the present invention comprises: a first prediction unit which predicts a portion of user intention using at least one piece of motion information; and a second prediction unit which predicts user intention using the portion of user intention predicted by the first prediction unit and multimodal information input by at least one multimodal sensor.

Description

User Intention Inference Device and Method Using Multi-modal Information

One or more aspects relate to a system using multi-modal information, and more particularly, to an apparatus and method for processing user input using multi-modal information.

Multi-modal interface means a method of interface using voice, keyboard, pen, etc. for communication between human and machine. When the multi-modal information through the multi-modal interface is input, the method of analyzing the user intention is a method of fusing and analyzing the multi-modal input at the signal level, and analyzing the respective modality input information, and then analyzing the result at the semantic level. There is a method of fusion and analysis.

The fusion method at the signal level fusions and analyzes and classifies multi-modal input signals at once. For example, the fusion method may be suitably used for signal processing simultaneously occurring, such as voice signals and lip movements. However, since two or more signals are integrated and processed, the feature space is very large, and the model for calculating the correlation between signals is very complicated and the learning amount is high. In addition, scalability as in the case of combining with other modalities or applying to other terminals is not easy.

The method of fusing each modality at the semantic level analyzes the meaning of each modality input signal and then fuses the analysis result. The independence between modalities can be maintained to facilitate learning and expansion. However, the reason for the user's multi-modal input is that there is an association between modalities, which is difficult to find when analyzing meaning individually.

An apparatus and method are provided that can efficiently and accurately infer user intention by predicting user intention by motion information and inferring the predicted user intention using multi-modal input information.

According to an aspect, an apparatus for inducing user intention may include a first predictor configured to predict a portion of user intention using at least one motion information, and a portion of the predicted user intention and multi-modal information input from at least one multi-modal sensor. It includes a second prediction unit for predicting the user intention using.

According to another aspect of the present invention, a method of inferring user intention may include receiving at least one motion information, predicting a part of the user intention using the received motion information, and multimodal information input from at least one multi-modal sensor. And receiving the predicted user intention using a part of the predicted user intention and the multi-modal information.

According to an embodiment, the user motion recognition predicts a part of the user intention, analyzes the multi-modal information according to the predicted part of the user intention, and predicts the user intention secondarily, thereby maintaining the independence between the modalities and the association between the modalities. It is easy to grasp and infer user intention accurately.

In addition, by using the motion information or by combining the multimodal information such as voice or video information with the motion information to predict the user's intention to start and end the voice input, the user can infer the user's inference apparatus without learning a special voice input method. Voice can be input.

1 is a diagram illustrating a configuration of a user intention reasoning apparatus according to an exemplary embodiment.

FIG. 2 is a diagram illustrating an example of a configuration of a user intention predictor of FIG. 1.

3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.

4 is a diagram illustrating an example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.

FIG. 5 illustrates another example of an operation of predicting a user's intention by receiving an additional multimodal input after a part of the user's intention is predicted.

6 is a diagram illustrating an example of a configuration of classifying a signal by combining an audio signal and a video signal.

7 is a diagram illustrating a user intention reasoning method using multi-modal information according to an exemplary embodiment.

The first predictor may generate a control signal for executing an operation performed in the process of predicting the user intention by using a part of the predicted user intention.

The control signal for executing the operation performed in the process of predicting the user intention may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus.

The secondary predictor may interpret the multi-modal information input from the multi-modal sensor to predict the user intention to be associated with a part of the predicted user intention.

If a part of the predicted user intention is a selection of an object displayed on the display screen, and a voice is input from the multi-modal sensor, the secondary predictor may predict the user intention by interpreting the input voice in association with the object selection.

The second prediction unit may predict the user intention using the multi-modal information input from the at least one multi-modal sensor within a range of the part of the predicted user intention.

When a part of the predicted user's intention is an operation of bringing a microphone into the mouth, the second predictor detects an acoustic signal, extracts and analyzes a feature with respect to the sensed acoustic signal, and predicts the user's intention.

The second prediction unit may determine whether a voice section is detected from the sound signal, and when the voice section is detected, predict the user intention as the voice command intention.

The second predictor may predict the user's intention by blowing a breath sound in the acoustic signal.

When a part of the predicted user intention is a selection of an object displayed on the display screen, the second predictor may predict the user intention as at least one of deletion, classification, and alignment of the selected object by using the multi-modal information.

The apparatus may further include a user intention application unit configured to control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.

Hereinafter, with reference to the accompanying drawings will be described in detail an embodiment of the present invention. In the description of the various embodiments of the present invention, when it is determined that detailed descriptions of related known functions or configurations may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

The user intention reasoning apparatus 100 includes a motion sensor 110, a controller 120, and a multi-modal sensing unit 130. The user intention inference device 100 includes a cellular phone, a personal digital assistane (PDA), a digital camera, a portable game console, an MP3 player, a portable / personal multimedia player (PMP), a handheld e-book, a portable laptop PC, a GPS (global) positioning system) and any type of device or system, such as navigation and desktop PCs, high definition televison (HDTV), optical disc players, set-top boxes, and the like. In addition, the user intention inference apparatus 100 may further include various components according to an implementation example, such as components for a multi-modal interface such as a user interface unit, a display unit, a sound output unit, and the like.

The motion sensor 110 may include an inertial sensor, a geomagnetic sensor for detecting a direction, an acceleration sensor or a gyro sensor for detecting a motion, and the like to detect motion information. In addition to the sensors listed above, the motion sensor 110 may include an image sensor, an acoustic sensor, and the like. According to an embodiment, a plurality of motion sensors may be attached to a part of the user's body and the user intention reasoning apparatus 100 to detect motion information.

The multi-modal sensing unit 130 may include at least one

multi-modal sensor

132, 134, 136, and 138. The acoustic sensor 132 is a sensor for detecting an acoustic signal, the image sensor 134 is a sensor for detecting image information, the biometric information sensor 136 detects biometric information such as body temperature, and the touch sensor 138 is a touch. The touch gesture on the pad may be sensed, and other various types or types of multi-modal sensors may be included.

1 illustrates that four sensors are included in the multi-modal sensing unit 130, but the number is not limited thereto. The type and range of the sensor included in the multi-modal sensing unit 130 may be wider than the type and range of the sensor included in the motion sensor 110 for the purpose of motion detection. In addition, although the motion sensor 110 and the multi-modal sensing unit 130 are illustrated as being separately present in FIG. 1, they may be integrated. Alternatively, the same kind of sensor, for example, an image sensor and an acoustic sensor, may be included in the sensor included in the motion sensor 110 and the multi-modal sensing unit 130.

The multi-modal detection unit 130 may include a module for extracting feature values according to the type of the multi-modal information detected by each of the

multi-modal sensors

132, 134, 136, and 138 to analyze the meaning. . Components for analyzing the multi-modal information may be included in the controller 120.

The controller 120 may include an application, data, and an operating system for controlling the operation of each component of the user intention reasoning apparatus 100. According to an embodiment, the controller 120 includes a user intention predictor 122 and a user intention applicator 124.

The user intention predictor 122 receives at least one motion information detected from the motion sensor 110, and primarily predicts a part of the user intention using the received motion information. In addition, the user intention predictor 122 may secondarily predict the user intention using a part of the predicted user intention and the multi-modal information input from the at least one multi-modal sensor. That is, when the user intention predictor 122 secondarily predicts the user intention, the user intention predictor 122 finally uses the motion information detected from the motion sensor 110 and the multi-modal information input from the multi-modal sensing unit 130 to finally determine the user intention. It can be predicted. The user intention predictor 122 may use various known inference models for inferring the user's intention.

In addition, the user intention predictor 122 may generate a control signal for executing an operation performed in the process of predicting the user intention secondary by using a part of the user intentionally predicted. The control signal for executing the operation performed in the user intention inference process may be a control signal for controlling the operation of the multi-modal sensing unit 130 controlled by the user intention inference apparatus 100.

For example, the motion information may be used to activate some sensor operations associated with a part of the first predicted user intention among the sensors of the multi-modal sensing unit 130 based on the part of the first predicted user intention. In this case, power consumption used for the sensor operation may be reduced as compared with the case of activating all the sensors of the multi-modal sensing unit 130. In addition, since the detection information input from some sensors is analyzed, accurate user intention can be inferred while simplifying the interpretation of the multi-modal input information while reducing the complexity of the user intention prediction process.

The user intention predictor 122 may include a module (not shown) that extracts and analyzes features according to types of multi-modal information in order to predict user intention secondarily. In addition, the user intention predictor 122 may interpret the multi-modal information input from the multi-modal sensing unit 130 to be associated with a part of the user's intention predicted primarily.

For example, when a part of the user intention primarily predicted by the user intention predictor 122 is determined by selection of an object displayed on the display screen, when the voice is input from the multi-modal sensing unit 130, the input voice is input. Can be secondarily predicted by interpreting in conjunction with object selection. In detail, when a part of the first intention predicted by the user is determined by the selection of an object displayed on the display screen, and the sound signal input by the multi-modal detection unit 130 is analyzed as “organized by date”, the user intention predictor The user's intention may be interpreted to mean "arrange the object selected on the display screen in the order of date".

In addition, when a part of the first predicted user intention is a selection of an object displayed on the display screen, the user intention predictor 122 may predict the secondary user intention as at least one of deleting, classifying, and sorting using the multi-modal information. Can be.

The user intention application unit 124 may control software or hardware controlled by the user intention inference apparatus using the user intention prediction result. The user intention applying unit 124 may provide a multi-modal interface for interacting with the predicted user intention. For example, if a user's intention is predicted as a voice command, you can run an application or search application that performs voice recognition to understand the meaning in the voice command and automatically connects the phone to a specific person based on the recognition result. If the intention is to transfer the object selected by the user, the email application can be executed. As another example, when the user intention is predicted to be humming, an application for searching for music similar to the humming sound source may be driven. As another example, when the user intention is predicted to be blow, the avatar may be used as a command for executing a specific action in the game application.

According to one embodiment, independence in the process of interpreting multi-modal information by predicting a part of user intention through user motion recognition, analyzing multi-modal information according to the predicted part of user intention, and secondly predicting user intention. Multi-modal information can be interpreted in relation to a part of the user's intentionally predicted, while maintaining the accuracy of the intention.

The user intention predictor 122 may include a motion information analyzer 210, a first predictor 220, and a second predictor 230.

The motion information analyzer 210 analyzes one or more motion information received from the motion sensor 110. The motion information analyzer 210 may measure location information and angle information of each part of the user's body to which the motion sensor 110 is attached, and the motion sensor 110 may use the measured location information and angle information. Location information and angle information of each part of the user's body that is not attached may also be calculated.

For example, when the motion sensor 110 is attached to both wrists and heads, the distance between the sensor and the sensor is measured, and each sensor can obtain three-dimensional rotation angle information about the reference coordinate system. Therefore, the distance between the wrist part and the head part and the rotation angle information of the wrist may be calculated from the motion information to calculate the distance between the wrist and the mouth part of the face and the rotation angle information of the wrist. Assuming that a user is holding a microphone corresponding to the acoustic sensor 132 of the user intention reasoning apparatus 100 in the hand, the distance between the mouths of the microphones and the direction of the microphone can be calculated.

As another example, when the motion sensor 110 is mounted on the microphone corresponding to the user's head and the acoustic sensor, the distance between the microphone and the head is measured from the motion information, and the axis of the shaft attached with the sensor from the inertial sensor attached to the microphone is measured. By obtaining the 3D angle information, the motion information analyzer 210 may calculate the distance between the wrist and the mouth of the face and the rotation angle information of the microphone.

As another example, an image sensor may be included in the motion sensor 110 to input image information to the motion information analyzer 210. In this case, the motion information analyzer 210 may recognize an object such as a face or a hand in the image and calculate a positional relationship between the objects. For example, the motion information analyzer 210 may calculate a distance and an angle between a face and two hands, a distance and an angle between two hands, and the like.

The primary predictor 220 predicts a part of the user intention triggered by the motion information analysis. For example, the primary predictor 220 may predict whether the motion primarily selects an object on the screen through analysis of motion information including an image.

The second prediction unit 230 predicts the user intention by using a part of the user intention predicted by the first prediction unit 220 and the multi-modal information input from the multi-modal sensing unit 130.

The second prediction unit 230 may interpret the multi-modal information input from the multi-modal sensor to be associated with a part of the first predicted user intention in order to predict the user intention. For example, when a part of the first predicted user intention is a selection of an object displayed on the display screen, and the second predictor 230 receives a voice from the multi-modal detection unit 130, the input voice is selected from the object selection. By correlating and interpreting, the user's intention can be predicted secondarily.

As another example, the first predictor 220 predicts that a part of the first predicted user's intention is to bring the microphone into the mouth, and the multimodal sensor 130 uses an image sensor 134 such as a camera. When the movement of the mouth is sensed and a voice is input through the acoustic sensory 132 such as a microphone, the secondary predictor 230 may predict the user's intention as a voice command input. In order to predict a voice command input intention, the user predictor 124 detects a voice section from the sound signal of the second predictor 230 and performs semantic analysis through feature extraction and analysis on the detected voice section. It can be made available.

As another example, the first prediction unit 220 firstly predicts that the microphone is brought to the mouth as a part of the user's intention, and the multimodal detection unit 130 uses the image sensor 134 such as a camera to make the lips When the image protruding forward is consistently sensed and a breath sound is input through the microphone, the second prediction unit 230 may predict the user's intention as blow.

In the above two examples, the user's intentions are different: "Hold microphone into mouth and input voice command" and "Hold microphone into mouth." However, some of the two user intentions are common to "take the microphone to the mouth," and the first predictor 220 may first predict a portion of the user intention to narrow the scope of the user intention. Within the range of user intention narrowed by the primary predictor 220, the secondary predictor 230 may predict the user intention in consideration of multi-modal information. Considering only the above two cases, when the motion of “take the microphone to the mouth” is detected, the range of the user's intention is limited to “speech command input” and “blowing” by the first predictor 220, 2 The difference predictor 230 may determine whether the user intention is "voice command input" or "blowing" in consideration of the sensed multi-modal information.

FIG. 3 is a diagram illustrating an exemplary operation of the user intention predictor of FIG. 2.

The primary predictor 220 may predict a part of the user's intention using the motion information analyzed by the motion information analyzer 210. The second prediction unit 230 receives a multi-modal signal such as an image detected by the image sensor 134 of the multi-modal detection unit 130 or an acoustic signal detected from the sound sensor 132, and the voice is detected. Information about whether or not the user can be generated to predict the intention of the user.

For example, the motion information analyzer 210 calculates a distance between a user's mouth and a hand holding a microphone using motion information detected from a motion sensor mounted on a user's head and wrist (310). The motion information analyzer 210 calculates the direction of the microphone from the rotation angle of the wrist (320).

The first predictor 220 predicts a part of the user's intention by predicting whether the user moves the microphone to the mouth using the distance and direction information calculated by the motion information analyzer 210 (330). For example, when the first predictor 220 determines that the position of the user holding the user's mouth and the microphone is within a 20 cm radius around the mouth, and the microphone direction is toward the mouth, the user attempts to bring the microphone into the mouth. It can be predicted.

In this case, the second prediction unit 230 analyzes the multimodal input signals input from the acoustic sensor 132 such as a microphone and the image sensor 134 such as a camera, and is it intended to be a voice command or an intention such as a hum or blowing. Etc., the user's intention can be predicted.

The second prediction unit 230 predicts a part of the user's intention, that is, the first prediction brings the microphone to the mouth, when the movement of the lips is detected from the camera, and the voice is detected from the acoustic signal detected by the microphone, the user's intention is determined. The voice command may be determined as the intention (340). On the contrary, when the first prediction is to bring the microphone to the mouth, the image protruding the lips forward from the camera is detected, and the breath sound is detected from the sound signal input from the microphone, the second prediction unit 230 is performed. May determine 350 the user intention to blow.

If the second predictor 230 is a part of the predicted user's intention received from the primary predictor 220 is to bring the microphone to the mouth (410), the second predictor 230 includes a microphone included in the multi-modal sensing unit 130. A multimodal signal is input by activating a sensor such as a camera (420).

The second predictor 230 extracts features from an acoustic signal input from the microphone and an image signal input from the camera, and classifies and analyzes the features (430).

Acoustic features include time energy, frequency energy, zero crossing rate, linear predictive coding (LPC), cepstral coefficients, and pitch features such as a time domain or statistical features such as a frequency spectrum may be extracted. The features that can be extracted are not limited to these and can be extracted by other feature algorithms. The extracted features are input feature speech using classification and learning algorithms such as Decision Tree, Support Vector Machine, Bayesian Network, Neural Network, etc. It may be classified as an activity class or a non-speech activity class, but is not limited thereto.

When the voice section is detected as a result of the feature analysis (440), the second prediction unit 230 may predict the user's intention by inputting the voice command. As a result of the feature analysis, the second predictor 230 may predict the degree of blow when the voice section is not detected (440) and when the breathing sound is detected (450). In addition, as other types of features are detected, the user's intention may be determined in various ways such as humming. In this case, the second prediction unit 230 may predict the user intention within a range limited from the first prediction.

Therefore, according to an embodiment, the user's intention may be predicted using the multi-modal information of the user, and the performance of the voice detection operation may be controlled according to the prediction result. Voice can be intuitively input without learning a separate button for input or an operation method such as a screen touch.

In addition to the acoustic information from the microphone, the second prediction unit 230 changes the image information input from the image sensor 134 such as a camera and the person input from the biometric information sensor 136 such as a vocal cord microphone to utter a voice. At least one of the at least one piece of sensing information may be used together with the feature information extracted from the sound signal to detect a voice section and process the voice of the detected voice section. Here, the sensing information includes image information indicating a change in the shape of the user's mouth, temperature information changed by breathing during ignition, vibration information of a body part such as a throat or jaw that vibrates during ignition, and infrared detection from a face or mouth during ignition. It may include at least one of the information.

When the voice section is detected (440), the user intention application unit 124 may perform voice recognition by processing a voice signal belonging to the detected voice section, and switch the application module using the voice recognition result. For example, when the application is executed according to the recognition result, when the name is recognized, intelligent voice input start and end switching can be performed, such as a search for a phone number for the recognized name or a call to the retrieved phone number. have. In addition, when the user intention inference device 100 is implemented as a mobile communication device, the voice call starts and ends based on the multi-modal information to grasp the intention of the voice call automatically even if the user does not perform a separate operation such as pressing a call button. The operation mode can be switched to the mode.

The second predictor 230 activates a sensor such as a camera and an ultrasonic sensor when a part of the first predicted user intention received from the first predictor 220 is a selection of a specific object (460). Input is received (470).

The second prediction unit 230 analyzes the input multi-modal signal 480 to predict the user's intention. In this case, the predicted user intention may be intentions within a range defined from the first prediction.

In operation 490, the second prediction unit 230 may determine that the user shakes the hand as a result of the multimodal signal analysis. The secondary predicting unit 230 interprets the waving operation as an intention to delete a specific item or file shown on the screen according to the application being executed by the user intention applying unit 124, and the user intention applying unit 224. ) Can be controlled to delete specific items or files.

FIG. 6 is a diagram illustrating an example of feature-based signal classification in which the secondary predictor 230 performs integrated analysis by using an acoustic signal and an image signal together.

The second predictor 230 may include an acoustic feature extractor 510, an acoustic feature analyzer 520, an image feature extractor 530, an image feature analyzer 540, and an integrated analyzer 550. have.

The sound feature extractor 510 extracts a sound feature from the sound signal. The acoustic feature analyzer 520 extracts a speech section by applying a classification and learning algorithm to the acoustic features. The image feature extractor 530 extracts an image feature from a series of image signals. The image feature analyzer 540 extracts a speech section by applying a classification and learning algorithm to the extracted image features.

The integrated analysis unit 550 fuses the results classified by the audio signal and the video signal, respectively, and finally detects the voice section. In this case, the acoustic feature and the image feature may be individually applied or the two features may be fused and applied. When the feature is extracted and analyzed from a signal indicating another signal, for example, vibration or temperature, the integrated analyzer 550 may be used. An audio section may be detected by fusion with detection information extracted from an audio signal and an image signal.

According to an embodiment, when using the voice interface, the user may intuitively input voice without separately learning a voice input method. For example, the user does not need to perform a separate button or screen touch for voice input. In addition, regardless of the kind or degree of noise such as home noise, vehicle noise, non-talker noise, it is possible to accurately detect the user's voice section in various environments. In addition, since the voice may be detected using other biometric information in addition to the image, the voice section of the user may be accurately detected even when the lighting is too bright or dark or the user's mouth is covered.

The user intention reasoning apparatus 100 receives the detected motion information from at least one motion sensor (610). The user intention reasoning apparatus 100 primarily predicts a part of the user intention using the received motion information (620).

When the multi-modal information input from the at least one multi-modal sensor is received (630), the user intention inference apparatus 100 predicts the user's intention by using a part of the first predicted user intention and the multi-modal information. (640). In the second step of predicting the user intention, an operation may be performed to interpret the multi-modal information input from the multi-modal sensor to be associated with a portion of the first predicted user intention.

A portion of the first predicted user intention may be used to generate a control signal for executing an operation performed in the secondary user intention prediction process. The control signal for executing the operation performed in the secondary user intention prediction process may be a control signal for controlling the operation of the multi-modal sensor controlled by the user intention reasoning apparatus 100. The user intention may be determined using multi-modal information input from at least one multi-modal sensor, within a range of the first predicted user intent.

One aspect of the invention may be embodied as computer readable code on a computer readable recording medium. Codes and code segments that implement a program can be easily inferred by a computer programmer in the art. Computer-readable recording media include all kinds of recording devices that store data that can be read by a computer system. Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

The above description is only one embodiment of the present invention, and those skilled in the art may implement the present invention in a modified form without departing from the essential characteristics of the present invention. Therefore, the scope of the present invention should not be limited to the above-described examples, but should be construed to include various embodiments within the scope equivalent to those described in the claims.

The invention is industrially applicable in the fields of computers, electronics, computer software and information technology.

Claims

A first predictor predicting a part of a user intention using at least one motion information; And

And a second predictor predicting the user intention using a part of the predicted user intention and the multi-modal information input from at least one multi-modal sensor.
The method of claim 1,

And the first predicting unit generates a control signal for executing an operation performed in the process of predicting the user intention by using a part of the predicted user intention.
The method of claim 2,

And a control signal for executing an operation performed in the process of predicting the user intention is a control signal for controlling an operation of a multi-modal sensor controlled by the user intention inference device.
The method of claim 1,

And the second prediction unit interprets multi-modal information input from the multi-modal sensor to be associated with a part of the predicted user intention in order to predict user intention.
The method of claim 4, wherein

When a part of the predicted user intention is a selection of an object displayed on a display screen, and a voice is input from the multi-modal sensor, the second predictor predicts the user intention by interpreting the input voice in association with the object selection. User Intention Inference Device.
The method of claim 1,

And the second predictor predicts user intention using multi-modal information input from at least one multi-modal sensor within a range of the part of the predicted user intention.
The method of claim 6,

When the part of the predicted user's intention is to bring the microphone into the mouth, the second predictor detects an acoustic signal, extracts and analyzes a feature with respect to the detected acoustic signal, and predicts the user's intention. Inference device.
The method of claim 7, wherein

And the second predictor determines whether a speech section is detected in the sound signal, and predicts a user intention as a voice command intention when the speech section is detected.
The method of claim 8,

And the second predictor predicts a user's intention by blowing a breath sound from the sound signal.
The method of claim 1,

When the part of the predicted user intention is a selection of an object displayed on the display screen, the second predictor uses the multi-modal information to predict the user intention as at least one of deletion, classification, and alignment of the selected object. Inference device.
The method of claim 1,

And a user intention application unit configured to control software or hardware controlled by the user intention inference apparatus using the user intention prediction result.
Receiving at least one motion information;

Predicting a portion of user intent using the received motion information;

Receiving multi-modal information input from at least one multi-modal sensor; And

Predicting user intention using the portion of the predicted user intent and the multi-modal information.
The method of claim 12,

And generating a control signal for executing an operation performed in the process of predicting the user intention using a portion of the predicted user intention.
The method of claim 13,

And a control signal for executing an operation performed in the process of predicting the user intention is a control signal for controlling an operation of a multi-modal sensor controlled by the user intention inference device.
The method of claim 12,

Predicting the user intention,

Interpreting multi-modal information input from the multi-modal sensor to be associated with a portion of the predicted user intention.
The method of claim 12,

And in the predicting the user intention, the user intention is predicted using the multi-modal information input from at least one multi-modal sensor within a range of the predicted portion of the user intention.
The method of claim 12,

And controlling software or hardware controlled by the user intention inference apparatus using the user intention prediction result.