KR20090090613A - System and method for multimodal conversational mode image management - Google Patents
System and method for multimodal conversational mode image management Download PDFInfo
- Publication number
- KR20090090613A KR20090090613A KR1020080015938A KR20080015938A KR20090090613A KR 20090090613 A KR20090090613 A KR 20090090613A KR 1020080015938 A KR1020080015938 A KR 1020080015938A KR 20080015938 A KR20080015938 A KR 20080015938A KR 20090090613 A KR20090090613 A KR 20090090613A
- Authority
- KR
- South Korea
- Prior art keywords
- image
- information
- module
- voice
- tag
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The present invention relates to a multimodal interactive image management system and method. The present invention provides an image recognition module that distinguishes an object included in an image and generates image classification information when a new image is input from a user terminal; A tagnet management module extracting tag candidate recommendation information of an image related to the image classification information based on the image ontology generated from the image recognition module, using a tag ontology constructed with words associated with each subject; A natural language dialogue module for selecting a desired tag through dialogue with a user using the extracted tag candidate recommendation information; A multi-modal interface processing module configured to receive image-related information including a memo about an image from a user terminal and integrate the information into one piece of information; And an image DB for storing images, related tags, and memo information. This allows the user to upload photos or videos to a multi-modal interactive image management system, and then use the multi-modal interface and dialogue modules, such as voice, keyboard, and touchpad, to interact with the user to add tag and memo information to the images. After pasting and storing it in the photo-tag DB, it provides tags and memos related to photos or videos to user terminals such as PCs and digital photo frames.
Description
The present invention relates to a multi-modal interactive image management system and method, and in particular, after a user uploads a photo or video to a multi-modal interactive image management system, the user uses an image recognition system, a photo-tag DB, and a tagnet. Multi-modal, which provides tag-based image information retrieval service by recommending tag related to video or by attaching tag information to image through dialogue with user using multi-modal interface and dialogue module combining voice recognition and touch pad. It relates to an interactive image management system and method.
In recent years, voice recognition and speech synthesis techniques have been developed, and the necessity of a multimodal interface using voice and a pen for a terminal such as a mobile terminal, a home network terminal, a robot, and the like is increasing.
Multimodal is a channel that is converted into a mechanical device by modeling each sensory channel such as human sight, hearing, touch, taste, and smell with multiple modalities. In addition, the interaction of each modality is called multimodal interaction.
Multimodal interfaces are classified by visual viewpoint and associative nature, sequential and simultaneous multimodal according to visual perspective, alternative combination by complementary nature, supplementary combination multi Classified as modal.
Multi-modal interface utilizes not only voice but also keyboard, pen, and graphic interface between user and terminal device. User can input information using voice, keyboard, pen, etc. Output interface.
Multimodal inputs include voice, pen (or keypad), keyboard, mouse, touchscreen, lip movement input, gesture input, eye movement input, and the like.
The output of the multimodal includes graphic elements such as icons, tables, and tables, sound elements such as earcons, sound effects, and voice synthesis, and vibration elements such as shocks.
The multimodal interface is divided into a voice interface for processing voice input and output, an ink interface for recognizing pen writing with a pen, and a keyboard interface conventionally used. Voice interfaces use the Extensible Multimodal Annotation Markup Language (EMMA) format to display the results of different interfaces simultaneously. An ink interface is a way of writing with a pen, drawing a picture, or using a mathematical interface. The keyboard interface is currently in use and uses the EMMA format to display like other interface results.
With the recent advances in speech recognition, speech synthesis, and handwriting recognition technology, and the need for services that utilize these multimodal technologies, the W3C has adopted the Multimodal Interaction Framework, EMMA (Multimodal Interaction Working Group). Standardization of Extensible Multimodal Annotation and Ink Markup Language is being promoted, and Voice Browser Working Group and Multimodal Interaction Working Group are promoting standardization for voice interface and multimodal interface.
Therefore, the speech interface used as the multi-modal interface is an MS speech server using speech recognition between a person and a terminal device, a speech synthesis and a language processing technology that converts text into speech by TTS (Text To Speech). It is a technique to implement. Language processing techniques may be included in speech recognition and speech synthesis techniques, but recently, techniques related to Voice Extensible Markup Language (VoiceXML) or Speech Application Language Tags (SALT) that control voice input and output are used.
VoiceXML is an XML (eXtended Mark-up Language) language that controls voice input / output with voice conversation functions such as voice recognition, voice synthesis, and DTMF signals on the Web.
VoiceXML technology is a technology that is formed by AT & T, Lucent Technologies, Motorola and IBM in August 1999, and formed a forum of about 400 companies. In March 2000, the VoiceXML 1.0 specification was proposed by the W3C Voice Browser Working Group. The VoiceXML 2.0 specification was proposed in October 2001 and has been revised in March 2004. VoiceXML 2.0 is a grammar for speech recognition. Instead of the Java Speech Grammar Format (JSGF) used in VoiceXML 1.0, Speech Recognition Markup Language (SRML) was rewritten in March 2004. Speech Synthesis Markup Language (SSML) for speech synthesis It was renovated in July 2004.
Speech Application Language Tags (SALT) is an extension of Web standard languages such as HTML, XHTML, and XML to make it easier for developers to develop speech and multimodal applications. In the early 2000s, companies such as Cisco, Intel, Converse, Philips Speech Processing, and Scan Soft gathered around Microsoft, using voice, pen, It is a multi-modal interface that can control input and output commands using a mouse. If VoiceXML technology is a high-level language for phone-based voice application developers, SALT is a low-level language for web-based developers.
The biggest difference between VoiceXML and SALT is that VoiceXML works independently of HTML and is difficult to use simultaneously, while SALT defines tags related to speech recognition and synthesis in HTML so that they can be used simultaneously and multimodal.
The multimodal interface method includes 1) voice recognition, voice synthesis, and discourse technology implemented in a terminal device to naturally give a command or hear a command result by voice, and 2) provide a voice interface function to a server. And a method of simply connecting a terminal to a voice interface using a telephone company's telephone network, and 3) a hybrid method (1 + 2) is considered. In particular, the W3C (World Wide Web Consortium) A study group for modal platform research has been formed to standardize.
Multi-modal interface means an interface using voice, keyboard and pen for communication between human and terminal, uses voice, pen, text and keyboard typing as input, and outputs the processing result of the terminal. Provides voice, audio and video. A multi-modal interface that follows the new standard of the W3C software architecture uses State Chart XML (SCXML), a Harrel statechart-based conversational modeling language. SCXML is an asynchronous, event-based, general-purpose state machine language that works in the interaction manager (conversation manager), which invokes scenario scripts written in SCXML and then processes XML markup such as XHTML, VoiceXML, and Scalable Vector Graphics (SVG). Runs a modality component.
Scalable Vector Graphics (SVG) is an XML-based language designed by the W3C to represent two-dimensional graphics.
Currently, the X + V multimodal system, which allows more than one modality, uses the nested structure of VoiceXML and XHTML, developed by IBM.
1 is a diagram illustrating a multimodal interaction framework proposed in the conventional W3C.
The multimodal interaction framework system includes an
The interaction manager (conversation manager) 20 performs the actual application service execution using the information obtained from the input element and serves to provide the result to the output element. Currently, the Harel state diagram, a kind of finite state diagram, is used.
The
System & Envirionment (23) provides an environment so that the output mode can be easily adapted to a portable terminal, a vehicle terminal and a desktop automatically according to the situation of the terminal and the user environment.
2 is a detailed structural diagram of input elements of a multimodal interaction framework.
The
The interpretation module semantically interprets the result recognized from the recognition module and converts it into the same representative text. For example, the interpretation module converts yes, yes, yes, etc. into yes. The results of the analysis module are converted to EMMA and entered into the integration module.
The integration module integrates information such as voice and pointing device and delivers the information to the
The input element uses voice, pen, keyboard or GPS information, and in the long term the sensor is also possible. When such information is input, the multimodal service can also be a context aware service.
EMMA is a standard language that connects the
It is a markup language that expresses processing results to enable data exchange between different components in a multimodal system. For example, EMMA allows metadata such as input time, confidence values for recognition results, and various input types to be represented.
Ink markup language is a language that expresses the result of recognizing handwriting based on XML. The handwriting that is used may be a picture, an importance sign, an autograph, or simply a handwritten letter, or a mathematical expression or musical symbol.
3 is a detailed block diagram of an output element of the multimodal interaction framework.
The
The generation module determines which mode to output, such as voice and graphics, when information to be transmitted to the user is input from the
The style module adds information about how it is rendered. For example, information about how the graphic is placed on the screen, or information about the spacing between words when a voice is output is added. The multimodal interface uses Cascading Style Sheets (CSS) to control speech output, Extensible HyperText Markup Language (XHTML) to output graphics, and the output is expressed in Speech Synthesis Markup Language (SSML).
The rendering module outputs a graphic drawn on the screen or voice generated by the style module. The output element of the multi-modal interface can control home appliances, robots, etc. when an actuator function for situation awareness is added, and can be distributedly processed according to the computing power of the terminal.
4 is a screen illustrating a user-centered online image storage and retrieval service (flickr).
flickr (http://www.flickr.com/) uses Web2.0 technologies, such as del.ico.us (http://del.ico.us/), to share bookmarks (favorites) to tag photos. tagging and storing them as an online image management system, and by collecting a large number of tagged online photo images by category to build collective intelligence, you can store, search, and share user-oriented online images. Providing services that make it possible.
A tag is a keyword that classifies information into a concept called folksonomy (a compound word of folks (people) and taxonomy, managing people's classification) by attaching words related to pictures, music, video clips, etc., and storing them in a search server. Or category.
However, the conventional online image system uses a keyboard, a voice, a touch pad, etc. by using a multimodal interface such as voice recognition, a touch pad, and a conversation processing technology, and applies tag information to uploaded pictures. By voice input, no system existed for tagging images and implementing collective intelligence.
The present invention has been proposed to solve the problems of the prior art, and is an interface that can be easily used by all users who are not used to using IT services, and select an image for inputting tag information into image data such as video and photo. After uploading an image by selecting a specific object through multi-modal interface such as voice, keyboard, and touch pad, and receiving a tag candidate or inputting tag information by voice, the tag information related to the uploaded image is automatically transferred to the photo-tag DB. The purpose of the present invention is to provide a multi-modal interactive image management system and method for storing a voice recognition and inputting a touch pad at the same time and using natural language dialogue to input menu selection, search word input, and tag information.
This process can be done by voice input or using the keyboard, or when the speech is freely spoken, the semantic information is extracted from the conversation management system to control all the controls necessary for image management such as menu movement, synonyms, and synonym search. Can be controlled through free conversation. In addition, the multi-modal interactive image management system also provides the ability to automatically tag all images through learning if only a few images are tagged.
In order to achieve the object of the present invention, the present invention is a multi-modal interactive image management system, when a new image is provided from the user terminal to identify the object included in the image and classify the image by subject according to the identified result An image recognition module for generating image classification information; A tagnet management module for generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; A natural language dialogue module for exchanging question and response information with the user terminal in a natural language format and selecting a tag using the tag candidate recommendation information; And a multi-mortal interface processing module for receiving and integrating the response information from the user terminal through one or more interfaces, and an image DB for storing the image and the selected tag.
A conversation model DB connected with the natural language conversation module and storing a plurality of examples for determining a semantic structure of the response information and a user's intention; And a dialogue example DB in which a semantic structure of the response information and a plurality of conversation pairs corresponding to a user's intention are stored.
The natural language dialogue module classifies the semantic structure of the response information and the user's intention, finds the dialogue pair mapped to the semantic structure and the user's intention, and provides the dialogue information to the user as the query information.
The image management application interworking module for searching for and providing an image and a tag associated with the image is requested by the user terminal.
The user terminal is characterized in that the digital picture frame.
The at least one interface may include a voice input method and a non-voice input method.
The non-voice input method may include an input method using a keyboard or a touch pad.
And an image classification training model DB connected with the image recognition module and storing image classification information related to a subject of a plurality of images.
And a tag ontology DB that is connected to the tagnet management module and stores the tag example including words related to each subject of an image.
The multi-mortal interface processing module includes a voice modality unit for voice input and output; An HTML modality unit for non-voice input and output, and a runtime framework for identifying and interpreting the contents of the voice input and output data.
The voice modality unit may include a voice I / O processor for voice input / output between a user and the terminal; IP-CCS for IP-based voice call input / output from the voice I / O processor; A voice recognition unit recognizing a user's voice; A voice synthesizer for converting text into voice; And a VoiceXML interpreter for interpreting input / output data of the voice recognition unit and the voice synthesis unit.
The voice I / O processor and the IP-CCS communicate with each other by RTP.
The runtime framework includes an integration module for integrating the voice input and output data in conjunction with an EMMA parser; An interaction manager for controlling the voice modality unit by interworking the integrated voice input / output data transmitted from the integration module with an SCXML parser; A session data manager to perform data support and management for session management of the user; A DCI for storing attributes of the terminal and preference data of the user; When the information to be transmitted to the user from the interaction manager is input, it is characterized in that it comprises a generation module for determining whether to output in a selected mode of the voice, graphics, and transmits to the voice modality unit or HTML modality unit.
The present invention provides a multi-modal interactive image management method, comprising: (a) receiving an image uploaded from a user terminal; (b) the image recognition module distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) generating, by the tagnet management module, tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) generating, by the natural language conversation module, query information based on the tag candidate recommendation information extracted in step (c), and providing the query information to the user terminal through a multi-modal interface providing module; (e) receiving, by the multimodal interface providing module, tag information desired by a user through the user terminal as one or more interface methods, and integrating the response information; (f) receiving image related information through a multi-modal interface providing module and storing an image and a tag related to the image as an image DB.
The step (e) may include: (e1) receiving the response information in which the natural language conversation module is integrated, and determining a semantic structure of the response information and a user's intention through a plurality of examples stored in a conversation model DB; (e2) generating the query information through the plurality of conversation pairs stored in a conversation example DB, and providing the query information to the user terminal through the multi-modal interface providing module.
In the step (e), the at least one interface method, characterized in that it comprises a voice and a non-voice method.
The image selection may be input in the non-voice method, and tag information corresponding to the image may be input in the voice.
(g) the image management application interworking module further comprising searching for and providing an image and a tag related to the image from the image DB according to the response information.
The present invention provides a computer or digital photo frame, comprising: (a) a function of uploading an image from a user terminal by a multimodal interactive image management system; (b) an image recognition module for distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) a tagnet management module generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) a function of the natural language dialogue module generating query information based on the tag candidate recommendation information extracted in the step (c) and providing the query information to the user terminal through a multi-modal interface providing module; (e) a function of receiving a tag information desired by a user through the user terminal as one or more interface methods through the user terminal, and integrating the response information; (f) A recording medium that can be read by a computer or digital photo frame recording a program for realizing the function of receiving image-related information through a module providing a multi-modal interface and storing image, tag and memo information related to the image as an image DB. to provide.
In the multi-modal interactive image management system and method according to the present invention, after a user uploads a photo or video to a website, the image is automatically distinguished using an image recognition system, a photo-tag DB, a tagnet, and the photo or Using a multi-modal interface that combines speech recognition and the touchpad and natural language processing technology to automatically recommend tags related to a video or find tag information through a conversation, call menus or enter search terms in a conversational form, By simultaneously inputting the tagging information by touch or clicking with a mouse, the user's intention and semantic information can be classified using a trained language understanding model to attach the tagging information to the image.
By providing simple program control and a user interface, the present invention can generate revenue through sales of original technology that makes it easier for people at various levels who are not familiar with IT services to control all types of applications.
In addition, the present invention can be used as the interface of all services that can simultaneously input the voice and the touch pad, such as digital photo frame, home network service, IPTV control, robot, portal service, complaint document issuer, navigation. These technologies can be used in intelligent robots as multimodal interface modules specialized for voice and sound processing functions.
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the specification and claims should not be construed as having a conventional or dictionary meaning, and the inventors should properly explain the concept of terms in order to best explain their own invention. Based on the principle that can be defined, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.
5 is a block diagram of a multi-modal interactive image management system according to the present invention.
The multi-modal interactive
The
The image classification
That is, the
Here, the
The
When the user speaks, the natural
The natural
In order to perform this function, the natural
The multi-modal
Here, the input method using a keyboard or a touch pad may be applied as the non-voice input method.
As a technique applied to process the above-described functions, a multimodal interface processing technique is a method for synchronizing with a computer when a user inputs a character, a voice, or a touchpad. It is a technology that can provide appropriate service by integrating input information while maintaining sync). That is, in the light of the technical concept of the present invention, a user uploads an image to the multi-modal interactive
In addition, the natural language
The image management
That is, the image management
6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.
The Multimodal Interface (MMI) processing module includes a Voice Modality Component, a Runtime Framework, an OSGi Service Bundle, including a Voice I / O Processor for voice input and output. It is composed of HTML Modality Component for non-voice input / output.
The voice modality component (Voice Modality Component) is a voice I / O processor (Voice I / O Processor) for voice input and output between the user and the terminal, Real Time Transport Protocol (RTP) for transmitting the voice stream, IP-based voice call IP Call Control Server (IP-CCS) for control, Automatic Speech Recognizer (ASR) for recognizing speech between people and terminals, Text to Speech (TTS) for converting text to speech, Voice It consists of a VoiceXML Interpreter for recognizing and interpreting speech synthesis data.
The present invention uses VoiceXML for the voice interface to interact with the user and the terminal by the interaction manager, and uses XHTML for the non-voice interface.
The Runtime Framework includes the Integration Module, EMMA Parser, Interaction Manager, SCXML Parser, Session / Data Model, Delivery Context Interface, and DCI. Contains a module (Generation Module).
The Integration Module integrates modality information such as voice and pointing device in conjunction with EMMA Parser and transmits it to the Interaction Manager.
The Interaction Manager works in conjunction with the SCXML Parser to control and supervise the Voice Modality Component for interaction between the user and the system, as well as the simple and secondary coupling multimodal delivered from the user. Process input or connect with external module.
The session data management unit (Session / Data Model) performs data support and management for session management.
The DCI (Delivery Context Interface) is a DB that stores personal profile information and terminal environment information in a synchronous messaging method to enable more personalized service, and stores device attributes and user preference data.
The generation module determines whether to output in a voice, graphics, etc. mode when information to be transmitted to the user is input from the interaction manager, and the content to be output is the voice modality unit. Voice Modality Component) and HTML Modality Component.
7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.
The multi-modal interactive
The
The natural
The multi-modal
The image management
8 is a flowchart illustrating the function of the natural language dialogue module.
The natural
The natural
The present invention, after uploading an image from the user terminal to the multi-modal interactive
According to the present invention, after a user uploads a picture or video to a website, the user automatically distinguishes the picture using an image recognition system, a photo-tag DB, and a tagnet, and automatically recommends a tag related to the picture or video. By using the multimodal interface combined with voice recognition and touchpad and natural language dialogue processing technology, the user inputs the tagging information through dialogue with the user, thereby classifying the user's intention information into a trained language understanding model. The tag information may be attached to a video or an image to implement collective intelligence to provide an image or a video search service.
In addition, the multi-modal interactive image management system can be used as an interface for all services that can simultaneously input voice and touch pad, such as digital photo frames, home network services, IPTV control, robots, portal services, civil document issuers, and navigation. This technology can be used as an intelligent multi-modal interface module specialized for voice and sound processing functions.
As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form.
As described above, although described with reference to a preferred embodiment of the present invention, those skilled in the art various modifications of the present invention without departing from the spirit and scope of the invention described in the claims below Or it may be modified.
1 is a multimodal interaction framework proposed in the prior art W3C.
2 is a detailed structural diagram of input elements of a conventional multimodal interaction framework.
3 is a detailed configuration diagram of output elements of a conventional multimodal interaction framework.
4 is a screen illustrating a user-centered online image storage and retrieval service.
5 is a block diagram of a multimodal interactive image management system according to the present invention;
6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.
7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.
8 is a flow chart illustrating the functionality of a natural language conversation module.
Claims (19)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020080015938A KR20090090613A (en) | 2008-02-21 | 2008-02-21 | System and method for multimodal conversational mode image management |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020080015938A KR20090090613A (en) | 2008-02-21 | 2008-02-21 | System and method for multimodal conversational mode image management |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20090090613A true KR20090090613A (en) | 2009-08-26 |
Family
ID=41208371
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020080015938A KR20090090613A (en) | 2008-02-21 | 2008-02-21 | System and method for multimodal conversational mode image management |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20090090613A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101143450B1 (en) * | 2010-11-24 | 2012-05-23 | 단국대학교 산학협력단 | Providing method and system of emotional augmented memory service |
WO2012093811A1 (en) * | 2011-01-04 | 2012-07-12 | (주)올라웍스 | Method for support in such a way as to allow collection of objects comprised in an input image, and a recording medium able to be read by terminal devices and computers |
KR101219469B1 (en) * | 2011-03-29 | 2013-01-11 | 서울대학교산학협력단 | Methods for Multimodal Learning and Classification of Multimedia Contents |
KR20130100448A (en) * | 2012-03-02 | 2013-09-11 | 엘지전자 주식회사 | Mobile terminal and method for controlling thereof |
KR20140042492A (en) * | 2012-09-28 | 2014-04-07 | 엘지전자 주식회사 | Mobile terminal and operating method for the same |
KR20150016776A (en) * | 2013-08-05 | 2015-02-13 | 삼성전자주식회사 | Interface device and method supporting speech dialogue survice |
KR20160084748A (en) * | 2015-01-06 | 2016-07-14 | 포항공과대학교 산학협력단 | Dialogue system and dialogue method |
CN109176535A (en) * | 2018-07-16 | 2019-01-11 | 北京光年无限科技有限公司 | Exchange method and system based on intelligent robot |
CN110209784A (en) * | 2019-04-26 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Method for message interaction, computer equipment and storage medium |
US10902262B2 (en) | 2017-01-19 | 2021-01-26 | Samsung Electronics Co., Ltd. | Vision intelligence management for electronic devices |
US10909371B2 (en) | 2017-01-19 | 2021-02-02 | Samsung Electronics Co., Ltd. | System and method for contextual driven intelligence |
CN115062131A (en) * | 2022-06-29 | 2022-09-16 | 支付宝(杭州)信息技术有限公司 | Multi-mode-based man-machine interaction method and device |
KR102492277B1 (en) * | 2022-06-28 | 2023-01-26 | (주)액션파워 | Method for qa with multi-modal information |
WO2023058944A1 (en) * | 2021-10-08 | 2023-04-13 | 삼성전자주식회사 | Electronic device and response providing method |
-
2008
- 2008-02-21 KR KR1020080015938A patent/KR20090090613A/en not_active Application Discontinuation
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101143450B1 (en) * | 2010-11-24 | 2012-05-23 | 단국대학교 산학협력단 | Providing method and system of emotional augmented memory service |
WO2012093811A1 (en) * | 2011-01-04 | 2012-07-12 | (주)올라웍스 | Method for support in such a way as to allow collection of objects comprised in an input image, and a recording medium able to be read by terminal devices and computers |
US8467577B2 (en) | 2011-01-04 | 2013-06-18 | Intel Corporation | Method, terminal, and computer-readable recording medium for supporting collection of object included in inputted image |
KR101219469B1 (en) * | 2011-03-29 | 2013-01-11 | 서울대학교산학협력단 | Methods for Multimodal Learning and Classification of Multimedia Contents |
KR20130100448A (en) * | 2012-03-02 | 2013-09-11 | 엘지전자 주식회사 | Mobile terminal and method for controlling thereof |
KR20140042492A (en) * | 2012-09-28 | 2014-04-07 | 엘지전자 주식회사 | Mobile terminal and operating method for the same |
KR20150016776A (en) * | 2013-08-05 | 2015-02-13 | 삼성전자주식회사 | Interface device and method supporting speech dialogue survice |
KR20160084748A (en) * | 2015-01-06 | 2016-07-14 | 포항공과대학교 산학협력단 | Dialogue system and dialogue method |
US10902262B2 (en) | 2017-01-19 | 2021-01-26 | Samsung Electronics Co., Ltd. | Vision intelligence management for electronic devices |
US10909371B2 (en) | 2017-01-19 | 2021-02-02 | Samsung Electronics Co., Ltd. | System and method for contextual driven intelligence |
CN109176535A (en) * | 2018-07-16 | 2019-01-11 | 北京光年无限科技有限公司 | Exchange method and system based on intelligent robot |
CN109176535B (en) * | 2018-07-16 | 2021-10-19 | 北京光年无限科技有限公司 | Interaction method and system based on intelligent robot |
CN110209784A (en) * | 2019-04-26 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Method for message interaction, computer equipment and storage medium |
CN110209784B (en) * | 2019-04-26 | 2024-03-12 | 腾讯科技(深圳)有限公司 | Message interaction method, computer device and storage medium |
WO2023058944A1 (en) * | 2021-10-08 | 2023-04-13 | 삼성전자주식회사 | Electronic device and response providing method |
KR102492277B1 (en) * | 2022-06-28 | 2023-01-26 | (주)액션파워 | Method for qa with multi-modal information |
US11720750B1 (en) | 2022-06-28 | 2023-08-08 | Actionpower Corp. | Method for QA with multi-modal information |
CN115062131A (en) * | 2022-06-29 | 2022-09-16 | 支付宝(杭州)信息技术有限公司 | Multi-mode-based man-machine interaction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20090090613A (en) | System and method for multimodal conversational mode image management | |
US10395654B2 (en) | Text normalization based on a data-driven learning network | |
RU2710984C2 (en) | Performing task without monitor in digital personal assistant | |
KR100561228B1 (en) | Method for VoiceXML to XHTML+Voice Conversion and Multimodal Service System using the same | |
US8073700B2 (en) | Retrieval and presentation of network service results for mobile device using a multimodal browser | |
CN100578614C (en) | Semantic object synchronous understanding implemented with speech application language tags | |
WO2022057712A1 (en) | Electronic device and semantic parsing method therefor, medium, and human-machine dialog system | |
GB2383247A (en) | Multi-modal picture allowing verbal interaction between a user and the picture | |
US20030144843A1 (en) | Method and system for collecting user-interest information regarding a picture | |
JP2009205579A (en) | Speech translation device and program | |
KR20170014353A (en) | Apparatus and method for screen navigation based on voice | |
JP2004310748A (en) | Presentation of data based on user input | |
WO2010124512A1 (en) | Human-machine interaction system and related system, device and method thereof | |
EP3550449A1 (en) | Search method and electronic device using the method | |
KR20240012245A (en) | Method and apparatus for automatically generating faq using an artificial intelligence model based on natural language processing | |
JP2010026686A (en) | Interactive communication terminal with integrative interface, and communication system using the same | |
JP7139157B2 (en) | Search statement generation system and search statement generation method | |
US11714599B2 (en) | Method of browsing a resource through voice interaction | |
Johnston | Extensible multimodal annotation for intelligent interactive systems | |
KR20020013148A (en) | Method and apparatus for internet navigation through continuous voice command | |
Alonso-Martín et al. | Multimodal fusion as communicative acts during human–robot interaction | |
US20240106776A1 (en) | Sign Language Translation Method And System Thereof | |
Griol et al. | The VoiceApp system: Speech technologies to access the semantic web | |
Jeevitha et al. | A study on innovative trends in multimedia library using speech enabled softwares | |
Habeeb et al. | Design module for speech recognition graphical user interface browser to supports the web speech applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |