KR20090090613A - System and method for multimodal conversational mode image management - Google Patents

System and method for multimodal conversational mode image management Download PDF

Info

Publication number
KR20090090613A
KR20090090613A KR1020080015938A KR20080015938A KR20090090613A KR 20090090613 A KR20090090613 A KR 20090090613A KR 1020080015938 A KR1020080015938 A KR 1020080015938A KR 20080015938 A KR20080015938 A KR 20080015938A KR 20090090613 A KR20090090613 A KR 20090090613A
Authority
KR
South Korea
Prior art keywords
image
information
module
voice
tag
Prior art date
Application number
KR1020080015938A
Other languages
Korean (ko)
Inventor
김현정
장두성
Original Assignee
주식회사 케이티
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 케이티 filed Critical 주식회사 케이티
Priority to KR1020080015938A priority Critical patent/KR20090090613A/en
Publication of KR20090090613A publication Critical patent/KR20090090613A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present invention relates to a multimodal interactive image management system and method. The present invention provides an image recognition module that distinguishes an object included in an image and generates image classification information when a new image is input from a user terminal; A tagnet management module extracting tag candidate recommendation information of an image related to the image classification information based on the image ontology generated from the image recognition module, using a tag ontology constructed with words associated with each subject; A natural language dialogue module for selecting a desired tag through dialogue with a user using the extracted tag candidate recommendation information; A multi-modal interface processing module configured to receive image-related information including a memo about an image from a user terminal and integrate the information into one piece of information; And an image DB for storing images, related tags, and memo information. This allows the user to upload photos or videos to a multi-modal interactive image management system, and then use the multi-modal interface and dialogue modules, such as voice, keyboard, and touchpad, to interact with the user to add tag and memo information to the images. After pasting and storing it in the photo-tag DB, it provides tags and memos related to photos or videos to user terminals such as PCs and digital photo frames.

Description

System and method for multimodal conversational mode image management

The present invention relates to a multi-modal interactive image management system and method, and in particular, after a user uploads a photo or video to a multi-modal interactive image management system, the user uses an image recognition system, a photo-tag DB, and a tagnet. Multi-modal, which provides tag-based image information retrieval service by recommending tag related to video or by attaching tag information to image through dialogue with user using multi-modal interface and dialogue module combining voice recognition and touch pad. It relates to an interactive image management system and method.

In recent years, voice recognition and speech synthesis techniques have been developed, and the necessity of a multimodal interface using voice and a pen for a terminal such as a mobile terminal, a home network terminal, a robot, and the like is increasing.

Multimodal is a channel that is converted into a mechanical device by modeling each sensory channel such as human sight, hearing, touch, taste, and smell with multiple modalities. In addition, the interaction of each modality is called multimodal interaction.

Multimodal interfaces are classified by visual viewpoint and associative nature, sequential and simultaneous multimodal according to visual perspective, alternative combination by complementary nature, supplementary combination multi Classified as modal.

Multi-modal interface utilizes not only voice but also keyboard, pen, and graphic interface between user and terminal device. User can input information using voice, keyboard, pen, etc. Output interface.

Multimodal inputs include voice, pen (or keypad), keyboard, mouse, touchscreen, lip movement input, gesture input, eye movement input, and the like.

The output of the multimodal includes graphic elements such as icons, tables, and tables, sound elements such as earcons, sound effects, and voice synthesis, and vibration elements such as shocks.

The multimodal interface is divided into a voice interface for processing voice input and output, an ink interface for recognizing pen writing with a pen, and a keyboard interface conventionally used. Voice interfaces use the Extensible Multimodal Annotation Markup Language (EMMA) format to display the results of different interfaces simultaneously. An ink interface is a way of writing with a pen, drawing a picture, or using a mathematical interface. The keyboard interface is currently in use and uses the EMMA format to display like other interface results.

With the recent advances in speech recognition, speech synthesis, and handwriting recognition technology, and the need for services that utilize these multimodal technologies, the W3C has adopted the Multimodal Interaction Framework, EMMA (Multimodal Interaction Working Group). Standardization of Extensible Multimodal Annotation and Ink Markup Language is being promoted, and Voice Browser Working Group and Multimodal Interaction Working Group are promoting standardization for voice interface and multimodal interface.

Therefore, the speech interface used as the multi-modal interface is an MS speech server using speech recognition between a person and a terminal device, a speech synthesis and a language processing technology that converts text into speech by TTS (Text To Speech). It is a technique to implement. Language processing techniques may be included in speech recognition and speech synthesis techniques, but recently, techniques related to Voice Extensible Markup Language (VoiceXML) or Speech Application Language Tags (SALT) that control voice input and output are used.

VoiceXML is an XML (eXtended Mark-up Language) language that controls voice input / output with voice conversation functions such as voice recognition, voice synthesis, and DTMF signals on the Web.

VoiceXML technology is a technology that is formed by AT & T, Lucent Technologies, Motorola and IBM in August 1999, and formed a forum of about 400 companies. In March 2000, the VoiceXML 1.0 specification was proposed by the W3C Voice Browser Working Group. The VoiceXML 2.0 specification was proposed in October 2001 and has been revised in March 2004. VoiceXML 2.0 is a grammar for speech recognition. Instead of the Java Speech Grammar Format (JSGF) used in VoiceXML 1.0, Speech Recognition Markup Language (SRML) was rewritten in March 2004. Speech Synthesis Markup Language (SSML) for speech synthesis It was renovated in July 2004.

Speech Application Language Tags (SALT) is an extension of Web standard languages such as HTML, XHTML, and XML to make it easier for developers to develop speech and multimodal applications. In the early 2000s, companies such as Cisco, Intel, Converse, Philips Speech Processing, and Scan Soft gathered around Microsoft, using voice, pen, It is a multi-modal interface that can control input and output commands using a mouse. If VoiceXML technology is a high-level language for phone-based voice application developers, SALT is a low-level language for web-based developers.

The biggest difference between VoiceXML and SALT is that VoiceXML works independently of HTML and is difficult to use simultaneously, while SALT defines tags related to speech recognition and synthesis in HTML so that they can be used simultaneously and multimodal.

The multimodal interface method includes 1) voice recognition, voice synthesis, and discourse technology implemented in a terminal device to naturally give a command or hear a command result by voice, and 2) provide a voice interface function to a server. And a method of simply connecting a terminal to a voice interface using a telephone company's telephone network, and 3) a hybrid method (1 + 2) is considered. In particular, the W3C (World Wide Web Consortium) A study group for modal platform research has been formed to standardize.

Multi-modal interface means an interface using voice, keyboard and pen for communication between human and terminal, uses voice, pen, text and keyboard typing as input, and outputs the processing result of the terminal. Provides voice, audio and video. A multi-modal interface that follows the new standard of the W3C software architecture uses State Chart XML (SCXML), a Harrel statechart-based conversational modeling language. SCXML is an asynchronous, event-based, general-purpose state machine language that works in the interaction manager (conversation manager), which invokes scenario scripts written in SCXML and then processes XML markup such as XHTML, VoiceXML, and Scalable Vector Graphics (SVG). Runs a modality component.

Scalable Vector Graphics (SVG) is an XML-based language designed by the W3C to represent two-dimensional graphics.

Currently, the X + V multimodal system, which allows more than one modality, uses the nested structure of VoiceXML and XHTML, developed by IBM.

1 is a diagram illustrating a multimodal interaction framework proposed in the conventional W3C.

The multimodal interaction framework system includes an input element 10, an output element 30, an interaction manager 20, an application service element 21, and a session element 22. , And System & Environments element 23.

The interaction manager (conversation manager) 20 performs the actual application service execution using the information obtained from the input element and serves to provide the result to the output element. Currently, the Harel state diagram, a kind of finite state diagram, is used.

The session component 22 is responsible for session management between various terminals and multi-modal application services and a sync function for outputting various terminal devices.

System & Envirionment (23) provides an environment so that the output mode can be easily adapted to a portable terminal, a vehicle terminal and a desktop automatically according to the situation of the terminal and the user environment.

2 is a detailed structural diagram of input elements of a multimodal interaction framework.

The input element 10 may include a recognition module constituting an information form that is easy to recognize and interpret a voice, a handwriting, a keyboard, an interpretation module semantically interpreting the recognized information, And an integration module for integrating various types of inputs and transmitting the same to the interaction manager 20.

The interpretation module semantically interprets the result recognized from the recognition module and converts it into the same representative text. For example, the interpretation module converts yes, yes, yes, etc. into yes. The results of the analysis module are converted to EMMA and entered into the integration module.

The integration module integrates information such as voice and pointing device and delivers the information to the interaction manager 20.

The input element uses voice, pen, keyboard or GPS information, and in the long term the sensor is also possible. When such information is input, the multimodal service can also be a context aware service.

EMMA is a standard language that connects the input element 10 and the interaction manager 20. In the multi-modal interaction, the EMMA receives input from a user such as a voice, a pen, a keyboard, and a handwriting, and transmits environment information in metadata.

It is a markup language that expresses processing results to enable data exchange between different components in a multimodal system. For example, EMMA allows metadata such as input time, confidence values for recognition results, and various input types to be represented.

Ink markup language is a language that expresses the result of recognizing handwriting based on XML. The handwriting that is used may be a picture, an importance sign, an autograph, or simply a handwritten letter, or a mathematical expression or musical symbol.

3 is a detailed block diagram of an output element of the multimodal interaction framework.

The output element 30 is composed of a generation module, a styling module, and a rendering module.

The generation module determines which mode to output, such as voice and graphics, when information to be transmitted to the user is input from the interaction manager 20.

The style module adds information about how it is rendered. For example, information about how the graphic is placed on the screen, or information about the spacing between words when a voice is output is added. The multimodal interface uses Cascading Style Sheets (CSS) to control speech output, Extensible HyperText Markup Language (XHTML) to output graphics, and the output is expressed in Speech Synthesis Markup Language (SSML).

The rendering module outputs a graphic drawn on the screen or voice generated by the style module. The output element of the multi-modal interface can control home appliances, robots, etc. when an actuator function for situation awareness is added, and can be distributedly processed according to the computing power of the terminal.

4 is a screen illustrating a user-centered online image storage and retrieval service (flickr).

flickr (http://www.flickr.com/) uses Web2.0 technologies, such as del.ico.us (http://del.ico.us/), to share bookmarks (favorites) to tag photos. tagging and storing them as an online image management system, and by collecting a large number of tagged online photo images by category to build collective intelligence, you can store, search, and share user-oriented online images. Providing services that make it possible.

A tag is a keyword that classifies information into a concept called folksonomy (a compound word of folks (people) and taxonomy, managing people's classification) by attaching words related to pictures, music, video clips, etc., and storing them in a search server. Or category.

However, the conventional online image system uses a keyboard, a voice, a touch pad, etc. by using a multimodal interface such as voice recognition, a touch pad, and a conversation processing technology, and applies tag information to uploaded pictures. By voice input, no system existed for tagging images and implementing collective intelligence.

The present invention has been proposed to solve the problems of the prior art, and is an interface that can be easily used by all users who are not used to using IT services, and select an image for inputting tag information into image data such as video and photo. After uploading an image by selecting a specific object through multi-modal interface such as voice, keyboard, and touch pad, and receiving a tag candidate or inputting tag information by voice, the tag information related to the uploaded image is automatically transferred to the photo-tag DB. The purpose of the present invention is to provide a multi-modal interactive image management system and method for storing a voice recognition and inputting a touch pad at the same time and using natural language dialogue to input menu selection, search word input, and tag information.

This process can be done by voice input or using the keyboard, or when the speech is freely spoken, the semantic information is extracted from the conversation management system to control all the controls necessary for image management such as menu movement, synonyms, and synonym search. Can be controlled through free conversation. In addition, the multi-modal interactive image management system also provides the ability to automatically tag all images through learning if only a few images are tagged.

In order to achieve the object of the present invention, the present invention is a multi-modal interactive image management system, when a new image is provided from the user terminal to identify the object included in the image and classify the image by subject according to the identified result An image recognition module for generating image classification information; A tagnet management module for generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; A natural language dialogue module for exchanging question and response information with the user terminal in a natural language format and selecting a tag using the tag candidate recommendation information; And a multi-mortal interface processing module for receiving and integrating the response information from the user terminal through one or more interfaces, and an image DB for storing the image and the selected tag.

A conversation model DB connected with the natural language conversation module and storing a plurality of examples for determining a semantic structure of the response information and a user's intention; And a dialogue example DB in which a semantic structure of the response information and a plurality of conversation pairs corresponding to a user's intention are stored.

The natural language dialogue module classifies the semantic structure of the response information and the user's intention, finds the dialogue pair mapped to the semantic structure and the user's intention, and provides the dialogue information to the user as the query information.

The image management application interworking module for searching for and providing an image and a tag associated with the image is requested by the user terminal.

The user terminal is characterized in that the digital picture frame.

The at least one interface may include a voice input method and a non-voice input method.

The non-voice input method may include an input method using a keyboard or a touch pad.

And an image classification training model DB connected with the image recognition module and storing image classification information related to a subject of a plurality of images.

And a tag ontology DB that is connected to the tagnet management module and stores the tag example including words related to each subject of an image.

The multi-mortal interface processing module includes a voice modality unit for voice input and output; An HTML modality unit for non-voice input and output, and a runtime framework for identifying and interpreting the contents of the voice input and output data.

The voice modality unit may include a voice I / O processor for voice input / output between a user and the terminal; IP-CCS for IP-based voice call input / output from the voice I / O processor; A voice recognition unit recognizing a user's voice; A voice synthesizer for converting text into voice; And a VoiceXML interpreter for interpreting input / output data of the voice recognition unit and the voice synthesis unit.

The voice I / O processor and the IP-CCS communicate with each other by RTP.

The runtime framework includes an integration module for integrating the voice input and output data in conjunction with an EMMA parser; An interaction manager for controlling the voice modality unit by interworking the integrated voice input / output data transmitted from the integration module with an SCXML parser; A session data manager to perform data support and management for session management of the user; A DCI for storing attributes of the terminal and preference data of the user; When the information to be transmitted to the user from the interaction manager is input, it is characterized in that it comprises a generation module for determining whether to output in a selected mode of the voice, graphics, and transmits to the voice modality unit or HTML modality unit.

The present invention provides a multi-modal interactive image management method, comprising: (a) receiving an image uploaded from a user terminal; (b) the image recognition module distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) generating, by the tagnet management module, tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) generating, by the natural language conversation module, query information based on the tag candidate recommendation information extracted in step (c), and providing the query information to the user terminal through a multi-modal interface providing module; (e) receiving, by the multimodal interface providing module, tag information desired by a user through the user terminal as one or more interface methods, and integrating the response information; (f) receiving image related information through a multi-modal interface providing module and storing an image and a tag related to the image as an image DB.

The step (e) may include: (e1) receiving the response information in which the natural language conversation module is integrated, and determining a semantic structure of the response information and a user's intention through a plurality of examples stored in a conversation model DB; (e2) generating the query information through the plurality of conversation pairs stored in a conversation example DB, and providing the query information to the user terminal through the multi-modal interface providing module.

In the step (e), the at least one interface method, characterized in that it comprises a voice and a non-voice method.

The image selection may be input in the non-voice method, and tag information corresponding to the image may be input in the voice.

(g) the image management application interworking module further comprising searching for and providing an image and a tag related to the image from the image DB according to the response information.

The present invention provides a computer or digital photo frame, comprising: (a) a function of uploading an image from a user terminal by a multimodal interactive image management system; (b) an image recognition module for distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) a tagnet management module generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) a function of the natural language dialogue module generating query information based on the tag candidate recommendation information extracted in the step (c) and providing the query information to the user terminal through a multi-modal interface providing module; (e) a function of receiving a tag information desired by a user through the user terminal as one or more interface methods through the user terminal, and integrating the response information; (f) A recording medium that can be read by a computer or digital photo frame recording a program for realizing the function of receiving image-related information through a module providing a multi-modal interface and storing image, tag and memo information related to the image as an image DB. to provide.

In the multi-modal interactive image management system and method according to the present invention, after a user uploads a photo or video to a website, the image is automatically distinguished using an image recognition system, a photo-tag DB, a tagnet, and the photo or Using a multi-modal interface that combines speech recognition and the touchpad and natural language processing technology to automatically recommend tags related to a video or find tag information through a conversation, call menus or enter search terms in a conversational form, By simultaneously inputting the tagging information by touch or clicking with a mouse, the user's intention and semantic information can be classified using a trained language understanding model to attach the tagging information to the image.

By providing simple program control and a user interface, the present invention can generate revenue through sales of original technology that makes it easier for people at various levels who are not familiar with IT services to control all types of applications.

In addition, the present invention can be used as the interface of all services that can simultaneously input the voice and the touch pad, such as digital photo frame, home network service, IPTV control, robot, portal service, complaint document issuer, navigation. These technologies can be used in intelligent robots as multimodal interface modules specialized for voice and sound processing functions.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the specification and claims should not be construed as having a conventional or dictionary meaning, and the inventors should properly explain the concept of terms in order to best explain their own invention. Based on the principle that can be defined, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

5 is a block diagram of a multi-modal interactive image management system according to the present invention.

The multi-modal interactive image management system 100 includes an image recognition module 110, an image classification training model DB 111, a tagnet management module 120, a tag ontology DB 121, a natural language dialogue module 130, and a dialogue. And a model DB 131, a multi-modal interface processing module 140, an image management application interworking module 145, and an image DB 150.

The image recognition module 110 is a module for automatically distinguishing objects and finding image classification information by using an image classification training model DB 111 when an image is input from a user terminal.

The image classification training model DB 111 is connected to the image recognition module 110 and stores pre-built image classification training model information.

That is, the image recognition module 110 preprocesses an image input from the user terminal to find a characteristic pattern, and then repeatedly compares the pattern with a plurality of example images stored in the image classification training model DB 111. By distinguishing the objects of the most similar image with probability and extracting the classification information of the image accordingly.

Tagnet management module 120 is similar to the classification subject by using the tag ontology DB 121 constructed with words associated with the subject when the classification information of the input image is extracted from the image recognition module 110 or The relevant information is extracted to extract a plurality of words (hereinafter, referred to as tag candidate recommendation information) which are tag candidates of an image.

Here, the tagnet management module 120 extracts a related tag set from tagging information, not a dictionary meaning.

The tag ontology DB 121 is connected to the tagnet management module 120 and stores words for each topic for tag recommendation related to an image.

When the user speaks, the natural language dialogue module 130 determines the user's intention and semantic information according to the dialogue model stored in the dialogue model DB (not shown), and based on this, the natural language dialogue module 130 is most suitable for the dialogue example DB (not shown). It is a module that processes menu movement, tagging information input, search word input, tag recommendation, and query response by searching for a conversation pair and responding to the user.

The natural language conversation module 130 performs a dialogue with the user based on the extracted tag candidate recommendation information through the multi-modal interface processing module 140 to select the desired information among the tag candidates of the corresponding image, or Understands user's intention, classifies semantic structure, selects the most suitable tag for the image and receives tag and memo information related to the image to image DB 150 .

In order to perform this function, the natural language conversation module 130 stores a conversation example DB for storing tens of thousands of conversation example information related to image tagging, and a conversation collected to recommend a tag related to an image or add a memo. Based on the example, the dialogue model DB for classifying the semantic structure and the user's intention is connected using information input from the multi-modal interface processing module 140 for dialogue with the user.

The multi-modal interface processing module 140 receives input from a user terminal in a voice and non-voice input manner, integrates and interprets various input means into one information.

Here, the input method using a keyboard or a touch pad may be applied as the non-voice input method.

As a technique applied to process the above-described functions, a multimodal interface processing technique is a method for synchronizing with a computer when a user inputs a character, a voice, or a touchpad. It is a technology that can provide appropriate service by integrating input information while maintaining sync). That is, in the light of the technical concept of the present invention, a user uploads an image to the multi-modal interactive image management system 100 using a digital frame or a PC, inputs a text using a touch pad or keyboard, or inputs a voice using a microphone. In this case, the multimodal interface processing module 140 integrates the input information and transmits the input information to the natural language conversation processing module 130. The natural language conversation processing module 130 transmits the input information to the user by using a conversation model. It is a technology that classifies the semantic components of user speech and makes an appropriate response.

In addition, the natural language dialogue processing module 130 is based on the natural language dialogue processing technology, and more specifically, the natural language dialogue processing technology interprets a speech or text utterance in sentence units or word units input by a user to interpret a semantic structure. It is a technology that processes conversations by grasping the user's intention and inferring the response of the most appropriate system accordingly.

The image management application interworking module 145 is a module for interworking to provide various services such as menu control, search word input, portal linkage, etc. to the user according to the dialogue processing result of the natural language dialogue module 130.

That is, the image management application interworking module 145 operates the image DB constructed as the image-tagset, and responds to the user's request input through the multi-modal interface processing module 140 and tags related to image information and images. Search and provide memo information, or link the application to transmit image-tag information to a digital photo frame or electronic diary system in the home to view the image through the digital frame or through the diary system related tags, memos and diary information Provides the ability to search.

6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.

The Multimodal Interface (MMI) processing module includes a Voice Modality Component, a Runtime Framework, an OSGi Service Bundle, including a Voice I / O Processor for voice input and output. It is composed of HTML Modality Component for non-voice input / output.

The voice modality component (Voice Modality Component) is a voice I / O processor (Voice I / O Processor) for voice input and output between the user and the terminal, Real Time Transport Protocol (RTP) for transmitting the voice stream, IP-based voice call IP Call Control Server (IP-CCS) for control, Automatic Speech Recognizer (ASR) for recognizing speech between people and terminals, Text to Speech (TTS) for converting text to speech, Voice It consists of a VoiceXML Interpreter for recognizing and interpreting speech synthesis data.

The present invention uses VoiceXML for the voice interface to interact with the user and the terminal by the interaction manager, and uses XHTML for the non-voice interface.

The Runtime Framework includes the Integration Module, EMMA Parser, Interaction Manager, SCXML Parser, Session / Data Model, Delivery Context Interface, and DCI. Contains a module (Generation Module).

The Integration Module integrates modality information such as voice and pointing device in conjunction with EMMA Parser and transmits it to the Interaction Manager.

The Interaction Manager works in conjunction with the SCXML Parser to control and supervise the Voice Modality Component for interaction between the user and the system, as well as the simple and secondary coupling multimodal delivered from the user. Process input or connect with external module.

The session data management unit (Session / Data Model) performs data support and management for session management.

The DCI (Delivery Context Interface) is a DB that stores personal profile information and terminal environment information in a synchronous messaging method to enable more personalized service, and stores device attributes and user preference data.

The generation module determines whether to output in a voice, graphics, etc. mode when information to be transmitted to the user is input from the interaction manager, and the content to be output is the voice modality unit. Voice Modality Component) and HTML Modality Component.

7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.

The multi-modal interactive image management system 100 receives an image from the user terminal (step S11).

The image recognition module 110 discriminates the objects included in the image and generates the image classification information by using the image classification training model through training using the image classification training model DB 111 (step S12).

TagNet management module 120 is a tag associated with the image classification information using the tag ontology DB 121 constructed of words associated with each subject based on the classification information of the new image generated from the image recognition module 110 The candidate recommendation information is extracted (step S13) to recommend the tag candidate information of the image to be selected by the user (step S14).

The natural language conversation module 130 allows the user to select desired tag information among the tag candidates of the corresponding image through dialogue with the user based on the tag candidate recommendation information, or provides additional explanation or memo about the image through the dialogue from the user. Receives input from the multi-modal interface processing module 140 by voice or non-voice input method, understands the intention of the user, classifies the semantic structure, selects the most suitable tag for the image, or displays the image and the tag and memo information related to the image. Save to DB 150 (step S15).

The multi-modal interface processing module 140 receives input from the user with a keyboard, a touch pad, and voice, integrates and interprets various input means, and communicates with the user by the natural language dialog module 130 to communicate with the user through tag candidates of the corresponding image. Allow the user to select the desired tag information, or receive additional descriptions or notes about the image from the multi-modal interface module 140 to understand the user's intention and classify the semantic structure to select the most suitable tag for the image, The tag and the associated memo are stored in the image DB 150.

The image management application interworking module 145 thus provides the most suitable image information for the user's needs while operating the image DB 150 in which the image-tags are stored, or by integrating the application into a digital photo frame or an electronic diary system in the home. The image-tag information is transmitted (step S16) to view an image through a digital picture frame, or to search for a memo, diary, tag information, etc. associated with the image through the diary system (step S17).

8 is a flowchart illustrating the function of the natural language dialogue module.

The natural language dialogue module 130 builds a dialogue model and a dialogue example DB (step S20), and when the user speaks through the dialogue with the user (step S21), allows the tag information to be selected in the image, an additional description or a memo of the image, etc. Is inputted from the multi-modal interface module 140 to extract the semantic structure / conversation line by the identifier (Classifier) according to the preset dialogue model (step S22) to select the most suitable tag for the image, or to select the image and the tag and associated The memo is stored in the image DB 150.

The natural language dialogue module 130 selects a dialogue example that is most likely similarly from the dialogue example DB (step S23) and generates a system utterance based on the search result system response template (step S24). A system response is provided in response to the conversation (step S25).

The present invention, after uploading an image from the user terminal to the multi-modal interactive image management system 100, collects about 20,000 pairs of dialogue examples in the area where image tagging or memo is added, and manually sets the meaning structure and the intention of the user. After grasping it, it is trained as a conversation model and a conversation example is constructed as a conversation model DB 131. When the user speaks in an actual use environment, the semantic structure and the intention of the user are classified using the conversation model, and the conversation model DB is used. In (131), an example-based dialog management system is provided which finds the most similar conversation pair and responds to the actual user.

According to the present invention, after a user uploads a picture or video to a website, the user automatically distinguishes the picture using an image recognition system, a photo-tag DB, and a tagnet, and automatically recommends a tag related to the picture or video. By using the multimodal interface combined with voice recognition and touchpad and natural language dialogue processing technology, the user inputs the tagging information through dialogue with the user, thereby classifying the user's intention information into a trained language understanding model. The tag information may be attached to a video or an image to implement collective intelligence to provide an image or a video search service.

In addition, the multi-modal interactive image management system can be used as an interface for all services that can simultaneously input voice and touch pad, such as digital photo frames, home network services, IPTV control, robots, portal services, civil document issuers, and navigation. This technology can be used as an intelligent multi-modal interface module specialized for voice and sound processing functions.

As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form.

As described above, although described with reference to a preferred embodiment of the present invention, those skilled in the art various modifications of the present invention without departing from the spirit and scope of the invention described in the claims below Or it may be modified.

1 is a multimodal interaction framework proposed in the prior art W3C.

2 is a detailed structural diagram of input elements of a conventional multimodal interaction framework.

3 is a detailed configuration diagram of output elements of a conventional multimodal interaction framework.

4 is a screen illustrating a user-centered online image storage and retrieval service.

5 is a block diagram of a multimodal interactive image management system according to the present invention;

6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.

7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.

8 is a flow chart illustrating the functionality of a natural language conversation module.

Claims (19)

Multimodal interactive image management system, An image recognition module for generating image classification information for classifying the image according to a subject according to the identified result by identifying an object included in the image when a new image is provided from a user terminal; A tagnet management module for generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; A natural language dialogue module for exchanging question and response information with the user terminal in a natural language format and selecting a tag using the tag candidate recommendation information; A multi-mortal interface processing module for receiving and integrating the response information from the user terminal through one or more interfaces; An image DB for storing the image and the selected tag; Multi-modal interactive image management system comprising a. The method of claim 1, A conversation model DB connected with the natural language conversation module and storing a plurality of examples for determining a semantic structure of the response information and a user's intention; A dialogue example DB in which a semantic structure of the response information and a plurality of conversation pairs corresponding to a user's intention are stored Multimodal interactive image management system further comprising. The method of claim 2, The natural language dialogue module classifies the semantic structure of the response information and the user's intention, and finds the dialogue pair mapped to the semantic structure and the user's intention and provides the user with the query information as the query information. Image management system. The method of claim 1, And an image management application interworking module for searching for and providing an image and a tag associated with the image at the request of the user terminal. The method of claim 1, And the user terminal is a digital picture frame. The method of claim 1, And the at least one interface comprises a voice input method and a non-voice input method. The method of claim 6, The non-voice input method is a multi-modal interactive image management system, characterized in that it comprises an input method using a keyboard or a touch pad. The method of claim 1, And an image classification training model DB connected to the image recognition module, the image classification training model DB storing image classification information related to a subject of a plurality of images. The method of claim 1, And a tag ontology DB connected with the tagnet management module and storing the tag example including words related to each subject of the image. The method of claim 1, The multi-mortal interface processing module, A voice modality unit for voice input / output; HTML modality unit for non-voice input and output, and Runtime framework to understand and interpret the contents of voice input and output data Multi-modal interactive image management system comprising a. The method of claim 10, The voice modality unit may include a voice I / O processor for voice input / output between a user and the terminal; IP-CCS for IP-based voice call input / output from the voice I / O processor; A voice recognition unit recognizing a user's voice; A voice synthesizer for converting text into voice; VoiceXML interpreter for interpreting input / output data of the speech recognition unit and speech synthesis unit Multi-modal interactive image management system comprising a. The method of claim 11, And the voice I / O processor and the IP-CCS communicate with each other by RTP. The method of claim 10, The runtime framework, An integration module for integrating the voice input / output data in association with an EMMA parser; An interaction manager for controlling the voice modality unit by interworking the integrated voice input / output data transmitted from the integration module with an SCXML parser; A session data manager to perform data support and management for session management of the user; A DCI for storing attributes of the terminal and preference data of the user; Generating module for determining whether to output in the selected mode of the voice, graphics, the information to be transmitted to the user from the interaction manager, and transmits to the voice modality unit or HTML modality unit Multi-modal interactive image management system comprising a. As a multimodal interactive image management method, (a) receiving an image from the user terminal; (b) the image recognition module distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) generating, by the tagnet management module, tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) generating, by the natural language conversation module, query information based on the tag candidate recommendation information extracted in step (c), and providing the query information to the user terminal through a multi-modal interface providing module; (e) receiving, by the multimodal interface providing module, tag information desired by a user through the user terminal as one or more interface methods, and integrating the response information;  (f) receiving image related information through a multi-modal interface providing module and storing an image and a tag related to the image as an image DB; Multi-modal interactive image management method comprising a. The method of claim 14, Step (e), (e1) receiving the response information integrated with the natural language conversation module, and determining a semantic structure of the response information and a user's intention through a plurality of examples stored in a conversation model DB; (e2) generating the query information through a plurality of conversation pairs stored in a conversation example DB and providing the query information to the user terminal through the multi-modal interface providing module; Multi-modal interactive image management method comprising a. The method of claim 14, In said step (e), said at least one interface scheme comprises voice and non-voice schemes. The method of claim 14, And the image selection is input in the non-voice method, and the tag information corresponding to the image is input in the voice method. The method of claim 14, (g) an image management application interworking module, further comprising: searching for and providing an image and a tag related to the image from the image DB according to the response information. (A) a multimodal interactive image management system uploading an image from a user terminal to a computer or a digital picture frame; (b) an image recognition module for distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) a tagnet management module generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) a natural language dialogue module generating query information based on the tag candidate recommendation information extracted in step (c) and providing the query information to the user terminal through a multi-modal interface providing module; (e) a function of receiving a tag information desired by a user through at least one interface method as response information through the user terminal and integrating the response information through the user terminal;  (f) A recording medium that can be read by a computer or digital photo frame recording a program for realizing a function of receiving image-related information through a multi-modal interface providing module and storing an image, tag and memo information related to the image as an image DB.
KR1020080015938A 2008-02-21 2008-02-21 System and method for multimodal conversational mode image management KR20090090613A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020080015938A KR20090090613A (en) 2008-02-21 2008-02-21 System and method for multimodal conversational mode image management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020080015938A KR20090090613A (en) 2008-02-21 2008-02-21 System and method for multimodal conversational mode image management

Publications (1)

Publication Number Publication Date
KR20090090613A true KR20090090613A (en) 2009-08-26

Family

ID=41208371

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020080015938A KR20090090613A (en) 2008-02-21 2008-02-21 System and method for multimodal conversational mode image management

Country Status (1)

Country Link
KR (1) KR20090090613A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101143450B1 (en) * 2010-11-24 2012-05-23 단국대학교 산학협력단 Providing method and system of emotional augmented memory service
WO2012093811A1 (en) * 2011-01-04 2012-07-12 (주)올라웍스 Method for support in such a way as to allow collection of objects comprised in an input image, and a recording medium able to be read by terminal devices and computers
KR101219469B1 (en) * 2011-03-29 2013-01-11 서울대학교산학협력단 Methods for Multimodal Learning and Classification of Multimedia Contents
KR20130100448A (en) * 2012-03-02 2013-09-11 엘지전자 주식회사 Mobile terminal and method for controlling thereof
KR20140042492A (en) * 2012-09-28 2014-04-07 엘지전자 주식회사 Mobile terminal and operating method for the same
KR20150016776A (en) * 2013-08-05 2015-02-13 삼성전자주식회사 Interface device and method supporting speech dialogue survice
KR20160084748A (en) * 2015-01-06 2016-07-14 포항공과대학교 산학협력단 Dialogue system and dialogue method
CN109176535A (en) * 2018-07-16 2019-01-11 北京光年无限科技有限公司 Exchange method and system based on intelligent robot
CN110209784A (en) * 2019-04-26 2019-09-06 腾讯科技(深圳)有限公司 Method for message interaction, computer equipment and storage medium
US10902262B2 (en) 2017-01-19 2021-01-26 Samsung Electronics Co., Ltd. Vision intelligence management for electronic devices
US10909371B2 (en) 2017-01-19 2021-02-02 Samsung Electronics Co., Ltd. System and method for contextual driven intelligence
CN115062131A (en) * 2022-06-29 2022-09-16 支付宝(杭州)信息技术有限公司 Multi-mode-based man-machine interaction method and device
KR102492277B1 (en) * 2022-06-28 2023-01-26 (주)액션파워 Method for qa with multi-modal information
WO2023058944A1 (en) * 2021-10-08 2023-04-13 삼성전자주식회사 Electronic device and response providing method

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101143450B1 (en) * 2010-11-24 2012-05-23 단국대학교 산학협력단 Providing method and system of emotional augmented memory service
WO2012093811A1 (en) * 2011-01-04 2012-07-12 (주)올라웍스 Method for support in such a way as to allow collection of objects comprised in an input image, and a recording medium able to be read by terminal devices and computers
US8467577B2 (en) 2011-01-04 2013-06-18 Intel Corporation Method, terminal, and computer-readable recording medium for supporting collection of object included in inputted image
KR101219469B1 (en) * 2011-03-29 2013-01-11 서울대학교산학협력단 Methods for Multimodal Learning and Classification of Multimedia Contents
KR20130100448A (en) * 2012-03-02 2013-09-11 엘지전자 주식회사 Mobile terminal and method for controlling thereof
KR20140042492A (en) * 2012-09-28 2014-04-07 엘지전자 주식회사 Mobile terminal and operating method for the same
KR20150016776A (en) * 2013-08-05 2015-02-13 삼성전자주식회사 Interface device and method supporting speech dialogue survice
KR20160084748A (en) * 2015-01-06 2016-07-14 포항공과대학교 산학협력단 Dialogue system and dialogue method
US10902262B2 (en) 2017-01-19 2021-01-26 Samsung Electronics Co., Ltd. Vision intelligence management for electronic devices
US10909371B2 (en) 2017-01-19 2021-02-02 Samsung Electronics Co., Ltd. System and method for contextual driven intelligence
CN109176535A (en) * 2018-07-16 2019-01-11 北京光年无限科技有限公司 Exchange method and system based on intelligent robot
CN109176535B (en) * 2018-07-16 2021-10-19 北京光年无限科技有限公司 Interaction method and system based on intelligent robot
CN110209784A (en) * 2019-04-26 2019-09-06 腾讯科技(深圳)有限公司 Method for message interaction, computer equipment and storage medium
CN110209784B (en) * 2019-04-26 2024-03-12 腾讯科技(深圳)有限公司 Message interaction method, computer device and storage medium
WO2023058944A1 (en) * 2021-10-08 2023-04-13 삼성전자주식회사 Electronic device and response providing method
KR102492277B1 (en) * 2022-06-28 2023-01-26 (주)액션파워 Method for qa with multi-modal information
US11720750B1 (en) 2022-06-28 2023-08-08 Actionpower Corp. Method for QA with multi-modal information
CN115062131A (en) * 2022-06-29 2022-09-16 支付宝(杭州)信息技术有限公司 Multi-mode-based man-machine interaction method and device

Similar Documents

Publication Publication Date Title
KR20090090613A (en) System and method for multimodal conversational mode image management
US10395654B2 (en) Text normalization based on a data-driven learning network
RU2710984C2 (en) Performing task without monitor in digital personal assistant
KR100561228B1 (en) Method for VoiceXML to XHTML+Voice Conversion and Multimodal Service System using the same
US8073700B2 (en) Retrieval and presentation of network service results for mobile device using a multimodal browser
CN100578614C (en) Semantic object synchronous understanding implemented with speech application language tags
WO2022057712A1 (en) Electronic device and semantic parsing method therefor, medium, and human-machine dialog system
GB2383247A (en) Multi-modal picture allowing verbal interaction between a user and the picture
US20030144843A1 (en) Method and system for collecting user-interest information regarding a picture
JP2009205579A (en) Speech translation device and program
KR20170014353A (en) Apparatus and method for screen navigation based on voice
JP2004310748A (en) Presentation of data based on user input
WO2010124512A1 (en) Human-machine interaction system and related system, device and method thereof
EP3550449A1 (en) Search method and electronic device using the method
KR20240012245A (en) Method and apparatus for automatically generating faq using an artificial intelligence model based on natural language processing
JP2010026686A (en) Interactive communication terminal with integrative interface, and communication system using the same
JP7139157B2 (en) Search statement generation system and search statement generation method
US11714599B2 (en) Method of browsing a resource through voice interaction
Johnston Extensible multimodal annotation for intelligent interactive systems
KR20020013148A (en) Method and apparatus for internet navigation through continuous voice command
Alonso-Martín et al. Multimodal fusion as communicative acts during human–robot interaction
US20240106776A1 (en) Sign Language Translation Method And System Thereof
Griol et al. The VoiceApp system: Speech technologies to access the semantic web
Jeevitha et al. A study on innovative trends in multimedia library using speech enabled softwares
Habeeb et al. Design module for speech recognition graphical user interface browser to supports the web speech applications

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination