CN102262624A

CN102262624A - System and method for realizing cross-language communication based on multi-mode assistance

Info

Publication number: CN102262624A
Application number: CN201110225342XA
Authority: CN
Inventors: 徐常胜; 程健; 梁超; 张歆明
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2011-08-08
Filing date: 2011-08-08
Publication date: 2011-11-30

Abstract

The invention provides a system and a method for realizing a cross-language communication based on a multi-mode assistance. The method provided by the invention comprises the following steps of: utilizing a foreground interactive module, a data managing module and a semantic association module of the system for realizing cross-language communication; utilizing a natural language processing tool to automatically extract a central discussion subject and keywords in conversations through analyzing conversation contents; utilizing the semantic association module to automatically search for relative images and video clips and supplying the relative images and video clips to both sides of the conversation by an appropriate way to achieve the aims of accelerating mutual understanding and communication. The images and videos which are used as assisted understanding can be automatically picked from the network through a searching method and also can be directly obtained from a pre-labeled multimedia library. Finally, the system provided by the invention can generate a multi-mode conversation abstract according text chat messages of both sides of the conversation and corresponding images and video contents.

Description

Stride the language communication system and method based on multi-modal auxiliary realization

Technical field

The invention belongs to multimedia analysis, field of network communication, relate to the method for striding language communication based on multi-modal auxiliary realization.

Background technology

Along with the fast development of mechanics of communication and Internet technology, occurred and the diverse a kind of network instant communication systems of traditional communication modes such as mail, phone, telegram, such as MSN and QQ.Traditional mail and telegram are based on literal, and phone is based on voice, and instant messaging not only can be used literal and voice, can also assist multimedia meanses such as rich video, picture.By instant communicating system, the people that are separated by vast oceans can realize as aspectant live talk.The whole earth has become genuine global village.

For the interlocutor who says different language, language issues remains and is difficult to the obstacle gone beyond in the instant messaging.In recent years, because machine translation mothod has been obtained rapid progress, the language issues that the interchange of the user between the different language exists has obtained certain solution by the technology of mechanical translation to a certain extent.But there are two significant disadvantages in mechanical translation.First is exactly the accurate translation between the different language.But mechanical translation still can only be translated automatically to some simple dialogues.Even the maximum bilingual of number of users in the world: English and Chinese, the automatic translation accuracy rate between them also still can't satisfy daily use needs fully.If consider numerous in the world minority languages, automatic accurately translation may remain a problem of shouldering heavy responsibilities between the different language.Second be exactly the polysemy of the meaning of a word be another the challenging difficult problem that runs in the mechanical translation.

For strengthening the synthesis system from the text to the image that exchanges, in the prior art form of body matter in the text of input with picture showed.The solution of this problem is to finish conversion from the text to the picture by three optimizations, promptly the probability that occurs based on the text maximization key word of input, maximize probability that corresponding picture occurs and based on input text based on input text and the key word selected, selected key word and corresponding picture maximization text and the space distribution of picture.Finally finish conversion from the text to the picture based on these three optimizations like this.But there are following three shortcomings in this system:

1). system handles speed is slow.This system can cause picture slack-off to the conversion rate of text owing to want calculation optimization like this;

2). the interface of system is unfriendly.Draw space layout and present to the user again owing to will be optimized together the text of input and the picture that provides.If the layout that such text picture is mixed is applied to the situation of talking with between the user, will certainly cause disagreeableness sensation to the user.

3). system is difficult for using.Owing to be terminal software, certainly will require user's downloaded software voluntarily like this.Can solve the not wield shortcoming of system by webpage.

Summary of the invention

The objective of the invention is to solve the prior art processing speed slowly, not wield technological deficiency, by the auxiliary people's online exchange swimmingly of using different language of multi-modal information.Reduce ambiguousness and the polysemy that produces in the automatic translation of tradition by multi-modal information such as image, video, and auxiliary semantic understanding to the user session content, the invention provides a kind of method of striding language communication based on multi-modal auxiliary realization thus.

For realizing described purpose, it is a kind of based on the multi-modal auxiliary language communication system of striding that a first aspect of the present invention provides, and the technical scheme of this system comprises: foreground interactive module, data management module and semantic association module, wherein:

The input end of foreground interactive module is accepted the text chat content of user's input and user's chatting contents is carried out pre-service, obtain the text message of user chat, and the output terminal of the AM/BAM interactive module by the foreground interactive module transmits the user version chat content after handling; The chat page of foreground interactive module chat for the user shows both sides' word content of dialogue and the multimedia picture of recommending out according to the content system that both sides talk;

The input end of semantic association module is connected with foreground interactive module output terminal, receive and user's text chat content is analyzed, utilize the natural language processing instrument to extract the main contents that both sides talk, obtain and export text and the corresponding multimedia messages of translating in the text message association, and generate a multi-modal summary according to the content and the corresponding multimedia messages of text chat content, translation;

The input end of data management module is connected output terminal and connects with the semantic association module, data management module will be stored the text chat content of new input, the content and the corresponding multimedia messages of translation, simultaneously historical user profile is integrated together with new user profile, generated and show all chat both sides' word content of dialogue and the multimedia pictorial information of recommending out according to the content system that both sides talk.

Preferred embodiment, after the semantic association module on backstage is received the text message that the user sends over, the semantic association module can be understood the other side's the implication of speaking from the angle of the language that uses for the chat user that helps different language, comes in the result of Google translation is integrated; Like this except original user's chat message, the also subsidiary translation of having gone up this chat content based on user's chat of Google translation.

Preferred embodiment, semantic association module extract that both sides talk to the effect that with these main contents as key word, adopt text-based image retrieval from image data base, to retrieve the corresponding candidate's pictures that come out.

For realizing described purpose, a second aspect of the present invention provides a kind of use to realize striding the method for language communication based on the multi-modal auxiliary language communication system of striding, this method is based on the user session chat, the result who the talk content analysis is obtained according to the text resolution technology, exchange the semantic understanding that exists between user obstacle or that culture background there are differences with auxiliary language for the user provides multimedia element, described method performing step comprises following:

Step S1: the user at first interface, foreground by semanteme chat sends the word content of oneself wanting with the other side's chat, the text message of user's chat is transmitted by the AM/BAM interactive module that Ajax makes up in the interface, foreground to the semantic association module on backstage, employing is analyzed user's conversation content based on the cross-module attitude analytical approach of theme, utilizes the natural language processing instrument automatically to extract central topic and key word in the dialogue;

Step S2: the semantic association module adopts text-based image retrieval automatically to retrieve relevant pictures and video segment and offer the talk both sides according to conversation topics from database or internet according to central topic and keyword message in the dialogue;

Step S3: system is according to talk both sides' text chat information and corresponding with it picture and video segment content, generates a multi-modal talk summary, finally realizes semantic smoothly interchange the between the user of different language with multimedia form; Simultaneously, system can generate a multi-modal talk summary for the talk both sides according to talk both sides' text chat historical information and corresponding with it picture and video content.

Preferred embodiment, described multi-modal talk summary comprises text, audio frequency, image and video information, exchanges the semantic understanding that exists between user obstacle or that culture background there are differences with auxiliary language for the user provides multimedia element.

Preferred embodiment, described picture and video segment content are to take off automatically from network by search to get, or directly obtain from a multimedia gallery that has marked in advance.

Preferred embodiment, described multi-modal talk summary is based on the summary of theme, the relational network of use and talked last time according to statistics in appear at a predefine and expect that the word symbiosis frequency in the storehouse obtains detecting theme.

Beneficial effect of the present invention: core of the present invention is how to come text message is described by multimedia messages (image or video).What the present invention proposed can provide friendly and environment easily for online instant messaging based on the multi-modal auxiliary language communication system of striding, three principal features are arranged: first friendly, understand owing to adopted, thereby significantly reduced polysemy and the ambiguousness translated based on topic relevant image or the auxiliary content of text of video search technology; Second interactivity, the system that makes can satisfy the demand of user individual better; The 3rd ease for use, the system that is proposed can automatically generate multimedia summary according to memcon.

For interchange and understanding between the auxiliary user, system of the present invention has adopted the cross-module attitude analytical approach based on theme.System generates a multi-modal talk summary according to talk both sides' text chat information and corresponding with it picture and video content.Like this, because this multi-modal talk is by comprising abundant content, the multi-modal supplementary of promptly very visual and understandable image, video, text etc., thereby effectively eliminate the ambiguousness that the automatic translation between the plain text occurs, improve the efficient and the quality of communication, carried out semantic smoothly the interchange between the user of realization different language.

Description of drawings

Fig. 1 is the interface block diagram that the present invention is based on the multi-modal auxiliary language communication system of striding;

Fig. 2 the present invention is based on the multi-modal auxiliary structured flowchart of striding the language communication system;

Fig. 3 a and Fig. 3 b have provided the example results of a predetermined Pizza;

Fig. 4 is at the multimedia summary example of conversation content.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

The present invention proposes the method for striding language communication based on the multi-modal auxiliary language communication system of striding and realization, described method is utilized foreground interactive module 1, data management module 2 and semantic association module 3, by analyzing conversation content, utilize the natural language processing instrument can automatically extract central topic and key word in the dialogue, and semantic association module 3 is according to detected central topic and keyword message, automatically search for relevant picture and video segment and offer the talk both sides, thereby reach promotion understanding and communication each other in appropriate mode.Here,, both can take off automatically from network and get, also can from a multimedia gallery that has marked in advance, directly obtain by the method for search as auxiliary picture and the video of understanding.At last, system generates a multi-modal talk summary according to talk both sides' text chat information and corresponding with it picture and video content.

Fig. 1 shows and the present invention proposes an auxiliary User Interface of striding the multimedia chat system of language communication, it can exchange for the user who uses different language provide a close friend, can be mutual timely communication environment.The function that has wherein mainly comprised three aspects: based on the textcommunication of timely translation, a picture or a video frequency searching based on conversation topics, and at the multimedia of conversation content summary (Fig. 4 illustrates).The uppermost part of Fig. 1 mainly is to be used for the name of display system and the theme that the user chats and talks.Ensuing is the main viewing area of system interface, and promptly text conversation and multimedia supplementary show, for example asks the way, buying car, decides hotel etc.Right side part among Fig. 1 is based on the textcommunication of timely translation, user version chat zone: the text message that presents the relevant Google translation of the basic text chat pager of user; Fig. 1 left part is a picture or a video frequency searching based on conversation topics, and makes a summary the content of multimedia show area at the multimedia of conversation content: the semantic understanding of the multimedia messages assisted user that the content of talking based on the user is relevant for the user presents.

Illustrate as Fig. 2 and the present invention is based on the multi-modal auxiliary structured flowchart of striding the language communication system.Be divided into three ingredients based on the multi-modal auxiliary framework of striding the language communication system, promptly the foreground interactive module 1, data management module 2 and semantic association module 3.Wherein the foreground design comprises chat interface and mutual two parts of AM/BAM.Wherein foreground interactive module 1 is accepted the text chat content of user's input and user's chatting contents is carried out pre-service, obtains the text message of user's chat; User version chat content after user's chat word content will be handled by the output terminal of the mutual word modules of AM/BAM of foreground interactive module 1 sends semantic association module 3 to, the chat page of foreground interactive module 1 chat for the user shows both sides' word content of dialogue and the multimedia picture of recommending out according to the content system that both sides talk.

The input end of semantic association module 3 is connected with foreground interactive module 1 output terminal, receive and by after user's text chat content is analyzed, utilize the natural language processing instrument to extract the main contents that both sides talk, obtain and export text and the corresponding multimedia messages of translating in the text message association, and generate a multi-modal summary according to the content and the corresponding multimedia messages of text chat content, translation; Semantic association module 3 outputs to data management module 2 together with the content and the corresponding multimedia messages of text chat content, translation.

The input end of data management module 2 is connected output terminal and connects with semantic association module 3, data management module 2 will be stored the content and the corresponding multimedia information of new input text chat content, translation.To integrate historical user profile together with new user profile simultaneously, generate and show all chat both sides' word content of dialogue and the multimedia pictorial information of recommending out according to the content system that both sides talk; Return to foreground interactive module 1 at last in the lump.The chat page of final foreground interactive module 1 will all be shown to the user with all information.Describe the workflow of a lower module below in detail.

The user at first sends chat content by chat interface to foreground interactive module 1.The continuous semantic chat interface of asking for an interview Fig. 1 user is to be divided into two main portions, a part is exactly the part of word content that shows traditional chat both sides' dialogue, and another part is exactly the multimedia picture tabulation that shows that the content system of talking according to both sides is recommended out.The AM/BAM interactive module that interface, foreground in this time makes up by Ajax is transmitted the text message of the text chat of user's input to the backstage.Back table frame is to be divided into two parts, and a part is a data management module 2, and another part is a semantic association module 3.After the text message that the user sends over was received on the backstage, semantic association module 3 can be understood the other side's the implication of speaking from the angle of language of use of self for the chat user that helps different language, came in the result of Google translation is integrated.Like this except original user's chat message, the also subsidiary translation of having gone up this chat content based on user's chat of Google translation.3 pairs of text messages of semantic association module utilize the natural language processing instrument to extract the main contents that both sides talk.This time, semantic association module 3 at first with these main contents as key word, adopt text-based image retrieval from image data base, to retrieve the corresponding candidate's pictures that come out.All of end user and dialogue and corresponding multimedia messages can be with generating a multi-modal summary.Example results with a predetermined Pizza is that example illustrates the multimedia summary that generates, as shown in Figure 4.This that provides from Fig. 4 find out based on multi-modal summary, the user with the goods person's in Pizza shop dialogue in, carried out the selection of Pizza kind, beverage and Payment Methods.The picture of the Pizza in the corresponding Pizza shop that the user feeds back by chat system can be selected according to the wish of oneself better.This multi-modal summary also helps the user and thinks to prefer once more Pizza in the future, can help the user according to the multimedia messages that this multi-modal summary provides and look back.

Below the semantic association mechanism among Fig. 2 is set forth.Semantic association mechanism mainly is divided into three parts, promptly based on the textcommunication of instant translation, based on the video frequency searching of topic and picture and the multi-modal summary that generates based on user version chat content and corresponding multimedia messages at last.

(1). based on the textcommunication of timely translation

Similar most timely communication system, the system that the present invention proposes also supports the most basic textcommunication.But, because the both sides that talk may have different language settings.For example, when an English-speaking American and Chinese that say Chinese talk on the net, the American is ignorant of Chinese, and Chinese do not know English, and can not make the clog-free communication of both sides by common text talk.For this reason, the system integration of the present invention a simple mechanical translation function, in when chat, speaker's automatic language translation is shown for behind recipient's the language again, the both sides that so just can guarantee to talk can roughly understand the other side's intention.

(2). based on the picture and the video frequency searching of topic

Although mechanical translation is arranged, stride the communication of language and still can not make us very satisfied as bridge.Study carefully its original meaning, be that mainly the accuracy (intelligibility of Aim of Translation language) of mechanical translation is still on the low side.Translation result between main languages for example between English and the Chinese, does not still also reach practical standard.In addition, owing to the existence of polysemant in the works and expressions for everyday use and sentence, cause machine translation mothod also to be difficult to satisfy the needs of reality.Food comprises shown in Fig. 3 a: extra large food, fruit, meat.Fruit comprises: banana, apple, orange, and for example " apple " speech both can be represented a kind of fruit, (Fig. 3 is a) also can to represent Apple.In order to build a kind of online communication environment of understandable, immersion, we have designed a kind of picture based on theme/video frequency searching submodule and have assisted the user of different language background to exchange mutually.Wherein, topic detection, picture retrieval and relevant feedback are three major functions.

Topic detects and realizes by two kinds of approach.The firstth, the user selects a topic from a predefined topic tabulation.Different topics is associated with different (obtaining mark by method manual or study) picture/video databases that marked.Second method then is to extract subject key words by extracting text analyzing.In once talking with, can extract the entity speech of many expression conversation contents.According to these entity speech, we at first set up the semantic relation tree of a similar WordNet, it is portrayed the semantic inheritance between speech, shown in Fig. 3 a, speech " apple ", " banana " and " orange " all belongs to the fruit subclass in the foodstuff, and " apple " shown in Fig. 3 b speech simultaneously may be simultaneously and " Dell ", " association " belongs to this class of computer brand together, and Fig. 3 b illustrates " apple " computer brand example and comprises: desktop computer mac, panel computer ipad and smart mobile phone iphone.These above-mentioned semantic relations can extract from WordNet and obtain, and (TF-IDF) is resultant for weight also can to pass through " word frequency-reverse document frequency " of statistics word in a predefined corpus by use.In case we are drawn into keyword from dialogue, system just can automatically infer its pairing potential topic by the semantic relation between the analysis of key speech.

According to the theme that is extracted in the dialogue, system is the corresponding pictorial information of retrieval from network or background data base automatically.The retrieval of use text based, we can easily find relevant mark picture according to conversation topics.Yet most network picture does not all mark, and the picture that the text that has marked that our use retrieves is associated is as training set, and study obtains a topic model, and retrieves a large amount of not mark pictures with this topic model district.For this reason, need at first make up topic model based on the picture retrieval of theme, its target is automatically to find potential (implicit) semantic space so that the document information more accurately in the modeling retrieving.Here, the semantic structure of a document has comprised implicit notion or the theme (they are a kind of stable and distinctive symbiosis pattern between equivalent often) that some are potential.By the weighted array of potential theme, document can be expressed as a series of potential theme, and its full combination coefficient then can be regarded a kind of character representation of document as.This expression has some serial advantages: at first semantic space is compared to the word space, and dimension is often lower.This has not only saved storage space, also helps quick search; Secondly by of the conversion of word space, not only can reduce the noise in the word vector, and can solve above-mentioned ambiguity and ambiguity problem, and then improve retrieval performance to semantic space.For example, word " apple " both can be represented a kind of fruit, can represent a computer brand (Fig. 3 b) again.Its accurate meaning can same theme other relevant keywords push away.

Feedback is widely used in the analysis of textview field visual information as a kind of popular human-computer interaction technology.By the feedback evaluation of user to system's output, system can revise adaptively.Be proved to be effectively in practice by the resulting supervision message of user feedback.In our system, the user can select correct theme from the resulting candidate list of automatic subject extraction algorithm.Selected theme will be used for subject extraction (current and next step) thematic relation by the modeling sequential next time.In image retrieval, the samples pictures that retrieve that our system row are more huge, and invite the user picture concerned to be given a mark according to conversation topics.

(3). multi-modal summary

Traditional timely communication is preserved with text mode usually and is kept chat record.In our system, the user can use multi-modal modes such as picture, video and text to express talker's intention.By a kind of multi-modal mode but not single text is preserved chat message, can obtain more vivid as compared with the past record.

Text, the summary of picture and video are research focuses in natural language processing and multimedia field.It often briefly expresses original text (picture or video) information by one section more concise succinct text (picture or video).Relevant at present technology is mostly according to the conspicuousness feature, and the mode of repetition or keyword information such as (frames) makes up clip Text.In our system, consider except that text, also to have a large amount of pictures and video information that we have adopted method of abstracting that theme drives by the conversation content between analysis user and then generate summary info about specific topics.This summary info has comprised related text, picture and the video content that relates to this topic.

The above; only be the embodiment among the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected, all should be encompassed within the protection domain of claims of the present invention.

Claims

1. one kind based on the multi-modal auxiliary language communication system of striding, and it is characterized in that described system comprises: foreground interactive module, data management module and semantic association module, wherein:

As claim 1 based on the multi-modal auxiliary language communication system of striding, it is characterized in that, after the semantic association module on backstage is received the text message that the user sends over, the semantic association module can be understood the other side's the implication of speaking from the angle of the language that uses for the chat user that helps different language, comes in the result of Google translation is integrated; Like this except original user's chat message, the also subsidiary translation of having gone up this chat content based on user's chat of Google translation.

As claim 1 based on the multi-modal auxiliary language communication system of striding, it is characterized in that, the semantic association module extract that both sides talk to the effect that with these main contents as key word, adopt text-based image retrieval from image data base, to retrieve the corresponding candidate's pictures that come out.

4. one kind is used the described method that realizes striding language communication based on the multi-modal auxiliary language communication system of striding of claim 1, it is characterized in that, this method is based on the user session chat, the result who the talk content analysis is obtained according to the text resolution technology, exchange the semantic understanding that exists between user obstacle or that culture background there are differences with auxiliary language for the user provides multimedia element, described method realizes may further comprise the steps:

5. the method for language communication is striden in realization as claimed in claim 4, it is characterized in that, described multi-modal talk summary comprises text, audio frequency, image and video information, exchanges the semantic understanding that exists between user obstacle or that culture background there are differences with auxiliary language for the user provides multimedia element.

6. the method for language communication is striden in realization as claimed in claim 4, it is characterized in that, described picture and video segment content are to take off automatically from network by search to get, or directly obtain from a multimedia gallery that has marked in advance.

7. the method for language communication is striden in realization as claimed in claim 4, it is characterized in that, described multi-modal talk summary is based on the summary of theme, the relational network of use and talked last time according to statistics in appear at a predefine and expect that the word symbiosis frequency in the storehouse obtains detecting theme.