WO2022242706A1

WO2022242706A1 - Multimodal based reactive response generation

Info

Publication number: WO2022242706A1
Application number: PCT/CN2022/093766
Authority: WO
Inventors: 宋睿华; 杜涛
Original assignee: 宋睿华
Priority date: 2021-05-19
Filing date: 2022-05-19
Publication date: 2022-11-24
Also published as: CN113238654A

Abstract

The present disclosure provides a method, system and apparatus for multimodal based reactive response generation. Multimodal input data can be obtained. At least one information element can be extracted from the multimodal input data. At least one reference information item can be generated based at least on the at least one information element. Multimodal output data can be generated by at least using the at least one reference information item. The multimodal output data can be provided.

Description

Reactive Response Generation Based on Multimodality

Background technique

In recent years, intelligent human-computer interaction systems have been widely used in more and more scenarios and fields, which can effectively improve the efficiency of human-computer interaction and optimize the experience of human-computer interaction. With the development of artificial intelligence (AI) technology, human-computer interaction systems have also achieved more in-depth development in aspects such as intelligent conversation systems. For example, the intelligent conversation system has covered application scenarios such as task dialogue, knowledge question answering, and open domain dialogue, and can be realized by using template-based technology, retrieval-based technology, and deep learning-based technology.

Contents of the invention

This Summary is provided to introduce a set of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Embodiments of the present disclosure propose methods, systems and apparatus for multimodality-based reactive response generation. Multimodal input data can be obtained. At least one informational element may be extracted from said multimodal input data. At least one reference information item may be generated based at least on said at least one information element. The at least one item of reference information may be used at least to generate multimodal output data. The multimodal output data may be provided.

It should be noted that one or more of the above aspects include the features specified in the following detailed description as well as in the claims. Certain illustrative features of the one or more aspects are set forth in detail in the following description and accompanying drawings. These features are merely indicative of the various ways in which the principles of various aspects can be implemented and this disclosure is intended to include all such aspects and their equivalents.

Description of drawings

The disclosed aspects will be described below with reference to the accompanying drawings, which are provided to illustrate but not limit the disclosed aspects.

FIG. 1 illustrates an exemplary architecture of a multimodality-based reactive response generation system according to an embodiment.

FIG. 2 illustrates an exemplary process for multimodality-based reactive response generation, according to an embodiment.

Figure 3 shows an example of a smart animated character scene according to an embodiment.

Fig. 4 shows an exemplary process of intelligently animating a character scene according to an embodiment.

Fig. 5 shows an exemplary process of smart animation generation according to an embodiment.

FIG. 6 shows a flowchart of an exemplary method for multimodality-based reactive response generation, according to an embodiment.

FIG. 7 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.

FIG. 8 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.

Detailed ways

The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than teaching any limitation to the scope of the present disclosure.

Existing human-computer interaction systems usually use a single medium as a channel for information input and output, for example, communication between humans and machines or between machines through one of text, voice, and gestures. Taking the intelligent conversational system as an example, although it can be oriented to text or speech, it still takes text processing or text analysis as its core. During the interaction process, the intelligent conversation system lacks the consideration of information such as facial expressions and body movements of the interactive objects outside the text, and also lacks the consideration of factors such as the sound and light in the environment, resulting in common errors in the interaction process. question. On the one hand, the problem is that the understanding of information is not comprehensive and accurate enough. In the process of actual communication, human beings do not express their entire communication content through language and text alone, but often use tone of voice, facial expressions, body movements, etc. as important channels for expressing or transmitting information. For example, for the same sentence, if it uses different tones or is accompanied by different facial expressions and body movements, it may convey completely different semantics in different occasions. The existing intelligent conversation technology with text processing as the core lacks this part of the information which is very important in the interaction process, which makes it very difficult to extract and apply the context information in the conversation. Another problem is that the expression of information is not vivid enough. Existing intelligent conversation technology mainly performs information expression through text, and in the case of supporting speech recognition and speech synthesis, the output text can also be converted into speech. However, such information transmission channels are still limited, and it is impossible to comprehensively and accurately express one's own intentions by comprehensively using language, facial expressions, body movements, etc. like humans, making it difficult to show lively anthropomorphic performances. Another problem is that the existing intelligent conversation technology is limited to responding to received input conversation messages, but cannot respond to various environmental factors spontaneously. For example, existing chatbots only focus on responding to conversational messages from users so as to be able to chat around conversational messages from users.

The embodiment of the present disclosure proposes a multimodal-based reaction response generation scheme, which can be implemented on a variety of intelligent conversation subjects, and can be widely used in various scenarios including human-computer interaction middle. In this paper, intelligent conversation subjects can broadly refer to AI product forms that can generate and present information content and provide interactive functions in specific application scenarios, such as chat robots, intelligent animated characters, virtual anchors, intelligent car assistants, Smart customer service, smart speakers, etc. According to an embodiment of the present disclosure, an intelligent conversational agent may generate multimodal output data based on multimodal input data, wherein the multimodal output data is a response generated in a reactive manner to be presented to the user.

The natural way people communicate with each other is often multimodal. When human beings communicate with each other, they tend to comprehensively consider various types of information such as speech, text, facial expressions, body movements, etc. from the communication objects, and at the same time take into account the scene, light, sound, and even temperature, humidity and other information of the environment they are in. Through the comprehensive consideration of these multi-modal information, human beings can more comprehensively, accurately and quickly understand what the communication object wants to express. Similarly, when expressing information, human beings tend to use voice, facial expression, body movements and other multi-modal expressions to express their intentions more accurately, vividly and comprehensively.

Based on the inspiration from the above-mentioned human communication methods, in the context of human-computer interaction, natural and authentic human-computer interaction solutions should also be multimodal. Therefore, the embodiments of the present disclosure propose a multimodal human-computer interaction method. In this paper, interaction can broadly refer to the understanding and expression of information, data, content, etc., while human-computer interaction can broadly refer to the interaction between the intelligent conversation subject and the interactive object, for example, between the intelligent conversation subject and the human Interaction between users, interaction between intelligent conversation subjects, responses of intelligent conversation subjects to various media contents or informational data, and so on. Compared with the existing interaction methods based on a single medium, the embodiments of the present disclosure have various advantages. In one aspect, more accurate information understanding can be achieved. Through the comprehensive processing of multimodal input data including media content, collected images or audio, chat sessions, and external environment data, information can be collected and analyzed more comprehensively, misunderstandings caused by missing information can be reduced, and more accurate Understand the deep-level intent of interacting objects. In one respect, the expression is more efficient. By superimposing and expressing information in multiple ways and in multiple modes, for example, superimposing facial expressions and/or body movements of avatars or other animation sequences on the basis of speech or text, information and emotions can be expressed more efficiently. In one aspect, the interactive behavior of the intelligent conversation subject will be more vivid. The understanding and expression of multimodal data will make the subject of intelligent conversation more anthropomorphic, thereby significantly improving user experience.

In addition, the embodiments of the present disclosure can enable the intelligent conversation subject to imitate human beings to generate natural responses to multi-modal input data such as speech, text, music, video images, ie, make reactive responses. In this paper, the reactive response of the intelligent conversation subject is not limited to the response to the chat message from the user, for example, but also covers various input data such as media content, captured image or audio, external environment, etc. proactive response. Taking the scene where the intelligent conversation subject acts as an intelligent animation role to provide AI intelligent companionship as an example, assuming that the intelligent conversation subject can accompany the user to watch videos through the corresponding avatar, the intelligent conversation subject can not only directly interact with the user, but also interact with the user. The content in the video responds spontaneously and reactively, for example, the avatar can speak, make facial expressions, make body movements, present text, etc. Thus, the behavior of the intelligent conversation subject will be more anthropomorphic.

Embodiments of the present disclosure propose a general multimodal-based reactive response generation technology. By integrating and applying the multimodal-based reactive response generation system, intelligent conversation subjects can efficiently and quickly obtain multimodal interaction capabilities. Through the multimodal-based reactive response generation technology according to the embodiments of the present disclosure, multimodal input data from various media channels can be integrated and processed, and the intent expressed by the multimodal input data can be interpreted more accurately and effectively. In addition, through the multimodal-based reactive response generation technology according to the embodiments of the present disclosure, the intelligent conversation subject can provide multimodal output data through multiple channels to express overall consistent information, thereby improving the accuracy of information expression. Accuracy and efficiency make the information expression of intelligent conversation subjects more vivid and interesting, thus significantly improving user experience.

The multi-modality-based reactive response generation technology according to the embodiments of the present disclosure can be adaptively applied to various scenarios. Based on the input and output capabilities supported by different scenarios, embodiments of the present disclosure can obtain corresponding multimodal input data in different scenarios, and output multimodal output data suitable for specific scenarios. Taking as an example a scene in which an intelligent conversational subject acting as an intelligent animation character automatically generates animations, embodiments of the present disclosure may generate reactive responses including, for example, animation sequences, for the avatar of the intelligent animation character. For example, when the smart animated character is used to accompany the user to watch a video, the smart animated character can comprehensively process multi-modal input data from video content, collected images or audio, chat sessions, external environment data, etc. Modal input data for depth perception and understanding, and respond accordingly in an intelligent and dynamic manner through multiple modalities such as speech, text, animation sequences including facial expressions and/or body movements, to achieve Comprehensive, efficient and vivid human-computer interaction experience. The perception ability and emotional expression ability of intelligent animation characters are greatly enhanced, and intelligent animation characters become more anthropomorphic. This can also become the technical basis for content creation such as intelligent animation through AI technology.

The above is only an exemplary description of the application of the embodiments of the present disclosure in the intelligent animation character scene, and the embodiments of the present disclosure can also be applied to various other scenes. For example, in the scenario where the subject of the intelligent conversation is a chat robot, the chat robot can chat with the user in forms such as voice, text, video, etc., then the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat session , collected images or audio, external environment data, etc., and the multimodal output data provided may include, for example, voice, text, animation sequences, etc. For example, in the scenario where the subject of the intelligent conversation is a virtual anchor, the virtual anchor can have a corresponding avatar and play and explain predetermined media content to multiple users, then the multimodal input data processed by the embodiments of the present disclosure can be It includes, for example, played media content, external environment data, etc., and the provided multimodal output data may include, for example, voice, text, animation sequences of avatars, and the like. For example, in the scenario where the subject of the intelligent conversation is a smart car-machine assistant, the smart car-machine assistant can provide assistance or companionship while the user is driving a vehicle (for example, a vehicle), then the multimodal input processed by the embodiments of the present disclosure The data may include, for example, chat sessions, collected images or audio, external environment data, etc., and the provided multimodal output data may include, for example, voice, text, and the like. For example, in the scenario where the subject of the intelligent conversation is an intelligent customer service, the intelligent customer service can provide customers with interactions such as answering questions and providing product information, then the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat sessions, External environment data, etc., and the multimodal output data provided may include, for example, voice, text, animation, etc. For example, in the scenario where the subject of the intelligent conversation is a smart speaker, the voice assistant or chat robot in the smart speaker can interact with the user, play audio content, etc., then the multimodal input data processed by the embodiments of the present disclosure can include For example, played audio content, chat sessions, collected audio, external environment data, etc., and the provided multimodal output data may include, for example, voice and the like. It should be understood that, in addition to the above exemplary scenarios, the embodiments of the present disclosure may also be applied to any other scenarios.

FIG. 1 shows an exemplary architecture of a multi-modality-based reactive response generation system 100 according to an embodiment. The system 100 can support the intelligent conversation subject to make multimodal-based reactive responses in different scenarios. An intelligent conversational subject may be implemented or resident on an end device or any user-accessible device or platform.

System 100 may include a multimodal data input interface 110 for obtaining multimodal input data. The multimodal data input interface 110 can collect various types of input data from various data sources. For example, in the case of playing the target content to the user, the multimodal data input interface 110 may collect data such as image, audio, and barrage files of the target content. Herein, the target content may broadly refer to various media content played on a device or presented to a user, for example, video content, audio content, picture content, text content, and the like. For example, in the case that the intelligent conversation subject can chat with the user, the multimodal data input interface 110 can obtain input data about the chat conversation. For example, the multimodal data input interface 110 may collect images and/or audio around the user through a camera and/or a microphone on the terminal device. For example, the multimodal data input interface 110 can also obtain external environment data from a third-party application or any other information source. Herein, the external environment data may broadly refer to various environmental parameters in the real world where the terminal device or the user is located, for example, data about weather, temperature, humidity, travel speed, and the like.

The multimodal data input interface 110 may provide the obtained multimodal input data 112 to the core processing unit 120 in the system 100 . The core processing unit 120 provides various core processing capabilities required for reactive response generation. Based on the processing stage and type, the core processing unit 120 may further include multiple processing modules, for example, a data integration processing module 130, a scene logic processing module 140, a multimodal output data generation module 150, and the like.

The data integration processing module 130 can extract different types of multi-modal information from the multi-modal input data 112 , and the extracted multi-modal information can be in the same context under specific scenarios and time sequence conditions. In one implementation, the data integration processing module 130 can extract one or more information elements 132 from the multimodal input data 112 . In this context, information elements can broadly refer to computer-understandable information or information representations extracted from raw data. In one aspect, the data integration processing module 130 may extract information elements from the target content included in the multimodal input data 112, for example, extract information elements from images, audio, bullet chat files, etc. of the target content. Exemplarily, the information elements extracted from the image of the target content may include, for example, character features, text, image light, objects, etc., the information elements extracted from the audio of the target content may include, for example, music, voice, etc., and the information elements extracted from the target content The information elements extracted from the bullet chat file may include, for example, bullet chat text and the like. Herein, music may broadly refer to song singing, instrumental performance, or a combination thereof, and speech may broadly refer to the sound of speech. In one aspect, data integration processing module 130 may extract informational elements, such as message text, from chat sessions included in multimodal input data 112 . In one aspect, the data integration processing module 130 can extract information elements, such as object features, from the captured images included in the multimodal input data 112 . In one aspect, the data integration processing module 130 may extract information elements such as speech, music, etc. from the collected audio included in the multimodal input data 112 . In one aspect, the data integration processing module 130 may extract information elements such as external environment information from the external environment data included in the multimodal input data 112 .

The scene logic processing module 140 may generate one or more reference information items 142 based at least on the information elements 132 . Herein, a reference information item may broadly refer to various guiding information generated based on various information elements for reference by the system 100 when generating multimodal output data. In one aspect, the reference information item 142 can include an emotion tag that can guide the emotion that the multimodal output data is presented or based on. In one aspect, the reference information item 142 may include an animation tag, which may be used to select the animation to be presented where the multimodal output data is to include an animation sequence. In one aspect, the reference information item 142 may include comment text, and the comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content. In one aspect, reference information item 142 may include chat response text, which may be a response to message text from a chat session. It should be understood that, optionally, the scene logic processing module 140 may also consider more other factors in the process of generating the reference information item 142, for example, scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject Wait.

The multimodal output data generation module 150 may utilize at least the reference information item 142 to generate the multimodal output data 152 . The multimodal output data 152 may include various types of output data, such as speech, text, animation sequences, and the like. The voice included in the multimodal output data 152 may be, for example, the voice corresponding to the comment text or the chat response text, and the text included in the multimodal output data 152 may be, for example, the text corresponding to the comment text or the chat response text, The animation sequence included in the multimodal output data 152 may be, for example, an animation sequence of an avatar of the intelligent conversation subject. It should be understood that, optionally, the multimodal output data generation module 150 may also consider more other factors during the process of generating the multimodal output data 152 , for example, scene-specific requirements and the like.

System 100 may include a multimodal data output interface 160 for providing multimodal output data 152 . The multimodal data output interface 160 may support providing or presenting multiple types of output data to a user. For example, the multimodal data output interface 160 can present text, animation sequences, etc. via a display screen, and can play voice, etc. via a speaker.

It should be understood that the architecture of the multimodal-based reactive response generation system 100 described above is only exemplary, and the system 100 may include more or less component units or modules according to actual application requirements and designs. In addition, it should be understood that the system 100 may be implemented by hardware, software or a combination thereof. For example, in one case, the multimodal data input interface 110, the core processing unit 120 and the multimodal data output interface 160 may be units implemented based on hardware, for example, the core processing unit 120 may be implemented by a The multimodal data input interface 110 and the multimodal data output interface 160 may be implemented by a hardware interface unit with data input/output capability. For example, in one case, the units or modules included in the system 100 may also be implemented by software or programs, so these units or modules may be software units or software modules. In addition, it should be understood that the units and modules included in the system 100 may be implemented at the terminal device, or may be implemented at the network device or platform, or may be partially implemented at the terminal device while the other part is implemented at the network device or platform.

FIG. 2 illustrates an exemplary process 200 for multimodality-based reactive response generation, according to an embodiment. The steps or processes in the process 200 may be performed by, for example, corresponding units or modules in the multi-modality-based reactive response generation system in FIG. 1 .

At 210, multimodal input data 212 can be obtained. Exemplarily, based on different application scenarios, the multimodal input data 212 may include, for example, images of the target content, audio of the target content, barrage files of the target content, chat sessions, collected images, collected audio, external environment data at least one of the others. For example, in a scene where the target content exists, such as a smart animation character scene, a virtual anchor scene, etc., data such as images, audio, and barrage files of the target content can be obtained at 210 . For example, in a scenario where the intelligent conversation subject supports a chat function, data about the chat session can be obtained at 210, which includes chat records in the chat session and the like. For example, in a scenario where the terminal device implementing the intelligent conversation subject has a camera or a microphone, data such as images collected by the camera and audio collected by the microphone may be obtained at 210 . For example, in the scenario where the intelligent conversation subject has the ability to obtain external environment data, various external environment data may be obtained at 210 . It should be understood that multimodal input data 212 is not limited to the exemplary input data described above.

At 220 , one or more informational elements 222 may be extracted from the multimodal input data 212 . Depending on the specific input data included in the multimodal input data 212, corresponding information elements may be extracted from these input data, respectively.

Where the multimodal input data 212 includes images of the target content, human features may be extracted from the images of the target content. Taking the target content as an example of a concert video played on a terminal device, various character features of the singer can be extracted from the image of the video, such as facial expressions, body movements, clothing colors, and the like. It should be understood that the embodiments of the present disclosure are not limited to any specific character feature extraction technology.

Where multimodal input data 212 includes an image of the target content, text may be identified from the image of the target content. In one implementation, text may be recognized from an image by a text recognition technique such as optical character recognition (OCR). Still taking the target content as an example of a concert video, some images in this video may contain music information, such as song title, lyricist, composer, singer, performer, etc., so these can be obtained through text recognition music information. It should be understood that the embodiments of the present disclosure are not limited to recognizing text by OCR technology, but any other text recognition technology may be used. In addition, the text recognized from the image of the target content is not limited to music information, and may also include any other text indicating information related to events occurring in the image, such as subtitles, lyrics, etc.

Where the multimodal input data 212 includes an image of the target content, image rays may be detected from the image of the target content. Image light may refer to the characteristics of ambient light in the picture presented by the image, for example, bright, dim, gloomy, flickering, and the like. Still taking the target content as a concert video as an example, assuming that the singer is singing a cheerful song, the stage at the concert site may use bright lights, so it can be detected from these images that the image light is bright. It should be understood that embodiments of the present disclosure are not limited to any particular image light detection technique.

Where the multimodal input data 212 includes images of the target content, objects may be identified from the images of the target content. The identified object may be, for example, a representative object in the image, an object appearing in a prominent or important position in the image, an object associated with a person in the image, etc. For example, the identified object may include props, background furnishings, etc. . Still taking the target content as an example of a concert video, assuming that the singer is playing a guitar while singing a song, the object "guitar" can be identified from the image. It should be understood that embodiments of the present disclosure are not limited to any particular object recognition technology.

Where the multimodal input data 212 includes audio of the target content, music may be extracted from the audio of the target content. The target content itself may be audio, for example, a song played to the user on the terminal device, and correspondingly, the music corresponding to the song may be extracted from the audio. In addition, the target content may also be a video, such as a concert video, and accordingly, music may be extracted from the audio contained in the video. Herein, music may broadly include, for example, musical pieces played by musical instruments, songs sung by singers, special effects sounds produced by special equipment or voice actors, and the like. The extracted music may be background music, foreground music, or the like. Also, music extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to music. It should be understood that embodiments of the present disclosure are not limited to any particular music extraction technique.

Where the multimodal input data 212 includes audio of the target content, speech may be extracted from the audio of the target content. Herein, speech may refer to the sound of speech. For example, when the target content includes conversations, speeches, comments, etc. of people or characters, the corresponding voice can be extracted from the audio of the target content. Speech extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to speech. It should be understood that embodiments of the present disclosure are not limited to any specific speech extraction technology.

In the case that the multimodal input data 212 includes a bullet chatting file of the target content, the bullet chatting text may be extracted from the bullet chatting file of the target content. In some cases, some video playback applications or playback platforms support different viewers of the video to send their own comments, feelings, etc. in the form of barrage, and these comments, feelings, etc. can be included as barrage text in the video attached to the video In the bullet chat file, therefore, the bullet chat text can be extracted from the bullet chat file. It should be understood that the embodiments of the present disclosure are not limited to any specific barrage text extraction technology.

Where multimodal input data 212 includes chat sessions, message text may be extracted from the chat sessions. The message text may include, for example, the text of a chat message sent by the intelligent conversation subject, the text of a chat message sent by at least one other chat participant, and the like. In the case that the chat session is carried out in the form of text, the text of the message can be directly extracted from the chat session, and in the case of the chat session in the form of voice, the voice message in the chat session can be converted into message text. It should be understood that the embodiments of the present disclosure are not limited to any specific message text extraction technology.

Where multimodal input data 212 includes acquired images, object features may be extracted from the acquired images. Object features may broadly refer to various characteristics of objects appearing in captured images, and the objects may include, for example, people, objects, and the like. For example, when an image of a computer user is captured by a computer camera, various features about the user, such as facial expressions and body movements, can be extracted from the image. For example, in the case of an image in front of a car collected by a camera installed on the car, various features such as vehicles in front, traffic signs, roadside buildings, etc. may be extracted from the image. It should be understood that the embodiments of the present disclosure are not limited to extracting the above exemplary object features from the collected images, but can also extract any other object features. In addition, the embodiments of the present disclosure are not limited to any specific object feature extraction technique.

Where multimodal input data 212 includes captured audio, speech and/or music may be extracted from the captured audio. Similar to the above-mentioned manner of extracting voice, music, etc. from the audio of the target content, voice, music, etc. may be extracted from the collected audio.

In the case that the multimodal input data 212 includes external environment data, external environment information may be extracted from the external environment data. For example, specific weather information may be extracted from data on weather, specific temperature information may be extracted from data on temperature, specific speed information may be extracted from data on travel speed, and so on. It should be understood that the embodiments of the present disclosure are not limited to any specific external environment information extraction technology.

It should be understood that the above-described information elements extracted from the multimodal input data 212 are exemplary, and embodiments of the present disclosure may also extract any other types of information elements. In addition, the extracted information elements can be in the same context under specific scenarios and timing conditions. For example, these information elements can be aligned in timing, and accordingly, different time points can be extracted at different time points. combination of information elements.

At 230 , one or more reference information items 232 may be generated based at least on information elements 222 .

According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include a sentiment tag. Sentiment tags may indicate, for example, emotion type, emotion level, and the like. Embodiments of the present disclosure may encompass any number of predetermined emotion types, and any number of emotion levels defined for each emotion type. Exemplary emotion types may include, for example, happiness, sadness, anger, etc., and exemplary emotion levels may include level 1, level 2, level 3, etc. according to the intensity of emotion from low to high. Correspondingly, if the emotion tag <happy, level 2> is determined at 230 , it indicates that the information element 222 expresses the emotion of happiness as a whole and the emotion level is a medium level of level 2 . It should be understood that the exemplary emotion types, exemplary emotion levels and their expressions are given above only for the convenience of explanation, and the embodiments of the present disclosure can also adopt more or less any other emotion types and any other emotion grade, and any other expression may be used.

The emotions expressed by each information element can be determined first, and then these emotions can be considered comprehensively to determine the final emotion type and emotion level. For example, one or more emotion representations respectively corresponding to one or more information elements in the information elements 222 may be generated first, and then a final emotion label is generated based at least on these emotion representations. In this paper, the emotion representation may refer to an informational representation of emotion, which may take the form of, for example, an emotion vector, an emotion label, and the like. The emotion vector may include multiple dimensions for expressing emotion distribution, each dimension corresponds to an emotion type, and the value on each dimension indicates the prediction probability or weight of the corresponding emotion type.

In the case that the information element 222 includes a character feature extracted from an image of the target content, for example, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the character feature. Taking facial expressions in character features as an example, a convolutional neural network model for facial emotion recognition can be used to predict the corresponding emotional representation. Similarly, the convolutional neural network model can also be trained to comprehensively consider other features that may be included in the character features, such as body movements, to predict emotional representation. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotional expression corresponding to the character's characteristics.

In the case where the information element 222 includes the text identified from the image of the target content, taking the text as music information as an example, the emotional information corresponding to the music can be retrieved in a pre-established music database based on the music information, so that Form an emotional expression. The music database may include music information of a large amount of music collected in advance and corresponding emotional information, music type, background knowledge, chat corpus, and the like. The music database can be indexed according to various music information such as song name, singer, performer, etc., so that emotional information corresponding to specific music can be found from the music database based on the music information. Optionally, since different music genres can generally indicate different emotions, music genres found from music databases can also be used to form emotion representations. In addition, taking the subtitle in which the recognized text is spoken by a person in the image as an example, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the subtitle. The machine learning model may be, for example, an emotion classification model based on a convolutional neural network. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to text recognized from an image of target content.

In the case that the information element 222 includes an object recognized from an image of the target content, the emotion representation corresponding to the object may be determined based on a pre-established machine learning model or a pre-set heuristic rule. In some cases, objects in an image can also help express emotion. For example, if it is shown in the image that a plurality of red ornaments are arranged on the stage to enhance the atmosphere, these red ornaments recognized from the image may help to determine emotions such as joy or joy. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to an object recognized from an image of target content.

Where the information element 222 includes music extracted from the audio of the target content, the emotional representation corresponding to the music may be determined or generated in a number of ways. In one manner, if the music information has been recognized, the emotion information corresponding to the music may be found from a music database based on the music information, so as to form an emotion expression. In one manner, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the music based on various music features extracted from the music. Musical features can include the Audio Average Energy (AE) of the music, denoted as

where x is the discrete audio input signal, t is the time, and N is the number of input signals x. Musical features may also include rhythmic features extracted from music represented by the number of beats and/or the distribution of beat intervals. Optionally, the music feature may also include the aforementioned emotional information corresponding to the music obtained by using the music information. The machine learning model can be trained based on the above one or more music features, so that the trained machine learning model can predict the emotional expression of music. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to music extracted from audio of target content.

Where the information element 222 includes speech extracted from the audio of the target content, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the speech. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to speech extracted from audio of target content.

In the case that the information element 222 includes the bullet chat text extracted from the bullet chat file of the target content, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the bullet chat text. The machine learning model may be, for example, a convolutional neural network-based sentiment classification model, denoted as CNN _sen . Assuming that the words in the bullet chat text are expressed as [d ₀ ,d ₁ ,d ₂ ,…], the sentiment vector corresponding to the bullet chat text can be predicted by the sentiment classification model CNN _sen , expressed as [s ₀ , s ₁ , s ₂ ,…]=CNN _sen [d ₀ ,d ₁ ,d ₂ ,…], where each dimension in the emotion vector [s ₀ ,s ₁ ,s ₂ ,…] corresponds to an emotion category. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotion expression corresponding to the bullet chat text extracted from the bullet chat file of the target content.

Where the information element 222 includes message text extracted from a chat session, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the message text. The machine learning model may be established in a manner similar to the aforementioned machine learning model for generating an emotion representation corresponding to the bullet chat text. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to message text extracted from a chat session.

Where the information element 222 includes object features extracted from captured images, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the object features. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to object features extracted from captured images.

Where the information element 222 includes speech and/or music extracted from captured audio, an emotional representation corresponding to the speech and/or music may be generated. The emotional representation corresponding to the speech and/or music extracted from the audio of the target content may be generated in a manner similar to that described above for determining the emotional representation corresponding to the speech and/or music extracted from the audio of the target content . It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to speech and/or music extracted from captured audio.

When the information element 222 includes external environment information extracted from the external environment data, the emotion expression corresponding to the external environment information may be determined based on a pre-established machine learning model or a preset heuristic rule. Taking the external environment information as "cloudy and rainy" weather as an example, since people tend to show slightly sad emotions in cloudy and rainy weather, the emotional expression corresponding to the sad emotion can be determined from the external environment information. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotion representation corresponding to the external environment information extracted from the external environment data.

After one or more emotion representations respectively corresponding to one or more information elements in the information elements 222 are generated according to the above description, a final emotion tag can be generated based at least on these emotion representations. The final sentiment label can be understood as indicating the overall sentiment determined by comprehensively considering various information elements. Sentiment labels can be formed from multiple sentiment representations in various ways. For example, in the case that emotion representations use emotion vectors, multiple emotion representations can be superimposed to obtain a total emotion vector, and the emotion type and emotion level can be derived from the emotion distribution in the total emotion vector to form the final emotion label. For example, in the case that emotional tags are used for emotional expression, the final emotional tag may be calculated, selected or determined from multiple emotional tags corresponding to multiple information elements based on predetermined rules. It should be understood that the embodiments of the present disclosure are not limited to any specific manner of generating emotion tags based on multiple emotion representations.

It should be understood that although the above discussion involves generating multiple emotion representations respectively corresponding to multiple information elements at 230, and then generating emotion labels based on these emotion representations, alternatively, embodiments of the present disclosure may also directly Sentiment labels are generated based on multiple information elements. For example, a machine learning model can be pre-trained that can be trained to take multiple information elements as multiple input features and predict sentiment labels accordingly. Thus, the trained model can be used to generate sentiment tags based directly on the information elements 222 .

According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include an animation tag. In the case where the multimodal output data is to include an animation sequence of the avatar of the intelligent session subject, the animation tag can be used to select the animation to be presented. The animation tag may indicate at least one or a combination of facial expression types, body movement types, etc. of the avatar. Facial expressions may include, for example, smiling, laughing, blinking, curling lips, speaking, etc., and body movements may include, for example, turning left, waving, body swinging, dance moves, and the like.

At least one information element 222 may be mapped to an animation tag according to a predetermined rule. For example, various animation tags may be predefined, and a large number of mapping rules from information element sets to animation tags may be predefined, where the information element set may include one or more information elements. Therefore, when an information element set including one or more information elements is given, the corresponding animation label can be determined based on one information element or a combination of multiple information elements in the information element set by referring to a predefined mapping rule . An exemplary mapping rule is: when the character features extracted from the image of the target content indicate the character's singing action, and the barrage text includes key words such as "good to hear" and "intoxicated", then these information can be Elements are mapped to animation tags such as "close your eyes", "swing your body", etc., so that the avatar can perform behaviors such as listening to a song intoxicated. An exemplary mapping rule is: when the voice extracted from the audio of the target content indicates that people are arguing, the bullet chat text includes key words such as "noise", "don't want to listen", and the voice extracted from the chat session The message text includes key words indicating the user's disgust, and these information elements can be mapped to animated labels such as "covering ears with hands" and "shaking head", so that the avatar can show behaviors such as not wanting to hear quarrels . An exemplary mapping rule is: when the image light detected from the image of the target content indicates a rapid light-dark change, the object identified from the image of the target content is a guitar, and the object extracted from the audio of the target content If the music indicates fast-paced music, these information elements can be mapped to animation tags such as "playing the guitar" and "fast-paced dance moves", so that the avatar can show, for example, the movement of playing the piano and dancing along with the lively music. Behavior. It should be understood that the above only lists several exemplary mapping rules, and embodiments of the present disclosure may also define a large number of any other mapping rules.

In addition, optionally, the animation tag may also be further generated based on the emotion tag. For example, emotion tags can be used together with information elements to define mapping rules, so that corresponding animation tags can be determined based on the combination of information elements and emotion tags. In addition, optionally, a direct mapping rule from emotion tags to animation tags can also be defined, so that after the emotion tags are generated, the corresponding animation tags can be determined directly based on the emotion tags by referring to the defined mapping rules. For example, a mapping rule can be defined from the emotion label <sadness, level 2> to animation labels such as "crying", "wiping tears with hands".

According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include review text. The comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content. The comment text can be selected from the bullet chat text of the target content. Exemplarily, a comment generation model constructed based on the twin-tower model can be used to select comment text from bullet chat text. The bullet chat text of the target content may be time-aligned with the image and/or audio of the target content, wherein the time alignment may refer to being located at the same moment or within the same time period. The bullet chat text at a specific moment may include multiple sentences, and these sentences may be comments of different viewers on the image and/or audio of the target content at that moment or in a nearby time period. At each moment, the comment generation model can select a suitable sentence from the corresponding bullet chat text as the comment text for the image and/or audio of the target content at that moment or in a nearby time period. For example, the two-tower model can be used to determine the matching degree between the sentences in the bullet chat text of the target content and the image and/or audio of the target content, and the sentence with the highest matching degree is selected from the bullet chat text as the comment text. Review generation models may include, for example, two twin tower models. For a sentence in the bullet chat text, a two-tower model can be used to output a first matching score based on the input target content image and the sentence to indicate the degree of matching between the image and the sentence, while the other The two-tower model can be used to output a second match score based on the input target content audio and the sentence to represent the degree of match between the audio and the sentence. The first matching degree score and the second matching degree score can be combined in any manner to obtain the comprehensive matching degree score of the statement. After obtaining multiple comprehensive matching scores of multiple sentences of the bullet chat text, the sentence with the highest matching score may be selected as the comment text for the current image and/or audio. It should be understood that the structure of the above-mentioned comment generation model is only exemplary, and the comment generation model may only include one of the two twin-tower models, or be based on any other sentences that are trained to determine the bullet chat text A model of how well images and/or audio match with targeted content.

According to an embodiment of the present disclosure, if the intelligent conversation subject is chatting with at least another chat participant in the chat conversation, the reference information item 232 generated at 230 may also include chat response text. Another chat participant may be, for example, a user, other intelligent conversation subject, or the like. After obtaining the message text from another chat participant, the corresponding chat response text can be generated at least based on the message text through the chat engine.

In one implementation, any common chat engine can be used to generate the chat response text.

In one implementation, the chat engine can generate chat response text based at least on the sentiment tag. For example, the chat engine may be trained to generate chat response text based at least on the input message text and the emotion tag, so that the chat response text is generated under the influence of the emotion indicated by the emotion tag at least.

In one implementation, the intelligent conversational subject can show the characteristics of emotional continuation in the chat session, for example, the response of the intelligent conversational subject is not only affected by the emotion of the currently received message text, but also by the intelligent conversational subject's own The influence of the current emotional state. As an example, assuming that the intelligent conversational subject is currently in a happy emotional state, although the current message text received may have or cause negative emotions such as anger, the intelligent conversational subject will not immediately give an angry response due to the current message text. Instead, the happy emotion may still be maintained or the happy emotion's emotion level may only be slightly lowered. Different from this, the existing chat engines usually only determine the emotional type of the response for the current round of conversation or only according to the currently received message text, so the emotional type of the response may change frequently with the received message text , which does not conform to the behavior that human beings are usually in a relatively stable emotional state when chatting and do not change their emotional state frequently. The intelligent conversation subject with the emotional continuation characteristic in the chat conversation proposed by the embodiments of the present disclosure will be more anthropomorphic. In order to implement the emotion continuation feature in the chat session, the chat engine can generate the chat response text based at least on the emotion representation from the emotion transfer network. The emotion transfer network is used to model dynamic emotion transformation, which can not only maintain a stable emotional state, but also make appropriate adjustments or updates to the emotional state in response to the currently received message text. For example, the emotion transfer network can take the current emotion representation and the currently received message text as input, and output an updated emotion representation, wherein the current emotion representation can be, for example, a vector representation of the current emotional state of the intelligent conversation subject. The updated emotion representation contains information reflecting the previous emotion state and information about the emotion change that may be caused by the current message text. The updated emotional representation can be further provided to the chat engine, so that the chat engine can generate a chat response text for the current message text under the influence of the received emotional representation.

In one implementation, the chat engine can be trained to chat with the target content, that is, a topic related to the target content can be discussed with another chat participant. Exemplarily, the chat engine may be a search-based chat engine constructed based on, for example, chat content among people in a forum related to the target content. The construction of the chat engine may include processing in various aspects. In one aspect, chat corpus involving chat content among people may be crawled from forums related to target content. In one aspect, a word embedding model can be trained for use in finding possible names for each named entity. For example, word embedding technology can be used to find related words of each named entity, and then, optionally, correct words can be reserved from the related words as possible names of the named entity through, for example, manual checking. In one aspect, keywords can be extracted from chat corpus. For example, statistics can be made based on the word segmentation results of related corpora, and then compared with the statistical results in non-related corpora, so as to find words with a large difference in term frequency-inverse document frequency (TF-IDF) as keywords. In one aspect, a deep retrieval model based on, for example, a deep convolutional neural network, which is the core network of a chat engine, can be trained. The deep retrieval model can be trained by using the message-reply pairs in the chat corpus as training data. The text in the message-reply pair may include original sentences or extracted keywords in the message and the reply. In one aspect, an intent detection model can be trained to detect which target content the received message text is specifically related to, so that a forum related to the target content can be selected from multiple forums. The intent detection model may be a binary classification classifier, specifically, it may be, for example, a convolutional neural network text classification model. The positive samples used for the intent detection model may come from chat corpus in forums related to the target content, and the negative samples may come from chat corpora in other forums or ordinary text. Through one or more of the above processes, and possibly any other processes, a retrieval-based chat engine can be built that responds to an input message text to provide a chat response text based on corpus in the forum.

It should be understood that the above-discussed process of generating reference information items 232 including, for example, emotion tags, animation tags, comment texts, chat response texts, etc. at 230 is exemplary. The process may also consider more other factors, such as scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject, and the like.

A scene-specific emotion may refer to a preset emotion preference associated with a specific scene. For example, in some scenarios, the intelligent conversational subject may be required to respond positively and optimistically as much as possible. Therefore, scene-specific emotions that can lead to positive and optimistic responses, such as happiness and excitement, may be preset for these scenarios. A scene-specific emotion may include an emotion type, or an emotion type and its emotion level. Scene-specific emotions can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, the scene-specific emotion and the information element 222 can be used as input, so as to jointly generate emotion tags. For example, the scene-specific emotion may be used as an emotion representation, and the emotion representation may be used together with a plurality of emotion representations respectively corresponding to a plurality of information elements to generate an emotion label. In one aspect, in the above-mentioned process of generating animation tags, scene-specific emotions can be considered in a similar manner to emotion tags, for example, scene-specific emotions can be used together with information elements to define mapping rules. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the How well the sentiment information detected in these sentences matches the scene-specific sentiment. In one aspect, in the process of generating chat response text described above, scene-specific emotions can be considered in a similar manner to emotion tags. For example, a chat engine can use input message text together with scene-specific sentiment and possibly sentiment tags to generate chat response text.

The preset personality of the intelligent conversation subject may refer to the personality characteristics pre-set for the intelligent conversation subject, for example, lively, cute, gentle, excited and so on. The response made by the intelligent conversation subject can be made to conform to the preset personality as much as possible. This preset personality can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, preset personalities can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags. For example, the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label. In one aspect, during the above-mentioned process of generating animated labels, preset personalities and information elements can be used to define mapping rules. For example, a lively and active preset personality will be more helpful in determining an animation label with more body movements, a cute preset personality will be more helpful in determining an animation label with cute facial expressions, and so on. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset personality. In one aspect, in the above-mentioned process of generating the chat response text, the emotional tendency corresponding to the preset personality can be considered in a manner similar to the emotional label. For example, a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text.

The preset role of the intelligent conversation subject may refer to the role to be played by the intelligent conversation subject. The preset roles can be classified according to various standards, for example, roles such as little girls and middle-aged men according to age and gender, roles such as teachers, doctors, and policemen according to occupations, and so on. The response made by the subject of the intelligent conversation can conform to the preset role as much as possible. This preset role can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, preset roles can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags. For example, the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label. In one aspect, during the above-mentioned process of generating animation tags, the preset roles and information elements can be used to define mapping rules. For example, the preset character of a little girl will be more helpful in determining animation tags with cute facial expressions, more body movements, and the like. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset role. In one aspect, in the above process of generating the chat response text, the emotional tendency corresponding to the preset character may be considered in a manner similar to the emotional label. For example, a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text. In addition, the training corpus of the chat engine may also include more corpus corresponding to the preset roles, so that the chat response text output by the chat engine is more in line with the language characteristics of the preset roles.

According to process 200 , after obtaining reference information item 232 , at 240 at least reference information item 232 may be utilized to generate multimodal output data 242 . The multimodal output data 242 is data to be provided or presented to the user, which may include various types of output data, for example, voice, text of the intelligent conversation subject, animation sequence of the avatar of the intelligent conversation subject, and the like.

Speech in the multimodal output data may be generated for comment text, chat response text, etc. in the reference information item. For example, comment text, chat response text, etc. may be converted into corresponding speech by any text-to-speech (TTS) conversion technology. Optionally, the TTS conversion process may be conditional on emotion tags such that the generated speech has the emotion indicated by the emotion tags.

The text in the multimodal output data may be visual text corresponding to comment text, chat response text, etc. in the reference information item. Therefore, the text can be used to visually present the content of comments and chat responses narrated by the subject of the intelligent conversation. Optionally, the text may be generated with a predetermined font or presentation effect.

The animation sequence in the multimodal output data may be generated using at least animation tags and/or emotion tags in the reference information items. An animation library of avatars of intelligent conversational subjects can be pre-built. The animation library may include a large number of pre-created animation templates based on the avatar of the intelligent conversation subject. Each animation template may include, for example, multiple GIF images. In addition, the animation templates in the animation library can be indexed by animation tags and/or emotion tags, for example, each animation template can be marked with a corresponding facial expression type, body movement type, emotion type, emotion level, etc. at least one. Therefore, when the reference information item 232 generated at 230 includes animation tags and/or emotion tags, the animation tags and/or emotion tags can be used to select a corresponding animation template from the animation library. Preferably, after the animation template is selected, time adaptation can be performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject. Time adaptation aims to adjust the animation template to match the time sequence of the speech corresponding to the comment text and/or chat response text. For example, the duration of facial expressions, body movements, etc. in the animation template can be adjusted to match the duration of the intelligent animated character's voice. As an example, during the period of playing the voice of the intelligent animated character, the image involving opening and closing the mouth in the animation template may be repeated continuously, so as to present a visual effect that the avatar is speaking. In addition, it should be understood that time adaptation is not limited to making the animation template match the time sequence of the speech corresponding to the comment text and/or chat response text, and it may also include making the animation template match the extracted one or more A time sequence of information elements 222. For example, assuming that the singer is playing the guitar in the target content, information elements such as the object "guitar" have been identified from the target content, and these information elements have been mapped to the animation tag "playing the guitar", then the singer During the time period of playing the guitar, the selected animation template corresponding to "playing the guitar" may be repeated continuously, so as to present a visual effect that the avatar is playing the guitar together with the singer in the target content. It should be understood that in different application scenarios, the intelligent conversation subject may have different avatars, so different animation libraries may be pre-established for different avatars.

It should be understood that the process of generating multimodal output data 242 at 240 discussed above is exemplary, and in other implementations, the process of generating multimodal output data can also be More other factors may be considered, for example, scene-specific requirements, etc., that is, the multimodal output data may be further generated based on scene-specific requirements. Considering the specific requirements of the scene can enable the embodiments of the present disclosure to be adaptively applied to various scenes, for example, the multi-modal output suitable for a specific scene can be adaptively output based on the output capabilities supported by different scenes data.

Scenario-specific requirements may refer to specific requirements of different application scenarios of the intelligent conversation subject. The scene-specific requirements may include, for example, types of supported multi-modal output data, preset speech rate settings, chat mode settings, etc. associated with a specific scene. In one aspect, different scenes may have different data output capabilities. Therefore, the types of multimodal output data supported by different scenes may include outputting only one of voice, animation sequence and text, or outputting voice, animation sequence and At least two of the text. For example, intelligent animation characters and virtual anchor scenes require terminal devices to at least support the output of images and audio, so that the specific requirements of the scene can indicate the output of one or more of voice, animation sequence and text. For example, a smart speaker scenario supports audio output only, so scenario-specific requirements can dictate that only voice be output. In one aspect, there may be different speech rate preferences in different scenes, therefore, the speech rate can be preset according to the specific needs of the scene. For example, since users can watch images and hear voices in smart animation characters and virtual anchor scenes, the speech rate can be set to be faster in order to express richer emotions. For example, in the scenarios of smart speakers and smart car assistants, users often only get or pay attention to voice output. Therefore, the speech rate can be set to be slow so that users can clearly understand what the intelligent conversation entity wants through voice alone. expressed content. In one aspect, different scenarios may have different chat mode preferences, therefore, chat mode settings can be made according to specific requirements of the scenario. For example, in the scenario of a smart car-machine assistant, since the user may be driving a vehicle, in order not to distract the user too much, the chat engine's chatter output can be reduced. In addition, the chat mode setting may also be associated with collected images, collected audio, external environment data, and the like. For example, the voice output of the chat response generated by the chat engine may be reduced when the collected audio indicates that there is loud noise around the user. For example, when the external environment data indicates that the user is traveling faster, for example, driving a vehicle at a high speed, the chatting output of the chat engine may be reduced.

At 240, multimodal output data can be generated based at least on the scene-specific requirements. For example, when the specific requirements of the scene indicate that image output is not supported or only voice output is supported, animation sequence and text generation may not be performed. For example, when a scene-specific requirement indicates a faster speech rate, the speech rate of the generated speech may be accelerated during the TTS conversion process. For example, when the specific requirement of the scenario indicates that the output of the chat response is reduced under a specific condition, the generation of voice or text corresponding to the text of the chat response may be restricted.

At 250, multimodal output data can be provided. For example, an animation sequence, text, etc. are displayed through a display screen, and voices are played through a speaker, etc.

It should be understood that process 200 may be performed continuously so as to continuously obtain multimodal input data and continuously provide multimodal output data.

Figure 3 shows an example of a smart animated character scene according to an embodiment. In the smart animation character scene in FIG. 3 , the user 310 can watch a video on the terminal device 320 , and at the same time, the smart conversational entity according to the embodiment of the present disclosure can serve as a smart animation character to accompany the user 310 to watch the video together. The terminal device 320 may include, for example, a display screen 330, a camera 322, a speaker (not shown), a microphone (not shown), and the like. A video 332 may be presented as target content in the display screen 330 . In addition, the avatar 334 of the intelligent conversation subject can also be presented on the display screen 330 . The intelligent conversation subject can perform multi-modality-based reactive response generation according to an embodiment of the present disclosure, and accordingly, can provide the generated multi-modal-based reactive response on the terminal device 320 via the avatar 334 . For example, in response to the content in the video 332 , the chat session with the user 310 , the captured image and/or audio, the obtained external environment data, etc., the avatar 334 can make facial expressions, body movements, and make voices, etc.

FIG. 4 illustrates an exemplary process 400 for intelligently animating a character scene, according to an embodiment. Process 400 illustrates the processing flow, data/information flow, etc. involved in, for example, the smart animated character scene of FIG. 3 . Furthermore, process 400 may be considered as a specific example of process 200 in FIG. 2 .

According to the process 400, multimodal input data may be obtained first, including at least one of, for example, video, external environment data, collected images, collected audio, chat sessions, and the like. The video, as the target content, may further include, for example, images, audio, bullet chat files, and the like. It should be understood that the obtained multimodal input data may be aligned in time and accordingly have the same context.

Information elements can be extracted from multimodal input data. For example, extract character features, text, image light, objects, etc. from video images, extract music, voice, etc. from video audio, extract barrage text from video barrage files, and extract external environment from external environment data information, extract object features from captured images, extract music, speech, etc. from captured audio, extract message text from chat sessions, and more.

The reference information item may be generated based at least on the extracted information elements, which includes, for example, at least one of emotion tags, animation tags, comment text, and chat response text. Review text may be generated by a review generation model 430 . Chat response text may be generated by chat engine 450 and optionally emotion transfer network 452 .

The generated reference information items may be utilized at least to generate multimodal output data, which includes, for example, at least one of an animation sequence, comment speech, comment text, chat response speech, chat response text, and the like. The animation sequence may be generated based on the description above in connection with FIG. 2 . For example, the animation selection 410 may be performed in the animation library to select an animation template by using animation tags, emotion tags, etc., and then the animation sequence generation 420 is executed based on the selected animation template, so that the timing of the animation sequence generation 420 is executed Adapt to obtain animation sequences. The review speech may be obtained by performing speech generation 440 (eg, TTS conversion) on the review text. The review text may be obtained based on the review text. The chat response speech may be obtained by performing speech generation 460 (eg, TTS conversion) on the chat response text. The chat response text may be obtained based on the chat response text.

The resulting multimodal output data can be provided on an end device. For example, an animation sequence, comment text, chat response text, etc. are presented on the display screen, and the comment voice, chat response voice, etc. are played through a speaker.

It should be understood that all the processing, data/information, etc. in the process 400 are exemplary, and in actual application, the process 400 may only involve one or more of these processing, data/information.

The multi-modality-based reactive response generation according to embodiments of the present disclosure can be applied to perform a variety of tasks. The following is only an exemplary intelligent animation generation task among these tasks. It should be understood that embodiments of the present disclosure are not limited to being used for performing intelligent animation generation tasks, but may also be used for performing various other tasks.

FIG. 5 illustrates an exemplary process 500 of intelligent animation generation according to an embodiment. Process 500 can be regarded as a specific implementation of process 200 in FIG. 2 . The intelligent animation generation of process 500 is a specific application of the multi-modality-based reactive response generation of process 200 . The intelligent animation generation of the process 500 may involve at least one of the generation of an animation sequence of the avatar, the generation of comment speech of the avatar, the generation of comment text, etc., performed in response to the target content.

In the process 500, the step of obtaining multimodal input data at 210 in FIG. 2 may be embodied as obtaining at 510 at least one of image, audio, and barrage files of the target content.

In the process 500, the information element extraction step at 220 in FIG. 2 can be embodied as at 520 extracting at least one information element from the image, audio, and barrage files of the target content. For example, extract character features, text, image light, objects, etc. from the image of the target content, extract music, voice, etc. from the audio of the target content, extract bullet chat text from the bullet chat file of the target content, and so on.

In the process 500, the step of generating reference information items at 230 in FIG. 2 may be embodied as generating at 530 at least one of animation tags, emotion tags and comment texts. For example, animated tags, sentiment tags, review text, etc. may be generated based at least on the at least one information element extracted at 520 .

In the process 500, the step of generating multimodal output data at 240 in FIG. At least one of comment voice and comment text. Taking an animation sequence as an example, the animation sequence may be generated by at least using animation tags and/or emotion tags in the manner described above in conjunction with FIG. 2 . In addition, comment speech and comment text may also be generated in the manner described above in conjunction with FIG. 2 .

In the process 500, the step of providing multimodal output data at 250 in FIG. 2 may be embodied as providing at 550 at least one of the generated animation sequence, comment voice, and comment text.

It should be understood that each step in process 500 may be performed in a manner similar to that described above for the corresponding step in FIG. 2 . In addition, process 500 may also include any other processing described above for process 200 of FIG. 2 .

FIG. 6 shows a flowchart of an exemplary method 600 for multimodality-based reactive response generation, according to an embodiment.

At 610, multimodal input data can be obtained.

At 620, at least one informational element can be extracted from the multimodal input data.

At 630, at least one reference information item may be generated based at least on the at least one information element.

At 640, multimodal output data may be generated using at least the at least one reference information item.

At 650, the multimodal output data can be provided.

In an implementation manner, the multimodal input data may include at least one of the following: images of target content, audio of target content, barrage files of target content, chat sessions, collected images, collected audio, and external environment data.

Extracting at least one information element from the multimodal input data may include at least one of the following: extracting character features from an image of the target content; recognizing text from an image of the target content; detecting image light from an image of the target content; Recognize objects from images of target content; extract music from audio of target content; extract speech from audio of target content; extract bullet chat text from bullet chat files of target content; extract message text from chat sessions; extracting object features from images; extracting speech and/or music from collected audio; and extracting external environment information from external environment data.

In an implementation manner, generating at least one reference information item based on at least one information element may include: generating at least one of emotion tags, animation tags, comment text, and chat response text based on at least one information element one.

Generating an emotion tag based at least on the at least one information element may include: generating one or more emotion representations respectively corresponding to one or more information elements in the at least one information element; and at least based on the one or more sentiment representations to generate the sentiment labels.

The emotion tag may indicate an emotion type and/or an emotion level.

Generating the animation label based at least on the at least one information element may include: mapping the at least one information element to the animation label according to a predetermined rule.

The animation tag may indicate the type of facial expression and/or the type of body movement.

The animation tag may be further generated based on the emotion tag.

Generating comment text based at least on the at least one information element may include: selecting the comment text from bullet chat text of the target content.

The selection of the comment text may include: using the twin towers model to determine the matching degree between the sentence in the bullet chat text of the target content and the image and/or audio of the target content; Select the sentence with the highest matching degree in the subtitle text as the comment text.

Generating the chat response text based at least on the at least one information element may include: generating the chat response text based at least on message text in the chat session by a chat engine.

The chat response text may be further generated based on the emotion tag.

The chat response text may be further generated based on an emotion representation from an emotion transfer network.

In an implementation manner, the at least one reference information item may be further generated based on at least one of the following: scene-specific emotion; preset personality of the intelligent conversation subject; and preset role of the intelligent conversation subject.

In an implementation manner, the multimodal output data may include at least one of the following: an animation sequence of an avatar of the intelligent conversation subject; voice of the intelligent conversation subject; and text.

Generating multimodal output data by using at least the at least one reference information item may include: generating voice and/or text corresponding to the comment text and/or the chat response text.

At least using the at least one reference information item to generate multimodal output data may include: using the animation tag and/or the emotion tag to select a corresponding animation template from the animation library of the avatar of the intelligent conversation subject; and Time adaptation is performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.

The time adaptation may include: adjusting the animation template to match the time sequence of the speech corresponding to the comment text and/or the chat response text.

In an implementation manner, the multimodal output data may be further generated based on specific requirements of the scene.

The scene-specific requirement may include at least one of the following: outputting only one of voice, animation sequence and text; outputting at least two of voice, animation sequence and text; predetermined speech rate setting; and chat mode setting.

In one implementation, the multimodal based reactive response generation can include intelligent animation generation. Obtaining the multimodal input data may include: obtaining at least one of image, audio and bullet chat files of the target content. Extracting at least one information element from the multimodal input data may include: extracting at least one information element from image, audio and bullet chat files of the target content. Generating at least one reference information item based at least on the at least one information element may include: generating at least one of animation tags, emotion tags and comment text based on at least the at least one information element. At least using the at least one reference information item to generate multimodal output data may include: using at least one of the animation tag, the emotion tag, and the comment text to generate an animation sequence of the avatar, an animation sequence of the avatar, At least one of comment voice and comment text. Providing the multimodal output data may include: providing at least one of the animation sequence, the comment voice and the comment text.

It should be understood that the method 600 may also include any steps/processes for multi-modality-based reactive response generation according to the embodiments of the present disclosure described above.

FIG. 7 illustrates an exemplary apparatus 700 for multimodality-based reactive response generation, according to an embodiment.

The device 700 may include: a multimodal input data obtaining module 710, for obtaining multimodal input data; a data integration processing module 720, for extracting at least one information element from the multimodal input data; a scene logic processing module 730, for generating at least one reference information item based at least on the at least one information element; the multimodal output data generation module 740, for at least utilizing the at least one reference information item to generate multimodal output data; and multiple The modality output data providing module 750 is configured to provide the multimodal output data.

In addition, the apparatus 700 may also include any other modules that execute the steps of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.

FIG. 8 illustrates an exemplary apparatus 800 for multimodality-based reactive response generation, according to an embodiment.

Apparatus 800 may include: at least one processor 810; and memory 820 storing computer-executable instructions. When the computer-executable instructions are executed, the at least one processor 810 may execute any steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.

Embodiments of the present disclosure propose a multimodal-based reactive response generation system, including: a multimodal data input interface for obtaining multimodal input data; a core processing unit configured to generate data from the multimodal Extracting at least one information element from the modal input data, generating at least one reference information item based on at least the at least one information element, and at least utilizing the at least one reference information item to generate multimodal output data; and multimodal data output An interface for providing the multimodal output data. In addition, the multimodal data input interface, the core processing unit, and the multimodal data output interface may also execute any relevant steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure. In addition, the multimodality-based reactive response generation system may further include any other units and modules for multimodality-based reactive response generation according to the above-mentioned embodiments of the present disclosure.

Embodiments of the present disclosure propose a computer program product for multimodal-based reactive response generation, including a computer program that is run by at least one processor to execute the method according to the above-mentioned embodiments of the present disclosure based on Any step/process of a method for multimodal reactive response generation.

Embodiments of the present disclosure can be embodied on a non-transitory computer readable medium. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the methods for multimodal-based reactive response generation according to embodiments of the present disclosure described above. steps/process.

It should be understood that all operations in the method described above are exemplary only, and the present disclosure is not limited to any operation in the method or the order of these operations, but should cover all other equivalent transformations under the same or similar concept .

In addition, the articles "a (a)" and "an (an)" as used in this specification and the appended claims should generally be construed unless otherwise specified or clear from the context to refer to a singular form. means "one" or "one or more".

It should also be understood that all modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or grouped together.

Processors have been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. As examples, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital signal processor (DSP), field programmable gate array (FPGA) ), programmable logic devices (PLDs), state machines, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors given in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, processes , functions, etc. The software may reside on a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic stripe), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), Programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), register or removable disk. Although memory is shown as being separate from the processor in various aspects of the present disclosure, memory may also be located internal to the processor (eg, cache or registers).

The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Accordingly, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the described aspects of this disclosure that are known or come to be known to those skilled in the art are intended to be covered by the claims.

Claims

A method for multimodal based reactive response generation comprising:

obtain multimodal input data;

extracting at least one informational element from said multimodal input data;

generating at least one reference information item based at least on said at least one information element;

generating multimodal output data using at least said at least one item of reference information; and

The multimodal output data is provided.
The method of claim 1, wherein the multimodal input data includes at least one of:

Images of the target content, audio of the target content, bullet chat files of the target content, chat sessions, collected images, collected audio, and external environment data.
The method of claim 2, wherein extracting at least one information element from the multimodal input data comprises at least one of:

Extract human features from images of target content;

recognize text from images of targeted content;

detecting image rays from an image of the target content;

Recognize objects from images of Targeted Content;

extract music from the audio of the target content;

extract speech from the audio of the target content;

Extract the bullet chat text from the bullet chat file of the target content;

Extract message text from chat sessions;

Extract object features from the captured image;

extract speech and/or music from captured audio; and

The external environment information is extracted from the external environment data.
The method of claim 1, wherein generating at least one reference information item based at least on the at least one information element comprises:

At least one of sentiment tags, animation tags, comment text, and chat response text is generated based at least on the at least one information element.
The method of claim 4, wherein generating an emotion tag based at least on the at least one information element comprises:

generating one or more emotional representations respectively corresponding to one or more of the at least one informational element; and

The sentiment tags are generated based at least on the one or more sentiment representations.
The method of claim 5, wherein,

The emotion tag indicates an emotion type and/or an emotion level.
The method of claim 4, wherein generating the animated label based at least on the at least one information element comprises:

Mapping the at least one information element to the animation tag according to a predetermined rule.
The method of claim 7, wherein,

The animation tag indicates the type of facial expression and/or the type of body movement.
The method of claim 7, wherein,

The animation tag is further generated based on the emotion tag.
The method of claim 4, wherein generating review text based at least on the at least one information element comprises:

The comment text is selected from the bullet chat text of the target content.
The method of claim 10, wherein said selecting said review text comprises:

Using the two-tower model, determine the matching degree between the sentence in the bullet chat text of the target content and the image and/or audio of the target content; and

A sentence with the highest matching degree is selected from the bullet chat text as the comment text.
The method of claim 4, wherein generating chat response text based at least on the at least one information element comprises:

The chat response text is generated by a chat engine based at least on message text in a chat session.
The method of claim 12, wherein,

The chat response text is further generated based on the emotion tag.
The method of claim 12, wherein,

The chat response text is further generated based on the emotion representation from the emotion transfer network.
The method of claim 1, wherein the at least one reference information item is further generated based on at least one of:

scene-specific emotion;

the preset personality of the subject of the intelligent conversation; and

Preset roles for intelligent conversation principals.
The method of claim 1, wherein the multimodal output data includes at least one of the following:

An animation sequence of the avatar of the subject of the intelligent conversation;

the voice of the subject of the intelligent conversation; and

Word.
The method of claim 4, wherein generating multimodal output data using at least the at least one reference information item comprises:

Generate voice and/or text corresponding to the comment text and/or the chat response text.
The method of claim 4, wherein generating multimodal output data using at least the at least one reference information item comprises:

Using the animation tag and/or the emotion tag to select a corresponding animation template from the animation library of the avatar of the intelligent conversation subject; and

Time adaptation is performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.
The method of claim 18, wherein said time adaptation comprises:

The animation template is adjusted to match the time sequence of speech corresponding to the comment text and/or the chat response text.
The method of claim 1, wherein,

The multimodal output data is further generated based on scene-specific requirements.
The method according to claim 20, wherein the scenario-specific requirements include at least one of the following:

output only one of speech, animation sequence and text;

outputting at least two of speech, animation sequences, and text;

Speech rate preset settings; and

Chat mode settings.
The method of claim 1, wherein,

Obtaining multimodal input data includes: obtaining at least one of image, audio and barrage files of the target content,

Extracting at least one information element from the multimodal input data includes: extracting at least one information element from images, audio and bullet chat files of the target content,

Generating at least one reference information item based at least on the at least one information element includes: generating at least one of animation tags, emotion tags, and comment text based on at least the at least one information element,

At least using the at least one reference information item to generate multimodal output data includes: using at least one of the animation tag, the emotion tag and the comment text to generate an animation sequence of the avatar, a comment on the avatar at least one of speech and commentary text, and

Providing the multimodal output data includes: providing at least one of the animation sequence, the comment voice and the comment text.
A multimodal based reactive response generation system comprising:

A multimodal data input interface for obtaining multimodal input data;

A core processing unit configured to: extract at least one information element from the multimodal input data; generate at least one reference information item based at least on the at least one information element; and utilize at least the at least one reference information items to generate multimodal output data; and

A multimodal data output interface, configured to provide the multimodal output data.
An apparatus for multimodality-based reactive response generation comprising:

at least one processor; and

A memory storing computer-executable instructions which, when executed, cause the at least one processor to perform the steps of the method as claimed in any one of claims 1 to 21.
An apparatus for multimodality-based reactive response generation comprising:

A multimodal input data obtaining module, configured to obtain multimodal input data;

a data integration processing module, configured to extract at least one information element from the multimodal input data;

A scene logic processing module, configured to generate at least one reference information item based at least on the at least one information element;

a multimodal output data generation module, configured to generate multimodal output data using at least the at least one reference information item; and

The multimodal output data providing module is configured to provide the multimodal output data.
A computer program product for multimodal-based reactive response generation, comprising a computer program executed by at least one processor for performing the steps of the method according to any one of claims 1 to 21 .