WO2022242706A1 - Multimodal based reactive response generation - Google Patents

Multimodal based reactive response generation Download PDF

Info

Publication number
WO2022242706A1
WO2022242706A1 PCT/CN2022/093766 CN2022093766W WO2022242706A1 WO 2022242706 A1 WO2022242706 A1 WO 2022242706A1 CN 2022093766 W CN2022093766 W CN 2022093766W WO 2022242706 A1 WO2022242706 A1 WO 2022242706A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
multimodal
chat
animation
emotion
Prior art date
Application number
PCT/CN2022/093766
Other languages
French (fr)
Chinese (zh)
Inventor
宋睿华
杜涛
Original Assignee
宋睿华
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 宋睿华 filed Critical 宋睿华
Publication of WO2022242706A1 publication Critical patent/WO2022242706A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • intelligent human-computer interaction systems have been widely used in more and more scenarios and fields, which can effectively improve the efficiency of human-computer interaction and optimize the experience of human-computer interaction.
  • AI artificial intelligence
  • human-computer interaction systems have also achieved more in-depth development in aspects such as intelligent conversation systems.
  • the intelligent conversation system has covered application scenarios such as task dialogue, knowledge question answering, and open domain dialogue, and can be realized by using template-based technology, retrieval-based technology, and deep learning-based technology.
  • Multimodal input data can be obtained.
  • At least one informational element may be extracted from said multimodal input data.
  • At least one reference information item may be generated based at least on said at least one information element.
  • the at least one item of reference information may be used at least to generate multimodal output data.
  • the multimodal output data may be provided.
  • FIG. 1 illustrates an exemplary architecture of a multimodality-based reactive response generation system according to an embodiment.
  • FIG. 2 illustrates an exemplary process for multimodality-based reactive response generation, according to an embodiment.
  • Figure 3 shows an example of a smart animated character scene according to an embodiment.
  • Fig. 4 shows an exemplary process of intelligently animating a character scene according to an embodiment.
  • Fig. 5 shows an exemplary process of smart animation generation according to an embodiment.
  • FIG. 6 shows a flowchart of an exemplary method for multimodality-based reactive response generation, according to an embodiment.
  • FIG. 7 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.
  • FIG. 8 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.
  • Existing human-computer interaction systems usually use a single medium as a channel for information input and output, for example, communication between humans and machines or between machines through one of text, voice, and gestures.
  • the intelligent conversational system as an example, although it can be oriented to text or speech, it still takes text processing or text analysis as its core.
  • the intelligent conversation system lacks the consideration of information such as facial expressions and body movements of the interactive objects outside the text, and also lacks the consideration of factors such as the sound and light in the environment, resulting in common errors in the interaction process. question.
  • the problem is that the understanding of information is not comprehensive and accurate enough.
  • the embodiment of the present disclosure proposes a multimodal-based reaction response generation scheme, which can be implemented on a variety of intelligent conversation subjects, and can be widely used in various scenarios including human-computer interaction middle.
  • intelligent conversation subjects can broadly refer to AI product forms that can generate and present information content and provide interactive functions in specific application scenarios, such as chat robots, intelligent animated characters, virtual anchors, intelligent car assistants, Smart customer service, smart speakers, etc.
  • an intelligent conversational agent may generate multimodal output data based on multimodal input data, wherein the multimodal output data is a response generated in a reactive manner to be presented to the user.
  • the embodiments of the present disclosure propose a multimodal human-computer interaction method.
  • interaction can broadly refer to the understanding and expression of information, data, content, etc.
  • human-computer interaction can broadly refer to the interaction between the intelligent conversation subject and the interactive object, for example, between the intelligent conversation subject and the human Interaction between users, interaction between intelligent conversation subjects, responses of intelligent conversation subjects to various media contents or informational data, and so on.
  • the embodiments of the present disclosure have various advantages. In one aspect, more accurate information understanding can be achieved.
  • multimodal input data including media content, collected images or audio, chat sessions, and external environment data
  • information can be collected and analyzed more comprehensively, misunderstandings caused by missing information can be reduced, and more accurate Understand the deep-level intent of interacting objects.
  • the expression is more efficient.
  • superimposing and expressing information in multiple ways and in multiple modes for example, superimposing facial expressions and/or body movements of avatars or other animation sequences on the basis of speech or text, information and emotions can be expressed more efficiently.
  • the interactive behavior of the intelligent conversation subject will be more vivid.
  • the understanding and expression of multimodal data will make the subject of intelligent conversation more anthropomorphic, thereby significantly improving user experience.
  • the embodiments of the present disclosure can enable the intelligent conversation subject to imitate human beings to generate natural responses to multi-modal input data such as speech, text, music, video images, ie, make reactive responses.
  • the reactive response of the intelligent conversation subject is not limited to the response to the chat message from the user, for example, but also covers various input data such as media content, captured image or audio, external environment, etc. proactive response.
  • the intelligent conversation subject acts as an intelligent animation role to provide AI intelligent companionship as an example, assuming that the intelligent conversation subject can accompany the user to watch videos through the corresponding avatar, the intelligent conversation subject can not only directly interact with the user, but also interact with the user.
  • the content in the video responds spontaneously and reactively, for example, the avatar can speak, make facial expressions, make body movements, present text, etc.
  • the behavior of the intelligent conversation subject will be more anthropomorphic.
  • Embodiments of the present disclosure propose a general multimodal-based reactive response generation technology.
  • intelligent conversation subjects can efficiently and quickly obtain multimodal interaction capabilities.
  • multimodal-based reactive response generation technology according to the embodiments of the present disclosure, multimodal input data from various media channels can be integrated and processed, and the intent expressed by the multimodal input data can be interpreted more accurately and effectively.
  • the intelligent conversation subject can provide multimodal output data through multiple channels to express overall consistent information, thereby improving the accuracy of information expression. Accuracy and efficiency make the information expression of intelligent conversation subjects more vivid and interesting, thus significantly improving user experience.
  • the multi-modality-based reactive response generation technology can be adaptively applied to various scenarios. Based on the input and output capabilities supported by different scenarios, embodiments of the present disclosure can obtain corresponding multimodal input data in different scenarios, and output multimodal output data suitable for specific scenarios. Taking as an example a scene in which an intelligent conversational subject acting as an intelligent animation character automatically generates animations, embodiments of the present disclosure may generate reactive responses including, for example, animation sequences, for the avatar of the intelligent animation character. For example, when the smart animated character is used to accompany the user to watch a video, the smart animated character can comprehensively process multi-modal input data from video content, collected images or audio, chat sessions, external environment data, etc.
  • Modal input data for depth perception and understanding and respond accordingly in an intelligent and dynamic manner through multiple modalities such as speech, text, animation sequences including facial expressions and/or body movements, to achieve Comprehensive, efficient and vivid human-computer interaction experience.
  • the perception ability and emotional expression ability of intelligent animation characters are greatly enhanced, and intelligent animation characters become more anthropomorphic. This can also become the technical basis for content creation such as intelligent animation through AI technology.
  • the chat robot can chat with the user in forms such as voice, text, video, etc.
  • the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat session , collected images or audio, external environment data, etc.
  • the multimodal output data provided may include, for example, voice, text, animation sequences, etc.
  • the multimodal input data processed by the embodiments of the present disclosure can be It includes, for example, played media content, external environment data, etc.
  • the provided multimodal output data may include, for example, voice, text, animation sequences of avatars, and the like.
  • the smart car-machine assistant can provide assistance or companionship while the user is driving a vehicle (for example, a vehicle), then the multimodal input processed by the embodiments of the present disclosure
  • the data may include, for example, chat sessions, collected images or audio, external environment data, etc.
  • the provided multimodal output data may include, for example, voice, text, and the like.
  • the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat sessions, External environment data, etc.
  • the multimodal output data provided may include, for example, voice, text, animation, etc.
  • the voice assistant or chat robot in the smart speaker can interact with the user, play audio content, etc.
  • the multimodal input data processed by the embodiments of the present disclosure can include For example, played audio content, chat sessions, collected audio, external environment data, etc.
  • the provided multimodal output data may include, for example, voice and the like.
  • FIG. 1 shows an exemplary architecture of a multi-modality-based reactive response generation system 100 according to an embodiment.
  • the system 100 can support the intelligent conversation subject to make multimodal-based reactive responses in different scenarios.
  • An intelligent conversational subject may be implemented or resident on an end device or any user-accessible device or platform.
  • the System 100 may include a multimodal data input interface 110 for obtaining multimodal input data.
  • the multimodal data input interface 110 can collect various types of input data from various data sources.
  • the multimodal data input interface 110 may collect data such as image, audio, and barrage files of the target content.
  • the target content may broadly refer to various media content played on a device or presented to a user, for example, video content, audio content, picture content, text content, and the like.
  • the multimodal data input interface 110 can obtain input data about the chat conversation.
  • the multimodal data input interface 110 may collect images and/or audio around the user through a camera and/or a microphone on the terminal device.
  • the multimodal data input interface 110 can also obtain external environment data from a third-party application or any other information source.
  • the external environment data may broadly refer to various environmental parameters in the real world where the terminal device or the user is located, for example, data about weather, temperature, humidity, travel speed, and the like.
  • the multimodal data input interface 110 may provide the obtained multimodal input data 112 to the core processing unit 120 in the system 100 .
  • the core processing unit 120 provides various core processing capabilities required for reactive response generation. Based on the processing stage and type, the core processing unit 120 may further include multiple processing modules, for example, a data integration processing module 130, a scene logic processing module 140, a multimodal output data generation module 150, and the like.
  • the data integration processing module 130 can extract different types of multi-modal information from the multi-modal input data 112 , and the extracted multi-modal information can be in the same context under specific scenarios and time sequence conditions.
  • the data integration processing module 130 can extract one or more information elements 132 from the multimodal input data 112 .
  • information elements can broadly refer to computer-understandable information or information representations extracted from raw data.
  • the data integration processing module 130 may extract information elements from the target content included in the multimodal input data 112, for example, extract information elements from images, audio, bullet chat files, etc. of the target content.
  • the information elements extracted from the image of the target content may include, for example, character features, text, image light, objects, etc.
  • the information elements extracted from the audio of the target content may include, for example, music, voice, etc.
  • the information elements extracted from the target content may include, for example, bullet chat text and the like.
  • music may broadly refer to song singing, instrumental performance, or a combination thereof
  • speech may broadly refer to the sound of speech.
  • data integration processing module 130 may extract informational elements, such as message text, from chat sessions included in multimodal input data 112 .
  • the data integration processing module 130 can extract information elements, such as object features, from the captured images included in the multimodal input data 112 .
  • the data integration processing module 130 may extract information elements such as speech, music, etc. from the collected audio included in the multimodal input data 112 . In one aspect, the data integration processing module 130 may extract information elements such as external environment information from the external environment data included in the multimodal input data 112 .
  • the scene logic processing module 140 may generate one or more reference information items 142 based at least on the information elements 132 .
  • a reference information item may broadly refer to various guiding information generated based on various information elements for reference by the system 100 when generating multimodal output data.
  • the reference information item 142 can include an emotion tag that can guide the emotion that the multimodal output data is presented or based on.
  • the reference information item 142 may include an animation tag, which may be used to select the animation to be presented where the multimodal output data is to include an animation sequence.
  • the reference information item 142 may include comment text, and the comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content.
  • reference information item 142 may include chat response text, which may be a response to message text from a chat session. It should be understood that, optionally, the scene logic processing module 140 may also consider more other factors in the process of generating the reference information item 142, for example, scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject Wait.
  • the multimodal output data generation module 150 may utilize at least the reference information item 142 to generate the multimodal output data 152 .
  • the multimodal output data 152 may include various types of output data, such as speech, text, animation sequences, and the like.
  • the voice included in the multimodal output data 152 may be, for example, the voice corresponding to the comment text or the chat response text
  • the text included in the multimodal output data 152 may be, for example, the text corresponding to the comment text or the chat response text
  • the animation sequence included in the multimodal output data 152 may be, for example, an animation sequence of an avatar of the intelligent conversation subject. It should be understood that, optionally, the multimodal output data generation module 150 may also consider more other factors during the process of generating the multimodal output data 152 , for example, scene-specific requirements and the like.
  • System 100 may include a multimodal data output interface 160 for providing multimodal output data 152 .
  • the multimodal data output interface 160 may support providing or presenting multiple types of output data to a user.
  • the multimodal data output interface 160 can present text, animation sequences, etc. via a display screen, and can play voice, etc. via a speaker.
  • the architecture of the multimodal-based reactive response generation system 100 described above is only exemplary, and the system 100 may include more or less component units or modules according to actual application requirements and designs.
  • the system 100 may be implemented by hardware, software or a combination thereof.
  • the multimodal data input interface 110, the core processing unit 120 and the multimodal data output interface 160 may be units implemented based on hardware, for example, the core processing unit 120 may be implemented by a The multimodal data input interface 110 and the multimodal data output interface 160 may be implemented by a hardware interface unit with data input/output capability.
  • the units or modules included in the system 100 may also be implemented by software or programs, so these units or modules may be software units or software modules.
  • the units and modules included in the system 100 may be implemented at the terminal device, or may be implemented at the network device or platform, or may be partially implemented at the terminal device while the other part is implemented at the network device or platform.
  • FIG. 2 illustrates an exemplary process 200 for multimodality-based reactive response generation, according to an embodiment.
  • the steps or processes in the process 200 may be performed by, for example, corresponding units or modules in the multi-modality-based reactive response generation system in FIG. 1 .
  • multimodal input data 212 can be obtained.
  • the multimodal input data 212 may include, for example, images of the target content, audio of the target content, barrage files of the target content, chat sessions, collected images, collected audio, external environment data at least one of the others.
  • images, audio, and barrage files of the target content can be obtained at 210 .
  • data about the chat session can be obtained at 210, which includes chat records in the chat session and the like.
  • multimodal input data 212 is not limited to the exemplary input data described above.
  • one or more informational elements 222 may be extracted from the multimodal input data 212 .
  • corresponding information elements may be extracted from these input data, respectively.
  • the multimodal input data 212 includes images of the target content
  • human features may be extracted from the images of the target content.
  • the target content as an example of a concert video played on a terminal device
  • various character features of the singer can be extracted from the image of the video, such as facial expressions, body movements, clothing colors, and the like. It should be understood that the embodiments of the present disclosure are not limited to any specific character feature extraction technology.
  • text may be identified from the image of the target content.
  • text may be recognized from an image by a text recognition technique such as optical character recognition (OCR).
  • OCR optical character recognition
  • some images in this video may contain music information, such as song title, lyricist, composer, singer, performer, etc., so these can be obtained through text recognition music information.
  • OCR optical character recognition
  • the embodiments of the present disclosure are not limited to recognizing text by OCR technology, but any other text recognition technology may be used.
  • the text recognized from the image of the target content is not limited to music information, and may also include any other text indicating information related to events occurring in the image, such as subtitles, lyrics, etc.
  • image rays may be detected from the image of the target content.
  • Image light may refer to the characteristics of ambient light in the picture presented by the image, for example, bright, dim, gloomy, flickering, and the like.
  • the stage at the concert site may use bright lights, so it can be detected from these images that the image light is bright. It should be understood that embodiments of the present disclosure are not limited to any particular image light detection technique.
  • objects may be identified from the images of the target content.
  • the identified object may be, for example, a representative object in the image, an object appearing in a prominent or important position in the image, an object associated with a person in the image, etc.
  • the identified object may include props, background furnishings, etc. .
  • the target content as an example of a concert video, assuming that the singer is playing a guitar while singing a song, the object "guitar" can be identified from the image. It should be understood that embodiments of the present disclosure are not limited to any particular object recognition technology.
  • music may be extracted from the audio of the target content.
  • the target content itself may be audio, for example, a song played to the user on the terminal device, and correspondingly, the music corresponding to the song may be extracted from the audio.
  • the target content may also be a video, such as a concert video, and accordingly, music may be extracted from the audio contained in the video.
  • music may broadly include, for example, musical pieces played by musical instruments, songs sung by singers, special effects sounds produced by special equipment or voice actors, and the like.
  • the extracted music may be background music, foreground music, or the like.
  • music extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to music. It should be understood that embodiments of the present disclosure are not limited to any particular music extraction technique.
  • speech may be extracted from the audio of the target content.
  • speech may refer to the sound of speech.
  • the target content includes conversations, speeches, comments, etc. of people or characters
  • the corresponding voice can be extracted from the audio of the target content.
  • Speech extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to speech. It should be understood that embodiments of the present disclosure are not limited to any specific speech extraction technology.
  • the bullet chatting text may be extracted from the bullet chatting file of the target content.
  • some video playback applications or playback platforms support different viewers of the video to send their own comments, feelings, etc. in the form of barrage, and these comments, feelings, etc. can be included as barrage text in the video attached to the video In the bullet chat file, therefore, the bullet chat text can be extracted from the bullet chat file. It should be understood that the embodiments of the present disclosure are not limited to any specific barrage text extraction technology.
  • message text may be extracted from the chat sessions.
  • the message text may include, for example, the text of a chat message sent by the intelligent conversation subject, the text of a chat message sent by at least one other chat participant, and the like.
  • the chat session is carried out in the form of text
  • the text of the message can be directly extracted from the chat session, and in the case of the chat session in the form of voice, the voice message in the chat session can be converted into message text. It should be understood that the embodiments of the present disclosure are not limited to any specific message text extraction technology.
  • object features may be extracted from the acquired images.
  • Object features may broadly refer to various characteristics of objects appearing in captured images, and the objects may include, for example, people, objects, and the like.
  • various features about the user such as facial expressions and body movements, can be extracted from the image.
  • various features such as vehicles in front, traffic signs, roadside buildings, etc. may be extracted from the image.
  • the embodiments of the present disclosure are not limited to extracting the above exemplary object features from the collected images, but can also extract any other object features.
  • the embodiments of the present disclosure are not limited to any specific object feature extraction technique.
  • multimodal input data 212 includes captured audio
  • speech and/or music may be extracted from the captured audio. Similar to the above-mentioned manner of extracting voice, music, etc. from the audio of the target content, voice, music, etc. may be extracted from the collected audio.
  • external environment information may be extracted from the external environment data.
  • specific weather information may be extracted from data on weather
  • specific temperature information may be extracted from data on temperature
  • specific speed information may be extracted from data on travel speed, and so on. It should be understood that the embodiments of the present disclosure are not limited to any specific external environment information extraction technology.
  • the above-described information elements extracted from the multimodal input data 212 are exemplary, and embodiments of the present disclosure may also extract any other types of information elements.
  • the extracted information elements can be in the same context under specific scenarios and timing conditions. For example, these information elements can be aligned in timing, and accordingly, different time points can be extracted at different time points. combination of information elements.
  • one or more reference information items 232 may be generated based at least on information elements 222 .
  • the reference information item 232 generated at 230 may include a sentiment tag.
  • Sentiment tags may indicate, for example, emotion type, emotion level, and the like.
  • Embodiments of the present disclosure may encompass any number of predetermined emotion types, and any number of emotion levels defined for each emotion type.
  • Exemplary emotion types may include, for example, happiness, sadness, anger, etc.
  • exemplary emotion levels may include level 1, level 2, level 3, etc. according to the intensity of emotion from low to high.
  • the emotion tag ⁇ happy, level 2> is determined at 230 , it indicates that the information element 222 expresses the emotion of happiness as a whole and the emotion level is a medium level of level 2 .
  • the exemplary emotion types, exemplary emotion levels and their expressions are given above only for the convenience of explanation, and the embodiments of the present disclosure can also adopt more or less any other emotion types and any other emotion grade, and any other expression may be used.
  • the emotions expressed by each information element can be determined first, and then these emotions can be considered comprehensively to determine the final emotion type and emotion level.
  • one or more emotion representations respectively corresponding to one or more information elements in the information elements 222 may be generated first, and then a final emotion label is generated based at least on these emotion representations.
  • the emotion representation may refer to an informational representation of emotion, which may take the form of, for example, an emotion vector, an emotion label, and the like.
  • the emotion vector may include multiple dimensions for expressing emotion distribution, each dimension corresponds to an emotion type, and the value on each dimension indicates the prediction probability or weight of the corresponding emotion type.
  • a pre-trained machine learning model may be used to generate an emotion representation corresponding to the character feature.
  • a convolutional neural network model for facial emotion recognition can be used to predict the corresponding emotional representation.
  • the convolutional neural network model can also be trained to comprehensively consider other features that may be included in the character features, such as body movements, to predict emotional representation. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotional expression corresponding to the character's characteristics.
  • the emotional information corresponding to the music can be retrieved in a pre-established music database based on the music information, so that Form an emotional expression.
  • the music database may include music information of a large amount of music collected in advance and corresponding emotional information, music type, background knowledge, chat corpus, and the like.
  • the music database can be indexed according to various music information such as song name, singer, performer, etc., so that emotional information corresponding to specific music can be found from the music database based on the music information.
  • music genres found from music databases can also be used to form emotion representations.
  • a pre-trained machine learning model may be used to generate an emotion representation corresponding to the subtitle.
  • the machine learning model may be, for example, an emotion classification model based on a convolutional neural network. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to text recognized from an image of target content.
  • the emotion representation corresponding to the object may be determined based on a pre-established machine learning model or a pre-set heuristic rule.
  • objects in an image can also help express emotion. For example, if it is shown in the image that a plurality of red ornaments are arranged on the stage to enhance the atmosphere, these red ornaments recognized from the image may help to determine emotions such as joy or joy. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to an object recognized from an image of target content.
  • the emotional representation corresponding to the music may be determined or generated in a number of ways. In one manner, if the music information has been recognized, the emotion information corresponding to the music may be found from a music database based on the music information, so as to form an emotion expression. In one manner, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the music based on various music features extracted from the music.
  • Music features can include the Audio Average Energy (AE) of the music, denoted as where x is the discrete audio input signal, t is the time, and N is the number of input signals x.
  • AE Audio Average Energy
  • Musical features may also include rhythmic features extracted from music represented by the number of beats and/or the distribution of beat intervals.
  • the music feature may also include the aforementioned emotional information corresponding to the music obtained by using the music information.
  • the machine learning model can be trained based on the above one or more music features, so that the trained machine learning model can predict the emotional expression of music. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to music extracted from audio of target content.
  • a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the speech. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to speech extracted from audio of target content.
  • a pre-trained machine learning model may be used to generate an emotion representation corresponding to the bullet chat text.
  • the machine learning model may be, for example, a convolutional neural network-based sentiment classification model, denoted as CNN sen .
  • a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the message text.
  • the machine learning model may be established in a manner similar to the aforementioned machine learning model for generating an emotion representation corresponding to the bullet chat text. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to message text extracted from a chat session.
  • a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the object features. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to object features extracted from captured images.
  • an emotional representation corresponding to the speech and/or music may be generated.
  • the emotional representation corresponding to the speech and/or music extracted from the audio of the target content may be generated in a manner similar to that described above for determining the emotional representation corresponding to the speech and/or music extracted from the audio of the target content . It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to speech and/or music extracted from captured audio.
  • the emotion expression corresponding to the external environment information may be determined based on a pre-established machine learning model or a preset heuristic rule. Taking the external environment information as "cloudy and rainy" weather as an example, since people tend to show slightly sad emotions in cloudy and rainy weather, the emotional expression corresponding to the sad emotion can be determined from the external environment information. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotion representation corresponding to the external environment information extracted from the external environment data.
  • a final emotion tag can be generated based at least on these emotion representations.
  • the final sentiment label can be understood as indicating the overall sentiment determined by comprehensively considering various information elements.
  • Sentiment labels can be formed from multiple sentiment representations in various ways. For example, in the case that emotion representations use emotion vectors, multiple emotion representations can be superimposed to obtain a total emotion vector, and the emotion type and emotion level can be derived from the emotion distribution in the total emotion vector to form the final emotion label.
  • the final emotional tag may be calculated, selected or determined from multiple emotional tags corresponding to multiple information elements based on predetermined rules. It should be understood that the embodiments of the present disclosure are not limited to any specific manner of generating emotion tags based on multiple emotion representations.
  • Sentiment labels are generated based on multiple information elements.
  • a machine learning model can be pre-trained that can be trained to take multiple information elements as multiple input features and predict sentiment labels accordingly.
  • the trained model can be used to generate sentiment tags based directly on the information elements 222 .
  • the reference information item 232 generated at 230 may include an animation tag.
  • the animation tag can be used to select the animation to be presented.
  • the animation tag may indicate at least one or a combination of facial expression types, body movement types, etc. of the avatar. Facial expressions may include, for example, smiling, laughing, blinking, curling lips, speaking, etc., and body movements may include, for example, turning left, waving, body swinging, dance moves, and the like.
  • At least one information element 222 may be mapped to an animation tag according to a predetermined rule.
  • various animation tags may be predefined, and a large number of mapping rules from information element sets to animation tags may be predefined, where the information element set may include one or more information elements. Therefore, when an information element set including one or more information elements is given, the corresponding animation label can be determined based on one information element or a combination of multiple information elements in the information element set by referring to a predefined mapping rule .
  • An exemplary mapping rule is: when the character features extracted from the image of the target content indicate the character's singing action, and the barrage text includes key words such as "good to hear” and "intoxicated”, then these information can be Elements are mapped to animation tags such as "close your eyes”, “swing your body”, etc., so that the avatar can perform behaviors such as listening to a song intoxicated.
  • An exemplary mapping rule is: when the voice extracted from the audio of the target content indicates that people are arguing, the bullet chat text includes key words such as "noise”, "don't want to listen”, and the voice extracted from the chat session
  • the message text includes key words indicating the user's disgust, and these information elements can be mapped to animated labels such as "covering ears with hands” and “shaking head", so that the avatar can show behaviors such as not wanting to hear quarrels .
  • An exemplary mapping rule is: when the image light detected from the image of the target content indicates a rapid light-dark change, the object identified from the image of the target content is a guitar, and the object extracted from the audio of the target content If the music indicates fast-paced music, these information elements can be mapped to animation tags such as "playing the guitar” and "fast-paced dance moves", so that the avatar can show, for example, the movement of playing the piano and dancing along with the lively music. Behavior. It should be understood that the above only lists several exemplary mapping rules, and embodiments of the present disclosure may also define a large number of any other mapping rules.
  • the animation tag may also be further generated based on the emotion tag.
  • emotion tags can be used together with information elements to define mapping rules, so that corresponding animation tags can be determined based on the combination of information elements and emotion tags.
  • a direct mapping rule from emotion tags to animation tags can also be defined, so that after the emotion tags are generated, the corresponding animation tags can be determined directly based on the emotion tags by referring to the defined mapping rules.
  • a mapping rule can be defined from the emotion label ⁇ sadness, level 2> to animation labels such as "crying", "wiping tears with hands".
  • the reference information item 232 generated at 230 may include review text.
  • the comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content.
  • the comment text can be selected from the bullet chat text of the target content.
  • a comment generation model constructed based on the twin-tower model can be used to select comment text from bullet chat text.
  • the bullet chat text of the target content may be time-aligned with the image and/or audio of the target content, wherein the time alignment may refer to being located at the same moment or within the same time period.
  • the bullet chat text at a specific moment may include multiple sentences, and these sentences may be comments of different viewers on the image and/or audio of the target content at that moment or in a nearby time period.
  • the comment generation model can select a suitable sentence from the corresponding bullet chat text as the comment text for the image and/or audio of the target content at that moment or in a nearby time period.
  • the two-tower model can be used to determine the matching degree between the sentences in the bullet chat text of the target content and the image and/or audio of the target content, and the sentence with the highest matching degree is selected from the bullet chat text as the comment text.
  • Review generation models may include, for example, two twin tower models.
  • a two-tower model can be used to output a first matching score based on the input target content image and the sentence to indicate the degree of matching between the image and the sentence, while the other
  • the two-tower model can be used to output a second match score based on the input target content audio and the sentence to represent the degree of match between the audio and the sentence.
  • the first matching degree score and the second matching degree score can be combined in any manner to obtain the comprehensive matching degree score of the statement.
  • the sentence with the highest matching score may be selected as the comment text for the current image and/or audio.
  • comment generation model may only include one of the two twin-tower models, or be based on any other sentences that are trained to determine the bullet chat text A model of how well images and/or audio match with targeted content.
  • the reference information item 232 generated at 230 may also include chat response text.
  • Another chat participant may be, for example, a user, other intelligent conversation subject, or the like.
  • the corresponding chat response text can be generated at least based on the message text through the chat engine.
  • any common chat engine can be used to generate the chat response text.
  • the chat engine can generate chat response text based at least on the sentiment tag.
  • the chat engine may be trained to generate chat response text based at least on the input message text and the emotion tag, so that the chat response text is generated under the influence of the emotion indicated by the emotion tag at least.
  • the intelligent conversational subject can show the characteristics of emotional continuation in the chat session, for example, the response of the intelligent conversational subject is not only affected by the emotion of the currently received message text, but also by the intelligent conversational subject's own The influence of the current emotional state.
  • the intelligent conversational subject is currently in a happy emotional state, although the current message text received may have or cause negative emotions such as anger, the intelligent conversational subject will not immediately give an angry response due to the current message text. Instead, the happy emotion may still be maintained or the happy emotion's emotion level may only be slightly lowered.
  • the existing chat engines usually only determine the emotional type of the response for the current round of conversation or only according to the currently received message text, so the emotional type of the response may change frequently with the received message text , which does not conform to the behavior that human beings are usually in a relatively stable emotional state when chatting and do not change their emotional state frequently.
  • the intelligent conversation subject with the emotional continuation characteristic in the chat conversation proposed by the embodiments of the present disclosure will be more anthropomorphic.
  • the chat engine can generate the chat response text based at least on the emotion representation from the emotion transfer network.
  • the emotion transfer network is used to model dynamic emotion transformation, which can not only maintain a stable emotional state, but also make appropriate adjustments or updates to the emotional state in response to the currently received message text.
  • the emotion transfer network can take the current emotion representation and the currently received message text as input, and output an updated emotion representation, wherein the current emotion representation can be, for example, a vector representation of the current emotional state of the intelligent conversation subject.
  • the updated emotion representation contains information reflecting the previous emotion state and information about the emotion change that may be caused by the current message text.
  • the updated emotional representation can be further provided to the chat engine, so that the chat engine can generate a chat response text for the current message text under the influence of the received emotional representation.
  • the chat engine can be trained to chat with the target content, that is, a topic related to the target content can be discussed with another chat participant.
  • the chat engine may be a search-based chat engine constructed based on, for example, chat content among people in a forum related to the target content.
  • the construction of the chat engine may include processing in various aspects.
  • chat corpus involving chat content among people may be crawled from forums related to target content.
  • a word embedding model can be trained for use in finding possible names for each named entity. For example, word embedding technology can be used to find related words of each named entity, and then, optionally, correct words can be reserved from the related words as possible names of the named entity through, for example, manual checking.
  • keywords can be extracted from chat corpus. For example, statistics can be made based on the word segmentation results of related corpora, and then compared with the statistical results in non-related corpora, so as to find words with a large difference in term frequency-inverse document frequency (TF-IDF) as keywords.
  • TF-IDF term frequency-inverse document frequency
  • a deep retrieval model based on, for example, a deep convolutional neural network, which is the core network of a chat engine, can be trained. The deep retrieval model can be trained by using the message-reply pairs in the chat corpus as training data. The text in the message-reply pair may include original sentences or extracted keywords in the message and the reply.
  • an intent detection model can be trained to detect which target content the received message text is specifically related to, so that a forum related to the target content can be selected from multiple forums.
  • the intent detection model may be a binary classification classifier, specifically, it may be, for example, a convolutional neural network text classification model.
  • the positive samples used for the intent detection model may come from chat corpus in forums related to the target content, and the negative samples may come from chat corpora in other forums or ordinary text.
  • a retrieval-based chat engine can be built that responds to an input message text to provide a chat response text based on corpus in the forum.
  • reference information items 232 including, for example, emotion tags, animation tags, comment texts, chat response texts, etc. at 230 is exemplary.
  • the process may also consider more other factors, such as scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject, and the like.
  • a scene-specific emotion may refer to a preset emotion preference associated with a specific scene.
  • the intelligent conversational subject may be required to respond positively and optimistically as much as possible. Therefore, scene-specific emotions that can lead to positive and optimistic responses, such as happiness and excitement, may be preset for these scenarios.
  • a scene-specific emotion may include an emotion type, or an emotion type and its emotion level. Scene-specific emotions can be used to influence the generation of reference information items.
  • the scene-specific emotion and the information element 222 can be used as input, so as to jointly generate emotion tags.
  • the scene-specific emotion may be used as an emotion representation, and the emotion representation may be used together with a plurality of emotion representations respectively corresponding to a plurality of information elements to generate an emotion label.
  • scene-specific emotions can be considered in a similar manner to emotion tags, for example, scene-specific emotions can be used together with information elements to define mapping rules.
  • the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the How well the sentiment information detected in these sentences matches the scene-specific sentiment.
  • scene-specific emotions can be considered in a similar manner to emotion tags.
  • a chat engine can use input message text together with scene-specific sentiment and possibly sentiment tags to generate chat response text.
  • the preset personality of the intelligent conversation subject may refer to the personality characteristics pre-set for the intelligent conversation subject, for example, lively, cute, gentle, excited and so on.
  • the response made by the intelligent conversation subject can be made to conform to the preset personality as much as possible.
  • This preset personality can be used to influence the generation of reference information items.
  • preset personalities can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags.
  • the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label.
  • preset personalities and information elements can be used to define mapping rules. For example, a lively and active preset personality will be more helpful in determining an animation label with more body movements, a cute preset personality will be more helpful in determining an animation label with cute facial expressions, and so on.
  • the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset personality.
  • the emotional tendency corresponding to the preset personality can be considered in a manner similar to the emotional label.
  • a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text.
  • the preset role of the intelligent conversation subject may refer to the role to be played by the intelligent conversation subject.
  • the preset roles can be classified according to various standards, for example, roles such as little girls and middle-aged men according to age and gender, roles such as teachers, doctors, and policemen according to occupations, and so on.
  • the response made by the subject of the intelligent conversation can conform to the preset role as much as possible.
  • This preset role can be used to influence the generation of reference information items.
  • preset roles in the above-mentioned process of generating emotion tags, preset roles can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags.
  • the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label.
  • the preset roles and information elements can be used to define mapping rules. For example, the preset character of a little girl will be more helpful in determining animation tags with cute facial expressions, more body movements, and the like.
  • the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset role.
  • the emotional tendency corresponding to the preset character may be considered in a manner similar to the emotional label.
  • a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text.
  • the training corpus of the chat engine may also include more corpus corresponding to the preset roles, so that the chat response text output by the chat engine is more in line with the language characteristics of the preset roles.
  • the multimodal output data 242 is data to be provided or presented to the user, which may include various types of output data, for example, voice, text of the intelligent conversation subject, animation sequence of the avatar of the intelligent conversation subject, and the like.
  • Speech in the multimodal output data may be generated for comment text, chat response text, etc. in the reference information item.
  • comment text, chat response text, etc. may be converted into corresponding speech by any text-to-speech (TTS) conversion technology.
  • TTS conversion process may be conditional on emotion tags such that the generated speech has the emotion indicated by the emotion tags.
  • the text in the multimodal output data may be visual text corresponding to comment text, chat response text, etc. in the reference information item. Therefore, the text can be used to visually present the content of comments and chat responses narrated by the subject of the intelligent conversation.
  • the text may be generated with a predetermined font or presentation effect.
  • the animation sequence in the multimodal output data may be generated using at least animation tags and/or emotion tags in the reference information items.
  • An animation library of avatars of intelligent conversational subjects can be pre-built.
  • the animation library may include a large number of pre-created animation templates based on the avatar of the intelligent conversation subject.
  • Each animation template may include, for example, multiple GIF images.
  • the animation templates in the animation library can be indexed by animation tags and/or emotion tags, for example, each animation template can be marked with a corresponding facial expression type, body movement type, emotion type, emotion level, etc. at least one. Therefore, when the reference information item 232 generated at 230 includes animation tags and/or emotion tags, the animation tags and/or emotion tags can be used to select a corresponding animation template from the animation library.
  • time adaptation can be performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.
  • Time adaptation aims to adjust the animation template to match the time sequence of the speech corresponding to the comment text and/or chat response text.
  • the duration of facial expressions, body movements, etc. in the animation template can be adjusted to match the duration of the intelligent animated character's voice.
  • the image involving opening and closing the mouth in the animation template may be repeated continuously, so as to present a visual effect that the avatar is speaking.
  • time adaptation is not limited to making the animation template match the time sequence of the speech corresponding to the comment text and/or chat response text, and it may also include making the animation template match the extracted one or more A time sequence of information elements 222.
  • information elements such as the object "guitar” have been identified from the target content, and these information elements have been mapped to the animation tag "playing the guitar", then the singer During the time period of playing the guitar, the selected animation template corresponding to "playing the guitar” may be repeated continuously, so as to present a visual effect that the avatar is playing the guitar together with the singer in the target content.
  • the intelligent conversation subject may have different avatars, so different animation libraries may be pre-established for different avatars.
  • the process of generating multimodal output data 242 at 240 discussed above is exemplary, and in other implementations, the process of generating multimodal output data can also be More other factors may be considered, for example, scene-specific requirements, etc., that is, the multimodal output data may be further generated based on scene-specific requirements.
  • scene-specific requirements etc.
  • the multi-modal output suitable for a specific scene can be adaptively output based on the output capabilities supported by different scenes data.
  • Scenario-specific requirements may refer to specific requirements of different application scenarios of the intelligent conversation subject.
  • the scene-specific requirements may include, for example, types of supported multi-modal output data, preset speech rate settings, chat mode settings, etc. associated with a specific scene.
  • different scenes may have different data output capabilities. Therefore, the types of multimodal output data supported by different scenes may include outputting only one of voice, animation sequence and text, or outputting voice, animation sequence and At least two of the text.
  • intelligent animation characters and virtual anchor scenes require terminal devices to at least support the output of images and audio, so that the specific requirements of the scene can indicate the output of one or more of voice, animation sequence and text.
  • a smart speaker scenario supports audio output only, so scenario-specific requirements can dictate that only voice be output.
  • the speech rate can be preset according to the specific needs of the scene. For example, since users can watch images and hear voices in smart animation characters and virtual anchor scenes, the speech rate can be set to be faster in order to express richer emotions. For example, in the scenarios of smart speakers and smart car assistants, users often only get or pay attention to voice output. Therefore, the speech rate can be set to be slow so that users can clearly understand what the intelligent conversation entity wants through voice alone. expressed content.
  • different scenarios may have different chat mode preferences, therefore, chat mode settings can be made according to specific requirements of the scenario.
  • the chat engine's chatter output can be reduced.
  • the chat mode setting may also be associated with collected images, collected audio, external environment data, and the like.
  • the voice output of the chat response generated by the chat engine may be reduced when the collected audio indicates that there is loud noise around the user.
  • the external environment data indicates that the user is traveling faster, for example, driving a vehicle at a high speed
  • the chatting output of the chat engine may be reduced.
  • multimodal output data can be generated based at least on the scene-specific requirements. For example, when the specific requirements of the scene indicate that image output is not supported or only voice output is supported, animation sequence and text generation may not be performed. For example, when a scene-specific requirement indicates a faster speech rate, the speech rate of the generated speech may be accelerated during the TTS conversion process. For example, when the specific requirement of the scenario indicates that the output of the chat response is reduced under a specific condition, the generation of voice or text corresponding to the text of the chat response may be restricted.
  • multimodal output data can be provided. For example, an animation sequence, text, etc. are displayed through a display screen, and voices are played through a speaker, etc.
  • process 200 may be performed continuously so as to continuously obtain multimodal input data and continuously provide multimodal output data.
  • FIG. 3 shows an example of a smart animated character scene according to an embodiment.
  • the user 310 can watch a video on the terminal device 320 , and at the same time, the smart conversational entity according to the embodiment of the present disclosure can serve as a smart animation character to accompany the user 310 to watch the video together.
  • the terminal device 320 may include, for example, a display screen 330, a camera 322, a speaker (not shown), a microphone (not shown), and the like.
  • a video 332 may be presented as target content in the display screen 330 .
  • the avatar 334 of the intelligent conversation subject can also be presented on the display screen 330 .
  • the intelligent conversation subject can perform multi-modality-based reactive response generation according to an embodiment of the present disclosure, and accordingly, can provide the generated multi-modal-based reactive response on the terminal device 320 via the avatar 334 .
  • the avatar 334 can make facial expressions, body movements, and make voices, etc.
  • FIG. 4 illustrates an exemplary process 400 for intelligently animating a character scene, according to an embodiment.
  • Process 400 illustrates the processing flow, data/information flow, etc. involved in, for example, the smart animated character scene of FIG. 3 .
  • process 400 may be considered as a specific example of process 200 in FIG. 2 .
  • multimodal input data may be obtained first, including at least one of, for example, video, external environment data, collected images, collected audio, chat sessions, and the like.
  • the video, as the target content may further include, for example, images, audio, bullet chat files, and the like. It should be understood that the obtained multimodal input data may be aligned in time and accordingly have the same context.
  • Information elements can be extracted from multimodal input data. For example, extract character features, text, image light, objects, etc. from video images, extract music, voice, etc. from video audio, extract barrage text from video barrage files, and extract external environment from external environment data information, extract object features from captured images, extract music, speech, etc. from captured audio, extract message text from chat sessions, and more.
  • the reference information item may be generated based at least on the extracted information elements, which includes, for example, at least one of emotion tags, animation tags, comment text, and chat response text.
  • Review text may be generated by a review generation model 430 .
  • Chat response text may be generated by chat engine 450 and optionally emotion transfer network 452 .
  • the generated reference information items may be utilized at least to generate multimodal output data, which includes, for example, at least one of an animation sequence, comment speech, comment text, chat response speech, chat response text, and the like.
  • the animation sequence may be generated based on the description above in connection with FIG. 2 .
  • the animation selection 410 may be performed in the animation library to select an animation template by using animation tags, emotion tags, etc., and then the animation sequence generation 420 is executed based on the selected animation template, so that the timing of the animation sequence generation 420 is executed Adapt to obtain animation sequences.
  • the review speech may be obtained by performing speech generation 440 (eg, TTS conversion) on the review text.
  • the review text may be obtained based on the review text.
  • the chat response speech may be obtained by performing speech generation 460 (eg, TTS conversion) on the chat response text.
  • the chat response text may be obtained based on the chat response text.
  • the resulting multimodal output data can be provided on an end device. For example, an animation sequence, comment text, chat response text, etc. are presented on the display screen, and the comment voice, chat response voice, etc. are played through a speaker.
  • process 400 all the processing, data/information, etc. in the process 400 are exemplary, and in actual application, the process 400 may only involve one or more of these processing, data/information.
  • the multi-modality-based reactive response generation can be applied to perform a variety of tasks.
  • the following is only an exemplary intelligent animation generation task among these tasks. It should be understood that embodiments of the present disclosure are not limited to being used for performing intelligent animation generation tasks, but may also be used for performing various other tasks.
  • FIG. 5 illustrates an exemplary process 500 of intelligent animation generation according to an embodiment.
  • Process 500 can be regarded as a specific implementation of process 200 in FIG. 2 .
  • the intelligent animation generation of process 500 is a specific application of the multi-modality-based reactive response generation of process 200 .
  • the intelligent animation generation of the process 500 may involve at least one of the generation of an animation sequence of the avatar, the generation of comment speech of the avatar, the generation of comment text, etc., performed in response to the target content.
  • the step of obtaining multimodal input data at 210 in FIG. 2 may be embodied as obtaining at 510 at least one of image, audio, and barrage files of the target content.
  • the information element extraction step at 220 in FIG. 2 can be embodied as at 520 extracting at least one information element from the image, audio, and barrage files of the target content. For example, extract character features, text, image light, objects, etc. from the image of the target content, extract music, voice, etc. from the audio of the target content, extract bullet chat text from the bullet chat file of the target content, and so on.
  • the step of generating reference information items at 230 in FIG. 2 may be embodied as generating at 530 at least one of animation tags, emotion tags and comment texts.
  • animated tags, sentiment tags, review text, etc. may be generated based at least on the at least one information element extracted at 520 .
  • the step of generating multimodal output data at 240 in FIG. At least one of comment voice and comment text.
  • the animation sequence may be generated by at least using animation tags and/or emotion tags in the manner described above in conjunction with FIG. 2 .
  • comment speech and comment text may also be generated in the manner described above in conjunction with FIG. 2 .
  • the step of providing multimodal output data at 250 in FIG. 2 may be embodied as providing at 550 at least one of the generated animation sequence, comment voice, and comment text.
  • process 500 may be performed in a manner similar to that described above for the corresponding step in FIG. 2 .
  • process 500 may also include any other processing described above for process 200 of FIG. 2 .
  • FIG. 6 shows a flowchart of an exemplary method 600 for multimodality-based reactive response generation, according to an embodiment.
  • multimodal input data can be obtained.
  • At 620, at least one informational element can be extracted from the multimodal input data.
  • At 630, at least one reference information item may be generated based at least on the at least one information element.
  • multimodal output data may be generated using at least the at least one reference information item.
  • the multimodal output data can be provided.
  • the multimodal input data may include at least one of the following: images of target content, audio of target content, barrage files of target content, chat sessions, collected images, collected audio, and external environment data.
  • Extracting at least one information element from the multimodal input data may include at least one of the following: extracting character features from an image of the target content; recognizing text from an image of the target content; detecting image light from an image of the target content; Recognize objects from images of target content; extract music from audio of target content; extract speech from audio of target content; extract bullet chat text from bullet chat files of target content; extract message text from chat sessions; extracting object features from images; extracting speech and/or music from collected audio; and extracting external environment information from external environment data.
  • generating at least one reference information item based on at least one information element may include: generating at least one of emotion tags, animation tags, comment text, and chat response text based on at least one information element one.
  • Generating an emotion tag based at least on the at least one information element may include: generating one or more emotion representations respectively corresponding to one or more information elements in the at least one information element; and at least based on the one or more sentiment representations to generate the sentiment labels.
  • the emotion tag may indicate an emotion type and/or an emotion level.
  • Generating the animation label based at least on the at least one information element may include: mapping the at least one information element to the animation label according to a predetermined rule.
  • the animation tag may indicate the type of facial expression and/or the type of body movement.
  • the animation tag may be further generated based on the emotion tag.
  • Generating comment text based at least on the at least one information element may include: selecting the comment text from bullet chat text of the target content.
  • the selection of the comment text may include: using the twin towers model to determine the matching degree between the sentence in the bullet chat text of the target content and the image and/or audio of the target content; Select the sentence with the highest matching degree in the subtitle text as the comment text.
  • Generating the chat response text based at least on the at least one information element may include: generating the chat response text based at least on message text in the chat session by a chat engine.
  • the chat response text may be further generated based on the emotion tag.
  • the chat response text may be further generated based on an emotion representation from an emotion transfer network.
  • the at least one reference information item may be further generated based on at least one of the following: scene-specific emotion; preset personality of the intelligent conversation subject; and preset role of the intelligent conversation subject.
  • the multimodal output data may include at least one of the following: an animation sequence of an avatar of the intelligent conversation subject; voice of the intelligent conversation subject; and text.
  • Generating multimodal output data by using at least the at least one reference information item may include: generating voice and/or text corresponding to the comment text and/or the chat response text.
  • At least using the at least one reference information item to generate multimodal output data may include: using the animation tag and/or the emotion tag to select a corresponding animation template from the animation library of the avatar of the intelligent conversation subject; and Time adaptation is performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.
  • the time adaptation may include: adjusting the animation template to match the time sequence of the speech corresponding to the comment text and/or the chat response text.
  • the multimodal output data may be further generated based on specific requirements of the scene.
  • the scene-specific requirement may include at least one of the following: outputting only one of voice, animation sequence and text; outputting at least two of voice, animation sequence and text; predetermined speech rate setting; and chat mode setting.
  • the multimodal based reactive response generation can include intelligent animation generation.
  • Obtaining the multimodal input data may include: obtaining at least one of image, audio and bullet chat files of the target content.
  • Extracting at least one information element from the multimodal input data may include: extracting at least one information element from image, audio and bullet chat files of the target content.
  • Generating at least one reference information item based at least on the at least one information element may include: generating at least one of animation tags, emotion tags and comment text based on at least the at least one information element.
  • At least using the at least one reference information item to generate multimodal output data may include: using at least one of the animation tag, the emotion tag, and the comment text to generate an animation sequence of the avatar, an animation sequence of the avatar, At least one of comment voice and comment text.
  • Providing the multimodal output data may include: providing at least one of the animation sequence, the comment voice and the comment text.
  • the method 600 may also include any steps/processes for multi-modality-based reactive response generation according to the embodiments of the present disclosure described above.
  • FIG. 7 illustrates an exemplary apparatus 700 for multimodality-based reactive response generation, according to an embodiment.
  • the device 700 may include: a multimodal input data obtaining module 710, for obtaining multimodal input data; a data integration processing module 720, for extracting at least one information element from the multimodal input data; a scene logic processing module 730, for generating at least one reference information item based at least on the at least one information element; the multimodal output data generation module 740, for at least utilizing the at least one reference information item to generate multimodal output data; and multiple
  • the modality output data providing module 750 is configured to provide the multimodal output data.
  • apparatus 700 may also include any other modules that execute the steps of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
  • FIG. 8 illustrates an exemplary apparatus 800 for multimodality-based reactive response generation, according to an embodiment.
  • Apparatus 800 may include: at least one processor 810; and memory 820 storing computer-executable instructions.
  • the at least one processor 810 may execute any steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
  • Embodiments of the present disclosure propose a multimodal-based reactive response generation system, including: a multimodal data input interface for obtaining multimodal input data; a core processing unit configured to generate data from the multimodal Extracting at least one information element from the modal input data, generating at least one reference information item based on at least the at least one information element, and at least utilizing the at least one reference information item to generate multimodal output data; and multimodal data output An interface for providing the multimodal output data.
  • the multimodal data input interface, the core processing unit, and the multimodal data output interface may also execute any relevant steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
  • the multimodality-based reactive response generation system may further include any other units and modules for multimodality-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
  • Embodiments of the present disclosure propose a computer program product for multimodal-based reactive response generation, including a computer program that is run by at least one processor to execute the method according to the above-mentioned embodiments of the present disclosure based on Any step/process of a method for multimodal reactive response generation.
  • Embodiments of the present disclosure can be embodied on a non-transitory computer readable medium.
  • the non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the methods for multimodal-based reactive response generation according to embodiments of the present disclosure described above. steps/process.
  • modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or grouped together.
  • processors have been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. As examples, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital signal processor (DSP), field programmable gate array (FPGA) ), programmable logic devices (PLDs), state machines, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors given in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • PLDs programmable logic devices
  • state machines gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described
  • the software may reside on a computer readable medium.
  • the computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic stripe), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), Programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), register or removable disk.
  • memory may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic stripe), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), Programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), register or removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM Programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically eras

Abstract

The present disclosure provides a method, system and apparatus for multimodal based reactive response generation. Multimodal input data can be obtained. At least one information element can be extracted from the multimodal input data. At least one reference information item can be generated based at least on the at least one information element. Multimodal output data can be generated by at least using the at least one reference information item. The multimodal output data can be provided.

Description

基于多模态的反应式响应生成Reactive Response Generation Based on Multimodality 背景技术Background technique
近年来,智能人机交互系统被广泛地应用于越来越多的场景和领域,其能够有效地提升人机交互效率、优化人机交互体验。随着人工智能(AI)技术的发展,人机交互系统也在例如智能会话系统等方面取得了更为深入的发展。例如,智能会话系统已经涵盖了任务对话、知识问答、开放域对话等应用场景,并且可以采用基于模板的技术、基于检索的技术、基于深度学习的技术等多种技术来实现。In recent years, intelligent human-computer interaction systems have been widely used in more and more scenarios and fields, which can effectively improve the efficiency of human-computer interaction and optimize the experience of human-computer interaction. With the development of artificial intelligence (AI) technology, human-computer interaction systems have also achieved more in-depth development in aspects such as intelligent conversation systems. For example, the intelligent conversation system has covered application scenarios such as task dialogue, knowledge question answering, and open domain dialogue, and can be realized by using template-based technology, retrieval-based technology, and deep learning-based technology.
发明内容Contents of the invention
提供本发明内容以便介绍一组概念,这组概念将在以下的具体实施方式中做进一步描述。本发明内容并非旨在标识所保护主题的关键特征或必要特征,也不旨在用于限制所保护主题的范围。This Summary is provided to introduce a set of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
本公开的实施例提出了用于基于多模态的反应式响应生成的方法、系统和装置。可以获得多模态输入数据。可以从所述多模态输入数据中提取至少一个信息元素。可以至少基于所述至少一个信息元素来生成至少一个参考信息项。可以至少利用所述至少一个参考信息项来产生多模态输出数据。可以提供所述多模态输出数据。Embodiments of the present disclosure propose methods, systems and apparatus for multimodality-based reactive response generation. Multimodal input data can be obtained. At least one informational element may be extracted from said multimodal input data. At least one reference information item may be generated based at least on said at least one information element. The at least one item of reference information may be used at least to generate multimodal output data. The multimodal output data may be provided.
应当注意,以上一个或多个方面包括以下详细描述以及权利要求中具体指出的特征。下面的说明书及附图详细提出了所述一个或多个方面的某些说明性特征。这些特征仅仅指示可以实施各个方面的原理的多种方式,并且本公开旨在包括所有这些方面和其等同变换。It should be noted that one or more of the above aspects include the features specified in the following detailed description as well as in the claims. Certain illustrative features of the one or more aspects are set forth in detail in the following description and accompanying drawings. These features are merely indicative of the various ways in which the principles of various aspects can be implemented and this disclosure is intended to include all such aspects and their equivalents.
附图说明Description of drawings
以下将结合附图描述所公开的多个方面,这些附图被提供用以说明而非限制所公开的多个方面。The disclosed aspects will be described below with reference to the accompanying drawings, which are provided to illustrate but not limit the disclosed aspects.
图1示出了根据实施例的基于多模态的反应式响应生成系统的示例性架构。FIG. 1 illustrates an exemplary architecture of a multimodality-based reactive response generation system according to an embodiment.
图2示出了根据实施例的用于基于多模态的反应式响应生成的示例性过程。FIG. 2 illustrates an exemplary process for multimodality-based reactive response generation, according to an embodiment.
图3示出了根据实施例的智能动画角色场景的实例。Figure 3 shows an example of a smart animated character scene according to an embodiment.
图4示出了根据实施例的智能动画角色场景的示例性过程。Fig. 4 shows an exemplary process of intelligently animating a character scene according to an embodiment.
图5示出了根据实施例的智能动画生成的示例性过程。Fig. 5 shows an exemplary process of smart animation generation according to an embodiment.
图6示出了根据实施例的用于基于多模态的反应式响应生成的示例性方法的流程图。FIG. 6 shows a flowchart of an exemplary method for multimodality-based reactive response generation, according to an embodiment.
图7示出了根据实施例的用于基于多模态的反应式响应生成的示例性装置。FIG. 7 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.
图8示出了根据实施例的用于基于多模态的反应式响应生成的示例性装置。FIG. 8 illustrates an exemplary apparatus for multimodality-based reactive response generation, according to an embodiment.
具体实施方式Detailed ways
现在将参考多种示例性实施方式来讨论本公开。应当理解,这些实施方式的讨论仅仅用于使得本领域技术人员能够更好地理解并从而实施本公开的实施例,而并非教导对本公开的范围的任何限制。The present disclosure will now be discussed with reference to various exemplary embodiments. It should be understood that the discussion of these embodiments is only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than teaching any limitation to the scope of the present disclosure.
现有的人机交互系统通常采用单一媒介来作为信息输入和输出的渠道,例如,通过文本、语音、手势等中之一进行人与机器或机器与机器之间的交流沟通。以智能会话系统为例,尽管其可面向文本或语音,但是其仍然以文本处理或文本分析为核心。智能会话系统在交互过程中缺少对交互对象在文本之外的例如面部表情、肢体动作等信息的考虑,也缺少对环境中的声音、光线等因素的考虑,致使在交互过程中存在较为普遍的问题。一方面的问题在于对信息的理解不够全面准确。人类在实际交流过程中,并不是单一地通过语言文本来表达自己的全部交流内容,而是往往也将语气、 面部表情、肢体动作等作为表达或传递信息的重要渠道。例如,对于相同一句话,如果使用不同的语气或伴随不同的面部表情和肢体动作,则在不同的场合下其可能传达截然不同的语义。现有的以文本处理为核心的智能会话技术缺失了在交互过程中相当重要的这部分信息,由此导致对会话中的上下文信息的提取与应用变得十分困难。另一方面的问题在于对信息的表达不够生动。现有的智能会话技术在信息表达上主要是通过文本来进行的,而在支持语音识别和语音合成的情况下,也可以将输出文本转换为语音。然而,这样的信息传递渠道仍然是受限的,无法像人类一样综合地利用语言、面部表情、肢体动作等来全面准确地表达自身意图,从而导致难以展示生动活泼的拟人化表现。再一方面的问题在于现有的智能会话技术局限于对所接收到的输入会话消息做出响应,而无法自发地对各种环境因素做出反应。例如,现有的聊天机器人仅专注于对来自用户的会话消息做出响应,以便能够围绕来自用户的会话消息而进行聊天。Existing human-computer interaction systems usually use a single medium as a channel for information input and output, for example, communication between humans and machines or between machines through one of text, voice, and gestures. Taking the intelligent conversational system as an example, although it can be oriented to text or speech, it still takes text processing or text analysis as its core. During the interaction process, the intelligent conversation system lacks the consideration of information such as facial expressions and body movements of the interactive objects outside the text, and also lacks the consideration of factors such as the sound and light in the environment, resulting in common errors in the interaction process. question. On the one hand, the problem is that the understanding of information is not comprehensive and accurate enough. In the process of actual communication, human beings do not express their entire communication content through language and text alone, but often use tone of voice, facial expressions, body movements, etc. as important channels for expressing or transmitting information. For example, for the same sentence, if it uses different tones or is accompanied by different facial expressions and body movements, it may convey completely different semantics in different occasions. The existing intelligent conversation technology with text processing as the core lacks this part of the information which is very important in the interaction process, which makes it very difficult to extract and apply the context information in the conversation. Another problem is that the expression of information is not vivid enough. Existing intelligent conversation technology mainly performs information expression through text, and in the case of supporting speech recognition and speech synthesis, the output text can also be converted into speech. However, such information transmission channels are still limited, and it is impossible to comprehensively and accurately express one's own intentions by comprehensively using language, facial expressions, body movements, etc. like humans, making it difficult to show lively anthropomorphic performances. Another problem is that the existing intelligent conversation technology is limited to responding to received input conversation messages, but cannot respond to various environmental factors spontaneously. For example, existing chatbots only focus on responding to conversational messages from users so as to be able to chat around conversational messages from users.
本公开的实施例提出了基于多模态的反应式(reaction)响应生成方案,其可以被实施在多种智能会话主体上,并且可以被广泛地应用于包括人机交互在内的多种场景中。在本文中,智能会话主体可以广泛地指能够在特定的应用场景中生成并呈现信息内容、提供交互功能等的AI产品形态,例如,聊天机器人、智能动画角色、虚拟主播、智能车机助理、智能客服、智能音箱等。根据本公开的实施例,智能会话主体可以基于多模态输入数据来产生多模态输出数据,其中,多模态输出数据是以反应式方式所生成的、将被呈现给用户的响应。The embodiment of the present disclosure proposes a multimodal-based reaction response generation scheme, which can be implemented on a variety of intelligent conversation subjects, and can be widely used in various scenarios including human-computer interaction middle. In this paper, intelligent conversation subjects can broadly refer to AI product forms that can generate and present information content and provide interactive functions in specific application scenarios, such as chat robots, intelligent animated characters, virtual anchors, intelligent car assistants, Smart customer service, smart speakers, etc. According to an embodiment of the present disclosure, an intelligent conversational agent may generate multimodal output data based on multimodal input data, wherein the multimodal output data is a response generated in a reactive manner to be presented to the user.
人与人之间自然的交流方式往往是多模态的。人类在彼此交流时,往往会综合考虑来自交流对象的语音、文字、面部表情、肢体动作等多种类型的信息,同时兼顾所处环境的场景、光线、声音甚至温度、湿度等信息。通过对这些多模态的信息的综合考虑,人类能够更加全面、准确、快速地理解交流对象所要表达的内容。同样地,在表达信息时,人类也会倾向于综合使用语音、面部表情、肢体动作等多模态的表达方式来更加准确、生动、全面地表达自身意图。The natural way people communicate with each other is often multimodal. When human beings communicate with each other, they tend to comprehensively consider various types of information such as speech, text, facial expressions, body movements, etc. from the communication objects, and at the same time take into account the scene, light, sound, and even temperature, humidity and other information of the environment they are in. Through the comprehensive consideration of these multi-modal information, human beings can more comprehensively, accurately and quickly understand what the communication object wants to express. Similarly, when expressing information, human beings tend to use voice, facial expression, body movements and other multi-modal expressions to express their intentions more accurately, vividly and comprehensively.
基于来自上述的人类交流方式的启发,在人机交互的场景下,自然、本真的人机交互方案也应该是多模态的。因此,本公开的实施例提出了基于多模态的人机交互方式。在本文中,交互可以广泛地指例如对信息、数据、内容等的理解和表达,而人机交互可以广泛地指在智能会话主体与交互对象之间的交互,例如,在智能会话主体与人类用户之间的交互、在智能会话主体之间的交互、智能会话主体对各种媒体内容或信息化数据的响应、等等。与现有的基于单一媒介的交互方式相比,本公开的实施例具有多种优势。在一个方面,可以实现更加准确地信息理解。通过对包括例如媒体内容、所采集的图像或音频、聊天会话、外界环境数据等多模态输入数据的综合处理,能够更加全面地收集和分析信息,减少信息缺失造成的误解,从而更加准确地理解交互对象的深层次意图。在一个方面,表达方式更为高效。通过以多种方式多模态地迭加表达信息,例如,在语音或文字的基础上迭加虚拟形象的面部表情和/或肢体动作或者其它动画序列等,可以更高效地表达信息和情感。在一个方面,智能会话主体的交互行为将更加生动。对多模态数据的理解与表达将使得智能会话主体更加拟人化,从而显著地提升用户体验。Based on the inspiration from the above-mentioned human communication methods, in the context of human-computer interaction, natural and authentic human-computer interaction solutions should also be multimodal. Therefore, the embodiments of the present disclosure propose a multimodal human-computer interaction method. In this paper, interaction can broadly refer to the understanding and expression of information, data, content, etc., while human-computer interaction can broadly refer to the interaction between the intelligent conversation subject and the interactive object, for example, between the intelligent conversation subject and the human Interaction between users, interaction between intelligent conversation subjects, responses of intelligent conversation subjects to various media contents or informational data, and so on. Compared with the existing interaction methods based on a single medium, the embodiments of the present disclosure have various advantages. In one aspect, more accurate information understanding can be achieved. Through the comprehensive processing of multimodal input data including media content, collected images or audio, chat sessions, and external environment data, information can be collected and analyzed more comprehensively, misunderstandings caused by missing information can be reduced, and more accurate Understand the deep-level intent of interacting objects. In one respect, the expression is more efficient. By superimposing and expressing information in multiple ways and in multiple modes, for example, superimposing facial expressions and/or body movements of avatars or other animation sequences on the basis of speech or text, information and emotions can be expressed more efficiently. In one aspect, the interactive behavior of the intelligent conversation subject will be more vivid. The understanding and expression of multimodal data will make the subject of intelligent conversation more anthropomorphic, thereby significantly improving user experience.
此外,本公开的实施例可以使得智能会话主体模仿人类来对语音、文本、音乐、视频图像等多模态输入数据产生自然的反应,即,做出反应式响应。在本文中,智能会话主体的反应式响应并不局限于对来自例如用户的聊天消息所做出的反应,还可以涵盖对例如媒体内容、所采集的图像或音频、外界环境等各种输入数据所主动做出的反应。以智能会话主体充当智能动画角色来提供AI智能陪伴的场景为例,假设智能会话主体可以通过对应的虚拟形象来陪伴用户观看视频,则该智能会话主体不仅可以与用户进行直接交互,还可以对该视频中的内容自发地做出反应式响应,例如,该虚拟形象可以发出语音、做出面部表情、做出肢体动作、呈现文字等。从而,智能会话主体的行为将更加拟人化。In addition, the embodiments of the present disclosure can enable the intelligent conversation subject to imitate human beings to generate natural responses to multi-modal input data such as speech, text, music, video images, ie, make reactive responses. In this paper, the reactive response of the intelligent conversation subject is not limited to the response to the chat message from the user, for example, but also covers various input data such as media content, captured image or audio, external environment, etc. proactive response. Taking the scene where the intelligent conversation subject acts as an intelligent animation role to provide AI intelligent companionship as an example, assuming that the intelligent conversation subject can accompany the user to watch videos through the corresponding avatar, the intelligent conversation subject can not only directly interact with the user, but also interact with the user. The content in the video responds spontaneously and reactively, for example, the avatar can speak, make facial expressions, make body movements, present text, etc. Thus, the behavior of the intelligent conversation subject will be more anthropomorphic.
本公开的实施例提出了通用的基于多模态的反应式响应生成技术,通过集成和应用基于多模态的反应式响应生成系统,智能会话主体可以高效快捷地获得多模态交互能力。通过根据本公开实施例的基于多模态的反应式响应生成技术,可以整合处理来自多种媒介渠道的多模态输入数据,并且能够更加准确有效地解读多模态输入数据所表达的意图。此外,通过根据本公开实施例的基于多 模态的反应式响应生成技术,智能会话主体可以经由多种渠道来提供多模态输出数据以表达整体一致的信息,由此提升了信息表达的准确度和效率,使得智能会话主体的信息表达更加生动有趣,从而显著地改善了用户体验。Embodiments of the present disclosure propose a general multimodal-based reactive response generation technology. By integrating and applying the multimodal-based reactive response generation system, intelligent conversation subjects can efficiently and quickly obtain multimodal interaction capabilities. Through the multimodal-based reactive response generation technology according to the embodiments of the present disclosure, multimodal input data from various media channels can be integrated and processed, and the intent expressed by the multimodal input data can be interpreted more accurately and effectively. In addition, through the multimodal-based reactive response generation technology according to the embodiments of the present disclosure, the intelligent conversation subject can provide multimodal output data through multiple channels to express overall consistent information, thereby improving the accuracy of information expression. Accuracy and efficiency make the information expression of intelligent conversation subjects more vivid and interesting, thus significantly improving user experience.
根据本公开实施例的基于多模态的反应式响应生成技术可以被自适应地应用于多种场景中。基于不同场景所支持的输入和输出能力,本公开的的实施例可以在不同场景中获得对应的多模态输入数据,并且输出适合于特定场景的多模态输出数据。以为充当智能动画角色的智能会话主体自动地生成动画的场景为例,本公开的实施例可以为智能动画角色的虚拟形象生成包括例如动画序列等的反应式响应。例如,在该智能动画角色被应用于陪伴用户观看视频的情况下,智能动画角色能够综合处理来自视频内容、采集的图像或音频、聊天会话、外界环境数据等的多模态输入数据,对多模态输入数据进行深度感知和理解,并且相应地以智能且动态的方式通过例如语音、文字、包含面部表情和/或肢体动作的动画序列等多种模态来做出合理的反应,从而实现全面、高效、生动的人机交互体验。智能动画角色的感知能力和情绪表达能力得到极大增强,并且智能动画角色变得更加拟人化。这也可以成为通过AI技术进行例如智能动画内容创作的技术基础。The multi-modality-based reactive response generation technology according to the embodiments of the present disclosure can be adaptively applied to various scenarios. Based on the input and output capabilities supported by different scenarios, embodiments of the present disclosure can obtain corresponding multimodal input data in different scenarios, and output multimodal output data suitable for specific scenarios. Taking as an example a scene in which an intelligent conversational subject acting as an intelligent animation character automatically generates animations, embodiments of the present disclosure may generate reactive responses including, for example, animation sequences, for the avatar of the intelligent animation character. For example, when the smart animated character is used to accompany the user to watch a video, the smart animated character can comprehensively process multi-modal input data from video content, collected images or audio, chat sessions, external environment data, etc. Modal input data for depth perception and understanding, and respond accordingly in an intelligent and dynamic manner through multiple modalities such as speech, text, animation sequences including facial expressions and/or body movements, to achieve Comprehensive, efficient and vivid human-computer interaction experience. The perception ability and emotional expression ability of intelligent animation characters are greatly enhanced, and intelligent animation characters become more anthropomorphic. This can also become the technical basis for content creation such as intelligent animation through AI technology.
以上仅仅对本公开实施例在智能动画角色场景中的应用进行了示例性说明,本公开的实施例还可以应用于多种其它场景。例如,在智能会话主体是聊天机器人的场景下,该聊天机器人可以与用户进行诸如语音、文字、视频等形式的聊天,则本公开的实施例所处理的多模态输入数据可以包括例如聊天会话、采集的图像或音频、外界环境数据等,并且所提供的多模态输出数据可以包括例如语音、文字、动画序列等。例如,在智能会话主体是虚拟主播的场景下,该虚拟主播可以具有对应的虚拟形象并且向多个用户播放和解说预定的媒体内容,则本公开的实施例所处理的多模态输入数据可以包括例如所播放的媒体内容、外界环境数据等,并且所提供的多模态输出数据可以包括例如语音、文字、虚拟形象的动画序列等。例如,在智能会话主体是智能车机助理的场景下,该智能车机助理可以在用户驾驶交通工具(例如,车辆)期间提供辅助或陪伴,则本公开的实施例所处理的多模态输入数据可以包括例如聊天会话、采集的图像或音频、外界环境数据等,并且所提供的多模态输出数据可以包括例如语音、文字等。例如,在智能会话主体是智能客服的场景下,该智能客服可以为顾客提供诸如问题解答、产品信息提供等交互,则本公开的实施例所处理的多模态输入数据可以包括例如聊天会话、外界环境数据等,并且所提供的多模态输出数据可以包括例如语音、文字、动画等。例如,在智能会话主体是智能音箱的场景下,该智能音箱中的语音助理或聊天机器人可以与用户进行交互、播放音频内容等,则本公开的实施例所处理的多模态输入数据可以包括例如所播放的音频内容、聊天会话、采集的音频、外界环境数据等,并且所提供的多模态输出数据可以包括例如语音等。应当理解,除了上述这些示例性场景,本公开的实施例还可以应用于任何其它场景。The above is only an exemplary description of the application of the embodiments of the present disclosure in the intelligent animation character scene, and the embodiments of the present disclosure can also be applied to various other scenes. For example, in the scenario where the subject of the intelligent conversation is a chat robot, the chat robot can chat with the user in forms such as voice, text, video, etc., then the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat session , collected images or audio, external environment data, etc., and the multimodal output data provided may include, for example, voice, text, animation sequences, etc. For example, in the scenario where the subject of the intelligent conversation is a virtual anchor, the virtual anchor can have a corresponding avatar and play and explain predetermined media content to multiple users, then the multimodal input data processed by the embodiments of the present disclosure can be It includes, for example, played media content, external environment data, etc., and the provided multimodal output data may include, for example, voice, text, animation sequences of avatars, and the like. For example, in the scenario where the subject of the intelligent conversation is a smart car-machine assistant, the smart car-machine assistant can provide assistance or companionship while the user is driving a vehicle (for example, a vehicle), then the multimodal input processed by the embodiments of the present disclosure The data may include, for example, chat sessions, collected images or audio, external environment data, etc., and the provided multimodal output data may include, for example, voice, text, and the like. For example, in the scenario where the subject of the intelligent conversation is an intelligent customer service, the intelligent customer service can provide customers with interactions such as answering questions and providing product information, then the multimodal input data processed by the embodiments of the present disclosure can include, for example, chat sessions, External environment data, etc., and the multimodal output data provided may include, for example, voice, text, animation, etc. For example, in the scenario where the subject of the intelligent conversation is a smart speaker, the voice assistant or chat robot in the smart speaker can interact with the user, play audio content, etc., then the multimodal input data processed by the embodiments of the present disclosure can include For example, played audio content, chat sessions, collected audio, external environment data, etc., and the provided multimodal output data may include, for example, voice and the like. It should be understood that, in addition to the above exemplary scenarios, the embodiments of the present disclosure may also be applied to any other scenarios.
图1示出了根据实施例的基于多模态的反应式响应生成系统100的示例性架构。系统100可以支持智能会话主体在不同的场景中做出基于多模态的反应式响应。智能会话主体可以实施或驻留在终端设备或任何用户可访问的设备或平台上。FIG. 1 shows an exemplary architecture of a multi-modality-based reactive response generation system 100 according to an embodiment. The system 100 can support the intelligent conversation subject to make multimodal-based reactive responses in different scenarios. An intelligent conversational subject may be implemented or resident on an end device or any user-accessible device or platform.
系统100可以包括多模态数据输入接口110,其用于获得多模态输入数据。多模态数据输入接口110可以从多种数据源处收集多种类型的输入数据。例如,在向用户播放目标内容的情况下,多模态数据输入接口110可以收集到该目标内容的例如图像、音频、弹幕文件等数据。在本文中,目标内容可以广泛地指在设备上播放或呈现给用户的各种媒体内容,例如,视频内容、音频内容、图片内容、文字内容等。例如,在智能会话主体可以与用户进行聊天的情况下,多模态数据输入接口110可以获得关于聊天会话的输入数据。例如,多模态数据输入接口110可以通过终端设备上的摄像头和/或麦克风来采集用户周围的图像和/或音频。例如,多模态数据输入接口110还可以从第三方应用或任何其它信息源处获得外界环境数据。在本文中,外界环境数据可以广泛地指终端设备或用户所处于的真实世界中的各种环境参数,例如,关于天气、温度、湿度、行进速度等的数据。 System 100 may include a multimodal data input interface 110 for obtaining multimodal input data. The multimodal data input interface 110 can collect various types of input data from various data sources. For example, in the case of playing the target content to the user, the multimodal data input interface 110 may collect data such as image, audio, and barrage files of the target content. Herein, the target content may broadly refer to various media content played on a device or presented to a user, for example, video content, audio content, picture content, text content, and the like. For example, in the case that the intelligent conversation subject can chat with the user, the multimodal data input interface 110 can obtain input data about the chat conversation. For example, the multimodal data input interface 110 may collect images and/or audio around the user through a camera and/or a microphone on the terminal device. For example, the multimodal data input interface 110 can also obtain external environment data from a third-party application or any other information source. Herein, the external environment data may broadly refer to various environmental parameters in the real world where the terminal device or the user is located, for example, data about weather, temperature, humidity, travel speed, and the like.
多模态数据输入接口110可以将所获得的多模态输入数据112提供给系统100中的核心处理单元120。核心处理单元120提供反应式响应生成所需要的各种核心处理能力。基于处理阶段和类型, 核心处理单元120可以进而包括多个处理模块,例如,数据整合处理模块130、场景逻辑处理模块140、多模态输出数据生成模块150等。The multimodal data input interface 110 may provide the obtained multimodal input data 112 to the core processing unit 120 in the system 100 . The core processing unit 120 provides various core processing capabilities required for reactive response generation. Based on the processing stage and type, the core processing unit 120 may further include multiple processing modules, for example, a data integration processing module 130, a scene logic processing module 140, a multimodal output data generation module 150, and the like.
数据整合处理模块130可以从多模态输入数据112中提取不同类型的多模态的信息,所提取的多模态的信息可以是在特定场景和时序条件下而处于同一上下文环境中的。在一种实现方式中,数据整合处理模块130可以从多模态输入数据112中提取一个或多个信息元素132。在本文中,信息元素可以广泛地指从原始数据中提取的计算机可理解的信息或信息表示。在一个方面,数据整合处理模块130可以从多模态输入数据112所包括的目标内容中提取信息元素,例如,从目标内容的图像、音频、弹幕文件等中提取信息元素。示例性地,从目标内容的图像中提取的信息元素可以包括例如人物特征、文本、图像光线、物体等,从目标内容的音频中提取的信息元素可以包括例如音乐、语音等,从目标内容的弹幕文件中提取的信息元素可以包括例如弹幕文本等。在本文中,音乐可以广泛地指歌曲演唱、器乐演奏或者其组合,语音可以广泛地指讲话的声音。在一个方面,数据整合处理模块130可以从多模态输入数据112所包括的聊天会话中提取信息元素,例如,消息文本。在一个方面,数据整合处理模块130可以从多模态输入数据112所包括的采集的图像中提取例如对象特征等信息元素。在一个方面,数据整合处理模块130可以从多模态输入数据112所包括的采集的音频中提取例如语音、音乐等信息元素。在一个方面,数据整合处理模块130可以从多模态输入数据112所包括的外界环境数据中提取例如外界环境信息等信息元素。The data integration processing module 130 can extract different types of multi-modal information from the multi-modal input data 112 , and the extracted multi-modal information can be in the same context under specific scenarios and time sequence conditions. In one implementation, the data integration processing module 130 can extract one or more information elements 132 from the multimodal input data 112 . In this context, information elements can broadly refer to computer-understandable information or information representations extracted from raw data. In one aspect, the data integration processing module 130 may extract information elements from the target content included in the multimodal input data 112, for example, extract information elements from images, audio, bullet chat files, etc. of the target content. Exemplarily, the information elements extracted from the image of the target content may include, for example, character features, text, image light, objects, etc., the information elements extracted from the audio of the target content may include, for example, music, voice, etc., and the information elements extracted from the target content The information elements extracted from the bullet chat file may include, for example, bullet chat text and the like. Herein, music may broadly refer to song singing, instrumental performance, or a combination thereof, and speech may broadly refer to the sound of speech. In one aspect, data integration processing module 130 may extract informational elements, such as message text, from chat sessions included in multimodal input data 112 . In one aspect, the data integration processing module 130 can extract information elements, such as object features, from the captured images included in the multimodal input data 112 . In one aspect, the data integration processing module 130 may extract information elements such as speech, music, etc. from the collected audio included in the multimodal input data 112 . In one aspect, the data integration processing module 130 may extract information elements such as external environment information from the external environment data included in the multimodal input data 112 .
场景逻辑处理模块140可以至少基于信息元素132来生成一个或多个参考信息项142。在本文中,参考信息项可以广泛地指基于各种信息元素所生成的、供系统100在产生多模态输出数据时所参考的各种引导性信息。在一个方面,参考信息项142可以包括情感标签,该情感标签可以引导多模态输出数据所要呈现或基于的情感。在一个方面,参考信息项142可以包括动画标签,在多模态输出数据将要包括动画序列的情况下,该动画标签可以用于选择所要呈现的动画。在一个方面,参考信息项142可以包括评论文本,该评论文本可以是针对例如目标内容的评论,以便表达智能会话主体自己对于目标内容的观点或评价等。在一个方面,参考信息项142可以包括聊天响应文本,该聊天响应文本可以是对来自聊天会话的消息文本的响应。应当理解,可选地,场景逻辑处理模块140还可以在生成参考信息项142的过程中考虑更多其它因素,例如,场景特定情感、智能会话主体的预设个性、智能会话主体的预设角色等。The scene logic processing module 140 may generate one or more reference information items 142 based at least on the information elements 132 . Herein, a reference information item may broadly refer to various guiding information generated based on various information elements for reference by the system 100 when generating multimodal output data. In one aspect, the reference information item 142 can include an emotion tag that can guide the emotion that the multimodal output data is presented or based on. In one aspect, the reference information item 142 may include an animation tag, which may be used to select the animation to be presented where the multimodal output data is to include an animation sequence. In one aspect, the reference information item 142 may include comment text, and the comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content. In one aspect, reference information item 142 may include chat response text, which may be a response to message text from a chat session. It should be understood that, optionally, the scene logic processing module 140 may also consider more other factors in the process of generating the reference information item 142, for example, scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject Wait.
多模态输出数据生成模块150可以至少利用参考信息项142来产生多模态输出数据152。多模态输出数据152可以包括多种类型的输出数据,例如,语音、文字、动画序列等。多模态输出数据152所包括的语音可以是例如与评论文本或聊天响应文本相对应的语音,多模态输出数据152所包括的文字可以是例如与评论文本或聊天响应文本相对应的文字,多模态输出数据152所包括的动画序列可以是例如智能会话主体的虚拟形象的动画序列。应当理解,可选地,多模态输出数据生成模块150还可以在生成多模态输出数据152的过程中考虑更多其它因素,例如,场景特定需求等。The multimodal output data generation module 150 may utilize at least the reference information item 142 to generate the multimodal output data 152 . The multimodal output data 152 may include various types of output data, such as speech, text, animation sequences, and the like. The voice included in the multimodal output data 152 may be, for example, the voice corresponding to the comment text or the chat response text, and the text included in the multimodal output data 152 may be, for example, the text corresponding to the comment text or the chat response text, The animation sequence included in the multimodal output data 152 may be, for example, an animation sequence of an avatar of the intelligent conversation subject. It should be understood that, optionally, the multimodal output data generation module 150 may also consider more other factors during the process of generating the multimodal output data 152 , for example, scene-specific requirements and the like.
系统100可以包括多模态数据输出接口160,其用于提供多模态输出数据152。多模态数据输出接口160可以支持向用户提供或呈现多种类型的输出数据。例如,多模态数据输出接口160可以经由显示屏幕来呈现文字、动画序列等,并且可以经由扬声器来播放语音等。 System 100 may include a multimodal data output interface 160 for providing multimodal output data 152 . The multimodal data output interface 160 may support providing or presenting multiple types of output data to a user. For example, the multimodal data output interface 160 can present text, animation sequences, etc. via a display screen, and can play voice, etc. via a speaker.
应当理解,以上描述的基于多模态的反应式响应生成系统100的架构仅仅是示例性的,根据实际的应用需求和设计,系统100可以包括更多或更少的组件单元或模块。此外,应当理解,系统100可以是通过硬件、软件或其组合来实现的。例如,在一种情况下,多模态数据输入接口110、核心处理单元120以及多模态数据输出接口160可以是基于硬件实现的单元,例如,核心处理单元120可以是由具有数据处理能力的处理器、控制器等实现的,而多模态数据输入接口110和多模态数据输出接口160可以是通过具有数据输入/输出能力的硬件接口单元实现的。例如,在一种情况下,系统100中所包括的单元或模块也可以是通过软件或程序来实现的,从而这些单元或模块可以是软件单元或软件模块。此外,应当理解,系统100所包括的单元和模块可以被实施在终端设备处、或者可以被实施在网络设备或平台处、或者可以一部分被实施在终端设备处而另一部分被实施在网络设备或平台处。It should be understood that the architecture of the multimodal-based reactive response generation system 100 described above is only exemplary, and the system 100 may include more or less component units or modules according to actual application requirements and designs. In addition, it should be understood that the system 100 may be implemented by hardware, software or a combination thereof. For example, in one case, the multimodal data input interface 110, the core processing unit 120 and the multimodal data output interface 160 may be units implemented based on hardware, for example, the core processing unit 120 may be implemented by a The multimodal data input interface 110 and the multimodal data output interface 160 may be implemented by a hardware interface unit with data input/output capability. For example, in one case, the units or modules included in the system 100 may also be implemented by software or programs, so these units or modules may be software units or software modules. In addition, it should be understood that the units and modules included in the system 100 may be implemented at the terminal device, or may be implemented at the network device or platform, or may be partially implemented at the terminal device while the other part is implemented at the network device or platform.
图2示出了根据实施例的用于基于多模态的反应式响应生成的示例性过程200。过程200中的步骤或处理可以由例如图1中的基于多模态的反应式响应生成系统中的对应单元或模块来执行。FIG. 2 illustrates an exemplary process 200 for multimodality-based reactive response generation, according to an embodiment. The steps or processes in the process 200 may be performed by, for example, corresponding units or modules in the multi-modality-based reactive response generation system in FIG. 1 .
在210处,可以获得多模态输入数据212。示例性地,基于不同的应用场景,多模态输入数据212可以包括例如目标内容的图像、目标内容的音频、目标内容的弹幕文件、聊天会话、采集的图像、采集的音频、外界环境数据等中的至少一个。例如,在存在目标内容的场景下,例如,智能动画角色场景、虚拟主播场景等,可以在210处获得目标内容的图像、音频、弹幕文件等数据。例如,在智能会话主体支持聊天功能的场景下,可以在210处获得关于聊天会话的数据,其包括该聊天会话中的聊天记录等。例如,在实施了智能会话主体的终端设备具有摄像头或麦克风的场景下,可以在210处获得通过摄像头所采集的图像、通过麦克风所采集的音频等数据。例如,在智能会话主体具有获取外界环境数据的能力的场景下,可以在210处获得各种外界环境数据。应当理解,多模态输入数据212并不局限于以上描述的示例性输入数据。At 210, multimodal input data 212 can be obtained. Exemplarily, based on different application scenarios, the multimodal input data 212 may include, for example, images of the target content, audio of the target content, barrage files of the target content, chat sessions, collected images, collected audio, external environment data at least one of the others. For example, in a scene where the target content exists, such as a smart animation character scene, a virtual anchor scene, etc., data such as images, audio, and barrage files of the target content can be obtained at 210 . For example, in a scenario where the intelligent conversation subject supports a chat function, data about the chat session can be obtained at 210, which includes chat records in the chat session and the like. For example, in a scenario where the terminal device implementing the intelligent conversation subject has a camera or a microphone, data such as images collected by the camera and audio collected by the microphone may be obtained at 210 . For example, in the scenario where the intelligent conversation subject has the ability to obtain external environment data, various external environment data may be obtained at 210 . It should be understood that multimodal input data 212 is not limited to the exemplary input data described above.
在220处,可以从多模态输入数据212中提取一个或多个信息元素222。取决于多模态输入数据212中所包括的具体的输入数据,可以分别从这些输入数据中提取对应的信息元素。At 220 , one or more informational elements 222 may be extracted from the multimodal input data 212 . Depending on the specific input data included in the multimodal input data 212, corresponding information elements may be extracted from these input data, respectively.
在多模态输入数据212包括目标内容的图像的情况下,可以从目标内容的图像中提取人物特征。以目标内容是在终端设备上播放的演唱会视频为例,可以从该视频的图像中提取演唱者的各种人物特征,例如,面部表情、肢体动作、服装颜色等。应当理解,本公开的实施例并不局限于任何特定的人物特征提取技术。Where the multimodal input data 212 includes images of the target content, human features may be extracted from the images of the target content. Taking the target content as an example of a concert video played on a terminal device, various character features of the singer can be extracted from the image of the video, such as facial expressions, body movements, clothing colors, and the like. It should be understood that the embodiments of the present disclosure are not limited to any specific character feature extraction technology.
在多模态输入数据212包括目标内容的图像的情况下,可以从目标内容的图像中识别文本。在一种实现方式中,可以通过例如光学符号识别(OCR)等文本识别技术来从图像中识别文本。仍然以目标内容是演唱会视频为例,该视频中的某些图像可能包含音乐信息,例如,歌曲名、作词人、作曲人、演唱者、演奏者等,因此,可以通过文本识别来获得这些音乐信息。应当理解,本公开的实施例并不局限于通过OCR技术来识别文本,而是可以采用任何其它文本识别技术。此外,从目标内容的图像中所识别的文本也并不局限于音乐信息,还可以包括任何其它指示了与图像中所发生的事件相关的信息的文本,例如,字幕、歌词等。Where multimodal input data 212 includes an image of the target content, text may be identified from the image of the target content. In one implementation, text may be recognized from an image by a text recognition technique such as optical character recognition (OCR). Still taking the target content as an example of a concert video, some images in this video may contain music information, such as song title, lyricist, composer, singer, performer, etc., so these can be obtained through text recognition music information. It should be understood that the embodiments of the present disclosure are not limited to recognizing text by OCR technology, but any other text recognition technology may be used. In addition, the text recognized from the image of the target content is not limited to music information, and may also include any other text indicating information related to events occurring in the image, such as subtitles, lyrics, etc.
在多模态输入数据212包括目标内容的图像的情况下,可以从目标内容的图像中检测图像光线。图像光线可以指图像所呈现的画面内的环境光线特性,例如,明亮、暗淡、阴森、闪烁等。仍然以目标内容是演唱会视频为例,假设演唱者正在演唱欢快风格的歌曲,则演唱会现场的舞台可能采用了明亮的灯光,从而,可以从这些图像中检测出图像光线为明亮。应当理解,本公开的实施例并不局限于任何特定的图像光线检测技术。Where the multimodal input data 212 includes an image of the target content, image rays may be detected from the image of the target content. Image light may refer to the characteristics of ambient light in the picture presented by the image, for example, bright, dim, gloomy, flickering, and the like. Still taking the target content as a concert video as an example, assuming that the singer is singing a cheerful song, the stage at the concert site may use bright lights, so it can be detected from these images that the image light is bright. It should be understood that embodiments of the present disclosure are not limited to any particular image light detection technique.
在多模态输入数据212包括目标内容的图像的情况下,可以从目标内容的图像中识别物体。所识别的物体可以是例如在图像中的代表性物体、在图像中的显著或重要位置出现的物体、在图像中与人物关联的物体等,例如,所识别的物体可以包括道具、背景陈设等。仍然以目标内容是演唱会视频为例,假设演唱者在演唱歌曲的同时弹奏挎在身上的吉他,则可以从图像中识别出物体“吉他”。应当理解,本公开的实施例并不局限于任何特定的物体识别技术。Where the multimodal input data 212 includes images of the target content, objects may be identified from the images of the target content. The identified object may be, for example, a representative object in the image, an object appearing in a prominent or important position in the image, an object associated with a person in the image, etc. For example, the identified object may include props, background furnishings, etc. . Still taking the target content as an example of a concert video, assuming that the singer is playing a guitar while singing a song, the object "guitar" can be identified from the image. It should be understood that embodiments of the present disclosure are not limited to any particular object recognition technology.
在多模态输入数据212包括目标内容的音频的情况下,可以从目标内容的音频中提取音乐。该目标内容本身可以是音频,例如,在终端设备上向用户播放的歌曲,相应地,可以从该音频中提取对应于该歌曲的音乐。此外,该目标内容也可以是视频,例如演唱会视频,相应地,可以从该视频所包含的音频中提取音乐。在本文中,音乐可以广泛地包括例如由乐器所演奏的乐曲、由演唱者所演唱的歌曲、由专用设备或配音员所产生的特效音、等等。所提取的音乐可以是背景音乐、前景音乐等。此外,音乐提取可以广泛地指例如获得与音乐相对应的声音文件、声波数据等。应当理解,本公开的实施例并不局限于任何特定的音乐提取技术。Where the multimodal input data 212 includes audio of the target content, music may be extracted from the audio of the target content. The target content itself may be audio, for example, a song played to the user on the terminal device, and correspondingly, the music corresponding to the song may be extracted from the audio. In addition, the target content may also be a video, such as a concert video, and accordingly, music may be extracted from the audio contained in the video. Herein, music may broadly include, for example, musical pieces played by musical instruments, songs sung by singers, special effects sounds produced by special equipment or voice actors, and the like. The extracted music may be background music, foreground music, or the like. Also, music extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to music. It should be understood that embodiments of the present disclosure are not limited to any particular music extraction technique.
在多模态输入数据212包括目标内容的音频的情况下,可以从目标内容的音频中提取语音。在本文中,语音可以指讲话的声音。例如,当目标内容包括人物或角色的交谈、演说、点评等时,可以从目标内容的音频中提取出对应的语音。语音提取可以广泛地指例如获得与语音相对应的声音文件、声波数据等。应当理解,本公开的实施例并不局限于任何特定的语音提取技术。Where the multimodal input data 212 includes audio of the target content, speech may be extracted from the audio of the target content. Herein, speech may refer to the sound of speech. For example, when the target content includes conversations, speeches, comments, etc. of people or characters, the corresponding voice can be extracted from the audio of the target content. Speech extraction may broadly refer to, for example, obtaining sound files, sound wave data, etc. corresponding to speech. It should be understood that embodiments of the present disclosure are not limited to any specific speech extraction technology.
在多模态输入数据212包括目标内容的弹幕文件的情况下,可以从目标内容的弹幕文件中提取弹幕文本。在一些情况下,一些视频播放应用或播放平台支持视频的不同观看者通过弹幕的形式来发送自己的评论、感受等,这些评论、感受等可以被作为弹幕文本而被包含在附加到视频的弹幕文件中,因此,可以从弹幕文件中提取弹幕文本。应当理解,本公开的实施例并不局限于任何特定的弹幕文本提取技术。In the case that the multimodal input data 212 includes a bullet chatting file of the target content, the bullet chatting text may be extracted from the bullet chatting file of the target content. In some cases, some video playback applications or playback platforms support different viewers of the video to send their own comments, feelings, etc. in the form of barrage, and these comments, feelings, etc. can be included as barrage text in the video attached to the video In the bullet chat file, therefore, the bullet chat text can be extracted from the bullet chat file. It should be understood that the embodiments of the present disclosure are not limited to any specific barrage text extraction technology.
在多模态输入数据212包括聊天会话的情况下,可以从聊天会话中提取消息文本。消息文本可以包括例如由智能会话主体所发送的聊天消息的文本、由至少另一聊天参与方所发送的聊天消息的文本等。在聊天会话是以文本方式进行的情况下,可以直接从聊天会话中提取消息文本,而在聊天会话是以语音方式进行的情况下,可以通过语音识别技术来将聊天会话中的语音消息转换为消息文本。应当理解,本公开的实施例并不局限于任何特定的消息文本提取技术。Where multimodal input data 212 includes chat sessions, message text may be extracted from the chat sessions. The message text may include, for example, the text of a chat message sent by the intelligent conversation subject, the text of a chat message sent by at least one other chat participant, and the like. In the case that the chat session is carried out in the form of text, the text of the message can be directly extracted from the chat session, and in the case of the chat session in the form of voice, the voice message in the chat session can be converted into message text. It should be understood that the embodiments of the present disclosure are not limited to any specific message text extraction technology.
在多模态输入数据212包括采集的图像的情况下,可以从采集的图像中提取对象特征。对象特征可以广泛地指在采集的图像中出现的对象的各种特征,所述对象可以包括例如人物、物体等。例如,在通过电脑摄像头而采集到电脑使用者的图像的情况下,可以从该图像中提取关于该使用者的各种特征,例如面部表情、肢体动作等。例如,在通过汽车上安装的摄像头而采集到汽车前方图像的情况下,可以从该图像中提取关于例如前方车辆、交通标识、路侧建筑等的各种特征。应当理解,本公开的实施例并不局限于从采集的图像中提取以上示例性的对象特征,而是还可以提取任何其它的对象特征。此外,本公开的实施例也并不局限于任何特定的对象特征提取技术。Where multimodal input data 212 includes acquired images, object features may be extracted from the acquired images. Object features may broadly refer to various characteristics of objects appearing in captured images, and the objects may include, for example, people, objects, and the like. For example, when an image of a computer user is captured by a computer camera, various features about the user, such as facial expressions and body movements, can be extracted from the image. For example, in the case of an image in front of a car collected by a camera installed on the car, various features such as vehicles in front, traffic signs, roadside buildings, etc. may be extracted from the image. It should be understood that the embodiments of the present disclosure are not limited to extracting the above exemplary object features from the collected images, but can also extract any other object features. In addition, the embodiments of the present disclosure are not limited to any specific object feature extraction technique.
在多模态输入数据212包括采集的音频的情况下,可以从采集的音频中提取语音和/或音乐。与上述的从目标内容的音频中提取语音、音乐等的方式相类似地,可以从采集的音频中提取语音、音乐等。Where multimodal input data 212 includes captured audio, speech and/or music may be extracted from the captured audio. Similar to the above-mentioned manner of extracting voice, music, etc. from the audio of the target content, voice, music, etc. may be extracted from the collected audio.
在多模态输入数据212包括外界环境数据的情况下,可以从外界环境数据中提取外界环境信息。例如,可以从关于天气的数据中提取具体的天气信息,可以从关于温度的数据中提取具体的温度信息,可以从关于行进速度的数据中提取具体的速度信息,等等。应当理解,本公开的实施例并不局限于任何特定的外界环境信息提取技术。In the case that the multimodal input data 212 includes external environment data, external environment information may be extracted from the external environment data. For example, specific weather information may be extracted from data on weather, specific temperature information may be extracted from data on temperature, specific speed information may be extracted from data on travel speed, and so on. It should be understood that the embodiments of the present disclosure are not limited to any specific external environment information extraction technology.
应当理解,以上描述的从多模态输入数据212中所提取的信息元素都是示例性的,本公开的实施例还可以提取任何其它类型的信息元素。此外,所提取的信息元素可以是在特定场景和时序条件下而处于同一上下文环境中的,例如,这些信息元素可以是在时序上对准的,相应地,可以在不同的时间点处提取不同的信息元素组合。It should be understood that the above-described information elements extracted from the multimodal input data 212 are exemplary, and embodiments of the present disclosure may also extract any other types of information elements. In addition, the extracted information elements can be in the same context under specific scenarios and timing conditions. For example, these information elements can be aligned in timing, and accordingly, different time points can be extracted at different time points. combination of information elements.
在230处,可以至少基于信息元素222来生成一个或多个参考信息项232。At 230 , one or more reference information items 232 may be generated based at least on information elements 222 .
根据本公开的实施例,在230处所生成的参考信息项232可以包括情感标签。情感标签可以指示例如情感类型、情感等级等。本公开的实施例可以涵盖任意数量的预定情感类型,以及为每种情感类型定义的任意数量的情感等级。示例性的情感类型可以包括例如高兴、伤心、愤怒等,示例性的情感等级可以按照情感强烈程度从低到高而包括1级、2级、3级等。相应地,如果在230处确定了情感标签<高兴,2级>,则表明信息元素222整体上表达出了高兴的情感并且情感等级为中等水平的2级。应当理解,以上仅仅为了便于解释而给出了示例性的情感类型、示例性的情感等级及其表达方式,本公开的实施例还可以采用更多或更少的任何其它情感类型以及任何其它情感等级,并且可以采用任何其它表达方式。According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include a sentiment tag. Sentiment tags may indicate, for example, emotion type, emotion level, and the like. Embodiments of the present disclosure may encompass any number of predetermined emotion types, and any number of emotion levels defined for each emotion type. Exemplary emotion types may include, for example, happiness, sadness, anger, etc., and exemplary emotion levels may include level 1, level 2, level 3, etc. according to the intensity of emotion from low to high. Correspondingly, if the emotion tag <happy, level 2> is determined at 230 , it indicates that the information element 222 expresses the emotion of happiness as a whole and the emotion level is a medium level of level 2 . It should be understood that the exemplary emotion types, exemplary emotion levels and their expressions are given above only for the convenience of explanation, and the embodiments of the present disclosure can also adopt more or less any other emotion types and any other emotion grade, and any other expression may be used.
可以首先针对每一种信息元素确定各自所表达的情感,然后综合考虑这些情感以确定最终的情感类型和情感等级。例如,可以首先生成与信息元素222中的一个或多个信息元素分别对应的一个或多个情感表示,然后至少基于这些情感表示来生成最终的情感标签。在本文中,情感表示可以指对情感的信息化表示,其可以采用例如情感向量、情感标签等形式。情感向量可以包括用于表示情感分布的多个维度,每个维度对应于一种情感类型,并且每个维度上的值表明对应情感类型的预测概率或权重。The emotions expressed by each information element can be determined first, and then these emotions can be considered comprehensively to determine the final emotion type and emotion level. For example, one or more emotion representations respectively corresponding to one or more information elements in the information elements 222 may be generated first, and then a final emotion label is generated based at least on these emotion representations. In this paper, the emotion representation may refer to an informational representation of emotion, which may take the form of, for example, an emotion vector, an emotion label, and the like. The emotion vector may include multiple dimensions for expressing emotion distribution, each dimension corresponds to an emotion type, and the value on each dimension indicates the prediction probability or weight of the corresponding emotion type.
在信息元素222包括从目标内容的图像中所提取的人物特征的情况下,可以利用例如预先训练的机器学习模型来生成与该人物特征对应的情感表示。以人物特征中的面部表情为例,可以采用 例如用于面部情感识别的卷积神经网络模型来预测对应的情感表示。类似地,该卷积神经网络模型也可以被训练为进而综合考虑人物特征中可能包含的例如肢体动作等其它特征来预测情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与人物特征对应的情感表示的技术。In the case that the information element 222 includes a character feature extracted from an image of the target content, for example, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the character feature. Taking facial expressions in character features as an example, a convolutional neural network model for facial emotion recognition can be used to predict the corresponding emotional representation. Similarly, the convolutional neural network model can also be trained to comprehensively consider other features that may be included in the character features, such as body movements, to predict emotional representation. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotional expression corresponding to the character's characteristics.
在信息元素222包括从目标内容的图像中所识别的文本的情况下,以该文本是音乐信息为例,可以基于该音乐信息在预先建立的音乐数据库中检索与该音乐对应的情感信息,从而形成情感表示。音乐数据库可以包括预先收集的大量音乐的音乐信息以及对应的情感信息、音乐类型、背景知识、聊天语料等。音乐数据库可以是按照例如歌曲名、演唱者、演奏者等各种音乐信息来建立索引的,从而,可以基于音乐信息来从音乐数据库中找到与特定音乐对应的情感信息。可选地,由于不同的音乐类型也通常可以指示不同的情感,因此,也可以将从音乐数据库中找到的音乐类型用于形成情感表示。此外,以所识别的文本是图像中的人物所讲话语的字幕为例,可以利用预先训练的机器学习模型来生成与该字幕对应的情感表示。该机器学习模型可以是例如基于卷积神经网络的情感分类模型。应当理解,本公开的实施例并不局限于任何特定的确定与从目标内容的图像中所识别的文本对应的情感表示的技术。In the case where the information element 222 includes the text identified from the image of the target content, taking the text as music information as an example, the emotional information corresponding to the music can be retrieved in a pre-established music database based on the music information, so that Form an emotional expression. The music database may include music information of a large amount of music collected in advance and corresponding emotional information, music type, background knowledge, chat corpus, and the like. The music database can be indexed according to various music information such as song name, singer, performer, etc., so that emotional information corresponding to specific music can be found from the music database based on the music information. Optionally, since different music genres can generally indicate different emotions, music genres found from music databases can also be used to form emotion representations. In addition, taking the subtitle in which the recognized text is spoken by a person in the image as an example, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the subtitle. The machine learning model may be, for example, an emotion classification model based on a convolutional neural network. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to text recognized from an image of target content.
在信息元素222包括从目标内容的图像中所识别的物体的情况下,可以基于预先建立的机器学习模型或者预先设定的启发式规则来确定与该物体对应的情感表示。在一些情况下,图像中的物体也可以有助于表达情感。例如,如果在图像中显示在舞台上布置了用于烘托气氛的多个红色摆件,则从图像中所识别出的这些红色摆件可以有助于确定出例如高兴或喜悦的情感。应当理解,本公开的实施例并不局限于任何特定的确定与从目标内容的图像中所识别的物体对应的情感表示的技术。In the case that the information element 222 includes an object recognized from an image of the target content, the emotion representation corresponding to the object may be determined based on a pre-established machine learning model or a pre-set heuristic rule. In some cases, objects in an image can also help express emotion. For example, if it is shown in the image that a plurality of red ornaments are arranged on the stage to enhance the atmosphere, these red ornaments recognized from the image may help to determine emotions such as joy or joy. It should be appreciated that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to an object recognized from an image of target content.
在信息元素222包括从目标内容的音频中所提取的音乐的情况下,可以通过多种方式来确定或生成与该音乐对应的情感表示。在一种方式中,如果已经识别出了音乐信息,则可以基于音乐信息来从音乐数据库中找到与该音乐对应的情感信息,从而形成情感表示。在一种方式中,可以利用预先训练的机器学习模型,基于从该音乐中提取的多种音乐特征来生成与该音乐对应的情感表示。音乐特征可以包括音乐的音频平均能量(AE),表示为
Figure PCTCN2022093766-appb-000001
其中,x是离散的音频输入信号,t是时间,N是输入信号x的数量。音乐特征还可以包括从音乐中提取的以节拍数量和/或节拍间隔的分布来表示的节奏特征。可选地,音乐特征也可以包括上述的利用音乐信息所获得的与该音乐对应的情感信息。可以基于上述的一种或多个音乐特征来训练机器学习模型,以使得经训练的机器学习模型能够预测音乐的情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与从目标内容的音频中所提取的音乐对应的情感表示的技术。
Where the information element 222 includes music extracted from the audio of the target content, the emotional representation corresponding to the music may be determined or generated in a number of ways. In one manner, if the music information has been recognized, the emotion information corresponding to the music may be found from a music database based on the music information, so as to form an emotion expression. In one manner, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the music based on various music features extracted from the music. Musical features can include the Audio Average Energy (AE) of the music, denoted as
Figure PCTCN2022093766-appb-000001
where x is the discrete audio input signal, t is the time, and N is the number of input signals x. Musical features may also include rhythmic features extracted from music represented by the number of beats and/or the distribution of beat intervals. Optionally, the music feature may also include the aforementioned emotional information corresponding to the music obtained by using the music information. The machine learning model can be trained based on the above one or more music features, so that the trained machine learning model can predict the emotional expression of music. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to music extracted from audio of target content.
在信息元素222包括从目标内容的音频中所提取的语音的情况下,可以利用预先训练的机器学习模型来生成与该语音对应的情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与从目标内容的音频中所提取的语音对应的情感表示的技术。Where the information element 222 includes speech extracted from the audio of the target content, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the speech. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to speech extracted from audio of target content.
在信息元素222包括从目标内容的弹幕文件中所提取的弹幕文本的情况下,可以利用预先训练的机器学习模型来生成与该弹幕文本对应的情感表示。该机器学习模型可以是例如基于卷积神经网络的情感分类模型,表示为CNN sen。假设将弹幕文本中的词语表示为[d 0,d 1,d 2,…],则可以通过情感分类模型CNN sen来预测出与该弹幕文本对应的情感向量,表示为[s 0,s 1,s 2,…]=CNN sen[d 0,d 1,d 2,…],其中,情感向量[s 0,s 1,s 2,…]中的每个维度对应一个情感类别。应当理解,本公开的实施例并不局限于任何特定的确定与从目标内容的弹幕文件中所提取的弹幕文本对应的情感表示的技术。 In the case that the information element 222 includes the bullet chat text extracted from the bullet chat file of the target content, a pre-trained machine learning model may be used to generate an emotion representation corresponding to the bullet chat text. The machine learning model may be, for example, a convolutional neural network-based sentiment classification model, denoted as CNN sen . Assuming that the words in the bullet chat text are expressed as [d 0 ,d 1 ,d 2 ,…], the sentiment vector corresponding to the bullet chat text can be predicted by the sentiment classification model CNN sen , expressed as [s 0 , s 1 , s 2 ,…]=CNN sen [d 0 ,d 1 ,d 2 ,…], where each dimension in the emotion vector [s 0 ,s 1 ,s 2 ,…] corresponds to an emotion category. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotion expression corresponding to the bullet chat text extracted from the bullet chat file of the target content.
在信息元素222包括从聊天会话中所提取的消息文本的情况下,可以利用预先训练的机器学习模型来生成与该消息文本对应的情感表示。该机器学习模型可以是与上述的用于生成与弹幕文本对应的情感表示的机器学习模型相类似的方式来建立的。应当理解,本公开的实施例并不局限于任何特定的确定与从聊天会话中所提取的消息文本对应的情感表示的技术。Where the information element 222 includes message text extracted from a chat session, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the message text. The machine learning model may be established in a manner similar to the aforementioned machine learning model for generating an emotion representation corresponding to the bullet chat text. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining an emotional representation corresponding to message text extracted from a chat session.
在信息元素222包括从采集的图像中所提取的对象特征的情况下,可以利用预先训练的机器学习模型来生成与该对象特征对应的情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与从采集的图像中所提取的对象特征对应的情感表示的技术。Where the information element 222 includes object features extracted from captured images, a pre-trained machine learning model may be utilized to generate an emotional representation corresponding to the object features. It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to object features extracted from captured images.
在信息元素222包括从采集的音频中所提取的语音和/或音乐的情况下,可以生成与该语音和/ 或音乐对应的情感表示。可以通过与上述的确定与从目标内容的音频中所提取的语音和/或音乐对应的情感表示相类似的方式,来生成与从采集的音频中所提取的语音和/或音乐对应的情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与从采集的音频中所提取的语音和/或音乐对应的情感表示的技术。Where the information element 222 includes speech and/or music extracted from captured audio, an emotional representation corresponding to the speech and/or music may be generated. The emotional representation corresponding to the speech and/or music extracted from the audio of the target content may be generated in a manner similar to that described above for determining the emotional representation corresponding to the speech and/or music extracted from the audio of the target content . It should be understood that embodiments of the present disclosure are not limited to any particular technique for determining emotional representations corresponding to speech and/or music extracted from captured audio.
在信息元素222包括从外界环境数据中所提取的外界环境信息的情况下,可以基于预先建立的机器学习模型或者预先设定的启发式规则来确定与该外界环境信息对应的情感表示。以外界环境信息为“阴雨”天气为例,由于人们往往在阴雨天气中表现出略微忧伤的情感,因此,可以从该外界环境信息中确定出对应于忧伤情感的情感表示。应当理解,本公开的实施例并不局限于任何特定的确定与从外界环境数据中所提取的外界环境信息对应的情感表示的技术。When the information element 222 includes external environment information extracted from the external environment data, the emotion expression corresponding to the external environment information may be determined based on a pre-established machine learning model or a preset heuristic rule. Taking the external environment information as "cloudy and rainy" weather as an example, since people tend to show slightly sad emotions in cloudy and rainy weather, the emotional expression corresponding to the sad emotion can be determined from the external environment information. It should be understood that the embodiments of the present disclosure are not limited to any specific technology for determining the emotion representation corresponding to the external environment information extracted from the external environment data.
在按照以上描述而生成了与信息元素222中的一个或多个信息元素分别对应的一个或多个情感表示之后,可以至少基于这些情感表示来生成最终的情感标签。该最终的情感标签可以被理解为指示了通过综合考虑多种信息元素而确定的整体情感。可以通过各种方式来从多个情感表示形成情感标签。例如,在情感表示采用了情感向量的情况下,可以对多个情感表示进行叠加以获得总情感向量,并且从总情感向量中的情感分布中导出情感类型和情感等级以形成最终的情感标签。例如,在情感表示采用了情感标签的情况下,可以基于预定规则来从与多个信息元素对应的多个情感标签中计算、选择或确定出最终的情感标签。应当理解,本公开的实施例并不局限于任何特定的基于多个情感表示来生成情感标签的方式。After one or more emotion representations respectively corresponding to one or more information elements in the information elements 222 are generated according to the above description, a final emotion tag can be generated based at least on these emotion representations. The final sentiment label can be understood as indicating the overall sentiment determined by comprehensively considering various information elements. Sentiment labels can be formed from multiple sentiment representations in various ways. For example, in the case that emotion representations use emotion vectors, multiple emotion representations can be superimposed to obtain a total emotion vector, and the emotion type and emotion level can be derived from the emotion distribution in the total emotion vector to form the final emotion label. For example, in the case that emotional tags are used for emotional expression, the final emotional tag may be calculated, selected or determined from multiple emotional tags corresponding to multiple information elements based on predetermined rules. It should be understood that the embodiments of the present disclosure are not limited to any specific manner of generating emotion tags based on multiple emotion representations.
应当理解,尽管以上讨论涉及到在230处可以首先生成与多个信息元素分别对应的多个情感表示,然后基于这些情感表示来生成情感标签,但是,替代地,本公开的实施例也可以直接基于多个信息元素来生成情感标签。例如,可以预先训练一个机器学习模型,该模型可以被训练为将多个信息元素作为多个输入特征并且相应地预测情感标签。从而,经训练的该模型可以用于直接基于信息元素222来生成情感标签。It should be understood that although the above discussion involves generating multiple emotion representations respectively corresponding to multiple information elements at 230, and then generating emotion labels based on these emotion representations, alternatively, embodiments of the present disclosure may also directly Sentiment labels are generated based on multiple information elements. For example, a machine learning model can be pre-trained that can be trained to take multiple information elements as multiple input features and predict sentiment labels accordingly. Thus, the trained model can be used to generate sentiment tags based directly on the information elements 222 .
根据本公开的实施例,在230处所生成的参考信息项232可以包括动画标签。在多模态输出数据将要包括智能会话主体的虚拟形象的动画序列的情况下,该动画标签可以用于选择所要呈现的动画。动画标签可以指示虚拟形象的例如面部表情类型、肢体动作类型等至少之一或其组合。面部表情可以包括例如微笑、大笑、眨眼、撇嘴、说话等,肢体动作可以包括例如向左转、挥手、摆动身体、舞蹈动作等。According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include an animation tag. In the case where the multimodal output data is to include an animation sequence of the avatar of the intelligent session subject, the animation tag can be used to select the animation to be presented. The animation tag may indicate at least one or a combination of facial expression types, body movement types, etc. of the avatar. Facial expressions may include, for example, smiling, laughing, blinking, curling lips, speaking, etc., and body movements may include, for example, turning left, waving, body swinging, dance moves, and the like.
可以根据预定规则,将至少一个信息元素222映射为动画标签。例如,可以预先定义多种动画标签,并且预先定义大量的从信息元素集合到动画标签的映射规则,其中,信息元素集合可以包括一个或多个信息元素。从而,在给定包括一个或多个信息元素的信息元素集合时,可以参考预先定义的映射规则,基于该信息元素集合中的一个信息元素或者多个信息元素的组合来确定出对应的动画标签。一种示例性的映射规则为:当从目标内容的图像中所提取的人物特征指示了人物的唱歌动作,并且弹幕文本包括例如“好听”、“陶醉”等关键词语,则可以将这些信息元素映射为例如“闭上双眼”、“摆动身体”等动画标签,以使得虚拟形象可以表现出例如陶醉地倾听歌曲的行为。一种示例性的映射规则为:当从目标内容的音频中所提取的语音指示人们在争吵,弹幕文本包括例如“噪音”、“不想听”等关键词语,并且从聊天会话中所提取的消息文本包括表明用户的厌恶情感的关键词语,则可以将这些信息元素映射为例如“用手捂住耳朵”、“摇头”等动画标签,以使得虚拟形象可以表现出例如不想听到争吵的行为。一种示例性的映射规则为:当从目标内容的图像中所检测的图像光线指示了快速的明暗变化,从目标内容的图像中所识别的物体为吉他,并且从目标内容的音频中所提取的音乐指示了快节奏的乐曲,则可以将这些信息元素映射为例如“弹吉他”、“快节奏舞蹈动作”等动画标签,以使得虚拟形象可以表现出例如随着热烈的乐曲而弹琴跳舞的行为。应当理解,以上仅仅列出了几种示例性的映射规则,本公开的实施例还可以定义大量的任何其它映射规则。At least one information element 222 may be mapped to an animation tag according to a predetermined rule. For example, various animation tags may be predefined, and a large number of mapping rules from information element sets to animation tags may be predefined, where the information element set may include one or more information elements. Therefore, when an information element set including one or more information elements is given, the corresponding animation label can be determined based on one information element or a combination of multiple information elements in the information element set by referring to a predefined mapping rule . An exemplary mapping rule is: when the character features extracted from the image of the target content indicate the character's singing action, and the barrage text includes key words such as "good to hear" and "intoxicated", then these information can be Elements are mapped to animation tags such as "close your eyes", "swing your body", etc., so that the avatar can perform behaviors such as listening to a song intoxicated. An exemplary mapping rule is: when the voice extracted from the audio of the target content indicates that people are arguing, the bullet chat text includes key words such as "noise", "don't want to listen", and the voice extracted from the chat session The message text includes key words indicating the user's disgust, and these information elements can be mapped to animated labels such as "covering ears with hands" and "shaking head", so that the avatar can show behaviors such as not wanting to hear quarrels . An exemplary mapping rule is: when the image light detected from the image of the target content indicates a rapid light-dark change, the object identified from the image of the target content is a guitar, and the object extracted from the audio of the target content If the music indicates fast-paced music, these information elements can be mapped to animation tags such as "playing the guitar" and "fast-paced dance moves", so that the avatar can show, for example, the movement of playing the piano and dancing along with the lively music. Behavior. It should be understood that the above only lists several exemplary mapping rules, and embodiments of the present disclosure may also define a large number of any other mapping rules.
此外,可选地,动画标签也可以是进一步基于情感标签来生成的。例如,可以将情感标签与信息元素一起用于定义映射规则,从而,可以基于信息元素和情感标签的组合来确定对应的动画标签。此外,可选地,也可以定义从情感标签到动画标签的直接映射规则,从而,在生成了情感标签 后,可以参考所定义的映射规则而直接基于情感标签确定出对应的动画标签。例如,可以定义从情感标签<悲伤,2级>到“大哭”、“用手擦眼泪”等动画标签的映射规则。In addition, optionally, the animation tag may also be further generated based on the emotion tag. For example, emotion tags can be used together with information elements to define mapping rules, so that corresponding animation tags can be determined based on the combination of information elements and emotion tags. In addition, optionally, a direct mapping rule from emotion tags to animation tags can also be defined, so that after the emotion tags are generated, the corresponding animation tags can be determined directly based on the emotion tags by referring to the defined mapping rules. For example, a mapping rule can be defined from the emotion label <sadness, level 2> to animation labels such as "crying", "wiping tears with hands".
根据本公开的实施例,在230处所生成的参考信息项232可以包括评论文本。评论文本可以是针对例如目标内容的评论,以便表达智能会话主体自己对于目标内容的观点或评价等。可以从目标内容的弹幕文本中选择评论文本。示例性地,可以利用基于双塔模型所构建的评论生成模型来从弹幕文本中选择评论文本。目标内容的弹幕文本可以与目标内容的图像和/或音频在时间上对齐,其中,在时间上对齐可以指位于相同的时刻处或者处于相同的时间段内。在特定时刻处的弹幕文本可能包括多个语句,这些语句可以是不同观看者对目标内容在该时刻或邻近时间段内的图像和/或音频的评论。在每个时刻处,评论生成模型可以从对应的弹幕文本中选择适合的语句,以作为针对目标内容在该时刻处或邻近时间段内的图像和/或音频的评论文本。例如,可以利用双塔模型来确定在目标内容的弹幕文本中的语句与目标内容的图像和/或音频之间的匹配度,并且从弹幕文本中选择匹配度最高的语句作为评论文本。评论生成模型可以包括例如两个双塔模型。对于弹幕文本中的一个语句,一个双塔模型可以用于基于输入的目标内容图像和该语句来输出第一匹配度分数,以表示在该图像与该语句之间的匹配程度,而另一个双塔模型可以用于基于输入的目标内容音频和该语句来输出第二匹配度分数,以表示在该音频与该语句之间的匹配程度。可以对第一匹配度分数和第二匹配度分数进行任意方式的组合以便得到该语句的综合匹配度分数。在获得了弹幕文本的多个语句的多个综合匹配度分数后,可以选择匹配度分数最高的语句作为对当前图像和/或音频的评论文本。应当理解,上述的评论生成模型的结构仅仅是示例性的,该评论生成模型也可以仅包括两个双塔模型中的一个双塔模型,或者基于任何其它被训练用于确定弹幕文本的语句与目标内容的图像和/或音频之间的匹配度的模型。According to an embodiment of the present disclosure, the reference information item 232 generated at 230 may include review text. The comment text may be, for example, a comment on the target content, so as to express the intelligent conversation subject's own opinion or evaluation on the target content. The comment text can be selected from the bullet chat text of the target content. Exemplarily, a comment generation model constructed based on the twin-tower model can be used to select comment text from bullet chat text. The bullet chat text of the target content may be time-aligned with the image and/or audio of the target content, wherein the time alignment may refer to being located at the same moment or within the same time period. The bullet chat text at a specific moment may include multiple sentences, and these sentences may be comments of different viewers on the image and/or audio of the target content at that moment or in a nearby time period. At each moment, the comment generation model can select a suitable sentence from the corresponding bullet chat text as the comment text for the image and/or audio of the target content at that moment or in a nearby time period. For example, the two-tower model can be used to determine the matching degree between the sentences in the bullet chat text of the target content and the image and/or audio of the target content, and the sentence with the highest matching degree is selected from the bullet chat text as the comment text. Review generation models may include, for example, two twin tower models. For a sentence in the bullet chat text, a two-tower model can be used to output a first matching score based on the input target content image and the sentence to indicate the degree of matching between the image and the sentence, while the other The two-tower model can be used to output a second match score based on the input target content audio and the sentence to represent the degree of match between the audio and the sentence. The first matching degree score and the second matching degree score can be combined in any manner to obtain the comprehensive matching degree score of the statement. After obtaining multiple comprehensive matching scores of multiple sentences of the bullet chat text, the sentence with the highest matching score may be selected as the comment text for the current image and/or audio. It should be understood that the structure of the above-mentioned comment generation model is only exemplary, and the comment generation model may only include one of the two twin-tower models, or be based on any other sentences that are trained to determine the bullet chat text A model of how well images and/or audio match with targeted content.
根据本公开的实施例,如果智能会话主体正在聊天会话中与至少另一聊天参与方进行聊天,则在230处所生成的参考信息项232也可以包括聊天响应文本。另一聊天参与方可以是例如用户、其它智能会话主体等。在获得了来自另一聊天参与方的消息文本后,可以通过聊天引擎,至少基于该消息文本来生成对应的聊天响应文本。According to an embodiment of the present disclosure, if the intelligent conversation subject is chatting with at least another chat participant in the chat conversation, the reference information item 232 generated at 230 may also include chat response text. Another chat participant may be, for example, a user, other intelligent conversation subject, or the like. After obtaining the message text from another chat participant, the corresponding chat response text can be generated at least based on the message text through the chat engine.
在一种实现方式中,可以采用任何通用的聊天引擎来生成聊天响应文本。In one implementation, any common chat engine can be used to generate the chat response text.
在一种实现方式中,聊天引擎可以至少基于情感标签来生成聊天响应文本。例如,聊天引擎可以被训练为至少基于输入的消息文本和情感标签来生成聊天响应文本,从而,使得聊天响应文本至少是在情感标签所指示的情感的影响下来生成的。In one implementation, the chat engine can generate chat response text based at least on the sentiment tag. For example, the chat engine may be trained to generate chat response text based at least on the input message text and the emotion tag, so that the chat response text is generated under the influence of the emotion indicated by the emotion tag at least.
在一种实现方式中,智能会话主体可以在聊天会话中表现出情感延续的特性,例如,智能会话主体的响应并不仅仅受到当前接收到的消息文本的情感的影响,还受到智能会话主体自己当前所处于的情感状态的影响。作为示例,假设智能会话主体当前处于高兴的情感状态,则尽管接收到的当前消息文本可能具有或导致例如愤怒等负面情感,智能会话主体也并不会由于该当前消息文本而立刻给出具有愤怒情感的响应,而是可能仍然保持高兴情感或仅仅略微降低高兴情感的情感等级。与此不同,现有的聊天引擎通常仅针对当前轮次的会话或仅根据当前接收到的消息文本来确定响应的情感类型,从而响应的情感类型可能随着接收到的消息文本而频繁地变化,这并不符合人类在聊天时通常处于较为平稳的情感状态而并不会频繁改变情感状态的行为。本公开实施例所提出的在聊天会话中具有情感延续特性的智能会话主体将会更加拟人化。为了实现在聊天会话中的情感延续特性,聊天引擎可以至少基于来自情感转移网络的情感表示来生成聊天响应文本。该情感转移网络用于建模动态的情感变换,其既可以保持平稳的情感状态,也可以响应于当前接收到的消息文本而对情感状态做出适当的调整或更新。例如,情感转移网络可以将当前情感表示与当前接收到的消息文本作为输入,并且输出经更新的情感表示,其中,当前情感表示可以是例如对智能会话主体当前的情感状态的向量表示。经更新的情感表示既包含了反应先前情感状态的信息,也包含了可能由当前消息文本导致的情感变化的信息。经更新的情感表示可以被进而提供给聊天引擎,从而,聊天引擎可以在该接收到的情感表示的影响下,针对当前消息文本来生成聊天响应文本。In one implementation, the intelligent conversational subject can show the characteristics of emotional continuation in the chat session, for example, the response of the intelligent conversational subject is not only affected by the emotion of the currently received message text, but also by the intelligent conversational subject's own The influence of the current emotional state. As an example, assuming that the intelligent conversational subject is currently in a happy emotional state, although the current message text received may have or cause negative emotions such as anger, the intelligent conversational subject will not immediately give an angry response due to the current message text. Instead, the happy emotion may still be maintained or the happy emotion's emotion level may only be slightly lowered. Different from this, the existing chat engines usually only determine the emotional type of the response for the current round of conversation or only according to the currently received message text, so the emotional type of the response may change frequently with the received message text , which does not conform to the behavior that human beings are usually in a relatively stable emotional state when chatting and do not change their emotional state frequently. The intelligent conversation subject with the emotional continuation characteristic in the chat conversation proposed by the embodiments of the present disclosure will be more anthropomorphic. In order to implement the emotion continuation feature in the chat session, the chat engine can generate the chat response text based at least on the emotion representation from the emotion transfer network. The emotion transfer network is used to model dynamic emotion transformation, which can not only maintain a stable emotional state, but also make appropriate adjustments or updates to the emotional state in response to the currently received message text. For example, the emotion transfer network can take the current emotion representation and the currently received message text as input, and output an updated emotion representation, wherein the current emotion representation can be, for example, a vector representation of the current emotional state of the intelligent conversation subject. The updated emotion representation contains information reflecting the previous emotion state and information about the emotion change that may be caused by the current message text. The updated emotional representation can be further provided to the chat engine, so that the chat engine can generate a chat response text for the current message text under the influence of the received emotional representation.
在一种实现方式中,聊天引擎可以被训练为能够针对目标内容来进行聊天,即,可以与另一 聊天参与方一起讨论与目标内容相关的话题。示例性地,该聊天引擎可以是基于例如在与目标内容相关的论坛中的人们之间的聊天内容而构建的基于检索的聊天引擎。该聊天引擎的构建可以包括多个方面的处理。在一个方面,可以从与目标内容相关的论坛中爬取涉及人们之间的聊天内容的聊天语料。在一个方面,可以训练一个词向量模型,以用于找到每个命名实体的可能的名称。例如,可以利用词向量技术来找到每个命名实体的相关词语,然后,可选地,通过例如人工核查的方式从相关词语中保留正确的词语以作为该命名实体的可能的名称。在一个方面,可以从聊天语料中提取关键词。例如,可以根据相关语料的分词结果进行统计,然后与非相关语料中的统计结果进行比较,由此找出词频-逆文档频率(TF-IDF)区别较大的词语作为关键词。在一个方面,可以训练基于例如深度卷积神经网络的深度检索模型,其是聊天引擎的核心网络。可以将聊天语料中的消息-回复对作为训练数据来对该深度检索模型进行训练。消息-回复对中的文本可以包括该消息和该回复中的原始语句或者所提取的关键词。在一个方面,可以训练意图检测模型,其可以检测所接收到的消息文本具体与哪个目标内容相关,从而可以从多个论坛中选择出与该目标内容相关的论坛。意图检测模型可以是二分类的分类器,具体地,其可以是例如卷积神经网络文本分类模型。用于该意图检测模型的正例样本可以来自于与该目标内容相关的论坛中的聊天语料,而反例样本可以来自于其它论坛中的聊天语料或者普通文本。通过上述的一个或多个处理以及可能的任何其它处理,可以构建出基于检索的聊天引擎,其可以响应于所输入的消息文本来提供聊天响应文本,该聊天响应文本是基于与目标内容相关的论坛中的语料的。In one implementation, the chat engine can be trained to chat with the target content, that is, a topic related to the target content can be discussed with another chat participant. Exemplarily, the chat engine may be a search-based chat engine constructed based on, for example, chat content among people in a forum related to the target content. The construction of the chat engine may include processing in various aspects. In one aspect, chat corpus involving chat content among people may be crawled from forums related to target content. In one aspect, a word embedding model can be trained for use in finding possible names for each named entity. For example, word embedding technology can be used to find related words of each named entity, and then, optionally, correct words can be reserved from the related words as possible names of the named entity through, for example, manual checking. In one aspect, keywords can be extracted from chat corpus. For example, statistics can be made based on the word segmentation results of related corpora, and then compared with the statistical results in non-related corpora, so as to find words with a large difference in term frequency-inverse document frequency (TF-IDF) as keywords. In one aspect, a deep retrieval model based on, for example, a deep convolutional neural network, which is the core network of a chat engine, can be trained. The deep retrieval model can be trained by using the message-reply pairs in the chat corpus as training data. The text in the message-reply pair may include original sentences or extracted keywords in the message and the reply. In one aspect, an intent detection model can be trained to detect which target content the received message text is specifically related to, so that a forum related to the target content can be selected from multiple forums. The intent detection model may be a binary classification classifier, specifically, it may be, for example, a convolutional neural network text classification model. The positive samples used for the intent detection model may come from chat corpus in forums related to the target content, and the negative samples may come from chat corpora in other forums or ordinary text. Through one or more of the above processes, and possibly any other processes, a retrieval-based chat engine can be built that responds to an input message text to provide a chat response text based on corpus in the forum.
应当理解,以上讨论的在230处生成包括例如情感标签、动画标签、评论文本、聊天响应文本等的参考信息项232的处理过程都是示例性的,在其它实现方式中,参考信息项生成的过程还可以考虑更多其它因素,例如,场景特定情感、智能会话主体的预设个性、智能会话主体的预设角色等。It should be understood that the above-discussed process of generating reference information items 232 including, for example, emotion tags, animation tags, comment texts, chat response texts, etc. at 230 is exemplary. The process may also consider more other factors, such as scene-specific emotion, preset personality of the intelligent conversation subject, preset role of the intelligent conversation subject, and the like.
场景特定情感可以指预先设定的与具体场景相关联的情感偏好。例如,在一些场景下可能需要智能会话主体尽量做出积极乐观的响应,从而,可以为这些场景预先设定能够导致积极乐观响应的场景特定情感,例如,高兴、兴奋等。场景特定情感可以包括情感类型,或者包括情感类型及其情感等级。场景特定情感可以用于影响参考信息项的生成。在一个方面,在上述的生成情感标签的过程中,可以将场景特定情感与信息元素222一起作为输入,以便共同生成情感标签。例如,该场景特定情感可以被作为一种情感表示,该情感表示可以与多个信息元素所分别对应的多个情感表示一起用于生成情感标签。在一个方面,在上述的生成动画标签的过程中,可以以与情感标签相类似的方式来考虑场景特定情感,例如,可以将场景特定情感与信息元素一起用于定义映射规则。在一个方面,在上述的生成评论文本的过程中,对弹幕文本中的多个语句的排序可以不仅考虑在这些语句与目标内容的图像和/或音频之间的匹配度,还可以考虑从这些语句中检测出的情感信息与场景特定情感的匹配度。在一个方面,在上述的生成聊天响应文本的过程中,可以以与情感标签相类似的方式来考虑场景特定情感。例如,聊天引擎可以将输入的消息文本与场景特定情感以及可能的情感标签一起用于生成聊天响应文本。A scene-specific emotion may refer to a preset emotion preference associated with a specific scene. For example, in some scenarios, the intelligent conversational subject may be required to respond positively and optimistically as much as possible. Therefore, scene-specific emotions that can lead to positive and optimistic responses, such as happiness and excitement, may be preset for these scenarios. A scene-specific emotion may include an emotion type, or an emotion type and its emotion level. Scene-specific emotions can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, the scene-specific emotion and the information element 222 can be used as input, so as to jointly generate emotion tags. For example, the scene-specific emotion may be used as an emotion representation, and the emotion representation may be used together with a plurality of emotion representations respectively corresponding to a plurality of information elements to generate an emotion label. In one aspect, in the above-mentioned process of generating animation tags, scene-specific emotions can be considered in a similar manner to emotion tags, for example, scene-specific emotions can be used together with information elements to define mapping rules. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the How well the sentiment information detected in these sentences matches the scene-specific sentiment. In one aspect, in the process of generating chat response text described above, scene-specific emotions can be considered in a similar manner to emotion tags. For example, a chat engine can use input message text together with scene-specific sentiment and possibly sentiment tags to generate chat response text.
智能会话主体的预设个性可以指预先为智能会话主体设定的个性特征,例如,活泼好动、可爱、性格温和、兴奋等等。可以使得智能会话主体做出的响应尽可能地符合预设个性。该预设个性可以用于影响参考信息项的生成。在一个方面,在上述的生成情感标签的过程中,可以将预设个性映射到对应的情感倾向,并且可以将该情感倾向与信息元素222一起作为输入,以便共同生成情感标签。例如,该情感倾向可以被作为一种情感表示,该情感表示可以与多个信息元素所分别对应的多个情感表示一起用于生成情感标签。在一个方面,在上述的生成动画标签的过程中,可以将预设个性与信息元素一起用于定义映射规则。例如,活泼好动的预设个性将更有助于确定出具有更多肢体动作的动画标签,可爱的预设个性将更有助于确定出具有可爱面部表情的动画标签,等等。在一个方面,在上述的生成评论文本的过程中,对弹幕文本中的多个语句的排序可以不仅考虑在这些语句与目标内容的图像和/或音频之间的匹配度,还可以考虑从这些语句中检测出的情感信息与预设个性所对应的情感倾向的匹配度。在一个方面,在上述的生成聊天响应文本的过程中,可以以与情感 标签相类似的方式来考虑与预设个性所对应的情感倾向。例如,聊天引擎可以将输入的消息文本与该情感倾向以及可能的情感标签一起用于生成聊天响应文本。The preset personality of the intelligent conversation subject may refer to the personality characteristics pre-set for the intelligent conversation subject, for example, lively, cute, gentle, excited and so on. The response made by the intelligent conversation subject can be made to conform to the preset personality as much as possible. This preset personality can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, preset personalities can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags. For example, the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label. In one aspect, during the above-mentioned process of generating animated labels, preset personalities and information elements can be used to define mapping rules. For example, a lively and active preset personality will be more helpful in determining an animation label with more body movements, a cute preset personality will be more helpful in determining an animation label with cute facial expressions, and so on. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset personality. In one aspect, in the above-mentioned process of generating the chat response text, the emotional tendency corresponding to the preset personality can be considered in a manner similar to the emotional label. For example, a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text.
智能会话主体的预设角色可以指智能会话主体所要扮演的角色。预设角色可以是按照各种标准来分类的,例如,按照年龄和性别划分的小女孩、中年男子等角色,按照职业划分的老师、医生、警察等角色,等等。可以使得智能会话主体做出的响应尽可能地符合预设角色。该预设角色可以用于影响参考信息项的生成。在一个方面,在上述的生成情感标签的过程中,可以将预设角色映射到对应的情感倾向,并且可以将该情感倾向与信息元素222一起作为输入,以便共同生成情感标签。例如,该情感倾向可以被作为一种情感表示,该情感表示可以与多个信息元素所分别对应的多个情感表示一起用于生成情感标签。在一个方面,在上述的生成动画标签的过程中,可以将预设角色与信息元素一起用于定义映射规则。例如,小女孩的预设角色将更有助于确定出具有可爱的面部表情、较多的肢体动作等的动画标签。在一个方面,在上述的生成评论文本的过程中,对弹幕文本中的多个语句的排序可以不仅考虑在这些语句与目标内容的图像和/或音频之间的匹配度,还可以考虑从这些语句中检测出的情感信息与预设角色所对应的情感倾向的匹配度。在一个方面,在上述的生成聊天响应文本的过程中,可以以与情感标签相类似的方式来考虑与预设角色对应的情感倾向。例如,聊天引擎可以将输入的消息文本与该情感倾向以及可能的情感标签一起用于生成聊天响应文本。此外,聊天引擎的训练语料也可以包括更多与预设角色对应的语料,从而使得聊天引擎所输出的聊天响应文本更符合预设角色的语言特点。The preset role of the intelligent conversation subject may refer to the role to be played by the intelligent conversation subject. The preset roles can be classified according to various standards, for example, roles such as little girls and middle-aged men according to age and gender, roles such as teachers, doctors, and policemen according to occupations, and so on. The response made by the subject of the intelligent conversation can conform to the preset role as much as possible. This preset role can be used to influence the generation of reference information items. In one aspect, in the above-mentioned process of generating emotion tags, preset roles can be mapped to corresponding emotional tendencies, and the emotional tendencies can be used as input together with the information element 222, so as to jointly generate emotional tags. For example, the emotional tendency may be used as an emotional representation, and the emotional representation may be used together with multiple emotional representations respectively corresponding to multiple information elements to generate an emotional label. In one aspect, during the above-mentioned process of generating animation tags, the preset roles and information elements can be used to define mapping rules. For example, the preset character of a little girl will be more helpful in determining animation tags with cute facial expressions, more body movements, and the like. In one aspect, in the above-mentioned process of generating comment text, the ordering of multiple sentences in the bullet chat text can not only consider the matching degree between these sentences and the image and/or audio of the target content, but also consider the The degree of matching between the emotional information detected in these sentences and the emotional tendency corresponding to the preset role. In one aspect, in the above process of generating the chat response text, the emotional tendency corresponding to the preset character may be considered in a manner similar to the emotional label. For example, a chat engine may use the input message text together with the sentiment orientation and possibly sentiment tags to generate a chat response text. In addition, the training corpus of the chat engine may also include more corpus corresponding to the preset roles, so that the chat response text output by the chat engine is more in line with the language characteristics of the preset roles.
根据过程200,在获得了参考信息项232之后,可以在240处至少利用参考信息项232来产生多模态输出数据242。多模态输出数据242是将要被提供或呈现给用户的数据,其可以包括各种类型的输出数据,例如,智能会话主体的语音、文字、智能会话主体的虚拟形象的动画序列等。According to process 200 , after obtaining reference information item 232 , at 240 at least reference information item 232 may be utilized to generate multimodal output data 242 . The multimodal output data 242 is data to be provided or presented to the user, which may include various types of output data, for example, voice, text of the intelligent conversation subject, animation sequence of the avatar of the intelligent conversation subject, and the like.
多模态输出数据中的语音可以是针对参考信息项中的评论文本、聊天响应文本等所生成的。例如,可以通过任何文本到语音(TTS)转换技术来将评论文本、聊天响应文本等转换成对应的语音。可选地,该TTS转换过程可以是以情感标签为条件的,以使得所生成的语音具有由情感标签所指示的情感。Speech in the multimodal output data may be generated for comment text, chat response text, etc. in the reference information item. For example, comment text, chat response text, etc. may be converted into corresponding speech by any text-to-speech (TTS) conversion technology. Optionally, the TTS conversion process may be conditional on emotion tags such that the generated speech has the emotion indicated by the emotion tags.
多模态输出数据中的文字可以是与参考信息项中的评论文本、聊天响应文本等对应的可视化文字。从而,可以通过该文字来可视化地呈现智能会话主体所讲述的评论内容、聊天响应内容等。可选地,该文字可以是以预定的字体或呈现效果来生成的。The text in the multimodal output data may be visual text corresponding to comment text, chat response text, etc. in the reference information item. Therefore, the text can be used to visually present the content of comments and chat responses narrated by the subject of the intelligent conversation. Optionally, the text may be generated with a predetermined font or presentation effect.
多模态输出数据中的动画序列可以是至少利用参考信息项中的动画标签和/或情感标签来生成的。可以预先建立智能会话主体的虚拟形象的动画库。该动画库可以包括大量的以智能会话主体的虚拟形象所预先创作的动画模板。每个动画模板可以包括例如多个GIF图像。此外,动画库中的动画模板可以是以动画标签和/或情感标签来索引的,例如,每个动画模板可以被标记有对应的面部表情类型、肢体动作类型、情感类型、情感等级等中的至少一个。因此,当在230处所生成的参考信息项232包括动画标签和/或情感标签时,可以利用该动画标签和/或情感标签,从动画库中选择对应的动画模板。优选地,在选择了动画模板之后,可以对该动画模板执行时间适配,以形成智能会话主体的虚拟形象的动画序列。时间适配旨在调整该动画模板,以使其匹配于与评论文本和/或聊天响应文本相对应的语音的时间序列。例如,可以调整动画模板中的面部表情、肢体动作等的持续时间,以便匹配于智能动画角色的语音的持续时间。作为示例,可以在播放智能动画角色的语音的时间段期间,使得动画模板中涉及嘴巴开合讲话的图像不断重复,从而呈现出虚拟形象正在讲话的视觉效果。此外,应当理解,时间适配并不局限于使得动画模板匹配于与评论文本和/或聊天响应文本相对应的语音的时间序列,其还可以包括使得动画模板匹配于所提取的一种或多种信息元素222的时间序列。例如,假设在目标内容中演唱者正在弹奏吉他、已经从目标内容中识别出了例如物体“吉他”等信息元素、并且已经将这些信息元素映射为“弹吉他”动画标签,则在演唱者弹奏吉他的时间段期间,可以不断重复所选择出的对应于“弹吉他”的动画模板,从而呈现出虚拟形象正在随着目标内容中的演唱者一起弹吉他的视觉效果。应当理解,在不同的应用场景中,智能会话主体可能具有不同的虚拟形象,从而可以针对不同的虚拟形象来分别预先建立不同的动画库。The animation sequence in the multimodal output data may be generated using at least animation tags and/or emotion tags in the reference information items. An animation library of avatars of intelligent conversational subjects can be pre-built. The animation library may include a large number of pre-created animation templates based on the avatar of the intelligent conversation subject. Each animation template may include, for example, multiple GIF images. In addition, the animation templates in the animation library can be indexed by animation tags and/or emotion tags, for example, each animation template can be marked with a corresponding facial expression type, body movement type, emotion type, emotion level, etc. at least one. Therefore, when the reference information item 232 generated at 230 includes animation tags and/or emotion tags, the animation tags and/or emotion tags can be used to select a corresponding animation template from the animation library. Preferably, after the animation template is selected, time adaptation can be performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject. Time adaptation aims to adjust the animation template to match the time sequence of the speech corresponding to the comment text and/or chat response text. For example, the duration of facial expressions, body movements, etc. in the animation template can be adjusted to match the duration of the intelligent animated character's voice. As an example, during the period of playing the voice of the intelligent animated character, the image involving opening and closing the mouth in the animation template may be repeated continuously, so as to present a visual effect that the avatar is speaking. In addition, it should be understood that time adaptation is not limited to making the animation template match the time sequence of the speech corresponding to the comment text and/or chat response text, and it may also include making the animation template match the extracted one or more A time sequence of information elements 222. For example, assuming that the singer is playing the guitar in the target content, information elements such as the object "guitar" have been identified from the target content, and these information elements have been mapped to the animation tag "playing the guitar", then the singer During the time period of playing the guitar, the selected animation template corresponding to "playing the guitar" may be repeated continuously, so as to present a visual effect that the avatar is playing the guitar together with the singer in the target content. It should be understood that in different application scenarios, the intelligent conversation subject may have different avatars, so different animation libraries may be pre-established for different avatars.
应当理解,以上讨论的在240处生成包括例如动画序列、语音、文字等的多模态输出数据242的处理过程都是示例性的,在其它实现方式中,多模态输出数据生成的过程还可以考虑更多其它因素,例如,场景特定需求等,即,多模态输出数据可以是进一步基于场景特定需求来产生的。对场景特定需求的考虑可以使得本公开的实施例能够被自适应地应用于多种场景中,例如,可以基于不同场景所支持的输出能力而自适应地输出适合于特定场景的多模态输出数据。It should be understood that the process of generating multimodal output data 242 at 240 discussed above is exemplary, and in other implementations, the process of generating multimodal output data can also be More other factors may be considered, for example, scene-specific requirements, etc., that is, the multimodal output data may be further generated based on scene-specific requirements. Considering the specific requirements of the scene can enable the embodiments of the present disclosure to be adaptively applied to various scenes, for example, the multi-modal output suitable for a specific scene can be adaptively output based on the output capabilities supported by different scenes data.
场景特定需求可以指智能会话主体的不同应用场景的特定需求。场景特定需求可以包括与具体场景相关联的例如所支持的多模态输出数据的类型、语速预定设置、聊天模式设置等。在一个方面,不同的场景可能具有不同的数据输出能力,因此,不同场景所支持的多模态输出数据的类型可以包括仅输出语音、动画序列和文字中之一,或者输出语音、动画序列和文字中的至少两者。例如,智能动画角色和虚拟主播场景要求终端设备至少能够支持图像和音频的输出,从而,场景特定需求可以指示输出语音、动画序列和文字中的一个或多个。例如,智能音箱场景仅支持音频输出,从而,场景特定需求可以指示仅输出语音。在一个方面,不同场景可能存在不同的语速偏好,因此,场景特定需求可以进行语速预定设置。例如,由于在智能动画角色和虚拟主播场景中用户既可以观看到图像也可以听到语音,因此,可以将语速设置为较快,以便表达更丰富的情感。例如,在智能音箱和智能车机助理的场景中,用户往往只能获得或仅关注语音输出,因此,可以将语速设置为较慢,以便用户可以仅通过语音就清楚地获知智能会话实体所要表达的内容。在一个方面,不同场景可能存在不同的聊天模式偏好,因此,场景特定需求可以进行聊天模式设置。例如,在智能车机助理的场景中,由于用户可能正在驾驶车辆,因此,为了不过多地分散用户的注意力,可以减少聊天引擎的闲聊输出。此外,聊天模式设置也可以与采集的图像、采集的音频、外界环境数据等相关联。例如,当采集的音频指示用户周围存在较大的噪声时,可以减少对聊天引擎所生成的聊天响应的语音输出。例如,当外界环境数据指示用户的行进速度比较快,例如,正在高速驾驶车辆时,可以减少聊天引擎的闲聊输出。Scenario-specific requirements may refer to specific requirements of different application scenarios of the intelligent conversation subject. The scene-specific requirements may include, for example, types of supported multi-modal output data, preset speech rate settings, chat mode settings, etc. associated with a specific scene. In one aspect, different scenes may have different data output capabilities. Therefore, the types of multimodal output data supported by different scenes may include outputting only one of voice, animation sequence and text, or outputting voice, animation sequence and At least two of the text. For example, intelligent animation characters and virtual anchor scenes require terminal devices to at least support the output of images and audio, so that the specific requirements of the scene can indicate the output of one or more of voice, animation sequence and text. For example, a smart speaker scenario supports audio output only, so scenario-specific requirements can dictate that only voice be output. In one aspect, there may be different speech rate preferences in different scenes, therefore, the speech rate can be preset according to the specific needs of the scene. For example, since users can watch images and hear voices in smart animation characters and virtual anchor scenes, the speech rate can be set to be faster in order to express richer emotions. For example, in the scenarios of smart speakers and smart car assistants, users often only get or pay attention to voice output. Therefore, the speech rate can be set to be slow so that users can clearly understand what the intelligent conversation entity wants through voice alone. expressed content. In one aspect, different scenarios may have different chat mode preferences, therefore, chat mode settings can be made according to specific requirements of the scenario. For example, in the scenario of a smart car-machine assistant, since the user may be driving a vehicle, in order not to distract the user too much, the chat engine's chatter output can be reduced. In addition, the chat mode setting may also be associated with collected images, collected audio, external environment data, and the like. For example, the voice output of the chat response generated by the chat engine may be reduced when the collected audio indicates that there is loud noise around the user. For example, when the external environment data indicates that the user is traveling faster, for example, driving a vehicle at a high speed, the chatting output of the chat engine may be reduced.
在240处可以至少基于场景特定需求来产生多模态输出数据。例如,当场景特定需求指示不支持图像输出或者仅支持语音输出时,可以不执行动画序列和文字的生成。例如,当场景特定需求指示采用较快的语速时,可以在TTS转换过程中加快所生成的语音的语速。例如,当场景特定需求指示在特定的条件下减少聊天响应输出时,则可以限制生成与聊天响应文本相对应的语音或文字。At 240, multimodal output data can be generated based at least on the scene-specific requirements. For example, when the specific requirements of the scene indicate that image output is not supported or only voice output is supported, animation sequence and text generation may not be performed. For example, when a scene-specific requirement indicates a faster speech rate, the speech rate of the generated speech may be accelerated during the TTS conversion process. For example, when the specific requirement of the scenario indicates that the output of the chat response is reduced under a specific condition, the generation of voice or text corresponding to the text of the chat response may be restricted.
在250处,可以提供多模态输出数据。例如,通过显示屏幕显示动画序列、文字等,通过扬声器播放语音等。At 250, multimodal output data can be provided. For example, an animation sequence, text, etc. are displayed through a display screen, and voices are played through a speaker, etc.
应当理解,过程200可以被持续地执行,以便不断地获得多模态输入数据和不断地提供多模态输出数据。It should be understood that process 200 may be performed continuously so as to continuously obtain multimodal input data and continuously provide multimodal output data.
图3示出了根据实施例的智能动画角色场景的实例。在图3的智能动画角色场景中,用户310可以在终端设备320上观看视频,同时,根据本公开实施例的智能会话实体可以作为智能动画角色来陪伴用户310一起观看视频。终端设备320可以包括例如显示屏幕330、摄像头322、扬声器(未示出)、麦克风(未示出)等。在显示屏幕330中可以呈现作为目标内容的视频332。此外,智能会话主体的虚拟形象334也可以在显示屏幕330中呈现。智能会话主体可以根据本公开的实施例来执行基于多模态的反应式响应生成,并且相应地,可以经由虚拟形象334来在终端设备320上提供所生成的基于多模态的反应式响应。例如,响应于视频332中的内容、与用户310的聊天会话、所采集的图像和/或音频、所获得的外部环境数据等,虚拟形象334可以做出面部表情、肢体动作、发出语音等。Figure 3 shows an example of a smart animated character scene according to an embodiment. In the smart animation character scene in FIG. 3 , the user 310 can watch a video on the terminal device 320 , and at the same time, the smart conversational entity according to the embodiment of the present disclosure can serve as a smart animation character to accompany the user 310 to watch the video together. The terminal device 320 may include, for example, a display screen 330, a camera 322, a speaker (not shown), a microphone (not shown), and the like. A video 332 may be presented as target content in the display screen 330 . In addition, the avatar 334 of the intelligent conversation subject can also be presented on the display screen 330 . The intelligent conversation subject can perform multi-modality-based reactive response generation according to an embodiment of the present disclosure, and accordingly, can provide the generated multi-modal-based reactive response on the terminal device 320 via the avatar 334 . For example, in response to the content in the video 332 , the chat session with the user 310 , the captured image and/or audio, the obtained external environment data, etc., the avatar 334 can make facial expressions, body movements, and make voices, etc.
图4示出了根据实施例的智能动画角色场景的示例性过程400。过程400示出了例如图3的智能动画角色场景所涉及的处理流、数据/信息流等。此外,过程400可以被视为是图2中的过程200的具体示例。FIG. 4 illustrates an exemplary process 400 for intelligently animating a character scene, according to an embodiment. Process 400 illustrates the processing flow, data/information flow, etc. involved in, for example, the smart animated character scene of FIG. 3 . Furthermore, process 400 may be considered as a specific example of process 200 in FIG. 2 .
根据过程400,可以首先获得多模态输入数据,其包括例如视频、外界环境数据、采集的图像、采集的音频、聊天会话等中的至少一个。所述视频作为目标内容,其可以进而包括例如图像、音频、弹幕文件等。应当理解,所获得的多模态输入数据可以是在时间上对准的,并相应地具有相同的上 下文。According to the process 400, multimodal input data may be obtained first, including at least one of, for example, video, external environment data, collected images, collected audio, chat sessions, and the like. The video, as the target content, may further include, for example, images, audio, bullet chat files, and the like. It should be understood that the obtained multimodal input data may be aligned in time and accordingly have the same context.
可以从多模态输入数据中提取信息元素。例如,从视频的图像中提取人物特征、文本、图像光线、物体等,从视频的音频中提取音乐、语音等,从视频的弹幕文件中提取弹幕文本,从外界环境数据中提取外界环境信息,从采集的图像中提取对象特征,从采集的音频中提取音乐、语音等,从聊天会话中提取消息文本,等等。Information elements can be extracted from multimodal input data. For example, extract character features, text, image light, objects, etc. from video images, extract music, voice, etc. from video audio, extract barrage text from video barrage files, and extract external environment from external environment data information, extract object features from captured images, extract music, speech, etc. from captured audio, extract message text from chat sessions, and more.
可以至少基于所提取的信息元素来生成参考信息项,其包括例如情感标签、动画标签、评论文本、聊天响应文本中的至少一个。评论文本可以是通过评论生成模型430来生成的。聊天响应文本可以是通过聊天引擎450以及可选的情感转移网络452来生成的。The reference information item may be generated based at least on the extracted information elements, which includes, for example, at least one of emotion tags, animation tags, comment text, and chat response text. Review text may be generated by a review generation model 430 . Chat response text may be generated by chat engine 450 and optionally emotion transfer network 452 .
可以至少利用所生成的参考信息项来产生多模态输出数据,其包括例如动画序列、评论语音、评论文字、聊天响应语音、聊天响应文字等中的至少一个。动画序列可以是基于以上结合图2的描述来生成的。例如,可以利用动画标签、情感标签等,在动画库中执行动画选择410以便选择出动画模板,进而基于所选择的动画模板来执行动画序列生成420,以便通过在动画序列生成420处执行的时间适配来获得动画序列。评论语音可以是通过对评论文本执行语音生成440(例如,TTS转换)来获得的。评论文字可以是基于评论文本来获得的。聊天响应语音可以是通过对聊天响应文本执行语音生成460(例如,TTS转换)来获得的。聊天响应文字可以是基于聊天响应文本来获得的。The generated reference information items may be utilized at least to generate multimodal output data, which includes, for example, at least one of an animation sequence, comment speech, comment text, chat response speech, chat response text, and the like. The animation sequence may be generated based on the description above in connection with FIG. 2 . For example, the animation selection 410 may be performed in the animation library to select an animation template by using animation tags, emotion tags, etc., and then the animation sequence generation 420 is executed based on the selected animation template, so that the timing of the animation sequence generation 420 is executed Adapt to obtain animation sequences. The review speech may be obtained by performing speech generation 440 (eg, TTS conversion) on the review text. The review text may be obtained based on the review text. The chat response speech may be obtained by performing speech generation 460 (eg, TTS conversion) on the chat response text. The chat response text may be obtained based on the chat response text.
可以在终端设备上提供所产生的多模态输出数据。例如,在显示屏幕上呈现动画序列、评论文字、聊天响应文字等,通过扬声器播放评论语音、聊天响应语音等。The resulting multimodal output data can be provided on an end device. For example, an animation sequence, comment text, chat response text, etc. are presented on the display screen, and the comment voice, chat response voice, etc. are played through a speaker.
应当理解,过程400中的所有处理、数据/信息等都是示例性的,在实际的应用中,过程400可能仅涉及这些处理、数据/信息中的一项或多项。It should be understood that all the processing, data/information, etc. in the process 400 are exemplary, and in actual application, the process 400 may only involve one or more of these processing, data/information.
根据本公开实施例的基于多模态的反应式响应生成可以被应用于执行多种任务。以下仅仅示例性地说明这些任务中的示例性智能动画生成任务。应当理解,本公开的实施例并不局限于用于执行智能动画生成任务,而是还可以用于执行多种其它任务。The multi-modality-based reactive response generation according to embodiments of the present disclosure can be applied to perform a variety of tasks. The following is only an exemplary intelligent animation generation task among these tasks. It should be understood that embodiments of the present disclosure are not limited to being used for performing intelligent animation generation tasks, but may also be used for performing various other tasks.
图5示出了根据实施例的智能动画生成的示例性过程500。过程500可以被视为是图2中的过程200的一种具体实现。过程500的智能动画生成是过程200的基于多模态的反应式响应生成的具体应用。过程500的智能动画生成可以涉及响应于目标内容而执行的虚拟形象的动画序列的生成、虚拟形象的评论语音的生成、评论文字的生成等中至少之一。FIG. 5 illustrates an exemplary process 500 of intelligent animation generation according to an embodiment. Process 500 can be regarded as a specific implementation of process 200 in FIG. 2 . The intelligent animation generation of process 500 is a specific application of the multi-modality-based reactive response generation of process 200 . The intelligent animation generation of the process 500 may involve at least one of the generation of an animation sequence of the avatar, the generation of comment speech of the avatar, the generation of comment text, etc., performed in response to the target content.
在过程500中,可以将图2的210处的多模态输入数据获取步骤具体化为在510处获得目标内容的图像、音频、弹幕文件中至少之一。In the process 500, the step of obtaining multimodal input data at 210 in FIG. 2 may be embodied as obtaining at 510 at least one of image, audio, and barrage files of the target content.
在过程500中,可以将图2的220处的信息元素提取步骤具体化为在520处从目标内容的图像、音频、弹幕文件中提取至少一个信息元素。例如,从目标内容的图像中提取人物特征、文本、图像光线、物体等,从目标内容的音频中提取音乐、语音等,从目标内容的弹幕文件中提取弹幕文本,等等。In the process 500, the information element extraction step at 220 in FIG. 2 can be embodied as at 520 extracting at least one information element from the image, audio, and barrage files of the target content. For example, extract character features, text, image light, objects, etc. from the image of the target content, extract music, voice, etc. from the audio of the target content, extract bullet chat text from the bullet chat file of the target content, and so on.
在过程500中,可以将图2的230处的参考信息项生成步骤具体化为在530处生成动画标签、情感标签和评论文本中至少之一。例如,可以至少基于在520处所提取的至少一个信息元素来生成动画标签、情感标签、评论文本等。In the process 500, the step of generating reference information items at 230 in FIG. 2 may be embodied as generating at 530 at least one of animation tags, emotion tags and comment texts. For example, animated tags, sentiment tags, review text, etc. may be generated based at least on the at least one information element extracted at 520 .
在过程500中,可以将图2的240处的多模态输出数据生成步骤具体化为在540处至少利用动画标签、情感标签和评论文本中至少之一来产生虚拟形象的动画序列、虚拟形象的评论语音、评论文字中至少之一。以动画序列为例,可以按照以上结合图2所描述的方式来至少利用动画标签和/或情感标签产生动画序列。此外,也可以按照以上结合图2所描述的方式来产生评论语音和评论文字。In the process 500, the step of generating multimodal output data at 240 in FIG. At least one of comment voice and comment text. Taking an animation sequence as an example, the animation sequence may be generated by at least using animation tags and/or emotion tags in the manner described above in conjunction with FIG. 2 . In addition, comment speech and comment text may also be generated in the manner described above in conjunction with FIG. 2 .
在过程500中,可以将图2的250处的多模态输出数据提供步骤具体化为在550处提供所生成的动画序列、评论语音、评论文字中至少之一。In the process 500, the step of providing multimodal output data at 250 in FIG. 2 may be embodied as providing at 550 at least one of the generated animation sequence, comment voice, and comment text.
应当理解,过程500中的每个步骤可以采用与以上针对图2中的对应步骤的描述相类似的方式来执行。此外,过程500还可以包括以上针对图2的过程200所描述的任何其它处理。It should be understood that each step in process 500 may be performed in a manner similar to that described above for the corresponding step in FIG. 2 . In addition, process 500 may also include any other processing described above for process 200 of FIG. 2 .
图6示出了根据实施例的用于基于多模态的反应式响应生成的示例性方法600的流程图。FIG. 6 shows a flowchart of an exemplary method 600 for multimodality-based reactive response generation, according to an embodiment.
在610处,可以获得多模态输入数据。At 610, multimodal input data can be obtained.
在620处,可以从所述多模态输入数据中提取至少一个信息元素。At 620, at least one informational element can be extracted from the multimodal input data.
在630处,可以至少基于所述至少一个信息元素来生成至少一个参考信息项。At 630, at least one reference information item may be generated based at least on the at least one information element.
在640处,可以至少利用所述至少一个参考信息项来产生多模态输出数据。At 640, multimodal output data may be generated using at least the at least one reference information item.
在650处,可以提供所述多模态输出数据。At 650, the multimodal output data can be provided.
在一种实现方式中,所述多模态输入数据可以包括以下至少之一:目标内容的图像、目标内容的音频、目标内容的弹幕文件、聊天会话、采集的图像、采集的音频、以及外界环境数据。In an implementation manner, the multimodal input data may include at least one of the following: images of target content, audio of target content, barrage files of target content, chat sessions, collected images, collected audio, and external environment data.
从所述多模态输入数据中提取至少一个信息元素可以包括以下至少之一:从目标内容的图像中提取人物特征;从目标内容的图像中识别文本;从目标内容的图像中检测图像光线;从目标内容的图像中识别物体;从目标内容的音频中提取音乐;从目标内容的音频中提取语音;从目标内容的弹幕文件中提取弹幕文本;从聊天会话中提取消息文本;从采集的图像中提取对象特征;从采集的音频中提取语音和/或音乐;以及从外界环境数据中提取外界环境信息。Extracting at least one information element from the multimodal input data may include at least one of the following: extracting character features from an image of the target content; recognizing text from an image of the target content; detecting image light from an image of the target content; Recognize objects from images of target content; extract music from audio of target content; extract speech from audio of target content; extract bullet chat text from bullet chat files of target content; extract message text from chat sessions; extracting object features from images; extracting speech and/or music from collected audio; and extracting external environment information from external environment data.
在一种实现方式中,至少基于所述至少一个信息元素来生成至少一个参考信息项可以包括:至少基于所述至少一个信息元素来生成情感标签、动画标签、评论文本、以及聊天响应文本中至少之一。In an implementation manner, generating at least one reference information item based on at least one information element may include: generating at least one of emotion tags, animation tags, comment text, and chat response text based on at least one information element one.
至少基于所述至少一个信息元素来生成情感标签可以包括:生成与所述至少一个信息元素中的一个或多个信息元素分别对应的一个或多个情感表示;以及至少基于所述一个或多个情感表示来生成所述情感标签。Generating an emotion tag based at least on the at least one information element may include: generating one or more emotion representations respectively corresponding to one or more information elements in the at least one information element; and at least based on the one or more sentiment representations to generate the sentiment labels.
所述情感标签可以指示情感类型和/或情感等级。The emotion tag may indicate an emotion type and/or an emotion level.
至少基于所述至少一个信息元素来生成动画标签可以包括:根据预定规则,将所述至少一个信息元素映射为所述动画标签。Generating the animation label based at least on the at least one information element may include: mapping the at least one information element to the animation label according to a predetermined rule.
所述动画标签可以指示面部表情类型和/或肢体动作类型。The animation tag may indicate the type of facial expression and/or the type of body movement.
所述动画标签可以是进一步基于所述情感标签来生成的。The animation tag may be further generated based on the emotion tag.
至少基于所述至少一个信息元素来生成评论文本可以包括:从目标内容的弹幕文本中选择所述评论文本。Generating comment text based at least on the at least one information element may include: selecting the comment text from bullet chat text of the target content.
所述选择所述评论文本可以包括:利用双塔模型,确定在所述目标内容的弹幕文本中的语句与所述目标内容的图像和/或音频之间的匹配度;以及从所述弹幕文本中选择匹配度最高的语句作为所述评论文本。The selection of the comment text may include: using the twin towers model to determine the matching degree between the sentence in the bullet chat text of the target content and the image and/or audio of the target content; Select the sentence with the highest matching degree in the subtitle text as the comment text.
至少基于所述至少一个信息元素来生成聊天响应文本可以包括:通过聊天引擎,至少基于聊天会话中的消息文本来生成所述聊天响应文本。Generating the chat response text based at least on the at least one information element may include: generating the chat response text based at least on message text in the chat session by a chat engine.
所述聊天响应文本可以是进一步基于所述情感标签来生成的。The chat response text may be further generated based on the emotion tag.
所述聊天响应文本可以是进一步基于来自情感转移网络的情感表示来生成的。The chat response text may be further generated based on an emotion representation from an emotion transfer network.
在一种实现方式中,所述至少一个参考信息项可以是进一步基于以下至少之一来生成的:场景特定情感;智能会话主体的预设个性;以及智能会话主体的预设角色。In an implementation manner, the at least one reference information item may be further generated based on at least one of the following: scene-specific emotion; preset personality of the intelligent conversation subject; and preset role of the intelligent conversation subject.
在一种实现方式中,所述多模态输出数据可以包括以下至少之一:智能会话主体的虚拟形象的动画序列;智能会话主体的语音;以及文字。In an implementation manner, the multimodal output data may include at least one of the following: an animation sequence of an avatar of the intelligent conversation subject; voice of the intelligent conversation subject; and text.
至少利用所述至少一个参考信息项来产生多模态输出数据可以包括:生成与所述评论文本和/或所述聊天响应文本相对应的语音和/或文字。Generating multimodal output data by using at least the at least one reference information item may include: generating voice and/or text corresponding to the comment text and/or the chat response text.
至少利用所述至少一个参考信息项来产生多模态输出数据可以包括:利用所述动画标签和/或所述情感标签,从智能会话主体的虚拟形象的动画库中选择对应的动画模板;以及对所述动画模板执行时间适配,以形成智能会话主体的虚拟形象的动画序列。At least using the at least one reference information item to generate multimodal output data may include: using the animation tag and/or the emotion tag to select a corresponding animation template from the animation library of the avatar of the intelligent conversation subject; and Time adaptation is performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.
所述时间适配可以包括:调整所述动画模板,以匹配于与所述评论文本和/或所述聊天响应文本相对应的语音的时间序列。The time adaptation may include: adjusting the animation template to match the time sequence of the speech corresponding to the comment text and/or the chat response text.
在一种实现方式中,所述多模态输出数据可以是进一步基于场景特定需求来产生的。In an implementation manner, the multimodal output data may be further generated based on specific requirements of the scene.
所述场景特定需求可以包括以下至少之一:仅输出语音、动画序列和文字中之一;输出语音、 动画序列和文字中的至少两者;语速预定设置;以及聊天模式设置。The scene-specific requirement may include at least one of the following: outputting only one of voice, animation sequence and text; outputting at least two of voice, animation sequence and text; predetermined speech rate setting; and chat mode setting.
在一种实现方式中,基于多模态的反应式响应生成可以包括智能动画生成。获得多模态输入数据可以包括:获得目标内容的图像、音频和弹幕文件中至少之一。从所述多模态输入数据中提取至少一个信息元素可以包括:从所述目标内容的图像、音频和弹幕文件中提取至少一个信息元素。至少基于所述至少一个信息元素来生成至少一个参考信息项可以包括:至少基于所述至少一个信息元素来生成动画标签、情感标签和评论文本中至少之一。至少利用所述至少一个参考信息项来产生多模态输出数据可以包括:至少利用所述动画标签、所述情感标签和所述评论文本中至少之一来产生虚拟形象的动画序列、虚拟形象的评论语音和评论文字中至少之一。提供所述多模态输出数据可以包括:提供所述动画序列、所述评论语音和所述评论文字中至少之一。In one implementation, the multimodal based reactive response generation can include intelligent animation generation. Obtaining the multimodal input data may include: obtaining at least one of image, audio and bullet chat files of the target content. Extracting at least one information element from the multimodal input data may include: extracting at least one information element from image, audio and bullet chat files of the target content. Generating at least one reference information item based at least on the at least one information element may include: generating at least one of animation tags, emotion tags and comment text based on at least the at least one information element. At least using the at least one reference information item to generate multimodal output data may include: using at least one of the animation tag, the emotion tag, and the comment text to generate an animation sequence of the avatar, an animation sequence of the avatar, At least one of comment voice and comment text. Providing the multimodal output data may include: providing at least one of the animation sequence, the comment voice and the comment text.
应当理解,方法600还可以包括根据上述本公开实施例的用于基于多模态的反应式响应生成的任何步骤/过程。It should be understood that the method 600 may also include any steps/processes for multi-modality-based reactive response generation according to the embodiments of the present disclosure described above.
图7示出了根据实施例的用于基于多模态的反应式响应生成的示例性装置700。FIG. 7 illustrates an exemplary apparatus 700 for multimodality-based reactive response generation, according to an embodiment.
装置700可以包括:多模态输入数据获得模块710,用于获得多模态输入数据;数据整合处理模块720,用于从所述多模态输入数据中提取至少一个信息元素;场景逻辑处理模块730,用于至少基于所述至少一个信息元素来生成至少一个参考信息项;多模态输出数据生成模块740,用于至少利用所述至少一个参考信息项来产生多模态输出数据;以及多模态输出数据提供模块750,用于提供所述多模态输出数据。The device 700 may include: a multimodal input data obtaining module 710, for obtaining multimodal input data; a data integration processing module 720, for extracting at least one information element from the multimodal input data; a scene logic processing module 730, for generating at least one reference information item based at least on the at least one information element; the multimodal output data generation module 740, for at least utilizing the at least one reference information item to generate multimodal output data; and multiple The modality output data providing module 750 is configured to provide the multimodal output data.
此外,装置700还可以包括执行根据上述本公开实施例的用于基于多模态的反应式响应生成的方法的步骤的任何其它模块。In addition, the apparatus 700 may also include any other modules that execute the steps of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
图8示出了根据实施例的用于基于多模态的反应式响应生成的示例性装置800。FIG. 8 illustrates an exemplary apparatus 800 for multimodality-based reactive response generation, according to an embodiment.
装置800可以包括:至少一个处理器810;以及存储器820,其存储计算机可执行指令。当所述计算机可执行指令被运行时,所述至少一个处理器810可以执行根据上述本公开实施例的用于基于多模态的反应式响应生成的方法的任何步骤/过程。 Apparatus 800 may include: at least one processor 810; and memory 820 storing computer-executable instructions. When the computer-executable instructions are executed, the at least one processor 810 may execute any steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
本公开的实施例提出了基于多模态的反应式响应生成系统,包括:多模态数据输入接口,用于获得多模态输入数据;核心处理单元,其被配置用于从所述多模态输入数据中提取至少一个信息元素,至少基于所述至少一个信息元素来生成至少一个参考信息项,以及至少利用所述至少一个参考信息项来产生多模态输出数据;以及多模态数据输出接口,用于提供所述多模态输出数据。此外,多模态数据输入接口、核心处理单元、多模态数据输出接口还可以执行根据上述本公开实施例的用于基于多模态的反应式响应生成的方法的任何相关步骤/过程。此外,基于多模态的反应式响应生成系统还可以包括根据上述本公开实施例的用于基于多模态的反应式响应生成的任何其它单元和模块。Embodiments of the present disclosure propose a multimodal-based reactive response generation system, including: a multimodal data input interface for obtaining multimodal input data; a core processing unit configured to generate data from the multimodal Extracting at least one information element from the modal input data, generating at least one reference information item based on at least the at least one information element, and at least utilizing the at least one reference information item to generate multimodal output data; and multimodal data output An interface for providing the multimodal output data. In addition, the multimodal data input interface, the core processing unit, and the multimodal data output interface may also execute any relevant steps/processes of the method for multimodal-based reactive response generation according to the above-mentioned embodiments of the present disclosure. In addition, the multimodality-based reactive response generation system may further include any other units and modules for multimodality-based reactive response generation according to the above-mentioned embodiments of the present disclosure.
本公开的实施例提出了用于基于多模态的反应式响应生成的计算机程序产品,包括计算机程序,所述计算机程序被至少一个处理器运行用于执行根据上述本公开实施例的用于基于多模态的反应式响应生成的方法的任何步骤/过程。Embodiments of the present disclosure propose a computer program product for multimodal-based reactive response generation, including a computer program that is run by at least one processor to execute the method according to the above-mentioned embodiments of the present disclosure based on Any step/process of a method for multimodal reactive response generation.
本公开的实施例可以实施在非暂时性计算机可读介质中。该非暂时性计算机可读介质可以包括指令,当所述指令被执行时,使得一个或多个处理器执行根据上述本公开实施例的用于基于多模态的反应式响应生成的方法的任何步骤/过程。Embodiments of the present disclosure can be embodied on a non-transitory computer readable medium. The non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any of the methods for multimodal-based reactive response generation according to embodiments of the present disclosure described above. steps/process.
应当理解,以上描述的方法中的所有操作都仅仅是示例性的,本公开并不限制于方法中的任何操作或这些操作的顺序,而是应当涵盖在相同或相似构思下的所有其它等同变换。It should be understood that all operations in the method described above are exemplary only, and the present disclosure is not limited to any operation in the method or the order of these operations, but should cover all other equivalent transformations under the same or similar concept .
另外,除非另有规定或者从上下文能清楚得知针对单数形式,否则如本说明书和所附权利要求书中所使用的冠词“一(a)”和“一个(an)”通常应当被解释为意指“一个”或者“一个或多个”。In addition, the articles "a (a)" and "an (an)" as used in this specification and the appended claims should generally be construed unless otherwise specified or clear from the context to refer to a singular form. means "one" or "one or more".
还应当理解,以上描述的装置中的所有模块都可以通过各种方式来实施。这些模块可以被实施为硬件、软件、或其组合。此外,这些模块中的任何模块可以在功能上被进一步划分成子模块或组合在一起。It should also be understood that all modules in the apparatus described above may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. Furthermore, any of these modules may be functionally further divided into sub-modules or grouped together.
已经结合各种装置和方法描述了处理器。这些处理器可以使用电子硬件、计算机软件或其任意组合来实施。这些处理器是实施为硬件还是软件将取决于具体的应用以及施加在系统上的总体设 计约束。作为示例,本公开中给出的处理器、处理器的任意部分、或者处理器的任意组合可以实施为微处理器、微控制器、数字信号处理器(DSP)、现场可编程门阵列(FPGA)、可编程逻辑器件(PLD)、状态机、门逻辑、分立硬件电路、以及配置用于执行在本公开中描述的各种功能的其它适合的处理部件。本公开给出的处理器、处理器的任意部分、或者处理器的任意组合的功能可以实施为由微处理器、微控制器、DSP或其它适合的平台所执行的软件。Processors have been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether these processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system. As examples, a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, digital signal processor (DSP), field programmable gate array (FPGA) ), programmable logic devices (PLDs), state machines, gate logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described in this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors given in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
软件应当被广泛地视为表示指令、指令集、代码、代码段、程序代码、程序、子程序、软件模块、应用、软件应用、软件包、例程、子例程、对象、运行线程、过程、函数等。软件可以驻留在计算机可读介质中。计算机可读介质可以包括例如存储器,存储器可以例如为磁性存储设备(如,硬盘、软盘、磁条)、光盘、智能卡、闪存设备、随机存取存储器(RAM)、只读存储器(ROM)、可编程ROM(PROM)、可擦除PROM(EPROM)、电可擦除PROM(EEPROM)、寄存器或者可移动盘。尽管在本公开给出的多个方面中将存储器示出为是与处理器分离的,但是存储器也可以位于处理器内部(如,缓存或寄存器)。Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subroutines, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, processes , functions, etc. The software may reside on a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic stripe), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), Programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), register or removable disk. Although memory is shown as being separate from the processor in various aspects of the present disclosure, memory may also be located internal to the processor (eg, cache or registers).
以上描述被提供用于使得本领域任何技术人员可以实施本文所描述的各个方面。这些方面的各种修改对于本领域技术人员是显而易见的,本文限定的一般性原理可以应用于其它方面。因此,权利要求并非旨在被局限于本文示出的方面。关于本领域技术人员已知或即将获知的、对本公开所描述各个方面的元素的所有结构和功能上的等同变换,都将由权利要求所覆盖。The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Accordingly, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the described aspects of this disclosure that are known or come to be known to those skilled in the art are intended to be covered by the claims.

Claims (26)

  1. 一种用于基于多模态的反应式响应生成的方法,包括:A method for multimodal based reactive response generation comprising:
    获得多模态输入数据;obtain multimodal input data;
    从所述多模态输入数据中提取至少一个信息元素;extracting at least one informational element from said multimodal input data;
    至少基于所述至少一个信息元素来生成至少一个参考信息项;generating at least one reference information item based at least on said at least one information element;
    至少利用所述至少一个参考信息项来产生多模态输出数据;以及generating multimodal output data using at least said at least one item of reference information; and
    提供所述多模态输出数据。The multimodal output data is provided.
  2. 如权利要求1所述的方法,其中,所述多模态输入数据包括以下至少之一:The method of claim 1, wherein the multimodal input data includes at least one of:
    目标内容的图像、目标内容的音频、目标内容的弹幕文件、聊天会话、采集的图像、采集的音频、以及外界环境数据。Images of the target content, audio of the target content, bullet chat files of the target content, chat sessions, collected images, collected audio, and external environment data.
  3. 如权利要求2所述的方法,其中,从所述多模态输入数据中提取至少一个信息元素包括以下至少之一:The method of claim 2, wherein extracting at least one information element from the multimodal input data comprises at least one of:
    从目标内容的图像中提取人物特征;Extract human features from images of target content;
    从目标内容的图像中识别文本;recognize text from images of targeted content;
    从目标内容的图像中检测图像光线;detecting image rays from an image of the target content;
    从目标内容的图像中识别物体;Recognize objects from images of Targeted Content;
    从目标内容的音频中提取音乐;extract music from the audio of the target content;
    从目标内容的音频中提取语音;extract speech from the audio of the target content;
    从目标内容的弹幕文件中提取弹幕文本;Extract the bullet chat text from the bullet chat file of the target content;
    从聊天会话中提取消息文本;Extract message text from chat sessions;
    从采集的图像中提取对象特征;Extract object features from the captured image;
    从采集的音频中提取语音和/或音乐;以及extract speech and/or music from captured audio; and
    从外界环境数据中提取外界环境信息。The external environment information is extracted from the external environment data.
  4. 如权利要求1所述的方法,其中,至少基于所述至少一个信息元素来生成至少一个参考信息项包括:The method of claim 1, wherein generating at least one reference information item based at least on the at least one information element comprises:
    至少基于所述至少一个信息元素来生成情感标签、动画标签、评论文本、以及聊天响应文本中至少之一。At least one of sentiment tags, animation tags, comment text, and chat response text is generated based at least on the at least one information element.
  5. 如权利要求4所述的方法,其中,至少基于所述至少一个信息元素来生成情感标签包括:The method of claim 4, wherein generating an emotion tag based at least on the at least one information element comprises:
    生成与所述至少一个信息元素中的一个或多个信息元素分别对应的一个或多个情感表示;以及generating one or more emotional representations respectively corresponding to one or more of the at least one informational element; and
    至少基于所述一个或多个情感表示来生成所述情感标签。The sentiment tags are generated based at least on the one or more sentiment representations.
  6. 如权利要求5所述的方法,其中,The method of claim 5, wherein,
    所述情感标签指示情感类型和/或情感等级。The emotion tag indicates an emotion type and/or an emotion level.
  7. 如权利要求4所述的方法,其中,至少基于所述至少一个信息元素来生成动画标签包括:The method of claim 4, wherein generating the animated label based at least on the at least one information element comprises:
    根据预定规则,将所述至少一个信息元素映射为所述动画标签。Mapping the at least one information element to the animation tag according to a predetermined rule.
  8. 如权利要求7所述的方法,其中,The method of claim 7, wherein,
    所述动画标签指示面部表情类型和/或肢体动作类型。The animation tag indicates the type of facial expression and/or the type of body movement.
  9. 如权利要求7所述的方法,其中,The method of claim 7, wherein,
    所述动画标签是进一步基于所述情感标签来生成的。The animation tag is further generated based on the emotion tag.
  10. 如权利要求4所述的方法,其中,至少基于所述至少一个信息元素来生成评论文本包括:The method of claim 4, wherein generating review text based at least on the at least one information element comprises:
    从目标内容的弹幕文本中选择所述评论文本。The comment text is selected from the bullet chat text of the target content.
  11. 如权利要求10所述的方法,其中,所述选择所述评论文本包括:The method of claim 10, wherein said selecting said review text comprises:
    利用双塔模型,确定在所述目标内容的弹幕文本中的语句与所述目标内容的图像和/或音频之间的匹配度;以及Using the two-tower model, determine the matching degree between the sentence in the bullet chat text of the target content and the image and/or audio of the target content; and
    从所述弹幕文本中选择匹配度最高的语句作为所述评论文本。A sentence with the highest matching degree is selected from the bullet chat text as the comment text.
  12. 如权利要求4所述的方法,其中,至少基于所述至少一个信息元素来生成聊天响应文本包括:The method of claim 4, wherein generating chat response text based at least on the at least one information element comprises:
    通过聊天引擎,至少基于聊天会话中的消息文本来生成所述聊天响应文本。The chat response text is generated by a chat engine based at least on message text in a chat session.
  13. 如权利要求12所述的方法,其中,The method of claim 12, wherein,
    所述聊天响应文本是进一步基于所述情感标签来生成的。The chat response text is further generated based on the emotion tag.
  14. 如权利要求12所述的方法,其中,The method of claim 12, wherein,
    所述聊天响应文本是进一步基于来自情感转移网络的情感表示来生成的。The chat response text is further generated based on the emotion representation from the emotion transfer network.
  15. 如权利要求1所述的方法,其中,所述至少一个参考信息项是进一步基于以下至少之一来生成的:The method of claim 1, wherein the at least one reference information item is further generated based on at least one of:
    场景特定情感;scene-specific emotion;
    智能会话主体的预设个性;以及the preset personality of the subject of the intelligent conversation; and
    智能会话主体的预设角色。Preset roles for intelligent conversation principals.
  16. 如权利要求1所述的方法,其中,所述多模态输出数据包括以下至少之一:The method of claim 1, wherein the multimodal output data includes at least one of the following:
    智能会话主体的虚拟形象的动画序列;An animation sequence of the avatar of the subject of the intelligent conversation;
    智能会话主体的语音;以及the voice of the subject of the intelligent conversation; and
    文字。Word.
  17. 如权利要求4所述的方法,其中,至少利用所述至少一个参考信息项来产生多模态输出数据包括:The method of claim 4, wherein generating multimodal output data using at least the at least one reference information item comprises:
    生成与所述评论文本和/或所述聊天响应文本相对应的语音和/或文字。Generate voice and/or text corresponding to the comment text and/or the chat response text.
  18. 如权利要求4所述的方法,其中,至少利用所述至少一个参考信息项来产生多模态输出数据包括:The method of claim 4, wherein generating multimodal output data using at least the at least one reference information item comprises:
    利用所述动画标签和/或所述情感标签,从智能会话主体的虚拟形象的动画库中选择对应的动画模板;以及Using the animation tag and/or the emotion tag to select a corresponding animation template from the animation library of the avatar of the intelligent conversation subject; and
    对所述动画模板执行时间适配,以形成智能会话主体的虚拟形象的动画序列。Time adaptation is performed on the animation template to form an animation sequence of the avatar of the intelligent conversation subject.
  19. 如权利要求18所述的方法,其中,所述时间适配包括:The method of claim 18, wherein said time adaptation comprises:
    调整所述动画模板,以匹配于与所述评论文本和/或所述聊天响应文本相对应的语音的时间序列。The animation template is adjusted to match the time sequence of speech corresponding to the comment text and/or the chat response text.
  20. 如权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述多模态输出数据是进一步基于场景特定需求来产生的。The multimodal output data is further generated based on scene-specific requirements.
  21. 如权利要求20所述的方法,其中,所述场景特定需求包括以下至少之一:The method according to claim 20, wherein the scenario-specific requirements include at least one of the following:
    仅输出语音、动画序列和文字中之一;output only one of speech, animation sequence and text;
    输出语音、动画序列和文字中的至少两者;outputting at least two of speech, animation sequences, and text;
    语速预定设置;以及Speech rate preset settings; and
    聊天模式设置。Chat mode settings.
  22. 如权利要求1所述的方法,其中,The method of claim 1, wherein,
    获得多模态输入数据包括:获得目标内容的图像、音频和弹幕文件中至少之一,Obtaining multimodal input data includes: obtaining at least one of image, audio and barrage files of the target content,
    从所述多模态输入数据中提取至少一个信息元素包括:从所述目标内容的图像、音频和弹幕文件中提取至少一个信息元素,Extracting at least one information element from the multimodal input data includes: extracting at least one information element from images, audio and bullet chat files of the target content,
    至少基于所述至少一个信息元素来生成至少一个参考信息项包括:至少基于所述至少一个信息元素来生成动画标签、情感标签和评论文本中至少之一,Generating at least one reference information item based at least on the at least one information element includes: generating at least one of animation tags, emotion tags, and comment text based on at least the at least one information element,
    至少利用所述至少一个参考信息项来产生多模态输出数据包括:至少利用所述动画标签、所述情感标签和所述评论文本中至少之一来产生虚拟形象的动画序列、虚拟形象的评论语音和评论文字中至少之一,以及At least using the at least one reference information item to generate multimodal output data includes: using at least one of the animation tag, the emotion tag and the comment text to generate an animation sequence of the avatar, a comment on the avatar at least one of speech and commentary text, and
    提供所述多模态输出数据包括:提供所述动画序列、所述评论语音和所述评论文字中至少之一。Providing the multimodal output data includes: providing at least one of the animation sequence, the comment voice and the comment text.
  23. 一种基于多模态的反应式响应生成系统,包括:A multimodal based reactive response generation system comprising:
    多模态数据输入接口,用于获得多模态输入数据;A multimodal data input interface for obtaining multimodal input data;
    核心处理单元,其被配置用于:从所述多模态输入数据中提取至少一个信息元素;至少基于所述至少一个信息元素来生成至少一个参考信息项;以及至少利用所述至少一个参考信息项来产生多模态输出数据;以及A core processing unit configured to: extract at least one information element from the multimodal input data; generate at least one reference information item based at least on the at least one information element; and utilize at least the at least one reference information items to generate multimodal output data; and
    多模态数据输出接口,用于提供所述多模态输出数据。A multimodal data output interface, configured to provide the multimodal output data.
  24. 一种用于基于多模态的反应式响应生成的装置,包括:An apparatus for multimodality-based reactive response generation comprising:
    至少一个处理器;以及at least one processor; and
    存储器,其存储计算机可执行指令,当所述计算机可执行指令被运行时使所述至少一个处理器执行如权利要求1至21中任一项所述方法的步骤。A memory storing computer-executable instructions which, when executed, cause the at least one processor to perform the steps of the method as claimed in any one of claims 1 to 21.
  25. 一种用于基于多模态的反应式响应生成的装置,包括:An apparatus for multimodality-based reactive response generation comprising:
    多模态输入数据获得模块,用于获得多模态输入数据;A multimodal input data obtaining module, configured to obtain multimodal input data;
    数据整合处理模块,用于从所述多模态输入数据中提取至少一个信息元素;a data integration processing module, configured to extract at least one information element from the multimodal input data;
    场景逻辑处理模块,用于至少基于所述至少一个信息元素来生成至少一个参考信息项;A scene logic processing module, configured to generate at least one reference information item based at least on the at least one information element;
    多模态输出数据生成模块,用于至少利用所述至少一个参考信息项来产生多模态输出数据;以及a multimodal output data generation module, configured to generate multimodal output data using at least the at least one reference information item; and
    多模态输出数据提供模块,用于提供所述多模态输出数据。The multimodal output data providing module is configured to provide the multimodal output data.
  26. 一种用于基于多模态的反应式响应生成的计算机程序产品,包括计算机程序,所述计算机程序被至少一个处理器运行用于执行如权利要求1至21中任一项所述方法的步骤。A computer program product for multimodal-based reactive response generation, comprising a computer program executed by at least one processor for performing the steps of the method according to any one of claims 1 to 21 .
PCT/CN2022/093766 2021-05-19 2022-05-19 Multimodal based reactive response generation WO2022242706A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110545116.3A CN113238654A (en) 2021-05-19 2021-05-19 Multi-modal based reactive response generation
CN202110545116.3 2021-05-19

Publications (1)

Publication Number Publication Date
WO2022242706A1 true WO2022242706A1 (en) 2022-11-24

Family

ID=77137616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093766 WO2022242706A1 (en) 2021-05-19 2022-05-19 Multimodal based reactive response generation

Country Status (2)

Country Link
CN (1) CN113238654A (en)
WO (1) WO2022242706A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113238654A (en) * 2021-05-19 2021-08-10 宋睿华 Multi-modal based reactive response generation
CN113744369A (en) * 2021-09-09 2021-12-03 广州梦映动漫网络科技有限公司 Animation generation method, system, medium and electronic terminal
CN115658935B (en) * 2022-12-06 2023-05-02 北京红棉小冰科技有限公司 Personalized comment generation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531162A (en) * 2016-10-28 2017-03-22 北京光年无限科技有限公司 Man-machine interaction method and device used for intelligent robot
CN110267052A (en) * 2019-06-19 2019-09-20 云南大学 A kind of intelligent barrage robot based on real-time emotion feedback
CN112379780A (en) * 2020-12-01 2021-02-19 宁波大学 Multi-mode emotion interaction method, intelligent device, system, electronic device and medium
CN113238654A (en) * 2021-05-19 2021-08-10 宋睿华 Multi-modal based reactive response generation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107831905A (en) * 2017-11-30 2018-03-23 北京光年无限科技有限公司 A kind of virtual image exchange method and system based on line holographic projections equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106531162A (en) * 2016-10-28 2017-03-22 北京光年无限科技有限公司 Man-machine interaction method and device used for intelligent robot
CN110267052A (en) * 2019-06-19 2019-09-20 云南大学 A kind of intelligent barrage robot based on real-time emotion feedback
CN112379780A (en) * 2020-12-01 2021-02-19 宁波大学 Multi-mode emotion interaction method, intelligent device, system, electronic device and medium
CN113238654A (en) * 2021-05-19 2021-08-10 宋睿华 Multi-modal based reactive response generation

Also Published As

Publication number Publication date
CN113238654A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN108962217B (en) Speech synthesis method and related equipment
US11595738B2 (en) Generating videos with a character indicating a region of an image
WO2022048403A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2022242706A1 (en) Multimodal based reactive response generation
US20200169591A1 (en) Systems and methods for artificial dubbing
CN108806656B (en) Automatic generation of songs
US20200279553A1 (en) Linguistic style matching agent
CN108806655B (en) Automatic generation of songs
US11705096B2 (en) Autonomous generation of melody
Metallinou et al. Context-sensitive learning for enhanced audiovisual emotion classification
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
US11520079B2 (en) Personalizing weather forecast
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN113010138B (en) Article voice playing method, device and equipment and computer readable storage medium
CN110148406B (en) Data processing method and device for data processing
KR20100129122A (en) Animation system for reproducing text base data by animation
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
JP6222465B2 (en) Animation generating apparatus, animation generating method and program
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115442495A (en) AI studio system
Augello et al. An emotional talking head for a humoristic chatbot
WO2022041177A1 (en) Communication message processing method, device, and instant messaging client
WO2022041192A1 (en) Voice message processing method and device, and instant messaging client
CN116580721B (en) Expression animation generation method and device and digital human platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22804026

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE