CN116756285A

CN116756285A - Virtual robot interaction method, device and storage medium

Info

Publication number: CN116756285A
Application number: CN202310736256.8A
Authority: CN
Inventors: 杜平杰; 殷雅俊
Original assignee: Beijing Huafang Technology Co ltd
Current assignee: Beijing Huafang Technology Co ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-09-15

Abstract

The invention provides an interaction method, equipment and storage medium of a virtual robot, wherein the method comprises the following steps: and responding to the virtual robot interaction request triggered by the user, and collecting the multi-mode data of the live broadcasting room of the user in real time. And then, processing the acquired data to convert the multi-mode data into text information in a unified format, wherein the text information is used for uniformly representing the characteristic information of the multi-mode data. Then, a response text is generated according to the text information. And determining interaction information based on the response text, and controlling the virtual robot to execute interaction operation based on the interaction information. In the scheme, the multi-mode data of the live broadcasting room are analyzed and processed to fully mine the live broadcasting room information, so that the virtual robot can automatically interact with the user in combination with various live broadcasting room information, the operation difficulty of the user can be reduced, the liveness of the live broadcasting room can be improved, and the live broadcasting enthusiasm of the host broadcasting and the audience experience are greatly improved.

Description

Virtual robot interaction method, device and storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to an interaction method, apparatus, and storage medium for a virtual robot.

Background

With the rapid development of science and technology and the improvement of living standard of people, network live broadcast is one of important ways of entertainment and living of people, and is deeply favored by young people. For live broadcast, cold spots are a situation that very affects the mood of the anchor and the experience of the audience. The traditional live broadcast assistant robot can only execute some simple instructions, such as playing sound effects, reminding attention, and the like, can not effectively interact with a host according to the content of a live broadcast room, lacks the capability of intelligent interaction and personalized service, and can not really help the host to promote the interaction and the liveness of the live broadcast room. In addition, the traditional robot also needs to manually configure various rules and instructions, and the access threshold of the anchor is high.

Disclosure of Invention

The embodiment of the invention provides an interaction method, equipment and storage medium of a virtual robot, which are used for solving the problem of insufficient liveness of a live broadcasting room.

In a first aspect, an embodiment of the present invention provides an interaction method for a virtual robot, where the method includes:

responding to a virtual robot interaction request triggered by a user, and collecting multi-mode data of a live broadcasting room of the user in real time;

determining text information corresponding to the multi-mode data, wherein the text information is used for uniformly characterizing characteristic information of the multi-mode data;

generating a response text according to the text information;

and determining interaction information based on the response text, and controlling the virtual robot to execute interaction operation based on the interaction information.

In a second aspect, an embodiment of the present invention provides an interaction device for a virtual robot, where the device includes:

the response module is used for responding to a virtual robot interaction request triggered by a user and collecting multi-mode data of the live broadcasting room of the user in real time;

the determining module is used for determining text information corresponding to the multi-mode data, and the text information is used for uniformly representing characteristic information of the multi-mode data;

the generation module is used for generating a response text according to the text information;

and the execution module is used for determining interaction information based on the response text and controlling the virtual robot to execute interaction operation based on the interaction information.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon, which when executed by the processor causes the processor to perform the method of interaction of a virtual robot as described in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement the method of interaction of a virtual robot according to the first aspect.

In the interaction scheme of the virtual robot provided by the embodiment of the invention, when the virtual robot interaction request triggered by the user is processed, the data of different mode types of the live broadcasting room of the user, such as live broadcasting video data, live broadcasting audio data, barrage information, viewing information and the like, can be acquired in real time in response to the virtual robot interaction request triggered by the user. And then, processing the acquired data to convert the multi-mode data into unified text information, wherein the text information is used for uniformly characterizing the characteristic information of the multi-mode data. Then, a response text is generated from the text information. And determining interaction information based on the response text, and controlling the virtual robot to execute interaction operation based on the interaction information.

In the scheme, the multi-mode data of the live broadcasting room are analyzed and processed to fully mine the live broadcasting room information, so that the virtual robot can automatically interact with the user in combination with various live broadcasting room information, the operation difficulty of the user can be reduced, the liveness of the live broadcasting room can be improved, and the live broadcasting enthusiasm of the host broadcasting and the audience experience are greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an interaction method of a virtual robot according to an embodiment of the present invention;

FIG. 2 is a flowchart for determining interactive information based on response text according to an embodiment of the present invention;

FIG. 3 is a flowchart of an instruction operation for determining correspondence of the instruction information according to an embodiment of the present invention;

fig. 4 is a schematic diagram of determining an interaction process of a virtual robot in a cloud service mode according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a virtual robot interaction device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to the present embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the embodiments of the present invention are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

With the rapid development of internet technology, network live broadcast enters the field of public view as a new technical field, and users can watch the wonderful performance of the live broadcast room on respective terminals and can interact with the live broadcast in real time. When the interaction in the live broadcasting room of the anchor is less and the atmosphere is poor, the anchor is often required to carry out atmosphere mobilization or the anchor is required to carry out atmosphere mobilization through an assistant robot of the anchor.

The existing live broadcast assistant robot can only execute some simple instructions, such as playing sound effects, reminding attention, and the like, can not effectively interact with a host according to the content of a live broadcast room, lacks the capability of intelligent interaction and personalized service, and can not truly help the host to promote the interaction and the liveness of the live broadcast room. In addition, the traditional robot also needs to manually configure various rules and instructions, and the access threshold of the anchor is high, so that the use of the anchor is affected.

In order to solve the technical problems, the embodiment of the invention provides a novel interaction scheme of a virtual robot. In the interaction scheme, various data sources of a live broadcasting room of a user are combined, live broadcasting room information is fully mined, interaction information of a virtual robot is determined to interact with the user in real time, for example, the virtual robot can chat with a host user in combination with live broadcasting content, and can comment, tune up, exaggerate and the like on the live broadcasting content according to user setting; meanwhile, according to audience feedback, the audience feedback can naturally play a role in appreciation thank you and barrage response in chat, namely, interaction is performed between the audience feedback and a host through the virtual robot very intelligently, and interaction and liveness of a living room can be improved.

Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the case where there is no conflict between the embodiments, the following embodiments and features in the embodiments may be combined with each other.

Fig. 1 is a flowchart of an interaction method of a virtual robot according to an embodiment of the present invention. As shown in fig. 1, the present embodiment provides an interaction method of a virtual robot, where an execution body of the method may be a server, and it may be understood that the server may be implemented as software, or a combination of software and hardware, and specifically the method includes the following steps:

101. and responding to the virtual robot interaction request triggered by the user, and collecting the multi-mode data of the live broadcasting room of the user in real time.

102. And determining text information corresponding to the multi-mode data, wherein the text information is used for uniformly characterizing characteristic information of the multi-mode data.

103. And generating response text according to the text information.

104. And determining interaction information based on the response text, and controlling the virtual robot to execute interaction operation based on the interaction information.

In the embodiment of the invention, after a user triggers a virtual robot interaction request, a server responds to the virtual robot interaction request triggered by the user and acquires multi-mode data of a live broadcasting room of the user in real time. The virtual robot interaction request can comprise a user identifier, and the multi-mode data of the user live broadcast room can be acquired in real time according to the user identifier. Optionally, the multi-mode data may include data of different mode types, such as live video data, live audio data, viewing information, barrage information, etc., and the specific data types and data amounts included in the multi-mode data may be set according to actual requirements.

In order to facilitate the subsequent analysis and processing of the acquired multi-modal data, the data of various different modal types can be converted into a unified form for representation. Then, after the multi-modal data of the user live broadcast room is obtained, the multi-modal data can be processed to determine text information corresponding to the multi-modal data. The text information is used for uniformly representing the characteristic information of the multi-mode data.

For example, the multimodal data may include video data, audio data, viewing information, bullet screen information, and the audio data may be converted into text information corresponding thereto, wherein the text information may be used to describe live content included in the live video data. The audio data is converted into text information corresponding thereto, which may be used to describe text content included in the audio data. The viewing information is converted into text information corresponding thereto, which can be used to describe viewing conditions of the viewer as well as viewer feedback information. The bullet screen information is converted into its corresponding text information, which may be used to describe the viewer feedback information.

In an alternative embodiment, if the collected multimodal data includes user audio data, text information corresponding to the user audio data may be determined through a speech recognition model. Specifically, the user audio data is input to the speech recognition model to obtain text information corresponding to the user audio data. For example, the chat audio of the anchor is converted into chat text using a speech recognition model to obtain anchor speaking text.

The speech recognition model mainly comprises an encoder and an encoder. The encoder is mainly used for converting audio data of a user into a vector representation for speech recognition. The decoder is mainly used for completing the recognition of voice to words so as to recognize all words spoken by a user in the audio data, and finally outputting a voice recognition result corresponding to the user to obtain text information corresponding to the audio data of the user. Alternatively, a plurality of encoders in cascade may be included in the encoder, and each encoder may include two sublayers therein: an attention layer and a feedforward neural network layer. A plurality of decoders may be included in a cascade, each including an attention layer and a feedforward neural network layer. The number of encoders included in the encoder may be set according to actual requirements, and similarly, the number of encoders included in the decoder may be set according to actual requirements, which is not limited herein. In addition, the attention layer in the decoder herein may include a self-attention layer and an attention layer.

In an alternative embodiment, if the collected multi-mode data includes live video data, text information corresponding to the live video data may be determined through an image recognition model. Specifically, live video data is input to the image recognition model to obtain text information corresponding to the live video data, wherein the text information is used for describing live content included in the live video data. Or performing screenshot processing on the acquired live video data according to a preset period, and inputting a plurality of images obtained after processing into an image recognition model to obtain text information corresponding to the live video data. Or screenshot is carried out on live broadcast content of the user according to a preset period to obtain a plurality of images, and the images are input into an image recognition model to obtain text information corresponding to the live broadcast video data. For example, the live broadcast of the host is periodically subjected to screenshot according to a preset period, and a visual language pre-training model (BLIP-2 model) is used for generating a text description corresponding to the screenshot so as to obtain a live video content description text.

The image recognition model mainly comprises an image encoder, a text encoder, an image-text encoder and an image-text decoder. The image encoder is mainly used for extracting image characteristic information and converting live video data of a user into vector representation for image recognition. The text encoder is mainly used for extracting text characteristic information and converting the vector representation of image recognition into a vector representation for character recognition. The image-text encoder is used to predict whether an image-text pair matches positively or negatively. The image-text decoder is used to generate a text description of a given image. The pretraining tasks of BLIP mainly include: and comparing and learning the output image characteristics and the text characteristics, and judging whether the pictures and texts are consistent or not and generating the text. The three pre-training tasks are unified for training, so that collected image-text multi-mode data can be more fully utilized, and the BLIP model can simultaneously meet the image-text understanding task and the image-text generating task.

In an alternative embodiment, if the collected multimodal data includes the viewer interaction data, text information corresponding to the viewer interaction data may be determined through the first language identification model. Specifically, the audience interaction data is input into the first language identification model to obtain text information corresponding to the audience interaction data, wherein the text information is used for describing audience feedback information. Wherein, the audience interaction data may include audience viewing information and barrage information. The first language recognition model may specifically be a generated Pre-training transducer model (generating Pre-trained Transformer, GTP model for short), and an unsupervised Pre-training and supervised model fine tuning are adopted. For example, using a GTP model, audience viewing gift information and audience bullet screen information are converted into text descriptions, and audience feedback information is summarized to obtain audience feedback summary text.

From the above description, it is clear that: the text information may include anchor talk text, live video content description text, audience feedback summary text, which may reflect the user's current live room information. Then, according to the acquired text information, a response text corresponding to the current virtual robot can be determined.

Specifically, in the embodiment of the invention, after text information corresponding to the multi-mode data is acquired, the text information is analyzed and processed to determine a response text corresponding to the virtual robot. Wherein the response text may include dialogue information and instruction information. That is, the current response mode and the current response content of the virtual robot can be determined according to the current live broadcast condition. If the current virtual robot carries out dialogue response according to the currently acquired text information, then specific dialogue information corresponding to the current dialogue response is determined. If the current virtual robot is determined to respond to the instruction according to the currently acquired text information, then specific instruction information corresponding to the current instruction response is determined. In addition, in practical application, according to the text information obtained currently, it is determined that the current virtual robot needs to perform the dialogue response and the instruction response at the same time, and then specific dialogue information corresponding to the current dialogue response and specific instruction information corresponding to the current instruction response can be determined.

The dialogue information may include real-time dialogue content with the anchor, criticizing, praying and exaggerating the live content, thanking the viewing audience, responding to the bullet screen information, and the like. For example, after processing the acquired multimodal data, the obtained text information is "what is the weather today". Then dialogue information "weather today is good" may be generated.

The instruction information may include an audio play instruction, a volume adjustment instruction, a background music switching instruction, and the like. For example, after the collected multi-modal data is processed, the obtained text information is "played music sound is loud". Then the instruction information "volume down" may be generated.

In an alternative embodiment, the response text corresponding to the text information may be determined by the second language identification model, and in particular, the text information is input into the second language identification model to obtain dialogue information and/or instruction information corresponding to the text information. The dialogue information is used for indicating the virtual robot to execute the operation corresponding to the dialogue information, and the instruction information is used for indicating the virtual robot to execute the operation corresponding to the instruction information. In addition, the second language recognition model may specifically be a generated Pre-training transducer model (generating Pre-trained Transformer, GTP model for short), which adopts unsupervised Pre-training and supervised model fine tuning. For example, the text information is analyzed using a GTP model to generate response text. Specifically, when the GPT model analyzes text information, the text is mainly played by a host, and the description text of live video content and the feedback summary text of audience are assisted to generate real-time dialogue information and/or instruction information. That is, in the language identification model, the anchor speaking text is weighted higher, and the live video content description text and the audience feedback summary text are weighted lower.

And after determining the response text corresponding to the current virtual robot, determining interaction information based on the response text, and controlling the virtual robot to execute interaction operation based on the interaction information. When the response text is dialogue information, the dialogue information is converted into voice by utilizing a voice synthesis technology, so that the virtual robot plays the voice. When the response text is instruction information, the instruction information is converted into instruction operation so that the virtual robot executes the instruction operation. Therefore, the virtual robot can perform humanized chat with the anchor by combining live broadcast content, comment, pray and exaggerate the live broadcast content according to anchor setting, and naturally perform appreciation thank you and barrage response in the chat according to audience feedback, so that user experience and user viscosity are improved; meanwhile, an operation instruction can be automatically generated according to live broadcast content or a host broadcasting dialogue, so that the virtual robot executes corresponding operation, and the host broadcasting operation difficulty is reduced.

In addition, the voice style played by the virtual robot can be set, so that the virtual robot can chat with a host or interact with audiences in different styles of voice, thereby increasing live broadcast interestingness and attracting more audiences to watch live broadcast of the user.

In the embodiment of the invention, the multi-mode data of the live broadcasting room is analyzed and processed to fully mine the live broadcasting room information, so that the virtual robot can automatically and intelligently interact with the user by combining various live broadcasting room information, the operation difficulty of the user can be reduced, the liveness of the live broadcasting room can be improved, and the live broadcasting enthusiasm of the host broadcasting and the audience experience are greatly improved.

The execution subject of the interaction scheme of the virtual robot described in the above embodiment is a server. In practical applications, the execution body of the interaction method of the virtual robot may be a virtual robot, and the virtual robot may be implemented as software or a combination of software and hardware, and the execution body is not limited and may be set according to practical requirements. And if the execution main body of the method is a virtual robot, after the virtual robot interaction request of the user, the virtual robot responds to the virtual robot interaction request triggered by the user to acquire the multi-mode data of the live broadcasting room of the user in real time. And processing the multi-modal data to determine text information corresponding to the multi-modal data. The text information is used for uniformly representing the characteristic information of the multi-mode data. And then, analyzing and processing the text information to determine a response text corresponding to the virtual robot. Then, based on the response text, the interactive information is determined, and corresponding interactive operation is executed based on the interactive information.

The specific implementation process involved in the embodiment of the present invention may refer to the content in the foregoing embodiment, and will not be described herein again.

The above embodiments describe a specific implementation procedure for determining the interactive operation performed by the virtual robot. In practical application, in order to increase live interaction and interest, a user may set information of a robot, generate different voice information based on different set information of the robot, determine interaction information based on response text in combination with fig. 2, and control a process of performing interaction operation by a virtual robot based on the interaction information to perform an exemplary description.

FIG. 2 is a flowchart for determining interactive information based on response text according to an embodiment of the present invention; as shown in fig. 2, where the response text includes dialogue information, the embodiment of the present invention provides a specific implementation manner for determining interaction information based on the response text, and the method includes the following steps:

201. and determining the human setting information of the virtual robot.

202. And determining voice information corresponding to the dialogue information based on the person setting information, and controlling the virtual robot to play the voice information.

In the embodiment of the invention, when the generated response text is dialogue information, the interaction information can be generated by combining the human setting information of the virtual robot. Specifically, the method includes the steps of firstly determining the human setting information of the virtual robot, determining voice information corresponding to dialogue information based on the human setting information, and controlling the virtual robot to play the voice information.

The user can preset the setting information of the virtual robot according to the live broadcast content, and the server can set the setting information of the virtual robot according to the current live broadcast content. The person setting information can comprise age, character, speaking style, region and the like, and dialogue feedback of different styles is generated according to the set person setting information, so that the virtual robot can play voice information by adopting the voices of different styles.

According to the embodiment of the invention, the voice information corresponding to the dialogue information is determined by determining the human setting information of the virtual robot and then based on the human setting information, and the virtual robot is controlled to play the voice information, so that live broadcasting interactivity and interestingness are improved.

The above embodiment describes an implementation of determining the interaction information based on the generated dialogue information, however, in practical applications, instruction information may also be generated according to text information, and based on the instruction information, the interaction information is determined, so that the virtual robot performs a corresponding instruction operation. Specifically, determining instruction operation corresponding to the instruction information, and controlling the virtual robot to execute the instruction operation. The operation information may include an audio play instruction, a volume adjustment instruction, a background music switching instruction, and the like.

In traditional live broadcasting, a host is usually required to manually trigger an audio play command, a volume adjustment command, a background music switching command and the like, however, in some scenes, for example, in dancing scenes, outdoor scenes, game scenes and the like, the host is inconvenient to trigger an operation command, and the playback, volume adjustment and the like of atmosphere audio cannot be performed in time. However, by adopting the interaction method of the virtual robot provided by the embodiment of the invention, instruction information can be automatically generated according to the live broadcast content, so that the virtual robot can automatically execute corresponding instruction operation, the triggering by a user is not needed any more, the user operation can be reduced, and the hands of the user are liberated.

In addition, in practical application, subjective factors exist in the process of manually selecting the sound effect, the suitability of the sound effect cannot be ensured, and the optimal effect cannot be realized frequently. In order to solve the problems, the embodiment of the invention adopts the AI technology, and combines the multi-mode data to automatically select the matched sound effect for playing, thereby realizing more efficient and intelligent active atmosphere control of live broadcast content. In order to facilitate the process of determining the matched preset sound effect, a specific process of determining the preset sound effect by combining the multi-mode data with fig. 3 to control the virtual robot to automatically play the preset sound effect is exemplarily described.

FIG. 3 is a flowchart of an instruction operation for determining correspondence of the instruction information according to an embodiment of the present invention; as shown in fig. 3, where the instruction information includes an audio playing instruction, the embodiment of the present invention provides a specific implementation manner for determining an instruction operation corresponding to the instruction information, where the method includes the following steps:

301. and acquiring live video data and live audio data in the current period.

302. And determining video characteristics corresponding to the live video data and audio characteristics corresponding to the live audio data.

303. And determining a live broadcast scene, a live broadcast style and a user emotion corresponding to the current live broadcast room based on the audio characteristics and the video characteristics.

304. Based on the live broadcast scene, the live broadcast style and the emotion of the user, determining a preset sound effect matched with the current live broadcast from the sound effect feature library, and controlling the virtual robot to play the preset sound effect.

In practical application, after corresponding instruction information is determined according to text information, determining instruction operation corresponding to the instruction information. It should be noted that: if the determined specific instruction information is an audio playing instruction, before determining the instruction operation corresponding to the audio playing instruction, determining a preset audio to be played by the virtual robot from an audio feature library.

The server may pre-construct an audio feature library. Specifically, a plurality of preset sound effects are firstly obtained, and information such as live broadcast scenes, user moods, live broadcast styles, live broadcast duration and the like, to which each preset sound effect is adapted, is determined. And extracting the audio characteristics of each preset sound effect by using the audio characteristic extraction model so as to obtain the audio characteristics corresponding to each sound effect. And then, combining live scenes, user moods, live styles, live time lengths and other combined information corresponding relations between the audio features and the live scenes, the user moods, the live styles, the live time lengths and the like, to which each preset sound effect is adapted, and creating a sound effect feature library based on the corresponding relations.

And then, combining the live broadcast content, determining a preset sound effect matched with the current live broadcast content from a sound effect feature library, and controlling the virtual robot to automatically play the preset sound effect. Specifically, live video data and live audio data in a current period are firstly obtained. Because the live broadcast content of the user may change from moment to moment, in order to better determine the current live broadcast content of the user, the selected preset sound effect achieves the best effect, and then the live broadcast video data and the live broadcast audio data of the live broadcast room of the user in the current period can be acquired.

And then, determining the video characteristics corresponding to the live video data and the audio characteristics corresponding to the live audio data. In an alternative embodiment, the 3D convolutional neural network may be used to process the acquired live video data to extract temporal features and spatial features in the video data, and the extracted temporal features and spatial features are determined to be video features corresponding to the live video data. The voice recognition model can be used for converting the live audio data into text information, and the text feature extraction is performed by using the transducer model so as to obtain the audio features corresponding to the live audio data.

After the audio features and the video features are determined, a live scene, a live style and a user emotion corresponding to the current live broadcasting room are determined based on the audio features and the video features. And then, based on the live broadcast scene, the live broadcast style and the emotion of the user, determining the preset sound effect matched with the current live broadcast from the sound effect feature library, wherein the determined preset sound effect is more matched with the current live broadcast content of the user, so that the optimal effect is achieved, the field atmosphere is more active and vivid, and the experience of the audience user is further improved. And finally, controlling the virtual robot to play the preset sound effect.

According to the embodiment of the invention, through extracting the characteristics of the multi-mode data and analyzing the characteristic information and emotion, the comprehensive analysis of the live broadcast content is realized, the atmosphere and emotion of the live broadcast content can be accurately mastered, the proper preset sound effect can be selected for playing, and the quality and viewing experience of the live broadcast content are improved. And the preset sound effect which can be played by the user is expanded from tens to thousands, and simultaneously, the hands of the host are liberated, so that the live broadcast has more possibility.

In addition, the interaction method of the virtual robot provided by the embodiment of the invention can be executed in the cloud, a plurality of computing nodes (cloud servers) can be deployed in the cloud, and each computing node is provided with processing resources such as computation, storage and the like. At the cloud, a service may be provided by multiple computing nodes, although one computing node may provide one or more services. The cloud may provide the service by providing a service interface to the outside, and the user invokes the service interface to use the corresponding service.

Aiming at the scheme provided by the embodiment of the invention, the cloud can provide a service interface of the interaction service of the virtual robot, a user calls the service interface through the terminal equipment to trigger the virtual robot interaction service request to the cloud, the request comprises a user identifier, the cloud determines a computing node responding to the request, and the following steps are executed by utilizing processing resources in the computing node:

generating a response text according to the text information;

The above execution may refer to the related descriptions in the other embodiments, which are not described in detail herein.

For ease of understanding, the description is exemplary in connection with fig. 4. The user can call the interactive service of the virtual robot through the terminal equipment E1 shown in fig. 4 to collect the multi-mode data of the live broadcasting room of the user in real time, and analyze and process the multi-mode data to obtain the interactive operation corresponding to the virtual robot. The service interface for the user to call the service includes a software development kit (Software Development Kit, abbreviated as SDK), an application program interface (Application Programming Interface, abbreviated as API), and the like. Illustrated in fig. 4 is the case of an API interface. At the cloud, as shown in fig. 4, it is assumed that the service cluster E2 provides the interactive service of the virtual robot, and the service cluster E2 includes at least one computing node. After receiving the request, the service cluster E2 executes the steps in the foregoing embodiments to determine the interactive operation to be executed by the virtual robot, and feeds back the interactive operation to the terminal device E1.

The interactive apparatus of a virtual robot according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means may be configured by the steps taught by the present solution using commercially available hardware components.

Fig. 5 is a schematic structural diagram of an interaction device of a virtual robot according to an embodiment of the present invention, where, as shown in fig. 5, the device includes: a response module 11, a determination module 12, a generation module 13, and an execution module 14.

And the response module 11 is used for responding to the virtual robot interaction request triggered by the user and collecting the multi-mode data of the live broadcasting room of the user in real time.

The determining module 12 is configured to determine text information corresponding to the multimodal data, where the text information is used to uniformly characterize feature information of the multimodal data.

And the generating module 13 is used for generating response text according to the text information.

The execution module 14 is configured to determine interaction information based on the response text, and control the virtual robot to perform an interaction operation based on the interaction information.

Optionally, the multimodal data includes user audio data, and the determining module 12 may specifically be configured to: and inputting the user audio data into a voice recognition model to obtain text information corresponding to the user audio data.

Optionally, the multimodal data includes live video data, and the determining module 12 may specifically be configured to: and inputting the live video data into an image recognition model to obtain text information corresponding to the live video data, wherein the text information is used for describing live content included in the live video data.

Optionally, the multimodal data includes audience interaction data, and the determining module 12 may specifically be configured to: and inputting the audience interaction data into a first language identification model to obtain text information corresponding to the audience interaction data, wherein the text information is used for describing audience feedback information.

Alternatively, the generating module 13 may be specifically configured to: inputting the text information into a second language identification model to obtain dialogue information and/or instruction information corresponding to the text information, wherein the dialogue information is used for instructing the virtual robot to execute the operation corresponding to the dialogue information, and the instruction information is used for instructing the virtual robot to execute the operation corresponding to the instruction information.

Optionally, the execution module 14 may specifically be configured to: determining the human setting information of the virtual robot; and determining voice information corresponding to the dialogue information based on the person setting information, and controlling the virtual robot to play the voice information.

Optionally, the execution module 14 may specifically be configured to: determining instruction operation corresponding to the instruction information, and controlling the virtual robot to execute the instruction operation.

Optionally, the instruction information includes an audio playing instruction, and the execution module 14 may specifically be configured to: acquiring live video data and live audio data in a current period; determining video characteristics corresponding to the live video data and audio characteristics corresponding to the live audio data; based on the audio characteristics and the video characteristics, determining a live broadcast scene, a live broadcast style and a user emotion corresponding to the current live broadcast room; and determining preset sound effects matched with the current live broadcast from a sound effect feature library based on the live broadcast scene, the live broadcast style and the emotion of the user, and controlling the virtual robot to play the preset sound effects.

The apparatus shown in fig. 5 may perform the steps in the voice recognition method in the foregoing embodiment, and the detailed performing process and technical effects are referred to the description in the foregoing embodiment, which is not repeated herein.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, where the electronic device may include: a processor 21, a memory 22, a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, causes the processor 21 to implement the interaction method of the virtual robot as in the previous embodiments.

In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement the interaction method of a virtual robot as provided in the foregoing embodiments.

The apparatus embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by adding necessary general purpose hardware platforms, or may be implemented by a combination of hardware and software. Based on such understanding, the foregoing aspects, in essence and portions contributing to the art, may be embodied in the form of a computer program product, which may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An interaction method of a virtual robot, comprising:

generating a response text according to the text information;

2. The method of claim 1, wherein the multimodal data includes user audio data, and wherein the determining text information corresponding to the multimodal data includes:

and inputting the user audio data into a voice recognition model to obtain text information corresponding to the user audio data.

3. The method of claim 1, wherein the multimodal data comprises live video data, and wherein the determining text information corresponding to the multimodal data comprises:

and inputting the live video data into an image recognition model to obtain text information corresponding to the live video data, wherein the text information is used for describing live content included in the live video data.

4. The method of claim 1, wherein the multimodal data includes viewer interaction data, and wherein the determining text information corresponding to the multimodal data includes:

and inputting the audience interaction data into a first language identification model to obtain text information corresponding to the audience interaction data, wherein the text information is used for describing audience feedback information.

5. The method of claim 1, wherein generating a response text from the text information comprises:

inputting the text information into a second language identification model to obtain dialogue information and/or instruction information corresponding to the text information, wherein the dialogue information is used for instructing the virtual robot to execute the operation corresponding to the dialogue information, and the instruction information is used for instructing the virtual robot to execute the operation corresponding to the instruction information.

6. The method of claim 5, wherein the determining interactive information based on the response text and controlling the virtual robot to perform interactive operations based on the interactive information comprises:

determining the human setting information of the virtual robot;

and determining voice information corresponding to the dialogue information based on the person setting information, and controlling the virtual robot to play the voice information.

7. The method of claim 5, wherein the determining interactive information based on the response text and controlling the virtual robot to perform interactive operations based on the interactive information comprises:

determining instruction operation corresponding to the instruction information, and controlling the virtual robot to execute the instruction operation.

8. The method of claim 7, wherein the instruction information includes an audio playback instruction, the determining an instruction operation corresponding to the instruction information, and controlling the virtual robot to execute the instruction operation, includes:

acquiring live video data and live audio data in a current period;

determining video characteristics corresponding to the live video data and audio characteristics corresponding to the live audio data;

based on the audio characteristics and the video characteristics, determining a live broadcast scene, a live broadcast style and a user emotion corresponding to the current live broadcast room;

and determining preset sound effects matched with the current live broadcast from a sound effect feature library based on the live broadcast scene, the live broadcast style and the emotion of the user, and controlling the virtual robot to play the preset sound effects.

9. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon, which when executed by the processor causes the processor to perform the method of interaction of a virtual robot as claimed in any one of claims 1 to 8.

10. A non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the method of interaction of a virtual robot according to any of claims 1 to 8.