CN107423809B

CN107423809B - Virtual robot multi-mode interaction method and system applied to video live broadcast platform

Info

Publication number: CN107423809B
Application number: CN201710551230.0A
Authority: CN
Inventors: 黄钊
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Virtual Point Technology Co Ltd
Priority date: 2017-07-07
Filing date: 2017-07-07
Publication date: 2021-02-26
Anticipated expiration: 2037-07-07
Also published as: CN107423809A

Abstract

The invention discloses a multi-mode interaction method of a virtual robot applied to a video live broadcast platform, wherein the video live broadcast platform is applied to access the virtual robot with multi-mode interaction capability, and the multi-mode interaction method comprises the following steps: displaying a virtual robot with a specific image in a preset area, entering a default live broadcast auxiliary mode, and receiving multi-mode data and multi-mode instructions input by a live broadcast room in real time; analyzing the multi-mode data and the multi-mode instructions, and distinguishing and determining a target live broadcast auxiliary mode by utilizing the multi-mode interaction capability of the virtual robot; and starting a target live broadcast auxiliary mode, and performing multi-mode interaction and display by the virtual robot according to the target live broadcast auxiliary mode. The invention utilizes the live mode conversion to display multi-mode interaction in multiple modes, thereby improving the user interest, keeping the user stickiness and improving the user experience.

Description

Virtual robot multi-mode interaction method and system applied to video live broadcast platform

Technical Field

The invention relates to the technical field of Internet live broadcast platforms, in particular to a multi-mode interaction method and system of a virtual robot applied to a video live broadcast platform.

Background

With the development of the network live broadcast industry, a user can receive the virtual prize in the modes of watching, doing activities and the like on a network live broadcast platform, and the obtained virtual prize is given to a favorite anchor for interaction, so that the watching habit and platform viscosity of the user are developed.

However, in the existing network live broadcast platform, the system for monitoring the live broadcast state of the anchor is not complete, the anchor performance mode is single, and the experience feeling brought to the user is not good, so that the improvement of the intelligence of the live broadcast platform is an important technical problem which needs to be solved urgently at present.

Disclosure of Invention

In order to solve the above technical problem, an embodiment of the present application first provides a multi-modal interaction method for a virtual robot applied to a live video platform, where an application of the live video platform is accessed to the virtual robot, and the virtual robot has a multi-modal interaction capability, where the multi-modal interaction method includes the following steps: a multi-mode information input step, namely displaying a virtual robot with a specific image in a preset area, entering a default live broadcast auxiliary mode, and receiving multi-mode data and multi-mode instructions input by a live broadcast room in real time; a data processing and mode distinguishing step, namely analyzing the multi-mode data and/or the multi-mode instructions, and distinguishing and determining a target live broadcast auxiliary mode by utilizing the multi-mode interaction capability of the virtual robot; and a multi-mode interactive information output step, namely starting a target live broadcast auxiliary mode, and performing multi-mode interaction and display by the virtual robot according to the target live broadcast auxiliary mode.

Preferably, the data processing and mode discrimination includes: receiving the multi-mode data in the live broadcasting process, and extracting awakening data aiming at the virtual robot; and entering one of the multi-mode interaction modes matched with the awakening data, and executing multi-mode interaction and display actions in the current multi-mode interaction mode.

Preferably, the multi-modal interaction mode comprises: a dialogue mode, a performance basic mode, an interaction with audience mode, and an interaction with other virtual robots mode.

Preferably, in the data processing and mode discrimination, further, the multi-mode instruction set for mode conversion by the anchor is acquired; and analyzing and responding to the mode conversion setting, and switching from the current multi-mode interaction mode to other multi-mode interaction modes, namely a target live broadcast auxiliary mode.

Preferably, the multimodal data and/or multimodal instructions comprise: one or more of text information, voice information, visual information, control command information, and combination information thereof.

In another aspect, embodiments of the present application propose a storage medium having stored thereon program code executable to perform the method steps of any one of the above.

On the other hand, an embodiment of the present application further provides a multi-modal interaction system of a virtual robot applied to a live video platform, where an application of the live video platform accesses the virtual robot, the virtual robot has multi-modal interaction capability, and the multi-modal interaction system includes the following modules: the multi-mode information input module displays a virtual robot with a specific image in a preset area, enters a default live broadcast auxiliary mode, and receives multi-mode data and multi-mode instructions input by a live broadcast room in real time; the data processing and mode distinguishing module analyzes the multi-mode data and the multi-mode instructions, and distinguishes and determines a target live broadcast auxiliary mode by utilizing the multi-mode interaction capacity of the virtual robot; and the multi-mode interactive information output module is used for starting a target live broadcast auxiliary mode, and the virtual robot performs multi-mode interaction and display according to the target live broadcast auxiliary mode.

Preferably, in the data processing and mode discrimination module, based on the multi-mode data, the wake-up data for the virtual robot is extracted; and entering one multi-mode interaction mode matched with the awakening data, and executing multi-mode interaction and display actions in the current multi-mode interaction mode.

Preferably, in the data processing and mode discrimination module, the multi-mode instruction set for mode conversion by the anchor is further acquired; and analyzing and responding to the mode conversion setting, and switching from the current multi-mode interaction mode to other multi-mode interaction modes, namely a target live broadcast auxiliary mode.

Preferably, the modelled data and/or the multimodal instructions comprise: one or more of text information, voice information, visual information, control command information, and combination information thereof.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

the embodiment of the invention provides a solution for assisting a main broadcast to carry out live broadcast work through a virtual robot, and the solution enables the virtual robot to display multi-mode interaction according to a determined live broadcast auxiliary mode, so that the interest of a user can be improved, the stickiness of the user is kept, and the user experience is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure and/or process particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the technology or prior art of the present application and are incorporated in and constitute a part of this specification. The drawings expressing the embodiments of the present application are used for explaining the technical solutions of the present application, and should not be construed as limiting the technical solutions of the present application.

Fig. 1 is a schematic view of a multimodal interactive application scenario of a live webcast platform according to an embodiment of the present application.

Fig. 2a is a schematic view of a mildewed mode scene of the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 2b is a schematic view of a basic buffering and performance mode scene of the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 2c is a schematic view of a scene of an interaction mode with an audience of the multi-modal interaction system of the live webcast platform according to the embodiment of the present application.

Fig. 2d is a schematic view of a scene of a live webcast platform multimodal interaction system in a microphone connecting mode with another virtual robot according to an embodiment of the present application.

Fig. 2e is a schematic view of a scene of an interaction mode with other virtual robots in the multi-modal interaction system of the live webcast platform according to the embodiment of the present application.

Fig. 3 is a schematic structural diagram of a multimodal interaction system of a live webcast platform according to an embodiment of the present application.

Fig. 4 is a mode conversion diagram of the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 5 is a block diagram of a multimodal interaction system of a live webcast platform according to an embodiment of the present application.

Fig. 6 is a block diagram of a side face detection module 522 in the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 7 is a flowchart illustrating a side face detection function implemented in a multi-modal interactive system of a live webcast platform according to an embodiment of the present application.

Fig. 8 is a block diagram of a speech recognition module 524 of the multimodal interaction system of the live webcast platform according to the embodiment of the present application.

Fig. 9 is a flowchart of implementing a voice recognition function in the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 10 is a block diagram of a mode discrimination module 523 of the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Fig. 11 is a block diagram of a semantic analysis module 525 of a multimodal interaction system of a live webcast platform according to an embodiment of the present application.

Fig. 12 is a flowchart illustrating a semantic analysis function implemented in the multi-modal interactive system of the live webcast platform according to the embodiment of the present application.

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the corresponding technical effects can be fully understood and implemented. The embodiments and the features of the embodiments can be combined without conflict, and the technical solutions formed are all within the scope of the present invention.

Additionally, the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a schematic view of an application scenario of a multi-modal interaction system of a live webcast platform according to an embodiment of the present application. As shown in fig. 1, the system is applied to a live webcast platform 300, before the system is applied, live webcast application software needs to be installed on the anchor device 120, the anchor 111 opens the live webcast software, and actively initiates a live webcast task to enter the live webcast platform 300 for live webcast performance. In addition, the viewer user (211 … … 21n) needs to install live-class application software with the same name as in the anchor device 121 on his user device (221 … … 22n), the user (211 … … 21n) can enter the live-room web address in his device (221 … … 22n) and enter it into the live-room platform 300 through the internet, and the user (211 … … 21n) views the live performance of the anchor 111 through the live-room user display interface (2211 … … 22n 1). It should be noted that the present application is also not specifically limited to the types of the user equipment (221 … … 22n) and the anchor equipment 121, and may be, for example: smart phones, computers, tablet computers, and the like.

Further, when the anchor 111 opens the live application and initiates a live command, the live application displays a live room anchor display interface 1211 on the display screen of the anchor device 121. Referring to fig. 1, a live broadcast display interface 1211 includes the following display areas: feeding back the anchor performance area of the anchor 111 performance video picture in real time; the barrage, the audience message and the audience gift-sending display area are used for displaying barrage information, audience message information and audience gift-sending data sent by audiences in a rolling mode; a main control area of the main broadcast for transmitting control commands (for example, the control commands can be realized by means of function buttons) such as opening and ending of the live broadcast, connecting with the audience and the like by the main broadcast 111; and the robot auxiliary performance area feeds back state information of the virtual robot 111f such as expression, language, action and the like in real time. In addition, when the user (211 … … 21n) enters the anchor live broadcast room 300, the user can view substantially the same performance picture as the anchor performance area and the robot-assisted performance area in the anchor display interface 1211 through the live broadcast user display interface (2211 … … 22n 1); however, the live user display interface (2211 … … 22n1) is different from the live anchor display interface 1211 in two points: first, the barrage, message and gift sending display area in the live broadcast user display interface (2211 … … 22n1) has the functions of the barrage, the message of the audience and the gift sending display area in the live broadcast main display interface 1211, and the user (211 … … 21n) can input message text information in the area; second, the user control area in the live user display interface (2211 … … 22n1) contains a control button for the user to leave the live room.

The live webcast platform multimodal interaction system of the present application is configured with a virtual robot 111f having multimodal interaction capability, which uses an animated image as a carrier and can output multimodal information such as text information, voice information, expression animation information, and motion information. In the embodiment of the application, the multi-modal interactive system of the live webcast platform can realize the following functions by using the virtual robot 111 f: in the case of a anchor without instructions, the virtual robot 111f may assist the anchor in performing and thank the designated audience user; and the corresponding multi-modal interaction mode can be converted according to different instructions of the anchor. The loading of the virtual robot can replace the anchor to interact with audiences when the anchor with poor mouth or the anchor is tired, corresponding performances are performed on the audiences, conversation can be carried out with the anchor, the access amount and the popularity of a live broadcast room are kept, and the live broadcast quality and the live broadcast duration are maintained.

Wherein, the anchor command comprises the following operations: the anchor side faces towards virtual robot 111f, the name of the virtual robot is called in the anchor speech, key instructions such as 'dance', 'singing', 'telling' and the like are spoken in the anchor speech, the anchor presses a button for interacting with audiences, and the anchor presses a button for interacting with other virtual robots. In addition, the multi-modal interaction mode includes the following modes: the interaction mode of the antimildew mode, the conversation mode, the basic mode of functional performance, the interaction mode with audiences and the interaction mode with other virtual robots.

Next, a detailed description is given of matching and conversion between the anchor command and the virtual robot assisted live mode in the multi-modal interactive system of the live webcast platform, and how the virtual robot implements an assisted performance process in each mode.

(first mode)

In the embodiment of the application, if the anchor is performing live show and no special instruction is sent out, the virtual robot is in the mildewing mode state.

Fig. 2a is a schematic view of a mildewing mode scene of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, as shown in fig. 2a, in an anchor display interface 1211 of an anchor room, an anchor main control area further includes emotional command buttons in the mildewing mode, for example: basic emotional commands such as excitement, joy, calmness, surprise, sadness and the like; meanwhile, under the mildewed mode, the virtual robot can simultaneously output the mildewed multi-mode information which corresponds to the emotion and comprises voice, action and expression by means of the animation image. In the same emotion, each modality has a plurality of information contents, and the virtual robot randomly outputs one action according to the plurality of information contents as output information of the modality under a specific command emotion. Specifically, (one embodiment) under an excited command emotion, the audio information that the virtual robot can output includes: "too good! "," true and false "," continue to like; the action information that can be output includes: circling, erecting thumbs, dancing and the like; the outputable expression actions include a laugh with teeth, a laugh with neck and the like, the virtual robot randomly selects different information of each mode, and performs the following mildewing multi-mode information matching output: the rotating action is matched with the language of 'true not right' and the expression information of the laugh; or dance movements and "too nice" languages and expressions of laughing up the neck. (another embodiment) in a surprise command mood, the audio information that the virtual robot can output comprises: "real Dome", "My o", "Mare", etc.; the action information that can be output includes: spreading out the two hands, backing one step, swinging the hands and the like; the outputtable expression actions include that the mouth shape is in a vertical ellipse shape, large eyes are glared and the like, the virtual robot randomly selects different information of each mode, and performs mildewed multi-mode information matching output as follows: the action of spreading out the two hands is matched with the language of 'real do' and the expression information of the big eye; or the action of going back by one step and the expression that the language and the mouth shape of the magic horse are vertical and elliptical.

On the other hand, the anchor master control area in the anchor display platform interface of the live broadcast room is also provided with a thank you command button in the mildewed mode. When the anchor presses the thank you command button, the system randomly selects a plurality of ceremonies to thank you according to the statistical result of the ceremony data of the audience. Specifically, when the anchor sends the thank you command, the system counts out the name of the audience user who is "yacht," the name of the audience user who is "villa," and so on according to the situation of audience gift delivery, randomly selects three persons to deliver the gift audience, and the virtual robot outputs the audio information of the thank you command by means of the animation image, for example: the "gifts" are "gifts", "thank you are" supports "," concerns "," gifts "are" gifts ", etc. (the place is corresponding to the name of the user who matches the gift-giving audience).

In the live broadcast room user display interface, a user can not only watch the performance of the anchor in real time, but also see animation segments (matching output of voice, expression and action) which are alternately displayed by the virtual robot.

(second mode)

In this embodiment of the application, if the anchor calls out the name of the virtual robot during the live performance and/or the anchor side faces towards the virtual robot, the virtual robot enters a conversation mode state.

Fig. 2b is a schematic view of a dialog mode scene of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 2b, when the system enters the dialog mode, the anchor display 1211 and the live user display (2211 … … 22n1) are switched from the amusing mode interface to the dialog mode interface. In the dialog mode interface, the anchor 110 character and the animated character of the virtual robot are displayed on the device screen while the dialog contents of both are scroll-displayed in text form in real time.

Specifically, in the embodiment of the present application, when the anchor side face is directed toward the virtual robot and/or for the virtual robot: "Picture diagram (the" picture diagram "is the name of the anchor 110 auxiliary live virtual robot in the embodiment of the present application), and play a call bar with everybody! "at this point, the system dialog mode command starts, and in one embodiment, the anchor and virtual robot may complete the dialog as follows.

The virtual robot says to the anchor: "hi, good family, i am a picture, which is the main cartoon of everybody. "

The anchor says to the virtual robot: "how did i just perform? "

The virtual robot says to the anchor: "particularly good! "

The anchor says to the virtual robot: "what do you feel there are places to improve? "

The virtual robot says to the anchor: "it is better if the expression is richer a little more! "… …

In the above-described dialogue, the virtual robot can perform a response dialogue in real time by a question issued by a host. It should be noted that, when the live webcast platform multimodal interactive system is in the conversation mode and no response information (response conversation text information) is output, the virtual robot enters a buffer state in the conversation mode. In a buffer state, the virtual robot takes the animation image as an information output carrier, displays the expression and voice information on a display interface of a live broadcast room, and fills a time gap caused by the overlong response time in the live broadcast process. Specifically, for example: in this state, the output emotion of the virtual robot is very happy, and the output voice content is common data for interaction with audiences.

(third mode)

In the embodiment of the application, when the system is in a conversation mode, if the anchor speaks a specific performance form command to the virtual robot in the live performance process, the virtual robot enters a performance basic mode state. Wherein, the performance form command is a sentence containing the following keywords: "sing", "dancing", "telling" etc.

Fig. 2c is a schematic view of a basic performance mode scene of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, as shown in fig. 2c, when the system enters the basic performance mode, the anchor display interface 1211 of the anchor room and the user display interface 2211 … … 22n1 of the live webcast room are switched from the session mode interface to the basic performance mode interface, the system can analyze a performance form command, the virtual robot responds to the analyzed command to perform a corresponding performance, and the virtual robot uses an animation image as a carrier to display preset and matched video stream information (the video stream information includes voice information, action information, and the like) on the display interface of the live webcast room. Wherein each show form command contains a plurality of sets of video stream information of different content. Specifically, (one embodiment) in the dialogue mode, when the anchor speaks a show form command to the virtual robot: "Picture, you give everyone dance-Bar! When the virtual robot live broadcast room displays the dance state, the system analyzes that the performance form command initiated by the anchor is dance, and randomly selects a group of data from a plurality of groups of data with different contents corresponding to the keyword 'dance' to output, so that the dance state is displayed on a display interface of the virtual robot live broadcast room. (another embodiment) in the dialogue mode, when the anchor speaks a show form command to the virtual robot: "do you say a joke story for large? When the virtual robot live broadcasting room display interface displays the state of speaking the joke, the system analyzes that the performance form command initiated by the anchor is speaking the joke, and randomly selects one group of data from a plurality of groups of data with different contents corresponding to the keyword of 'speaking' & 'joke' to output, so that the display interface of the virtual robot live broadcasting room displays the state of speaking the joke. It should be noted that, in the embodiment of the present application, the application is not specifically limited with respect to the types of the keywords of the performance form command, the number of data sets and the data content included in each keyword, and the implementer of the application can perform real-time adjustment according to actual needs.

(fourth mode)

In this embodiment of the present application, when the system is in the session mode, if the anchor presses the button for controlling interaction with the audience during the live performance, the virtual robot enters the state of the interaction mode with the audience.

Fig. 2d is a schematic view of a mode scene of interaction with viewers in the multimodal interaction system of the live webcast platform according to the embodiment of the present application, as shown in fig. 2d, when the system enters the mode of interaction with viewers, the anchor display interface 1211 and the live user display interface (2211 … … 22n1) are switched from the conversation mode interface to the mode interface of interaction with viewers, the system stops collecting live images of the anchor 111, the virtual robot randomly selects a plurality of messages left by viewers for answering, the live display interface can not only scroll-display text messages that have a conversation with viewers, but also read reply sentences left by the viewers using the animated images of the virtual robot to output audio and text synchronously. When the anchor presses the button for returning to the default mode, the interaction mode with the audience is ended, the collection of the video image of the wheat audience is stopped, and the mode returns to the mildewing mode again.

(fifth mode)

In this embodiment of the present application, when the system is in the session mode, if the anchor presses the button for controlling interaction with another intelligent robot during live performance, and the anchor 111a assisted by another virtual robot presses the button for receiving the microphone connecting request after receiving the microphone connecting notification, the system enters the state of interaction with another virtual robot.

Fig. 2e is a schematic view of a scene of an interaction mode with other virtual robots in the multi-modal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 2e, in the scene, the virtual robot a and the virtual robot B perform a session with each other. In this embodiment, the virtual robot a is an auxiliary live broadcast virtual robot of the anchor 111, and the virtual robot B is an auxiliary live broadcast virtual robot of the anchor 111a that connects to the anchor 111. Before the connecting mode, a user (211 … … n) watches the live performance of the anchor 111 through a live room platform to which the anchor 111 belongs, the user (211a … … na) watches the live performance of the anchor 111a through the live room platform to which the anchor 111a belongs, when the anchor 111 sends a connecting request with a virtual robot to the anchor 111a and the anchor 111a receives the connecting request, the system enters into an interaction mode with other virtual robots, live room anchor display interfaces (1211 and 1211a) and live room user display interfaces (2211 … … n1 and 211a … … na) switch from the interaction mode interfaces to interaction mode interfaces with other virtual robots, and the user (211 … … n) and the user (211a … … na) can simultaneously watch the conversation process of an auxiliary live virtual robot to which the anchor 111 and the anchor 111 a. In this mode, the system stops capturing live videos of the anchor 111 and the anchor 111a, and the live broadcast room display interface displays scrolling dialog text information of two virtual robot dialogues. When anchor 111 presses the return to default mode button, the mode ends, and the mode returns to the mildewed mode again.

Fig. 3 is a schematic structural diagram of a live webcast platform multimodal interaction system according to an embodiment of the present application, and as shown in fig. 3, the live webcast platform multimodal interaction system includes the following elements: anchor camera 511, anchor microphone 512, anchor main control button 513, live broadcast room platform 300, cloud server 400.

The constituent elements of the system will be described below. The anchor camera 511, which collects live broadcast images of the anchor 111 in real time; a cast microphone 512 which collects live voice information of the cast 111 in real time; the anchor master button 513, which is controlled by the command of the anchor 111, transmits a control command signal. Further, the anchor camera 511, the anchor microphone 512, and the anchor main control button 513 respectively transmit the acquired live video information, live voice information, and control command signal to a data acquisition interface of the anchor side of the live broadcast room platform 300. It should be noted that the present application is not specifically limited to the installation position of the anchor camera, the device type of the anchor microphone, and the installation position and the output form of the control command (the button is a specific example of the output form of the control command).

Referring to fig. 3, the live broadcast room platform 300 includes a live broadcast application anchor and a live broadcast application client communicating with the anchor through the internet. The live broadcast application anchor is configured with an API interface having corresponding communication rules and data transmission formats, and the virtual robot 1211 is connected to the live broadcast application anchor through the API interface in the form of a functional plug-in and installed in the live broadcast application anchor. Therefore, the plug-in of the virtual robot 1211 needs to satisfy the data transmission rule of the API interface, so that the plug-in of the virtual robot 1211 can be loaded into the live application software (the virtual robot plug-in is installed in the live application software in the anchor device 121), and then performs real-time information interaction with the cloud server 400 and the live application client through the internet transmission protocol, respectively. In addition, the plug-in of the virtual robot 1211 needs to run simultaneously with the live broadcast application software to realize a new auxiliary live broadcast function of the virtual robot attached to a common live broadcast room platform.

The cloud server 400 is connected to the plug-in unit of the virtual robot 1211 through the internet, has a large storage space and a strong computing power, and can efficiently calculate, store and analyze a large amount of data. In the embodiment of the present application, the virtual robot 1211 utilizes the powerful computing and storage capability of the cloud server 400 to enable it to have a multi-modal interaction capability, for example: and outputting one or more of text information, voice information, visual information and combined information thereof.

In the embodiment of the application, the virtual robot plug-in outputs multi-mode data through a special animation image, and when the virtual robot plug-in runs, a multi-mode interaction function can be added to common live broadcast application software, so that a multi-mode interaction system of a network live broadcast platform in the application is formed. When the multi-modal interactive system of the live webcast platform runs, the system has the following functions: the system comprises a main camera 511, a main microphone 512, a main control button 513, a main control button, a user text message, user gift data and video information of a specified connected microphone object, wherein the main camera 511 is capable of sending main live broadcast information, the main microphone 512 is capable of sending main voice information and control command signals sent by the main control button 513; secondly, information access and interaction with the cloud server 400 are realized through the internet, and a large amount of received data can be analyzed and calculated in real time through the cloud server 400; and thirdly, the virtual robot can feed back the character information, the voice information, the animation information, the video stream information and the like responded by the virtual robot to the user in real time through the Internet. Further, the multi-modal interactive system of the live webcast platform further has the following functions in the data processing process, for example: detecting the side face of the anchor 111, performing character recognition on voice information, recognizing a system mode, outputting a response text on the voice character information (corresponding character information into which the voice information is converted), and the like.

In the embodiment of the application, the multi-modal interactive system of the live webcast platform has multiple modes, and when the plug-in of the virtual robot 1211 is accessed to the live-broadcast application software, the system enters a video live broadcast process and can be switched among the multiple modes. The multi-modal interactive system of the network live broadcast platform comprises the following modes: an amusing mode and a multi-modal interaction mode. Further, the multi-modal interaction mode includes: a dialogue mode, a performance basic mode, an interaction with audience mode, and an interaction with other virtual robots mode. Fig. 4 is a mode conversion diagram of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 4, the mode conversion process satisfies the following steps.

Firstly, in the video live broadcast process, displaying a virtual robot with a specific image in a preset area, entering a default live broadcast auxiliary mode, and receiving multi-mode data and multi-mode instructions input in a live broadcast room in real time. Then, the system analyzes the multi-mode data and the multi-mode instructions, and discriminates and determines the target live broadcast auxiliary mode by utilizing the multi-mode interaction capability of the virtual robot. The multi-modal data and the multi-modal instructions specifically contain the following information: live video information, voice information, control command signals, user text information and user gift sending data. Furthermore, the system extracts wake-up data for the virtual robot according to the received multi-mode data sent in the live broadcast process, enters one of multi-mode interaction modes matched with the wake-up data (in the embodiment of the application, the system enters a conversation mode in advance), and executes multi-mode interaction and display actions in the current multi-mode interaction mode; and then, acquiring a multi-mode command of the anchor and aiming at mode conversion setting, performing function analysis on the multi-mode command and the multi-mode data, responding to the mode conversion setting, and switching from the current multi-mode interactive mode to other multi-mode interactive modes, namely a target live broadcast auxiliary mode. Specifically, firstly, the virtual robot judges whether the host starts the wake-up data or not, and if the wake-up data is not started, the system is in a mildewing mode; if the awakening data is started, the system enters a conversation mode (one multi-mode interaction mode), then the system analyzes multi-mode instructions and utilizes the multi-mode data to switch from the current conversation mode to other multi-mode interaction modes, namely a target live broadcast auxiliary mode. The awakening data refers to the name of the virtual robot called by the anchor in the live performance process and/or the side face of the anchor faces towards the virtual robot; multimodal data and multimodal commands include: one or more of text information, voice information, visual information and combined information thereof; in addition, the target live broadcast auxiliary mode includes a basic mode of performance including dance, singing, telling stories, etc., an interactive mode with viewers, and an interactive mode with other virtual robots.

It should be noted that, in the process of live video, the default mode of the system is the mildewed mode.

It should be noted that, the present application is not limited to the actual types of the multimodal data and instructions, and the implementers can adjust the content of the multimodal data and instructions according to the actual requirements. Finally, when the new live auxiliary mode is finished, the system returns to the default mode (mildewing mode).

Fig. 5 is a block diagram of a multimodal interaction system of a live webcast platform according to an embodiment of the present application, and as shown in fig. 5, the system is composed of the following devices: a multi-modal information input module 51, a data processing and mode discrimination module 52 and a multi-modal interactive information output module 53. The multi-modal information input module 51 is used for acquiring and receiving multi-modal data in the live broadcasting process and multi-modal instructions set for mode conversion input in a live broadcasting room in real time, performing function coding processing on different function information, and forwarding a processed multi-modal input data packet to the cloud server 400; the data processing and mode distinguishing module 52 is used for receiving and analyzing the multi-mode input data packet sent by the multi-mode information input module 51, distinguishing and determining a target live broadcast auxiliary mode according to the obtained awakening data and multi-mode data and instructions set for mode conversion, calling multi-mode interaction capacity to process data in a corresponding mode to obtain a multi-mode output data packet for the live broadcast room, and sending the multi-mode output data packet to the multi-mode interaction information output module 53 through the internet; and the multi-mode interaction information output module 53 is used for starting the target live broadcast auxiliary mode, performing multi-mode interaction and display by the virtual robot according to the current live broadcast auxiliary mode, analyzing the multi-mode output data packet, and acquiring and outputting corresponding system output information in the target live broadcast auxiliary mode. It should be noted that the data processing and mode discrimination module 52 is executed by the cloud server 400, wherein the cloud server 400 has a multi-modal interaction capability, and can implement functions such as side face detection, voice recognition, semantic analysis, and the like.

The module structure and function of the multi-modal interactive system of the live webcast platform will be described in detail below. First, the multimodal information input module 51 will be explained in detail. As shown in FIG. 5, the module consists of six acquisition modules (511-516) and an information forwarding module 517. The first acquisition module 511 acquires video information of a main broadcast performance in real time in a live broadcast process, converts the information from a video format into a single-frame image format, performs functional encoding on the frame image information (for example, the functional encoding is 111), and outputs an image input data packet including the frame image functional encoding and frame image data; a second collecting module 512, which collects the anchor voice information in real time during the live broadcast, functionally encodes the anchor voice information (for example, functionally encodes it to 112), and outputs a voice input data packet including the voice functional code and the voice data; a third collecting module 513, which collects the control command signal sent by the anchor main control area in real time during the live broadcast, performs function coding on the control command signal (for example, the function coding is 113), and outputs a command input data packet including the function coding of the control command signal and the control command signal; a fourth collecting module 514, which collects the text information including the left message information and the bullet screen information of the audience sent by the user end of the live broadcast room in real time during the live broadcast process, performs function coding on the text information (for example, the function coding is 114), and outputs a text input data packet including the text information function coding and the text data; a fifth collecting module 515, configured to collect, in real time, audience gift-offering information sent by a user side of a live broadcast room during a live broadcast process, where the audience gift-offering information includes a gift code and a gift-offering user name, perform function coding on the audience gift-offering information (for example, the function coding is 115), and output a gift-offering information input data packet including the function coding of the audience gift-offering information and audience gift-offering data; a sixth collecting module 516 collecting voice information of a specific connected microphone (other virtual robot) through the internet, performing function encoding on the video information (for example, the function encoding is 116), and outputting a connected microphone video information input data packet including a function encoding of the voice information of the specific connected microphone and image information of the specific connected microphone; and an information forwarding module 517, which receives the data packets of the first to sixth acquisition modules, integrates the six data packets received at the same acquisition frequency, and encodes the data packet for the integrated data, thereby obtaining a new acquisition information input data packet with a data packet code. The control command signals include an audience interaction command signal (e.g., its function code is 121), an interaction signal with other virtual robots (e.g., its function code is 122), and an amusing command signal including several emotions (e.g., its function code is 1231-123 n).

Next, the composition and function of the data processing and mode discrimination module 32 will be described in detail. Referring to fig. 5, the module is composed of a data receiving module 521, a side face detection module 522, a speech recognition module 524, a mode discrimination module 523, a semantic analysis module 525, a reading module 526, a performance basic mode module 527, an amusing mode module 528 and a data transmitting module 529. The functions and components of each module in the data processing and mode discrimination module 32 will be described in detail.

And a data receiving module 521, configured to receive the collected information input data packet sent by the information forwarding module 517, analyze the data packet according to the data packet code and the data work code, convert the analyzed data into a functional data packet, and distribute the functional data packet to each subsequent module. The acquisition information input data is converted into a functional data packet with the following data identification: data packet coding, data function coding and data information. Specifically, (in the first embodiment), when the analyzed data function code is 122, the data content of the corresponding interactive signal with another virtual robot is "1", the information is encoded according to the data identifier, so as to obtain a corresponding function data packet, and the corresponding function data packet is transmitted to the mode judging module 523; (second embodiment) when the analyzed data function code is 113, the corresponding data content is the text message of the left message of the audience, the message is coded according to the data identifier, and the corresponding function data packet is obtained and transmitted to the mode discrimination module 523; (third embodiment) when the analyzed data function code is 114, the corresponding data content is audience gift sending information, the information is encoded according to the data identifier, and a corresponding function data packet is obtained and transmitted to the mode judging module 523; (fourth embodiment) when the analyzed data function code is 111, the corresponding data content is single frame image data, the information is encoded according to the data identifier, and a corresponding function data packet is obtained and transmitted to the side face detection module 522.

Fig. 6 is a block diagram of a side face detection module 522 in a multi-modal interactive system of a live webcast platform according to an embodiment of the present application, and as shown in fig. 6, the module includes the following units: an image input unit 5221, a side face detection unit 5222, a side face signal determination unit 5223, and a data output unit 5224. The image input unit 5221 is configured to receive and analyze the functional data packet with the data function code 111 sent by the data receiving module 521, and acquire single-frame image data; a side face detection unit 5222 that detects a side face image of a human face in a single frame image and outputs a detection result; a side face signal determination unit 5223 that outputs a side face signal based on the side face detection result; and a data output unit 5224 that functionally encodes the side face signal (for example, functionally encodes it to 222) and constructs a new mode determination packet.

Fig. 7 is a flowchart illustrating that a side face detection function is implemented in a multi-modal interactive system of a live webcast platform according to an embodiment of the present invention, and as shown in fig. 7, after acquiring single-frame image data, an image input unit 5221 enters a side face detection unit 5222, in the unit, an Adaboost algorithm is used to detect a side face image in an image, and according to a pre-generated human face side face cascade classification detector, it is determined whether a side face image exists in a single-frame image, and then a detection result is output, and the detection result is transmitted to a side face signal determination unit 5223. Next, the side face signal determination unit 5223 determines the data content of the side face signal, which is "1" when the side face image is detected and "0" when the side face image is not detected, based on the side face detection result, and outputs the data content to the data output unit 5224. When the data output unit 5224 receives the side face signal data, it re-encodes the data processing result of the side face detection module 522 to obtain a new mode decision packet, where the mode decision packet includes data identifiers such as a packet code, a side face signal function code, and side face signal data.

Further, in the side face detection unit 5223, the construction of the face-side face classification detector is to obtain the face-side face features by recalculating the extracted face features according to the rotation range of the marked side face by using the face rotation angle of 45 ° to 90 ° as the rotation range through the face database, and then obtaining the side face feature classification detector according to the Adaboost algorithm.

It should be noted that, in the embodiment of the present invention, the Adaboost algorithm is used to detect the side face state in the live single-frame image, and the implementation method of the present application for detecting the side face of the human face is not particularly limited, and other methods may be used instead.

Fig. 8 is a block diagram of a speech recognition module 524 of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 8, the module includes the following units: a voice input unit 5241, an audio-text conversion unit 5242, a text matching unit 5243, and a voice-text output unit 5244. The voice input unit 5241 receives and analyzes the functional data packet 112 of the data function code sent by the data receiving module 521, and obtains voice data; an audio text conversion unit 5242 that converts the voice data into voice text data matching the voice data; a text matching unit 5243 for matching the voice text data according to preset keyword information and outputting a keyword code; a speech/text output unit 5244 that functionally encodes the speech/text data and the keyword code data obtained by the speech recognition module 524 to form a new mode determination packet.

Fig. 9 is a flowchart illustrating implementation of a voice recognition function in the multimodal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 9, after the voice input unit 5241 parses the voice function data packet, the voice input unit can obtain data packet codes and voice data, and then sends the voice data to the audio text conversion module 5242, and executes the audio text conversion module 5242. In this module, the audio information needs to be converted into text information.

Specifically, the above conversion process needs to complete the following steps: 1) carrying out signal preprocessing such as head and tail silence segment cutting, framing and the like on the voice signal; 2) extracting features of the voice input data using the acoustic model and the language model trained in advance stored in the audio text conversion unit 5242; 3) matching the single-frame voice characteristics by using the acoustic model and the language model again; 4) and integrating the matching results by utilizing a semantic understanding database, and outputting a voice recognition result (voice text information).

When the anchor voice message is converted into a text message, the text message enters the text matching unit 5243. In the unit, a preset keyword database related to mode discrimination is stored, each keyword corresponds to a keyword code, and when the keyword in the keyword database appears in the voice text, the code corresponding to the keyword is output. The keyword database may include the following keywords related to pattern recognition, for example: no keywords (e.g., corresponding to code 212999), "map" (anchor 111 auxiliary live virtual robot name, e.g., corresponding to code 212001), "map" & "xianling" (also containing names of auxiliary live virtual robots respectively corresponding to anchors that can interact with each other, e.g., corresponding to code 212006), "sing" (e.g., corresponding to code 212021), "sing" & "mary" (e.g., corresponding to code 212025), "sing" & "swiftlet gesture" (e.g., corresponding to code 212027), "hop" (e.g., corresponding to code 212201), "hop" & "ballet" (e.g., corresponding to code 212014), hop & "dance with holes", "speak" (e.g., corresponding to code 212401), "speak" & "joke" (e.g., corresponding to code 212412), "speak" & "story" (e.g., its corresponding code is 212420), "tell" & "fairy tale" (e.g., its corresponding code is 212421), "tell" & "historical story" (e.g., its corresponding code is 212425), etc.

Finally, the phonetic text output unit 5244 is executed. After receiving the packet code transmitted by the voice input unit 5241, the voice text data output by the audio text conversion unit 5242, and the keyword code data transmitted by the word matching unit 5243, the unit first performs function coding on the voice text data and the keyword code data (for example, the function code corresponding to the voice text data is 211, and the function code corresponding to the keyword code data is 212); and then, re-encoding the new data obtained by the voice recognition module to obtain a new mode judgment data packet, wherein the mode judgment data packet of the module comprises data identifications such as data packet codes, voice text function codes, voice text data, function codes of keyword codes, keyword code data and the like.

Fig. 10 is a block diagram of a mode discrimination module 523 of the multi-modal interactive system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 10, the module is divided into the following units: a data input unit 5231, a mode discrimination unit 5232, a data classification unit 5233 and a data classification transmission unit 5234. The following describes each unit in the mode discrimination block 523 in detail.

First, the data input unit 5231 is capable of receiving the audience gift-sending function data packet, the audience message text information function data packet, the microphone voice information function data packet, and various control command function data packets sent by the data receiving module 521, and analyzing and extracting the key data in the mode determination and the pre-response data in the target auxiliary live broadcast mode. The control command function data packet specifically comprises the following commands: an interaction command with the audience (e.g., its function code 121), an interaction command with another virtual robot (e.g., its function code 122), a mildewproof mode emotion command including a mildewproof mode very happy command (e.g., its function code 1232), an amulet mode calm command (e.g., its function code 1235), and a termination dialogue mode command (e.g., its function code 124). In the specific implementation process, the module obtains the following data after completing the analysis: key basis data including control command data, side face signals, keyword code data, and the like; viewer text information; voice information of the person who connects the microphone; audience gift information, etc.

Next, the mode determination unit 5232 will be described in detail. The module analyzes key basis data in the mode determination according to a data analysis result of the data input unit 5231, determines a system target mode, obtains a corresponding mode code, and performs function encoding on the mode code. Specifically, (in the first embodiment) when the data content of the side face signal is analyzed to be "0", the keyword code data is "212999", and/or the control command signal is function code "12301 to 12320", the target pattern is determined to be an amusing pattern state (for example, the function code of amusing pattern is 2131); (second embodiment) when the data content of the side face signal is analyzed to be "1" and/or the keyword code data range is "212001-212004" (i.e. the keyword includes a virtual robot name) and the control command signal is function code "120 (no control command signal)", determining that the target mode is a dialog mode state (e.g. the function code of the dialog mode is 2132), and locking the dialog mode state; releasing the dialog mode locking state when the control command signal is parsed to have a function code of "121-124" (e.g., a function code of 124 for a dialog mode ending command) and/or a keyword code data range of "212021-212900" (including a keyword for specific performance content); (third embodiment) when the keyword code data range is analyzed to be "212021-212900" in the current conversation mode state, determining that the target mode is in the performance basic mode state (for example, the function code of the performance basic mode is 2133); (fourth embodiment) when the analyzed control command signal is that the function code is "121", the target mode is determined to be the state of the viewer interaction mode (for example, the function code of the viewer interaction mode is 2134); (fifth embodiment) when the control command signal is parsed as function code "122" and the keyword code data range is "212006 ~ 212020" (i.e., the keyword includes two virtual robot names), the target mode is determined as the mode state of interaction with the viewer (e.g., the function code of interaction with the viewer is 2135).

After the current live broadcast mode is determined, the data analyzed by the data input unit 5231 is classified, recombined and the like according to the output data required in the current mode, so as to obtain a pre-response data packet in the target mode, and the mode pre-response data packet corresponding to the current mode function code and the current mode is sent to the data classification transmission unit 5234. Specifically, (first embodiment) when the mode is judged to be the mildewed mode, the mildewed mode pre-response data packet comprises a data packet code, a mildewed mode function code, a mildewed mode emotion command code data, audience gift sending information data and the like; (second embodiment) when the determination mode is the dialogue mode, the dialogue mode pre-response packet includes a packet code, a dialogue mode function code, an anchor voice text message, keyword code data, and the like; (third embodiment) when the determination mode is the performance basic mode, the performance basic mode pre-response packet includes a packet code, a performance basic mode function code, a keyword code data, etc.; (fourth embodiment) when the decision mode is the viewer interaction mode, the pre-response packet to the viewer interaction mode includes a packet code, a function code to the viewer interaction mode, a function code to the viewer interaction command, a keyword code data, a viewer message text data, etc.; (fifth embodiment) when the determination mode is the interaction mode with other virtual robots, the pre-response packet with other virtual robot interaction mode includes a packet code, a function code with other virtual robot interaction mode, a function code with other virtual robot interaction command, a keyword code data, a voice text function code, a voice text data, and the like.

Finally, the data classification transmission unit 5234 encodes data according to the current mode function, and distributes the mode pre-response data packet in the corresponding mode to the subsequent modules.

Fig. 11 is a block diagram of a semantic analysis module 525 of the multimodal interaction system of the live webcast platform according to the embodiment of the present application, and as shown in fig. 11, the module includes the following units: a data input unit 5251, a response data search unit 5252, and a data text output unit 5253. The data input unit 5251 receives and analyzes the mode pre-response data packet containing the voice text sent by the mode discrimination module 523, and obtains voice text information data; a response data search unit 5252 that searches for a response text corresponding to the input text based on a response database preset by the unit, and outputs response text information; the response text data output unit 5253 functionally encodes the response text data (for example, its functional code is 217) and constructs a new response text response packet.

Fig. 12 is a flowchart of implementing a semantic analysis function in the multi-modal interactive system of the live webcast platform according to the embodiment of the present application, and referring to fig. 12, after the data input unit 5251 acquires the mode and response data packet containing the speech text information, it parses the mode and response data packet, and extracts the input speech text information data. The search response unit 5252 searches the phonetic text information data for response text data corresponding to the inputted search text by using the response dialogue database resource through the search engine. Next, when the response text data output unit 5253 obtains the response text data, data including packet codes, mode function code data, command control signal codes, function codes of the response text, response text data, keyword code data, and the like are recoded to obtain a new response text packet. In the process of establishing the resources of the response dialogue database, input and output in a large amount of frequently-used dialogue historical data and network language dialogue historical data are used as training data to generate a response dialogue text model, and then a large amount of input texts in the actual application process are used as test data to complete establishment of the resources of the response dialogue database.

Referring again to fig. 5, the following description will be made in detail with respect to the reading module 526, the performance basic mode module 527, the mildewing mode module 528, and the data sending module 529 of the data processing and model discriminating module 52.

The reading module 526 receives and analyzes the response dialogue response data packet sent by the semantic analysis module 525, extracts response text information data, and converts the response text information data into an audio format by using a preset character reading database to obtain response voice information data. The responsive speech information data is then functionally encoded (e.g., functionally encoded as 218). And finally, recoding data including packet data coding, mode function coding data, command control signal coding, function coding of response voice data, response voice information data, keyword code data and the like to obtain a new response voice response data packet.

Then, the performance basic mode module 527 receives and analyzes the performance basic mode pre-response packet sent by the mode determination module 523, acquires keyword code data ("212021-212900"), searches video stream data corresponding to the keyword code in a preset performance basic function database by using the keyword basic code, acquires performance basic mode video stream data, performs function coding on the data (for example, the function coding is 215), and finally re-codes data including packet data coding, performance basic mode function coding, keyword code data, function coding of performance basic mode video stream data, and the like, thereby obtaining a new performance video response packet. The performance basic database is preset in the performance basic mode module 527, each keyword code corresponds to a plurality of groups of related performance video stream information, and the module randomly selects a group of related video stream data to output. Specifically, in one embodiment, if the parsed keyword code data is "212025", the keyword information corresponding to the code is "sing" & "may be" may day ", the code data corresponds to video stream information of a plurality of groups of songs, so that one group may be randomly output.

Then, a mildewing mode module 528 receives and analyzes the mildewing mode pre-response data packet sent by the mode discrimination module 523, obtains mildewing mode emotion command encoding data ("12301 to 12320"), audience gift information data, searches for mildewing mode emotion command encoding data in a preset mildewing performance database for mildewing multi-modal information corresponding to emotion commands (combination matching data of a plurality of voice information, a plurality of action information and a plurality of expression information under the same emotion) by using the mildewing mode emotion command encoding data, obtains the three-state multi-modal mode data, re-encodes the data (for example, the function encoding is 216), and finally re-encodes the data including data packet encoding, mildewing mode function encoding, mildewing mode emotion command encoding data, mildewing mode multi-modal data function encoding, multi-modal smash mode data and the like, further obtaining a new mildewing mode multi-mode response data packet. Wherein, the Amur performance database is preset in the Amur mode module 528, each mood command code corresponds to several sets of Amur multimodality information, and the module randomly selects a set of Amur multimodality information to output. Specifically, in one embodiment, if the passion mode emotion command coded data is analyzed to be "12308", the emotion corresponding to the command code is "thank you", wherein the data set corresponding to the passion mode thank you emotion command includes a plurality of voice messages (e.g., "thank you focus", "thank you present", etc.), a plurality of action messages (e.g., a gesture indicating thank you, a head nod), and a plurality of expression messages (e.g., smiles), the module may randomly select audience presentation messages and extract a user name from the audience presentation messages, perform thank you for a specific user, and further randomly select the three-state messages to generate corresponding passion multi-modal information.

Finally, referring again to fig. 5, the data transmission module 529 including the data processing and mode discrimination module 52 will be described in detail. In the module, it receives and parses the response text response data packet output by the semantic analysis module 525, the response voice response data packet output by the reading module 526, the performance video stream response data packet sent by the performance basic mode module 527, and the amusing mode multi-modal response data packet sent by the amusing mode module 528, to obtain the response data packet with the same data packet code, integrating and coding the data packets to obtain output response data packets including data packet coding, target mode function coding, response voice data, response text data function coding, response text data, function coding of performance basic mode video stream data, mildewed mode multi-mode data function coding, mildewed mode multi-mode data and the like, the new outgoing response packet is transmitted to the multimodal interaction information output module 53 via the internet.

After the cloud server 400 completes the data processing and mode discrimination, the multi-modal interaction information output module 53 further analyzes and distributes the processing result, as shown in fig. 5, the multi-modal interaction information output module 53 includes the following modules: an information transceiving module 531, an interface output module 532, a video stream output module 533, a voice output module 534, and a text output module 535. The information forwarding and receiving module 531 receives and analyzes the output response data packet sent by the data processing and mode discrimination module 52, and sends the target mode function code to the interface output module 532 according to the analyzed target mode function code, the analyzed response voice data function code, the analyzed response text data function code, the analyzed performance basic mode video stream data function code and the analyzed passion fruit mode multi-mode data function code; sending the target mode function code, the performance basic mode video stream data and the amusing mode multi-mode data to the video stream output module 533; the target mode function code and the response voice data are sent to the voice output module 534; the target mode function code and response text data are sent to text output module 535. And the interface output module 532 converts the current live broadcast interface into a target live broadcast room display interface in a corresponding mode according to the function code of the target mode. The video stream output module 533 outputs, based on the target mode function code (the amusing mode or the performance basic mode), performance basic mode video stream data and amusing mode multi-mode data in the corresponding mode. And a voice output module 534 outputting response voice data based on the target mode function code (conversation mode or interaction with audience or interaction with other auxiliary robots). A text output module 535 that outputs response text data based on the target mode functional encoding (dialog mode or interaction with audience mode or interaction with other auxiliary robots mode).

Note that the video stream output module 533 stores a buffer database. When the information transceiving module 531 resolves that the target mode is the dialog mode and the response voice data and the response text data are empty, it sends a buffering command signal to the video stream output module 533, and when the video stream output module 533 receives the buffering command signal, randomly retrieves and outputs a plurality of preset segments of video stream information in the buffering database. And the cache database is a video stream database preset for the buffer state, and each group of video stream data comprises multi-mode information including voice information, action information and expression information.

It should be noted that, in the embodiment of the present application, functional codes for all input and output data are only one specific example of the present application, and implementers of the present application design the distinguishing identifier of the data function according to the actual application, and the present invention is not limited to this part specifically.

The method of the present invention is described as being implemented in a computer system. The computer system may be provided, for example, in a control core processor of the robot. For example, the methods described herein may be implemented as software executable with control logic that is executed by a CPU in a robotic operating system. The functionality described herein may be implemented as a set of program instructions stored in a non-transitory tangible computer readable medium. When implemented in this manner, the computer program comprises a set of instructions which, when executed by a computer, cause the computer to perform a method capable of carrying out the functions described above. Programmable logic may be temporarily or permanently installed in a non-transitory tangible computer-readable medium, such as a read-only memory chip, computer memory, disk, or other storage medium.

It is to be understood that the disclosed embodiments of the invention are not limited to the particular structures, process steps, or materials disclosed herein but are extended to equivalents thereof as would be understood by those ordinarily skilled in the relevant arts. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-mode interaction method of a virtual robot applied to a video live broadcast platform is characterized in that the application of the video live broadcast platform is accessed into the virtual robot, the virtual robot has multi-mode interaction capability, and the multi-mode interaction method comprises the following steps:

a multi-mode information input step, namely displaying a virtual robot with a specific image in a preset area, entering a default live broadcast auxiliary mode, and receiving multi-mode data and multi-mode instructions input by a live broadcast room in real time;

data processing and mode distinguishing, namely analyzing the multi-mode data and/or the multi-mode instructions, and distinguishing and determining a target live broadcast auxiliary mode by utilizing the multi-mode interaction capability of the virtual robot, wherein:

when receiving the multi-mode data in the live broadcast process, extracting awakening data aiming at the virtual robot, wherein the awakening data refers to data of calling the name of the virtual robot and/or the anchor side face towards the virtual robot by an anchor in the live broadcast process, and the side face detection angle range is 45-90 degrees;

entering a dialogue interaction mode matched with the awakening data and searched based on a response dialogue database, and executing multi-modal interaction and display actions in the current multi-modal interaction mode;

acquiring the multi-mode instruction set for mode conversion of the anchor;

analyzing and responding to the mode conversion setting, and switching from a current conversation interaction mode to a target live broadcast auxiliary mode, wherein the target live broadcast auxiliary mode comprises a performance basic mode, an audience interaction mode and an interaction mode with other virtual robots, in the conversation interaction mode, if a performance form command is acquired, the virtual robot enters the performance basic mode, if an audience interaction control command is acquired, the virtual robot enters the audience interaction mode, if an interaction control command with other intelligent robots is acquired, the virtual robot enters the interaction mode with other virtual robots, and if no response information is output, the virtual robot enters a buffer state in the conversation mode to fill a blank of live broadcast process occurrence time caused by overlong response time;

and a multi-mode interactive information output step, namely starting a target live broadcast auxiliary mode, and performing multi-mode interaction and display by the virtual robot according to the target live broadcast auxiliary mode.

2. The method of claim 1,

the multimodal data and/or multimodal instructions comprise: one or more of text information, voice information, visual information, control command information, and combination information thereof.

3. A storage medium having stored thereon program code executable to perform the method steps of claim 1 or 2.

4. The utility model provides a be applied to video live platform's multi-modal interactive system of virtual machine people, its characterized in that, the application access virtual machine people of video live platform, virtual machine people possesses multi-modal interactive capability, multi-modal interactive system includes following module:

the multi-mode information input module displays a virtual robot with a specific image in a preset area, enters a default live broadcast auxiliary mode, and receives multi-mode data and multi-mode instructions input by a live broadcast room in real time;

the data processing and mode distinguishing module analyzes the multi-mode data and the multi-mode instructions, and distinguishes and determines a target live broadcast auxiliary mode by utilizing the multi-mode interaction capability of the virtual robot, wherein,

acquiring the multi-mode instruction set for mode conversion of the anchor;

analyzing and responding to the mode conversion setting, and switching from a current multi-mode interaction mode to a target live broadcast auxiliary mode, wherein the target live broadcast auxiliary mode comprises a performance basic mode, an audience interaction mode and an interaction mode with other virtual robots, in a conversation interaction mode, if a performance form command is obtained, the virtual robot enters the performance basic mode, if an audience interaction control command is obtained, the virtual robot enters the audience interaction mode, if an interaction control command with other intelligent robots is obtained, the virtual robot enters the interaction mode with other virtual robots, and if no response information is output, the virtual robot enters a buffer state in the conversation mode to fill a blank of live broadcast process occurrence time caused by overlong response time;

and the multi-mode interactive information output module is used for starting a target live broadcast auxiliary mode, and the virtual robot performs multi-mode interaction and display according to the target live broadcast auxiliary mode.

5. The system of claim 4,

the multimodal data and/or the multimodal instructions comprise: one or more of text information, voice information, visual information, control command information, and combination information thereof.