CN113742473A

CN113742473A - Digital virtual human interaction system and calculation transmission optimization method thereof

Info

Publication number: CN113742473A
Application number: CN202111091529.5A
Authority: CN
Inventors: 曹文浩
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-03

Abstract

The invention provides a digital virtual human interaction system and a calculation transmission optimization method thereof, belonging to the technical field of audio and video transmission. The method comprises the following steps: the large visual screen is used for displaying the virtual human image, the chart and the character data; the RTSP server cluster is used for providing audio and video media streams for the large visual screen; the proxy server is connected between the large visual screen and the RTSP server cluster, and is used for forwarding the real-time audio and video media stream generated by the RTSP server and providing the local audio and video media stream stored by the proxy server to the large visual screen; and the user client is used for connecting the large visual screen and realizing interaction with the virtual human in the large visual screen. The invention adds the agent between the RTSP server and the large visual screen, can control the transmission of the media stream in real time, is beneficial to managing the large visual screen and optimizes the transmission mode of the audio and video stream.

Description

Digital virtual human interaction system and calculation transmission optimization method thereof

Technical Field

The invention relates to the technical field of audio and video transmission, in particular to a digital virtual human interaction system and a calculation transmission optimization method thereof.

Background

With the massive accumulation of enterprise data, the large-screen data visualization demand is gradually expanding: the monitoring center and the command scheduling center need to make a decision quickly according to real-time data; a company exhibition hall and an exhibition center need a data display platform; when promoting activities, e-commerce enterprises publish real-time sales data as an advertising means; demonstration scenes such as meeting halls, studio halls, shopping centers, stations and airports need to display information, publicize advertisements and the like through large screens.

The Real Time Streaming Protocol (RTSP) is an application layer protocol in the TCP/IP suite, which is located on top of the real time transport protocol (RTP) and the real time transport control protocol (RTCP), and defines some control behaviors including OPTIONS, DESCRIBE, SETUP, TEARDOWN, PLAY, PAUSE, SCALE, GET _ PARAMETER.

The technology of data transmission between a real-time streaming protocol (RTSP) server and a large screen based on the RTSP is mature day by day, but the current visual data large screen has poor interactivity and wastes RTSP server resources, because the current visual large screen pulls RTSP streams, the RTSP streams cannot be subjected to intermediate processing, under a cluster environment, the visual large screen cannot be subjected to effective load balancing, and when no user uses the RTSP streams, the visual large screen needs to occupy RTSP connections all the time, and the RTSP server resources cannot be fully utilized.

Disclosure of Invention

In order to solve the technical problem, the invention provides a digital virtual human interaction system and a calculation transmission optimization method thereof.A real-time streaming protocol proxy server is arranged between an RTSP server and a visual large-screen end, has the functions of load balancing and media stream proxy forwarding, and supports the playing of local audio and video data when the visual large-screen end is not used by a user.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

one of the purposes of the invention is to provide a digital virtual human interaction system, which comprises:

the large visual screen is used for displaying the virtual human image, the chart and the character data; the chart and the character data are matched and updated according to the expression content of the virtual human image;

the RTSP server cluster is composed of a plurality of RTSP servers and is used for providing audio and video media streams for a visual large screen, and the audio and video media streams comprise limbs, facial feature action videos and corresponding audios for driving a virtual human image;

the proxy server is connected between the large visual screen and the RTSP server cluster, and is used for forwarding the real-time audio and video media stream generated by the RTSP server and providing the local audio and video media stream stored by the proxy server to the large visual screen; the proxy server monitors the interactive state of the large visual screen in real time, and when the large visual screen is in the interactive state, a load balancer is used for selecting and connecting one RTSP server from the RTSP server cluster; when the large visual screen is in a non-interactive state, the connection with the RTSP server is disconnected, local audio and video media stored in the proxy server are played by using the large visual screen, and corresponding chart and character data are displayed;

and the user client is used for connecting the large visual screen and realizing interaction with the virtual human in the large visual screen.

Further, the RTSP server cluster further includes:

the voice collection module is used for acquiring voice audio of a user question;

the text conversion module is used for converting voice audio of a user into character sentences;

the intention identification module is used for acquiring the intention corresponding to the text sentence according to the intention identification model;

the dialogue knowledge base stores answers corresponding to questions with different intentions, and is used for receiving the recognized intentions and outputting the best answer;

the TTS module is used for converting answers output by the conversation knowledge base into audio;

and the virtual human action synthesis module is used for fitting the limb actions and the five sense organs actions of the virtual human according to the audio data corresponding to the answers and generating the virtual human image video matched with the audio content.

Furthermore, the RTSP server cluster is connected with an external resource database, when the dialogue knowledge base outputs answers, the diagram and the character data corresponding to the answers are obtained from the external resource database, and are output to the large visual screen together with audio and video for displaying.

Another object of the present invention is to provide a calculation transmission optimization method of the above digital virtual human interaction system, which includes the following steps:

the method comprises the following steps: the user client is connected with the large visual screen, requests for interaction with the large visual screen, and collects voice audio of a user question by using a microphone of the client;

step two: the proxy server monitors that the visual large screen is in an interactive state, selects one RTSP server from the RTSP server cluster by using the load balancer to connect, transmits voice audio of a user question to the voice collecting module, and obtains a text sentence of the user question through the text conversion module; taking the text sentences as input of an intention recognition module, retrieving the best answer from a conversation knowledge base according to an intention recognition result, and acquiring a chart and/or text data corresponding to the best answer;

then, a TTS module is used for realizing the conversion from characters to voice, and a virtual human action synthesis module is used for fitting the limb actions and the five sense organs actions of the virtual human to generate a virtual human image video matched with the audio content;

step three: the proxy server transmits the audio and video stream and the chart and/or the character data back to the large visual screen, drives the virtual human image to express, and simultaneously displays the corresponding expression content.

Compared with the prior art, the invention has the advantages that: the real-time streaming protocol proxy server is arranged between the RTSP server cluster and the large visual screen end, has the functions of load balancing and media streaming proxy forwarding, supports the playing of local audio and video data when the large visual screen end is not used by a user, can control the transmission of media streams in real time, reduces unnecessary connection between the RTSP server end and the large visual screen end, improves the resource utilization rate and enhances the management of the large visual screen. In addition, the design of the RTSP server cluster provides high expansibility and fault tolerance, and the system is deployed in a cluster mode, so that the increase and decrease of the servers can be carried out according to real requirements, the whole RTSP service cannot be influenced after a certain node in the cluster is down, and the condition that the whole service cannot be used once the node is down in a single-node deployment mode is avoided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of the digital avatar interaction system of the present invention shown in accordance with an exemplary embodiment;

fig. 2 is a schematic diagram illustrating an audio-video stream transmission manner in the present invention according to an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method of computing transmission optimization in the present invention according to an exemplary embodiment;

FIG. 4 is a data processing flow diagram illustrating the overall interaction process of the present invention, according to an exemplary embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The invention provides a digital virtual human interaction system, which comprises:

in this embodiment, the RTSP server cluster is composed of a plurality of RTSP servers, each server can implement the same function, and a plurality of large visual screens share one RTSP server cluster. Each server is provided with:

the intention identification module is used for acquiring the intention corresponding to the text sentence according to the intention identification model, for example, the deep learning algorithm Bi-LSTM-CRF, the technology based on the existing report such as the attention-based RNN and the like can realize the function;

the TTS module is used for converting answers output by the conversation knowledge base into audio; the TTS deep learning model TACORTON is an end-to-end TTS model, the core of the model is seq2seq + attribute, and corresponding audio is generated according to a series of input text word vectors.

Around social digitization, integration and modernization in the field of big data, the large-screen data visualization and the artificial intelligence technology are combined, and through the artificial intelligence technology, the human is assisted to search and analyze data, so that the man-machine cooperation capability of the large screen is enhanced, and the method is novel innovation and attempt. The function of the large visualization screen can be expanded on the basis shown in the embodiment, for example, besides arranging the virtual human image on the large visualization screen, the large visualization screen can also provide a UI interface for visually displaying related charts and text data. The RTSP server cluster is connected with an external resource database, when the dialogue knowledge base outputs answers, the diagram and character data corresponding to the answers are obtained from the external resource database, and are output to a large visual screen together with audio and video for displaying.

For example, for a digital avatar interaction system deployed on an AI demonstration industrial campus, multiple large visualization screens may be installed at different locations on the campus. When the user connects to a large screen and asks: "how does a reception go"? After the agent server is connected with one RTSP server and answers are obtained, the route is broadcasted by the virtual human image, an external resource library is called to visually display the garden map on a large-screen UII interface, the large-screen position and the position of a destination A reception place are marked on the map, and route guidance and text description are given. The map, route, caption, etc. in this example are the chart, text data corresponding to the answer obtained from the external resource database.

There are various voice ways to obtain the user's question, and the user can directly collect the question through a microphone on the user's client. For example, a mobile phone of a user can be directly used as a user client, and a microphone of the mobile phone can be used for recording question voice. When the interaction with the large screen is needed, each user client can only be connected with one visual large screen, in the embodiment, the two-dimensional code on the visual large screen can be scanned through a small program or an APP, and the connection between the user client and the visual large screen is realized. The client can also synchronously visualize the chart and the character data displayed on the large screen in real time, and the user can conveniently look up the chart and the character data.

The technology for synthesizing the digital virtual human image can synthesize synchronous lip and tooth and other semantically related human actions according to text contents (texts corresponding to answers). For example, by combining existing achievements of a speech synthesis model, an ASR technology, a WebSocketjishu model, a TTS model, a video generation model and the like, training is performed on any close-range video set in which a person speaks, a sequence frame of a random text read by a virtual person is generated in real time, the mouth shapes are completely contrasted, and the expressions are vivid.

Taking fig. 4 as an example, after the WeChat code is scanned, the mobile phone is changed into a microphone receiving device through small program connection, the audio is transmitted to the ASR, the ASR is responsible for converting the voice collected by the audio device into text content, then the text content is processed by the NLP, the intention expressed by the text is understood, the reply answer corresponding to the intention is found from the knowledge base, finally the answer text is transmitted to the TTS and is synchronized to the virtual person, the final audio and video stream is synthesized and returned to the visual large screen, and the visual large screen is used for rendering and playing.

The existing voice synthesis technology is mature day by day, and in the intelligent voice conversation robot, VAD (voice activity detection) algorithm is adopted, and the combination of VAD + ASR can realize millisecond interrupt type reply: in the process of speaking of the robot, a user can interrupt consultation at any time, and the robot can recognize and respond quickly in real time, so that the user experience is improved; the accurate conversion of the voice into the text is realized, and exceptional conditions such as polyphone abnormal reading, dialect abnormal reading and the like can be correctly processed. The conversion time is very real-time, when the number of threads is less than or equal to the number of CPU physical cores, the real-time rate is about 200ms, and the man-machine communication is smooth.

When the intention recognition module and the conversation knowledge base realize the functions, the generation or the retrieval of the conversation can be carried out by combining the context according to the conversation of the specific service field, the modeling of the context semantics is carried out by utilizing deep learning by combining the behavior outside the language and the theme of the current context, finally, the answer to be selected is sequenced according to the context, and the answer most suitable for the current context is selected as the reply content.

Because the current RTSP stream pulled by the large visual screen end cannot be subjected to intermediate processing of the stream, the large visual screen end cannot perform effective load balancing in a cluster environment, and when no user uses the RTSP stream, the large visual screen end needs to occupy RTSP connection all the time, and RTSP server resources cannot be fully utilized. The proxy server introduced by the invention forwards the source RTSP audio/video media stream through the proxy server, controls the pause and the progress of the media stream playing through the proxy server, and can monitor the visual large-screen state through the proxy server, such as: the method has the advantages that large-screen connection time, large-screen disconnection time, the number of connected clients and the like can be realized, and accordingly large screens can be better managed. In addition, since the audio and video are required to be played when the user does not use the system, but if the connection between the RTSP server and the client is maintained all the time, the resource waste is caused, so that the proxy server supports playing of the default static video, and the system is really switched to the connection with the RTSP server when the user uses the system, so that the relationship between the server and the client is more than one, rather than one to one.

Based on this, the present embodiment further provides a calculation transmission optimization method for a digital virtual human interaction system, which is mainly characterized in that: when no user client is connected with the large visual screen, the proxy server transmits default audio and video streams to the large visual screen, and the large visual screen is rendered and played; when the user starts to use the large visual screen, the proxy server obtains the media stream corresponding to the answer from the RTSP service cluster, and then forwards the media stream to the large visual screen for rendering and playing.

As shown in fig. 2 and 3, the implementation of the computational transmission optimization method comprises the following steps:

the method comprises the following steps: the user client is connected with the large visual screen, requests to interact with the large visual screen, and collects voice audio of a user question by using a microphone of the client.

and when the connected RTSP server is down, immediately selecting another RTSP server from the RTSP server cluster by using the load balancer to switch.

In a specific implementation of the invention, when the user client is not connected with the large visual screen, the proxy server monitors that the large visual screen is in a non-interactive state, and at the moment, the proxy server returns the RTSP URL address of the local audio/video stream and the corresponding chart and/or text data to the large visual screen, and the large visual screen acquires the local resource through the RTSP URL address to play. And when the real-time media stream is played, the real-time media stream is obtained through the proxy server, and the real-time media stream is returned to the large visual screen in a manner of providing the RTSP URL address.

By the method, the RTSP server can be hidden towards the large-screen end, the large-screen end does not need to care which server is connected specifically under the cluster mode, only one RTSP resource needs to be applied to the load balancer in the proxy server, then the load balancer in the proxy server applies for one RTSP resource from the RTSP cluster according to a certain algorithm and returns the resource, and the supportable load balancing algorithm comprises a random method, a polling method, a source address hashing method and the like.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A digital virtual human interaction system, comprising:

2. The digital avatar interactive system of claim 1, wherein said RTSP server cluster further comprises:

3. The digital virtual human interaction system according to claim 2, wherein the RTSP server cluster is connected to an external resource database, and when the dialog knowledge base outputs an answer, the RTSP server cluster acquires a chart and text data corresponding to the answer from the external resource database, and outputs the chart and text data together with audio and video to a large visual screen for display.

4. The digital avatar interaction system of claim 2, wherein said voice audio of the user's questions is collected through a microphone on the user's client.

5. The digital virtual human interaction system according to claim 1, wherein the user client scans the two-dimensional code on the large visual screen through a applet or an APP to realize connection between the user client and the large visual screen.

6. The digital virtual human interaction system according to claim 5, wherein the chart and the text data displayed on the large screen are synchronously visualized on the client in real time.

7. A computing transmission optimization method based on the digital virtual human interaction system of claim 3, characterized by comprising the following steps:

8. The calculation transmission optimization method of the digital virtual human interaction system according to claim 7, wherein when the user client is not connected to the large visual screen, the proxy server monitors that the large visual screen is in a non-interactive state, and at this time, the proxy server returns the RTSP URL address of the local audio/video stream and the corresponding chart and/or text data to the large visual screen, and the large visual screen acquires local resources through the RTSP URL address to play.

9. The method of claim 7, wherein a plurality of the large visualization screens share one RTSP server cluster.

10. The method of claim 7, wherein when the RTSP server connected in the step two goes down, another RTSP server in the RTSP server cluster is selected by the load balancer to switch.