CN113727051A

CN113727051A - Bidirectional video method, system, equipment and storage medium based on virtual agent

Info

Publication number: CN113727051A
Application number: CN202111017617.0A
Authority: CN
Inventors: 张万忠
Original assignee: Shenzhen Thinkive Information Technology Co ltd
Current assignee: Shenzhen Thinkive Information Technology Co ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-30

Abstract

The invention discloses a bidirectional video method, a system, equipment and a storage medium based on virtual seats, wherein the method comprises the following steps: the virtual seat responds to a video recording instruction sent by the H5 front end and obtains a phonetics text sent by the H5 front end, and synthesized voice is obtained according to the phonetics text; the virtual seat sends the synthesized voice to the front end of H5 through a bidirectional video channel, and plays the synthesized voice through the front end of H5; the virtual seat receives and identifies a voice response obtained according to the synthesized voice, and acquires a response text generated according to the voice response; and the virtual seat sends the response text to the H5 front end, responds to the video recording ending instruction sent by the H5 front end and acquires a recorded video. By the method, the audio and video of the virtual seat and the audio and video of the H5 front end for intelligent voice interaction are recorded by the H5 front end, so that the virtual seat obtains the audio and video recorded by the H5 front end, and the problem that the H5 front end cannot complete video recording and intelligent voice interaction for opening a witness at the front end is solved.

Description

Bidirectional video method, system, equipment and storage medium based on virtual agent

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a bidirectional video method, system, device, and storage medium based on a virtual agent.

Background

At present, with the release of the one-way video account opening policy, the two-way video witness problem which troubles a plurality of dealer for years is finally broken through.

Under the one-way video witnesses, the investors can record one-way videos in a self-service mode in the account opening process without waiting for manual witnesses. Especially in the time period with centralized account opening amount, queuing and waiting in the account opening process are not needed. For security dealer, the one-way video witness also saves the labor cost to a certain extent.

The intelligent voice interaction is a common interaction mode in account-opening witnesses; the system automatically broadcasts the question of needing to confirm the client's account opening intention at the account opening front end, the client answers (yes) or (no) according to the prompt, and the system performs voice recognition to confirm whether the answer is passed. However, limited by the processing power of the H5 front end, the H5 one-way video temporarily fails to complete video recording and intelligent voice interaction of the account-opening witnesses at the front end.

Disclosure of Invention

Therefore, the present invention is directed to solve the deficiencies in the prior art at least to some extent, and therefore to provide a bidirectional video method and system, device and storage medium based on virtual agents.

In a first aspect, the present invention provides a bidirectional video method based on virtual agents, the method comprising:

the virtual seat responds to a video recording instruction sent by the H5 front end, acquires a phonetics text sent by the H5 front end, and obtains a synthesized voice according to the phonetics text;

the virtual seat sends the synthesized voice to the H5 front end through a bidirectional video channel, and plays the synthesized voice through the H5 front end;

the virtual seat receives and identifies a voice response obtained according to the synthesized voice and acquires a response text generated according to the voice response;

and the virtual seat sends the response text to the H5 front end, responds to a video recording ending instruction sent by the H5 front end and acquires a recorded video.

In a second aspect, the present invention provides a virtual agent based two-way video system, the system comprising:

an acquisition module: the virtual agent obtains a phonetics text sent by the front end of H5, and obtains a synthesized voice according to the phonetics text;

a playing module: for the virtual agent to send the synthesized speech to the H5 front end over a bi-directional video channel and play the synthesized speech based on the H5 front end;

a generation module: the virtual seat is used for receiving and identifying a voice response obtained according to the synthesized voice and acquiring a response text generated according to the voice response;

a sending module: for the virtual agent to send the answer text to the H5 front end.

In a third aspect, the present invention further provides a virtual agent-based bidirectional video device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the steps in the virtual agent-based bidirectional video method according to the first aspect.

In a fourth aspect, the present invention also provides a storage medium having stored thereon a computer program which, when executed, performs the steps of the virtual agent-based bidirectional video method according to the first aspect.

The invention provides a bidirectional video method based on virtual seats, which comprises the following steps: the virtual seat responds to a video recording instruction sent by the H5 front end and obtains a phonetics text sent by the H5 front end, and synthesized voice is obtained according to the phonetics text; the virtual seat sends the synthesized voice to the front end of H5 through a bidirectional video channel, and plays the synthesized voice through the front end of H5; the virtual seat receives and identifies a voice response obtained according to the synthesized voice, and acquires a response text generated according to the voice response; and the virtual seat sends the response text to the H5 front end, responds to the video recording ending instruction sent by the H5 front end and acquires a recorded video. By the method, based on intelligent voice interaction between the virtual seat and the H5 front end, the audio and video of the intelligent voice interaction between the virtual seat and the H5 front end is recorded by the H5 front end, so that the virtual seat obtains the audio and video recorded by the H5 front end, and the problem that the H5 front end cannot complete video recording and intelligent voice interaction for evidence of opening an account at the front end is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a virtual agent-based two-way video method according to the present invention;

FIG. 2 is a sub-flow diagram of the virtual agent based bi-directional video method of the present invention;

FIG. 3 is another sub-flow diagram of the virtual agent based bi-directional video method of the present invention;

FIG. 4 is a schematic view of another sub-flow of the virtual agent-based two-way video method of the present invention;

FIG. 5 is a block diagram of the present invention showing the program modules of the virtual agent based two-way video system.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic flowchart of a bidirectional video method based on a virtual agent according to an embodiment of the present application, where in the embodiment, the bidirectional video method based on a virtual agent includes:

step 101, responding to a video recording instruction sent by the front end of H5 by the virtual seat, acquiring a phonetics text sent by the front end of H5, and obtaining a synthetic voice according to the phonetics text.

In this embodiment, after the virtual agent is connected with the H5 front end, the H5 front end sends a video recording instruction to the virtual agent, the virtual agent starts recording after responding to the video recording instruction, the H5 front end sends a dialog text to the virtual agent, and the virtual agent receives the dialog text sent by the H5 front end, where the processing capability of the H5 front end is limited, and the video recording and intelligent voice interaction of the evidence of opening an account cannot be completed at the front end in the process of opening an account by an investor, so that the audio and video collected by the H5 front end is transmitted to the back end, and the virtual agent completes related functions, so the H5 front end sends the dialog text to the virtual agent. The front end of H5 is a technology set, which includes HTML5, CSS3, JS and other technologies; the virtual agent is a virtual role obtained through computer generation, animation and artificial intelligence, provides services for customers through the function of the chat robot, and can answer questions posed by the customers and provide information about products and services of companies.

In this embodiment, after the virtual agent receives the dialogistic text sent by the front end of H5, the virtual agent processes the dialogistic text to finally obtain the processed dialogistic text.

And 102, the virtual seat sends the synthesized voice to the front end of the H5 through a bidirectional video channel, and plays the synthesized voice through the front end of the H5.

In this embodiment, after the virtual seat receives the synthesized voice returned by the intelligent voice end, the virtual seat sends the synthesized voice to the H5 front end through the bidirectional video channel, and after the H5 front end receives the synthesized voice, the synthesized voice is played through the H5 front end, wherein the bidirectional video channel plays a role in mutual communication, so that the virtual seat and the H5 front end perform an interaction.

And 103, receiving and identifying the voice response obtained according to the synthesized voice by the virtual seat, and acquiring a response text generated according to the voice response.

In this embodiment, in step 102, after the H5 front end plays the synthesized voice, and after the user knows the synthesized voice played by the H5 front end during the process of interacting with the H5 front end, the user sends a voice response according to the synthesized voice, the H5 front end collects the voice response of the user and sends the voice response to the virtual seat, and the virtual seat receives the voice response, recognizes the voice response, and generates a response text according to the voice response.

And step 104, the virtual seat sends the response text to the H5 front end, responds to a video recording ending instruction sent by the H5 front end, and obtains a recorded video.

In this embodiment, the virtual agent sends the response text obtained in step 103 to the front end of H5, so that the virtual agent forms an interaction with the front end of H5, the user sends a voice response, the virtual agent obtains a response text according to the voice response and sends the response text to the front end of H5, and the front end of H5 delivers the information of the response text to the user.

In this embodiment, the H5 front end sends a video recording ending instruction to the virtual seat, the virtual seat responds to the video recording ending instruction sent by the H5 front end, and obtains a video recorded by the H5 front end, and by obtaining a video file of the intelligent voice interaction between the user and the virtual seat, it is effectively verified that the witness body of the video is a client, and the recorded video is retained.

The embodiment of the application provides a bidirectional video method based on a virtual agent, which comprises the following steps: the virtual seat responds to a video recording instruction sent by the H5 front end and obtains a phonetics text sent by the H5 front end, and synthesized voice is obtained according to the phonetics text; the virtual seat sends the synthesized voice to the front end of H5 through a bidirectional video channel, and plays the synthesized voice through the front end of H5; the virtual seat receives and identifies a voice response obtained according to the synthesized voice, and acquires a response text generated according to the voice response; and the virtual seat sends the response text to the H5 front end, responds to the video recording ending instruction sent by the H5 front end and acquires a recorded video. By the method, based on intelligent voice interaction between the virtual seat and the H5 front end, the audio and video of the intelligent voice interaction between the virtual seat and the H5 front end is recorded by the H5 front end, so that the virtual seat obtains the audio and video recorded by the H5 front end, and the problem that the H5 front end cannot complete video recording and intelligent voice interaction for evidence of opening an account at the front end is solved.

Further, after the virtual agent sends the response text to the H5 front end, the method further includes:

and logically processing the response text based on the H5 front end.

In this embodiment, after the virtual agent sends the response text to the front end of H5, the front end of H5 performs a logic process on the response text, and after the front end of H5 performs a logic process on the response cost, the response text after the logic process is completed is transmitted to the user, where the transmission mode may be voice, text, and the like, which is not limited.

Further, the virtual agent acquires the recorded video recorded at the front end of the H5 through a C + + audio/video export interface of the WebRTC.

In this embodiment, the video recording is to mix audio for intelligent voice interaction between the user and the virtual seat, and to merge the mixed audio and the user video into a video file, that is, a video file recorded at the front end of H5. The virtual seat realizes the acquisition of internal audio and video data by a method of registering audio and video callback; the virtual seat realizes the acquisition of external audio and video data by replacing the acquisition source corresponding to the audio and video track in the media stream.

Further, the virtual agent establishes the bidirectional video channel and the bidirectional signaling channel with the H5 front end based on the WebRTC, and the sending of the response text by the virtual agent to the H5 front end specifically includes:

and the virtual seat sends the response text to the H5 front end through the bidirectional signaling channel.

In this embodiment, the virtual agent implements a bidirectional video based on WebRTC (Web Real-Time Communication) and an H5 front end, and the virtual agent and the H5 front end implement an audio-video interaction process, where the H5 front end can be applied to a mobile end, such as a mobile end with an apple system and a mobile end with an android system, and the mobile ends all use browsers supporting the WebRTC, recommend that the mobile end with the apple system uses a Safari browser, and the mobile end with the android system uses QQ, wechat, and the like. The virtual seat realizes a set of PeerConnection (peer-to-peer connection) process based on the C + + version of the WebRTC source code. In this process, the virtual agent has the capability of performing two-way video communication with the H5 front end, where the WebRTC is an API (Application Programming Interface) that supports a web browser to perform real-time voice conversation or video conversation.

Further, the sending, by the virtual agent, the synthesized speech to the front end of the H5 through the bidirectional video channel specifically includes:

and the virtual seat sends the obtained synthesized voice to the H5 front end in real time.

In this embodiment, after the virtual agent processes the dialog text by calling the interface of the intelligent voice end, the virtual agent does not wait for the intelligent voice end to process all the dialog text, and then sends the processed synthesized voice to the front end of H5, but when the virtual agent receives the synthesized voice data sent by the intelligent voice end, the synthesized voice data is sent to the front end of H5 synchronously, so that the waiting problem in the process of synthesizing voice caused by long length of the dialog text is avoided.

Further, referring to fig. 2, fig. 2 is a schematic sub-flow diagram of a bidirectional video method based on a virtual agent in the embodiment of the present application, in this embodiment, a virtual agent responds to a video recording instruction sent by an H5 front end and obtains a phonetics text sent by the H5 front end, and obtaining a synthesized voice according to the phonetics text specifically includes:

step 201, the virtual agent obtains the dialect text sent by the front end of the H5;

step 202, the virtual seat calls an interface of an intelligent voice end to process the dialect text to obtain the synthesized voice.

In this embodiment, the H5 front end sends a dialogistic text to the virtual seat, after the virtual seat receives the dialogistic text, the interface of the intelligent voice end is called to process the dialogistic text, after the intelligent voice end obtains a synthesized voice, the obtained synthesized voice is returned to the virtual seat, that is, the intelligent voice end sends an answer to the dialogistic text to the virtual seat, and finally the virtual seat obtains the synthesized voice after the intelligent voice end processes.

Further, referring to fig. 3, fig. 3 is another schematic sub-flow diagram of a bidirectional video method based on a virtual agent in the embodiment of the present application, where in this embodiment, the receiving and recognizing, by the virtual agent, a voice response obtained according to the synthesized voice, and acquiring a response text generated according to the voice response specifically include:

step 301, the virtual seat receives a voice response obtained by the front end of the H5 according to the synthesized voice;

step 302, the virtual seat calls an interface of the intelligent voice end to identify the voice response, and the response text is obtained.

In this embodiment, after the H5 front end plays the synthesized voice, the user forms a voice response for the played synthesized voice, the H5 front end collects the voice response of the user again and sends the voice response of the user to the virtual seat, after the virtual seat receives the voice response of the user, the interface of the intelligent voice end is called to recognize the voice response of the user, after the intelligent voice end recognizes the voice response, a response text is obtained, the response text is returned to the virtual seat, and finally the virtual seat obtains the response text.

Further, referring to fig. 4, fig. 4 is a schematic sub-flow diagram of a bidirectional video method based on a virtual agent in this embodiment of the present application, where in this embodiment, the receiving and recognizing, by the virtual agent, a voice response obtained according to the synthesized voice, and acquiring a response text generated according to the voice response further include:

step 401, the virtual agent receives and identifies the fixed telephone operation sent by the H5 front end;

and step 402, the virtual seat sends a preset recording to the front end of the H5 according to the fixed telephone operation.

In the embodiment, the virtual seat calls the interface of the intelligent voice end to synthesize the recording aiming at the fixed telephone operation proposed by the user in advance, the recording is sent to the virtual seat after the intelligent voice end synthesizes the recording, and when the virtual seat meets the fixed telephone operation proposed by the user, the recording preset in advance is directly sent, so that unnecessary synthesis requests can be reduced.

Further, the specific implementation steps of the embodiment of the application include:

1. the front end of H5 carries out video recording, the virtual seat processes the fixed telephone operation in advance, the fixed telephone operation is processed through the interface of the intelligent voice end, and after the intelligent voice end processes the fixed telephone operation, a recording is obtained and returned to the virtual seat;

2. the front end of H5 sends a speech text to the virtual seat, after the virtual seat receives the speech text, the interface of the intelligent voice end is called to process the speech text to obtain a synthetic voice, the intelligent voice end returns the synthetic voice obtained after processing to the virtual seat, and if the virtual seat receives a fixed speech, the virtual seat directly sends a recording to the front end of H5;

3. after receiving the synthesized voice returned by the intelligent voice end, the virtual seat is sent to the H5 front end in real time through a bidirectional video channel between the virtual seat and the H5 front end, and the synthesized voice is played through the H5 front end;

4. after the front end of H5 plays the synthesized voice, the user transmits a voice response to the front end of H5 according to the obtained synthesized voice, the front end of H5 collects the voice response transmitted by the user and then sends the voice response to the virtual seat, the virtual seat calls an interface of the intelligent voice end to perform voice recognition on the voice response after receiving the voice response, and the intelligent voice end processes the voice response to obtain a response text and returns the response text to the virtual seat;

5. after receiving the response text returned by the intelligent voice end, the virtual seat is sent to the H5 front end through a bidirectional signaling channel, the H5 front end performs logic processing on the response text and transmits the response text after the logic processing to a user, the H5 front end finishes recording videos, and the virtual seat obtains audios and videos recorded by the H5 front end.

Further, in this embodiment of the present application, before recording the video, the method further includes:

1. detecting a living body;

verifying that the main body of the video witness is a real person instead of a photo through biopsy action, and meanwhile, outputting a user positive photo and a living body key frame;

2. comparing human faces;

the main body of the witness in the video is verified to be the principal by comparing the positive photo of the user, the living key frame and the identity card photo.

Further, in the embodiment of the present application, the logic architecture of the unidirectional video includes:

1. user layer

The mobile terminal (front end of an android H5 and front end of an apple H5) uses a browser supporting WebRTC to access, and recommends the android to use QQ and WeChat; apple uses a Safari browser.

2. Application layer

Video basic capability (video communication, video recording), voice basic capability (voice synthesis, voice recognition), face basic capability (live body detection, face comparison) are provided.

3. Service layer

Providing corresponding service support for an application layer; the method comprises the following steps: video basic service, voice basic service, face basic service.

4. Storage layer

And the picture storage and video storage capacity is provided.

Further, an embodiment of the present application further provides a bidirectional video system 500 based on a virtual agent, and fig. 5 is a schematic diagram of program modules of the bidirectional video system based on a virtual agent in the embodiment of the present application, in which the bidirectional video system 500 based on a virtual agent includes:

the obtaining module 501: the virtual seat is used for responding to a video recording instruction sent by the front end of H5, acquiring a phonetics text sent by the front end of H5 and obtaining synthesized voice according to the phonetics text;

the playing module 502: for the virtual agent to send the synthesized speech to the H5 front end over a bi-directional video channel and play the synthesized speech based on the H5 front end;

the generation module 503: the virtual seat is used for receiving and identifying a voice response obtained according to the synthesized voice and acquiring a response text generated according to the voice response;

the sending module 504: and the virtual seat is used for sending the response text to the H5 front end, responding to a video recording ending instruction sent by the H5 front end and acquiring a recorded video.

The embodiment of the present application provides a bidirectional video system 500 based on virtual agents, which can implement: the virtual seat responds to a video recording instruction sent by the H5 front end and obtains a phonetics text sent by the H5 front end, and synthesized voice is obtained according to the phonetics text; the virtual seat sends the synthesized voice to the front end of H5 through a bidirectional video channel, and plays the synthesized voice through the front end of H5; the virtual seat receives and identifies a voice response obtained according to the synthesized voice, and acquires a response text generated according to the voice response; and the virtual seat sends the response text to the H5 front end, responds to the video recording ending instruction sent by the H5 front end and acquires a recorded video. By the method, based on intelligent voice interaction between the virtual seat and the H5 front end, the audio and video of the intelligent voice interaction between the virtual seat and the H5 front end is recorded by the H5 front end, so that the virtual seat obtains the audio and video recorded by the H5 front end, and the problem that the H5 front end cannot complete video recording and intelligent voice interaction for evidence of opening an account at the front end is solved.

Further, the present application also provides a virtual agent-based bidirectional video device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps in the virtual agent-based bidirectional video method when executing the computer program.

Further, the present application also provides a storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the virtual agent based bidirectional video method as described above.

Each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium.

Based on such understanding, the technical solution of the present invention, which is described in the specification or contributes to the prior art in essence, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention. In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

For those skilled in the art, according to the idea of the embodiments of the present application, there may be variations in the specific implementation and application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention.

Claims

1. A bidirectional video method based on virtual agents, characterized in that the method comprises:

2. The method according to claim 1, wherein the virtual agent obtains the recorded video recorded at the front end of the H5 through a C + + audio video export interface of WebRTC.

3. The method of claim 1, wherein the virtual agent establishes the bidirectional video channel and the bidirectional signaling channel with the H5 front end based on the WebRTC, and the virtual agent sending the answer text to the H5 front end specifically comprises:

4. The method according to claim 1, wherein the virtual agent sending the synthesized speech to the H5 front end through a bidirectional video channel specifically comprises:

5. The method according to claim 1, wherein the virtual agent responds to a video recording instruction sent by an H5 front end and obtains a dialect text sent by the H5 front end, and obtaining the synthesized speech according to the dialect text specifically comprises:

the virtual seat acquires the dialect text sent by the H5 front end;

and the virtual seat calls an interface of an intelligent voice end to process the dialect text to obtain the synthesized voice.

6. The method according to claim 5, wherein the virtual agent receives and recognizes a voice response obtained from the synthesized voice, and the obtaining of the response text generated from the voice response specifically comprises:

the virtual seat receives a voice response obtained by the H5 front end according to the synthesized voice;

and the virtual seat calls an interface of the intelligent voice end to identify the voice response to obtain the response text.

7. The method of claim 1, wherein the virtual agent receiving and recognizing a voice response from the synthesized voice and obtaining a response text generated from the voice response further comprises:

the virtual seat receives and identifies the fixed telephone operation sent by the H5 front end;

and the virtual seat sends a preset recording to the front end of the H5 according to the fixed telephone operation.

8. A virtual agent based two-way video system, the system comprising:

an acquisition module: the virtual seat is used for responding to a video recording instruction sent by the front end of H5, acquiring a phonetics text sent by the front end of H5 and obtaining synthesized voice according to the phonetics text;

a sending module: and the virtual seat is used for sending the response text to the H5 front end, responding to a video recording ending instruction sent by the H5 front end and acquiring a recorded video.

9. A virtual agent based bi-directional video device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor when executing the computer program performs the steps of the virtual agent based bi-directional video method of any of claims 1-7.

10. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the virtual agent based bi-directional video method of any of claims 1-7.