CN116962582A

CN116962582A - Interaction method, device, network equipment and readable storage medium

Info

Publication number: CN116962582A
Application number: CN202211078304.0A
Authority: CN
Inventors: 李洋; 许玥; 周铭吉; 韩屹; 路骁虎
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd; CM Intelligent Mobility Network Co Ltd
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2023-10-27

Abstract

The invention provides an interaction method, an interaction device, network equipment and a readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: establishing long term evolution voice bearing VoLTE video call connection with a terminal; receiving first audio information sent by the terminal; obtaining target response information according to the first audio information based on a target technology, wherein the target technology comprises at least one of the following: voice activation detection VAD techniques, automatic speech recognition ASR techniques, natural language understanding NLU techniques; driving a digital person by using the target response information to obtain target video information, wherein the target video information comprises digital person image information and target voice information, and the target voice information is generated according to the target response information; and sending the target video information to the terminal. The scheme of the invention solves the problem of lack of a full duplex interaction scheme between the terminal and the digital person in the VoLTE-based communication mode.

Description

Interaction method, device, network equipment and readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an interaction method, an interaction device, network equipment and a readable storage medium.

Background

A digital person, also known as a virtual person or avatar, is a digitized character that exists in the virtual world with multiple human traits.

In the prior art, the interaction function between the terminal and the digital person is usually realized through the internet, but in the communication mode based on the Voice over Long-Term Evolution (VoLTE), the full duplex interaction scheme between the terminal and the digital person is lacking.

Disclosure of Invention

The invention aims to provide an interaction method, an interaction device, network equipment and a readable storage medium, which solve the problem that a full duplex interaction scheme of a terminal and a digital person is lacked in a communication mode based on VoLTE.

To achieve the above object, an embodiment of the present invention provides an interaction method, applied to a network device, including:

establishing long term evolution voice bearing VoLTE video call connection with a terminal;

receiving first audio information sent by the terminal;

obtaining target response information according to the first audio information based on a target technology, wherein the target technology comprises at least one of the following: VAD (Voice Activity Detection, voice activation detection) technology, ASR (Automatic Speech Recognition ) technology, NLU (Natural Language Understanding, natural language understanding) technology;

Driving a digital person by using the target response information to obtain target video information, wherein the target video information comprises digital person image information and target voice information, and the target voice information is generated according to the target response information;

and sending the target video information to the terminal.

Optionally, the establishing a long term evolution voice bearer VoLTE video call connection with the terminal includes:

receiving a VoLTE call request sent by the terminal;

judging whether the VoLTE call request is a video call according to SDP (Session Description Protocol );

and establishing VoLTE video call connection with the terminal according to the judging result.

Optionally, the establishing VoLTE video call connection with the terminal according to the judging result includes at least one of the following:

under the condition that the VoLTE call request is a video call, voLTE video call connection is established with the terminal;

and under the condition that the VoLTE call request is an audio call, establishing VoLTE video call connection with the terminal according to the support condition of the terminal on resource reservation.

Optionally, the establishing VoLTE video call connection with the terminal according to the supporting situation of the terminal for resource reservation includes:

Negotiating with the terminal according to the supporting condition of the terminal for resource reservation;

and under the condition that negotiation is passed, establishing VoLTE video call connection with the terminal.

Optionally, the negotiating with the terminal according to the supporting situation of the terminal for the resource reservation includes at least one of the following:

in case the terminal supports resource reservation, performing QoS (Quality of Service ) resource reservation negotiation with the terminal using SIP (Session initialization Protocol, session initiation protocol) signaling;

and under the condition that the terminal does not support resource reservation, carrying out modification session negotiation with the terminal by using the SIP signaling.

Optionally, the obtaining, based on the target technology, target response information according to the first audio information includes:

obtaining target voice information from the first audio information based on the VAD technique;

based on the ASR technology, obtaining voice recognition information according to the target voice information;

based on the NLU technology, target response information is obtained according to the voice recognition information.

Optionally, the method further comprises:

stopping driving the digital person under the condition that a preset condition is met;

Wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

Optionally, the driving the digital person with the target response information to obtain target video information includes:

according to the target response information, invoking a digital human interface to obtain first video information;

adjusting the first video information based on preset media information to obtain the target video information;

the preset media information comprises at least one of characters, pictures, audio and video.

To achieve the above object, an embodiment of the present invention provides an interaction method, applied to a terminal, including:

establishing VoLTE video call connection with network equipment;

transmitting first audio information to the network device;

and receiving target video information sent by the network equipment, wherein the target video information comprises digital human image information and target voice information, the target voice information is generated according to target response information, and the target response information is generated according to the first audio information.

Optionally, the method further comprises:

receiving a Session Initiation Protocol (SIP) signaling sent by the network equipment;

Displaying a first operation interface according to the SIP signaling;

and responding to the first operation of the user on the first operation interface, and sending negotiation result information to the network equipment.

To achieve the above object, an embodiment of the present invention provides an interaction device, which is applied to a network device, including:

the first processing module is used for establishing long-term evolution voice bearing VoLTE video call connection with the terminal;

the first receiving module is used for receiving first audio information sent by the terminal;

the second processing module is used for obtaining target response information according to the first audio information based on a target technology, wherein the target technology comprises at least one of the following steps: voice activation detection VAD techniques, automatic speech recognition ASR techniques, natural language understanding NLU techniques;

the third processing module is used for driving the digital person by utilizing the target response information to obtain target video information, wherein the target video information comprises digital person image information and target voice information, and the target voice information is generated according to the target response information;

and the first sending module is used for sending the target video information to the terminal.

Optionally, the first processing module includes:

The third receiving module is used for receiving the VoLTE call request sent by the terminal;

a fifth processing module, configured to determine, according to a session description protocol SDP, whether the VoLTE call request is a video call;

and the sixth processing module is used for establishing VoLTE video call connection with the terminal according to the judging result.

Optionally, the sixth processing module includes:

the first processing sub-module is used for establishing VoLTE video call connection with the terminal under the condition that the VoLTE call request is a video call;

and the second processing sub-module is used for establishing VoLTE video call connection with the terminal according to the support condition of the terminal for resource reservation under the condition that the VoLTE call request is an audio call.

Optionally, the second processing sub-module includes:

the first processing unit is used for negotiating with the terminal according to the supporting condition of the terminal on the resource reservation;

and the second processing unit is used for establishing VoLTE video call connection with the terminal under the condition that negotiation is passed.

Optionally, the first processing unit includes:

a first processing subunit, configured to perform a quality of service QoS resource reservation negotiation with the terminal using session initiation protocol SIP signaling in a case where the terminal supports resource reservation;

And the second processing subunit is used for carrying out modification session negotiation with the terminal by using the SIP signaling under the condition that the terminal does not support resource reservation.

Optionally, the second processing module includes:

a seventh processing module, configured to obtain target voice information according to the first audio information based on the VAD technique;

an eighth processing module, configured to obtain voice recognition information according to the target voice information based on the ASR technique;

and a ninth processing module, configured to obtain target response information according to the voice recognition information based on the NLU technology.

Optionally, the apparatus further comprises:

a tenth processing module, configured to stop driving the digital person if a preset condition is satisfied;

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

Optionally, the third processing module includes:

the third processing sub-module is used for calling a digital human interface according to the target response information to obtain first video information;

a fourth processing sub-module, configured to adjust the first video information based on preset media information, to obtain the target video information;

To achieve the above object, an embodiment of the present invention provides an interaction device, which is applied to a terminal, including:

the fourth processing module is used for establishing VoLTE video call connection with the network equipment;

the second sending module is used for sending the first audio information to the network equipment;

the second receiving module is used for receiving target video information sent by the network equipment, wherein the target video information comprises digital human image information and target voice information, the target voice information is generated according to target response information, and the target response information is generated according to the first audio information.

Optionally, the apparatus further comprises:

a fourth receiving module, configured to receive a session initiation protocol SIP signaling sent by the network device;

an eleventh processing module, configured to display a first operation interface according to the SIP signaling;

and the fourth sending module is used for responding to the first operation of the user on the first operation interface and sending negotiation result information to the network equipment.

To achieve the above object, an embodiment of the present invention provides a network device, including a processor and a transceiver, where the processor is configured to:

receiving first audio information sent by the terminal;

obtaining target response information according to the first audio information based on a target technology, wherein the target technology comprises at least one of the following: voice activation detection VAD techniques, automatic speech recognition ASR techniques, natural language understanding NLU techniques;

and sending the target video information to the terminal.

Optionally, when the processor establishes a long term evolution voice bearer VoLTE video call connection with the terminal, the processor is specifically configured to:

receiving a VoLTE call request sent by the terminal;

judging whether the VoLTE call request is a video call or not according to a session description protocol SDP;

Optionally, when the processor establishes VoLTE video call connection with the terminal according to the determination result, the processor is specifically configured to:

Optionally, when the processor establishes VoLTE video call connection with the terminal according to the support condition of the terminal for resource reservation, the processor is specifically configured to:

Optionally, the processor is specifically configured to, when negotiating with the terminal according to the supporting situation of the terminal for resource reservation:

under the condition that the terminal supports resource reservation, carrying out QoS resource reservation negotiation with the terminal by using a Session Initiation Protocol (SIP) signaling;

Optionally, the processor is specifically configured to, when obtaining the target response information according to the first audio information based on the target technology:

Optionally, the processor is further configured to:

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

Optionally, the processor is specifically configured to, when driving the digital person with the target response information to obtain the target video information:

To achieve the above object, an embodiment of the present invention provides a terminal including a processor and a transceiver, wherein the processor is configured to:

Establishing VoLTE video call connection with network equipment;

transmitting first audio information to the network device;

Optionally, the processor is further configured to:

displaying a first operation interface according to the SIP signaling;

To achieve the above object, an embodiment of the present invention provides a network device including a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; the processor, when executing the program or instructions, implements the interaction method as applied to the network device.

To achieve the above object, an embodiment of the present invention provides a terminal including a transceiver, a processor, a memory, and a program or instructions stored on the memory and executable on the processor; the processor, when executing the program or instructions, implements the interaction method as applied to the terminal as described above.

To achieve the above object, an embodiment of the present invention provides a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement steps in an interaction method as applied to a network device or steps in an interaction method as applied to a terminal.

The technical scheme of the invention has the following beneficial effects:

according to the method provided by the embodiment of the invention, after the network equipment establishes VoLTE video call connection with the terminal, the network equipment can obtain target response information according to the received first audio information sent by the terminal based on VAD, ASR, NLU and other technologies, and drive a digital person by utilizing the target response information, so that the target video information comprising the digital person image information is finally obtained and sent to the terminal. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the communication mode of VoLTE can be realized, compared with the Internet video call, the VoLTE video communication does not need to download APP (Application) or use a browser, so that the interaction between the terminal and the digital person is more convenient and quick.

Drawings

FIG. 1 is a flow chart of an interaction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an implementation process of an interaction method according to an embodiment of the present invention;

Fig. 3 is a timing diagram of resource reservation negotiation according to an embodiment of the present invention;

FIG. 4 is a timing diagram of a modified session negotiation according to an embodiment of the present invention;

FIG. 5 is a full duplex flow diagram of an instant break mode according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a workflow of an MCU module according to an embodiment of the invention;

FIG. 7 is a flow chart of an interaction method according to another embodiment of the present invention;

FIG. 8 is a block diagram of an interactive device according to an embodiment of the present invention;

FIG. 9 is a block diagram of an interactive device according to another embodiment of the present invention;

fig. 10 is a block diagram of a network device according to an embodiment of the present invention;

fig. 11 is a block diagram of a terminal according to an embodiment of the present invention;

fig. 12 is a block diagram of a network device according to another embodiment of the present invention;

fig. 13 is a block diagram of a terminal according to another embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved more apparent, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present application, it should be understood that the sequence numbers of the following processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

In addition, the terms "system" and "network" are often used interchangeably herein.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a from which B may be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.

The VoLTE access mode of the existing scheme is relatively single, and video call implementation is usually required to be clicked at the terminal. VoLTE is mainly used for voice service, but video service is less, for VoLTE access number that the enterprise provided, most users are difficult to confirm whether they support video service, and the user may abandon to use video service consequently, in addition, when partial terminal (e.g. mobile phone) carries out VoLTE video call, need to enter into the second level menu to operate, the process is comparatively complicated, has influenced user's use experience, has also reduced the effective utilization ratio of platform. In addition, due to different usage habits of users and different VoLTE line resources, various compatibility problems exist in VoLTE video access. Therefore, in the existing VoLTE-based communication method, a full duplex interaction scheme between the terminal and the digital person is not yet available.

As shown in fig. 1, an interaction method of an embodiment of the present invention is applied to a network device, and includes:

step 101, establishing a voice over long term evolution (VoLTE) video call connection with a terminal.

In this embodiment, the network device may be a VoLTE gateway.

As an alternative embodiment, the step 101 may specifically include the following steps: receiving a VoLTE call request sent by the terminal; judging whether the VoLTE call request is a video call or not according to a session description protocol SDP; and establishing VoLTE video call connection with the terminal according to the judging result.

For example, a user may specify a fixed phone number through a terminal VoLTE audio call or video call, the terminal may send a VoLTE call request to the network device, where the VoLTE call request may reach the network device through an operator IMS (IP Multimedia Subsystem ) line, the network device receives the VoLTE call request, and may perform related determining and processing operations according to the VoLTE call request (for example, determining whether the VoLTE call request is a video call, or negotiating with the terminal, etc.). The network device may determine, according to the SDP, whether the VoLTE call request sent by the terminal is a video call.

Step 102, receiving first audio information sent by the terminal.

Step 103, obtaining target response information according to the first audio information based on a target technology, wherein the target technology comprises at least one of the following steps: voice activation detection VAD techniques, automatic speech recognition ASR techniques, natural language understanding NLU techniques.

The network device is based on VAD (technology, which can detect whether voice signals exist in current voice signals, namely, judge input signals, distinguish voice signals from various background noise signals, respectively adopt different processing methods for the two signals, and based on ASR technology, can convert vocabulary content in human voice into computer readable input such as keys, binary codes or character sequences, and based on NLU (commonly known as man-machine dialogue) technology, can simulate human language interaction process by using an electronic computer, so that the computer can understand and use natural language (such as Chinese, english and the like) of human society, and natural language communication between human and machine is realized.

And 104, driving the digital person by using the target response information to obtain target video information, wherein the target video information comprises digital person image information and target voice information, and the target voice information is generated according to the target response information.

It should be noted that, after the network device establishes the VoLTE video call connection with the terminal, the network device may connect with the digital person (service), so as to drive the digital person with the target response information to obtain the target video information. The network device may use RTP (Real-time Transport Protocol, real-time transmission protocol) to obtain a Real-time audio/video stream (i.e., target video information) of the digital person.

And step 105, the target video information is sent to the terminal.

Here, the network device may push the target video information to the terminal through IMS using an RTP protocol.

Thus, after receiving the target video information, the terminal can play the target video information, further display the digital human image information and play the target voice information. At this time, the user can see a certain visual non-static virtual digital person at the terminal interface, and the virtual digital person can have micro-actions such as blinks, smiles and the like.

It should be noted that VoLTE is a high-speed wireless communication standard applicable to terminals such as mobile phones and data terminals (e.g. internet of things devices and wearable devices), and is based on an IMS network, which enables voice/video services (control and media planes) to be transmitted as data streams in an LTE (Long Term Evolution ) data bearer network, without requiring maintenance and reliance on a traditional circuit switched voice network. The VoLTE technology can support the video call function, provide higher quality voice call effect, lower delay and lower drop rate, and can also seamlessly integrate with the RCS (Rich Communication Suite, converged communication) to bring more abundant services.

In this embodiment, after the network device establishes the VoLTE video call connection with the terminal, the network device may obtain the target response information according to the received first audio information sent by the terminal based on VAD, ASR, NLU and other technologies, and drive the digital person with the target response information, and finally obtain the target video information including the digital person image information, and send the target video information to the terminal. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the VoLTE communication mode can be realized, and compared with the video interaction communication mode of the Internet, the video communication interaction based on the VoLTE does not need to download APP or use a browser, so that the interaction between the terminal and the digital person is more convenient and quicker.

and under the condition that the VoLTE call request is an audio call, establishing VoLTE video call connection with the terminal according to the support condition of the terminal on resource reservation (i.e. precondition).

In this embodiment, if the terminal requests a video call, voLTE video call connection can be directly established with the terminal; if the terminal requests an audio call, the terminal can process (e.g. negotiate with the terminal) according to the supporting condition of the terminal to the precondition, so as to establish a VoLTE video call connection with the terminal.

As an optional embodiment of the present invention, the establishing VoLTE video call connection with the terminal according to the supporting situation of the terminal for resource reservation may specifically include:

In this embodiment, the network device may negotiate with the terminal through the IMS line. If the negotiation is passed, the network equipment and the terminal can establish VoLTE video call connection; if the negotiation fails, the audio call connection can be established for the terminal to perform the audio call service.

as shown in fig. 3, in case that the terminal supports resource reservation, a quality of service QoS resource reservation (QoS Precondition) negotiation is performed with the terminal using session initiation protocol SIP signaling;

as shown in fig. 4, in case the terminal does not support resource reservation, a modification session (re-INVITE) negotiation is performed with the terminal using the SIP signaling.

In the embodiment, based on QoS Precondition and SIP re-INVITE mechanism, the VoLTE audio call can be upgraded to video and compatible with video call, thus, even if the user uses the audio call, the user can negotiate with the terminal to upgrade to video call, the user access flow is simplified, multiple incoming call scenes can be compatible, the user use experience is optimized, and the platform utilization rate is improved.

Here, the network device may obtain the target response information by calling the VAD service, the ASR service, the NLU service, and the like, respectively.

It should be noted that, the network device may detect and obtain the target voice information according to the first audio information sent by the terminal based on the VAD technology, and then may call the ASR service to send the target voice information to the ASR service, and the ASR service may output the start, intermediate result and final result of a sentence to the network device according to the target voice information. Wherein the network device may consider that the user's speech is detected when receiving the start of a sentence, that is, the network device obtains the speech recognition information, and determine that the user starts to speak; the intermediate result is an intermediate recognition result of a sentence, such as "present", "today's weather", etc.; the end result is a complete sentence, such as "weather is good today".

For example, as shown in fig. 2, a user establishes a VoLTE video call connection between a terminal and a network device through a VoLTE call request, the user inputs voice through the terminal, and the network device drives a digital person to answer, so that interaction between the user and the digital person is realized. Specifically, for example, a user inputs a "business consultation", a terminal acquires first audio information of which the content input by the user is the "business consultation", and sends the first audio information to a network device; the network device sends a voice stream (i.e. first audio information) to the ASR service, that is, the network device invokes the ASR service, obtains a final recognition result (i.e. voice recognition information) as "business consultation", and sends the voice recognition information to the NLU service (i.e. invokes the NLU service); the NLU service outputs corresponding text according to the scene, the network equipment obtains target response information, and then the digital human interface can be called according to the target response information to update the audio and video stream, namely, the digital human is driven, and finally the target video information is obtained.

In the embodiment, based on the VAD technology, the ASR technology and the NLU technology, target response information for driving the digital person can be obtained, and then the digital person is driven to obtain target video information and sent to the terminal, so that a full duplex interaction mode between the terminal and the digital person is realized.

Optionally, the method further comprises:

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

Here, the implementation of stopping driving the digital person may be to replace the audio stream of the digital person (service) with a mute stream, i.e. the digital person stops "speaking" waiting for the user's voice input.

It should be noted that, the existing digital person interaction mode is relatively mechanical and hard to use, for example, before interaction, a user needs to wake up a digital person by voice, in the interaction process, the user needs to wait for the digital person to speak to complete before asking questions, so that the interaction mode between the user and the digital person is not natural enough, and the user experience is poor.

In the embodiment of the invention, the Full Duplex interrupt function can be realized based on VoLTE, VAD, ASR technology, and the digital person can listen to the user and feed back the user while speaking. That is, when the digital person "speaks", if the situation that the user thinks that the reply sentence of the digital person is too long or the content of speaking the digital person is not interested, etc. occurs, the user can interrupt the digital person "speaking" through the voice, and does not need to wait for the digital person to "speak" to finish speaking, thereby solving the problems of too long waiting time, too slow response, low utilization rate of digital person resources, etc. of the user in the interaction process, realizing real-time and bidirectional voice information interaction between the user and the digital person, enabling the experience sense to be more similar to the interaction with real person customer service, and improving the interaction experience between the user and the digital person.

Specifically, the interrupt function may be implemented in two ways:

in a first mode, the instant breaking: as shown in fig. 5, when the network device receives the target voice information, it can confirm that the user starts speaking, and if it detects that the digital person is speaking, it invokes the corresponding interface to interrupt the digital person service, i.e. stops driving the digital person. The timing of the user speaking may be determined when the network device receives the target voice information or when the network device obtains the voice recognition information.

Mode two, semantic disruption: for example, when the network device receives the voice recognition information from the ASR service, the voice recognition information may be sent to the NLU service for semantic matching, and if the preset characters including "slightly waiting", "waiting" and the like are found, the semantic matching is confirmed to be successful, and at this time, if it is detected that the digital person is speaking, the digital person service may be interrupted, that is, the driving of the digital person is stopped.

Here, it can be understood that the instant interrupt mode or the semantic interrupt mode is specifically adopted, and the configuration can be performed according to specific situations.

In the embodiment, the full duplex interaction mode of the user and the digital person is realized, so that the interaction of the user and the digital person supports full duplex, and the user interrupt modes such as instant interrupt and semantic interrupt are supported, so that the waiting time of user perception in the interaction process with the digital person is reduced, the interaction flow is simplified, and the user experience is optimized.

It should be noted that the existing digital human interaction form is relatively single and is not easy to integrate with the service system, and the expression, action and audio of the digital human can be mainly realized, so that the collaboration with the service system is poor. In the embodiment of the invention, the first video information of the digital person can be adjusted based on the preset media information, such as operations of adding subtitles, mixing sounds and the like, so that the interaction effect with the digital person is enriched, and the interaction process is more real.

Here, adjusting the first video information based on the preset media information to obtain the target video information may specifically include:

carrying out YUV coding on the first video information to obtain first coding information;

carrying out YUV coding on preset media information to obtain second coding information;

And obtaining target video information according to the first coding information and the second coding information.

For example, as shown in fig. 6, if it is desired to embed text (such as subtitles), pictures, video, etc. in a digital human picture, the content may be implemented by an MCU (Multi Control Unit, multipoint control unit) module of the network device, and the MCU may extract information and signaling such as audio, video, data, etc. and further complete corresponding operations such as audio mixing or switching, video mixing or switching, etc. Specifically, the network device may convert preset media information (such as preset text, picture, video, etc.) into a YUV coding format (i.e., first coding information), convert the digital human video stream into a YUV coding format (i.e., second coding information), combine the first coding information and the second coding information, and recode the first coding information and the second coding information to form a new video stream (i.e., target video information); the audio stream of the digital person can also be replaced or mixed according to the application scene.

Therefore, based on MCU technology, characters, images, videos and the like can be embedded on the basis of digital human videos and used for various information display, the problem of business system integration is solved, the form of interaction with digital human is enriched, the information transmission efficiency is optimized, and the user experience is improved.

It can be understood that the method of adjusting the first video information based on the preset media information is not limited to the interaction scene with the digital person, and the conversation process with the real person customer service is also applicable. For example, based on preset media information, processing such as mixing sound of real person customer service or adding caption to the picture of real person customer service is performed.

According to the interaction method of the embodiment, after the network equipment establishes VoLTE video call connection with the terminal, target response information can be obtained according to the received first audio information sent by the terminal based on VAD, ASR, NLU and other technologies, the target response information is utilized to drive a digital person, and finally target video information comprising digital person image information is obtained and sent to the terminal. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the VoLTE communication mode can be realized, the digital person can be interrupted in the process of speaking, the waiting time of user perception in the process of interacting with the digital person is reduced, the interaction flow is simplified, and the user experience is optimized. Compared with the Internet video call, the VoLTE video communication does not need to download the APP or use a browser, so that the interaction between the terminal and the digital person is more convenient and quick.

As shown in fig. 7, an interaction method in an embodiment of the present invention is applied to a terminal, and includes:

step 701, establishing VoLTE video call connection with network equipment;

step 702, sending first audio information to the network device;

and step 703, receiving target video information sent by the network device, wherein the target video information comprises digital human image information and target voice information, the target voice information is generated according to target response information, and the target response information is generated according to the first audio information.

In this step, after the terminal and the network device establish the VoLTE video call connection, the terminal may send the first audio information to the network device, and then the terminal may receive the target video information sent by the network device and display the target video information. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the VoLTE communication mode can be realized, compared with the Internet video call, the VoLTE video communication does not need to download APP, and a browser is not needed, so that the interaction between the terminal and the digital person is more convenient and quick.

Optionally, the method further comprises:

and sending a VoLTE call request to the network equipment.

Here, the VoLTE call request may be a video call or an audio call, and after receiving the VoLTE call request, the network device may perform related judging and processing operations (for example, judging whether the VoLTE call request is a video call or negotiating with the terminal, etc.) according to the VoLTE call request, so as to establish a VoLTE video call connection with the terminal.

Optionally, the method further comprises:

displaying a first operation interface according to the SIP signaling;

For example, the first operation interface may include an agreeing or rejecting button for the user to select, and when the user clicks the agreeing button, it indicates that the user agrees to upgrade the audio call to the video call; when the user clicks the reject button, it means that the user rejects upgrading the audio call to the video call, in which case the terminal can be kept in the audio call service.

According to the interaction method, after the VoLTE video call connection is established between the terminal and the network equipment, the audio or video information of the user can be collected, so that first audio information is sent to the network equipment according to the information, and then the terminal can receive target video information sent by the network equipment and display the target video information to the user. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the VoLTE communication mode can be realized, compared with the Internet video call, the VoLTE video communication does not need to download APP, and a browser is not needed, so that the interaction between the terminal and the digital person is more convenient and quick.

As shown in fig. 8, an interaction device of an embodiment of the present invention is applied to a network device, and includes:

the first processing module 810 is configured to establish a long term evolution voice bearer VoLTE video call connection with the terminal;

a first receiving module 820, configured to receive first audio information sent by the terminal;

a second processing module 830, configured to obtain target response information according to the first audio information based on a target technology, where the target technology includes at least one of the following: voice activation detection VAD techniques, automatic speech recognition ASR techniques, natural language understanding NLU techniques;

a third processing module 840, configured to drive a digital person with the target response information to obtain target video information, where the target video information includes digital person image information and target voice information, and the target voice information is generated according to the target response information;

and a first sending module 850, configured to send the target video information to the terminal.

In this embodiment, after the network device establishes the VoLTE video call connection with the terminal, the network device may obtain the target response information according to the received first audio information sent by the terminal based on VAD, ASR, NLU and other technologies, and drive the digital person with the target response information, and finally obtain the target video information including the digital person image information, and send the target video information to the terminal. Therefore, the full duplex interaction scheme of the terminal and the digital person based on the VoLTE communication mode can be realized, compared with the Internet video call, the VoLTE video communication does not need to download APP, and a browser is not needed, so that the interaction between the terminal and the digital person is more convenient and quick.

Optionally, the first processing module includes:

Optionally, the sixth processing module includes:

Optionally, the second processing sub-module includes:

Optionally, the first processing unit includes:

Optionally, the second processing module includes:

Optionally, the apparatus further comprises:

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

Optionally, the third processing module includes:

It should be noted that, the interaction device provided in the embodiment of the present invention can implement all the method steps implemented in the embodiment of the interaction method applied to the network device, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those in the embodiment of the method are omitted herein.

As shown in fig. 9, an interaction device according to an embodiment of the present invention is applied to a terminal, and includes:

a fourth processing module 910, configured to establish a VoLTE video call connection with a network device;

a second sending module 920, configured to send the first audio information to the network device;

and a second receiving module 930, configured to receive target video information sent by the network device, where the target video information includes digital human image information and target voice information, the target voice information is generated according to target response information, and the target response information is generated according to the first audio information.

Optionally, the apparatus further comprises:

and the third sending module is used for sending a VoLTE call request to the network equipment.

Optionally, the apparatus further comprises:

It should be noted that, the interaction device provided in the embodiment of the present invention can implement all the method steps implemented in the embodiment of the interaction method applied to the terminal, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those in the embodiment of the method are omitted herein.

As shown in fig. 10, a network device 1000 according to an embodiment of the present invention includes a processor 1010 and a transceiver 1020, where the processor is configured to:

receiving first audio information sent by the terminal;

and sending the target video information to the terminal.

receiving a VoLTE call request sent by the terminal;

Optionally, the processor is further configured to:

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

It should be noted that, the network device provided in the embodiment of the present invention can implement all the method steps implemented in the interactive method embodiment applied to the network device, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those in the method embodiment in the embodiment are omitted.

As shown in fig. 11, a terminal 1100 according to an embodiment of the present invention includes a processor 1110 and a transceiver 1120, where the processor is configured to:

establishing VoLTE video call connection with network equipment;

transmitting first audio information to the network device;

Optionally, the processor is further configured to:

and sending a VoLTE call request to the network equipment.

Optionally, the processor is further configured to:

displaying a first operation interface according to the SIP signaling;

It should be noted that, the terminal provided by the embodiment of the present invention can implement all the method steps implemented by the interactive method embodiment applied to the terminal, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment in the embodiment are omitted.

A network device according to another embodiment of the present invention, as shown in fig. 12, includes a transceiver 1210, a processor 1200, a memory 1220, and a program or instructions stored on the memory 1220 and executable on the processor 1200; the processor 1200, when executing the program or instructions, implements the interaction method described above as being applied to a network device.

The transceiver 1210 is configured to receive and transmit data under the control of the processor 1200.

Wherein in fig. 12, a bus architecture may comprise any number of interconnected buses and bridges, and in particular, one or more processors represented by processor 1200 and various circuits of memory represented by memory 1220, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 1210 may be a number of elements, i.e. include a transmitter and a receiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1200 is responsible for managing the bus architecture and general processing, and the memory 1220 may store data used by the processor 1200 in performing operations.

A terminal according to another embodiment of the present invention, as shown in fig. 13, includes a transceiver 1310, a processor 1300, a memory 1320, and a program or instructions stored on the memory 1320 and executable on the processor 1300; the processor 1300, when executing the program or instructions, implements the interaction method applied to the terminal described above.

The transceiver 1310 is configured to receive and transmit data under the control of the processor 1300.

Where in FIG. 13, a bus architecture may comprise any number of interconnected buses and bridges, with various circuits of the one or more processors, specifically represented by processor 1300, and the memory, represented by memory 1320, being linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 1310 may be a number of elements, i.e., include a transmitter and a receiver, providing a means for communicating with various other apparatus over a transmission medium. The user interface 1330 may also be an interface capable of interfacing with an inscribed desired device for a different user device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

The readable storage medium of the embodiment of the present invention stores a program or an instruction, which when executed by a processor, implements the steps in the interaction method described above, and can achieve the same technical effects, and in order to avoid repetition, a detailed description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It is further noted that the terminals described in this specification include, but are not limited to, smartphones, tablets, etc., and that many of the functional components described are referred to as modules in order to more particularly emphasize their implementation independence.

In an embodiment of the invention, the modules may be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different bits which, when joined logically together, comprise the module and achieve the stated purpose for the module.

Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Likewise, operational data may be identified within modules and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices.

Where a module may be implemented in software, taking into account the level of existing hardware technology, a module may be implemented in software, and one skilled in the art may, without regard to cost, build corresponding hardware circuitry, including conventional Very Large Scale Integration (VLSI) circuits or gate arrays, and existing semiconductors such as logic chips, transistors, or other discrete components, to achieve the corresponding functions. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.

The exemplary embodiments described above are described with reference to the drawings, many different forms and embodiments are possible without departing from the spirit and teachings of the present invention, and therefore, the present invention should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will convey the scope of the invention to those skilled in the art. In the drawings, the size of the elements and relative sizes may be exaggerated for clarity. The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Unless otherwise indicated, a range of values includes the upper and lower limits of the range and any subranges therebetween.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. An interaction method, applied to a network device, comprising:

receiving first audio information sent by the terminal;

and sending the target video information to the terminal.

2. The method of claim 1, wherein the establishing a long term evolution voice bearer VoLTE video call connection with the terminal comprises:

Receiving a VoLTE call request sent by the terminal;

3. The method according to claim 2, wherein the establishing a VoLTE video call connection with the terminal according to the determination result includes at least one of:

4. A method according to claim 3, wherein said establishing a VoLTE video call connection with said terminal according to the support of resource reservation by said terminal comprises:

5. The method according to claim 4, wherein negotiating with the terminal according to the terminal's support for resource reservation comprises at least one of:

6. The method of claim 1, wherein the obtaining target response information based on the first audio information based on the target technology comprises:

7. The method as recited in claim 6, further comprising:

wherein the preset conditions include at least one of:

obtaining the target voice information;

the voice recognition information includes a preset character.

8. The method of claim 6, wherein driving the digital person with the target response information to obtain target video information comprises:

9. An interaction method, which is applied to a terminal, comprises the following steps:

establishing VoLTE video call connection with network equipment;

transmitting first audio information to the network device;

10. The method as recited in claim 9, further comprising:

displaying a first operation interface according to the SIP signaling;

11. An interaction device, for use with a network device, comprising:

12. An interaction device, applied to a terminal, comprising:

13. A network device, comprising: a transceiver and a processor; the processor is configured to:

receiving first audio information sent by the terminal;

and sending the target video information to the terminal.

14. A terminal, comprising: a transceiver and a processor; the processor is configured to:

establishing VoLTE video call connection with network equipment;

transmitting first audio information to the network device;

15. A readable storage medium having stored thereon a program or instructions, which when executed by a processor, implements the steps of the interaction method of any of claims 1 to 8 or the steps of the interaction method of any of claims 9 to 10.