WO2024012590A1 - 音视频呼叫方法及装置 - Google Patents

音视频呼叫方法及装置 Download PDF

Info

Publication number
WO2024012590A1
WO2024012590A1 PCT/CN2023/107721 CN2023107721W WO2024012590A1 WO 2024012590 A1 WO2024012590 A1 WO 2024012590A1 CN 2023107721 W CN2023107721 W CN 2023107721W WO 2024012590 A1 WO2024012590 A1 WO 2024012590A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video
stream
media server
calling
Prior art date
Application number
PCT/CN2023/107721
Other languages
English (en)
French (fr)
Inventor
魏学松
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2024012590A1 publication Critical patent/WO2024012590A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • H04N7/147Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals

Definitions

  • Embodiments of the present invention relate to the field of communications, and specifically to an audio and video calling method and device.
  • 5G New Call is an upgrade to basic audio and video calls. Based on the long-term evolution voice bearer (Voice over LTE, VoLTE) or 5G voice solution (Voice over New Radio, VoNR) audio and video calls, it can achieve more A faster, clearer, smarter, and broader call experience supports users in real-time interaction during calls and provides users with richer and more convenient call functions.
  • VoIP over LTE VoLTE
  • VoNR 5G voice solution
  • Embodiments of the present invention provide an audio and video calling method and device to at least solve the problem of single audio and video calling functions in related technologies.
  • an audio and video call method including: after the audio and video call between the calling user and the called user is anchored to the media server, the artificial intelligence AI component receives the media server The copied audio stream and video stream of the audio and video call between the calling user and the called user; the AI component identifies the specific content in the audio stream and/or the video stream, and uses the media to The server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user.
  • an audio and video call method including: after the audio and video call between the calling user and the called user is anchored to the media server, the media server The audio stream and video stream of the audio and video call between the user and the called user are copied to the artificial intelligence AI component; the media server analyzes the specific content in the audio stream and/or the video stream according to the AI component. As a result of the recognition, an animation corresponding to the specific content is superimposed on the audio and video call between the calling user and the called user.
  • an audio and video calling device including: a first receiving module that receives the audio and video call between the calling user and the called user after the audio and video call is anchored to the media server.
  • the audio stream and video stream of the audio and video call between the calling user and the called user copied by the media server; the identification processing module identifies the specific content in the audio stream and/or the video stream to pass The media server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user.
  • an audio and video calling device including: a copy sending module configured to copy the audio and video call between the calling user and the called user after the audio and video call is anchored to the media server.
  • the audio stream and video stream of the audio and video call between the calling user and the called user are copied to the artificial intelligence AI component;
  • the overlay module is configured to modify the audio stream and/or the video stream according to the AI component.
  • the identification result of specific content is determined between the calling user and the called party. Superimpose animations corresponding to the specific content on the user's audio and video calls.
  • a computer-readable storage medium is also provided.
  • a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to execute any of the above methods when running. Steps in Examples.
  • an electronic device including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above. Steps in method embodiments.
  • Figure 1 is a hardware structure block diagram of a mobile terminal of an audio and video calling method according to an embodiment of the present invention
  • Figure 2 is a flow chart of an audio and video calling method according to an embodiment of the present invention.
  • Figure 3 is a flow chart of an audio and video calling method according to an embodiment of the present invention.
  • Figure 4 is a flow chart of an audio and video calling method according to an embodiment of the present invention.
  • Figure 5 is a flow chart of an audio and video calling method according to an embodiment of the present invention.
  • Figure 6 is a flow chart of an audio and video calling method according to an embodiment of the present invention.
  • Figure 7 is a flow chart of dynamic effect superposition according to an embodiment of the present invention.
  • Figure 8 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • Figure 9 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • FIG. 10 is a structural block diagram of an identification processing module according to an embodiment of the present invention.
  • FIG. 11 is a structural block diagram of an identification processing module according to an embodiment of the present invention.
  • Figure 12 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • Figure 13 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • Figure 14 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • Figure 15 is a structural block diagram of a superposition module according to an embodiment of the present invention.
  • Figure 16 is a schematic flow chart of user video call anchoring according to a scenario embodiment of the present invention.
  • Figure 17 is a schematic flowchart of AI component identification and dynamic effect superposition according to a scene embodiment of the present invention.
  • FIG. 1 is a hardware structure block diagram of a mobile terminal of an audio and video calling method according to an embodiment of the present invention.
  • the mobile terminal may include one or more (only one is shown in Figure 1) processors 102 (the processor 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, wherein the above-mentioned mobile terminal may also include a transmission device 106 and an input and output device 108 for communication functions.
  • processors 102 may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • a memory 104 for storing data
  • the above-mentioned mobile terminal may also include a transmission device 106 and an input and output device 108 for communication functions.
  • the structure shown in Figure 1 is only illustrative, and it does not limit the structure of the above-mentioned mobile terminal.
  • the mobile terminal may also include more or fewer components than shown in FIG. 1 , or have a different configuration than shown in FIG. 1 .
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the audio and video calling method in the embodiment of the present invention.
  • the processor 102 runs the program stored in the memory 104.
  • Computer programs are used to perform various functional applications and data processing, that is, to implement the above methods.
  • Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include memory located remotely relative to the processor 102, and these remote memories may be connected to the mobile terminal through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • Transmission device 106 is used to receive or send data via a network.
  • Specific examples of the above-mentioned network may include a wireless network provided by a communication provider of the mobile terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module configured to communicate with the Internet wirelessly.
  • NIC Network Interface Controller
  • FIG. 2 is a flow chart of the audio and video call method according to an embodiment of the present invention. As shown in Figure 2, the process includes the following steps:
  • Step S202 After the audio and video call between the calling user and the called user is anchored to the media server, the artificial intelligence AI component receives the audio stream of the audio and video call between the calling user and the called user copied by the media server. and video streaming;
  • Step S204 The AI component identifies specific content in the audio stream and/or video stream, and superimposes animations corresponding to the specific content on the audio and video call between the calling user and the called user through the media server.
  • the AI component is used to receive the audio of the audio and video call between the calling user and the called user copied by the media server.
  • Stream and video stream the AI component identifies specific content in the audio stream and/or video stream, and superimposes animations corresponding to the specific content on the audio and video call between the calling user and the called user through the media server, solving the problem
  • the problem of single audio and video calling functions in related technologies has been achieved by improving the fun and intelligence of audio and video calling.
  • the execution subject of the above steps may be a base station, a terminal, etc., but is not limited thereto.
  • the method before the AI component receives the audio stream and video stream of the audio and video call between the calling user and the called user copied by the media server, the method further includes: the AI component receiving a negotiation request from the media server; The component returns the receiving end's Uniform Resource Locator URL address and port information to the media server.
  • Figure 3 is a flow chart of an audio and video calling method according to an embodiment of the present invention. As shown in Figure 3, the process includes the following steps:
  • Step S302 the AI component negotiates with the media server the port information and media information used to receive audio streams and video streams;
  • Step S304 the AI component returns the uniform resource locator URL address and port information used to receive audio streams and video streams to the media server;
  • Step S306 After the audio and video call between the calling user and the called user is anchored to the media server, the artificial intelligence AI component receives the audio stream of the audio and video call between the calling user and the called user copied by the media server. and video streaming;
  • Step S308 The AI component identifies specific content in the audio stream and/or video stream, and superimposes animations corresponding to the specific content on the audio and video call between the calling user and the called user through the media server.
  • the AI component identifies specific content in the audio stream and/or video stream, including: the AI component converts the audio stream into text and sends the text to the business application so that the business application can identify the content in the text. Identify keywords and query the animation corresponding to the keywords.
  • the AI component identifies specific content in the audio stream and/or video stream, and further includes: the AI component identifies specific actions in the video stream, and sends the recognition result to the business application so that the business application Query the animation corresponding to a specific action.
  • the animation includes at least one of the following: static pictures or dynamic videos.
  • FIG. 4 is a flow chart of an audio and video calling method according to an embodiment of the present invention. As shown in Figure 4, the process includes the following steps:
  • Step S402 After the audio and video call between the calling user and the called user is anchored to the media server, the media server copies the audio stream and video stream of the audio and video call between the calling user and the called user to the manual server.
  • Intelligent AI components
  • Step S404 The media server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user based on the recognition result of the specific content in the audio stream and/or video stream by the AI component.
  • the media server before the media server copies the audio stream and video stream of the audio and video call between the calling user and the called user to the AI component, it also includes: the media server separately sends the audio and video streams to the calling user according to the application of the call platform.
  • the calling user and the called user allocate media resources so that the call platform can re-anchor the calling user and the called user to the media server respectively according to the applied media resources of the calling user and the called user.
  • FIG. 5 is a flow chart of an audio and video calling method according to an embodiment of the present invention. As shown in Figure 5, the process includes the following steps:
  • Step S502 The media server allocates media resources to the calling user and the called user respectively according to the call platform's application;
  • Step S504 After the audio and video call between the calling user and the called user is anchored to the media server, the media server copies the audio stream and video stream of the audio and video call between the calling user and the called user to the manual server.
  • Intelligent AI components
  • Step S506 The media server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user based on the recognition result of the specific content in the audio stream and/or video stream by the AI component.
  • the method before the media server copies the audio stream and video stream of the audio and video call between the calling user and the called user to the AI component, the method further includes: the media server receives the audio stream sent by the business application and The request instruction to copy the stream and video stream to the AI component, where the request instruction carries the audio stream ID, video stream ID, and URL address of the AI component; the media server negotiates with the AI component to receive the port information and media of the copied audio stream and video stream. Information; the media server receives the URL address and port information for receiving the copied audio stream and video stream returned by the AI component.
  • FIG. 6 is a flow chart of an audio and video calling method according to an embodiment of the present invention. As shown in Figure 6, the process includes the following steps:
  • Step S602 The media server receives a request instruction issued by the business application to copy the audio stream and video stream to the AI component, where the request instruction carries the audio stream ID, video stream ID, and URL address of the AI component;
  • Step S604 the media server negotiates with the AI component the port information and media information used to receive audio streams and video streams;
  • Step S606 The media server receives the URL address and port information returned by the AI component for receiving audio streams and video streams;
  • Step S608 the media server copies the audio stream and video stream of the audio and video call between the calling user and the called user to the artificial intelligence AI component;
  • Step S610 The media server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user based on the recognition result of the specific content in the audio stream and/or video stream by the AI component.
  • the media server superimposes an animation corresponding to the specific content on the audio and video call between the calling user and the called user based on the recognition result of the specific content in the audio stream and/or video stream by the AI component.
  • the media server receives media processing instructions from the business application, where the media processing instructions are generated based on the AI component's recognition results of specific content in the audio stream and/or video stream; the media server performs the processing between the calling user and the called user.
  • Figure 7 is a flow chart of dynamic effect superposition according to an embodiment of the present invention. As shown in Figure 7, the process includes the following steps:
  • Step S702 The media server receives the media processing instruction from the business application and obtains the animation effect according to the URL of the animation effect carried in the media processing instruction;
  • Step S704 The media server encodes and synthesizes the motion effect and the audio stream and/or video stream, and delivers the encoded and synthesized audio stream and video stream to the calling user and the called user.
  • the method according to the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present invention can be embodied in the form of a software product in essence or the part that contributes to the existing technology.
  • the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.
  • an audio and video calling device is provided.
  • the device is used to implement the above embodiments and preferred implementations. What has been described will not be described again.
  • the terms “module” and “unit” may be a combination of software and/or hardware that implements predetermined functions.
  • the apparatus described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG 8 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • the audio and video calling device 80 includes: a first receiving module 810.
  • the audio and video communication between the calling user and the called user is After the call is anchored to the media server, the audio stream and video stream of the audio and video call between the calling user and the called user that are copied by the media server are received;
  • the identification processing module 820 performs specific processing on the audio stream and/or video stream.
  • the content is identified to superimpose animations corresponding to the specific content on the audio and video calls between the calling user and the called user through the media server.
  • Figure 9 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • the audio and video calling device 90 also includes :
  • the first negotiation module 910 is configured to negotiate with the media server the port information and media information for receiving the audio stream and the video stream;
  • the return module 920 is configured to return to the media server the unified resource location for receiving the audio stream and the video stream. URL address and port information.
  • Figure 10 is a structural block diagram of a recognition processing module according to an embodiment of the present invention.
  • the recognition processing module 820 includes: an audio processing unit 1010, configured to convert the audio stream into text, and Send the text to the business application so that the business application can identify the keywords in the text and query the animation corresponding to the keywords.
  • FIG. 11 is a structural block diagram of a recognition processing module according to an embodiment of the present invention.
  • the recognition processing module 820 in addition to the units shown in FIG. 10 , the recognition processing module 820 also includes: a video processing unit 1110 , set to identify specific actions in the video stream and send the recognition results to the business application so that the business application can query the animation corresponding to the specific action.
  • an audio and video calling device is also provided.
  • Figure 12 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • the audio and video calling device 120 includes:
  • the copy and send module 1210 is configured to copy the audio stream and video stream of the audio and video call between the calling user and the called user after the audio and video call between the calling user and the called user is anchored to the media server.
  • the overlay module 1220 is configured to superimpose an animation corresponding to the specific content on the audio and video call between the calling user and the called user based on the recognition result of the AI component on specific content in the audio stream and/or video stream. effect.
  • Figure 13 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • the audio and video calling device 130 also includes: Resource allocation module 1310, It is set to allocate media resources to the calling user and the called user respectively according to the application of the calling platform, so that the calling platform can re-anchor the calling user and the called user to respectively according to the applied media resources of the calling user and the called user. media server.
  • Figure 14 is a structural block diagram of an audio and video calling device according to an embodiment of the present invention.
  • the audio and video calling device 140 also includes:
  • the second receiving module 1410 is configured to receive a request instruction issued by the business application to copy the audio stream and video stream to the AI component, where the request instruction carries the audio stream ID, the video stream ID, and the URL address of the AI component;
  • the second negotiation module 1420 is configured to negotiate with the AI component the port information and media information used to receive audio streams and video streams;
  • the third receiving module 1430 is configured to receive the URL address returned by the AI component for receiving audio streams and video streams. and port information.
  • Figure 15 is a structural block diagram of an overlay module according to an embodiment of the present invention.
  • the overlay module 1220 includes: a receiving unit 1510, configured to receive media processing instructions from business applications, and Obtain the dynamic effect according to the URL of the dynamic effect carried in the media processing instruction; the overlay unit 1520 is configured to encode and synthesize the dynamic effect with the audio stream and/or video stream, and deliver the encoded and synthesized data to the calling user and the called user. Handled audio and video streams.
  • each of the above modules and units can be implemented through software or hardware.
  • it can be implemented in the following ways, but is not limited to this: the above modules and units are all located in the same processor; or, each of the above Modules and units are located in different processors in any combination.
  • Embodiments of the present invention also provide a computer-readable storage medium that stores a computer program, wherein the computer program is configured to execute the steps in any of the above method embodiments when running.
  • the computer-readable storage medium may include but is not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk magnetic disk or optical disk and other media that can store computer programs.
  • An embodiment of the present invention also provides an electronic device, including a memory and a processor.
  • a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.
  • the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
  • Embodiments of the present invention are mainly based on VoLTE video calls, which require automatic recognition functions in audio and video, including voice recognition and video action recognition. After recognition, the recognition results are returned.
  • the business performs video processing based on the returned recognition results, mainly performing Decoding, video overlay processing and encoding processing, etc., and finally presenting some dynamic effects functions to both users during the video call.
  • the specific implementation instructions are as follows:
  • the user initiates a native VOLTE video call using a mobile phone terminal, or initiates a voice call and then switches to a video call.
  • the user has signed up for the new call enhanced call service function, otherwise this function cannot be used.
  • the call parties need to be re-anchored, and the audio and video of both parties need to be re-anchored to the media server.
  • the purpose of re-negotiating the call parties and re-anchoring to the media server is to control the media flows of both parties.
  • the caller and the called party can start anchoring to the media plane.
  • the media server will copy the audio and video streams of the contracted user to the AI component, and the AI component will identify the audio and video.
  • the AI component will mainly perform audio recognition. language After the audio is converted into text, it is sent to the business application, which identifies the keywords.
  • the AI component mainly performs intelligent identification of the video and identifies specific content.
  • the AI component When a keyword in the user's audio and a specific action in the video are recognized, if it is audio recognition, the AI component will return the transcribed text content to the business application, and the business application will perform keyword recognition; if it is video recognition, the AI The component directly recognizes and sends the recognition results to the business application. Finally, the application finds the special effects corresponding to the user's settings according to the user's settings, and instructs the media server to perform media processing of the video.
  • the media server After receiving the instruction, the media server obtains the corresponding user motion effects, downloads them locally, and then performs video media processing functions to superimpose the corresponding motion effects onto the videos of both parties.
  • FIG 16 is a schematic diagram of a user video call anchoring process according to a scenario embodiment of the present invention. As shown in Figure 16, the process includes the following steps:
  • Step 1602 The call starts. Call events are normally reported to the service application, such as call origination, ringing, response, and response interruption events. The service needs to indicate the next operation.
  • Step 1604 After the call is answered, the service authenticates the user, finds that the enhanced call service has been subscribed, and issues a media renegotiation control command.
  • Step 1606 After receiving the media anchoring instruction, the new call platform used to implement business function control and logic control begins to anchor the called party. First, it applies for the called media resource. After applying, it uses the applied media resource. Initiate reinvite re-media negotiation for the called party, obtain the called media resources, return them to the media server, and then add the called terminal to the conference (in this scenario embodiment, anchoring is implemented through the conference), thus completing Audio and video anchoring function on the called side. After the anchoring is completed, the parameters of each stream should be returned to the anchoring initiator, such as local stream, audio stream id, video stream id, sending and receiving direction, remote stream, audio stream id, video stream id, sending and receiving direction, etc. .
  • Step 1608 After completing the anchoring of the called party, it also applies to the media server for the calling media resources. After the application, it initiates an update media update operation on the calling side, carries the newly applied media resources to the calling side, and the calling side returns its own The media resources of the calling side are also added to the conference. In this way, the media resources of the calling side and the called side are added to the conference of the media server, and the media anchoring function of the calling side and the called side is realized.
  • FIG 17 is a schematic process diagram of AI component identification and dynamic effect superposition according to a scene embodiment of the present invention. As shown in Figure 17, the process includes the following steps:
  • Step 1702 after completing the anchoring of the called party and the calling party, the business side, that is, the business application, begins to apply for an access address from the AI component, and at the same time requests the AI component to perform intelligent voice transcription functions and video recognition operations, including voice to text and video gesture recognition. , the AI responds by returning the AI's subsequently negotiated Uniform Resource Locator (Uniform Resource Locator, URL).
  • Uniform Resource Locator Uniform Resource Locator
  • Step 1704 The business application begins to issue audio and video stream copy request instructions to the media server.
  • the audio stream is copied to the corresponding audio recognition AI component platform, and the video stream is copied to the corresponding AI video recognition component platform.
  • Step 1706 The media server receives the stream copy instruction and needs to negotiate with the AI component on the specific stream copy port and media information, including the copied IP, port, and stream codec type.
  • the AI receives the negotiation request from the media server. Afterwards, processing is performed, and the final response returns the address and port of the corresponding copied receiving end. After the negotiation is completed, the media server starts stream copying to the AI component platform. At the same time, the media server returns the copy result to the business application.
  • Step 1708 After receiving the copy stream, the AI component platform turns on the AI intelligent recognition function, including audio transliteration into text and video recognition of user-specified gestures. Among them, after the audio is transcribed into text, the text and URL address are returned directly.
  • Step 1710 In the process of identifying the video, if the AI component identifies the corresponding key information, it will immediately report it to the business application. If it is audio content, the AI component returns the transcribed text content, and the business application identifies the keywords. To identify keywords, the business application first saves all the text transcribed by the user, and then starts keyword recognition each time it receives new text. If the keyword is recognized, the post-recognition process is performed.
  • Step 1712 when a keyword is recognized, whether it is a keyword recognized by the business application itself or a dynamic gesture recognized by AI, the business application will query the corresponding animation set by the user based on the recognized information, which may be a static picture. It can also be a dynamic short video.
  • Step 1714 the business application issues a media processing instruction to the media server, in which the animation is sent with the URL address of the animation resource.
  • the media server receives the media processing instruction, it first obtains the corresponding animation according to the URL of the animation, and then It can be cached locally. If it is not available locally, it will be obtained locally through URL access.
  • Step 1716 The media server performs media processing, decodes the video on the server, performs encoding and synthesis processing on the user video stream, encodes the video after synthesis, and then delivers the video.
  • the synthesized video needs to be processed between the user calling and the recipient.
  • the two-way downlink video must be synthesized, so that both the calling party and the called party can see the same video processing result.
  • the audio and video calling method and device mainly include speech recognition, video intelligent recognition and video processing. Specifically, it includes two main functions: a voice rotation effect function and a gesture rotation effect function.
  • Voice rotation effect During the call, if the user says certain keywords, such as: happy birthday, thank you, like and other keywords, the system will perform speech recognition. If certain keywords are recognized, it will be reported to the business On the other side, the business side instructs the media server to display specific animations or pictures. For example, in two-way videos, animations such as cakes, hearts, or fireworks are displayed.
  • the user's gestures are automatically recognized. For example, if the user makes a heart gesture, the AI component will recognize the predefined key actions and then superimpose the animation of the key actions on the video of both parties. , such as hearts, thumbs up and other pictures or animations.
  • the embodiment of the present invention provides a server-based audio and video enhancement function for audio and video calls based on VOLTE calls. It does not need to rely on client APP and SDK support. As long as the user supports native VOLTE video calls, more interesting calls can be provided.
  • the function realizes automatic voice recognition and automatic video recognition on the server side. After recognition, some dynamic effects are superimposed, which greatly enhances the fun of audio and video calls and improves the user experience. Make users' calls more interesting and smarter. It greatly improves the user's operating experience and makes voice calls more intelligent, which is very conducive to the promotion and application of 5G new call services.
  • each module or each step of the above-mentioned embodiments of the present invention can be implemented by a general-purpose computing device. They can be concentrated on a single computing device, or distributed among multiple computing devices. over a network, they may be implemented with program code executable by a computing device, such that they may be stored in a storage device for execution by the computing device, and in some cases, may be executed in a sequence different from that described here.
  • the steps shown or described may be implemented by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps among them into a single integrated circuit module. As such, the invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明实施例提供了一种音视频呼叫方法及装置,通过本发明实施例,通过在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,利用AI组件接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;AI组件对音频流和/或视频流中的特定内容进行识别,并通过媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效,解决了相关技术中音视频呼叫功能单一的问题,达到了提升音视频通话的趣味性和智能化水平。

Description

音视频呼叫方法及装置 技术领域
本发明实施例涉及通信领域,具体而言,涉及一种音视频呼叫方法及装置。
背景技术
5G新通话是对基础音、视频通话的升级,在长期演进语音承载(Voice over LTE,VoLTE)或者5G话音解决方案(Voice over New Radio,VoNR)的音、视频通话的基础上,可实现更快、更清、更智、更广的通话体验,支持用户在通话中进行实时交互,为用户提供更丰富、更便捷的通话功能。
传统的音视频通话,只能进行通话功能,不能附加更多的智能功能,随着5G视频业务的推广,越来越多的人尝试使用视频呼叫功能,但是当前的视频呼叫,更多是基本功能,没有附加其它功能和智能功能,虽然有些应用APP也尝试推出一些有趣的功能,比如进行虚拟背景、虚拟头像等,但在语音呼叫中,很少有这些实现,而且这些实现都是基于客户端应用APP的实现,需要用户安装APP,对业务的推广有很大阻碍作用。
发明内容
本发明实施例提供了一种音视频呼叫方法及装置,以至少解决相关技术中音视频呼叫功能单一的问题。
根据本发明的一个实施例,提供了一种音视频呼叫方法,包括:在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,人工智能AI组件接收所述媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;所述AI组件对所述音频流和/或所述视频流中的特定内容进行识别,并通过所述媒体服务器在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
根据本发明的又一实施例,还提供了一种音视频呼叫方法,包括:在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,所述媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;所述媒体服务器根据所述AI组件对所述音频流和/或所述视频流中的特定内容的识别结果,在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
根据本发明的又一实施例,还提供了一种音视频呼叫装置,包括:第一接收模块,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,接收所述媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;识别处理模块,对所述音频流和/或所述视频流中的特定内容进行识别,以通过所述媒体服务器在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
根据本发明的又一实施例,还提供了一种音视频呼叫装置,包括:复制发送模块,设置为在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;叠加模块,设置为根据所述AI组件对所述音频流和/或所述视频流中的特定内容的识别结果,在所述主叫用户与所述被 叫用户的音视频通话上叠加与所述特定内容对应的动效。
根据本发明的又一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本发明的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
附图说明
图1是本发明实施例的一种音视频呼叫方法的移动终端的硬件结构框图;
图2是根据本发明实施例的音视频呼叫方法的流程图;
图3是根据本发明实施例的音视频呼叫方法的流程图;
图4是根据本发明实施例的音视频呼叫方法的流程图;
图5是根据本发明实施例的音视频呼叫方法的流程图;
图6是根据本发明实施例的音视频呼叫方法的流程图;
图7是根据本发明实施例的动效叠加的流程图;
图8是根据本发明实施例的音视频呼叫装置的结构框图;
图9是根据本发明实施例的音视频呼叫装置的结构框图;
图10是根据本发明实施例的识别处理模块的结构框图;
图11是根据本发明实施例的识别处理模块的结构框图;
图12是根据本发明实施例的音视频呼叫装置的结构框图;
图13是根据本发明实施例的音视频呼叫装置的结构框图;
图14是根据本发明实施例的音视频呼叫装置的结构框图;
图15是根据本发明实施例的叠加模块的结构框图;
图16是根据本发明场景实施例的用户视频呼叫锚定流程示意图;
图17是根据本发明场景实施例的AI组件识别和动效叠加的流程示意图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明的实施例。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请实施例中所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图1是本发明实施例的一种音视频呼叫方法的移动终端的硬件结构框图。如图1所示,移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,其中,上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本发明实施例中的音视频呼叫方法对应的计算机程序,处理器102通过运行存储在存储器104内的计 算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输设备106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端的通信供应商提供的无线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,简称为RF)模块,设置为通过无线方式与互联网进行通讯。
在本实施例中提供了一种运行于上述移动终端的音视频呼叫方法,图2是根据本发明实施例的音视频呼叫方法的流程图,如图2所示,该流程包括如下步骤:
步骤S202,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,人工智能AI组件接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;
步骤S204,AI组件对音频流和/或视频流中的特定内容进行识别,并通过媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
通过上述步骤,通过在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,利用AI组件接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;AI组件对音频流和/或视频流中的特定内容进行识别,并通过媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效,解决了相关技术中音视频呼叫功能单一的问题,达到了提升音视频通话的趣味性和智能化水平。
其中,上述步骤的执行主体可以为基站、终端等,但不限于此。
在一示例性实施例中,在AI组件接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流之前,还包括:AI组件接收媒体服务器的协商请求;AI组件向媒体服务器返回接收端的统一资源定位符URL地址和端口信息。图3是根据本发明实施例的音视频呼叫方法的流程图,如图3所示,该流程包括如下步骤:
步骤S302,AI组件与媒体服务器协商用于接收音频流和视频流的端口信息和媒体信息;
步骤S304,AI组件向媒体服务器返回用于接收音频流和视频流的统一资源定位符URL地址和端口信息;
步骤S306,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,人工智能AI组件接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;
步骤S308,AI组件对音频流和/或视频流中的特定内容进行识别,并通过媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一示例性实施例中,AI组件对音频流和/或视频流中的特定内容进行识别,包括:AI组件将音频流转写成文字,并将文字发送至业务应用,以便业务应用对文字中的关键词进行识别,并查询与关键词对应的动效。
在一示例性实施例中,AI组件对音频流和/或视频流中的特定内容进行识别,还包括:AI组件识别视频流中的特定动作,并将识别结果发送至业务应用,以便业务应用查询与特定动作对应的动效。
在一示例性实施例中,动效包括以下至少之一:静态图片或者动态视频。
在本发明的又一实施例中,提供了一种音视频呼叫方法,图4是根据本发明实施例的音视频呼叫方法的流程图,如图4所示,该流程包括如下步骤:
步骤S402,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;
步骤S404,媒体服务器根据AI组件对音频流和/或视频流中的特定内容的识别结果,在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一个示例性实施例中,在媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到AI组件之前,还包括:媒体服务器根据通话平台的申请分别给主叫用户和被叫用户分配媒体资源,以便通话平台根据申请到的主叫用户和被叫用户的媒体资源分别将主叫用户和被叫用户重新锚定至媒体服务器。
图5是根据本发明实施例的音视频呼叫方法的流程图,如图5所示,该流程包括如下步骤:
步骤S502,媒体服务器根据通话平台的申请分别给主叫用户和被叫用户分配媒体资源;
步骤S504,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;
步骤S506,媒体服务器根据AI组件对音频流和/或视频流中的特定内容的识别结果,在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一个示例性实施例中,在媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到AI组件之前,还包括:媒体服务器接收业务应用下发的将音频流和视频流复制到AI组件的请求指令,其中请求指令中携带音频流ID、视频流ID、AI组件的URL地址;媒体服务器与AI组件协商接收复制的音频流和视频流的端口信息和媒体信息;媒体服务器接收AI组件返回的接收复制的音频流和视频流的URL地址和端口信息。
图6是根据本发明实施例的音视频呼叫方法的流程图,如图6所示,该流程包括如下步骤:
步骤S602,媒体服务器接收业务应用下发的将音频流和视频流复制到AI组件的请求指令,其中请求指令中携带音频流ID、视频流ID、AI组件的URL地址;
步骤S604,媒体服务器与AI组件协商用于接收音频流和视频流的端口信息和媒体信息;
步骤S606,媒体服务器接收AI组件返回的用于接收音频流和视频流的URL地址和端口信息;
步骤S608,媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;
步骤S610,媒体服务器根据AI组件对音频流和/或视频流中的特定内容的识别结果,在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一个示例性实施例中,媒体服务器根据AI组件对音频流和/或视频流中的特定内容的识别结果,在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效,包括:媒体服务器接收来自业务应用的媒体加工指令,其中,媒体加工指令根据AI组件对音频流和/或视频流中的特定内容的识别结果产生;媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
图7是根据本发明实施例的动效叠加的流程图,如图7所示,该流程包括以下步骤:
步骤S702,媒体服务器接收来自业务应用的媒体加工指令,并根据媒体加工指令中携带的动效的URL获取动效;
步骤S704,媒体服务器将动效与音频流和/或视频流进行编码合成,并向主叫用户和被叫用户下发经过编码合成处理的音频流和视频流。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
根据本发明的又一实施例,提供了一种音视频呼叫装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”、“单元”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图8是根据本发明实施例的音视频呼叫装置的结构框图,如图8所示,该音视频呼叫装置80包括:第一接收模块810,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,接收媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;识别处理模块820,对音频流和/或视频流中的特定内容进行识别,以通过媒体服务器在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一个示例性实施例中,图9是根据本发明实施例的音视频呼叫装置的结构框图,如图9所示,该音视频呼叫装置90除了包括图8所示的各个模块外,还包括:第一协商模块910,设置为与媒体服务器协商用于接收音频流和视频流的端口信息和媒体信息;返回模块920,设置为向媒体服务器返回用于接收音频流和视频流的统一资源定位符URL地址和端口信息。
在一个示例性实施例中,图10是根据本发明实施例的识别处理模块的结构框图,如图10所示,识别处理模块820包括:音频处理单元1010,设置为将音频流转写成文字,并将文字发送至业务应用,以便业务应用对文字中的关键词进行识别,并查询与关键词对应的动效。
在一个示例性实施例中,图11是根据本发明实施例的识别处理模块的结构框图,如图11所示,识别处理模块820除了包括图10所示的单元,还包括:视频处理单元1110,设置为识别视频流中的特定动作,并将识别结果发送至业务应用,以便业务应用查询与特定动作对应的动效。
根据本发明的又一实施例,还提供了一种音视频呼叫装置,图12是根据本发明实施例的音视频呼叫装置的结构框图,如图12所示,该音视频呼叫装置120包括:复制发送模块1210,设置为在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到A I组件;叠加模块1220,设置为根据AI组件对音频流和/或视频流中的特定内容的识别结果,在主叫用户与被叫用户的音视频通话上叠加与特定内容对应的动效。
在一个示例性实施例中,图13是根据本发明实施例的音视频呼叫装置的结构框图,如图13所示,该音视频呼叫装置130除了包括图12中的各个模块外,还包括:资源分配模块1310, 设置为根据通话平台的申请分别给主叫用户和被叫用户分配媒体资源,以便通话平台根据申请到的主叫用户和被叫用户的媒体资源分别将主叫用户和被叫用户重新锚定至媒体服务器。
在一个示例性实施例中,图14是根据本发明实施例的音视频呼叫装置的结构框图,如图14所示,该音视频呼叫装置140除了包括图13中的各个模块外,还包括:第二接收模块1410,设置为接收业务应用下发的将音频流和视频流复制到AI组件的请求指令,其中请求指令中携带音频流ID、所述视频流ID、AI组件的URL地址;第二协商模块1420,设置为与AI组件协商用于接收音频流和视频流的端口信息和媒体信息;第三接收模块1430,设置为接收AI组件返回的用于接收音频流和视频流的URL地址和端口信息。
在一个示例性实施例中,图15是根据本发明实施例的叠加模块的结构框图,如图15所示,叠加模块1220包括:接收单元1510,设置为接收来自业务应用的媒体加工指令,并根据媒体加工指令中携带的动效的URL获取动效;叠加单元1520,设置为将动效与音频流和/或视频流进行编码合成,并向主叫用户和被叫用户下发经过编码合成处理的音频流和视频流。
需要说明的是,上述各个模块、单元是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块、单元均位于同一处理器中;或者,上述各个模块、单元以任意组合的形式分别位于不同的处理器中。
本发明的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本发明的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。
为了使本领域的技术人员更好地理解本发明的技术方案,下面结合具体的场景实施例进行阐述。
本发明实施例主要是基于VoLTE视频呼叫,需要在音视频进行自动识别功能,包括语音识别和视频动作识别,识别后,返回识别的结果,业务根据返回的识别结果,进行视频加工处理,主要进行解码,视频叠加处理和编码处理等,最后给双方用户在视频通话中,呈现一些动效功能。具体实施说明如下:
用户用手机终端发起原生VOLTE视频呼叫,或者发起语音呼叫后切换为视频呼叫,用户签约了新通话增强呼叫业务功能,否则不能使用该功能。
首先,需要重新对通话双方进行锚定,需要把通话双方音视频重新锚定到媒体服务器上。对通话双方进行重新协商,重新锚定到媒体服务器上,目的是为了实现双方媒体流的控制,一般可以在被叫用户应答后,开始对主被叫锚定到媒体面。
锚定后,需要对用户的音视频流程重新进行控制,媒体服务器将进行签约用户的音视频流复制到AI组件,由AI组件进行音视频的的识别,音频方面,主要由AI组件进行音频的语 音转文字后发送给业务应用,由业务应用进行关键词的识别,视频方面,主要由AI组件进行视频的智能识别,对特定的内容进行识别。
当用户的音频中的某关键词和视频某特定动作被识别后,如果是音频识别,AI组件会返回转写的文字内容到业务应用,由业务应用进行关键词识别;如果视频识别,则AI组件直接进行识别,把识别结果给业务应用,最后由应用根据用户的设置,找到对应的用户的设置的特效,指示媒体服务器进行视频的媒体加工。
媒体服务器收到指示后,获取对应的用户动效,下载到本地,然后进行视频媒体的加工功能,把对应的动效叠加到双方的视频上。
图16是根据本发明场景实施例的用户视频呼叫锚定流程示意图,如图16所示,该流程包括以下步骤:
步骤1602,呼叫开始,正常上报呼叫事件到业务应用,比如:呼叫起呼,振铃,应答,应答中断事件,需要业务指示下一步操作。
步骤1604,呼叫应答后,业务对用户鉴权,发现签约了增强呼叫业务,下达媒体重新协商控制命令。
步骤1606,用于实现业务功能控制与逻辑控制的新通话平台收到媒体锚定指令后,开始对被叫进行锚定,首先对被叫媒体资源进行申请,申请后,用申请到的媒体资源对被叫发起reinvite重新媒体协商,获取到被叫媒体资源后,返回给媒体服务器,然后把被叫终端加入会议(在本场景实施例中,锚定通过会议的方式实现)中,由此完成被叫侧的音视频锚定功能。完成锚定后,要向锚定的发起端返回各个流的参数情况,比如本端流,音频流id,视频流id,收发方向,远端流,音频流id,视频流id,收发方向等。
步骤1608,完成被叫锚定后,同样向媒体服务器申请主叫媒体资源,申请后,对主叫侧发起update媒体更新操作,把刚申请的媒体资源携带给主叫侧,主叫侧返回自己的媒体资源,同样把主叫侧媒体资源加入到会议中,这样完成把主被叫的媒体资源都加入媒体服务器的会议中,实现主被叫的媒体锚定功能。
图17是根据本发明场景实施例的AI组件识别和动效叠加的流程示意图,如图17所示,该流程包括以下步骤:
步骤1702,完成被叫和主叫的锚定后,业务侧即业务应用开始向AI组件申请访问地址,同时请求AI组件进行智能语音转写功能和视频识别操作,包括语音转文字和视频手势识别,AI响应,则返回AI的后继协商的统一资源定位器(Uniform Resoure Locator,URL)。
步骤1704,业务应用开始向媒体服务器下达音视频流复制请求指令,音频流复制到对应的音频识别的AI组件平台上,视频流复制到对应的AI视频识别的组件平台上。携带的参数主要包括:要复制的音频流ID,视频流ID,复制的请求AI的URL。
步骤1706,媒体服务器收到流复制指令,需要和AI组件就具体的流复制的端口和媒体信息进行协商,包括复制的IP,端口和流的编解码类型等,AI收到媒体服务器的协商请求后,进行处理,最后响应返回对应的复制的接收端的地址和端口等信息,协商完后,媒体服务器向AI组件平台启动流复制。同时媒体服务器返回复制结果给业务应用。
步骤1708,AI组件平台收到复制流后,开启对AI智能识别功能,包括音频转写为文字,视频识别用户指定的手势。其中,音频转写为文字后,直接将文字和URL地址返回。
步骤1710,AI组件在识别视频过程中,如果识别了对应的关键信息,则立刻上报给业务 应用。如果是音频内容,则由AI组件返回转写的文字内容,由业务应用进行关键词的识别。业务应用对关键词的识别,首先把用户转写的所有文字都保存起来,然后每次收到新增加的文字后,开始进行关键词识别,如果识别到关键词,则进行识别后流程处理。
步骤1712,当识别到关键词后,不管是业务应用自行识别的关键词,还是AI识别的手势动态,业务应用都会根据识别的信息,查询用户的设置的对应的动效,可能是静态图片,也可以是动态短视频。
步骤1714,业务应用下发媒体加工指令给媒体服务器,其中动效以动效资源的URL地址发送,媒体服务器收到媒体加工指令后,根据动效的URL,首先获取到对应的动效,也可以缓存到本地,如果本地没有,则通过URL访问获取到本地。
步骤1716,媒体服务器进行媒体加工处理,在服务器进行视频解码,在用户视频流上进行编码合成处理,合成后进行视频编码,然后下发视频,其中,合成的视频,需要在用户主叫和被叫双向下行视频都要进行合成处理,这样主被叫都可以看到相同的视频加工结果。
综上,本发明实施例提供的音视频呼叫方法及装置,主要包括语音识别、视频智能识别和视频加工处理几个部分,具体包括两个主要功能更:语音转动效功能和手势转动效功能。
语音转动效,用户在通话过程中,如果用户说了某些关键词,比如:生日快乐,谢谢,点赞等关键词,系统侧进行语音识别,如果识别到某些关键词后,上报到业务侧,由业务侧指示媒体服务器进行特定动效或图片的展示,比如在双向视频中,均展现出蛋糕,爱心或礼花等动效。
在视频通话中,自动对用户手势动作的识别,比如用户做了一个比爱心的手势,则经过AI组件识别后,识别预先定义的关键动作后,然后在双方视频上叠加关键动动作的动效,比如爱心,大拇指点赞等图片或动效。
本发明实施例基于VOLTE呼叫的音视频通话,提供一种基于服务端的音视频增强功能,不需要借助于客户端的APP和SDK支持,只要用户支持原生的VOLTE视频呼叫,就可以提供更加有趣的通话功能,在服务端实现语音的自动识别和视频的自动识别,识别后,进行某些动效的叠加处理,大大的增强了音视频呼叫的趣味性,提升用户的使用体验。使用户的通话更加的有趣,更加的智能。大大提升了用户的操作体验,使语音通话更具有智能化,非常利于5G新通话业务的推广应用。
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (19)

  1. 一种音视频呼叫方法,包括:
    在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,人工智能AI组件接收所述媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;
    所述AI组件对所述音频流和/或所述视频流中的特定内容进行识别,并通过所述媒体服务器在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
  2. 根据权利要求1所述的方法,其中,在所述AI组件接收所述媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流之前,还包括:
    所述AI组件与所述媒体服务器协商用于接收所述音频流和视频流的端口信息和媒体信息;
    所述AI组件向所述媒体服务器返回用于接收所述音频流和视频流的统一资源定位符URL地址和端口信息。
  3. 根据权利要求1所述的方法,其中,所述AI组件对所述音频流和/或所述视频流中的特定内容进行识别,包括:
    所述AI组件将所述音频流转写成文字,并将所述文字发送至业务应用,以便所述业务应用对所述文字中的关键词进行识别,并查询与所述关键词对应的动效。
  4. 根据权利要求1所述的方法,其中,所述AI组件对所述音频流和/或所述视频流中的特定内容进行识别,还包括:
    所述AI组件识别所述视频流中的特定动作,并将识别结果发送至业务应用,以便所述业务应用查询与所述特定动作对应的动效。
  5. 根据权利要求1所述的方法,所述动效包括以下至少之一:静态图片或者动态视频。
  6. 一种音视频呼叫方法,包括:
    在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,所述媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;
    所述媒体服务器根据所述AI组件对所述音频流和/或所述视频流中的特定内容的识别结果,在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
  7. 根据权利要求6所述的方法,其中,在所述媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到AI组件之前,还包括:
    所述媒体服务器根据通话平台的申请分别给主叫用户和被叫用户分配媒体资源,以便所述通话平台根据申请到的所述主叫用户和被叫用户的媒体资源分别将所述主叫用户和被叫用户重新锚定至所述媒体服务器。
  8. 根据权利要求6所述的方法,其中,在所述媒体服务器将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到AI组件之前,还包括:
    所述媒体服务器接收业务应用下发的将所述音频流和视频流复制到所述AI组件的请求指令,其中所述请求指令中携带所述音频流ID、所述视频流ID、AI组件的URL地址;
    所述媒体服务器与所述AI组件协商用于接收所述音频流和视频流的端口信息和媒体信息;
    所述媒体服务器接收所述AI组件返回的用于接收所述音频流和视频流的URL地址和端口信息。
  9. 根据权利要求6所述的方法,其中,所述媒体服务器在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效,包括:
    所述媒体服务器接收来自业务应用的媒体加工指令,并根据所述媒体加工指令中携带的所述动效的URL获取所述动效;
    所述媒体服务器将所述动效与所述音频流和/或视频流进行编码合成,并向所述主叫用户和被叫用户下发经过编码合成处理的所述音频流和视频流。
  10. 一种音视频呼叫装置,包括:
    第一接收模块,在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,接收所述媒体服务器复制的主叫用户与被叫用户之间的音视频通话的音频流和视频流;
    识别处理模块,对所述音频流和/或所述视频流中的特定内容进行识别,以通过所述媒体服务器在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
  11. 根据权利要求10所述的装置,还包括:
    第一协商模块,设置为与所述媒体服务器协商用于接收所述音频流和视频流的端口信息和媒体信息;
    返回模块,设置为向所述媒体服务器返回用于接收所述音频流和视频流的统一资源定位符URL地址和端口信息。
  12. 根据权利要求10所述的装置,其中,所述识别处理模块包括:
    音频处理单元,设置为将所述音频流转写成文字,并将所述文字发送至业务应用,以便所述业务应用对所述文字中的关键词进行识别,并查询与所述关键词对应的动效。
  13. 根据权利要求10所述的装置,其中,所述识别处理模块还包括:
    视频处理单元,设置为识别所述视频流中的特定动作,并将识别结果发送至业务应用,以便所述业务应用查询与所述特定动作对应的动效。
  14. 一种音视频呼叫装置,包括:
    复制发送模块,设置为在主叫用户与被叫用户之间的音视频通话被锚定至媒体服务器后,将主叫用户与被叫用户之间的音视频通话的音频流和视频流复制到人工智能AI组件;
    叠加模块,设置为根据所述AI组件对所述音频流和/或所述视频流中的特定内容的识别结果,在所述主叫用户与所述被叫用户的音视频通话上叠加与所述特定内容对应的动效。
  15. 根据权利要求14所述的装置,还包括:
    资源分配模块,设置为根据通话平台的申请分别给主叫用户和被叫用户分配媒体资源,以便所述通话平台根据申请到的所述主叫用户和被叫用户的媒体资源分别将所述主叫用户和被叫用户重新锚定至所述媒体服务器。
  16. 根据权利要求14所述的装置,还包括:
    第二接收模块,设置为接收业务应用下发的将所述音频流和视频流复制到所述AI组件的请求指令,其中所述请求指令中携带所述音频流ID、所述视频流ID、AI组件的URL地址;
    第二协商模块,设置为与所述AI组件协商用于接收所述音频流和视频流的端口信息和媒体信息;
    第三接收模块,设置为接收所述AI组件返回的用于接收所述音频流和视频流的URL地址和端口信息。
  17. 根据权利要求14所述的装置,其中,所述叠加模块包括:
    接收单元,设置为接收来自业务应用的媒体加工指令,并根据所述媒体加工指令中携带的所述动效的URL获取所述动效;
    叠加单元,设置为将所述动效与所述音频流和/或视频流进行编码合成,并向所述主叫用户和被叫用户下发经过编码合成处理的所述音频流和视频流。
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被处理器执行时实现所述权利要求1至9任一项中所述的方法。
  19. 一种电子装置,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现所述权利要求1至9任一项中所述的方法。
PCT/CN2023/107721 2022-07-15 2023-07-17 音视频呼叫方法及装置 WO2024012590A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210840292.4A CN117440123A (zh) 2022-07-15 2022-07-15 音视频呼叫方法及装置
CN202210840292.4 2022-07-15

Publications (1)

Publication Number Publication Date
WO2024012590A1 true WO2024012590A1 (zh) 2024-01-18

Family

ID=89535671

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/107721 WO2024012590A1 (zh) 2022-07-15 2023-07-17 音视频呼叫方法及装置

Country Status (2)

Country Link
CN (1) CN117440123A (zh)
WO (1) WO2024012590A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117978788B (zh) * 2024-04-01 2024-06-11 中电科东方通信集团有限公司 一种基于5g新通话的数字人视频外呼系统、方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100134588A1 (en) * 2008-12-01 2010-06-03 Samsung Electronics Co., Ltd. Method and apparatus for providing animation effect on video telephony call
CN104902212A (zh) * 2015-04-30 2015-09-09 努比亚技术有限公司 一种视频通信方法及装置
CN108304753A (zh) * 2017-01-24 2018-07-20 腾讯科技(深圳)有限公司 视频通信方法及视频通信装置
CN110650306A (zh) * 2019-09-03 2020-01-03 平安科技(深圳)有限公司 视频聊天中添加表情的方法、装置、计算机设备及存储介质
CN114449200A (zh) * 2020-10-30 2022-05-06 华为技术有限公司 音视频通话方法、装置及终端设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100134588A1 (en) * 2008-12-01 2010-06-03 Samsung Electronics Co., Ltd. Method and apparatus for providing animation effect on video telephony call
CN104902212A (zh) * 2015-04-30 2015-09-09 努比亚技术有限公司 一种视频通信方法及装置
CN108304753A (zh) * 2017-01-24 2018-07-20 腾讯科技(深圳)有限公司 视频通信方法及视频通信装置
CN110650306A (zh) * 2019-09-03 2020-01-03 平安科技(深圳)有限公司 视频聊天中添加表情的方法、装置、计算机设备及存储介质
CN114449200A (zh) * 2020-10-30 2022-05-06 华为技术有限公司 音视频通话方法、装置及终端设备

Also Published As

Publication number Publication date
CN117440123A (zh) 2024-01-23

Similar Documents

Publication Publication Date Title
US9955205B2 (en) Method and system for improving interactive media response systems using visual cues
US8687016B2 (en) Method and system for enhancing the quality of video prompts in an interactive media response system
US9602553B2 (en) Method, apparatus, and system for implementing VOIP call in cloud computing environment
CN111803940B (zh) 游戏的处理方法、装置、电子设备及计算机可读存储介质
WO2022193595A1 (zh) 对象播放方法及装置
WO2024012590A1 (zh) 音视频呼叫方法及装置
WO2015021650A1 (zh) 媒体流的传输方法、装置与系统
KR20080038251A (ko) 장치로 멀티미디어 스트림들에 대해 아무 동기화도수행하지 말라거나 동기화 지연을 포함하도록 시그날링하는 방법
CN110290140B (zh) 多媒体数据处理方法及装置、存储介质、电子设备
WO2020151660A1 (zh) Stb云化方法、系统、瘦stb、虚拟stb及平台、存储介质
WO2019218478A1 (zh) 一种通话服务的响应方法及设备
WO2023160361A1 (zh) Rtc数据的处理方法以及装置
CN110113298A (zh) 数据传输方法、装置、信令服务器和计算机可读介质
US20220417813A1 (en) Methods and apparatus for application service relocation for multimedia edge services
EP2219330A1 (en) Method for enhancing service, proxy server and communications system
WO2022203891A1 (en) Method and system for integrating video content in a video conference session
CN114567704A (zh) 应用于呼叫的交互方法和相关装置
WO2022068674A1 (zh) 视频通话的方法、电子设备及系统
CN112738026B (zh) 一种指挥调度方法
WO2022048255A1 (zh) 数据处理的方法和系统、云终端、服务器和计算设备
CN113098931B (zh) 信息分享方法和多媒体会话终端
CN111343407A (zh) 单向视频通信方法、装置、电子设备和存储介质
US20150286633A1 (en) Generation, at least in part, of at least one service request, and/or response to such request
CN118264650A (zh) 一种浏览器实时通信方法、系统、设备及存储介质
CN115801740A (zh) 音频流数据处理方法、装置、云端服务器及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23839076

Country of ref document: EP

Kind code of ref document: A1