CN116684652A

CN116684652A - Audio and video pulling method and device, storage medium and computer equipment

Info

Publication number: CN116684652A
Application number: CN202310585715.7A
Authority: CN
Inventors: 李道维; 黄惠敬
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-09-01

Abstract

The embodiment of the application discloses an audio and video pulling method, an audio and video pulling device, a storage medium and computer equipment. The embodiment of the application receives a low-delay playing instruction; responding to the low-delay playing instruction, calling the first real-time communication interface to subscribe a corresponding target audio/video stream, wherein the first real-time communication interface adopts a user datagram protocol to realize communication; and pulling the target audio and video stream from a media server to play based on a first preset transmission strategy of the first real-time communication interface. Therefore, by combining the real-time communication interface with the terminal and based on the private RTC protocol, the audio and video stream is acquired through the media server, so that the time delay of the audio and video stream in playing is reduced, and the live broadcast watching experience of the user is optimized.

Description

Audio and video pulling method and device, storage medium and computer equipment

Technical Field

The application relates to the field of network live broadcast, in particular to an audio and video pulling method, an audio and video pulling device, a storage medium and computer equipment.

Background

With the continuous development of the live broadcast industry, the requirement of interaction in the live broadcast process is gradually increased. In the interactive scene of the live broadcasting room, the video of the audience terminal is to mix pictures and audio of the interactive video stream of each main broadcasting terminal at the server side, and then output a new video stream after decoding and recoding. And then, the audience terminal pulls the video stream after the picture mixing and the audio mixing through the player, so that the playing of the audio and video under the live broadcast interactive scene is realized.

The main factors influencing the live viewing experience of the network include four aspects of first screen time, time delay, sound and picture synchronization and fluency. At the same time, the current network live broadcast increasingly emphasizes the interaction communication between the anchor and the audience, and the time delay is the most important factor affecting the interaction.

Players at the audience end generally adopt HTTP-FLV or RTMP protocol to pull streams from CDN, the time delay from the anchor end to the audience end (hereinafter referred to as end-to-end) generally ranges from 2 seconds to 5 seconds, some events live broadcast may choose to pull streams from CDN by HLS protocol for fluency, and the time delay from end to end may reach 5 seconds to 10 seconds. These protocols are all based on TCP to transmit data, and the delay is difficult to further optimize on public network transmission.

Disclosure of Invention

The embodiment of the application provides an audio and video pulling method, an audio and video pulling device, a storage medium and computer equipment, which are used for reducing time delay during live audio and video playing and optimizing user experience.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

the audio and video pulling method is applied to a first terminal, wherein the first terminal comprises a first real-time communication interface, and the audio and video pulling method comprises the following steps:

receiving a low-delay playing instruction;

Responding to the low-delay playing instruction, calling the first real-time communication interface to subscribe a corresponding target audio/video stream, wherein the first real-time communication interface adopts a user datagram protocol to realize communication;

and pulling the target audio and video stream from a media server to play based on a first preset transmission strategy of the first real-time communication interface.

An audio and video pulling device is applied to a first terminal, wherein the first terminal comprises a first real-time communication interface, and the audio and video device comprises:

the first receiving unit is used for receiving the low-delay playing instruction;

the first subscription unit is used for responding to the low-delay playing instruction and calling the first real-time communication interface to subscribe the corresponding target audio and video stream, wherein the first real-time communication interface adopts a user datagram protocol to realize communication;

and the pulling unit is used for pulling the target audio/video stream to play from the media server based on a first preset transmission strategy of the first real-time communication interface.

In some embodiments, the first subscription unit includes:

the analysis subunit is used for analyzing the low-delay playing instruction based on the first real-time communication interface to obtain audio and video stream information to be pulled, wherein the video stream information comprises a target transmission head and a transmission address;

And the subscription subunit is used for sending the corresponding target transmission head of the audio and video stream information to the media server and subscribing the target audio and video stream corresponding to the transmission address.

In some embodiments, the pull unit includes:

a protocol subunit, configured to determine a protocol transmission policy of a target audio and video stream by adopting the user datagram protocol at the first real-time communication interface and transmitting the target audio and video stream according to a transmission header corresponding to the user datagram protocol in the audio and video stream name;

the network sub-unit is used for adjusting the transmission network based on a preset network optimization strategy and determining a network transmission strategy of the downlink of the first real-time communication interface;

and the execution subunit is used for actively pulling the target audio/video stream from the media server based on the protocol transmission strategy and the network transmission strategy and playing the target audio/video stream.

In some embodiments, the network sub-unit is further configured to:

monitoring the network quality of the current transmission network in real time based on a real broadband detection algorithm;

collecting network data of the network quality, analyzing data characteristics of the network data, and determining a network model corresponding to a current transmission network;

And when the network quality of the network model does not reach the preset standard, adjusting the network model according to a preset sending end control algorithm, and determining a network transmission strategy for pulling the target audio and video.

In some embodiments, the audio-video pulling device is further configured to:

generating a cache area with a corresponding storage size according to the network transmission strategy;

pre-caching audio and video stream cache data in the cache area;

and playing the audio and video stream cache data until the preset playing frame is acquired, stopping playing the audio and video stream cache data and playing the target audio and video stream.

In some embodiments, the audio-video pulling device is further configured to:

based on the played audio and video stream cache data, calculating a corresponding callback speed;

and according to the callback speed, frame tracking is carried out on the audio and video stream cache data so as to consume the audio and video stream cache data and play the target audio and video stream.

In some embodiments, the audio-video pulling device is further configured to:

calculating playing time delay corresponding to audio and video in the target audio and video stream through a network time protocol;

when the network time protocol is abnormal, respectively adding corresponding jitter buffers in the playing time delays corresponding to the audio and the video in the target audio and video stream, and calculating the target playing time delays corresponding to the audio and the video in the target audio and video stream;

And according to the target playing time delay corresponding to the audio and the video in the target audio and video stream, performing audio and video synchronization on the audio and the video in the target audio and video stream.

The audio and video pulling device is applied to a server, the server performs data transmission with a second terminal through a second real-time communication interface, and the audio and video device comprises:

the second receiving unit is used for receiving the uplink audio and video stream uploaded by the second terminal through a second preset transmission strategy of the second real-time communication interface, and the second real-time communication interface adopts a user datagram protocol to realize communication;

the second subscription unit is used for receiving a subscription instruction of the first terminal for requesting to subscribe to the target audio/video stream, wherein the subscription instruction is generated by the operation of the first terminal for calling the first real-time communication interface and subscribing to the target audio/video stream to be pulled, and the subscription instruction comprises a first subscription instruction and a second subscription instruction;

and the second transmission unit is used for responding to the subscription instruction, determining a corresponding target audio-video stream from the uplink audio-video stream and returning the target audio-video stream to the first terminal.

In some embodiments, the second receiving unit is configured to:

The user datagram protocol is adopted at the second real-time communication interface, and the uplink audio and video stream is transmitted according to a transmission head corresponding to the user datagram protocol, so that a protocol transmission strategy of the uplink audio and video stream is determined;

based on a preset network optimization strategy, adjusting an uplink transmission network of an uplink audio/video stream, and determining a network transmission strategy of a second real-time communication interface;

and passively receiving the uplink audio and video stream of the second terminal based on the protocol transmission strategy and the network transmission strategy of the second communication interface.

In some embodiments, the second transmission unit is further configured to:

when the uplink audio-video stream is a single live audio-video stream, responding to the subscription instruction, determining a corresponding target audio-video stream from the uplink audio-video stream, and returning the target audio-video stream to the first terminal;

when the uplink audio-video stream is a multi-user live broadcast interactive audio-video stream, forwarding the uplink audio-video stream to a mixed picture transcoding server, so that the mixed picture transcoding server performs live broadcast audio-video stream mixing processing on the uplink audio-video stream and receives the processed uplink audio-video stream;

And responding to the subscription instruction, determining a corresponding target video stream from the processed uplink audio/video stream, and returning the target audio/video stream to the first terminal.

A computer storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the above-described audio video pulling method.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps in the audio video pulling method provided above when the computer program is executed.

A computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the audio/video pulling method provided above.

The embodiment of the application receives a low-delay playing instruction; responding to the low-delay playing instruction, calling the first real-time communication interface to subscribe a corresponding target audio/video stream, wherein the first real-time communication interface adopts a user datagram protocol to realize communication; and pulling the target audio and video stream from a media server to play based on a first preset transmission strategy of the first real-time communication interface. Therefore, by combining the real-time communication interface with the terminal, data transmission is performed based on the private RTC protocol, and the time delay of live broadcast audio and video playing is reduced. In addition, based on the transmission strategy adjustment of the real-time communication interface, the audio and video stream is acquired from the media server, so that the step of transmitting data of the live audio and video stream is reduced, the pulling efficiency and the playing effect of the target audio and video stream are improved, the time delay of playing the audio and video stream is further reduced, and the live watching experience of a user is optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic architecture diagram of a live broadcast system according to an embodiment of the present application;

fig. 2 is a flow chart of an audio/video pulling method according to an embodiment of the present application;

fig. 3 is a schematic view of a scenario of an audio/video pulling method according to an embodiment of the present application;

fig. 4 is a timing flow chart of an audio/video pulling method according to an embodiment of the present application;

fig. 5 is another flow chart of an audio/video pulling method according to an embodiment of the present application;

fig. 5a is another schematic view of an audio/video pulling method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio/video pulling device according to an embodiment of the present application;

fig. 7 is another schematic structural diagram of an audio/video pulling device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides an audio and video pulling method, an audio and video pulling device, a storage medium and computer equipment.

The following first describes an architecture of a live broadcast system in the technical solution of the present application, referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a live broadcast system provided in an embodiment of the present application, where the live broadcast system may include: client a, media server B, and mixed picture transcoding server C. The client A and the media server B can be connected through a communication network; the media server B and the mixed picture transcoding server C can be connected through a communication network. The communication network includes a wireless network and a wired network, wherein the wireless network includes a combination of one or more of a wireless wide area network, a wireless local area network, a wireless metropolitan area network, and a wireless personal area network. The network includes network entities such as routers, gateways, etc., which are not shown.

The live broadcast system can comprise an audio and video pulling device which can be integrated in a terminal with a storage unit, a microprocessor and an operation capability, such as a tablet personal computer, a mobile phone, a notebook computer, a desktop computer and the like. The terminal may install the client a, where the client a may include a anchor client and a viewer client, and the number of anchor clients and viewer clients is not limited accordingly. The method is characterized in that a host client uploads live audio and video data or virtual imaging scene pictures obtained by applying technology imaging through an uplink module of an RTC; and the audience client pulls and plays the audio and video stream or the virtual imaging scene picture obtained by applying the technology imaging through the downlink module of the RTC. The viewer client of the client a may be configured to receive a low-latency play instruction; responding to the low-delay playing instruction, calling the first real-time communication interface to subscribe a corresponding target audio/video stream, wherein the first real-time communication interface adopts a user datagram protocol to realize communication; and pulling the target audio and video stream from a media server to play based on a first preset transmission strategy of the first real-time communication interface.

The live broadcast system can also comprise a media server B, wherein the media server B can comprise a plurality of media servers, and audio and video data of live broadcast of a host client side are stored in the media server B. When the media server B receives the audio and video data of the upstream of the anchor client, the media server B can detect whether the audio and video data is in an interactive scene of the audio and video stream, if the audio and video stream data of the upstream of the anchor client is not in the interactive scene, the subscription instruction of the audience client of the client A is responded, and the audio and video data is sent to the audience client of the client A for playing; if the audio and video data uploaded by the anchor client are in the interactive scene, the audio and video data are sent to the mixing transcoding server C for mixing, and the interactive audio and video data after mixing are cached and recorded. The media server B plays a role of data forwarding in the live broadcast system of this embodiment, and may be configured to receive, through a second preset transmission policy of the second real-time communication interface, an uplink audio/video stream uploaded by the second terminal, where the second real-time communication interface uses a user datagram protocol to implement communication; receiving a subscription instruction of a first terminal for subscribing a target audio/video stream, wherein the subscription instruction is generated by the operation of calling a first real-time communication interface for the first terminal and subscribing the target audio/video stream to be pulled; and responding to the subscription instruction, determining a corresponding target audio-video stream from the uplink audio-video stream, and returning the target audio-video stream to the first terminal.

The live broadcast system can further comprise a mixed picture transcoding server C, the mixed picture transcoding server C can mix pictures and mix audio of audio and video streams uploaded from the anchor client in an interactive scene, when audio and video stream data of the media server B are received as multi-person audio and video data in the interactive scene, the multi-person audio and video data are sent to the mixed picture transcoding server C, and mixed picture mixed audio and video transcoding is carried out on the multi-person audio and video data through the mixed picture transcoding server C.

It should be noted that, the architecture schematic diagram of the live broadcast system shown in fig. 1 is only an example, and the live broadcast system and the architecture described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the live broadcast system and the appearance of a new service scenario, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

The following will describe in detail.

In this embodiment, description will be made from the viewpoint of an audio-video pulling apparatus, which may be integrated in a client of a terminal in particular.

Referring to fig. 2, fig. 2 is a flowchart of an audio/video pulling method according to an embodiment of the application. The audio and video pulling method is applied to a first terminal of a live broadcast system, wherein the first terminal comprises a first real-time communication interface, and the audio and video pulling method comprises the following steps:

in step S101, a low latency play instruction is received.

The first terminal can be an audience client of the live broadcast system, and the audience client can pull the target audio/video stream through the first real-time communication interface and play the target audio/video stream on a player of the audience client. Specifically, the first real-time communication interface may be a downlink module of an RTC (Real Time Communication, real-time communication), and the viewer client performs data transmission with the server through the downlink module of the RTC.

In the prior art, the player at the audience adopts the streaming protocols such as RTMP, HTTP-FLV and HLS, and the streaming protocols of the transmission based on TCP at the bottom layers have higher time delay, thereby influencing the experience of the audience to watch live broadcast.

In order to solve the above problems, in the embodiment of the present application, by adding a trigger component for low-delay playing on a live broadcast interface of a client of a viewer, when there is a delay problem in audio and video data being played in a current live broadcast room, a user may generate a low-delay playing instruction by triggering the low-delay playing component, where the low-delay playing instruction is used to adjust a transmission policy of a live audio and video stream in the current live broadcast room, so as to reduce a delay generated by the live audio and video stream in the current live broadcast room.

For a better understanding of the present application, reference may be made to fig. 3, and fig. 3 is a schematic view of a scenario provided by an embodiment of the present application. When the time delay problem exists in the audio and video data being played in the current live broadcasting room, a user can switch the live broadcasting audio and video stream in the current live broadcasting room into a low-delay live broadcasting audio and video stream after clicking the low-delay triggering component 3001.

In step S102, the first real-time communication interface is invoked to subscribe to a corresponding target audio/video stream in response to the low-delay play command, where the first real-time communication interface implements communication by using a user datagram protocol.

In response to a condition or state that is used to represent an operation that is dependent on, one or more operations that are performed may be real-time or have a set delay when the dependent condition or state is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

It should be noted that the first terminal includes a first real-time communication interface, and the first terminal realizes audio and video stream data transmission with the server through the first real-time communication interface. The first real-time communication interface may be a downlink module of the RTC, where the downlink module of the RTC uses UDP (User Datagram Protocol ) to implement data transmission of the audio and video stream.

In the related art, a viewer client pulls a video-on-air stream, and typically pulls the video-on-air stream from a CDN (Content Delivery Network ) on a public network to play the video-on-air stream using RTMP (Real Time Messaging Protocol, real-time transport protocol). The real-time transmission protocol for realizing data transmission of the live audio and Video stream can also be HTTP-FLV (HTTP-FLash Video, live streaming protocol), HLS (HTTP Live Streaming, streaming media protocol) and the like. In the data transmission process of the protocol, data transmission is realized based on TCP (Transmission Control Protocol ). Specifically, the direct broadcast audio and video stream is pulled from the public network of the content distribution network through the transmission control protocol, on one hand, the delay is difficult to reduce when the data transmission is carried out on the public network; on the other hand, the end-to-end data transmission mode increases the load and algorithm pressure at the two ends of data transmission, so that serious time delay occurs in the process of streaming audio and video at the end of a spectator, and the experience effect of live broadcast watching of the spectator is affected.

In order to solve the above problems, in the embodiment of the present application, a downlink module of an RTC is added to a transmission network between a viewer client and a server, and it should be noted that, the downlink module of the RTC adopts UDP to implement data transmission of an audio/video stream, and the user datagram is a data transmission method capable of sending an encapsulated IP datagram without establishing a connection, which is a connectionless transport layer protocol. When the user datagram protocol of the downlink module of the RTC is used for carrying out downlink transmission of live broadcast audio and video stream data, the downlink module of the RTC is used for loading a data transmission algorithm to realize end-to-end transmission of the audio and video stream data, so that a client of a spectator only needs to pull the direct broadcast audio and video stream from a server through the downlink module of the RTC, the pulling efficiency of the audio and video stream is improved, the time delay of the downlink module in the transmission process of the audio and video stream data is reduced, and the user experience of live broadcast watching by the spectator is optimized.

In some embodiments, invoking the first real-time communication interface to subscribe to a corresponding target audio/video stream includes:

(1) Analyzing the low-delay playing instruction based on the first real-time communication interface to obtain audio and video stream information to be pulled, wherein the video stream information comprises a target transmission head and a transmission address;

(2) And sending the target transmission head corresponding to the audio and video stream information to the media server, and subscribing the target audio and video stream corresponding to the transmission address.

It should be noted that, when responding to the low-delay playing instruction in the live broadcast interface of the audience client, the low-delay playing instruction is analyzed through the first real-time communication interface (RTC downlink module) of the audience client to obtain the audio and video stream information of the audio and video stream to be pulled, where the audio and video stream information includes the transmission address and the target transmission header of the audio and video stream. The target transmission head is a transmission head which is used for transmitting preset data based on the first real-time communication interface by adopting UDP, such as yits, and the target transmission head is used for guaranteeing data stability and safety in the process of transmitting audio and video stream data based on a user datagram transmission protocol between the first terminal and the media server. Correspondingly, after the data analysis is performed on the low-delay playing instruction through the first real-time communication interface, a corresponding low-delay playing instruction formed by the transmission address of the target transmission head and the audio-video stream can be obtained, and the low-delay playing instruction can be a uniform resource locator (Universal Resource Locator, URL). Specifically, the low latency play instruction may include: and a corresponding low-delay playing instruction in the single live broadcast scene and a corresponding low-delay playing instruction in the multi-user live broadcast scene.

Therefore, after receiving the low-delay playing instruction corresponding to the current application scene, the low-delay playing instruction is sent to the media server, the audio and video stream corresponding to the low-delay playing instruction is pulled from the media server, and the audio and video stream is returned to the player of the client side of the audience for playing. Specifically, when the audio/video stream corresponding to the low-delay playing instruction is a live audio/video stream in a single live scene, the audio/video stream is sent from the media server to the first terminal (audience client) as a target audio/video stream for playing. When the audio and video streams corresponding to the low-delay playing instruction are a plurality of audio and video streams in the multi-user continuous-play interaction scene, the audio and video streams are sent to a mixed picture transcoding server, mixed picture and mixed audio transcoding is carried out on the audio and video streams, a target audio and video stream is obtained, the target audio and video stream is transmitted to a first terminal (audience client) to be played through a downlink module of the RTC, and the specific process is to continue with the following steps.

In some embodiments, the first preset transmission policy is associated with a first real-time communication interface, the first preset transmission policy includes a protocol transmission policy and a network transmission policy, and pulling the target audio-video stream from a media server to play based on the first preset transmission policy of the first real-time communication interface includes:

(1) The user datagram protocol is adopted at the first real-time communication interface, and the target audio-video stream is transmitted according to a transmission head corresponding to the user datagram protocol in the audio-video stream name, so that a protocol transmission strategy of the target audio-video stream is determined;

(2) Based on a preset network optimization strategy, adjusting a transmission network, and determining a network transmission strategy of a first real-time communication interface downlink;

(3) And actively pulling the target audio and video stream from the media server based on the protocol transmission strategy and the network transmission strategy, and playing the target audio and video stream.

It should be noted that, in the embodiment of the present application, the data transmission is performed on the audio and video stream through the first real-time communication interface, and the transmission method for transmitting the audio and video stream data through the first real-time communication interface (downlink module of the RTC) is associated with the first preset transmission policy, where the first preset transmission policy may include a protocol transmission policy and a network transmission policy.

Specifically, the protocol transmission policy refers to a data transmission protocol specifically adopted in the process of transmitting audio and video stream data under different transmission requirements; the network transmission policy refers to a network transmission policy specifically adopted for optimizing a network according to a preset QoS (Quality of Service ) policy in the process of transmitting audio and video stream data under different network conditions, and the quality of service policy QoS may include bandwidth of network transmission, delay of data transmission, and packet loss rate of data. It can be understood that the different transmission requirements can be the transmission format of the audio and video stream data, the transmission port, the decision application layer and other standard requirements; the different network conditions may be network conditions of each end, network link parameters, and other transmission quality conditions in the audio/video streaming process.

In this embodiment, on the one hand, data transmission of an audio/video stream is achieved by adopting a data transmission protocol of UDP in a first real-time communication interface, a protocol transmission policy of audio/video stream data transmission by a first preset transmission policy associated with the first real-time communication interface is determined, and in this embodiment, data stability and security in an audio/video stream data transmission process of a first terminal and a media server based on a user datagram transmission protocol are ensured by applying a preset transmission header, for example, in a process of transmitting audio/video stream data by UDP in the first real-time communication interface, the preset transmission header is used for transmitting audio/video stream data.

On the other hand, by adopting various service quality strategies such as ARQ (Automatic Retransmission reQuest ) and bandwidth detection in the network transmission process of the audio and video stream, the transmission network in various complex networks is adaptively adjusted, and the network transmission strategy for transmitting the audio and video stream is generated.

In some embodiments, based on a preset optimization policy, the adjusting the transmission network pulled by the target audio/video stream, to determine a network transmission policy of the first real-time communication interface specifically includes:

1. Monitoring the network quality of the current transmission network in real time based on a real broadband detection algorithm;

2. collecting network data of the network quality, analyzing data characteristics of the network data, and determining a network model corresponding to a current transmission network;

3. and when the network quality of the network model does not reach the preset standard, adjusting the network model according to a preset congestion control algorithm, and determining a network transmission strategy for pulling the target audio and video.

It should be noted that, because the first real-time communication interface performs data transmission through the downlink module of the RTC, and the downlink module of the RTC adopts the data transmission protocol of UDP, the data transmission protocol of UDP may cause a problem of low data security and reliability during data transmission.

Therefore, the embodiment of the application adjusts the transmission network of the user datagram protocol based on the preset optimization strategy so as to ensure the data security and the data reliability of the audio and video stream when the first real-time communication interface (the downlink module of the RTC) adopts the user datagram protocol to carry out data transmission. The preset optimization strategy can be a reliable transmission strategy, a congestion control strategy and a real broadband detection strategy.

Specifically, the reliable transmission policy may be an automatic retransmission policy under the condition of aiming at a weak network, and in the audio and video streaming transmission process, network data of a transmission network are collected, after data characteristics of the network data are analyzed, a network model corresponding to a current transmission network is obtained, and the current transmission network is adjusted according to the network model. The congestion control policy may be a congestion control policy that merges an issuer congestion algorithm and an executor congestion algorithm, for example, when a receiver with a poor network environment is detected, by scheduling an issuer-controlled congestion control algorithm, such as an issuer-controlled congestion control algorithm (Google Congest Control, GCC) and a congestion control algorithm (Bottleneck Bandwidth and Round-trip propagation time, BBR), on a server, the congestion algorithm is run on the server to realize audio/video streaming data transmission between the two ends, and by flexibly calling the congestion control algorithm in the data transmission process to realize double-end data transmission, the delay in the audio/video streaming process is reduced. The real broadband detection strategy is based on the congestion control strategy, the strategy of real bandwidth detection is added on the transmission network of the audio and video stream, the real bandwidth of the link is detected by reasonably controlling the sending of the sending data, and the accuracy of the data characteristics corresponding to the network quality on the transmission network is improved.

Illustratively, the algorithm pressure at the sender is relieved by writing GCC at the receiver so that the computation logic is concentrated at the receiver, in this embodiment by relying on computation logic compilation to the first terminal (viewer client), the computation logic pressure at the media server is relieved. Further, when the condition of the network environment is poor, the congestion algorithm adjustment is performed on the first terminal with poor network condition through the congestion control algorithm BBR of the sending end, so that data transmission of the audio and video stream under various network environments is realized, and the transmission delay of the audio and video stream data is reduced to a certain extent.

In step S103, the target audio/video stream is pulled from the media server to play based on the first preset transmission policy of the first real-time communication interface.

It should be noted that, the first real-time communication interface is a downlink module of the RTC, and the data transmission between the viewer client and the server can be implemented through the first real-time communication interface (downlink module of the RTC), so that the user datagram protocol UDP is based on the user client and the server (first real-time communication interface) when the data transmission is performed.

It should be noted that, the user datagram protocol UDP is a connectionless transport layer protocol, and when data transmission is performed based on the user datagram protocol, packets of an audio/video stream to be transmitted do not include grouping, assembling and ordering, so that when data transmission is performed through the user datagram protocol, control options of the audio/video stream are few, the transmission speed between the ends is accelerated, pull delay of the audio/video stream is small, efficiency is high, and time delay is small. However, the audio and video stream transmitted through the user datagram protocol does not include the reliability guarantee, sequence guarantee and flow control fields, so that the first real-time communication interface (downlink module of RTC) based on the user datagram protocol can cause problems of packet loss, disorder and low transmission reliability of the audio and video stream when time delay is reduced.

In this way, in this embodiment, by adding the first preset transmission policy to the first real-time communication interface, the first preset transmission policy is associated with the first real-time communication interface, and when the downlink module of the RTC performs data transmission on the audio and video stream, the transmission network of the user datagram protocol is optimized, so as to ensure the stability and reliability of the data of the audio and video stream on the basis of reducing the transmission delay problem of the audio and video stream.

For a better understanding of the present application, reference may be made to fig. 4, and fig. 4 is a schematic timing flow chart provided in an embodiment of the present application. In this embodiment, the player and the downlink module of the RTC are combined, but in order to better embody the steps executed by the downlink module of the RTC, the player and the downlink module of the RTC are described separately. Specifically, a uniform resource locator (Universal Resource Locator, URL) corresponding to a low-delay playing instruction is input through a player of a viewer client, the player sends the uniform resource locator to a downlink module of the RTC, the downlink module of the RTC analyzes the uniform resource locator to determine a stream name to be pulled, subscription of a target audio/video stream is carried out to a media server according to the stream name, and if the stream name is a request name in a single live broadcast scene, the audio/video stream corresponding to the stream name is directly returned to the downlink module of the RTC; if the stream name is the corresponding multi-audio/video stream name in the multi-user continuous-time scene, the multi-audio/video stream corresponding to the stream audio/video stream name is sent to a mixing transcoding server for mixing, the mixed target audio/video stream is returned to a media server, and then the mixed target audio/video stream is sent to a downlink module of the RTC from the media server. And the downlink module of the RTC receives the target audio and video stream issued by the server, and returns the target audio and video stream to the player of the audience client for decoding, rendering and playing.

According to the embodiment, the first terminal is combined with the first real-time communication interface (the downlink module of the audience client and the RTC), so that the burden of the server in the audio and video stream data pulling process is reduced, the problem of blocking of the server in the audio and video stream data pulling process is solved, and the time delay of live broadcast audio and video playing is reduced. In addition, the user datagram protocol based on the first real-time communication interface carries out data transmission on the live audio and video stream, and the live audio and video stream data is directly transmitted on the first terminal and the server without pulling the audio and video stream from the public network, so that the step of transmitting the live audio and video stream data is reduced, the problems of large playing delay and low efficiency of the audio and video stream caused by overlarge live audio and video stream data are solved, and the time delay of the live audio and video stream is reduced.

In some other embodiments, before the playing the target audio-video stream, the method further includes:

pre-caching audio and video stream cache data in the cache area;

It should be noted that, in order to cope with physical fluctuation of the network environment in the live audio/video streaming transmission process, a buffer area with a preset storage size is set in a downlink module of the RTC through a jitter buffer technology, and is used for storing audio/video streaming buffer data between a pull target audio/video stream and a play target audio/video stream, where the audio/video streaming buffer data may be some key frames in the live audio/video stream, such as IDR frames. Therefore, by setting a fixed buffer area on the player of the audience client, the first screen time for playing the target audio/video stream is reduced under the condition of ensuring low time delay in the pulling process of the target audio/video stream.

It should be noted that, in this embodiment, whether the video is a video that is played by the anchor client or a video that is output in the mixed picture transcoding server, the first frame of the video is an IDR frame, and when the player decodes the video, the player can decode and play the video stream only when the first frame of the video is the IDR frame.

In some other embodiments, the stopping playing the audio/video stream buffer data and playing the target audio/video stream includes:

In the above scheme for reducing the first screen time of playing the target audio and video stream, a buffer area with a preset storage size is set in the downlink module of the RTC to store the audio and video stream buffer data, however, after the audio and video stream buffer data is loaded into the target audio and video stream, the server will issue fast access data first, which may cause the playing of the target audio and video stream to be blocked, so when the video stream buffer data in the buffer area is played, the frame loss logic of the video stream buffer data is triggered based on the target audio and video stream data received by the UDP for playing by the first terminal, specifically, the fast playing chase is controlled by the RTC downlink module to quicken the speed of callback to the player, and the audio stream with a close timestamp and the video stream with a non-key frame is pruned, where the priority set of the frame loss logic of the pruned audio and video stream may be: b frame loss, P frame loss and I frame loss. Therefore, on the premise of considering the decoding dependency of the audio and video stream data, the player decodes and plays as soon as possible, so that the smoothness of playing the target audio and video stream is improved, and the time delay of the target audio and video stream is reduced.

In some other embodiments, after the playing the target audio-video stream, the method further comprises:

In the technical scheme of the application, in order to ensure the stability of the time stamp corresponding to the audio and the video in the target audio-video stream, a clock source of a terminal operating system is used as the time stamp for synchronizing the audio and the video in the downlink module of the RTC. Further, the downlink module of the RTC calculates a play delay corresponding to the audio and video in the target audio and video stream based on the NTP (Network Time Protocol ) service. The playing time delay corresponding to the audio and the video in the target audio and video stream is an absolute time delay.

When the network time protocol service is abnormal, the jitter buffer of the audio or video is added to the corresponding playing time delay of the audio and video in the target audio and video stream obtained through calculation, the total delay of the audio and video is respectively determined, and the audio and video synchronization is carried out according to the total delay of the audio and video. When the playing delay of the audio and the video is calculated, the time consumption of decoding and rendering the target audio and video stream by the player of the audience client is considered at the same time, the accuracy of the total delay of the audio and the video is ensured, the accuracy of audio and video synchronization of the target audio and video stream is further improved, and the experience of wave viewing by a user is optimized.

Referring to fig. 5, fig. 5 is another flow chart of the audio/video pulling method according to the embodiment of the present invention. Specifically, the method is also applied to a server of the live broadcast system, the server performs data transmission with the second terminal through the second real-time communication interface, and the audio and video pulling method further comprises the following steps:

in step 201, the uplink audio/video stream uploaded by the second terminal is received through a second preset transmission policy of the second real-time communication interface, where the second real-time communication interface uses a user datagram protocol to implement communication.

It should be noted that the second real-time communication interface may be an uplink module of the RTC, and the server performs data transmission with the second terminal through the uplink module of the RTC, where the second terminal may be a hosting client, and is used for live audio and video streaming in the uplink live broadcasting room; and the second real-time communication interface realizes the data transmission of the live audio and video stream by adopting a UDP data transmission protocol.

In this way, the anchor client transmits the live audio and video stream to the server through the uplink module of the RTC for subsequent operation, and the live audio and video stream is not required to be transmitted to the content distribution network, so that the efficiency of pulling the audio and video stream is improved, and the loads at two ends in the data transmission process are reduced, therefore, the time delay of the uplink module in the audio and video stream data transmission process is reduced, and the user experience of watching by pulling the direct audio and video stream is optimized.

In some embodiments, the second preset transmission policy is associated with a second real-time communication interface, where the second preset transmission policy includes a protocol transmission policy and a network transmission policy, and the uplink audio/video stream uploaded by the second terminal is received through the second preset transmission policy of the second real-time communication interface, and the second real-time communication interface implements communication by adopting a user datagram protocol, including:

(1) The user datagram protocol is adopted at the second real-time communication interface, and the uplink audio and video stream is transmitted according to a transmission head corresponding to the user datagram protocol, so that a protocol transmission strategy of the uplink audio and video stream is determined;

(2) Based on a preset network optimization strategy, adjusting an uplink transmission network of an uplink audio/video stream, and determining a network transmission strategy of a second real-time communication interface;

(3) And passively receiving the uplink audio and video stream of the second terminal based on the protocol transmission strategy and the network transmission strategy of the second communication interface.

It should be noted that, in the embodiment of the present application, the audio and video stream is transmitted through the second preset transmission policy of the second real-time communication interface, where the transmission method of the audio and video stream data transmitted by the second real-time communication interface (downlink module of RTC) is related to the second preset transmission policy. Specifically, the principle steps of the second preset transmission policy are the same as those of the first preset transmission policy, and may include a protocol transmission policy and a network transmission policy, where the method of determining a data transmission protocol specifically adopted in the process of transmitting audio and video stream data under different transmission requirements and the method of determining a network transmission policy specifically adopted in the process of transmitting audio and video stream data under different network conditions are consistent, and are not described herein.

In step 202, receiving a subscription instruction of a first terminal for requesting to subscribe to a target audio/video stream;

and the subscription instruction is generated by calling a first real-time communication interface for the first terminal and subscribing the operation of the target audio/video stream to be pulled.

It should be noted that, according to receiving a subscription instruction from the first terminal, the subscription instruction requests to subscribe to preset audio/video stream data, a corresponding live audio/video stream is pulled from the media server. And after determining the live audio and video stream corresponding to the subscription instruction, realizing audio and video stream data transmission between the media server and the first terminal through the first real-time communication interface. The first real-time communication interface may be a downlink module of the RTC, where the downlink module of the RTC uses UDP to implement data transmission of the audio and video stream.

As can be seen, the server sends the target audio/video stream to the viewer client through the second real-time communication interface and based on the second preset transmission policy, where the second real-time communication interface is consistent with the working principle of the first real-time communication interface in the first terminal in the solution of the present application, and detailed descriptions in step S102 and step S103 are referred to herein.

In step 203, in response to the subscription instruction, a corresponding target audio/video stream is determined from the uplink audio/video stream, and the target audio/video stream is returned to the first terminal.

It should be noted that, with the continuous development of the live broadcast industry, the application of the live broadcast interaction in the live broadcast scene is more and more popular, and in the live broadcast scene of the live broadcast interaction, as the number of participants in the interaction scene increases, the audio and video stream for processing and data transmission increases, so that the transmission process of the live broadcast audio and video pull has serious time delay problem.

In order to solve the technical problem, the server receives a plurality of live audio and video streams uploaded by the hosting client through a second real-time communication interface (an uplink module of the RTC), and then sends the plurality of live audio and video streams to the media server through the second real-time communication interface, and after receiving the plurality of live audio and video streams, the media server carries out mixing and audio and video stream transcoding on the mixing and transcoding server on the plurality of live audio and video streams to be interacted, so as to obtain interacted target audio and video streams under the interaction scene of the audio and video streams.

According to the scheme, the steps of transmitting a plurality of live audio and video streams to the public network for data transmission are reduced, the media server is directly connected with the mixed picture transcoding server, the mixed picture transcoding server is used for carrying out mixed picture transcoding on the plurality of live audio and video streams, the mixed picture video streams are output, the calculation burden of a client is reduced, in a scene of multi-user communication interaction, a viewer can watch pictures of multi-user main broadcasting communication only by subscribing one path of mixed picture and mixed video streams, and the live broadcasting efficiency and user experience in the multi-user communication scene are improved.

In the technical scheme of the application, in the process of carrying out the audio-video stream transcoding of mixing pictures and mixing sounds on a plurality of live audio-video streams, a mixed picture transcoding server adopts a jitter buffer technology to mix the plurality of live audio-video streams into one audio-video stream. The jitter buffer technology is used for improving the fluency of the interacted target audio and video stream after audio and video mixing.

In some embodiments, the determining, in response to the subscription instruction, a corresponding target audio-video stream from the uplink audio-video streams, and returning the target audio-video stream to the first terminal includes:

(1) When the uplink audio-video stream is a single live audio-video stream, responding to the subscription instruction, determining a corresponding target audio-video stream from the uplink audio-video stream, and returning the target audio-video stream to the first terminal;

(2) When the uplink audio-video stream is a multi-user live broadcast interactive audio-video stream, forwarding the uplink audio-video stream to a mixed picture transcoding server, so that the mixed picture transcoding server performs live broadcast audio-video stream mixing processing on the uplink audio-video stream and receives the processed uplink audio-video stream;

(3) And responding to the subscription instruction, determining a corresponding target video stream from the processed uplink audio/video stream, and returning the target audio/video stream to the first terminal.

When a user enters a live broadcasting room of a host broadcasting, if an uplink audio and video received by a media server from a host broadcasting client is a live broadcasting audio and video stream in a single live broadcasting scene, the corresponding audio and video stream is uplink to a front-end media server in response to a subscription instruction of the user in the live broadcasting room, a corresponding target audio and video stream is determined in the uplink audio and video stream, and the target audio and video stream is sent from the media server to a first terminal (a spectator client) for playing.

In other embodiments, the live audio and video stream in the single live scene may also be up-going to the media server, where the media server inputs the live audio and video stream in the single live scene to the mixed picture transcoding server, so as to improve the smoothness of the picture played by the live audio and video stream at the first terminal.

Further, when a user enters a live broadcasting room of a host broadcasting, the media server sends a plurality of corresponding audio/video streams to a mixing and transcoding server when the uplink audio/video received by the host broadcasting client is a plurality of corresponding audio/video data streams in a multi-user continuous-play interaction scene, the mixing and transcoding server is used for mixing and transcoding the plurality of audio/video streams, and a corresponding target audio/video stream is determined in the uplink audio/video stream which is processed correspondingly by the mixing and transcoding server, wherein the target audio/video stream is not downloaded to a live broadcasting audio/video stream played by a first terminal.

The above steps of playing the target audio/video stream down to the first terminal refer to the descriptions of the present disclosure in step S102 and step S103, and are not described herein.

In order to better understand the technical solution of the present application, please continue to refer to fig. 5a, fig. 5a is a schematic view of a scene of live broadcast interaction of multiple persons provided in this embodiment. The first terminal receives the low-delay playing instruction of the user by clicking the "low-delay" trigger component 5001 in the multi-user communication scene at the viewer client. And uploading the uplink audio and video of each corresponding anchor client to the media server based on the low-delay instruction, and uploading the live audio and video stream from the anchor client to the media server through a second transmission strategy of the RTC uplink module at the second real-time communication interface. And in the multi-user communication interaction scene, the media server sends a plurality of live broadcast audio and video streams to the mixed picture transcoding server to perform audio and video stream mixed picture transcoding, a target audio and video stream in the multi-user communication interaction scene is obtained, the target audio and video stream in the multi-user communication scene is returned to the media server, and the mixed picture audio and video stream 5002 in the multi-user communication scene is downloaded to the first terminal for playing based on a first transmission strategy of the first real-time communication interface through the RTC downlink module.

Therefore, the scheme of the application aims at the live audio and video stream in the multi-person continuous-microphone interaction scene, independently creates the mixed picture transcoding server to carry out mixed picture audio mixing on the live audio and video stream in the interaction scene, and reduces the algorithm pressure and load of the media server. In addition, the mixed picture transcoding server in the embodiment of the application mixes the live audio and video streams by adopting a jitter buffer technology, thereby improving the smoothness of the play pictures of the live broadcasting room in the multi-user continuous-wheat interaction scene, improving the compatibility of the continuous-wheat and non-continuous-wheat live broadcasting room and improving the smoothness of the switching of the live broadcasting room in the multi-user continuous-wheat interaction scene.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an audio/video pulling device according to an embodiment of the present application, where the audio/video pulling device is applied to a first terminal, the first terminal includes a first real-time communication interface, and the audio/video pulling device may include: a first receiving unit 301, a first subscribing unit 302, a pulling unit 303, and the like.

A first receiving unit 301, configured to receive a low-latency playing instruction;

the first subscription unit 302 is configured to invoke the first real-time communication interface to subscribe to a corresponding target audio/video stream in response to the low-latency playing instruction, where the first real-time communication interface implements communication using a user datagram protocol;

And the pulling unit 303 is configured to pull the target audio/video stream to be played from the media server based on the first preset transmission policy of the first real-time communication interface.

In some embodiments, the first subscription unit 302 includes:

a subscription subunit, configured to send a target transmission header corresponding to the audio and video stream information to the media server, and subscribe a target audio and video stream corresponding to the transmission address;

and the transmission subunit is used for sending the subscription instruction to the media server so as to subscribe the target audio/video stream corresponding to the audio/video stream name.

In some embodiments, the pulling unit 303 includes:

In some embodiments, the network sub-unit is further configured to:

In some embodiments, the audio-video pulling device is further configured to:

pre-caching audio and video stream cache data in the cache area;

In some embodiments, the audio-video pulling device is further configured to:

Referring to fig. 7, fig. 7 is a schematic structural diagram of an audio/video pulling device according to an embodiment of the present application, where the audio/video pulling device is applied to a first terminal, the first terminal includes a first real-time communication interface, and the audio/video pulling device may include: a second receiving unit 401, a second subscribing unit 402, a second transmitting unit 403, and the like.

the second subscription unit is used for receiving a subscription instruction of the first terminal for requesting to subscribe to the target audio/video stream, wherein the subscription instruction is generated by the operation of calling a first real-time communication interface for the first terminal and subscribing to the target audio/video stream to be pulled;

In some embodiments, the second receiving unit is configured to:

In some embodiments, the second transmission unit is further configured to:

The embodiment of the application also provides a computer device, which may be a terminal, as shown in fig. 8, which shows a schematic structural diagram of the computer device according to the embodiment of the application, specifically:

The computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 5 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the computer device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

receiving a low-delay playing instruction;

In the foregoing embodiments, the descriptions of the embodiments are focused, and the portions of an embodiment that are not described in detail may be referred to the above detailed description of the audio/video pulling method, which is not repeated herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any one of the audio and video pulling methods provided by the embodiment of the present application.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations provided in the above embodiments.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the computer storage medium may include: read Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The instructions stored in the computer storage medium can execute the steps in any audio/video pulling method provided by the embodiment of the present application, so that the beneficial effects that any audio/video pulling method provided by the embodiment of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

The foregoing describes in detail an audio and video pulling method, apparatus, storage medium and computer device provided by the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing description of the embodiments is only for helping to understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. The audio and video pulling method is characterized by being applied to a first terminal, wherein the first terminal comprises a first real-time communication interface, and the audio and video pulling method comprises the following steps:

receiving a low-delay playing instruction;

2. The method of claim 1, wherein invoking the first real-time communication interface to subscribe to the corresponding target audio-video stream comprises:

analyzing the low-delay playing instruction based on the first real-time communication interface to obtain audio and video stream information to be pulled, wherein the video stream information comprises a target transmission head and a transmission address;

and sending the target transmission head corresponding to the audio and video stream information to the media server, and subscribing the target audio and video stream corresponding to the transmission address.

3. The audio video pulling method of claim 2, wherein the first predetermined transmission policy is associated with a first real-time communication interface, the first predetermined transmission policy comprising a protocol transmission policy and a network transmission policy,

the pulling the target audio/video stream from the media server to play based on the first preset transmission policy of the first real-time communication interface includes:

the user datagram protocol is adopted at the first real-time communication interface, and the target audio-video stream is transmitted according to a transmission head corresponding to the user datagram protocol in the audio-video stream name, so that a protocol transmission strategy of the target audio-video stream is determined;

Based on a preset network optimization strategy, adjusting a transmission network, and determining a network transmission strategy of a first real-time communication interface downlink;

and actively pulling the target audio and video stream from the media server based on the protocol transmission strategy and the network transmission strategy, and playing the target audio and video stream.

4. The audio/video pulling method as set forth in claim 3, wherein the adjusting the transmission network based on the preset network optimization policy to determine the network transmission policy of the first real-time communication interface downstream comprises:

5. The audio-video pulling method of claim 4, wherein prior to said playing said target audio-video stream, said method further comprises:

pre-caching audio and video stream cache data in the cache area;

6. The audio-video pulling method as set forth in claim 5, wherein said stopping playing said audio-video stream buffer data and playing said target audio-video stream comprises:

7. The audio-video pulling method of claim 5, wherein after said playing said target audio-video stream, said method further comprises:

8. The audio and video pulling method is characterized in that the method is applied to a media server, the media server performs data transmission with a second terminal through a second real-time communication interface, and the audio and video pulling method comprises the following steps:

receiving an uplink audio and video stream uploaded by a second terminal through a second preset transmission strategy of the second real-time communication interface, wherein the second real-time communication interface adopts a user datagram protocol to realize communication;

receiving a subscription instruction of a first terminal for subscribing a target audio/video stream, wherein the subscription instruction is generated by the operation of calling a first real-time communication interface for the first terminal and subscribing the target audio/video stream to be pulled;

and responding to the subscription instruction, determining a corresponding target audio-video stream from the uplink audio-video stream, and returning the target audio-video stream to the first terminal.

9. The method of claim 8, wherein the determining, in response to the subscription instruction, a corresponding target audio-video stream from the upstream audio-video stream, and returning the target audio-video stream to the first terminal, comprises:

10. The audio video pulling method of claim 8, wherein the second predetermined transmission policy is associated with a second real-time communication interface, the second predetermined transmission policy comprising a protocol transmission policy and a network transmission policy,

the receiving, by the second preset transmission policy of the second real-time communication interface, the uplink audio/video stream uploaded by the second terminal includes:

11. The utility model provides an audio and video draws device, its characterized in that, audio and video draws device to be applied to first terminal, contain first real-time communication interface in the first terminal, audio and video device includes:

12. The utility model provides an audio and video draws device, its characterized in that, audio and video device is applied to the server, the server carries out data transmission through second real-time communication interface and second terminal, audio and video device includes:

13. A computer readable storage medium, characterized in that the storage medium stores a plurality of instructions adapted to be loaded by a processor for performing the steps of the audio video pulling method of any one of claims 1 to 10.

14. A computer device comprising a memory, a processor and a computer program stored in the memory and running on the processor, the processor implementing the steps in the audio video pulling method of any one of claims 1 to 10 when the computer program is executed.