CN113038128B

CN113038128B - Data transmission method and device, electronic equipment and storage medium

Info

Publication number: CN113038128B
Application number: CN202110099201.1A
Authority: CN
Inventors: 李志成
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2022-07-26
Anticipated expiration: 2041-01-25
Also published as: CN113038128A

Abstract

The embodiment of the application discloses a data transmission method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, and the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer; acquiring the current network state of a target network; determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N layers of video frame layers with level sequence numbers arranged from the lowest layer to the high layer in a step-by-step manner; and sending the data corresponding to the target frame layer to a target terminal through a target network. The method can relieve the network congestion condition while ensuring the normal decoding and video watching quality of the target terminal, thereby relieving the abnormal video playing condition of the target terminal.

Description

Data transmission method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data transmission method and apparatus, an electronic device, and a storage medium.

Background

With the development of network technology, audio and video watching through a network gradually becomes mainstream, but for users with poor network, the situation of frame dropping easily occurs in the transmission process of the audio and video, and then abnormal playing situations such as black screen and pause occur.

Disclosure of Invention

In view of the foregoing problems, embodiments of the present application provide a data transmission method, an apparatus, an electronic device, and a storage medium to improve the foregoing problems.

In a first aspect, an embodiment of the present application provides a data transmission method, where the method includes: acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer; acquiring the current network state of a target network; determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N video frame layers with level sequence numbers arranged from the lowest layer to the high layer in a progressive manner, N is a natural number, and N is less than or equal to M; and sending the data corresponding to the target frame layer to a target terminal through a target network.

In a second aspect, an embodiment of the present application provides a data transmission method, where the method includes: acquiring video coded data to be transmitted, wherein the video coded data to be transmitted comprises K layers of video frame layers, each layer of video frame layer is provided with a corresponding hierarchical sequence number, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, and M is a positive integer; acquiring the current network state of a target network; determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises a T-layer video frame layer with level sequence numbers arranged from the lowest layer to the high layer in a progressive manner, T is a natural number, and T is less than or equal to K; and sending the data corresponding to the target frame layer to a target terminal through a target network.

In a third aspect, an embodiment of the present application provides a data transmission apparatus, including: the device comprises an audio and video coding data acquisition module, a first network state acquisition module, a first target frame layer determination module and a first data transmission module. The audio and video coding data acquisition module is used for acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer. The first network state acquisition module is used for acquiring the current network state of the target network. The first target frame layer determining module is used for determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N layers of video frame layers with level sequence numbers arranged from the lowest layer to the higher layer in a step-by-step mode, N is a natural number, and N is smaller than or equal to M. And the first data sending module is used for sending the data corresponding to the target frame layer to the target terminal through the target network.

In a fourth aspect, an embodiment of the present application provides a data transmission apparatus, including: the video coding data acquisition module, the third network state acquisition module, the third target frame layer determination module and the third data transmission module. The video coding data acquisition module is used for acquiring video coding data to be transmitted, the video coding data to be transmitted comprises K layers of video frame layers, each layer of video frame layer has a corresponding hierarchical sequence number, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, and K is a positive integer. The third network state obtaining module is used for obtaining the current network state of the target network. And the third target frame layer determining module is used for determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises T-layer video frame layers with level sequence numbers arranged from the lowest layer to the higher layer in a progressive manner, T is a natural number, and T is less than or equal to K. And the third data sending module is used for sending the data corresponding to the target frame layer to the target terminal through the target network.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.

In a sixth aspect, the present application provides a computer-readable storage medium, in which program codes are stored, wherein the program codes, when executed by a processor, perform the above-mentioned method.

In a seventh aspect, the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the processor executes the computer instructions, causing the computer device to perform the method described above.

According to the data transmission method, the data transmission device, the electronic equipment and the storage medium, audio and video coding data to be transmitted, including M layers of video frame layers and audio frame layers, are obtained, then a target frame layer corresponding to the current network state of a target network is obtained, the target frame layer includes the audio frame layer and N layers of video frame layers with hierarchical sequence numbers arranged from the lowest layer to the higher layer in a progressive mode, and finally data corresponding to the target frame layer are sent to a target terminal through the target network. In the foregoing manner, since the target frame layer includes an audio frame layer and N video frame layers having hierarchical numbers arranged in a stepwise manner from a lowest layer to a higher layer, and the video frame layer having a higher hierarchical number relies on the video frame layer having a lower hierarchical number for decoding, even if the target frame layer corresponding to the network status is transmitted, the target terminal can perform normal decoding according to the target frame layer, and since the target frame layer corresponds to the network status, the target terminal can select different target frame layers to be transmitted, that is, selectively transmit part of the audio/video encoded data to be transmitted, so as to reduce the number of transmitted packets, reduce downlink network packets of the target terminal, reduce network consumption, alleviate a network congestion condition, and further alleviate a video playing abnormal condition of the target terminal, and since a video encoding rate is not reduced, therefore, the watching quality of the decoded and played video can be ensured, and better interactive experience is brought to a user.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic diagram of an application environment proposed by an embodiment of the present application;

fig. 1b is a schematic diagram of another application environment proposed in the embodiment of the present application;

FIG. 2 is a schematic diagram of another application environment proposed by an embodiment of the present application;

fig. 3 is a schematic structural diagram illustrating an encoding framework of a video encoder according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a partition of a CU/PU/TU according to an embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a data transmission method according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a layered structure of an encoded video frame according to an embodiment of the present application;

fig. 7 shows a schematic diagram of an encoded audio frame according to an embodiment of the present application;

fig. 8 shows a schematic diagram of a hierarchical structure of audio/video coded data to be transmitted according to an embodiment of the present application;

fig. 9 is a flowchart illustrating another data transmission method proposed in the embodiment of the present application;

fig. 10 is a flow chart illustrating an implementation manner of S230 in a data transmission method proposed by the embodiment shown in fig. 9;

fig. 11 is a flowchart illustrating another implementation manner of S230 in a data transmission method according to the embodiment shown in fig. 9;

fig. 12 is a flowchart illustrating another data transmission method proposed in the embodiment of the present application;

fig. 13 is a flow chart illustrating an implementation manner of S320 in a data transmission method proposed by the embodiment shown in fig. 12;

fig. 14 is a graph showing comparison of effects of the number of hundred second caltons on a certain client in a video cloud according to an embodiment of the present disclosure;

fig. 15 is a flowchart illustrating another data transmission method proposed in the embodiment of the present application;

fig. 16 is a block diagram of a data transmission apparatus according to an embodiment of the present application;

fig. 17 is a block diagram showing another data transmission apparatus proposed in the embodiment of the present application;

fig. 18 is a block diagram showing another electronic device for executing a data transmission method according to an embodiment of the present application;

fig. 19 shows a storage unit of the embodiment of the present application, which is used for storing or carrying program codes for implementing the data transmission method according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the existing mainstream Video coding algorithm, for example, h.264/h.265/h.266/AVI (Audio Video Interleaved, Audio Video Interleaved format)/VP 9, etc., when the Video is coded by default, the gop (group of pictures) sequence (I/P/B frame) will provide reference for other types of frames in the sequence, the inter-reference between frames improves the coding compression efficiency, one frame only needs to store the difference between itself and the referenced frame, but the inter-reference mode will cause error propagation, i.e., x frame is in error, so is the y frame that references it, then z frame that references y frame is in error, and so on, therefore, in the process of transmitting video data to a terminal for decoding and playing, if a frame is discarded, the decoding of the coding frame with a reference discarded frame fails, and abnormal playing conditions such as black screen, block missing, frame skipping and jamming occur.

With the higher requirements of users on the playing experience, some Network adaptive schemes start to be configured in the distribution process of a related video CDN (Content Delivery Network) so as to avoid frame loss of users in a poor Network, and further avoid abnormal playing situations such as screen blacking, block missing, frame skipping, and jamming. However, the inventor finds in research on related network adaptive schemes that all of the related network adaptive schemes, such as hls (Live streaming)/dash (dynamic adaptive streaming over http), are solutions for reducing video encoding rate based on a network transmission layer, that is, when congestion occurs in network transmission, the video encoding rate is forcibly reduced, but the video encoding rate is reduced while the video viewing quality of a user is also reduced, which brings a poor interactive experience to the user, and meanwhile, the related network adaptive schemes need to download fragments, which results in a problem of large delay.

Therefore, the inventor provides a data transmission method, a device, an electronic device and a storage medium provided by the present application, in the application, audio and video coded data to be transmitted including M layers of video frame layers and audio frame layers can be obtained, then a target frame layer corresponding to the current network state of a target network is obtained, the target frame layer includes an audio frame layer and N layers of video frame layers with hierarchical sequence numbers arranged from the lowest layer to the higher layer in a stepwise manner, and finally data corresponding to the target frame layer is sent to a target terminal through the target network.

Therefore, by the mode, the target frame layer corresponding to the current network state of the target network can be obtained from the audio and video coding data to be transmitted, because the target frame layer comprises the audio frame layer and the N layers of video frame layers with the hierarchical sequence numbers arranged from the lowest layer to the high layer in a stepwise manner, and the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, even if the target frame layer corresponding to the network state is sent, the target terminal can normally decode according to the target frame layer, and because the target frame layer corresponds to the network state, different target frame layers can be selected and sent according to different network states, namely part of the audio and video coding data to be transmitted is selectively sent, the number of the sent data packets is reduced, the downlink network data packets of the target terminal are reduced, the network consumption is reduced, and the network congestion condition is reduced, the abnormal situation of video playing of the target terminal is relieved, and the video coding code rate cannot be reduced, so that the video watching quality of a user can be guaranteed, better interactive experience is brought to the user, and in addition, due to the fact that the scheme of the embodiment of the application does not need downloading of the fragments, larger time delay does not exist.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be applied to the following explanations.

I frame: also known as intra pictures, there may be many I frames in a sequence, and pictures following an I frame may reference pictures between I frames for motion reference.

P frame: frames are forward predictive coded. The P frame represents the difference between the frame and a previous key frame (or P frame), and the difference defined by the frame needs to be superimposed on the previously buffered picture when decoding, so as to generate a final picture.

B frame: bi-predictive interpolative coded frames. The B frame is a bidirectional difference frame, that is, the B frame records the difference between the current frame and the previous and subsequent frames, and the B frame may or may not be a reference frame of other B frames.

Intra-frame prediction: a prediction block is a block formed based on the encoded reconstructed block and the current block.

Inter-frame prediction: the method mainly comprises motion estimation (a motion search method, a motion estimation criterion, sub-pixel interpolation and motion vector estimation) and motion compensation, and is reference and prediction interpolation compensation on GOP granularity time sequence.

GOP (group of pictures): the interval between two I-frames.

The application environment related to the embodiment of the present application will be described below.

As shown in fig. 1a, fig. 1a is a schematic diagram illustrating an application environment according to an embodiment of the present application. The application environment includes a first terminal 110a, a server 120, and a second terminal 130 a. The first terminal 110a is configured to collect audio and video data of a user in real time, and then send the collected audio and video data to the server 120, after receiving the audio and video data sent by the first terminal 110a, the server 120 encodes the audio and video data to obtain M layers of video frame layers and audio frame layers, and further performs layering processing on the M layers of video frame layers and audio frame layers to obtain audio and video encoded data to be transmitted, which includes the M layers of video frame layers and audio frame layers, then the server 120 further obtains a target frame layer corresponding to a current network state from the audio and video encoded data to be transmitted, and finally sends data corresponding to the target frame layer to the second terminal 130a through a target network, so that the second terminal 130a decodes and plays the data corresponding to the target frame layer. There may be a plurality of second terminals 130 a.

Optionally, after acquiring the audio and video data of the user in real time, the first terminal 110a may also encode the acquired audio and video data to obtain M layers of video frame layers and audio frame layers, and further perform layering processing on the M layers of video frame layers and audio frame layers to obtain to-be-transmitted audio and video encoded data including the M layers of video frame layers and audio frame layers, and then send the to-be-transmitted audio and video encoded data to the server 120, after obtaining the to-be-transmitted audio and video encoded data, the server 120 may further obtain a target frame layer corresponding to the current network state from the to-be-transmitted audio and video encoded data, and finally send data corresponding to the target frame layer to the second terminal 130a through the target network, so that the second terminal 130a decodes and plays data corresponding to the target frame layer.

It should be noted that the application environment shown in fig. 1a may include, but is not limited to, a live video scene or a real-time video communication scene.

In addition, fig. 1a is an exemplary application environment, and the method provided by the embodiment of the present application may also be operated in other application environments.

As shown in fig. 1b, fig. 1b is a schematic diagram of another application environment according to an embodiment of the present application. The application environment includes a first terminal 110b and a second terminal 130b directly connected through a network, and the first terminal 110b and the second terminal 130b do not need a server to forward data. At this time, the first terminal 110b is configured to, after acquiring the audio and video data of the user in real time, encode and layer the acquired audio and video data to obtain audio and video encoded data to be transmitted, which includes M layers of video frame layers and audio frame layers, then obtain a target frame layer corresponding to the current network state from the audio and video encoded data to be transmitted, and finally send data corresponding to the target frame layer to the second terminal 130b through the network, so that the second terminal 130b decodes and plays the data corresponding to the target frame layer.

It should be noted that the application environment shown in fig. 1b may include, but is not limited to, a scenario in which audio-video communication is performed through a local area network.

Besides, the method provided by the embodiment of the present application can be operated in other application environments besides the application environments shown in fig. 1a and fig. 1 b.

As shown in fig. 2, fig. 2 is a schematic diagram of another application environment according to an embodiment of the present application. The application environment may include a multimedia database 210, a server 220, and a third terminal 230. The server may be connected to the multimedia database 210 and obtain corresponding audio and video data from the multimedia database 210, after obtaining the audio and video data, the server 220 may encode the audio and video data to obtain M layers of video frame layers and audio frame layers, and perform layering processing on the M layers of video frame layers and audio frame layers, so as to obtain to-be-transmitted audio and video encoded data including the M layers of video frame layers and audio frame layers, after obtaining the to-be-transmitted audio and video encoded data, may further obtain a target frame layer corresponding to a current network state from the to-be-transmitted audio and video encoded data, and finally send data corresponding to the target frame layer to the second terminal 230 through a target network, so that the second terminal 230 performs decoding playing on data corresponding to the target frame layer.

Alternatively, the multimedia database 210 may be deployed in a hardware device separate from the server 220, or may be deployed directly in the server 220.

It should be noted that the application environment shown in fig. 2 may include, but is not limited to, a video-on-demand scene or a short video playing scene.

It should be noted that the servers in fig. 1a and fig. 2 may be independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers that provide basic cloud computing services such as cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and big data and artificial intelligence platform. The first terminal and the second terminal in fig. 1a, 1b and 2 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like.

It should be noted that, in the embodiment of the present application, the audio and video data may include audio data and video data.

The audio data may be encoded by an audio encoder, and the video data may be encoded by a video encoder. When the Audio encoder works, the Audio data is processed by adopting a preset Audio Coding algorithm to obtain Audio coded data, wherein the preset Audio Coding algorithm can comprise g.723/mp3(mpeg-1Audio layer 3, third layer of Audio of mpeg-1)/aac (Advanced Audio Coding, and the like), and G.723 is a multimedia voice Coding and decoding standard formulated by ITU-T in 1996.

When the video encoder works, the video encoder processes the video data by adopting a preset video coding algorithm to obtain video coding data, wherein the preset video coding algorithm can comprise H.264/H.265/H.266/AVI and the like.

Referring to fig. 3, fig. 3 shows a schematic structural diagram of a video encoder Coding frame according to an embodiment of the present application, based on fig. 3, a frame of image is sent to an encoder, and is first divided into Coding Tree Units (CTUs) according to a size of 64 × 64 blocks, and the Coding Tree Units (CTUs) are obtained through depth division, where each CU includes a Prediction Unit (PU) and a Transform Unit (TU). The method comprises the steps of conducting inter-frame prediction (including motion estimation ME and motion compensation MC) and intra-frame prediction on each PU to obtain a predicted value, subtracting input data from the predicted value to obtain a residual error, conducting DCT transformation and quantization sequentially through a DCT changing unit and a quantization unit to obtain a residual error coefficient, sending the residual error coefficient to an entropy coding module to output a code stream, conducting inverse quantization and inverse transformation on the residual error coefficient to obtain a residual error value of a reconstructed image, adding the residual error value and the predicted value to obtain a reconstructed image, conducting intra-loop filtering, namely deblocking filtering (DB) and pixel adaptive offset (SAO), on the reconstructed image, enabling the reconstructed image to enter a reference frame queue to serve as a reference image of a next frame, and conducting backward coding sequentially. During prediction, starting from a Largest Coding Unit (LCU), each layer is divided downwards layer by layer according to a quadtree, and recursive calculation is performed. First, the division is from top to bottom. From depth to 0, the 64x64 block is first divided into 4 32x32 children CUs. Then one 32x32 sub-CU is split into 4 16x16 sub-CUs CUs, and so on until depth is 3 and CU size is 8x 8. Then, pruning is performed from bottom to top. And summing up the 4 RDcosts of CUs of 8x8 (marked as cost1), comparing the RDcosts of the CUs corresponding to the 16x16 at the previous stage (marked as cost2), if the cost1 is smaller than the cost2, keeping the segmentation of the CUs of 8x8, and otherwise, continuing to trim upwards and comparing layer by layer. And finally, finding out the optimal CU deep division condition. PU prediction is divided into intra-frame prediction and frame-level prediction, comparison is carried out between different PUs in the same prediction type to find out an optimal segmentation mode, and then intra-frame inter-frame mode comparison is carried out to find out an optimal prediction mode under the current CU; meanwhile, a quadtree structure-based adaptive Transform (RQT) is performed on the CU to find out an optimal TU mode. And finally, dividing a frame of image into the CUs and the PUs and TUs corresponding to the CUs.

Referring to fig. 4, a schematic diagram of partitioning a CU/PU/TU according to an embodiment of the present disclosure is shown, as shown in fig. 4, a PU has 8 partition modes, and a TU has only a partition mode or no partition mode 2.

Based on the coding framework shown in fig. 3, namely the coding principle, the intra-frame prediction and the frame-level prediction are basically performed, firstly, the intra-frame prediction and the intra-frame prediction are performed in the same prediction type and by comparing different PUs, the optimal partition mode is found, and then, the intra-frame mode and the intra-frame mode are compared, and the optimal prediction mode under the current CU is found; meanwhile, a Quad-tree structure-based adaptive Transform (RQT) is performed on the CU to find out an optimal TU mode. And finally, dividing a frame of image into a CU, and PU and TU corresponding to the CU. Each PU is predicted by a frame image to obtain a predicted value, the predicted value is subtracted from input data to obtain a residual error, DCT (discrete cosine transformation) conversion and quantization are carried out to obtain a residual error coefficient, the residual error coefficient is sent to an entropy coding module to output a code stream, the frame image determines that the frame is I, P, B frames according to coding parameter setting and a code rate control strategy, different frame types of I/P/B frames have different reference frame queues (the I frame is only referred in the frame, the P frame refers to the I/P frame forwards, and the B frame can refer to the I/P/B frame forwards and backwards). The audio/video coded data to be transmitted in this embodiment at least includes one GOP frame sequence. That is, when the audio/video coding data to be transmitted is obtained, at least one complete GOP frame sequence is obtained. The GOP frame sequence in this embodiment may include the following cases:

fixed GOPs and sequences, for example, a fixed GOP is 120 frames, i.e., an I frame is generated every 120 frames, and the sequence of GOP frames is fixed as follows: ib B P … B I.

The fixed sequence of GOP sizes is not fixed, for example, a fixed GOP is 120 frames, i.e., an I frame is generated every 120 frames, and the sequence of GOP frames determines whether a frame is a P frame or a B frame according to picture complexity and related P/B frame generation weights.

The open GOPs (both GOP size and sequence are not fixed) are automatically generated according to picture texture, motion complexity, and I/P/B frame generation strategy and weight configuration.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 5, fig. 5 is a flowchart illustrating a data transmission method according to an embodiment of the present application, where the method includes:

s110, audio and video coding data to be transmitted are obtained, the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer.

The audio/video coding data to be transmitted can be considered as data obtained by coding the audio/video data to be transmitted to the target terminal to obtain an M-layer video frame layer and an M-layer audio frame layer, and then layering the M-layer video frame layer and the M-layer audio frame layer.

It can be understood that, taking the application environment of fig. 1a or fig. 2 as an example, there may be a plurality of ways of obtaining the audio-video encoded data to be transmitted. Optionally, if the server acquires the audio and video data from other devices, the server needs to encode and layer the audio and video data, and then acquires audio and video encoded data to be transmitted; optionally, if the encoded and layered audio/video encoding data to be transmitted is already acquired by the server from other devices, the encoded and layered audio/video encoding data to be transmitted can be directly acquired from other devices without encoding and layering; optionally, if the server acquires the encoded audio frame and the M-layer video frame from the other device, the server needs to perform layered processing on the encoded audio frame and the M-layer video frame, and then acquire the to-be-transmitted audio/video encoded data. No matter which way to obtain the audio/video coding data to be transmitted, the audio/video data to be transmitted to the target terminal needs to be coded and layered, and the difference lies in that the devices for coding and layered processing the original audio/video data are different.

Considering that coding performance is parallel and coding efficiency is improved, in some modes, obtaining a coded M-layer video frame layer corresponding to an audio and video to be transmitted includes: and carrying out layered coding on the video data included in the audio and video to be transmitted to obtain M layers of video frame layers after coding.

The video encoder can adopt a layered coding mode when encoding video data, a set of logic of dependency relationship among video frames can be maintained by adopting a video frame obtained by the layered coding mode, and all the video frames meeting the dependency relationship can be encoded simultaneously. After layered coding, M levels of video frame layers are finally obtained, and each video frame layer comprises a corresponding video frame. Each video frame carries a corresponding encoded frame number, the encoded frame number exists in a pps Unit of a NALU (Network Abstract Layer Unit), and the encoded frame number is automatically generated when an encoder encodes.

In the embodiment of the present application, in M hierarchical video frame layers obtained by performing hierarchical encoding on video data, a video frame in a video frame layer with a higher hierarchical number is decoded depending on a video frame in a video frame layer with a lower hierarchical number, and a video frame in a video frame layer with a lower hierarchical number is not decoded depending on a video frame in a video frame layer with a higher hierarchical number. The video frames in the video frame layer with the higher hierarchical sequence number are decoded depending on the video frames in the video frame layer with the lower hierarchical sequence number, and it can be understood that when a certain video frame in the video frame layer with the higher hierarchical sequence number is decoded, data in the video frame layer lower than the frame layer to which the video frame belongs is required to be used.

Exemplarily, referring to fig. 6, a schematic diagram of a layered structure of a video frame after encoding is shown, as shown in fig. 6, assuming that a video sequence includes a GOP frame sequence including 17 frames (0 frame, 1 frame, 2 frame … … 16 frame), a video encoder adopts a 5-layer encoding layered structure, and the layered encoding process is as follows: the first layer of

frame

0 and 16 is coded, then the second layer of frame 8 is coded, then the third layer of

frame

4 and 12 of the third layer can be coded at the same time, the fourth layer of

frame

2 and 6 can be coded concurrently after the third layer of frame 4 is coded, the fourth layer of

frame

10 and 14 can be coded concurrently after the fourth layer of frame 12 is coded, the fourth layer of

frame

1, 3, 5, 7, 9, 11, 13 and 15 can be coded concurrently after the third layer of frame dependent on the fourth layer of frame 5 is coded, and so on. Finally, 5 levels of video frame hierarchies are obtained after adopting hierarchical coding.

In the 5-level video frame hierarchy, predictive decoding of a video frame layer located at an upper layer depends on a video frame layer located at a lower layer thereof, and the arrow shown in fig. 6 expresses the dependency of each frame. For example, in decoding a layer 5 frame, layer 4 frames, layer 3 frames, layer 2 frames, and layer 1

frames

0, 16 frames; for example, when decoding 12 frames of layer 3, 8 frames of

layer

2, 0 frames and 16 frames of layer 1 are relied upon, and 10 frames and 14 frames of

layer

4, and 9 frames, 11 frames, 13 frames and 15 frames of layer 5 are not relied upon.

In other embodiments, when encoding video data, a simple encoding mode may also be used without performing layered encoding on the video data, that is, a sequence of video frames is obtained by encoding the video data, and an M-layer video frame layer cannot be directly obtained, in this case, obtaining an M-layer video frame layer after encoding corresponding to the to-be-transmitted audio and video includes: coding the video data included in the audio and video to be transmitted to obtain a coded video frame sequence; and layering the video frames of the coded video frame sequence according to the decoding dependency relationship among the video frames to obtain the coded M-layer video frame layer.

If a simple encoding method is adopted when video data is encoded, then video frame layering needs to be performed on a video frame sequence according to a decoding dependency relationship among video frames, so as to obtain M levels of video frame layers as a layered encoding result, wherein among the M levels of video frame layers, a video frame in a video frame layer with a high level number is decoded depending on a video frame in a video frame layer with a low level number, and a video frame in a video frame layer with a low level number is decoded independently of a video frame in a video frame layer with a high level number.

It will be appreciated that audio data is encoded independently at the time of encoding, and therefore, there is no dependency between audio frames, and the audio encoder does not need to perform layered encoding at the time of encoding. Referring to fig. 7, a diagram of an encoded audio frame is shown.

After the M layers of video frame layers and the audio frame are obtained respectively, the audio frame and the M layers of video frame layers can be layered to obtain the audio and video coding data to be transmitted. The layering process may be understood as a process of re-assigning a layer number to an M layer of video frames as well as to an audio frame. In the same audio and video, the network consumption during audio frame transmission is lower than that during video frame transmission, and under the condition of extremely poor network quality in a network state, a user can only hear the sound, so that a smaller hierarchical sequence number can be allocated to the audio frame, and the hierarchical sequence number of the audio frame is smaller than that of any video frame layer. For example, the level sequence number of the audio frame may be assigned as layer 0. As shown in fig. 8, the diagram is a schematic diagram of a hierarchical structure of audio/video coded data to be transmitted, where audio/video data to be transmitted to a target terminal is coded and layered to obtain the audio/video coded data to be transmitted, the audio/video coded data to be transmitted includes 5 video frame layers and audio frame layers, each of the video frame layers and the audio frame layer has a corresponding hierarchical sequence number, a video frame layer with a higher hierarchical sequence number depends on a video frame layer with a lower hierarchical sequence number for decoding, and the hierarchical sequence number of the audio frame layer is 0 and is lower than that of any video frame layer. The value of M may be determined according to the dependency relationship between the video frames included in the GOP frame sequence.

It should be noted that the target terminal is a terminal for the user to view the audio/video data. Specifically, the target terminal uses a video client installed by the target terminal to interact with the server and receive data corresponding to the target frame layer.

The audio and video data to be transmitted to the target terminal may include audio data and video data, where the original audio data may be encoded by an audio encoder, and the video data may be encoded by a video encoder. When the audio encoder works, the audio encoder processes the audio data by using a preset audio coding algorithm to obtain an encoded audio frame, wherein the preset audio coding algorithm may include g.723/mp3/aac and the like. When the video encoder works, the video encoder processes the video data by adopting a preset video coding algorithm to obtain an encoded video frame, wherein the preset video coding algorithm can comprise H.264/H.265/H.266/AVI and the like. As a mode, the function of the video encoder and the function of the audio encoder may be integrated into an audio and video encoder, and the audio and video encoder may process audio data to obtain an encoded audio frame and may process video data to obtain an encoded video frame.

And S120, acquiring the current network state of the target network.

It can be understood that the audio and video coding data to be transmitted, which is acquired by the server, is sent to the target terminal through the network, so that the target network can be understood as a network for performing communication between the server and the target terminal, and the current network state of the target network can be understood as the corresponding network quality of the network for performing communication between the server and the target terminal at the acquired moment.

It should be noted that the server may be connected to multiple target terminals, each target terminal may correspond to one target network with the server, and the total amount of the network bandwidth of the target terminal and whether the network bandwidth is stable may affect the network state of the target network. The server sends the audio and video coding data to be transmitted to which target terminal, namely, the current network state of a target network between the server and the target terminal needs to be acquired.

As a manner, the server may obtain the current network state of the target network according to the bandwidth of the target network and the video coding rate, and it may be understood that, in a case that the network bandwidth is smaller than the video coding rate, the larger the difference between the two is, the worse the current network state of the target network is.

Alternatively, the server may first obtain the data transmission slow speed ratio of the target network, and then obtain the current network state of the target network based on the data transmission slow speed ratio, and it may be understood that the larger the data transmission slow speed ratio is, the worse the current network state of the target network is. The data sending slow speed ratio can be understood as the slow speed degree of data sent by the server to the target terminal, and the larger the slow speed degree of the data sent is, the larger the data sending slow speed ratio is.

S130, determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N video frame layers with hierarchical sequence numbers arranged from the lowest layer to the high layer in a progressive mode, N is a natural number, and N is smaller than or equal to M.

The target frame layer may be understood as a layer corresponding to data that needs to be transmitted to the target terminal, and the target frame layer may include an audio frame layer and N layers of video frame layers with sequence numbers arranged in a sequence from a lowest layer to a higher layer. The audio frame layer may be understood as a hierarchy corresponding to the audio frame, and the video frame layer may be understood as a hierarchy corresponding to the video frame.

For a target network with insufficient or unstable network bandwidth, a target terminal is prone to cause network congestion and TCP zero window phenomena during a decoding and playing process, and therefore, in order to reduce the occurrence of the network congestion and the TCP zero window phenomena, it may be considered that a part of video frames are discarded, that is, a server only obtains data of a part of target frame layers according to a current network state of the target network. However, it should be noted that, because there is a prediction dependency relationship between video frames in the same GOP frame sequence, the dropped part of video frames cannot be used as prediction reference frames of other retained video frames.

As can be seen from the foregoing, in the embodiments of the present application, the to-be-transmitted audio/video encoded data including M layers of video frame layers and audio frame layers is obtained, and the video frame layer with the higher hierarchical sequence number relies on the video frame layer with the lower hierarchical sequence number for decoding. Because the video frame layer with a large hierarchical sequence number depends on the video frame layer with a smaller hierarchical sequence number when decoding prediction is carried out, even if the video frame layer with a high hierarchical sequence number is abandoned, the frame structure depending on the reserved video frame layer can be found when decoding, normal decoding and playing can be realized, and because the audio frame layer with the lowest hierarchical sequence number is independently decoded, decoding can not be carried out depending on any video frame layer with a large hierarchical sequence number. Therefore, for different network qualities corresponding to the current network state, it can be considered that the N layers of video frame layers are reserved from the lowest layer to the higher layer step by step according to the layer sequence numbers. For the case of a very bad user network, such as a breakdown, it may be considered to discard all video frame layers, where N is equal to 0, and keep the audio frame layers, so that even if all the video frame layers are discarded, it is possible to ensure that the user can hear the voice but not see the video.

There may be various ways to determine the target frame layer corresponding to the current network state.

As an embodiment, the target frame layer corresponding to the current network state may be determined based on a first correspondence table, where the first correspondence table includes a plurality of network states and a target frame layer corresponding to each network state, and the network quality corresponding to the current network state is positively correlated to the number of the target frame layers.

It can be understood that, the better the network quality corresponding to the current network state is, which indicates that the current network can support sending more data packets, the fewer the discarded data packets are, the more the number of corresponding target frame layers is, and therefore, the network quality corresponding to the current network state is positively correlated to the number of target frame layers.

The first corresponding relation table can be formulated according to the type number of the network state and the level number corresponding to the audio and video coding data to be transmitted.

Alternatively, the target frame layer corresponding to the current network state may be determined based on a pre-trained first classification model. After the current network state is obtained, the current network state can be input into a first classification model which is trained in advance, and the first classification model sequentially passes through the steps of feature extraction and feature processing to determine a target frame layer corresponding to the current network state. The first classification model may be obtained by training an initial model, and the training samples in the training process may be a plurality of network states carrying target frame layer identifiers.

And S140, sending the data corresponding to the target frame layer to a target terminal through a target network.

After the target frame layer corresponding to the current network state is determined, data corresponding to the target frame layer can be sent to the target terminal through the target network.

The data transmission method includes acquiring audio and video coding data to be transmitted, including M layers of video frame layers and audio frame layers, then acquiring a target frame layer corresponding to a current network state of a target network, wherein the target frame layer includes an audio frame layer and N layers of video frame layers with level sequence numbers arranged from a lowest layer to a higher layer in a step-by-step mode, and finally sending data corresponding to the target frame layer to a target terminal through the target network. Therefore, by the mode, as the target frame layer comprises the audio frame layer and the N video frame layers with the hierarchical sequence numbers arranged from the lowest layer to the higher layer in a progressive manner, and the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, even if the target frame layer corresponding to the network state is sent, the target terminal can carry out normal decoding according to the target frame layer, and as the target frame layer corresponds to the network state, different target frame layers can be selected and sent according to different network states, namely, part of audio and video coding data to be transmitted is selectively sent, the number of sending data packets is reduced, the downlink network data packets of the target terminal are reduced, the network consumption is reduced, the network congestion condition is relieved, the video playing abnormal condition of the target terminal is relieved, and the video coding rate is not reduced, therefore, the watching image quality of the decoded and played video can be ensured, and better interactive experience is brought to the user.

Referring to fig. 9, fig. 9 is a flowchart illustrating a data transmission method according to another embodiment of the present application, where the method includes:

s210, audio and video coding data to be transmitted are obtained, the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer.

S220, acquiring the current network state of the target network.

And S230, determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N video frame layers with hierarchical sequence numbers arranged from the lowest layer to the higher layer in a progressive manner, N is a natural number, and N is less than or equal to M.

As a mode, the audio/video coded data to be transmitted includes an audio frame layer with a hierarchical sequence number of the 0 th layer and a video frame layer with a hierarchical sequence number sequentially increasing from the 1 st layer to the M th layer, and the network state includes a 1 st network state, a 2 nd network state and a 3 rd network state, where the network quality of the 1 st network state is the worst, the 2 nd network state is moderate, and the network quality of the 3 rd network state is the best. In this case, referring to fig. 10, determining a target frame layer corresponding to the current network state includes:

s231, when the current network state is the type 1 network state, determining an audio frame layer with the level sequence number of the 0 th layer as a target frame layer corresponding to the current network state.

S232, when the current network status is the type 2 network status, determining an audio frame layer with a hierarchical sequence number of the 0 th layer and video frame layers with hierarchical sequence numbers of the 1 st layer to the P th layer as target frame layers corresponding to the current network status, where P is a positive integer less than M.

S233, when the current network state is the type 3 network state, determining the audio frame layer with the hierarchical sequence number of the 0 th layer and the video frame layers with the hierarchical sequence numbers of the 1 st layer to the mth layer as the target frame layer corresponding to the current network state.

In this embodiment, the network states are divided into 3 types, and the network quality corresponding to the network state of type 1 is the worst, which can be understood as a network crash, at this time, only the audio frame layer is reserved, so that the user of the target terminal can only listen to voice but not watch video, and the overall user experience is better than that in the related art, because the packet is not lost but the user network bad data packet is not sent to the target terminal, the experience of continuous retransmission and congestion of the link occurs.

The network quality corresponding to the type 2 network state is between the type 1 network state and the type 3 network state, and the discarding of part of data of the video frame layer can be considered, so that all video data do not need to be discarded, and some data corresponding to the video frame layer which does not influence the watching experience are discarded, thereby relieving the network congestion condition while ensuring the watching experience of a user, and further relieving the video playing abnormal condition of a target terminal.

The network quality corresponding to the type 3 network state is the best, namely the network is normal, and the situation of network congestion does not exist, and at the moment, video data does not need to be discarded.

Alternatively, the correspondence between the network states of the various types in S231-S233 and the target frame layer may be obtained based on the first correspondence table.

Alternatively, the correspondence between the network states of the various types in S231-S233 and the target frame layer may be obtained based on the first classification model.

As another mode, the audio/video coded data to be transmitted includes an audio frame layer whose hierarchical sequence number is layer 0 and video frame layers whose hierarchical sequence numbers sequentially increase from layer 1 to layer M, the network state includes a network state from type 1 to type M +1 in which the network quality sequentially increases, and the type of the network state is the same as the hierarchical level of the audio/video coded data to be transmitted. In this case, referring to fig. 11, determining the target frame layer corresponding to the current network state includes:

and S234, when the current network state is the type 1 network state, determining an audio frame layer with the level sequence number of the layer 0 as a target frame layer corresponding to the current network state.

S235, when the current network state is the L-th network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the L-1 st layer as target frame layers corresponding to the current network state, wherein L belongs to [2, M ], and L is a natural number.

S236, when the current network state is the M + 1-th network state, determining an audio frame layer with a hierarchical sequence number of the 0 th layer and video frame layers with hierarchical sequence numbers of the 1 st layer to the M th layer as target frame layers corresponding to the current network state.

In this embodiment, the type of the network state is the same as the hierarchy of the audio/video encoded data to be transmitted, and it can be known from the foregoing that the better the network quality corresponding to the current network state is, which indicates that the current network can support sending more data packets, the fewer the discarded data packets are, the more the number of corresponding target frame layers is, and therefore, in this case, the target frame layer corresponding to the current network state can be directly determined according to the one-to-one correspondence between the type of the network state and the hierarchy of the audio/video encoded data to be transmitted.

Optionally, the one-to-one correspondence between the type of the network status in S234 to S236 and the level of the audio-video encoded data to be transmitted may be obtained based on the first correspondence table.

Optionally, the one-to-one correspondence between the type of the network status in S234 to S236 and the level of the audio-video encoded data to be transmitted may be obtained based on the first classification model.

That is, if the current network state is the type 1 network state, the audio frame layer with the level sequence number of the layer 0 is determined as the target frame layer corresponding to the current network state, which is equivalent to only retaining the audio frame layer when the network quality corresponding to the network state is the worst.

If the current network state is the L-th network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the L-1 st layer as target frame layers corresponding to the current network state, which is equivalent to that when the network quality corresponding to the network state is between the worst and the best, the audio frame layer and part of the video frame layers can be reserved.

If the current network state is the M +1 type network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the M th layer as the target frame layer corresponding to the current network state, namely, when the network quality corresponding to the network state is the best, retaining all audio and video coding data to be transmitted.

Illustratively, continuing to take the layered structure of the to-be-transmitted audio/video encoded data shown in fig. 8 as an example, the to-be-transmitted audio/video encoded data includes an audio frame layer with a hierarchical sequence number of the 0 th layer and video frame layers with hierarchical sequence numbers sequentially increasing from the 1 st layer to the 5 th layer, and the network state includes a type 1 network state to a type 6 network state in which the network quality sequentially increases.

The type 1 network state may be network crash, the type 2 network state-the type 4 network state may be relatively poor network, relatively poor network and relatively very poor network in sequence, the type 5 network state may be network poor, and the type 6 network state may be network good. In this case, the target frame layers respectively corresponding to the network states are as follows:

and when the current network state is the type 1 network state, namely the network is crashed, determining an audio frame layer with the level sequence number of the 0 th layer as a target frame layer corresponding to the current network state.

And when the current network state is the type 2 network state, namely the network is relatively poor, determining an audio frame layer with the hierarchical sequence number of the layer 0 and a video frame layer with the hierarchical sequence number of the layer 1 as target frame layers corresponding to the current network state.

And when the current network state is the 3 rd type network state, namely the network is relatively poor, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the 2 nd layer as a target frame layer corresponding to the current network state.

When the current network state is the 4 th type network state, that is, the network is relatively very bad, the audio frame layer with the hierarchical sequence number of the 0 th layer and the video frame layers with the hierarchical sequence numbers of the 1 st layer to the 3 rd layer are determined as the target frame layer corresponding to the current network state.

When the current network state is the 5 th type network state, namely the network difference, the audio frame layer with the hierarchical sequence number of the 0 th layer and the video frame layers with the hierarchical sequence numbers of the 1 st layer to the 4 th layer are determined as the target frame layer corresponding to the current network state.

When the current network state is the 6 th type network state, namely the network is good, an audio frame layer with the hierarchy sequence number of the 0 th layer and video frame layers with the hierarchy sequence numbers of the 1 st layer to the 5 th layer are determined as target frame layers corresponding to the current network state.

And S240, sending the data corresponding to the target frame layer to the target terminal through the target network.

And S250, acquiring the network state of the target network at the moment when the preset condition is triggered as a new network state under the condition that the preset condition is met.

Considering that the target network may fluctuate, if data corresponding to the target frame layer of the same layer is sent all the time, congestion may be caused again after a period of time, so that the target terminal is abnormal in decoding and playing, and therefore, in order to avoid the situation that the target terminal is abnormal in decoding and playing, the network state of the target network at the moment triggered by the preset condition can be obtained as a new network state when the preset condition is met.

The preset condition may have various conditions. As one way, the preset condition may be a set time period, that is, the server acquires the network state of the target network each time a time point corresponding to the time period arrives, for example, the time period may be 3 seconds, 5 seconds, 10 seconds, and the like. It should be noted that, the size of the time period may be set according to the network fluctuation condition, for a target network with larger network fluctuation, the time period is relatively set to be smaller, and for a target network with smaller network fluctuation, the time period may be set to be larger.

Alternatively, the preset condition may be a set transmission amount of the data packets, for example, the server may obtain the network status of the target network every time the server transmits a preset number of data packets through the target network. It should be noted that the quantity corresponding to the preset quantity may be set according to the network fluctuation condition, for a target network with large network fluctuation, the preset quantity is relatively smaller, and for a target network with small network fluctuation, the preset quantity may be larger.

Alternatively, the preset condition may be that the server acquires the network status of the target network before sending a new GOP frame sequence, that is, each time the server prepares to send a new GOP sequence.

And S260, determining a new target frame layer corresponding to the new network state, wherein the new target frame layer comprises an audio frame layer and a Q-layer video frame layer with the sequence number of the layers arranged from the lowest layer to the higher layer in a step-by-step mode, Q is a natural number, and Q is smaller than or equal to M.

After the new network state is obtained, the method for determining the target frame layer corresponding to the current network state may be used to determine a new target frame layer corresponding to the new network state.

And S270, sending the data corresponding to the new target frame layer to the target terminal through the target network.

After determining a new target frame layer corresponding to the new network state, the server may send data corresponding to the new target frame layer to the target terminal through the target network, so that the target terminal can perform decoding and playing according to the data corresponding to the new target frame layer.

According to the data transmission method provided by the embodiment of the application, on the premise that the target terminal can normally decode according to the data corresponding to the target frame layer and the video watching image quality is guaranteed, different target frame layers cannot be selected to be sent by adapting to different network states, the number of sent data packets is reduced, the network consumption is reduced, the network congestion condition is reduced, the video playing abnormal condition of the target terminal is relieved, the sent target frame layer can be adaptively adjusted again under the condition that the preset condition is met, and the condition that the decoding playing is abnormal of the target terminal is further relieved.

Referring to fig. 12, fig. 12 is a flowchart illustrating a data synchronization method according to another embodiment of the present application, where the method includes:

s310, audio and video coding data to be transmitted are obtained, the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer.

And S320, acquiring a data transmission slow speed ratio of the target network.

The data sending slow speed ratio can be understood as the slow speed degree of data sent by the server to the target terminal, and the larger the slow speed degree of the data sent is, the larger the data sending slow speed ratio is.

As one way, as shown in fig. 13, obtaining the data transmission slow speed ratio of the target network includes:

s321: and acquiring a decoding time stamp closest to the current time in the decoding time stamps corresponding to the audio and video coding data to be transmitted and a current decoding time stamp corresponding to a video frame sent by a target network at the current time.

Each audio/video coded data to be transmitted includes a plurality of video frames, and each video frame carries a DTS timestamp (Decoding timestamp) during coding.

The decoding time stamp closest to the current time in the decoding time stamps corresponding to the audio and video coding data to be transmitted can be understood as the decoding time stamp corresponding to the last or latest acquired video frame in all the coding data to be transmitted acquired by the server. It can be understood that, in order to send data corresponding to a target frame layer to a target terminal, a server needs to continuously obtain audio/video coding data to be transmitted, and each video frame in the obtained audio/video coding data to be transmitted carries a decoding time stamp, so that according to the sequence of each obtained video frame, a decoding time stamp closest to the current time in the decoding time stamps corresponding to the audio/video coding data to be transmitted can be determined. Illustratively, the server acquires audio/video coded data to be transmitted from 0 th second, the current time is 3 rd second, and the frame acquired in the 3 rd second is 15 th frame, so that the decoding timestamp closest to the current time in the decoding timestamps corresponding to the audio/video coded data to be transmitted is the decoding timestamp carried by the 15 frames.

The current decoding timestamp corresponding to the video frame sent by the target network at the current moment refers to a decoding timestamp corresponding to the video frame sent by the target network at the current moment. Exemplarily, at the current time, the video frame sent through the target network is 5 frames, and the decoding timestamp carried by the 5 frames is the current decoding timestamp corresponding to the video frame sent by the target network at the current time.

From the foregoing, it can be seen that the embodiments of the present application can be applied to the application environments shown in fig. 1 or fig. 2. When the method is applied to the application environment shown in fig. 1, the decoding timestamp closest to the current time in the decoding timestamps corresponding to the audio and video coded data to be transmitted is the decoding timestamp corresponding to the encoded uplink latest video frame of the first terminal.

When the method is applied to the application environment shown in fig. 2, the decoding timestamp closest to the current time in the decoding timestamps corresponding to the audio and video coded data to be transmitted is the decoding timestamp corresponding to the encoded latest video frame obtained by the server from the multimedia database.

S322: and acquiring a decoding time stamp closest to the current moment and an absolute value of a difference value of the current decoding time stamp as a data sending slow speed ratio corresponding to the target network.

It can be understood that the decoding timestamp closest to the current time and the absolute value T of the difference between the current decoding timestamps may be regarded as a time interval from when the server acquires the audio/video encoded data to be transmitted to when the server sends the audio/video encoded data to be transmitted. Therefore, the larger the absolute value is, the larger the interval between the two timestamps is, it can be understood that the server sends out the audio/video coded data to be transmitted after a long time interval is obtained, that is, the server sends the data to the target terminal at a slow speed to a great extent, and conversely, the smaller the absolute value is, the slower the server sends the data to the target terminal to a small extent.

And S330, obtaining the current network state of the target network based on the data sending slow speed ratio.

The current network state of the target network is obtained based on the data sending slow speed ratio, and various modes can be provided.

As one way, the current network status corresponding to the slow data transmission rate may be determined based on the second correspondence table. The process specifically comprises the following steps: and searching the current network state corresponding to the data transmission slow speed ratio from a second corresponding relation table, wherein the second corresponding relation table comprises a plurality of network states and a data transmission slow speed ratio range corresponding to each network state.

For example, assuming that there are 6 network states, the second correspondence table may store the following correspondence relationships:

the network is good: t is within 0-1 second

Network difference: t is belonged to [1-2) seconds;

network relative difference: t is belonged to [2-3) seconds;

the network is relatively poor: t is equal to [3-4) second;

the network is relatively poor: t belongs to [4-5) seconds;

and (3) network crash: t ∈ [5- + ∞) seconds.

It should be noted that, the correspondence between the network state and the data transmission slow speed ratio range expressed by the second correspondence table is only an optional example, and in some cases, the correspondence between the network state and the data transmission slow speed ratio range expressed by the second correspondence table may also be appropriately adjusted.

Alternatively, the current network state of the target network may be obtained based on a pre-trained second classification model and a data transmission slow speed ratio. After the data sending slow speed ratio is obtained, the data sending slow speed ratio can be input to a second classification model which is trained in advance, and the second classification model sequentially passes through the steps of feature extraction and feature processing to determine the current network state corresponding to the data sending slow speed ratio. The second classification model can be obtained by training the initial model, and the training samples in the training process can be a plurality of data transmission slow speed ratios carrying the network state identifiers.

S340, determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises an audio frame layer and N video frame layers with hierarchical sequence numbers arranged from the lowest layer to the high layer in a progressive manner, N is a natural number, and N is less than or equal to M.

And S350, sending the data corresponding to the target frame layer to the target terminal through the target network.

The data transmission method provided by the embodiment of the application can select and send different target frame layers by adapting to different network states on the premise of ensuring that the target terminal can also perform normal decoding according to data corresponding to the target frame layers and ensuring the video watching image quality, thereby reducing the number of sent data packets, reducing the network consumption, reducing the network congestion condition and relieving the video playing abnormal condition of the target terminal. In addition, in this embodiment, the absolute value of the difference between the decoding timestamp closest to the current time and the current decoding timestamp is used as the data transmission slow speed ratio corresponding to the target network, and then the current network state corresponding to the data transmission slow speed ratio is determined based on the second correspondence table.

The following describes the effect of the data transmission method provided by the embodiment of the present application with reference to the accompanying drawings.

Referring to fig. 14, fig. 14 is a graph showing comparison of effects of the hundred second katon times of two adjacent days on a client of a video cloud according to an embodiment of the present application, where an abscissa is time from 00.00 to 24.00, and an ordinate is the time of the hundred second katon times. In fig. 14, there are two hundred second katon times curves, which represent two adjacent days, where curve 1 is the recorded hundred second katon times of the client when the data transmission method of the embodiment of the present application is not applied to the server all day, and in curve 2, from 00:00 to 13: 45, the server does not apply for the data transmission method of the embodiment, from 13: 45 to the current time 15:35, when the server adopts the data transmission method of the embodiment of the present application, the recorded hundred-second stuck times of the client are recorded, as can be seen from fig. 14, in two adjacent days, when the data transmission method of the embodiment of the present application is not adopted in the server, the hundred-second stuck times of the client are substantially the same compared with the same time period of the previous day, and after the server adopts the data transmission method of the embodiment of the present application, the hundred-second stuck times of the client are significantly reduced compared with the same time period of the previous day. Therefore, the method of the embodiment of the application can effectively avoid network congestion of users in a poor network, and further reduce audio and video watching jamming.

It should be noted that the present application provides some specific examples of the foregoing implementable embodiments, and on the premise of not conflicting with each other, the examples of the embodiments may be arbitrarily combined to form a new data transmission method. It should be understood that a new data transmission method formed by any combination of examples is within the scope of the present application.

It is contemplated that in some scenarios, the server obtains data that includes video frames, but not audio frames, such as silent short video, or silent live. In this scenario, please refer to fig. 15, fig. 15 is a flowchart illustrating a data synchronization method according to another embodiment of the present application, where the method includes:

s410, acquiring video coded data to be transmitted, wherein the video coded data to be transmitted comprise K layers of video frame layers, each layer of video frame layer is provided with a corresponding hierarchical sequence number, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, and K is a positive integer;

s420, acquiring the current network state of the target network;

s430, determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises T layers of video frame layers with level sequence numbers arranged from the lowest layer to the higher layer in a step-by-step mode, T is a natural number and is less than or equal to K;

and S440, sending the data corresponding to the target frame layer to the target terminal through the target network.

As a mode, the data acquired by the server may be video encoded data to be transmitted, in this case, the video encoded data to be transmitted includes K layers of video frame layers, but does not include audio frames, each layer of video frame layer has a corresponding hierarchical sequence number, a video frame layer with a higher hierarchical sequence number relies on a video frame layer with a lower hierarchical sequence number to decode, meanwhile, the server may also acquire a current network state of the target network, then determine a target frame layer corresponding to the current network state, and finally transmit the data corresponding to the target frame layer to the target terminal through the target network.

According to the data transmission method provided by the embodiment of the application, on the premise that the target terminal can perform normal decoding according to the data corresponding to the target frame layer and the video watching quality is guaranteed, different target frame layers cannot be selected to be sent by adapting to different network states, the number of data packets to be sent is reduced, the network consumption is reduced, the network congestion situation is reduced, and the video playing abnormal situation of the target terminal is relieved.

Referring to fig. 16, fig. 16 is a block diagram of a data transmission apparatus 500 according to an embodiment of the present application, where the apparatus 500 includes: the audio/video coding data acquisition module 510, the first network state acquisition module 520, the first target frame layer determination module 530 and the first data transmission module 540.

The audio/video coding data acquiring module 510 is configured to acquire audio/video coding data to be transmitted, where the audio/video coding data to be transmitted includes M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have a corresponding hierarchical sequence number, a video frame layer with a higher hierarchical sequence number depends on a video frame layer with a lower hierarchical sequence number for decoding, and the hierarchical sequence number of the audio frame layer is lower than the hierarchical sequence number of any video frame layer, where M is a positive integer.

As a manner, the audio/video encoding data obtaining module 510 includes: an acquisition submodule and a hierarchical processing submodule.

And the acquisition submodule is used for acquiring the coded audio frame corresponding to the audio and video to be transmitted and the coded M-layer video frame layer.

And the layering processing submodule is used for layering the audio frame and the M layers of video frame layers to obtain audio and video coding data to be transmitted.

As one way, the acquisition submodule includes:

and the video frame sequence obtaining unit is used for coding the video data included in the audio and video to be transmitted to obtain a coded video frame sequence.

And the video frame layering unit is used for layering the video frames of the coded video frame sequence according to the decoding dependency relationship among the video frames to obtain the coded M-layer video frame layer.

As one way, the acquisition submodule includes:

and the layered coding unit is used for performing layered coding on the video data included by the audio and video to be transmitted to obtain M layers of video frame layers after coding.

The first network status obtaining module 520 is configured to obtain a current network status of the target network.

As one mode, the first network status obtaining module 520 includes: a data sending slow speed ratio obtaining submodule and a network state obtaining submodule.

And the data transmission slow speed ratio acquisition submodule is used for acquiring the data transmission slow speed ratio of the target network.

And the network state acquisition submodule is used for acquiring the current network state of the target network based on the data sending slow speed ratio.

As one mode, the data transmission slow speed ratio acquisition submodule includes:

and the time stamp obtaining unit is used for obtaining the decoding time stamp closest to the current time in the decoding time stamps corresponding to the audio and video coding data to be transmitted and the current decoding time stamp corresponding to the video frame sent by the target network at the current time.

And the absolute value acquisition unit is used for acquiring the decoding timestamp closest to the current moment and the absolute value of the difference value of the current decoding timestamp as the data transmission slow speed ratio corresponding to the target network.

As one mode, the network status obtaining sub-module includes: and the searching unit is used for searching the current network state corresponding to the data transmission slow speed ratio from a second corresponding relation table, and the second corresponding relation table comprises a plurality of network states and a data transmission slow speed ratio range corresponding to each network state.

The first target frame layer determining module 530 is configured to determine a target frame layer corresponding to the current network state, where the target frame layer includes an audio frame layer and N video frame layers with hierarchical sequence numbers arranged from a lowest layer to a higher layer in a stepwise manner, where N is a natural number, and N is less than or equal to M.

By way of one approach, the first target frame layer determining module 530 includes: and the first target frame layer determining submodule is used for determining a target frame layer corresponding to the current network state based on a first corresponding relation table, the first corresponding relation table comprises a plurality of network states and a target frame layer corresponding to each network state, and the network quality corresponding to the current network state is positively correlated with the number of the target frame layers.

As another mode, the to-be-transmitted audio/video encoded data includes an audio frame layer with a hierarchical sequence number of a layer 0 and video frame layers with hierarchical sequence numbers sequentially increasing from the layer 1 to the layer M, and the network state includes a type 1 network state, a type 2 network state, and a type 3 network state in which the network quality sequentially increases, in this case, the first target frame layer determining module 530 includes: the second target frame layer determining submodule is used for determining an audio frame layer with the hierarchical sequence number of the 0 th layer as a target frame layer corresponding to the current network state when the current network state is the type 1 network state; or when the current network state is the type 2 network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the P th layer as target frame layers corresponding to the current network state, wherein P is a positive integer smaller than M; or when the current network state is the 3 rd type network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the Mth layer as target frame layers corresponding to the current network state.

As another mode, the to-be-transmitted audio/video encoded data includes an audio frame layer with a hierarchical sequence number of layer 0 and video frame layers with hierarchical sequence numbers sequentially increasing from layer 1 to layer M, and the network state includes a type 1 network state to a type M +1 network state, where the network state sequentially increasing in network quality, and in this case, the first target frame layer determining module 530 includes: a third target frame layer determining submodule, configured to determine, when the current network state is a type 1 network state, an audio frame layer with a hierarchical sequence number of a layer 0 as a target frame layer corresponding to the current network state; or when the current network state is the L-th network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the L-1 st layer as target frame layers corresponding to the current network state, wherein L belongs to [2, M ], and L is a natural number; or when the current network state is the M +1 type network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the M th layer as target frame layers corresponding to the current network state.

A first data sending module 540, configured to send data corresponding to the target frame layer to the target terminal through the target network.

As a mode, the first network state obtaining module 520 is further configured to obtain, when a preset condition is met, a network state of the target network at the moment when the preset condition is triggered, as a new network state;

the first target frame layer determining module 530 is further configured to determine a new target frame layer corresponding to the new network state, where the new target frame layer includes an audio frame layer and Q video frame layers with hierarchical sequence numbers arranged from a lowest layer to a higher layer in a stepwise manner, where Q is a natural number and Q is less than or equal to M;

the first data sending module 540 is further configured to send data corresponding to the new target frame layer to the target terminal through the target network.

In the data transmission device provided by the embodiment of the application, because the target frame layer comprises the audio frame layer and the N layers of video frame layers with the hierarchical sequence numbers arranged from the lowest layer to the higher layer in a stepwise manner, even if the target frame layer corresponding to the network state is sent, the target terminal can perform normal decoding according to the target frame layer, and because the target frame layer corresponds to the network state, different target frame layers can be selected and sent according to different network states, namely part of audio and video coding data to be transmitted is selectively sent, so that the number of the sent data packets is reduced, downlink network data packets of the target terminal are reduced, network consumption is reduced, the network congestion condition is relieved, the video playing abnormal condition of the target terminal is relieved, and the video coding rate cannot be reduced, so that the watching image quality of the video played by decoding can be ensured, and better interactive experience is brought to the user.

Referring to fig. 17, fig. 17 is a block diagram of a data transmission apparatus 600 according to an embodiment of the present application, where the apparatus 600 includes: a video coded data acquiring module 610, a second network status acquiring module 620, a second target frame layer determining module 630 and a second data transmitting module 640.

The video encoding data obtaining module 610 is configured to obtain video encoding data to be transmitted, where the video encoding data to be transmitted includes K layers of video frame layers, each layer of video frame layer has a corresponding hierarchical sequence number, and a video frame layer with a higher hierarchical sequence number depends on a video frame layer with a lower hierarchical sequence number for decoding, where K is a positive integer.

The second network status obtaining module 620 is configured to obtain a current network status of the target network.

The second target frame layer determining module 630 is configured to determine a target frame layer corresponding to the current network state, where the target frame layer includes T layers of video frame layers with hierarchical sequence numbers arranged from a lowest layer to a higher layer in a stepwise manner, where T is a natural number, and T is less than or equal to K.

The second data sending module 640 is configured to send data corresponding to the target frame layer to the target terminal through the target network.

In the data transmission apparatus according to the embodiment of the present application, since the target frame layer includes N layers of video frame layers having hierarchical sequence numbers arranged from the lowest layer to the higher layer, the target frame layer is, therefore, even if a target frame layer corresponding to the network status is transmitted, the target terminal can perform normal decoding according to the target frame layer, and because the target frame layer is corresponding to the network state, different target frame layers can be selected to be sent according to different network states, i.e., selectively transmitting portions of the video encoded data to be transmitted, thereby reducing the number of transmitted packets, so that the downlink network data packets of the target terminal are reduced, the network consumption is reduced, the network congestion situation is relieved, thereby alleviating the abnormal situation of video playing of the target terminal, and because the video coding code rate is not reduced, therefore, the watching image quality of the decoded and played video can be ensured, and better interactive experience is brought to the user.

It should be noted that the apparatus embodiment in the present application corresponds to the foregoing method embodiment, and specific principles in the apparatus embodiment may refer to the contents in the foregoing method embodiment, which is not described herein again.

An electronic device provided by the present application will be described with reference to fig. 18.

Referring to fig. 18, based on the data transmission method, another electronic device 200 including a processor 104 capable of executing the data transmission method is provided in the embodiment of the present application, where the electronic device 200 may be a smart phone, a tablet computer, a portable computer, or the like. Electronic device 200 also includes memory 104, network module 106, and screen 108. The memory 104 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 102 executes the program stored in the memory 104.

Processor 102 may include, among other things, one or more cores for processing data and a message matrix unit. The processor 102 interfaces with various components throughout the electronic device 200 using various interfaces and lines to perform various functions of the electronic device 200 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104 and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 102 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 102, but may be implemented by a communication chip.

The Memory 104 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The storage data area may also store data created by the terminal 100 in use, such as a phonebook, audio-video data, chatting log data, and the like.

The network module 106 is configured to receive and transmit electromagnetic waves, and achieve interconversion between the electromagnetic waves and the electrical signals, so as to communicate with a communication network or other devices, for example, an audio playing device. The network module 106 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and so forth. The network module 106 may communicate with various networks, such as the internet, an intranet, a wireless network, or with other devices via a wireless network. The wireless network may comprise a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 106 may perform information interaction with a base station.

The screen 108 may display interface content and may also be used to respond to touch gestures.

It should be noted that, in order to implement more functions, the electronic device 200 may also protect more devices, for example, may also protect a structured light sensor for acquiring face information or may also protect a camera for acquiring an iris.

Referring to fig. 19, a block diagram of a computer-readable storage medium provided in an embodiment of the present application is shown. The computer readable medium 1100 has stored therein a program code that can be called by a processor to execute the method described in the above method embodiments.

The computer-readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 1100 includes a non-volatile computer-readable medium. The computer readable storage medium 1100 has storage space for program code 1110 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 1110 may be compressed, for example, in a suitable form.

Based on the above data transmission method, according to an aspect of the embodiments of the present application, there is provided a computer program product or a computer program, the computer program product or the computer program comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

To sum up, according to the data transmission method, the data transmission device, the electronic device, the storage medium, the computer program product, or the computer program provided in the embodiments of the present application, audio/video encoded data to be transmitted is obtained, which includes M layers of video frame layers and audio frame layers, then a target frame layer corresponding to a current network state of a target network is obtained, the target frame layer includes the audio frame layer and N layers of video frame layers, whose hierarchical sequence numbers are arranged from a lowest layer to a higher layer step by step, and finally, data corresponding to the target frame layer is sent to a target terminal through the target network. Thus, in the foregoing manner, since the target frame layer includes the audio frame layer and the N-layered video frame layers having the hierarchical order numbers arranged from the lowest layer to the higher layer, the video frame layer, even if a target frame layer corresponding to the network status is transmitted, the target terminal can perform normal decoding according to the target frame layer, and because the target frame layer is corresponding to the network state, different target frame layers can be selected to be sent according to different network states, namely, selectively sending part of the audio and video coding data to be transmitted, thereby reducing the quantity of sending data packets, so that the downlink network data packets of the target terminal are reduced, the network consumption is reduced, the network congestion situation is relieved, thereby alleviating the abnormal situation of video playing of the target terminal, and because the video coding code rate is not reduced, therefore, the watching image quality of the decoded and played video can be ensured, and better interactive experience is brought to the user.

Finally, it should be noted that: the above embodiments are intended to illustrate the technical solutions of the present application, but not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of data transmission, comprising:

acquiring audio and video coded data to be transmitted, wherein the audio and video coded data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any one video frame layer, and M is a positive integer;

acquiring a decoding time stamp closest to the current time in the decoding time stamps corresponding to the audio and video coding data to be transmitted and a current decoding time stamp corresponding to a video frame sent by a target network at the current time; acquiring the decoding timestamp closest to the current moment and the absolute value of the difference value of the current decoding timestamp as a data sending slow speed ratio corresponding to the target network;

obtaining the current network state of the target network based on the data sending slow speed ratio;

determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises the audio frame layer and N layers of the video frame layers with hierarchical sequence numbers arranged from the lowest layer to the high layer in a progressive manner, N is a natural number, and N is less than or equal to M;

and sending the data corresponding to the target frame layer to a target terminal through the target network.

2. The method according to claim 1, wherein the obtaining of audio-video encoding data to be transmitted comprises:

acquiring a coded audio frame corresponding to an audio and video to be transmitted and a coded M-layer video frame layer;

and layering the audio frame and the M layers of video frame layers to obtain the audio and video coding data to be transmitted.

3. The method according to claim 2, wherein obtaining the encoded M layers of video frame layers corresponding to the audio and video to be transmitted comprises:

and carrying out layered coding on the video data included by the audio and video to be transmitted to obtain M layers of video frame layers after coding.

4. The method according to any one of claims 1-3, wherein determining the target frame layer corresponding to the current network state comprises:

and determining a target frame layer corresponding to the current network state based on a first corresponding relation table, wherein the first corresponding relation table comprises a plurality of network states and the target frame layer corresponding to each network state, and the network quality corresponding to the current network state is positively correlated with the number of the target frame layers.

5. The method according to any one of claims 1 to 3, wherein the to-be-transmitted audio/video encoded data includes an audio frame layer with a hierarchical sequence number of layer 0 and video frame layers with hierarchical sequence numbers sequentially increasing from layer 1 to layer M, the network state includes a type 1 network state, a type 2 network state and a type 3 network state, in which network qualities sequentially increase, and the determining a target frame layer corresponding to the current network state includes:

when the current network state is a type 1 network state, determining an audio frame layer with a level sequence number of a layer 0 as a target frame layer corresponding to the current network state; or

When the current network state is a type 2 network state, determining an audio frame layer with a hierarchical sequence number of a layer 0 and video frame layers with hierarchical sequence numbers of layers 1 to P as target frame layers corresponding to the current network state, wherein P is a positive integer smaller than M; or

And when the current network state is the 3 rd type network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the Mth layer as target frame layers corresponding to the current network state.

6. The method according to any one of claims 1 to 3, wherein the to-be-transmitted audio/video encoded data includes an audio frame layer with a hierarchical sequence number of layer 0 and video frame layers with hierarchical sequence numbers sequentially increasing from layer 1 to layer M, the network state includes a type 1 network state to a type M +1 network state in which network quality sequentially increases, and the determining of the target frame layer corresponding to the current network state includes:

when the current network state is the type 1 network state, determining an audio frame layer with the hierarchical sequence number of the layer 0 as a target frame layer corresponding to the current network state; or

When the current network state is the L-th network state, determining an audio frame layer with a hierarchical sequence number of a layer 0 and video frame layers with hierarchical sequence numbers of layers 1 to L-1 as target frame layers corresponding to the current network state, wherein L belongs to [2, M ], and L is a natural number; or alternatively

And when the current network state is the M +1 type network state, determining an audio frame layer with the hierarchical sequence number of the 0 th layer and video frame layers with the hierarchical sequence numbers of the 1 st layer to the M th layer as a target frame layer corresponding to the current network state.

7. The method according to claim 1, wherein after sending the data corresponding to the target frame layer to a target terminal through the target network, further comprising:

acquiring a network state of a target network at the moment when a preset condition is triggered as a new network state under the condition that the preset condition is met;

determining a new target frame layer corresponding to the new network state, wherein the new target frame layer comprises an audio frame layer and a Q-layer video frame layer with the hierarchical sequence numbers arranged from the lowest layer to the high layer in a progressive manner, Q is a natural number, and Q is less than or equal to M;

and sending the data corresponding to the new target frame layer to a target terminal through the target network.

8. The method of claim 1, wherein deriving the current network state of the target network based on the data transmission slow speed ratio comprises:

and searching the current network state corresponding to the data transmission slow speed ratio from a second corresponding relation table, wherein the second corresponding relation table comprises a plurality of network states and a data transmission slow speed ratio range corresponding to each network state.

9. A method of data transmission, comprising:

acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise K layers of video frame layers, each layer of video frame layer has a corresponding hierarchical sequence number, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, and K is a positive integer;

determining a target frame layer corresponding to the current network state, wherein the target frame layer comprises T layers of video frame layers with level sequence numbers arranged from the lowest layer to the higher layer in a step-by-step manner, T is a natural number and is less than or equal to K;

10. A data transmission apparatus, comprising:

the audio and video coding data acquisition module is used for acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise M layers of video frame layers and audio frame layers, each layer of video frame layer and each layer of audio frame layer have corresponding hierarchical sequence numbers, the video frame layer with the higher hierarchical sequence number depends on the video frame layer with the lower hierarchical sequence number for decoding, the hierarchical sequence number of the audio frame layer is lower than that of any video frame layer, and M is a positive integer;

the first network state acquisition module is used for acquiring a decoding timestamp closest to the current moment in the decoding timestamps corresponding to the audio and video coded data to be transmitted and a current decoding timestamp corresponding to a video frame sent by a target network at the current moment; acquiring the decoding timestamp closest to the current moment and the absolute value of the difference value of the current decoding timestamp as a data transmission slow speed ratio corresponding to the target network; obtaining the current network state of the target network based on the data sending slow speed ratio;

a first target frame layer determining module, configured to determine a target frame layer corresponding to the current network state, where the target frame layer includes the audio frame layer and N video frame layers with hierarchical sequence numbers arranged from a lowest layer to a higher layer in a stepwise manner, where N is a natural number, and N is less than or equal to M;

and the first data sending module is used for sending the data corresponding to the target frame layer to a target terminal through the target network.

11. A data transmission apparatus, comprising:

the video coding data acquisition module is used for acquiring audio and video coding data to be transmitted, wherein the audio and video coding data to be transmitted comprise K layers of video frame layers, each layer of video frame layer has a corresponding hierarchical sequence number, the video frame layer with the higher hierarchical sequence number is decoded by depending on the video frame layer with the lower hierarchical sequence number, and K is a positive integer;

the third network state acquisition module is used for acquiring the decoding timestamp closest to the current time in the decoding timestamps corresponding to the audio and video coding data to be transmitted and the current decoding timestamp corresponding to the video frame sent by the target network at the current time; acquiring the decoding timestamp closest to the current moment and the absolute value of the difference value of the current decoding timestamp as a data sending slow speed ratio corresponding to the target network; obtaining the current network state of the target network based on the data sending slow speed ratio;

a third target frame layer determining module, configured to determine a target frame layer corresponding to the current network state, where the target frame layer includes T video frame layers with level sequence numbers arranged from a lowest layer to a highest layer, where T is a natural number, and T is less than or equal to K;

and the third data sending module is used for sending the data corresponding to the target frame layer to a target terminal through the target network.

12. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-8 or claim 9.

13. A computer-readable storage medium, in which a program code is stored, the program code being invokable by a processor to perform the method according to any one of claims 1 to 8 or according to claim 9.